Abstract- Text clustering is an important application of data mining. It is concerned with grouping similar text documents together. Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. Clustering method has to embed the documents in a suitable similarity space. In this paper we compare four popular similarity measures : cosine similarity, Jaccard similarity, Euclidean measure and Correlation Coefficient in conjunction with different types of vector space representation (boolean, term frequency and inverse document frequency) of documents. Clustering of documents is performed using generalized k-Means; a Partitioned based clustering technique on high dimensional sparse data representing text documents. Performance is measured against a human-imposed classification of Topic and Place categories. We conducted a number of experiments and used entropy measure to assure statistical significance of results. Cosine, Pearson correlation and Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean measures perform poor.
Keywords: clustering, Jaccard similarity, Cosine similarity, Euclidean measure, Correlation coefficient, k-means.
Today, with the rapid advancements in technology we are able to accumulate huge amounts of data of different kinds. Data mining emerged as a field concerned with the extraction of useful knowledge from data 1. Data mining techniques have been applied to solve a wide range of real-world problems. Clustering is an unsupervised data mining technique where the labels of data objects are unknown. It is the job of the clustering technique to identify the categorization of data objects under examination. Clustering can be applied to different kinds of data including text. When dealing with textual data, objects can be documents, paragraphs, or words 2. Text clustering refers to the process of grouping similar text documents together. The problem can be formulated as follows: given a set of documents it is required to divide them into multiple groups, such that documents in the same group are more similar to each other than to documents in other groups. There are many applications of text clustering including: document organization and browsing, corpus summarization, and document classification clustering has been proposed for use in browsing a collection of documents 3 or in organizing the results returned by a search engine in response to user’s query 4 or help users quickly identify and focus on the relevant set of results. Customer comments are clustered in many online stores, such as Amazon.com to provide collaborative recommendations. In collaborative bookmarking or tagging, clusters of users that share certain traits are identified by their annotations. Document clustering has also been used to automatically generate Hierarchical clusters of documents 5. This paper is organized as follows. The section 2 deals with the related work in text document clustering, section 3 describes the document representation used in the experiments. Section 4 discusses the similarity measures and their semantics. Section 5 presents the K-means clustering algorithm and Section 6 explains experiment settings, evaluation approaches, results and analysis and Section 7 concludes and discusses future work.
2. Related Work:
Text clustering is one of the important applications of data mining. In this section, we review some of the related work in this field.
Luo et al. 3 used the concepts of document neighbors and links in order to enhance the performance of k-means and bisecting k-means clustering. Using a pair wise similarity function and a given similarity threshold, the neighbors of a document are the documents that are considered similar to it. A link between two documents is the number of common neighbors. The concepts were used in the selection of initial cluster centroids and in document similarity measuring.
Many clustering techniques have been proposed in the literature. Clustering algorithms are mainly categorized into Hierarchical and Partitioning methods 2, 3, 4, 5. Hierarchical clustering method works by grouping data objects into a tree of clusters 6. These methods can further be classified into agglomerative and divisive Hierarchical clustering depending on whether the Hierarchical decomposition is formed in a bottom-up or top-down fashion. K-means and its variants 7, 8, 9 are the most well-known partitioning methods 10.
Bide and Shedge proposed a clustering pipeline to improve the performance of k-means clustering. The authors adopted a divide-and-conquer approach to cluster documents in the 20 Newsgroup dataset Documents were divided into groups where preprocessing, feature extraction, and k-means clustering were applied on each group. Document similarity was calculated using the cosine similarity measure. The proposed approach achieved better results as compared to standard k-means in terms of both cluster quality and execution time.
Hierarchical techniques produce a nested sequence of partitions, with a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters from the next lower level (or splitting a cluster from the next higher level). The result of a Hierarchical clustering algorithm can be graphically displayed as tree, called a dendogram.
In contrast to Hierarchical techniques, Partitional clustering techniques create a one-level (un-nested) partitioning of the data points. If K is the desired number of clusters, then Partitional approaches typically find all K clusters at once. Contrast this with traditional Hierarchical schemes, which bisect a cluster to get two clusters or merge two clusters to get one. Of course, a Hierarchical approach can be used to generate a flat partition of K clusters, and likewise, the repeated application of a Partitional scheme can derive Hierarchical clustering.