Title: A Comparison of Document Clustering Techniques
Abstract: This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a “standard” K-means algorithm and a variant of K-means, “bisecting” K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to “get the best of both worlds.” However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document
Publication Year: 2000
Publication Date: 2000-05-23
Language: en
Type: article
Access and Citation
Cited By Count: 2486
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot