- Neepa Shah and Sunita Mahajan 2012. Document Clustering: A Detailed Review. International Journal of Applied Information Systems. 4, 5 (October 2012), 30-38. DOI=http://dx.doi.org/10.5120/ijais450691
-
@article{10.5120/ijais2017451568, author = {Neepa Shah and Sunita Mahajan}, title = {Document Clustering: A Detailed Review}, journal = {International Journal of Applied Information Systems}, issue_date = {October 2012}, volume = {4}, number = {}, month = {October}, year = {2012}, issn = {}, pages = {30-38}, numpages = {}, url = {/archives/volume4/number5/300-0691}, doi = { 10.5120/ijais12-450691}, publisher = { xA9 2010 by IJAIS Journal}, address = {} }
-
%1 450691 %A Neepa Shah %A Sunita Mahajan %T Document Clustering: A Detailed Review %J International Journal of Applied Information Systems %@ %V 4 %N %P 30-38 %D 2012 %I xA9 2010 by IJAIS Journal
Abstract
Document clustering is automatic organization of documents into clusters so that documents within a cluster have high similarity in comparison to documents in other clusters. It has been studied intensively becauseof its wide applicability in various areas such as web mining,search engines, and information retrieval. It is measuring similarity between documents and grouping similardocuments together. It providesefficient representation and visualization of thedocuments; thus helps in easy navigation also. In this paper, we have given overview of various document clustering methodsstudied and researched since last few years,starting from basic traditional methods to fuzzy based, genetic, co-clustering, heuristic oriented etc. Also, the document clustering procedure with feature selection process, applications, challenges in document clustering, similarity measures and evaluation of document clustering algorithm is explained.
References
- RekhaBaghel and Dr. RenuDhir, "A Frequent Concepts Based Document Clustering Algorithm,"International Journal of Computer Applications, vol. 4, No. 5, pp. 0975 – 8887, Jul. 2010
- A. Huang, "Similarity measures for text document clustering,"In Proc. of the Sixth New Zealand Computer Science Research Student Conference NZCSRSC, pp. 49—56, 2008.
- Nicholas O. Andrews and Edward A. Fox,"Recent developments indocument clustering,"Technical report published by citeseer, pp. 1-25, Oct. 2007
- Chun-Ling Chen, Frank S. C. Tseng, and Tyne Liang, "An integration of WordNet and fuzzy association rule mining for multi-label document clustering,"Data and Knowledge Engineering, vol. 69, issue 11, pp. 1208-1226, Nov. 2010
- Yong Wang and Julia Hodges, "Document Clustering with Semantic Analysis,"In Proc. of the 39th Annual Hawaii International Conference on System Sciences, HICSS 2006,vol. 03, pp. 54. 3
- Michael Steinbach , George Karypis, andVipin Kumar, "A comparison of document clustering techniques,"In KDD Workshop on Text Mining, 2002
- Xiaohui Cui and Thomas E. Potok, "Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm," Special Issue, 2005
- F. Beil, M. Ester, and X. Xu, "Frequent term-based text clustering,"Proc. of Int'l Conf. on knowledge Discovery and Data Mining (KDD'02), pp. 436–442, 2002.
- Benjamin C. M. Fung, Ke Wang, and Martin Ester, "Hierarchical Document Clustering Using Frequent Itemsets," In Proc. Siam International Conference On Data Mining 2003,SDM 2003
- Chun-Ling Chen, Frank S. C. Tseng, and Tyne Liang, "Mining fuzzy frequent itemsets for hierarchical document clustering," Published in an Int'l Journal of Information Processing and Management, vol. 46, issue 2, pp. 193-211, Mar. 2010
- C. L. Chen, F. S. C. Tseng, T. Liang, An integration of fuzzy association rules and WordNet for document clustering, Proc. of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-09), 2009, pp. 147–159.
- PankajJajoo, "Document Clustering," Masters' Thesis, IIT Kharagpur, 2008
- Chih-Ping Wei, Chin-Sheng Yang, Han-Wei Hsiao, and Tsang-Hsiang Cheng, "Combining preference- and content-based approaches for improving document clustering effectiveness,"Published in Int'l Journal of Information Processing and Management, vol. 42, issue 2, pp. 350-372, Mar. 2006
- MS. K. Mugunthadevi, MRS. S. C. Punitha, and Dr. . M. Punithavalli, "Survey on Feature Selection in Document Clustering,"Int'l Journal on Computer Science and Engineering (IJCSE), vol. 3, No. 3, pp. 1240-1244, Mar 2011
- Yi Peng, Gang Kou, Zhengxin Chen, and Yong Shi, "Recent trends in Data Mining (DM): Document Clustering of DM Publications," Int'l Conference on Service Systems and Service Management, vol. 2, pp. 1653 – 1659, Oct. 2006
- Man Lan, Chew Lim Tan, Jian Su, and Yue Lu, "Supervised and Traditional Term Weighting Methods for Automatic Text Categorization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, No. 4, Apr. 2009
- Shen Huang, Zheng Chen, Yong Yu, and Wei-Ying Ma, "Multitype Features Coselection for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 18, No. 4, Apr. 2006
- Minqiang Li and Liang Zhang,"Multinomial mixture model with feature selection for text clustering," Journal of Knowledge-Based Systems, vol. 21, issue 7,pp. 704-708, Oct. 2008
- Jun Yan, Ning Liu, Shuicheng Yan, Qiang Yang, Weiguo (Patrick) Fan, Wei Wei, and Zheng Chen, "Trace-Oriented Feature Analysis for Large-Scale Text Data Dimension Reduction,"IEEE Transactions on Knowledge and Data Engineering,vol. 23, No. 7, Jul. 2011
- Peter Willett, "Recent Trends In Hierarchic Document Clustering: A Critical Review,"Information Processing & Management, vol. 24, No. 5, pp. 517-597, 1988
- CongnanLuo, Yanjun Li, and Soon M. Chung, "Text document clustering based on neighbors,"Data and Knowledge Engineering 68,pp. 1271–1288, 2009
- Junjie Wu, HuiXiong, and JianChen,"Towardsunderstandinghierarchicalclustering: A datadistributionperspective," Neurocomputing 72, pp. 2319–2330, 2009
- Reynaldo Gil-García and Aurora Pons-Porrata, "Dynamic hierarchical algorithms for document clustering,"Pattern Recognition Letters 31, pp. 469–477, 2010
- Oren Zamir, Oren Etzioni,OmidMadani, and Richard M. Karp,"Fast and intuitive clustering of web documents citation," In Proc. of the 3rd Int'l Conference on Knowledge Discovery and Data Mining, 1997
- Noam Slonim and NaftaliTishby, "Document Clustering using Word Clusters via the Information Bottleneck Method," In Proc. of the 23rd annual Int'l ACM SIGIR conference on Research and development in information retrieval, pp. 208 – 215, 2000
- Sholom Weiss, Brian White, and ChidApte, "Lightweight document clustering,"IBM Research Report RC-21684, 2000
- Ying Zhao and George Karypis, "Evaluation of Hierarchical Clustering Algorithms for Document Datasets", Technical Report, Jun. 2002
- Wei Xu, Xin Liu, and Yihong Gong, "Document Clustering Based On Non-negative Matrix Factorization," In Proc. of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 267-273, 2003
- Khaled M. Hammoudaand Mohamed S. Kamel, "Efficient Phrase-Based Document Indexing for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering,vol. 16, No. 10, Oct. 2004
- William-Chandra Tjhi andLihui Chen, "Possibilistic fuzzy co-clustering of large document collections,"Journal of Pattern Recognition,vol. 40,issue 12, pp. 3452-3466, Dec. 2007
- William-Chandra Tjhi andLihui Chen, "A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data,"Journal of Fuzzy Sets and Systems,vol. 159,issue 4, pp. 371-389, Feb. 2008
- Wenyuan Li, Wee-Keong Ng, Ying Liu, and Kok-Leong Ong, "Enhancing the Effectiveness of Clustering with Spectra Analysis,"Journal of IEEE Transactions on Knowledge and Data Engineering,vol. 19, issue 7, pp. 887-902, Jul. 2007
- R. Kashef andM. S. Kamel, "Enhanced bisecting k-means clustering using intermediate cooperation,"Journal of Pattern Recognition,vol. 42, issue 11, pp. 2557-2569, Nov. 2009
- Liang Feng, Ming-HuiQiu, Yu-Xuan Wang, Qiao-Liang Xiang, Yin-Fei Yang, and Kai Liu, "A fast divisive clustering algorithm using an improved discrete particle swarm optimizer," Journal of Pattern Recognition Letters¸ vol. 31, issue 11, pp. 1216-1225, Aug. 2010
- Yuan-chao Liu, Chong Wu, and Ming Liu, "Research of fast SOM clustering for text information," An International Journal Expert Systems with Applications, vol. 38, issue 8, pp. 9325-9333, Aug. 2011
- Xiaodi Huang, XiaodongZheng, Wei Yuan, Fei Wang, and Shanfeng Zhu, "Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization," an International Journal on Information Sciences, vol. 181,issue 11, pp. 2293-2302, Jun. 2011
- Deng Cai, Xiaofei He, and Jiawei Han, "Locally Consistent Concept Factorization for Document Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 902-913, Jun. 2011
- Patrick A. De Maziere and Marc M. Van Hulle, "A clustering study of a 7000 EU document inventory using MDS and SOM,"An International Journal on Expert Systems with Applications, vol. 38,issue 7, pp. 8835-8849, Jul. 2011
- AbdolrezaHatamloua, Salwani Abdullah, and HosseinNezamabadi-pour, "A combined approach for clustering based on K-means and gravitational search algorithms," Swarm and Evolutionary Computation, Available online 12 Mar. 2012
Keywords
Document clustering, document clustering applications, document clustering procedure, similarity measures for document clustering, evaluation of document clustering algorithm, challenges in document clustering