Document Clustering Using Graph Based Document Representation with Constraints

Authors

  • F. Amin
  • M. Rafi
  • M. Shahid

Abstract

Document clustering is an unsupervised approach in which a large collection of documents (corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters). Meaningful representation of documents and implicitly identifying the patterns, on which this separation is performed, is the challenging part of document clustering. We have proposed a document clustering technique using graph based document representation with constraints. A graph data structure can easily capture the non-linear relationships of nodes, document contains various feature terms that can be non-linearly connected, and hence a graph can easily represents this information. Constrains, are explicit conditions for document clustering where background knowledge is used to set the direction for Linking or Not-Linking a set of documents for a target clusters, thus guiding the clustering process. We deemed clustering is an ill-define problem, there can be many clustering results. Background knowledge can be used to drive the clustering algorithm in the right direction. We have proposed three different types of constraints, Instance level, corpus level and cluster level constraints. A new algorithm Constrained HAC is also proposed which will incorporate Instance level constraints as prior knowledge; it will guide the clustering process leading to better results. Extensive set of experiments have been performed on both synthetic and standard document clustering datasets .Results are then compared on standard clustering measures like: purity, entropy and F-measure. These clearly establish that our proposed approach leads to improvement in cluster quality.

References

Angryk, R.A. and Hossain, M.S; 2007. GDClust: A Graph-Based Document Clustering Technique,Proc. 7th IEEE International Conference on Data Mining (ICDM-IEEE 07),417-422.

Anand, R. and Reddy, C ; 2011.Graph-Based Clustering with Constraints, Proceedings of 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD, Shenzhen, China, 51-62.

Carrot Search http://search.carrot2.org/stable/ search. August, 2014

Huang, A., Milne, D., Frank, E. and Witten, I.H; 2009.Clustering Documents using a Wikipediabased Concept,Proc.13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.

Jiang, H., Ren, Z., Xuan, J. and Wu, X; 2013. Extracting elite pairwise constraints for clustering, Journal of Neurocomputing,Volume 99, 124-133.

Rafi, M., Shaikh, M., and Farooq, A; 2010. Document Clustering based on Topic Maps, International Journal of Computer Applications,Volume 12,32-36.

Song, Y., Pan, S., Liu, S., Wei, F., Zhou, M.X., and Qian, W; 2013. Constrained Text Coclustering with Supervised and Unsupervised Constraints,Knowledge and Data Engineering, IEEE Transactions Journal,Volume25,1227- 1239.

Stanford Parser, http://nlp.stanford.edu/ downloads/lex-parser.shtml, June, 2014

Theodosiou, T., Darzentas, N., Angelis, L., and Ouzounis, C.A; 2008. PuReD-MCL: a graphbased PubMed document clustering, Bioinformatics journal,Volume24, 1935-1941.

Wang, Y., Ni, X., Sun, J., Tong, Y. and Chen, Z;2011. Representing Document as Dependency Graph for Document Clustering.Proc. 20th ACM international conference on Information and knowledge management,2177-2180.

Xu, X., Lu, L., He, P., Pan, Z. and Chen, L; 2011. Improving constrained clustering via swarm intelligence,Proc. 7th international conference on Intelligent Computing: bioinspired computing and applications, Zhengzhou, China, 317-325.

Zeng, H., Song, A. and Cheung,M.Y; 2013 Improving clustering with pairwise constraints: a discriminative approach,International journal Knowledge and Information Systems,Volume 36, 489-515.

Downloads

Published

2016-06-22

Issue

Section

Electrical Engineering and Computer Science