Finding Topics in Urdu: A Study of Applicability of Document Clustering in Urdu Language

Authors

  • Toqeer Ehsan Department of Computer Science and Engineering, University of Engineering and Technology, Lahore http://orcid.org/0000-0002-6724-6705
  • H. M. Shahzad Asif Department of Computer Science and Engineering, University of Engineering and Technology, Lahore

Abstract

In this research, we present the results of a study conducted to ascertain the applicability of document clustering techniques on Urdu Language corpus. This study, which is first of its kind, employs a fully probabilistic Bayesian method, Latent Dirichlet Allocation, for clustering Urdu language corpus by using the features collected from the documents. Results obtained are compared with those obtained from a simplistic classification technique. Analysis of the results shows that supervised and unsupervised techniques for grouping documents perform reasonably well on this corpus. Results further indicate that Urdu document clustering technique outperforms document classification technique in some cases with an accuracy of above 90%.

References

T. Ahmad, Spatial Expressions and Case in South Asian Languags, Ph.D. dissertation, University of Konstanz, Germany, 2009.
M. Butt, Theories of Case, Cambridge: Cambridge University Press, 2006.

Q. Abbas, "Morphologically rich Urdu grammar parsing using Earley Algorithm," Natural Language Engineering, vol. 21, no. 2, pp. 1-36, 16 April 2015.

S. Hussain, Finite State Morphological Analyzer for Urdu, Unpublished MS thesis, National University of Computer and Emerging Sciences, Pakistan, 2004.

M. Butt and G. Ramchand, "Complex aspectual structure in Hindi/Urdu," M. Liakata, B. Jensen, & D. Maillat, Eds, pp. 1-30, 2001.

T. Ahmad, S. Urooj, S. Hussain, A. Mustafa, R. Parveen, F. Adeeba, A. Hautli and M. Butt, "The CLE Urdu POS Tagset," in Language Resources and Evaluation Conference (LREC 14), Reykjavik, Iceland, 2014.

H. Sajjad, "Statistical Part of Speech Tagger for Urdu," MS thesis, National University of Computer and Emerging Sciences, Lahore, Pakistan, 2007.

H. Sajjad and H. Schmid, "Tagging Urdu Text with Parts of Speech: A Tagger Comparison," in 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL-09), 2009.

McCallum and A. Kachites, "MALLET: A Machine Learning for Language Toolkit.," in http://mallet.cs.umass.edu, 2002.

D. M. Blei, A. Y. Ng and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, vol. 03, pp. 993-1022, 2003.

H. Zhang, "The Optimality of Naive Bayes," in 17th International FLAIRS Conference, Florida, USA, 2004.

M. Zrigui, R. Ayadi, M. Mars and M. Maraoui, "Arabic Text Classification Framework Based on Latent Dirichlet Allocation," Journal of Computing and Information Technology, vol. 20, pp. 125-140, 2012.

P. Anupriya and S. Karpagavalli, "LDA Based Topic Modeling of Journal Abstracts," in International Conference on Advanced Computing and Communication Systems (ICACCS -2015), Coimbatore, INDIA, 2015.

Q. Chen, L. Yao and J. Yang, "Short text classification based on LDA topic model," in International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 2016.

R. G. Apaza, E. V. Cervantes, L. C. Quispe and J. O. Luna, "Online Courses Recommendation based on LDA," in SIMBig2014, 2014.

B. Chao and A. Sirmorya, "Automated Movie Genre Classification with LDA-based Topic Modeling," International Journal of Computer Applications, vol. 145, no. 13, 2016.

T. H. Nguyen and K. Shirai, "Topic Modeling based Sentiment Analysis on Social Media for Stock Market Prediction," in ACL-IJCNLP 2015, Beijing, China, 2015.

D. M. Blei and J. D. McAuliffe, "Supervised topic models," Advances in, vol. 20, pp. 121-128, 2008.

M. Pavlinek and V. Podgorelec, "Text classification method based on self-training and LDA topic models," Expert Systems With Applications, vol. 80, pp. 83-93, 2017.

G. Casella and E. I. George, "Explaining the Gibbs Sampler," The American Statistician, vol. 46, no. 03, pp. 167-174, 1992.

A. R. Ali and M. Ijaz, "Urdu Text Classification," in 7th International Conference on Frontiers of Information Technology (FIT), 2009.

Y. Wang, W. Bai, M. Stanton, W.-Y. Chen and E. Y. Chang, "PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications," in 5th International Conference on Algorithmic Aspects in, Heidelberg, 2009.

"Matlab Topic Modeling Toolbox 1.4," [Online]. Available: http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.

W. K. Hastings, "Monte Carlo Sampling Methods Using Markov Chains and Their Applications," Biometrika, vol. 57, no. 1, pp. 97-109, 1970.

D. Newman, A. Asuncion, P. Smyth and M. Welling, "Distributed Algorithms for Topic Models," The Journal of Machine Learning Research, vol. 10, pp. 1801-1828, 2009.

A. Muaz, A. Ali and S. Hussain, "Analysis and Development of Urdu POS Tagged Corpus," in 7th Workshop on Asian Language Resources, IJCNLP’, Suntec City, Singapore, 2009.

Downloads

Published

2018-09-11

Issue

Section

Computer Science