Urdu Named Entity Recognition System using Hidden Markov Model

Muhammad Kamran Malik, Syed Mansoor Sarwar


Named Entity Recognition (NER) is the process of identifying Person, Organization, Location name and other miscellaneous information like number, date and measure from text.  In this paper, we describe the development of a NER system for Urdu Language using Hidden Markov Model (HMM). We first show a comparison of IOB2 and IOE2 tagging schemes. We then show preprocess of the Urdu language before feeding data to the HMM model for training using the IOE2 tagging scheme. Finally, we use the Part of Speech (POS) information, gazetteers and rules to improve the accuracy of the system.  Our system yields 66.71%, 71.70% and 69.12% as the values for precision, recall, and f-measure, respectively.

Full Text:



Siddiq, S., Hussain, S., Ali, A., Malik, K., & Ali, W. (2010, December). Urdu Noun Phrase Chunking-Hybrid Approach. In Asian Language Processing (IALP), 2010 International Conference on (pp. 69-72). IEEE.

Ali, W., Malik, M. K., Hussain, S., Siddiq, S., & Ali, A. (2010, September). Urdu noun phrase chunking: HMM based approach. In Educational and Information Technology (ICEIT), 2010 International Conference on (Vol. 2, pp. V2-494). IEEE.

Malik, M. K., & Sarwar, S. M. (2016). Named Entity Recognition System for Postpositional Languages: Urdu as a Case Study. International Journal of Advanced Computer Science and Applications, 7(10), 141-147.

Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The annals of mathematical statistics, 37(6), 1554-1563.

Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3-26.

Mukund, S., & Srihari, R. K. (2009, June). NE tagging for Urdu based on bootstrap POS learning. In Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (pp. 61-69). Association for Computational Linguistics.

Saha, S. K., Sarkar, S., & Mitra, P. (2008, January). A Hybrid Feature Set based Maximum Entropy Hindi Named Entity Recognition. In IJCNLP (pp. 343-349).

Kumar, P., & Kiran, V. R. (2008, January). A hybrid named entity recognition system for south Asian languages. In proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages (pp. 83-88).

Jahangir, F., Anwar, W., Bajwa, U. I., & Wang, X. (2012, December). N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language. In 24th International Conference on Computational Linguistics (p. 95).

Saha, S. K., Chatterji, S., Dandapat, S., Sarkar, S., & Mitra, P. (2008, January). A hybrid approach for named entity recognition in indian languages. In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian languages (pp. 17-24).

Tafseer, A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F., & Butt, M. (2015). The CLE Urdu POS Tagset. In LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2920-2925).

Gali, K., Surana, H., Vaidya, A., Shishtla, P., & Sharma, D. M. (2008, January). Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition. In IJCNLP (pp. 25-32).

Naz, S., Umar, A. I., Shirazi, S. H., Khan, S. A., Ahmed, I., & Khan, A. A. (2014). Challenges of Urdu Named Entity Recognition: A Scarce Resourced Language. Research Journal of Applied Sciences, Engineering and Technology, 8(10), 1272-1278.

Ekbal, A., Haque, R., Das, A., Poka, V., & Bandyopadhyay, S. (2008, January). Language Independent Named Entity Recognition in Indian Languages. In IJCNLP (pp. 33-40).

Riaz, K. (2010, July). Rule-based named entity recognition in Urdu. In Proceedings of the 2010 named entities workshop (pp. 126-135). Association for Computational Linguistics.

Singh, U., Goyal, V., & Lehal, G. S. (2012). Named Entity Recognition System for Urdu. In COLING (pp. 2507-2518).

Malik, M. K., & Sarwar, S. M. (2015) "Urdu Named Entity Recognition and Classification System Using Conditional Random Field” Sci-int. 27(5), pp (4473-4477).

Kamran Malik, M., Ahmed, T., Sulger, S., Bögel, T., Gulzar, A., Raza, G., & Butt, M. (2010). Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In LREC 2010, Seventh International Conference on Language Resources and Evaluation (pp. 2921-2927).




Copyright (c) 2017 Pakistan Journal of Engineering and Applied Sciences

Powered By KICS