Language Independent Keyword Based Information Retrieval System of Handwritten Documents using SVM Classifier and Converting Words into Shapes

Authors

  • Muhammad Rashid Hussain
  • Asif Masood
  • Haris Ahmad Khan
  • Khurram Khurshid
  • Imran Siddiqi

Abstract

This work presents a language independent keyword based document indexing and retrieval system using SVM as classifier. Word spotting presents an attractive alternative to the traditional Optical Character Recognition (OCR) systems where instead of converting the image into text, retrieval is based on matching the images of words using pattern classification techniques. The proposed technique relies on extracting words from images of handwritten documents and converting each word image into a shape represented by its contour. A set of multiple features is then extracted from each word image and instances of same words are grouped into clusters. These clusters are used to train a multi-class SVM which learns different word classes. The documents to be indexed are segmented into words and the closest cluster for each word is determined using the SVM. An index file is maintained for each word containing the word locations within each document. A query word presented to the system is matched with the clusters in the database and the documents containing occurrences of the query word are retrieved. The system realized promising precision and recall rates on the IAM database of handwritten documents.

References

[1] Vinciarelli, A.: A survey on off-line cursive word recognition. Pattern recognition 35(7), 1433–1446 (2002)

[2] Plamondon, R., Srihari, S.N.: Online and off-line handwriting recognition: a comprehensive survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22(1), 63–84 (2000)

[3] Frinken, V., Fischer, A., Manmatha, R., Bunke, H.: A novel word spotting method based on recurrent neural networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34(2), 211–224 (2012)

[4] Liu, Y., Xu, M., Cai, L.: Improved keyword spotting system by optimizing posterior confidence measure vector using feed-forward neural network. In: Neural Networks (IJCNN), 2014 International Joint Conference On, pp. 2036–2041 (2014).

[5] Tarafdar, A., Pal, U., Roy, P.P., Ragot, N., Ramel, J.-Y.: A two-stage approach for word spotting in graphical documents. In: 12th International Conference On Document Analysis and Recognition (ICDAR), 2013 pp. 319–323 (2013).

[6] Impedovo, S., Mangini, F.M., Pirlo, G., Barbuzzi, D., Impedovo, D.: Voronoi tessellation for effective and efficient handwritten digit classification. In: Document
Analysis and Recognition (ICDAR), 2013 12th International Conference On, pp. 435–439 (2013).

[7] Li, J., Fan, Z.-G., Wu, Y., Le, N.: Document image retrieval with local feature sequences. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference On, pp. 346–350 (2009).

[8] Andreev, A., Kirov, N.: Word image matching based on hausdorff distances. In: Proc. 10th International Conference on Document Analysis and Recognition, pp. 396–400 (2009).

[9] Rothfeder, J.L., Feng, S., Rath, T.M.: Using corner feature correspondences to rank word images by similarity. In: Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference On, vol. 3, pp. 30–30 (2003).

[10] Adamek, T., O’Connor, N.E., Smeaton, A.F.: Word matching using single closed contours for indexing handwritten historical documents. International Journal of Document Analysis and Recognition (IJDAR) 9(2-4), 153–165 (2007)

[11] Marinai, S., Faini, S., Marino, E., Soda, G.: Efficient word retrieval by means of som clustering and pca. In: Document Analysis Systems VII, pp. 336–347. Springer, (2006)

[12] Gatos, B., Pratikakis, I.: Segmentation-free word spotting in historical printed documents. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference On, pp. 271–275 (2009).

[13] Rath, T.M., Manmatha, R.: Word spotting for historical documents. International Journal of Document Analysis and Recognition (IJDAR) 9(2-4), 139–152 (2007)

[14] Zagoris, K., Papamarkos, N., Chamzas, C.: Web document image retrieval system based on word spotting. In: Image Processing, 2006 IEEE International Conference On, pp. 477–480 (2006).

[15] Rusi˜nol, M., Llad´os, J.: Word and symbol spotting using spatial organization of local descriptors. In: Document Analysis Systems, 2008. DAS’08. The Eighth IAPR International Workshop On, pp. 489–496 (2008).

[16] Bai, S., Li, L., Tan, C.L.: Keyword spotting in document images through word shape coding. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference On, pp. 331–335 (2009).

[17] Bertolami, R., Gutmann, C., Bunke, H., Spitz, A.L.: Shape code based lexicon reduction for offline handwritten word recognition. In: Document Analysis Systems, 2008. DAS’08. The Eighth IAPR International Workshop On, pp. 158–163 (2008).

[18] Kluzner, V., Tzadok, A., Shimony, Y., Walach, E., Antonacopoulos, A.: Word-based adaptive ocr for historical books. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference On, pp. 501–505 (2009).

[19] Abidi, A., Siddiqi, I., Khurshid, K.: Towards searchable digital urdu libraries-a word spotting based retrieval approach. In: Document Analysis and Recognition (ICDAR), 2011 International Conference On, pp. 1344–1348 (2011).

[20] Khurshid, K., Faure, C., Vincent, N.: Word spotting in historical printed documents using shape and sequence comparisons. Pattern Recognition 45(7), 2598–2609 (2012)

[21] Siddiqi, I., Vincent, N.: A set of chain code based features for writer recognition. In: In Proc. of 10th International Conference on Document Analysis and Recognition, pp. 981– 985 (2009).

[22] Khurshid, K., Faure, C., Vincent, N.: Feature-based word spotting in ancient printed documents. In: PRIS, pp. 193–198 (2008)

[23] Lu, Y., Shridhar, M.: Character segmentation in handwritten words—an overview. Pattern recognition 29(1), 77–96 (1996)

[24] Terasawa, K., Imura, H., Tanaka, Y.: Automatic evaluation framework for word spotting. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference On, pp. 276–280 (2009).

[25] Vamvakas, G., Gatos, B., Stamatopoulos, N., Perantonis, S.J.: A complete optical character recognition methodology for historical documents. In: Document Analysis Systems, 2008. DAS’08. The Eighth IAPR International Workshop On, pp. 525–532 (2008).

[26] Moghaddam, R.F., Cheriet, M.: Application of multi-level classifiers and clustering for automatic word spotting in historical document images. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference On, pp. 511–515 (2009).

[27] Leydier, Y., LeBourgeois, F., Emptoz, H.: Textual indexation of ancient documents. In: Proceedings of the 2005 ACM Symposium on Document Engineering, pp. 111–117 (2005).

[28] Frinken, V., Fischer, A., Bunke, H., Manmatha, R.: Adapting blstm neural network based keyword spotting trained on modern data to historical documents. In: Frontiers in Handwriting Recognition (ICFHR), 2010 International Conference On, pp. 352–357 (2010).

[29] Khurshid, K., Faure, C., Vincent, N.: A novel approach for word spotting using merge-split edit distance. In: Computer Analysis of Images and Patterns, pp. 213–220 (2009).

[30] Fischer, A., Keller, A., Frinken, V., Bunke, H.: Hmm-based word spotting in handwritten documents using subword models. In: Pattern Recognition (icpr), 2010 20th International Conference On, pp. 3416–3419 (2010).

[31] Siddiqi, I., Vincent, N.: Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features. Pattern Recognition 43(11), 3853– 3865 (2010)

[32] Nakano, H.Y.Y.: Cursive handwritten word recognition using multiple segmentation determined by contour analysis. IEICE Transactions on Information and Systems E79- D(5), 464–470 (1996)

[33] Kimura, F., Kayahara, N., Miyake, Y., Shridhar, M.: Machine and human recognition of segmented characters from handwritten words. In: In Proc. of the 4th International Conference on Document Analysis and Recognition, pp. 866–869 (1997)

[34] Blumenstein, M., Verma, B., Basli, H.: A novel feature extraction technique for the recognition of segmented handwritten characters. In: In Proc. of the Seventh International Conference on Document Analysis and Recognition, pp. 137–141 (2003)

[35] Blumenstein, M., Liu, X.Y., Verma, B.: An investigation of the modified direction feature for cursive character recognition. Pattern Recognition 40(2), 376–388 (2007)

[36] M.E.Dehkordi, N.Sherkat, T.Allen: Handwriting style classification. International Journal of Document Analysis and Recognition 6, 55–74 (2003)

[37] Siddiqi, I., Djeddi, C., Raza, A., Souici-meslati, L.: Automatic analysis of handwriting for gender classification. Pattern Analysis and Applications (2014)

[38] Wall, K., Danielsson, P.-E.: A fast sequential method for polygonal approximation of digitized curves. Computer Vision, Graphics, and Image Processing 28(3), 220–227 (1984)

[39] Bensefia, A., Paquet, T., Heutte, L.: A writer identification and verification system. Pattern Recognition Letters 26(13), 2080–2092 (2005)

[40] Marti, U.-V., Bunke, H.: The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5(1), 39–46 (2002)

[41] AbbyyFinereader, Online: http://www.abbyy.com/finereader/

[42] Wshah, Safwan, Gaurav Kumar, and VenuGovindaraju. "Script independent word spotting in offline handwritten documents based on hidden markov models." Frontiers in Handwriting Recognition (ICFHR), (2012).

[43] Frinken, Volkmar, et al. "A novel word spotting method based on recurrent neural networks." IEEE Transactions on Pattern Analysis and Machine Intelligence, 34.2 (2012): 211-224.

[44] Rodríguez-Serrano, José A., and FlorentPerronnin. "A model-based sequence similarity with application to handwritten word spotting." IEEE Transactions on Pattern Analysis and Machine Intelligence 34.11 (2012): 2108-2120.

[45] Fischer, Andreas; Frinken, Volkmar; Bunke, Horst; Suen, Ching Y. "Improving hmm-based keyword spotting with character language models." 12th International Conference on Document Analysis and Recognition. (ICDAR), 2013.

[46] Kumar, G.; Govindaraju, V., "A Bayesian Approach to Script Independent Multilingual Keyword Spotting," 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014, vol., no., pp.357, 362, 1-4 Sept. 2014

[47] Ranjan, V.; Harit, G.; Jawahar, C.V., "Document Retrieval with Unlimited Vocabulary," IEEE Winter Conference on Applications of Computer Vision (WACV), 2015, pp.741-748, 5-9 Jan. 2015

[48] J. Almazan, A. Gordo, A. Fornes, and E. Valveny, “Word Spotting and Recognition with Embedded Attributes." IEEE Transactions on Pattern Analysis and Machine Intelligence. vol.36, no.12, pp.2552,2566, Dec. 1 2014

Downloads

Published

2016-08-05

Issue

Section

Electrical Engineering and Computer Science