An Efficient Algorithm To Collect Minimal Speech Corpora

Saad Irtza, Sarmad Hussain


Generally phonetically rich and balanced corpora are popular for training speech recognition system but these corpora are costly to develop. Different greedy algorithms have been develop to collect such corpora. A significant effort is required to record and transcribe such speech corpora. Therefore there is motivation to further reduce their size. This paper demonstrates such an algorithm. Earlier work shows that different amount of training data is required to train different phonemes. The current work further develops these findings to reduce phonetically rich training data. Experiments show that this algorithm reduces the size of an Urdu speech corpus by 56.49% without degradation in accuracy.

Full Text:



H. Sarfraz, S. Hussain, R. Bokhari, A. A. Raza, I. Ullah, Z. Sarfraz, S. Pervez, A. Mustafa, I. Javed, R. Parveen “Speech Corpus Development for a Speaker Independent Spontaneous Urdu Speech Recognition System”, Oriental COCOSDA 2010 conference, Nov. 24-25, 2010, Katmandu, Nepal.

S. Irtza, S. Hussain, “Error Analysis of Single Speaker Urdu Speech Recognition System”, in CLT-12, University of Engineering and Technology, Lahore, Pakistan, 2012.

Irtza, S. and Hussain, S. "Minimally Balanced Corpus for Speech Recognition", in the Proceedings of 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA'13), IEEE, Sharjah, 2013.

J. Daniel & J. H. Martin, Speech and Language Processing: An introduction to natural language processing computational linguistics, and speech recognition, 2005.

A. Raza, S. Hussain, H. Sarfraz, I. Ullah and Z. Sarfraz, “An ASR System for Spontaneous Urdu Speech”, In the Proc. of Oriental COCOSDA, Kathmandu, Nepal. 24-25 November 2010.

A. Samoulian, Knowledge based approach to speech recognition, Department of Electrical and Computer Engineering University of Wollongong.

L. Deng , H. Strik, Structure-Based and Template-Based Automatic Speech Recognition, Comparing parametric and non-parametric approaches”.

Nirav S. Uchat, Hidden Markov Model and Speech Recognition.

HTK,, accessed July 2010.

S. T. Abate, W. Menzel, and B. Tafila, “An amharic speech corpus for large vocabulary Continuous speech recognition,” ISCA, 2005. Ninth European Conference on Speech Communication and Technology.

L. Villase.or-Pineda, M. Montes-y Gomez, D. Vaufreydaz, and J. F. Serignat, “Experiments on the construction of a phonetically balanced corpus from the web,” Lecture notes in computer science, pp. 416–419, 2004.

A. Li, F. Zheng, W. Byrne, P. Fung, T. Kamm, Y. Liu, Z. Song, U. Ruhi, V. Venkataramani, and X. Chen, “Cass: A phonetically transcribed corpus of mandarin spontaneous speech,” ISCA, 2000. Sixth International Conference on Spoken Language Processing.

A. L. Ronzhin, R. M. Yusupov, I.V. Li, and A. B. Leontieva, “Survey of Russian speech recognition systems”.

A. Li, F. Zheng, W. Byrne, P. Fung, T. Kamm, Y. Liu, Z. Song, U. Ruhi, V. Venkataramani, and X. X. Xhen, “ Cass: A phonetically transcribed corpus of mandrain spontaneous speech, ISCA, 2000. Sixth International Conference on Spoken Language Processing.

S. T Abate, W. Menzel, and B. Tafila, “An Amharic speech corpus for large vocabulary continuous speech recognition,” ISCA, 2005. Ninth European Conference on Speech.

P. A Heeman, “The American English sala-II data collection,” 2004. Proceedings LREC.

G. Raskinis, “Building medium vocabulary isolated word Lithuanian HMM speech recognition system,” Informatica, vol. 14, no. 1, pp.75-84, 2003.

G. Anumanchipalli, R. Chitturi, S. Joshi, R. Kumar, S. P. Singh, R. N. V. Sitaram, and S. P. Kishore, “Development of Indian language speech database for large vocabulary speech recognition system”.

V. Chourasia, K. Samudravijaya, and M. Chandwant, “Phonetically rich Hindi sentence corpus for creation of speech database,” Proc. O-Cocosda,p. 132-137, 2005.

Y. C. Yio, M. S Liang, Y.C. Chiang, and R. Y. Lyu, “Biphone rich versus triphone rich: a comparison of speech corpora in automatic speech recognition,” pp.194-197, 2005. Cellular Neural Networks and their applications, 2005 9th International workshop.

A. C. Kelly, H. Berthelsen, N. Campbell, A. Chasaide, C. Gobl , “Corpus Design Techniques for Irish Speech Synthesis Phonetics and Speech Laboratory”, SLSCS, Trinity College Dublin, Ireland, 2006.

W. Chai, P. Cotsomrong, S. Suebvisai, S. Kanokphara, Information Research and Development Unit National Electronics and Computer Technology Center, Phonetically Distributed Continuous Speech Corpus for Thai Language, COCOSDA, 2003.

Copyright (c) 2016 Saad Irtza

Powered By KICS