Highly accurate children’s speech recognition for interactive reading tutors using subword units
Andreas Hagen, Bryan Pellom *, Ronald Cole
Center for Spoken Language Research, University of Colorado at Boulder, 1777 Exposition Drive, Suite #171, Boulder, CO 80301, USA Received 15 December 2005; received in revised form 20 February 2007; accepted 9 May 2007
Abstract Speech technology offers great promise in the field of automated literacy and reading tutors for children. In such applications speech recognition can be used to track the reading position of the child, detect oral reading miscues, assessing comprehension of the text being read by estimating if the prosodic structure of the speech is appropriate to the discourse structure of the story, or by engaging the child in interactive dialogs to assess and train comprehension. Despite such promises, speech recognition systems exhibit higher error rates for children due to variabilities in vocal tract length, formant frequency, pronunciation, and grammar. In the context of recognizing speech while children are reading out loud, these problems are compounded by speech production behaviors affected by difficulties in recognizing printed words that cause pauses, repeated syllables and other phenomena. To overcome these challenges, we present advances in speech recognition that improve accuracy and modeling capability in the context of an interactive literacy tutor for children. Specifically, this paper focuses on a novel set of speech recognition techniques which can be applied to improve oral reading recognition. First, we demonstrate that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling. Next, we propose extending our baseline system by introducing a novel token-passing search architecture targeting subword unit based
References: Aist, G., Chan, P., Huang, X., Jiang, L., Kennedy, R., Latimer, D., Mostow, J., Yeung, C., 1998. How effective is unsupervised data collection for children’s speech recognition? In: Proc. ICSLP 98 Sydney, Australia. Arcy, S., Wong, L., Russel, M., 2004. Recognition of read and spontaneous children’s speech using two new corpora. In: Proc. ICSLP 2004, Jeju Island, Korea. Banerjee, S., Beck, J., Mostow, J., 2003a. Evaluating the effect of predicting oral reading miscues. In: Proc. Eurospeech 2003, Geneva, Switzerland. Banerjee, S., Mostow, J., Beck, J., Tam, W., 2003b. Improving language models by learning from speech recognition errors in a reading tutor that listens. In: Proc. Second Internat. Conf. on Applied Artificial Intelligence 2003, Fort Panhala, Kolhapur, India. Bazzi, I., 2002. Modelling out-of-vocabulary words for robust speech recognition. Ph.D. Thesis, MIT, June 2002, Department of Electrical Engineering and Computer Science. Cole, R., Hosom, P., Pellom, B., 2006a. University of Colorado Prompted and Read Children’s Speech Corpus. Technical Report TR-CSLR2006-02, Center for Spoken Language Research, University of Colorado, Boulder. Cole, R., Pellom, B., 2006b. University of Colorado Read and Summarized Stories Corpus. Technical Report TR-CSLR-2006-03, Center for Spoken Language Research, University of Colorado, Boulder. Cole, R.A., Van Vuuren, S., Pellom, B., Hacioglu, K., Ma, J., Movellan, J., Schwartz, S., Wade-Stein, D., Ward, W., Yan, J., 2003. Perceptive animated interfaces: first steps toward a new paradigm for human– computer interaction. Proc. IEEE: Special Issue on Human–Computer Multimodal Interface 91 (9), 1391–1405. Cole, R., Wise, B., Van Vuuren, S., 2006. How Marni teaches children to read. Educ. Technol. 47 (1), 14–18. COLit, 2004. Colorado Literacy Tutor Project. . Cosi, P., Pellom, B., 2005. Italian Children’s speech recognition for advanced interactive literacy tutors. In: Proc. Eurospeech 2005, Lisbon, Portugal. Creutz, M., Lagus, K., 2002. Unsupervised discovery of morphemes. In: Proc. Workshop on Morphological and Phonological Learning of ACL-02, Philadelphia, pp. 21–30. Das, S., Nix D., Picheny, M., 1998. Improvements in children’s speech recognition performance. In: Proc. ICASSP 98, Seattle, WA. Eskenazi, M., 1996. KIDS: A database of childrens speech. J. Acoust. Soc. Amer. 100 (4, Part 2). Fogarty, J., Dabbish, L., Steck, D.M., Mostow, J., 2001. Mining a database of reading mistakes: For what should an automated Reading Tutor listen? In: Proc. Tenth Internat. Conf. on Artificial Intelligence in Education (AI-ED) 2001, San Antonio, Texas. Gales, M., 1997. Maximum likelihood linear transformations for HMMbased speech recognition. Technical Report, CUED/F-INFENG/ TR291, Cambridge University. Giuliani, D., Gerosa, M., 2003. Investigating recognition of children’s speech. In: Proc. ICASSP 2003, Hong Kong. Gustafson, J., Sjolander, K., 2002. Voice transformations for improving children’s speech recognition in a publicly available dialogue system. In: Proc. ICSLP 2002, Denver, Colorado. Hacioglu, K., Pellom, B., Ciloglu, T., Ozturk, O., Kurimo, M., Creutz, M., 2003. On lexicon creation for Turkish LVCSR. In: Proc. Eurospeech 2003, Geneva, Switzerland. Hagen, A., Pellom, B., 2005a. A Multi-layered lexical-tree based token passing architecture for efficient recognition of subword speech units. In: The 2nd Language and Tech. Conf., Poznan, Poland. A. Hagen et al. / Speech Communication 49 (2007) 861–873 Hagen, A., Pellom, B., 2005b. Data driven subword unit modeling for speech recognition and its application to interactive reading tutors. In: Interspeech 2005, Lisbon, Portugal. Hagen, A., Pellom, B., Cole, R., 2003. Children’s speech recognition with application to interactive books and tutors. In: IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, St. Thomas. Hagen, A., Pellom, B., Van Vuuren, S., Cole, R., 2004. Advances in children’s speech recognition within an interactive literacy tutor. HLTNAACL, Boston, May 2004. Lee, S., Potamianos, A., Narayanan, S., 1997. Analysis of children’s speech: duration, pitch and formants, In: Proc. EUROSPEECH 97, Rhodes, Greece. Lee, S., Potamianos, A., Narayanan, S., 1999. Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Amer. 105, 1455–1468. Lee, K., Hagen, A., Romanyshyn, N., Martin, S., Pellom, B., 2004. Analysis and detection of reading miscues for interactive literacy tutors. COLING, Geneva, Switzerland. Li, Q., Russell, M., 2002. An analysis of the causes of increased error rates in children’s speech recognition. In: Proc. ICSLP 02, Denver, Colorado. McCandless, M., 1992. Word rejection for a literacy tutor. S.B. Thesis, MIT, May 1992, Department of Electrical Engineering and Computer Science. Mostow, J., Roth, S.F., Hauptmann, A.G., Kane, M., 1994. A prototype reading coach that listens. In: Proc. of AAAI-94, Seattle, WA, pp. 785– 792. Mostow, J., Beck, J., Winter, S., Wang, S., Tobin, B., 2002. Predicting oral reading miscues. In: ICSLP 2002, Denver, Colorado. Pellom, B., 2001. SONIC: The University of Colorado Continuous Speech Recognizer. Technical Report TR-CSLR-2001-01, University of Colorado. 873 Pellom, B., Hacioglu, K., 2003. Recent improvements in the CU SONIC ASR system for noisy speech: the SPINE task. In: Proc. ICASSP 2003, Hong Kong. Potamianos, A., Narayanan, S., 2003. Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11, 603–616. Potamianos, A., Narayanan, S., Lee, S., 1997. Automatic speech recognition for children. In: Proc. EUROSPEECH 97, Rhodes, Greece. Shobaki, K., Hosom, J.P., Cole, R., 2000. The OGI Kids’ Speech Corpus and recognizers. In: Proc. ICSLP 2000, Beijing, China. Siohan, O., Myrvoll, T., Lee, C.H., 2002. Structural maximum a posteriori linear regression for fast HMM adaptation. Computer, Speech and Language 16, 5–24. Spache, G.D., 1981. Diagnostic Reading Scales. Del, Monte Research Park, Monterey, CA 93940: CTB, Macmillan/McGraw-Hill. Tam, Y.C., Mostow, J., Beck, J., Banerjee, S., 2003. Training a confidence measure for a reading tutor that listens. In: Proc. Eurospeech 2003, Geneva, Switzerland. van Vuuren, S., Cole, R., Ngampatipatpong, N., 2006. Providing feedback to students while reading out loud in interactive books. Technical Report TR-CSLR-2006-01, Center for Spoken Language Research, University of Colorado, Boulder. Welling, L.,Kanthak, S., Ney, H., 1999. Improved methods for vocal tract length normalization. In: Proc. ICASSP 99, Phoenix, Arizona. Wise, B., Cole, R., Van Vuuren, S., Schwartz, S., Snyder, L., Ngampatipatpong, N., Tuantranont, J., Pellom, B., 2005. Learning to read with a virtual tutor: foundations to literacy. In: Kinzer, C., Verhoeven, L. (Eds.), Interactive Literacy Education: Facilitating Literacy Environments through Technology. Lawrence Erlbaum, Mahwah, NJ. Young, S.J., Russell, N.H., Thornton, J.H.S., 1989. Token passing: a simple conceptual model for connected speech recognition systems. Cambridge University, Technical Report CUED/F-INFENG/TR.38.