In this paper we address the issue of pronunciation model- ing for conversational speech synthesis. We experiment with two different HMM topologies (fully connected state model and forward connected state model) for sub-phonetic model- ing to capture the deletion and insertion of sub-phonetic states during speech production process. We show that the experi- mented HMM topologies have higher log likelihood than the traditional 5-state sequential model. We also study the first and second mentions of content words and their influence on the pronunciation variation. Finally we report phone recogni- tion experiments using the modified HMM topologies.
1. INTRODUCTION
Modeling of pronunciation variations in conversational speech is essential for speech recognition as well as speech synthe- sis. The state-of-art speech synthesis systems are built using unit selection databases of carefully read speech recorded in a controlled environment. While these systems produce high quality natural speech they produce little effect of a conversa- tion and lack the genre and style of conversational speech. the pronunciation variations [2]. Jande used phonological rule system for adapting the pronunciation for faster speech rate
[3]. Bennett et al., used acoustic models trained on single speaker database to label the alternate pronunciations of the words: ”to, for, a, the” and used CART tree to predict the probable pronunciation with the given context [4].
There has been considerable research in speech recogni- tion field towards capturing the pronunciation variants. Bates et al., showed that prosodic features derived from energy, F0 and duration could be cues to model the pronunciation vari- ability [5]. Nedel et al., used phone splitting technique to model the pronunciation variants of two phones AA and IY
[6].
Most of the work in speech recognition and speech syn- thesis use multiple entries in the dictionary generated either manually or by