1.0 INTRODUCTION
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.
The ultimate goal of text-to-speech synthesis is to convert ordinary orthographic text into an acoustic signal that is indistinguishable from human speech. The conversion process, illustrated in Figure 1, is considered to have two parts, because the two parts involve different types of knowledge and processes. The front end handles problems in text analysis and higher level linguistic features; it interprets orthographic text and outputs a phonetic transcription that specifies the phonemes and an intonation for the text. The back end handles problems in phonetics, acoustics and signal processing; it converts the phonetic transcription into a speech waveform containing appropriate values for acoustic parameters such as pitch, amplitude, duration, and spectral characteristics.
For both parts the conversion process is usually performed through one of two approaches. One approach is to rely on rules formulated either by experts in natural language, linguistics, or acoustics. Another approach is to avoid rules and instead include large lists from machine readable dictionaries or to formulate rules by automatic methods from statistical analysis of a large corpus of transcribed speech. The trend today is to use the corpus-based approach. The dependency on large amounts of data processing necessitates the use of automated techniques such as those used in automatic speech recognition. Crucial issues for this approach are the coverage of the corpus and how a system deals with cases outside the corpus. Figure 1
Text-to-speech synthesis has had a long history, one that can be traced back at