Dept. of Computer Science and Engineering
Oregon Graduate Institute of Science & Technology
Human speakers indicate the relative importance of syllables or words in an utterance through variations in intonation and speaking rate, as well as variations in the degree of articulation and vocal effort. To generate natural-sounding computer speech, corresponding acoustic variables, such as the fundamental frequency, phoneme durations, formant frequencies, and spectral balance, must be accurately controlled. However, in most speech synthesis systems, parameters related to the degree of articulation are not explicitly controlled. The resulting speech often sounds over-articulated, and requires a high listening effort as acoustic cues related to the semantic and linguistic structure of the message are missing. The degree of articulation can be defined as the phonetic quality of a phoneme, i.e., the extent to which the target sound associated with a phoneme is realized. Moon and Lindblom have proposed a contextual model of articulation, in which the degree of articulation of vowels is described by three factors: (1) the phonetic context, (2) the phoneme duration, and (3) the spectral rate-of-change. In this thesis, we investigate whether this model can explain variations of the degree of articulation in fluent speech, and how the model may be integrated in a concatenative speech synthesis system. We focus on the spectral rate-of-change of phonetic transitions, which reflects the articulation effort used by a speaker. We design and analyze a balanced database to study the interaction between the spectral rate-of-change and prosodic factors such as stress, accent, word position, and speaking style. A numerical model is proposed to predict the spectral rate-of-change from prosodic factors. Then, we investigate how the contextual model of articulation can be integrated in a concatenative speech synthesis system, by modifying the spectral rate-of-change of acoustic units according to the prosodic context. Spectral modification is realized using a sinusoidal + all-pole representation of the acoustic units, which avoids shortcomings of modification methods based on inverse filtering. The results show that vowel reduction and coarticulation between sonorants can be produced at synthesis time, reducing the amount of speech units needed in the acoustic inventory. Concatenation mismatch between acoustic units is also successfully diminished.
Wouters, Johan, "Analysis and synthesis of degree of articulation" (2001). Scholar Archive. 163.