Author

Johan Wouters

Date

June 2001

Document Type

Dissertation

Degree Name

Ph.D.

Department

Dept. of Computer Science and Engineering

Institution

Oregon Graduate Institute of Science & Technology

Abstract

Human speakers indicate the relative importance of syllables or words in an utterance through variations in intonation and speaking rate, as well as variations in the degree of articulation and vocal effort. To generate natural-sounding computer speech, corresponding acoustic variables, such as the fundamental frequency, phoneme durations, formant frequencies, and spectral balance, must be accurately controlled. However, in most speech synthesis systems, parameters related to the degree of articulation are not explicitly controlled. The resulting speech often sounds over-articulated, and requires a high listening effort as acoustic cues related to the semantic and linguistic structure of the message are missing. The degree of articulation can be defined as the phonetic quality of a phoneme, i.e., the extent to which the target sound associated with a phoneme is realized. Moon and Lindblom have proposed a contextual model of articulation, in which the degree of articulation of vowels is described by three factors: (1) the phonetic context, (2) the phoneme duration, and (3) the spectral rate-of-change. In this thesis, we investigate whether this model can explain variations of the degree of articulation in fluent speech, and how the model may be integrated in a concatenative speech synthesis system. We focus on the spectral rate-of-change of phonetic transitions, which reflects the articulation effort used by a speaker. We design and analyze a balanced database to study the interaction between the spectral rate-of-change and prosodic factors such as stress, accent, word position, and speaking style. A numerical model is proposed to predict the spectral rate-of-change from prosodic factors. Then, we investigate how the contextual model of articulation can be integrated in a concatenative speech synthesis system, by modifying the spectral rate-of-change of acoustic units according to the prosodic context. Spectral modification is realized using a sinusoidal + all-pole representation of the acoustic units, which avoids shortcomings of modification methods based on inverse filtering. The results show that vowel reduction and coarticulation between sonorants can be produced at synthesis time, reducing the amount of speech units needed in the acoustic inventory. Concatenation mismatch between acoustic units is also successfully diminished.

Identifier

doi:10.6083/M4XG9P2M

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.