Author

Qi Miao

Date

March 2012

Document Type

Thesis

Degree Name

M.S.

Department

Dept. of Biomedical Engineering

Institution

Oregon Health & Science University

Abstract

Concatenative synthesis is currently the most widely-used Text-to-Speech (TTS) framework. However, it suffers from the problem that it can not guarantee to minimize both the target cost and the concatenation cost at the same time. As a result, the selected units for concatenation may come from totally different phonemic and prosodic contexts, which can lead to audible discontinuities in the output speech at the concatenation points. Various speech modification methods have been studied and applied during concatenation. In most cases, they can create a locally smooth transition between two units, but the resulting speech may be far from the target. In a previous study, a linear cross-fading weight function was used to remove spectral and time domain discontinuities during concatenative speech synthesis. We learned that concatenation through a linear weighted cross-fading function can produce smooth, yet unnaturally shaped formant trajectories; in addition, we noted that the precise details of how to cross-fade a specific pair of units may be highly context dependent. We propose a new algorithm that uses a unit-dependent parameterized cross-fading weight function to create more natural-looking formant trajectories and, it is hoped, better-sounding output speech. The proposed algorithm uses a perceptually-based objective function to capture differences between cross-faded and natural trajectories across the whole region of the phoneme, and uses the phoneme identity, prosodic contexts, and acoustic features of the units to predict optimal cross-fading parameters. This thesis reports a study on the feasibility of developing such perceptual cost functions. A special corpus was designed to produce a variety of shapes of formant frequency trajectories in different linguistic environments. A perceptual experiment was performed to determine whether we could predict perceptual quality of output speech from acoustic distance measures. We generated a range of synthetic/natural stimulus pairs, where the synthetic stimuli were generated using three types of cross-fading models, applied to different regions in the vowel. The results show that the perceptual cost function can be reliably predicted from the distance measures. Moreover, the results support our hypotheses that: a) the quality of the output speech is influenced by the shape of formant trajectories in the entire region across the vowel; and b) human perceptual scores are correlated to both the absolute distance and the first derivative of the absolute distance of the formant trajectories.

Identifier

doi:10.6083/M4ZG6Q7F

Division

Center for Spoken Language Understanding

School

School of Medicine

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.