Dept. of Biomedical Engineering
Oregon Health & Science University
Concatenative synthesis is currently the most widely-used Text-to-Speech (TTS) framework. However, it suffers from the problem that it can not guarantee to minimize both the target cost and the concatenation cost at the same time. As a result, the selected units for concatenation may come from totally different phonemic and prosodic contexts, which can lead to audible discontinuities in the output speech at the concatenation points. Various speech modification methods have been studied and applied during concatenation. In most cases, they can create a locally smooth transition between two units, but the resulting speech may be far from the target. In a previous study, a linear cross-fading weight function was used to remove spectral and time domain discontinuities during concatenative speech synthesis. We learned that concatenation through a linear weighted cross-fading function can produce smooth, yet unnaturally shaped formant trajectories; in addition, we noted that the precise details of how to cross-fade a specific pair of units may be highly context dependent. We propose a new algorithm that uses a unit-dependent parameterized cross-fading weight function to create more natural-looking formant trajectories and, it is hoped, better-sounding output speech. The proposed algorithm uses a perceptually-based objective function to capture differences between cross-faded and natural trajectories across the whole region of the phoneme, and uses the phoneme identity, prosodic contexts, and acoustic features of the units to predict optimal cross-fading parameters. This thesis reports a study on the feasibility of developing such perceptual cost functions. A special corpus was designed to produce a variety of shapes of formant frequency trajectories in different linguistic environments. A perceptual experiment was performed to determine whether we could predict perceptual quality of output speech from acoustic distance measures. We generated a range of synthetic/natural stimulus pairs, where the synthetic stimuli were generated using three types of cross-fading models, applied to different regions in the vowel. The results show that the perceptual cost function can be reliably predicted from the distance measures. Moreover, the results support our hypotheses that: a) the quality of the output speech is influenced by the shape of formant trajectories in the entire region across the vowel; and b) human perceptual scores are correlated to both the absolute distance and the first derivative of the absolute distance of the formant trajectories.
Center for Spoken Language Understanding
School of Medicine
Miao, Qi, "Perceptual cost function for cross-fading based concatenation" (2012). Scholar Archive. 720.