Date

October 2001

Document Type

Dissertation

Degree Name

Ph.D.

Department

Dept. of Computer Science and Engineering

Institution

Oregon Health & Science University

Abstract

Speaker identity, the sound of a person's voice, plays an important role in human communication. With speech systems becoming more and more ubiquitous, Voice Transformation (VT), a technology that modifies a source speaker's speech utterance to sound as if a target speaker had spoken it, offers a number of useful applications. For example, a novice user can adapt a text-to-speech system to speak with a new voice quickly and inexpensively. In this dissertation, we consider new approaches in both the design and the evaluation of VT techniques. We propose a new type of speech corpus that is especially suited to VT research and development by consisting of naturally time-aligned sentences. Consequently, removal of individual prosodic characteristics, such as fundamental pitch and durations, requires only very little processing and results in high-quality speech samples that only differ in their segmental properties, our focus of transformation. These "prosody-normalized" speech samples are used for training VT systems, as well as for evaluating their transformation performance objectively and subjectively. Our baseline transformation system (SET) is based on transforming the spectral envelope as represented by the LPC spectrum, using a harmonic sinusoidal model for analysis and synthesis. The transformation function is implemented as a regressive, joint-density Gaussian mixture model, trained on aligned LSF vectors by an expectation maximization algorithm. We improve upon the baseline by adding a residual prediction module, which predicts target LPC residuals from transformed LPC spectral envelopes, using a classifier and residual codebooks. The resulting high resolution transformation system (HRT) is capable of rendering transformed speech with a high degree of spectral detail. Because of the severe shortcomings of evaluating VT performance objectively, we propose a subjective evaluation strategy, consisting of several listening tests. In a speaker discrimination test, the HRT system performed significantly better than the SET system. However, discrimination is below that of natural utterances. Similarly, listeners selected the HRT system over other systems in a system comparison test. Finally, listeners rated the speech quality of the HRT system as better than the SET system. However, the quality of natural utterances was considered better than that of transformed speech.

Identifier

doi:10.6083/M41J97PX

School

OGI School of Science and Engineering

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.