April 2004

Document Type


Degree Name



Dept. of Computer Science and Engineering


Oregon Health & Science University


Prosody plays an important role in discriminating between languages and speakers. Due to the complexity of estimating relevant prosodic information, most recognition systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness/stress) distributions can be used to capture prosodic differences between speakers and languages. However, this simplistic notion disregards the temporal aspects and the relationship between prosodic features that determine certain phenomena, such as intonation and stress. We propose alternative approaches that exploit the dynamics between the fundamental frequency and speech energy to capture prosodic differences. The aim is to characterize different intonation, stress, or rhythm patterns produced by the variation in the fundamental frequency and speech energy contours. In these approaches, the continuous speech signal is converted into a sequence of discrete units that describe the signal in terms of dynamics of the fundamental frequency and speech energy contours. Using simple statistical models, we show that the statistical dependency between such discrete units can capture language- and speaker-specific information. On the extended data task of the 2001 and 2002 NIST Speaker Recognition Evaluation, such approach achieves a relative improvement of at least 17% over a system based on the distribution statistics of fundamental frequency, speech energy and their deltas. We also show that they are robust to communication channel effects when compared to the state-of-the-art speaker recognition system. Segmental information is incorporated to capture dependencies between segmental and prosodic information. In this approach, a new set of segment classes is estimated from the time-alignment between a sequence of phonemes or phones (i.e., segmental information) and the new prosodic information representation. We show that this approach can characterize speaker-dependent information. Since conventional recognition systems do not fully incorporate different levels of information, we show that the performance of conventional systems is improved when the proposed approaches are incorporated by fusing the systems. In the 2003 NIST Language Recognition Evaluation, the fusion of the prosodic speech representation and a conventional system yields a relative improvement in performance of 14%. The fusion with the state-of-the-art speaker recognition system achieves a relative improvement of about 28% and 12% for the extended-data task of the 2001 and 2002 NIST Speaker Recognition Evaluation, respectively.




OGI School of Science and Engineering



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.