Dept. of Computer Science and Engineering
Oregon Health & Science University
Prosody plays an important role in discriminating between languages and speakers. Due to the complexity of estimating relevant prosodic information, most recognition systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness/stress) distributions can be used to capture prosodic differences between speakers and languages. However, this simplistic notion disregards the temporal aspects and the relationship between prosodic features that determine certain phenomena, such as intonation and stress. We propose alternative approaches that exploit the dynamics between the fundamental frequency and speech energy to capture prosodic differences. The aim is to characterize different intonation, stress, or rhythm patterns produced by the variation in the fundamental frequency and speech energy contours. In these approaches, the continuous speech signal is converted into a sequence of discrete units that describe the signal in terms of dynamics of the fundamental frequency and speech energy contours. Using simple statistical models, we show that the statistical dependency between such discrete units can capture language- and speaker-specific information. On the extended data task of the 2001 and 2002 NIST Speaker Recognition Evaluation, such approach achieves a relative improvement of at least 17% over a system based on the distribution statistics of fundamental frequency, speech energy and their deltas. We also show that they are robust to communication channel effects when compared to the state-of-the-art speaker recognition system. Segmental information is incorporated to capture dependencies between segmental and prosodic information. In this approach, a new set of segment classes is estimated from the time-alignment between a sequence of phonemes or phones (i.e., segmental information) and the new prosodic information representation. We show that this approach can characterize speaker-dependent information. Since conventional recognition systems do not fully incorporate different levels of information, we show that the performance of conventional systems is improved when the proposed approaches are incorporated by fusing the systems. In the 2003 NIST Language Recognition Evaluation, the fusion of the prosodic speech representation and a conventional system yields a relative improvement in performance of 14%. The fusion with the state-of-the-art speaker recognition system achieves a relative improvement of about 28% and 12% for the extended-data task of the 2001 and 2002 NIST Speaker Recognition Evaluation, respectively.
OGI School of Science and Engineering
Adami, Andre Gustavo, "Modeling prosodic differences for speaker and language recognition" (2004). Scholar Archive. 18.