Dept. of Computer Science and Engineering
Oregon Graduate Institute of Science & Technology
One requirement for researching and building spoken language systems is the availability of speech data that have been labeled and time-aligned at the phonetic level. Although manual phonetic alignment is considered more accurate than automatic methods, it is too time consuming to be commonly used for aligning large corpora. One reason for the greater accuracy of human labeling is that humans are better able to locate distinct events in the speech signal that correspond to specific phonetic characteristics. The development of the proposed method was motivated by the belief that if an automatic alignment method were to use such acoustic-phonetic information, its accuracy would become closer to that of human performance. Our hypothesis is that the integration of acoustic-phonetic information into a state-of-the-art automatic phonetic alignment system will significantly improve its accuracy and robustness. In developing an alignment system that uses acoustic-phonetic information, we use a measure of intensity discrimination in detecting voicing, glottalization, and burst-related impulses. We propose and implement a method of voicing determination that has average accuracy of 97.25% (which is an average 58% reduction in error over a baseline system), a fundamental-frequency extraction method with average absolute error of 3.12 Hz (representing a 45% reduction in error), and a method for detecting burst-related impulses with accuracy of 86.8% on the TIMIT corpus (which is a 45% reduction in error compared to reported results). In addition to these features, we propose a means of using acoustics-dependent transition information in the HMM framework. One aspect of successful implementation of this method is the use of distinctive phonetic features. To evaluate the proposed and baseline phonetic alignment systems, we measure agreement with manual alignments and robustness. On the TIMIT corpus, the proposed method has 92.57% agreement within 20 msec. The average agreement of the proposed method represents a 28% reduction in error over our state-of-the-art baseline system. In measuring robustness, the proposed method has 14% less standard deviation when evaluated on 12 versions of the TIMIT corpus.
Hosom, John-Paul, "Automatic time alignment of phonemes using acoustic-phonetic information" (2000). Scholar Archive. 175.