Oregon Health & Science University
Speech recognition systems consist of three components, namely, the acoustic model, the pronunciation model and the language model. The acoustic and language models are typically learned separately and furthermore optimized for different cost functions. This framework has been a result of historical and practical considerations such as the availability of limited amounts of training data and the computational cost. These considerations are currently being overcome. Arguably, learning both models jointly to directly minimize the word error rate will result in a better recognizer.
One of the contributions of this thesis is a detailed investigation of a discriminative framework to jointly learn the parameters of the acoustic, language and duration models (commonly captured with the parameters from the acoustic models). The acoustic state transition parameters, the n-gram language model parameters and the state duration parameters are learned using a reranking framework, which has been previously employed in discriminative language models . We report experiments on the GALE Arabic transcription task, a NIST benchmark, with about 200 hours of training data and two test sets of about 2.5 hours. Our results demonstrate that our model improves the performance by about 1.4-1.6% absolute word error rate over the baseline system. Continuing with the joint modeling framework, next, we apply it to learn pronunciation variations particular to African American Vernacular English (AAVE) speech. Popular speaker adaptation methods adapt the acoustic models quickly using small amounts of data, for example, by estimating a few linear transforms. Such transformations are incapable of appropriately capturing systematic phonetic transformations. We investigate strategies for learning phonetic transformations jointly with the discriminative language model. We compare our new models on NPR's StoryCorps corpus, which consists of stories from self-identified AAVE and Standard American English (SAE) speakers. The joint discriminative pronunciation and language model improves the performance of the AAVE recognizer by about 2.0% WER of which about 0.5% can be attributed to pronunciation models. Improvements on the SAE data are lower and mainly attributed to the discriminative language model.
Finally, we examine how joint modeling of acoustic and lexical variations can improve the performance of a downstream application, a narrative retelling assessment tool. We develop a conditional random field (CRF) based model to incorporate both variations, and demonstrate gains of 6.3% over a generative baseline in the F-score of detecting story elements on a clinical task, the Wechsler Logical Memory test.
Center for Spoken Language Understanding
School of Medicine
Lehr, Maider, "Discriminative joint modeling of acoustic and lexical variations for spoken language processing" (2014). Scholar Archive. 3560.