Maider Lehr



Document Type


Degree Name



Oregon Health & Science University


Speech recognition systems consist of three components, namely, the acoustic model, the pronunciation model and the language model. The acoustic and language models are typically learned separately and furthermore optimized for different cost functions. This framework has been a result of historical and practical considerations such as the availability of limited amounts of training data and the computational cost. These considerations are currently being overcome. Arguably, learning both models jointly to directly minimize the word error rate will result in a better recognizer.

One of the contributions of this thesis is a detailed investigation of a discriminative framework to jointly learn the parameters of the acoustic, language and duration models (commonly captured with the parameters from the acoustic models). The acoustic state transition parameters, the n-gram language model parameters and the state duration parameters are learned using a reranking framework, which has been previously employed in discriminative language models [156]. We report experiments on the GALE Arabic transcription task, a NIST benchmark, with about 200 hours of training data and two test sets of about 2.5 hours. Our results demonstrate that our model improves the performance by about 1.4-1.6% absolute word error rate over the baseline system. Continuing with the joint modeling framework, next, we apply it to learn pronunciation variations particular to African American Vernacular English (AAVE) speech. Popular speaker adaptation methods adapt the acoustic models quickly using small amounts of data, for example, by estimating a few linear transforms. Such transformations are incapable of appropriately capturing systematic phonetic transformations. We investigate strategies for learning phonetic transformations jointly with the discriminative language model. We compare our new models on NPR's StoryCorps corpus, which consists of stories from self-identified AAVE and Standard American English (SAE) speakers. The joint discriminative pronunciation and language model improves the performance of the AAVE recognizer by about 2.0% WER of which about 0.5% can be attributed to pronunciation models. Improvements on the SAE data are lower and mainly attributed to the discriminative language model.

Finally, we examine how joint modeling of acoustic and lexical variations can improve the performance of a downstream application, a narrative retelling assessment tool. We develop a conditional random field (CRF) based model to incorporate both variations, and demonstrate gains of 6.3% over a generative baseline in the F-score of detecting story elements on a clinical task, the Wechsler Logical Memory test.




Center for Spoken Language Understanding


School of Medicine



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.