Date

August 1997

Document Type

Dissertation

Degree Name

Ph.D.

Department

Dept. of Computer Science and Engineering

Institution

Oregon Graduate Institute of Science & Technology

Abstract

Automatic speech recognition (ASR) is performed imperfectly by computers. For some designated part (e.g., word or phrase) of the ASR output, rejection is deciding (yes or no) whether it is correct, and confidence is the probability (0.0 to 1.0) of it being correct. This thesis presents new methods of rejecting errors and estimating confidence for telephone speech. These are also called word or utterance verification and can be used in wordspotting or voice-response systems. Open-set or out-of-vocabulary situations are a primary focus. Language models are not considered. In vocabulary-dependent rejection all words in the target vocabulary are known in advance and a strategy can be developed for confirming each word. A word-specific artificial neural network (ANN) is shown to discriminate well, and scores from such ANNs are shown on a closed-set recognition task to reorder the N-best hypothesis list (N=3) for improved recognition performance. Segment-based duration and perceptual linear prediction (PLP) features are shown to perform well for such ANNs. The majority of the thesis concerns vocabulary and task-independent confidence and rejection based on phonetic word models. These can be computed for words even when no training examples of those words have been seen. New techniques are developed using phoneme ranks instead of probabilities in each frame. These are shown to perform as well as the best other methods examined despite the data reduction involved. Certain new weighted averaging schemes are studied but found to give no performance benefit. Hierarchical averaging is shown to improve performance significantly: frame scores combine to make segment (phoneme state) scores, which combine to make phoneme scores, which combine to make word scores. Use of intermediate syllable scores is shown to not affect performance. Normalizing frame scores by an average of the top probabilities in each frame is shown to improve performance significantly. Perplexity of the wrong-word set is shown to be an important factor in computing the impostor probability used in the likelihood ratio. Bootstrap parameter estimation techniques are used to assess the significance of performance differences.

Identifier

doi:10.6083/M4D798B5

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.