Dept. of Computer Science and Electrical Engineering
Oregon Health & Science University
New language constantly emerges from complex, collaborative human-human interactions like meetings -- such as when a presenter handwrites a new term on a whiteboard while saying it redundantly. Fixed vocabulary recognizers fail on such new terms, which often are critical to dialogue understanding. This dissertation presents SHACER, our Speech and HAndwriting reCoginzER (pronounced "shaker"). SHACER learns out-of-vocabulary terms dynamically by integrating information from instances of redundant handwriting and speaking. SHACER can automatically populate an MS Project â¢ Gantt Chart by observing a whiteboard scheduling meeting. To document the occurrence and importance of such multimodal redundancy, we examine (1) whiteboard presentations, (2) a spontaneous brainstorming meeting, and (3) informal annotation discussions about travel photographs. Averaged across these three contexts 96.5% of handwritten words were also spoken redundantly. We also find that redundantly presented terms are (a) highly topic specific and thus likely to be out-of- vocabulary, (b) more memorable, and (c) significantly better query terms for later search and retrieval. To combine information SHACER normalizes handwriting and speech recognizer out-puts by applying letter-to-sound and sound-to-letter transformations. SHACER then uses an articulatory-feature based distance metric to align handwriting to redundant speech. Phone sequence information from that aligned segment then constrains a second pass phone recognition over cached speech features. The resulting refined pronunciation serves as a measure against which the integration of all orthographic and pronunciation hypotheses is scored. High-scoring integrations are enrolled in the system's dictionaries and reinforcement tables. When a presenter subsequently says a newly enrolled term it is more easily recognized. If an abbreviation is handwritten at the same time, then the already recognized spelling is compared to the handwriting hypotheses. If there is a first-letter or prefix match, then that full spelling is dynamically acquired by the handwritten abbreviation as its expanded meaning. On a held-out test set SHACER significantly reduced the absolute number of recognition errors for abbreviated Gantt chart labels by 37%. For cognitive systems to be accepted as cooperative assistants they need to learn as easily humans. Dynamically learning new vocabulary, as SHACER does by leveraging multimodal redundancy, is a significant step in that direction.
OGI School of Science and Engineering
Kaiser, Edward C., "Leveraging multimodal redundancy for dynamic learning, with SHACER, a speech and handwriting recognizer" (2007). Scholar Archive. 113.