Oregon Health & Science University
Samples of everyday conversations are being collected and analyzed in a growing number of applications, ranging from studying behavior in social psychology to clinical assessment of voice pathology and even cognitive function. Aside from the spoken words, the acoustic properties of speech samples can provide important cues in these applications.
The goal of this study is developing novel algorithms for robust and accurate estimation of speech features and employing them to build probabilistic speech models for characterizing and analyzing clinical speech. We aim to achieve accurate and reliable estimation of voiced segments, fundamental frequency, harmonic-to-noise ratio (HNR), jitter, and shimmer for clinical speech analysis. Towards this goal, we adopt a harmonic model (HM) of speech. We overcome certain drawbacks of this model and introduce an improved version of HM that leads us to accurate and reliable estimation of voiced segments, fundamental frequency, HNR, jitter, and shimmer. We evaluate the performance of our improved HM in the context of voicing detection and pitch estimation with other state-of-the-art techniques on the Keele data set. Through extensive experiments on several noisy conditions, we demonstrate that the proposed improvements provide substantial gains over other popular methods under different noise levels and environments. We also employ our improved harmonic model for developing a novel algorithm to estimate jitter, shimmer, and HNR that is less sensitive to noise and can also capture variations within the frame.
We further verify the robustness of these measures on detecting disordered voices due to Parkinson's disease (PD). Next, we turn our attention to investigate the utility of developed measures in clinical applications. We perform empirical studies on the speech-based assessment of cognitive impairments including PD, autism spectrum disorder (ASD), and clinical depression. We demonstrate that the severity of PD can be inferred from speech with a mean absolute error of about 5.5, explaining 61% of the variance and consistently well-above chance. Leveraging the same mechanisms developed for inferring PD, we detect children with ASD and classify them into four categories. We find that our features improve the performance, measured in terms of unweighted average recall (UAR), of detecting ASD by 2.3% and classifying the disorder into four categories by 2.8% compare to a state-of-the-art baseline performance. We also examine the use of our features in detection of clinical depression in adolescents. We conduct experiments to compare the performance our developed features with that obtained from openSMILE , a standard feature extraction tool. Our experiment show that our extracted features from HM improve the performance of detecting depression from spoken utterances for speaker-level decisions. Finally, we explore the feasibility of detecting social contexts from audio recordings of everyday life such as in life-log.
Again, we find that the features developed in this thesis perform better than MFCC and OpenSMILE features in these tasks. This is true even when we apply recently developed deep neural networks (DNNs) for classi cation, achieving classi cation accuracy as high as 87.7% and 86.8% for speakers' location and activity.
Center for Spoken Language Understanding
School of Medicine
Asgari, Meysam, "Algorithms for extracting robust and accurate speech features and their application in clinical domain" (2014). Scholar Archive. 3555.