Addressing pitch Mismatch for Children's Automatic Speech Recognition

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
This thesis addresses the acoustic mismatch due to pitch differences between the adults' and the children's speech for children's automatic speech recognition (ASR) on adult's speech trained models. The motivation for the work is obtained through the study done on exploring various acoustic sources of mismatch: pitch, speaking rate, formant frequencies and glottal flow parameters (open quotient, return quotient and speed quotient) for children's ASR on adults' speech trained models. The effect of variations in each of these acoustic correlates across speech signals is studied on Mel frequency cepstral coefficient (MFCC) features and ASR models. Following that, their relative significance is explored for children's speech recognition on the adults speech trained models in a consistent setup. It is found that apart from the formant frequencies, the pitch is the other major source of acoustic mismatch between the adults' and the children's speech. The increase in the pitch of the signals is found to significantly increase the dynamic range and in turn the variances of the higher order coefficients of MFCC (C0DC12) features. Motivated by that, the pitch-robustness of perceptual linear prediction cepstral coefficient (PLPCC) and perceptual minimum variance distortionless response (PMVDR) cepstral coefficient features is studied to explore their efficacy for children's ASR on adults' speech trained models in comparison to MFCC features. It is found that MFCC features outperform PLPCC features while with suitable optimization of model order PMVDR features are more pitchrobust than MFCC features. However, the children's ASR performance obtained with MFCC features after explicit pitch normalization of children's speech is found to be comparable to that obtained with PMVDR features after optimization of its model order for children's speech. Following the observations, a pitch normalization algorithm is proposed which modifies the Mel filterbank during MFCC test feature extraction based on the average pitch of the test signal for children's ASR on adults' speech trained models. Also, a Mel cepstral truncation based method is proposed for reducing the pitch mismatch be-tween the training and the test data. The proposed algorithm automatically selects the appropriate length of the base MFCC features for each test signal without prior knowledge about the speaker of the test utterance. Significant improvements are obtained in the children's speech recognition performances using the proposed algorithms on the adults' speech trained models. Using the proposed adaptive MFCC feature truncation algorithm significant improvements are found in the children's and adults' ASR performances on children's speech trained models as well. The improvements obtained in the ASR performances with the proposed algorithms are also found to be additive to those obtained with the existing speaker normalization and model adaptation techniques viz., VTLN, MLLR and CMLLR. Keywords: Children's speech recognition, acoustic mismatch, pitch, speaking rate, glottal flow parameters, MFCC, PLPCC, PMVDR, Mel filterbank, cepstral truncation...
Supervisor: Rohit Sinha