Subsegmental Segmental and suprasegmental processing of linear prediction residual for speaker information

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
The speaker-specific information in speech is mostly attributed to the shape, size and dynamics of the vocal tract and excitation source. The excitation information can be viewed at subsegmental (3-5 msec), segmental (10-30 msec) and suprasegmental (100-300 msec) levels. These include glottal cycle activities, periodicity and strength of vocal folds vibration, and speaker learning habits. This work proposes methods to model these information from the linear prediction (LP) residual and uses them in a combined fashion to develop a speaker verification system. The significance and different nature of the subsegmental, segmental and supraseg- mental excitation information present in the LP residual are demonstrated by pro- cessing it directly in the time domain. The segmental excitation information pro- vides the best performance followed by subsegmental level. Due to large intra- speaker variability and also text-independent mode, the suprasegmental excitation information provides the least performance. The combined evidence from each of these levels further improves the performance. For compact and effective rep- resentation, different methods of parameterizing the LP residual are explored. We found that Liljencrants-Fant (LF) parameters computed from the LP residual, com- bined use of spectral flatness measure and cepstral coefficients from LP residual mel warped spectrum, combined use of pitch, epoch strength and LP residual mel fre- quency cepstral trajectories are proposed as the possible ways of representing the subsegmental, segmental and suprasegmental excitation information, respectively. The vocal tract based system provides relatively good performance, but suffers severely in noisy conditions. In this sense, the proposed excitation information based system is relatively more robust and hence may be useful.The contributions of the work reported in this thesis for subsegmental, segmental and suprasegmental processing of LP residual for speaker information include, Implicit processing of the the LP residual in the time domain with differ- ent frame size and shift for the extraction of subsegmental, segmental and suprasegmental speaker-specific excitation information. Implicit processing of the analytic representation of the LP residual in the time domain for independent modeling of amplitude and sequence information. Modification suggested to zero-frequency filtering method for accurate estima- tion of pitch and epoch strength from telephone speech. Proposed efficient approach for the computation of the LF parameters. Compact representation of the subsegmental excitation information by using LF parameters. Investigation on filter shapes for processing the LP residual in frequency and cepstral domains for compact representation of the segmental excitation infor- mation. Effective modelling of the suprasegmental excitation information by the com- bined use of pitch, epoch strength and cepstral trajectory vectors. Development of the speaker verification system using excitation information. Keywords: Subsegmental, segmental, suprasegmental, LP residual, vocal tract and excitation information, LF parameters, mel warped spectrum, speaker recognition..
Supervisor: S. R. M. Prasanna