Simultaneous Speech and Speaker Recognition Using Hybrid Architecture

TitleSimultaneous Speech and Speaker Recognition Using Hybrid Architecture
Publication TypeTechnical Report
Year of Publication1999
AuthorsGenoud, D., Ellis D. P. W., & Morgan N.
Other Numbers1168

The automatic recognition process of the human voice is often divided in speech recognition and speaker recognition. These 2 areas use the same input signal (the voice), but not for the same purpose: the speech recognition aims to recognize the message uttered by any speaker, and the speaker recognition wants to identify the person who is talking. However, more and more applications need to use simultaneously the 2 kinds of information. Some actual examples given below illustrate this tendency.State-of-the-art speech recognition systems tend to be speaker independent by using models (phonemes, diphones, triphones) estimated on huge databases containing numerous speakers, and also by using parameterization techniques which try to suppress the speaker dependent characteristics (PLP,RASTA-PLP). However, for some types of applications it could be important to re-adapt the speaker independent speech recognizer to a defined speaker, in order to improve the noise robustness for example, or simply to improve the speech recognition performances by adding some knowledge of the speaker. Some recent results shows that speaker adaptation of a speech recognizer improve the performances of the systems [DARPA, 1998].Nowadays, numerous applications performing speech information retrieval require the automatic extraction of the content of shows and the retrieval of the speech of a particular speaker on a particular subject. In this case a speech recognition and a speaker recognition should be carried on in parallel. Furthermore the detection of speaker change in a conversation (speaker A/ speaker B or speaker/music) may also be very useful for the indexing and the labeling of the huge databases available.Finally, a speaker recognition is needed for applications like secured voice access to information (as a bank account or a voice-mail box). In this case, the speaker recognition can be text independent if the content of the utterance is not checked. However, better results are obtained by using text dependent speaker recognition, both because a control of what is said can be done and also because more accurate models (phonemes, words) can be built. Anyhow, the text dependent speaker recognition has to be preceded by a speech recognition step to control and split the message properly.All these applications show the need of a simultaneous speaker and speech recognition. This rapport shows that it exists some possibilities exist to carry out this 2 tasks simultaneously.

Bibliographic Notes

ICSI Technical Report TR-99-012

Abbreviated Authors

D. Genoud, D. Ellis, and N. Morgan

ICSI Research Group


ICSI Publication Type

Technical Report