Merging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech Recognition

TitleMerging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech Recognition
Publication TypeTechnical Report
Year of Publication1989
AuthorsBourlard, H., & Morgan N.
Other Numbers532

The statistical and sequential nature of the human speech production system makes automatic speech recognition difficult. Hidden Markov Models (HMM) have provided a good representation of these characteristics of speech, and were a breakthrough in speech recognition research. However, the a priori choice of a model topology and weak discriminative power limit HMM capabilities. Recently, connectionist models have been recognized as an alternative tool. Their main useful properties are their discriminative power and their ability to capture input-output relationships. They have also proved useful in dealing with statistical data. However, the sequential character of speech is difficult to handle with connectionist models. We have used a classic form of a connectionist system, the Multilayer Perceptron (MLP), for the recognition of continuous speech as part of an HMM system. We show theoretically and experimentally that the outputs of the MLP approximate the probability distribution over output classes conditioned on the input (i.e., the Maximum a Posteriori (MAP) probabilities). We also report the results of a series of speech recognition experiments. By using contextual information at the input of the MLP, frame classification performance can be achieved which is significantly improved over the corresponding performance for simple Maximum Likelihood probabilities, or even MAP probabilities without the benefit of context. However, it was not so easy to improve the recognition of words in continuous speech by the use of an MLP, although it was clear that the classification at the frame and phoneme levels was better than we achieved with our HMM system. We present several modifications of the original methods that were required to achieve acceptable performance at the word level. Preliminary results are reported for a 1000 word vocabulary, phoneme based, speaker-dependent continuous speech recognition system embedding MLP into HMM. These results show equivalent recognition performance using either the Maximum Likelihood or the outputs of an MLP to estimate emission probabilities of an HMM.

Bibliographic Notes

ICSI Technical Report TR-89-033

Abbreviated Authors

H. Bourlard and N. Morgan

ICSI Publication Type

Technical Report