Deep and Wide Learning for Automatic Speech Recognition

Principal Investigator(s): 
Nelson Morgan

In this project, speech researchers are looking at trade-offs between two approaches to automatic speech recognition (ASR): signal processing of multiple acoustic features vs. using simpler features and relying on machine learning algorithms to replace feature engineering. The goal is not only to improve accuracy for difficult examples, but also to learn about the computational consequences for high performance computing.

Automatically providing transcriptions for multimedia sources that contain speech will require ASR. While large vocabulary continuous speech recognition systems can be quite accurate, particularly when they were trained with sufficient data having similar characteristics to the test conditions, a potential limitation of all trained systems is the ability to perform well when the training data does not adequately represent the characteristics of the test data. This is a common problem in many tasks, as real-world data often has many sources of variability that are not related to the task at hand (e.g., noise, reverberation, altered speaking style), so there can be a mismatch between the training and testing data characteristics. Even if unlimited data is available, experience has shown that there are diminishing returns for significantly increasing the training data past a few thousand hours of speech, particularly for difficult test conditions (such as might be found in the audio for consumer videos).

Some degradation due to acoustical variability can be short circuited by signal processing transformations, often inspired by models of hearing, once the nature of the problem is understood. But learning effective representations from the data (or from simple transformations such as short term spectral energy) has often been shown to be very powerful. What is the appropriate tradeoff between choosing signal features, e.g., from models of human processing vs. deriving them from machine learning?

These considerations can have a significant effect on the computational requirements for the training and recognition tasks, particularly for parallelization over many cores. If predefined feature processing is irrelevant and all that is required is sufficient depth and clever machine learning algorithms, the emphasis will need to be on matching the learning algorithms to HPC architectures. If on the other hand features are important, and the neural network architecture (and learning algorithms) can be relatively simple, the emphasis might need to be on more complex signal processing, which in turn could lead to parallelizing multiple subnets that each process different representations, a “wide” approach to the acoustic front end that has sometimes proven useful. It could easily be the case that there is no single correct answer for this conundrum, and that instead what should be sought is a guide to tradeoffs between intelligent signal representations and deep (and wide) learning.

This project explores the relationship between the data/parameters ratio and the importance of using robust signal processing for the mismatch case (close-mic’d training, distant mic testing), with an emphasis on the consequences for computation on many computational cores.