About ICSI Groups Projects Publications Events Partnerships Visitor Programs News Search
Algorithms Projects AI Projects Networking Projects Speech Projects Projects of Other Activities
       
 

Projects

Speech

   
 

Global Autonomous Language Exploitation (GALE)

The goal of the DARPA GALE program is to develop and apply technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. Automatic processing "engines" will convert and distill the data, delivering pertinent, consolidated information in easy-to-understand forms to military personnel and monolingual English-speaking analysts in response to direct or implicit requests.

GALE consists of three major engines: Transcription, Translation and Distillation. The output of each engine is English text. The input to the transcription engine is speech and to the translation engine, text. Engines pass along pointers to relevant source language data that will be available to humans and downstream processes. The distillation engine integrates information of interest to its user from multiple sources and documents.

ICSI is participating in GALE as part of the SRI multi-site team called "Nightingale". ICSI is collaborating with SRI and other sites to develop improved speech recognition, diarization (who spoke when), segmentation (e.g., of sentences or sentence-like word strings), and distillation.

Speech Processing for Meetings

ICSI researchers seek to develop algorithms and systems for the recognition of speech from meetings, as well as methods for information retrieval and other applications that such recognition would make possible. Funding for this research is provided by the "Mapping Meetings" project from the US National Science Foundation; the European Union's project, AMIDA: Augmented Multi-party Interaction Distance Access; and the Swiss project, IM2: Interactive Multimodal Information Management. Our NSF project is aimed at developing intermediate representations that could prove useful for subsequent information retrieval, extraction, and summarization. The AMIDA Project is an international effort focused on computer enhanced multi-modal interaction in the context of meetings. AMIDA Website. IM2 is a Swiss project aimed at the advancement of research, and the development of prototypes, in the field of man-machine interaction. IM2 Website ICSI's meeting recorder project page.

ROADMAP (RObust Automatic Detection of Meeting events with Audiovisual Perception) is a joint project between ICSI and IDIAP Research Center, Martigny CH. The goal of the project is to automatically discover and identify conversational events and trends exhibited by groups in meetings by extracting and analyzing multiple perceptual cues and to answer specific questions like "who was at the meeting?", "what is the relationship between participants?", and "what events took place?" using both visual and audio cues. ICSI's contribution is the analysis of videos in the compressed domain as well as in the pixel domain and the distillation of information from audio recorded in the meeting. Using speaker diarization and speaker identification techniques, questions like "how many speakers participated in the meeting?" or "who spoke when?" will be answered automatically with high accuracy and speed.

Speaker Identification

This project is concerned with the discovery of highly speaker-characteristic behaviors ("speaker performaces") for use in speaker recognition and related speech technologies. The intention is to move beyond the usual low-level short-term spectral features which dominate speaker recognition systems today, instead focusing on higher-level sources of speaker information, including idiosyncratic word usage and pronunciation, prosodic patterns, and vocal gestures.
The project goal is two-fold: to conduct fundamental research to discover new speaker-distinctive features and encode them into richer, more informative speaker models; and to evaluate the utility of these feature sets and models for speaker recognition and other speech technology applications. The feature discovery efforts are necessarily exploratory, pursuing both a "knowledge-based" track, building on existing linguistic constructs and guided by insights from psycholinguistics and human performance studies, and a more speculative "data-driven" approach, seeking idiosyncratic "vocal performances" --- spectr-temporal patterns with high speaker-characterizing power, independent of linguistic constraints. Speaker Identification Project Page

Dialogue Systems

Speech Technology for Developing Countries

ICSI researchers are developing speech recognition technologies for "emerging regions". As part of this effort, they have developed simple recognizers for Tamil, a language spoken by over 50 million people in Southease India, where illiteracy rates hover around 50% for men and between 60% to 80% for women. Speech recognition, especially in combination with speech synthesis and compelling visual user interfaces, may be key in increasing access to technology to primarily oral communities. They have designed and field tested prototypes for speech recognition applications, collectively called Open Sesame, which includes a multi-modal system that accepts both voice and touch input to provide farmers and other rural community members with information on agricultural innovations and crop varieties, as recommended by local experts in Tamil Nadu. The system is one example of ICSI's capability to rapidly design and deploy low-cost speech prototypes using openly available technology.

SmartWeb

The SmartWeb project, which is being led by the German Institute for Artificial Intelligence (DFKI), deals with access to semantic Web services on mobile devices such as mobile phones. Speech input and output are well suited to mobile devices and will be a major focus of SmartWeb. A major part of the project vision is the ability for a user to ask a question using a mobile device and immediately receive an answer based on information drawn from the Web. A demonstration system which can answer questions related to the soccer World Cup is being constructed and should be ready in time for the 2006 World Cup. ICSI staff are involved in the creation of the English-language version of the demonstration system and in the development of speech recognition technology for the overall project.

Mutaphrase

Many natural language processing (NLP) applications implicitly or explicitly depend on content being expressed in a particular way. Thus, a process which is programmed or trained for the sequence "You weren't smart to eat fugu" will not necessarily handle the semantically equivalent paraphrase "Eating blowfish was dumb of you". The mutaphraser automatically generates variants of an input sentence using the semantics and syntax encoded in FrameNet and the lexical semantic information in WordNet. The utility of mutaphrasing is tested on various NLP applications including speech recognition, machine translation training, and machine translation evaluation.

Speech Coding

With the proliferation of high-fidelity digital audio, wideband audio coding algorithms have become increasingly important. The main purpose of audio coding is to represent an audio signal with a minimum number of bits while maintaining perceptual transparency. While representing an audio signal with fewer bits generally results in degraded quality, there are many applications where bandwidth constraints exist and encoding at lower bit rates is required. Delivering high-quality audio over mobile service networks has proven difficult for traditional audio coding techniques, which generally encode transparent audio at much higher bit rates. Additionally current audio codecs do not always compress speech well.

In contrast to current speech and audio codecs, our coding technique uses a time-frequency decomposition followed by Frequency-Domain Linear Prediction (FDLP), which models a signal's temporal evolution by fitting an autoregressive model to the signal's Hilbert envelope. This method is performed over long temporal segments, on the order of one second, in sub-bands. Our focus is on low-to-medium bit rate applications where latency requirements are less stringent.

Objective measures and a series of listening tests have shown that this technique performs comparably to state-of-the-art codecs at 64 kbps. However, we would like to further reduce bit rate while maintaining quality. Current research involves improving compression efficiency through implementation of a psychoacoustic model and entropy coding. Additional research involves reducing algorithmic latency and comparing the FDLP technique with codecs at lower bit rates. This research is funded by QUALCOMM.

Multiple Stream Speech Recognition

This project has three components.

(1) Cortically-inspired speech recognition: Acoustic events such as speech exhibit distinctive spectro-temporal amplitude modulations. These types of modulations are not well-captured by conventional feature extraction methods, which involve either spectral processing or temporal processing at a time.

Recent findings from mammalian-auditory-cortical receptive field measurements suggest that biological systems are highly-tuned to spectro-temporal modulations. The spectro-temporal receptive fields (STRFs) of cortical cells are found to resemble 2-D spectro-temporal Gabor filters. In prior work, researchers have used 2-D Gabor filters to extract spectro-temporal features for speech recognition and speech discrimination. However, these studies have involved only single streams of task-optimized features to very large multi-dimensional representations of spectro-temporal responses. Therefore, there is a need to explore the use of multiple streams of spectro-temporal features, which may preserve the organizational map of STRFs and alleviate cumbersome computation of sizable data, in speech recognition.

This research aims to develop, evaluate, and incorporate multi-stream spectro-temporal features for robust speech recognition.

(2) Feature selection for speech recognition: Many times, significant achievements in automatic speech recognition performance have been obtained using a multi-stream approach where multiple classifiers that each use a different feature extraction method are run in parallel and have their decisions combined. However, the problem of how best to distribute the possible features among the classifiers has not been much studied. Usually, features coming from a particular feature extraction method are treated as a single, indivisible group, of which either all or no members are assigned to a classifier in the system. We are working on a different approach in which an automatic search procedure is used to assign features to classifiers at the level of individual features, using an objective function related to the classification accuracy of the resulting multi-stream system. We hope this approach can increase ASR accuracy and give insight into the relative usefulness of different features.

(3) Parallel processing for speech recognition: In noisy or reverberant environments, more processing will be needed for speech recognition. If a mobile device is used then the device will often be elsewhere than right up near the user's mouth, which will hurt ASR. For instance, in the most recent NIST evaluations, the best word error rate for multi-microphone speech recognition in a conference room was about 40%. This used beamforming, but as yet does not have the techniques we propose below, which have the potential of significantly reducing this error rate, at the expense of using much more computational power.

A parallel processing approach that could help further is the multi-stream methodology, in which multiple signal representations are used to generate posterior probabilities of speech sound classes, and then are combined and further transformed (Gaussianized and orthogonalized) to generate input features for a statistical speech recognition engine. Multi-layer perceptrons generate the individual posterior probabilities. These methods have been successfully used for 2-15 streams, but we would ultimately like to work with much larger ensembles of feature generators. We will start our work using the Quicknet libraries that were developed at ICSI, parallelizing it for the target approaches discussed in this proposal. We will then develop code that incorporates these libraries in a system that permits experimentation and ultimately exhibits much greater robustness for speech recognition in moderate noise and reverberation with microphones that are not head-mounted. This work is closely connected with the Berkeley ParLab, which is described here.

 

More about the Speech Research Group >>

top

   
Copyright © 2007 International Computer Science Institute. All Rights Reserved.