Event

 
 

Multimodal Speaker Diarization and Localization

Gerald Friedland

ICSI

Tuesday, November 25, 2008
12:30

Research in cognitive psychology suggests that the human brain is able to integrate different sensory modalities, such as sight, sound, and touch, into a perceptual experience that is coherent and unified. Experiments show that by considering input from multiple sensors, perceptual problems can be solved more robustly and even more efficiently. In computer science, however, synergistic use of data encoded for different sensory modalities has not always lived up to its promise.

This talk presents speaker diarization as an example of a multimedia content analysis task where the integrated use of video and audio information is beneficial. Traditionally, speaker diarization tries to automatically identify speakers from a single-source audio track with the goal of answering the question "who spoke when". Incorporating the information from a low-resolution video camera not only improves the accuracy of the ICSI speaker diarization engine significantly, the talk also presents how the same engine can be used to localize the speakers as a side-effect, thus extending the questions answered by the approach to "who spoke when and from where".

 
Copyright © 2005 International Computer Science Institute. All Rights Reserved.