On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval

Title	On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval
Publication Type	Conference Paper
Year of Publication	2011
Authors	Mertens, R., Huang P-S., Gottlieb L., Friedland G., & Divakaran A.
Page(s)	446-451
Other Numbers	3201
Keywords	Audio Clustering, Audio Indexing, Speaker Diarization, Video Indexing
Abstract	Recently, audio concepts emerged as a usefulbuilding block in multimodal video retrieval systems. Informationlike this file contains laughter, this file contains enginesounds or this file contains slow music can significantlyimprove purely visual based retrieval. The weak point ofcurrent approaches to audio concept detection is that theyheavily rely on human annotators. In most approaches, audiomaterial is manually inspected to identify relevant concepts.Then instances that contain examples of relevant conceptsare selected again manually and used to train conceptdetectors. This approach comes with two major disadvantages:(1) it leads to rather abstract audio concepts that hardly coverthe audio domain at hand and (2) the way human annotatorsidentify audio concepts likely differs from the way a computeralgorithm clusters audio data introducing additional noisein training data. This paper explores whether unsupervizedaudio segementation systems can be used to identify usefulaudio concepts by analyzing training data automatically andwhether these audio concepts can be used for multimediadocument classification and retrieval. A modified version ofthe ICSI (International Computer Science Institute) speakerdiarization system finds segments in an audio track that havesimilar perceptual properties and groups these segments. Thisarticle provides an in-depth analysis on the statistic propertiesof similar acoustic segments identified by the diarization systemin a predefined document set and the theoretical fitness of thisapproach to discern one document class from another.
Acknowledgment	This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Agency (IARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of IARPA or of the U.S. Government.
URL	http://www.icsi.berkeley.edu/pubs/speech/applicabilityof12.pdf
Bibliographic Notes	Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451
Abbreviated Authors	R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran
ICSI Research Group	Audio and Multimedia
ICSI Publication Type	Article in conference proceedings

Google Scholar