On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval
Title | On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval |
Publication Type | Conference Paper |
Year of Publication | 2011 |
Authors | Mertens, R., Huang P-S., Gottlieb L., Friedland G., & Divakaran A. |
Page(s) | 446-451 |
Other Numbers | 3201 |
Keywords | Audio Clustering, Audio Indexing, Speaker Diarization, Video Indexing |
Abstract | Recently, audio concepts emerged as a usefulbuilding block in multimodal video retrieval systems. Informationlike this file contains laughter, this file contains enginesounds or this file contains slow music can significantlyimprove purely visual based retrieval. The weak point ofcurrent approaches to audio concept detection is that theyheavily rely on human annotators. In most approaches, audiomaterial is manually inspected to identify relevant concepts.Then instances that contain examples of relevant conceptsare selected again manually and used to train conceptdetectors. This approach comes with two major disadvantages:(1) it leads to rather abstract audio concepts that hardly coverthe audio domain at hand and (2) the way human annotatorsidentify audio concepts likely differs from the way a computeralgorithm clusters audio data introducing additional noisein training data. This paper explores whether unsupervizedaudio segementation systems can be used to identify usefulaudio concepts by analyzing training data automatically andwhether these audio concepts can be used for multimediadocument classification and retrieval. A modified version ofthe ICSI (International Computer Science Institute) speakerdiarization system finds segments in an audio track that havesimilar perceptual properties and groups these segments. Thisarticle provides an in-depth analysis on the statistic propertiesof similar acoustic segments identified by the diarization systemin a predefined document set and the theoretical fitness of thisapproach to discern one document class from another. |
Acknowledgment | This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Agency (IARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of IARPA or of the U.S. Government. |
URL | http://www.icsi.berkeley.edu/pubs/speech/applicabilityof12.pdf |
Bibliographic Notes | Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451 |
Abbreviated Authors | R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran |
ICSI Research Group | Audio and Multimedia |
ICSI Publication Type | Article in conference proceedings |