Acoustic Super Models for Large Scale Video Event Detection

Given the exponential growth of videos published on theInternet, mechanisms for clustering, searching, and browsinglarge numbers of videos have become a major researcharea. More importantly, there is a demand for event detectorsthat go beyond the simple finding of objects butrather detect more abstract concepts, such as “feeding ananimal” or a “wedding ceremony”. This article presents anapproach for event classification that enables searching forarbitrary events, including more abstract concepts, in foundvideo collections based on the analysis of the audio track.The approach does not rely on speech processing, and islanguage-indepent, instead it generates models for a set ofexample query videos using a mixture of two types of audiofeatures: Linear-Frequency Cepstral Coefficients and ModulationSpectrogram Features. This approach can be used incomplement with video analysis and requires no domain specifictagging. Application of the approach to the TRECVidMED 2011 development set, which consists of more than4000 random “wild” videos from the Internet, has shown a


