Audio Concept Ranking for Video Event Detection on User-Generated Content

Video event detection on user-generated content (UGC) aims to find videos that show an observable event such as a wedding ceremony or birthday party rather than an object, such as a wedding dress, or an audio concept such as music, speech or clapping. Different events are better described by different concepts. Therefore, proper audio concept classification enhances the search for acoustic cues in this challenge. However, audio concepts for training are typically chosen and annotated by humans and are not necessarily relevant to a specific event or the distinguishing factor for a particular event. A typical ad-hoc annotation process ignores the complex characteristics of UGC audio such as concept ambiguities, overlap concepts and concept duration. This paper presents a methodology to rank audio concepts based on relevance to the events and contribution to discriminability. A ranking measure guides an automatic or user-based selection of concepts in order to improve audio concept classification with the goal to improve video event detection. The ranking aids to determine and select the most relevant concepts for each event, discard meaningless concepts, combine ambiguous sounds to enhance a concept, thereby suggesting a focus for annotation and understanding of the UGC audio. Experiments show an improvement of the audio concepts mean classification accuracy as well as a better-defined diagonal in the confusion matrix. The selection of top 40 audio concepts using our methodology outperforms a best-accuracy-based selection by a relative 17.56% and a frame-frequency-based selection by 5.74%.


