Insights into Audio-Based Multimedia Event Classification with Neural Networks

TitleInsights into Audio-Based Multimedia Event Classification with Neural Networks
Publication TypeConference Paper
Year of Publication2015
AuthorsRavanelli, M., Elizalde B. Martinez, Bernd J., & Friedland G.
Other Numbers3822

Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task.


This work was supported in part by the National Science Foundationunder Award IIS : 1251276 (SMASH: Scalable Multimedia contentAnalysiS in a High-level language), and by Lawrence LivermoreNational Laboratory, operated by Lawrence Livermore NationalSecurity, LLC, for the U.S. Department of Energy, NationalNuclear Security Administration, under Contract DE-AC52-07NA27344.We also gratefully acknowledge the support of NVIDIA Corporationwith the donation of a Tesla K40 GPU used for this research.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the funders.

Bibliographic Notes

Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions (MMCommons '15), Brisbane, Australia, pp. 19-23

Abbreviated Authors

M. Ravanelli, B. Elizalde, J. Bernd, and G. Friedland

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Article in conference proceedings