Capturing the Acoustic Scene Characteristics for Audio Scene Detection

Scene detection on user-generated content (UGC) aims to classify an audio recording that belongs to a specific scene such as busy street, office or supermarket rather than a sound such as car noise, computer keyboard or cash machine. The difficulty of scene content analysis on UGC lies in the lack of structure and acoustic variability of the audio. The i-vector system is state-of-the-art in Speaker Verification and Scene Detection, and is outperforming conventional Gaussian Mixture Model (GMM)-based approaches. The system compensates for undesired acoustic variability and ex- tracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper reports our results in the challenge by using a hand-tuned i-vector system and MFCC features on the IEEE-AASP Scene Classification Challenge dataset. Compared to the MFCC+GMM baseline system, our approach increased the classification accuracy by 26.4% relative, to 65.8%. We discuss our approach and highlight parameters in our system that significantly improved our classification accuracy.


