Lost in Segmentation: Three Approaches for Speech/Non-Speech Detection in Consumer-Produced Videos

Traditional speech/non-speech segmentation systemshave been designed for specific acoustic conditions, suchas broadcast news or meetings. However, little researchhas been done on consumer-produced audio. This type ofmedia is constantly growing and has complex characteristicssuch as low quality recordings, environmental noise andoverlapping sounds. This paper discusses an evaluation ofthree different approaches for speech/non-speech detectionon consumer-produced audio. The approaches are state-ofthe-art speech/non-speech detectors–one based on GaussianMixture Models (GMM), another on Support Vector Machines(SVM), and the last on Neural Networks (NN). Usingthe TRECVID MED 2012 database, we designed training/testing sets combinations to aid the understanding of whatspeech/non-speech detection on consumer-produced mediaentails and how traditional approaches to this detection performedin this domain. The results revealed that the crossdomainstate-of-the-art GMM and SVM systems’ tests underperformed


