Fast Speaker Diarization Using a High-Level Scripting Language
Title | Fast Speaker Diarization Using a High-Level Scripting Language |
Publication Type | Conference Paper |
Year of Publication | 2011 |
Authors | Gonina, E., Friedland G., Cook H., & Keutzer K. |
Other Numbers | 3200 |
Abstract | Most current speaker diarization systems use agglomerativeclustering of Gaussian Mixture Models (GMMs) todetermine who spoke when in an audio recording. While stateof-the-art in accuracy, this method is computationally costly,mostly due to the GMM training, and thus limits the performanceof current approaches to be roughly real-time. Increased sizesof current datasets require processing of hundreds of hours ofdata and thus make more efficient processing methods highlydesirable. With the emergence of highly parallel multicore andmanycore processors, such as graphics processing units (GPUs),one can re-implement GMM training to achieve faster thanreal-time performance by taking advantage of parallelism inthe training computation. However, developing and maintainingthe complex low-level GPU code is difficult and requires adeep understanding of the hardware architecture of the parallelprocessor. Furthermore, such low-level implementations are notreadily reusable in other applications and not portable to otherplatforms, limiting programmer productivity. In this paper wepresent a speaker diarization system captured in under 50 lines ofPython that achieves 50-250× faster than real-time performanceby using a specialization framework to automatically map andexecute computationally intensive GMM training on an NVIDIAGPU, without significant loss in accuracy. |
Acknowledgment | This work was partially supported by funding provided to ICSI by the U.S. Defense Advanced Research Projects Agency (DARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of DARPA or of the U.S. Government. This work was also partially supported by funding provided by CISCO, Microsoft, Intel, and U.C. Discovery. |
URL | http://www.icsi.berkeley.edu/pubs/speech/fastspeakerdiarization11.pdf |
Bibliographic Notes | Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2011), Big Island, Hawaii |
Abbreviated Authors | E. Gonina, G. Friedland, H. Cook, and K. Keutzer |
ICSI Research Group | Audio and Multimedia |
ICSI Publication Type | Article in conference proceedings |