YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shoot Recognition

Title	YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shoot Recognition
Publication Type	Conference Paper
Year of Publication	2013
Authors	Guadarrama, S., Krishnamoorthy N., Malkarnenkar G., Mooney R., Darrell T., & Saenko K.
Other Numbers	3615
Abstract	Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow do-mains and small vocabularies of actions. In this paper, wetackle the challenge of recognizing and describing activities in-the-wild. We present a solution that takes a shortvideo clip and outputs a brief sentence that sums up themain activity in the video, such as the actor, the action andits object. Unlike previous work, our approach works onout-of-domain actions: it does not require training videosof the exact activity. If it cannot find an accurate predictionfor a pre-trained model, it finds a less specific answer thatis also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choosean appropriate level of generalization, and priors learnedfrom web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use aweb-scale language model to fill in novel verbs, i.e. whenthe verb does not appear in the training set. We evaluateour method on a large YouTube corpus and demonstrate itis able to generate short sentence descriptions of video clipsbetter than baseline approaches.
Acknowledgment	This work was partially supported by funding provided through National Science Foundation grants IIS : 1212928 ("Reconstructive recognition: Uniting statistical scene understanding and physics-based visual reasoning"), IIS : 1016312 (" Perceptually Grounded Learning of Instructional Language"), and IIS-1116411 ("Hierarchical Probabilistic Layers for Visual Recognition of Complex Objects"). Additional support was provided by DARPAs MSEE Program, U.S. ARO Award W911NF-10-2-0059 and Toyota.. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the funders.
URL	https://www.icsi.berkeley.edu/pubs/vision/youtube2text13.pdf
Bibliographic Notes	Proceedings of the International Conference on Computer Vision 2013 (ICCV 2013), Sydney, Australia
Abbreviated Authors	S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, T. Darrell, and K. Saenko
ICSI Research Group	Vision
ICSI Publication Type	Article in conference proceedings

Google Scholar