YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shoot Recognition

TitleYouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shoot Recognition
Publication TypeConference Paper
Year of Publication2013
AuthorsGuadarrama, S., Krishnamoorthy N., Malkarnenkar G., Mooney R., Darrell T., & Saenko K.
Other Numbers3615

Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow do-mains and small vocabularies of actions. In this paper, wetackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a shortvideo clip and outputs a brief sentence that sums up themain activity in the video, such as the actor, the action andits object. Unlike previous work, our approach works onout-of-domain actions: it does not require training videosof the exact activity. If it cannot find an accurate predictionfor a pre-trained model, it finds a less specific answer thatis also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choosean appropriate level of generalization, and priors learnedfrom web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use aweb-scale language model to “fill in” novel verbs, i.e. whenthe verb does not appear in the training set. We evaluateour method on a large YouTube corpus and demonstrate itis able to generate short sentence descriptions of video clipsbetter than baseline approaches.


This work was partially supported by funding provided through National Science Foundation grants IIS : 1212928 ("Reconstructive recognition: Uniting statistical scene understanding and physics-based visual reasoning"), IIS : 1016312 (" Perceptually Grounded Learning of Instructional Language"), and IIS-1116411 ("Hierarchical Probabilistic Layers for Visual Recognition of Complex Objects"). Additional support was provided by DARPA’s MSEE Program, U.S. ARO Award W911NF-10-2-0059 and Toyota.. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the funders.

Bibliographic Notes

Proceedings of the International Conference on Computer Vision 2013 (ICCV 2013), Sydney, Australia

Abbreviated Authors

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, T. Darrell, and K. Saenko

ICSI Research Group


ICSI Publication Type

Article in conference proceedings