YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shoot Recognition
Title | YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shoot Recognition |
Publication Type | Conference Paper |
Year of Publication | 2013 |
Authors | Guadarrama, S., Krishnamoorthy N., Malkarnenkar G., Mooney R., Darrell T., & Saenko K. |
Other Numbers | 3615 |
Abstract | Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow do-mains and small vocabularies of actions. In this paper, wetackle the challenge of recognizing and describing activities in-the-wild. We present a solution that takes a shortvideo clip and outputs a brief sentence that sums up themain activity in the video, such as the actor, the action andits object. Unlike previous work, our approach works onout-of-domain actions: it does not require training videosof the exact activity. If it cannot find an accurate predictionfor a pre-trained model, it finds a less specific answer thatis also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choosean appropriate level of generalization, and priors learnedfrom web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use aweb-scale language model to fill in novel verbs, i.e. whenthe verb does not appear in the training set. We evaluateour method on a large YouTube corpus and demonstrate itis able to generate short sentence descriptions of video clipsbetter than baseline approaches. |
Acknowledgment | This work was partially supported by funding provided through National Science Foundation grants IIS : 1212928 ("Reconstructive recognition: Uniting statistical scene understanding and physics-based visual reasoning"), IIS : 1016312 (" Perceptually Grounded Learning of Instructional Language"), and IIS-1116411 ("Hierarchical Probabilistic Layers for Visual Recognition of Complex Objects"). Additional support was provided by DARPAs MSEE Program, U.S. ARO Award W911NF-10-2-0059 and Toyota.. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the funders. |
URL | https://www.icsi.berkeley.edu/pubs/vision/youtube2text13.pdf |
Bibliographic Notes | Proceedings of the International Conference on Computer Vision 2013 (ICCV 2013), Sydney, Australia |
Abbreviated Authors | S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, T. Darrell, and K. Saenko |
ICSI Research Group | Vision |
ICSI Publication Type | Article in conference proceedings |