First Person Action-Object Detection with EgoNet

TitleFirst Person Action-Object Detection with EgoNet
Publication TypeMiscellaneous
Year of Publication2016
AuthorsBertasius, G., Park H. Soo, Yu S. X., & Shi J.
Keywordsaction-object detection, saliency, visual attention

Objects afford visual sensation and motor actions. A first person camera, placed at the person's head, captures unscripted moments of our visual sensorimotor object interactions. Can a single first person image tell us about our momentary visual attention and motor action with objects, without a gaze tracking device or tactile sensors? To study the holistic correlation of visual attention with motor action, we introduce the concept of action-objects---objects associated with seeing and touching actions, which exhibit characteristic 3D spatial distance and orientation with respect to the person. A predictive action-object model is designed to re-organize the space of interactions in terms of visual and tactile sensations, which is realized by our proposed EgoNet network. EgoNet is composed of two convolutional neural networks: 1) Semantic Gaze Pathway that learns 2D appearance cues with first person coordinate embedding, and 2) 3D Spatial Pathway that focuses on 3D depth and height measurements relative to the person with brightness reflectance attached. Retaining two distinct pathways enables effective learning from a limited number of examples, diversified prediction from complementary visual signals, and flexible architecture that is functional with RGB image without depth information. We show that our model correctly predicts action-objects in a first person image where we outperform the existing approaches across different datasets.

ICSI Research Group