Representation Learning

Principal Investigator(s): 
Trevor Darrell

Researchers at ICSI and UC Berkeley are developing new representation learning models for visual detection, leveraging advances in discriminatively trained convolutional neural networks.  In 2013, they established important results related to these models, including observations of their ability to generalize to new tasks and domains, and importantly to be applicable to detection and segmentation tasks.  They developed a new “Region-CNN” model (R-CNN), which outperformed all competing methods on the most important visual detection benchmark, the PASCAL challenge.

Recently they have extended R-CNN to incorporate Large Scale Detection Adaptation, or LSDA, developing a new framework for large scale detection which transfers knowledge from detection tasks to classification tasks. They combine adaptation techniques with deep convolutional models to create a fast and effective large scale detection network. They have released a 7604 category detector model for use within the CAFFE framework. The categories correspond to the 7404 leaf nodes from the ImageNet dataset, as well as 200 stronger detectors that are available with bounding box data from the ILSVRC2013 challenge dataset. They continue to extend this model to include more vocabulary entries, and plan to scale the model up to include explicit adjective-noun pairs in the coming months, to include well over 10K concepts. They are continuing pushing to combine NLP and visual models for open-vocabulary video description; they are training action and activity recognition models based on time-sequence neural network models, and joint visual-text semantic emeddings, with the goal of improving the ability to generate sentences from open-domain (e.g., YouTube) video sources.