Active Learning with Multiple Views
For many real world object recognition tasks a common difficulty is having a large pool of unlabeled images whereas the cost of generating a label for an image is high. In order to learn a concept with the help of a human expert, we aim at picking only a small subset of examples that is most 'helpful' for the classifier. The concept of active learning tackles this setting by enabling a classifier to pose specific queries that are chosen from an unlabeled dataset.
To create a representation of an image that the computer understands, numerical features of an image can be calculated. These features may describe for example the shape, brightness or texture of an object. They are usually stringed together to form a long (high-dimensional) feature vector. However, such high-dimensional feature vectors can cause problems in finding global optima for the parameter space of the classification model.
One way to overcome this problem is feature selection or feature weighting to determine the most relevant features for a classification task. Many methods have been developed but they rely on a sufficiently large labeled dataset. In many problem settings (like in this active learning setting), labeled data may not be available; therefore we try to take a different perspective: we describe objects with different views.
For example, each feature module in image classification can be seen as a view on the object. Each view can describe different concepts and each view can contribute to a certain degree to the target concept that is to be learned. Having multiple representations can improve classification performance when in addition to labeled examples, many unlabeled examples are available.
The aim of this research project is to combine active learning and multi-view methods to derive new and more enhanced selection strategies and to improve the classification accuracy with few labeled examples.
Bayesian Localized Multiple Kernel Learning
Multiple kernel learning approaches form a set of techniques for performing classification that can easily combine information from multiple data sources, e.g., by adding or multiplying kernels. Most methods, however, are limited by their assumption of a per-view kernel weighting. For many problems, the set of features important for discriminating between examples can vary locally. As a consequence these global techniques suffer in the presence of complex noise processes, such as heteroscedastic noise, or when the discriminative properties of each feature type varies across the input space. In this work, we propose a localized multiple kernel learning approach with Gaussian Processes that learns a local weighting over each view and can obtain accurate classification performance and deal with insufficient views corrupted by complex noise, e.g., per-sample occlusion. We demonstrate our approach on the tasks of audio-visual gesture recognition and object category classification.
Facial Image Indexing Interfaces
During a disaster a large number of children may become separated from their families. Many of these children, especially the younger ones, may be unable or unwilling to identify themselves, making the task of reuniting them with their families especially difficult. Without a system in place for hospitals to document their unidentified children and to help parents search, families could be separated for months. After Katrina it was 6 months until the last child was reunited with her family. We are working on a system where each hospital takes digital photos of the childrens' faces, and the system is able to automatically extract features useful for identification. We are also hoping to extend the system to automatically refine image searches based on the identification of similar looking faces. Along those lines, we are working on determining a metric for feature importance in facial similarity.
Grounded Semantics
This project explores how to define the meaning of prepositions using visual data. One potential application is to be able to command a robot to arrange objects in a room. For example, in order for a robot to be able to follow the command "Put the cup there, on the front of the table", the robot must identify the target location of the cup. The robot can only identify this location if it understands the meanings of each of the components. The project specifically focuses on defining the meanings of prepositions because they are both perceptible in images and can be composed together to form higher level meanings.
Hashing Algorithms for Scalable Image Search
A common problem in large-scale data is that of quickly extracting nearest neighbors to a query from a large database. In computer vision, for example, this problem arises in content-based image retrieval, 3-d image reconstructions, human body pose estimation, object recognition problems, and other problems. This project focuses on developing algorithms for quickly and accurately performing large-scale image searches using hashing techniques. Some particular contributions include incorporating hashing methods for learned metrics as well as for performing locality-sensitive hashing over arbitrary kernel functions, two prominent scenarios arising in modern computer vision applications. Recent work has aimed to learn appropriate hash functions for a given image search task in order to minimize the memory overhead required for accurate searches. We have applied our algorithms to several large-scale data sets including the 80 million images of the Tiny Image data set and other large content-based image retrieval data sets.
Nonrigid Object Recognition and Tracking
In our everyday life, we manipulate many nonrigid objects, such as clothes. In the context of personal robotics, it would therefore be important to correctly recognize and track these objects for a robot to interact with them. While tracking and recognition of rigid objects has received a lot of attention in the Computer Vision community, similar tasks for deformable ones remain mainly unstudied. The main challenges that need to be addressed arise from the much larger appearance variability of such objects. Furthermore, the wide range of shapes that a piece of clothe can undergo makes 3D reconstruction and tracking very challenging.
In this project, we intend to study machine learning and computer vision methods to solve the following problems:
- Texture-based classification of the different parts of a single objects, e.g. boundaries vs. interior parts.
- Instance-level recognition of particular pieces of cloth.
- Category-level / material recognition, e.g. jeans vs. t-shirts.
- 3D shape estimation and tracking.
Most of the above-mentioned problems can be tackled in a multi-modal context, where different types of input, such as video or laser, are available. Another subject of interest is the study of a principled way of combining these inputs, in particular when they are asynchronous.
Transparent Object Recognition
Despite the omni-presence of transparent objects in our daily environment, little research has been conducted on how recognize and detect such objects. The difficulties of this task lie in the complex interactions between scene geometry and illuminants that lead to changing refractory patterns. Realizing that a complete physical model of these phenomena is out of reach at the moment, we seek different machine learning solution to approach this problem.
In particular we investigate a latent local additive feature model. In stark contrast to previous approach, this method seeks to separate different contributions to the overall gradient statistic in an unsupervised decomposition approach.
Visual Sense Disambiguation Using Multiple Modalities
Traditionally, object recognition requires manually labeled images of objects for training. However, there often exist additional sources of information that can be used as weak labels, reducing the need for human supervision. In this project we use different modalities and information sources to help learn visual models of object categories. The first type of information we use is the speech uttered by a user referring to an object. Such spoken utterances can occur in interaction with an assistant robot, voice-tagging a photo, etc. We propose a method that uses both the image of the object and the speech segment referring to the object to recognize the underlying category label. In preliminary experiments, we have shown that even noisy speech input helps visual recognition, and vice versa. We also explore two sources of information in the text modality: the words surrounding images on the web, and dictionary entries for words that refer to objects. Words that co-occur with images on the web have been used as weak object labels, but this tends to produce noisy datasets with many unrelated images. We use text and dictionary information to learn a refined model of what sense an image found on the web is likely to belong to. We apply this model to a dataset of images of polysemous words collected via image search and show that it improves both retrieval of specific senses and the resulting object classifiers.
Probabilistic Models for Multi-View Learning and Distributed Feature Selection
Many problems in machine learning contain datasets that are comprised of multiple independent feature sets or views, e.g., audio and video, text and images, and multi-sensor data. In this setting, each view provides a potentially redundant sample of the class or event of interest. Techniques in multi-view learning exploit this property to learn under weak supervision by maximizing the agreement of a set of classifiers defined in each view over the training data. The ability to perform reliable inference and learning in the presence of multi-view data is a challenging problem that is complicated by many factors including view insufficiency, i.e., learning from real-world noisy observations, and coping with the potentially large amounts of information that arises when incorporating possibly many information sources for classification. In this work we propose probabilistic models built upon multi-view Gaussian Processes (GPs) for learning from noisy real-word multi-view data and for performing distributed feature selection in bandwidth constrained environments such as those typically encountered in multi-source sensor networks. Initial experiments on audio-visual gesture and multi-view image datasets demonstrate that our probabilistic multi-view learning approach is able to learn under significant amounts of complex view corruption, e.g., per sample occlusions. Our work on GP-based multi-view feature selection has shown promising results for achieving compact feature descriptions from multiple sensors while preserving classification performance on a multi-view object categorization task.
Interactive Image Matching for Information Retrieval and Human Computer Interaction
Recent advances in content-based image retrieval have made it possible to index and search millions of images accurately and efficiently. Finding images, instead of an end itself, can be an effective mean for human users to perform a wide variety of interesting tasks in information retrieval and human-computer interactions. For example, by finding images online that resemble an object a human user is looking at, the text surrounding the online images may contain useful information about the object. Moreover, in terms of usability, the user may find it more intuitive to simply take the image of the object as the input query, compared to using keywords to describe the object. In this project, we examine various ways to exploit the vast amount of online multimedia data with both image and text. Specifically, we develop prototype systems to search online catalog for product information and screenshots for software tutorial. Also, we investigate usability issues such as learnability (how easy is it to learn to use the system to find information) and efficiency (how much quicker can the user find it).
More about the Vision Research Group >>
top |