Event

 
 

Cross-lingual information distillation using statistical classification methods

Adish Kumar Singla

EPFL and ICSI

Friday, February 08, 2008
2:45 p.m.

As the amount of information available to users increase, accessing information in an efficient way becomes difficult. The goal of information distillation is to extract useful pieces of information for a user's query from massive multi-lingual audio and textual document sources. In the framework of the DARPA GALE project, these sources are English, Arabic and Mandarin newswire, blogs, broadcast news and conversations. One approach for answering these queries from non-English sources is mainly training models using annotated data from English documents and using machine translation (MT) output for the non-English sources while searching answers to queries. While, the noise introduced by automatic translation reduces the accuracy of distillation process, there is more annotated data available from the English sources.

In this work, we first experimented with the word-based statistical approaches for answering user's queries from Arabic and Mandarin sources. To benefit from the data from the English side, and the absence of automatic translation noise on the source language side, we combined models from the source languages and English. We have shown that, this approach resulted in 5%-15% absolute improvement on the GALE Year-1 and Year-2 data sets over the previous work.

 
Copyright © 2005 International Computer Science Institute. All Rights Reserved.