| |
Cross-lingual information distillation using statistical classification methods
Adish Kumar Singla
EPFL and ICSI
Friday, February 08, 2008
2:45 p.m.
As the amount of information available to users increase, accessing
information in an efficient way becomes difficult. The goal of
information distillation is to extract useful pieces of information
for a user's query from massive multi-lingual audio and textual
document sources. In the framework of the DARPA GALE project, these
sources are English, Arabic and Mandarin newswire, blogs, broadcast
news and conversations. One approach for answering these queries from
non-English sources is mainly training models using annotated data
from English documents and using machine translation (MT) output for
the non-English sources while searching answers to queries. While, the
noise introduced by automatic translation reduces the accuracy of
distillation process, there is more annotated data available from the
English sources.
In this work, we first experimented with the word-based statistical
approaches for answering user's queries from Arabic and Mandarin
sources. To benefit from the data from the English side, and the
absence of automatic translation noise on the source language side, we
combined models from the source languages and English. We have shown
that, this approach resulted in 5%-15% absolute improvement on the
GALE Year-1 and Year-2 data sets over the previous work.
|
|