| |
Combining Discriminative Re-Ranking and Co-Training for Parsing Mandarin Speech Transcripts
Wen Wang
SRI
Tuesday, February 17, 2009
12:30
Parsing aims at resolving structural ambiguity. State-of-the-art
statistical parsers require treebanks to estimate their parameters,
but their performance degrades when there is mismatch on
genres/domains between the training treebank and the data to parse.
Furthermore, creating a high-quality in-genre/in-domain treebank for
the data to parse is expensive and difficult. However, under the DARPA
GALE program, there are new genres besides newswire text, namely,
broadcast news (BN), broadcast conversation (BC), newsgroup (NG), and
web log (WL). Generating high-quality parse trees for Chinese data in
these genres can be useful for various tasks within GALE, including
syntax-guided translation and reordering models for
Chinese-to-English machine translation (MT), name entity detection,
and structured language modeling for automatic speech recognition
(ASR) on Mandarin BN and BC audio. In our earlier research, we
employed the weakly supervised co-training technique on two
state-of-the-art parsers, Charniak's parser and the Berkeley parser,
to bootstrap them from a newswire Chinese Treebank and a small amount
of BN and BC seed annotated treebank with a large amount of unlabeled
BN and BC transcripts, in order to achieve high parsing accuracy on
Mandarin BN and BC transcripts. By employing co-training, we obtained
2.2% -- 2.6% absolute improvement on F-measure for parsing BN and BC
transcripts. On the other hand, discriminative reranking for parsers
has produced significant improvement on parsing accuracy. In this
work, we explore the effectiveness of combining discriminative
reranking and co-training to further improve parsing performance on
Mandarin BN and BC transcripts. We investigated co-training reranked
parsers and co-training rerankers and observed that co-training
rerankers outperforms co-training reranked parsers and the former is
also computationally more efficient. We show that combining
discriminative reranking and co-training could improve the F-measure
by 1.8%--2% absolute compared to co-training two state-of-the-art
Chinese parsers without reranking, for parsing Mandarin broadcast news
and conversation transcripts.
Short bio of the speaker:
Wen Wang received her Ph.D. in Computer Engineering from Purdue
University in 2003 and received B.S. in Electrical Engineering (1996)
and M.S. in Computer Engineering (1998) at Shanghai Jiao Tong
University, Shanghai, China. She is currently a research engineer at
Speech Technology and Research Laboratory of SRI International. Her
research interests are in statistical language modeling, speech
recognition, machine translation, natural language processing
techniques and applications, and optimization. She is member of IEEE
and the Association for Computational Linguistics.
|
|