Event

 
 

Combining Discriminative Re-Ranking and Co-Training for Parsing Mandarin Speech Transcripts

Wen Wang

SRI

Tuesday, February 17, 2009
12:30

Parsing aims at resolving structural ambiguity. State-of-the-art statistical parsers require treebanks to estimate their parameters, but their performance degrades when there is mismatch on genres/domains between the training treebank and the data to parse. Furthermore, creating a high-quality in-genre/in-domain treebank for the data to parse is expensive and difficult. However, under the DARPA GALE program, there are new genres besides newswire text, namely, broadcast news (BN), broadcast conversation (BC), newsgroup (NG), and web log (WL). Generating high-quality parse trees for Chinese data in these genres can be useful for various tasks within GALE, including syntax-guided translation and reordering models for Chinese-to-English machine translation (MT), name entity detection, and structured language modeling for automatic speech recognition (ASR) on Mandarin BN and BC audio. In our earlier research, we employed the weakly supervised co-training technique on two state-of-the-art parsers, Charniak's parser and the Berkeley parser, to bootstrap them from a newswire Chinese Treebank and a small amount of BN and BC seed annotated treebank with a large amount of unlabeled BN and BC transcripts, in order to achieve high parsing accuracy on Mandarin BN and BC transcripts. By employing co-training, we obtained 2.2% -- 2.6% absolute improvement on F-measure for parsing BN and BC transcripts. On the other hand, discriminative reranking for parsers has produced significant improvement on parsing accuracy. In this work, we explore the effectiveness of combining discriminative reranking and co-training to further improve parsing performance on Mandarin BN and BC transcripts. We investigated co-training reranked parsers and co-training rerankers and observed that co-training rerankers outperforms co-training reranked parsers and the former is also computationally more efficient. We show that combining discriminative reranking and co-training could improve the F-measure by 1.8%--2% absolute compared to co-training two state-of-the-art Chinese parsers without reranking, for parsing Mandarin broadcast news and conversation transcripts.

Short bio of the speaker: Wen Wang received her Ph.D. in Computer Engineering from Purdue University in 2003 and received B.S. in Electrical Engineering (1996) and M.S. in Computer Engineering (1998) at Shanghai Jiao Tong University, Shanghai, China. She is currently a research engineer at Speech Technology and Research Laboratory of SRI International. Her research interests are in statistical language modeling, speech recognition, machine translation, natural language processing techniques and applications, and optimization. She is member of IEEE and the Association for Computational Linguistics.

 
Copyright © 2005 International Computer Science Institute. All Rights Reserved.