CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora
This work introduces a method and tool for handling overlapping parallel corpora — i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations of the overlapping corpora or compare them and assess their quality in comparison to each other. The introduced tool enables the user to define the desired behavior when combining corpora pairs, resulting in pure comparison, maximum-size or maximum-quality versions of the combinations. We test the tool on two cases of overlapping parallel corpora and five language pairs. We also evaluate the impact of using the method on two translation systems — a phrase-based and a parsing-based one.
If the inline PDF is not rendering correctly, you can download the PDF file here.
Bojar Ondřej and Zdeněk Žabokrtský. CzEng0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics 92 2009.
Kaalep Heiki-Jaan and Kaarel Veskis. Comparing parallel corpora and evaluating their quality. In Proceedings of MT Summit XI pages 275-279 Copenhagen Denmark 2007.
Koehn Philipp. Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit X pages 79-86 Phuket Thailand 2005.
Koehn Philipp Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke Cowan Wade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL'07 pages 177-180 Prague Czech Republic 2007.
Li Zhifei Chris Callison-Burch Chris Dyer Sanjeev Khudanpur Lane Schwartz Wren Thornton Jonathan Weese and Omar Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation pages 135-139 Athens Greece 2009.
NIST. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Technical report NIST 2002.
Och Franz J. and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19-51 2003.
Papieni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL'01 pages 311-318 Philadelphia PA USA 2001.
Steinberger Ralf Bruno Pouliquen Anna Widiger Camelia Ignat Tomaž Erjavec Dan Tufiş and Dániel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of LREC'06 pages 2142-2147 Genoa Italy 2006.
Stolcke Andreas. SRILM - an extensible language modeling toolkit. In Proceedings of ICSLP'02 volume 2 pages 901-904 Denver Colorado USA 2002.
Varga Daniel László Németh Péter Halácsy András Kornai Viktor Trón and Viktor Nagy. Parallel corpora for medium density languages. In Proceedings of RANLP'05 pages 590-596 Borovets Bulgaria 2005.