We present a semi-supervised, language- and domain-independent approach to high precision sentence alignment. The key idea is to bootstrap a supervised discriminative learner from wood-standard alignments, i.e. alignments that have been automatically generated by state-of-the-art sentence alignment tools. We deploy 3 different unsupervised sentence aligners (Opus, Hunalign, Gargantua) and 2 different datasets (movie subtitles and novels) and show experimentally that bootstrapping consistently improves precision significantly such that, with one exception, we obtain an overall gain in F-score.
If the inline PDF is not rendering correctly, you can download the PDF file here.
Bannard Colin and Chris Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedingsof the Annual Meeting of the Association for Computational Linguistics pages 597-604 Ann Arbor MI 2005.
Blunsom Phil and Trevor Cohn. Discriminative word alignment with conditional random fields. In Proceedings of the joint conference of the International Committee on ComputationalLinguistics and the Association for Computational Linguistics (COLING-ACL’06) pages 65-72 Sydney Australia 2006.
Braune Fabienne and Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conferenceon Computational Linguistics (COLING’10) pages 81-89 Beijing China 2010.
Gale William A. and Kenneth W. Church. A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75-102 1993.
Gao Jianfeng Jian-Yun Nie and Ming Zhou. Statistical query translation models for cross language information retrieval. ACM Transactions on Asian Language Information Processing 5(4):323-359 2006.
Koehn Philipp. Statistical Machine Translation. Cambridge University Press 2010.
Kraaij Wessel Jian-Yun Nie and Michel Simard. Embedding web-based statistical translation models in cross-language information retrieval. Computational Linguistics 29(3):381-419 2003.
Lavergne Thomas Olivier Chappé and François Yvon. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10) pages 504-513 Uppsala Sweden 2010.
Lu Bin Banjamin K. Tsou Jingbo Zhu Tao Jiang and Oi Yee Kwong. The construction of a chinese-english patent parallel corpus. In Proceedings of the MT Summit XII pages 17-24 Ottawa Canada 2009.
Moore Robert. Fast and accurate sentence alignment of bilingual corpora. In Proceedings ofthe 5th Conference of the Association for Machine Translation in the Americas (AMTA’02) pages 135-144 Tiburon CA 2002.
Munteanu Dragos Stefan and Daniel Marcu. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4):477-504 2005.
Resnik Philip and Noah A. Smith. The web as a parallel corpus. Computational Linguistics 29 (3):349-380 2003.
Schmid Helmut. Probabilistic part-of-speech tagging using decision trees. In Proceedings of theInternational Conference on New Methods in Language Processing Manchester UK 1994.
Smith Jason R. Chris Quirk and Kristina Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of Human Language Technologies:The 11th Annual Conference of the North American Chapter of the Association for ComputationalLinguistics (NAACL-HLT’10) pages 403-411 Los Angeles CA 2010.
Tiedemann Jörg. Improved sentence alignment for movie subtitles. In Proceedings of the InternationalConference on Recent Advances in Natural Language Processing (RANLP’07) pages 582-588 Borovets Bulgaria 2007.
Tiedemann Jörg. News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In Proceedings of the International Conference on Recent Advances in NaturalLanguage Processing (RANLP’09) pages 1-12 Borovets Bulgaria 2009.
Utiyama Masao and Hitoshi Isahara. A Japanese-English patent parallel corpus. In Proceedingsof MT Summit XI pages 475-482 Copenhagen Denmark 2007.
Varga Dániel László Németh Péter Halácsy András Kornai Viktor Trón and Viktor Nagy. Parallel corpora for medium density languages. In Proceedings of the Recent Advances in NaturalLanguage Processing 2005 Conference pages 590-596 Borovets Bulgaria 2005.
Xu Jinxi Ralph Weischedel and Chanh Nguyen. Evaluating a probabilistic model for crosslingual information retrieval. In Proceedings of the 24th annual international ACM SIGIR conferenceon Research and development in information retrieval pages 105-110 New Orleans LA 2001.