We present a semi-supervised, language- and domain-independent approach to high precision sentence alignment. The key idea is to bootstrap a supervised discriminative learner from wood-standard alignments, i.e. alignments that have been automatically generated by state-of-the-art sentence alignment tools. We deploy 3 different unsupervised sentence aligners (Opus, Hunalign, Gargantua) and 2 different datasets (movie subtitles and novels) and show experimentally that bootstrapping consistently improves precision significantly such that, with one exception, we obtain an overall gain in F-score.
Bannard, Colin and Chris Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedingsof the Annual Meeting of the Association for Computational Linguistics, pages 597-604, Ann Arbor, MI, 2005.
Blunsom, Phil and Trevor Cohn. Discriminative word alignment with conditional random fields. In Proceedings of the joint conference of the International Committee on ComputationalLinguistics and the Association for Computational Linguistics (COLING-ACL’06), pages 65-72, Sydney, Australia, 2006.
Braune, Fabienne and Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conferenceon Computational Linguistics (COLING’10), pages 81-89, Beijing, China, 2010.
Gale, William A. and Kenneth W. Church. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75-102, 1993.
Gao, Jianfeng, Jian-Yun Nie, and Ming Zhou. Statistical query translation models for cross language information retrieval. ACM Transactions on Asian Language Information Processing, 5(4):323-359, 2006.
Koehn, Philipp. Statistical Machine Translation. Cambridge University Press, 2010.
Kraaij, Wessel, Jian-Yun Nie, and Michel Simard. Embedding web-based statistical translation models in cross-language information retrieval. Computational Linguistics, 29(3):381-419, 2003.
Lavergne, Thomas, Olivier Chappé, and François Yvon. Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10), pages 504-513, Uppsala, Sweden, 2010.
Lu, Bin, Banjamin K. Tsou, Jingbo Zhu, Tao Jiang, and Oi Yee Kwong. The construction of a chinese-english patent parallel corpus. In Proceedings of the MT Summit XII, pages 17-24, Ottawa, Canada, 2009.
Moore, Robert. Fast and accurate sentence alignment of bilingual corpora. In Proceedings ofthe 5th Conference of the Association for Machine Translation in the Americas (AMTA’02), pages 135-144, Tiburon, CA, 2002.
Munteanu, Dragos Stefan and Daniel Marcu. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4):477-504, 2005.
Resnik, Philip and Noah A. Smith. The web as a parallel corpus. Computational Linguistics, 29 (3):349-380, 2003.
Schmid, Helmut. Probabilistic part-of-speech tagging using decision trees. In Proceedings of theInternational Conference on New Methods in Language Processing, Manchester, UK, 1994.
Smith, Jason R., Chris Quirk, and Kristina Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of Human Language Technologies:The 11th Annual Conference of the North American Chapter of the Association for ComputationalLinguistics (NAACL-HLT’10), pages 403-411, Los Angeles, CA, 2010.
Tiedemann, Jörg. Improved sentence alignment for movie subtitles. In Proceedings of the InternationalConference on Recent Advances in Natural Language Processing (RANLP’07), pages 582-588, Borovets, Bulgaria, 2007.
Tiedemann, Jörg. News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In Proceedings of the International Conference on Recent Advances in NaturalLanguage Processing (RANLP’09), pages 1-12, Borovets, Bulgaria, 2009.
Utiyama, Masao and Hitoshi Isahara. A Japanese-English patent parallel corpus. In Proceedingsof MT Summit XI, pages 475-482, Copenhagen, Denmark, 2007.
Varga, Dániel, László Németh, Péter Halácsy, András Kornai, Viktor Trón, and Viktor Nagy. Parallel corpora for medium density languages. In Proceedings of the Recent Advances in NaturalLanguage Processing 2005 Conference, pages 590-596, Borovets, Bulgaria, 2005.
Xu, Jinxi, Ralph Weischedel, and Chanh Nguyen. Evaluating a probabilistic model for crosslingual information retrieval. In Proceedings of the 24th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 105-110, New Orleans, LA, 2001.