Optimisation in statistical machine translation is usually made toward the BLEU score, but this metric is questioned about its relevance to an human evaluation. Many other metrics exist but none of them are in perfect harmony with human evaluation. On the other hand, most evaluation campaigns use multiple metrics (BLEU, TER, METEOR, etc.). Statistical machine translation systems can be optimised for other metrics than BLEU, but usually the optimisation with other metrics tends to decrease the BLEU score, the main metric used in MT evaluation campaigns.
In this paper we extend the minimum error training tool of the popular Moses SMT toolkit with a scorer for the TER score, and any linear combination of the existing metrics. The TER scorer was reimplemented in C++ which results in a ten times faster execution than the reference java code.
We have performed experiments with two large-scale phrase-base SMT systems to show the benefit of the new options of the minimum error training in Moses. The first one translates from French into English (WMT 2011 evaluation). The second one was developed in the frame work of the DARPA Gale project to translate from Arabic to English in three different genres (news, web and transcribed broadcast news and conversations).
If the inline PDF is not rendering correctly, you can download the PDF file here.
Banerjee Satanjeev and Alon Lavie. Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics 2005.
Bertoldi Nicola Barry Haddow and Jean-Baptiste Fouet. Improved minimum error rate training in Moses. The Prague Bulletin of Mathematical Linguistics 91 2009.
Cer Daniel Michel Galley Daniel Jurafsky and Christopher Manning. Phrasal: A toolkit for statistical machine translation with facilities for extraction and incorporation of arbitrary model features. In North American Association of Computational Linguistics - Demo Session (NAACL-10) 2010a.
Cer Daniel Christopher D. Manning and Daniel Jurafsky. The best lexical metric for phrase-based statistical mt system optimization. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics HLT '10 pages 555-563 Stroudsburg PA USA 2010b. Association for Computational Linguistics. ISBN 1-932432-65-5. URL http://portal.acm.org/citation.cfm?id=1857999.1858079 http://portal.acm.org/citation.cfm?id=1857999.1858079
Hammon Olivier. Rapport du projet CESTA: Campagne d'evaluation des systèmes de traduction automatique. Technical report ELDA 2007.
Koehn Philipp Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke Cowan Wade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In ACL demonstration session 2007.
Mauser Arne Saša Hasan and Hermann Ney. Automatic evaluation measures for statistical machine translation system optimization. In LREC'08 2008.
Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL 2002.
Schwenk Holger Patrik Lambert Loïc Barrault Christophe Servan Haithem Afli Sadaf Abdul-Rauf and Kashif Shah. LIUM's SMT machine translation systems for WMT 2011. In 6th Workshop on statistical Machine Translation 2011.
Snover Matthew Bonnie Dorr Richard Schwartz Linnea Micciulla and John Makhoul. A study of translation edit rate with targeted human annotation. In ACL 2006.
Snover Matthew Nitin Madnani Bonnie Dorr and Richard Schwartz. Fluency adequacy or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation at the 12th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-2009) pages 259-268 2009.