Comparative Quality Estimation for Machine Translation Observations on Machine Learning and Features

  • 1 German Research Center for Artificial Intelligence (DFKI Berlin)


A deeper analysis on Comparative Quality Estimation is presented by extending the state-of-the-art methods with adequacy and grammatical features from other Quality Estimation tasks. The previously used linear method, unable to cope with the augmented features, is replaced with a boosting classifier assisted by feature selection. The methods indicated show improved performance for 6 language pairs, when applied on the output from MT systems developed over 7 years. The improved models compete better with reference-aware metrics.

Notable conclusions are reached through the examination of the contribution of the features in the models, whereas it is possible to identify common MT errors that are captured by the features. Many grammatical/fluency features have a good contribution, few adequacy features have some contribution, whereas source complexity features are of no use. The importance of many fluency and adequacy features is language-specific.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Avramidis, Eleftherios. Qualitative: Python Tool for MT Quality Estimation Supporting Server Mode and Hybrid MT. The Prague Bulletin of Mathematical Linguistics, 106:147–158, 2016.

  • Avramidis, Eleftherios and Maja Popović. Machine learning methods for comparative and time-oriented Quality Estimation of Machine Translation output. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 329–336, Sofia, Bulgaria, 2013.

  • Avramidis, Eleftherios, Maja Popović, David Vilar, and Aljoscha Burchardt. Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical features. In Proceedings of WMT, pages 65–70, Edinburgh, Scotland, 2011.

  • Bojar, Ondřej, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 12–58, Sofia, Bulgaria, 2013.

  • Callison-Burch, Chris, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. (Meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Prague, Czech Republic, 2007.

  • Denkowski, Michael and Alon Lavie. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA, 2014.

  • Duh, Kevin. Ranking vs. regression in machine translation evaluation. Proceedings of the Third Workshop on Statistical Machine Translation, pages 191–194, 2008.

  • Felice, Mariano and Lucia Specia. Linguistic Features for Quality Estimation. In Proceedings of the 7th Workshop on Statistical Machine Translation, pages 96–103, Montréal, Canada, 2012.

  • Formiga, Lluís, Lluís Màrquez, and Jaume Pujantel. Real-life Translation Quality Estimation for MT System Selection. In Proceedings of MT Summit XIV, pages 69–76, Nice, France, 2013.

  • Herbrich, Ralf, Thore Graepel, and Klaus Obermayer. Support Vector Learning for Ordinal Regression. In International Conference on Artificial Neural Networks, pages 97–102, 1999.

  • Hopkins, Mark and Jonathan May. Tuning as ranking. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1352–1362, Edinburgh, Scotland, 2011.

  • Järvelin, Kalervo and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002.

  • Mutton, Andrew, Mark Dras, Stephen Wan, and Robert Dale. GLEU: Automatic Evaluation of Sentence-Level Fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–351, 2007.

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. IBM Research Report RC22176(W0109-022), IBM, 2001.

  • Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. Learning Accurate, Compact, and Interpretable Tree Annotation. In Proc. of ACL, pages 433–440, Sydney, Australia, 2006.

  • Popović, Maja. rgbF: An Open Source Tool for n-gram Based Automatic Evaluation of Machine Translation Output. The Prague Bulletin of Mathematical Linguistics, 98(98):99–108, 2012.

  • Quirk, Chris. Training a Sentence-Level Machine Translation Confidence Measure. In Proceedings of LREC2004, volume 4, pages 825–828, Lisbon, Portugal, 2004.

  • Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and Ralph Weischedel. A Study of Translation Error Rate with Targeted Human Annotation. In In Proceedings of the Association for Machine Transaltion in the Americas, 2006.

  • Specia, Lucia, M. Turchi, N. Cancedda, M. Dymetman, and N. Cristianini. Estimating the Sentence-Level Quality of Machine Translation Systems. In 13th Annual Meeting of the European Association for Machine Translation, pages 28–35, Barcelona, Spain., 2009.

  • Yasuda, Keiji, Fumiaki Sugaya, Toshiyuki Takezawa, Seiichi Yamamoto, and Masuzo Yanagida. Automatic machine translation selection scheme to output the best result. In Proceedings of LREC2002, pages 525–528, Las Palmas, Spain, 2002.

  • Zhechev, Ventsislav. Unsupervised Generation of Parallel Treebank through Sub-Tree Alignment. Prague Bulletin of Mathematical Linguistics, 91:89–98, 2009.


Journal + Issues