Learning Morphological Normalization for Translation from and into Morphologically Rich Languages

Franck Burlot 1  und François Yvon 1
  • 1 LIMSI, CNRS, Université Paris-Saclay, France


When translating between a morphologically rich language (MRL) and English, word forms in the MRL often encode grammatical information that is irrelevant with respect to English, leading to data sparsity issues. This problem can be mitigated by removing from the MRL irrelevant information through normalization. Such preprocessing is usually performed in a deterministic fashion, using hand-crafted rules and yielding suboptimal representations. We introduce here a simple way to automatically compute an appropriate normalization of the MRL and show that it can improve machine translation in both directions.

Falls das inline PDF nicht korrekt dargestellt ist, können Sie das PDF hier herunterladen.

  • Allauzen, Alexandre, Lauriane Aufrant, Franck Burlot, Ophélie Lacroix, Elena Knyazeva, Thomas Lavergne, Guillaume Wisniewski, and François Yvon. LIMSI@WMT16: Machine Translation of News. In Proc. WMT, pages 239–245, Berlin, Germany, 2016.

  • Bojar, Ondřej. English-to-Czech Factored Machine Translation. In Proc. of the 2nd WMT, pages 232–239, Prague, Czech Republic, 2007.

  • Bojar, Ondřej, Yvette Graham, Amir Kamran, and Miloš Stanojević. Results of the WMT16 Metrics Shared Task. In Proc. WMT, pages 199–231, Berlin, Germany, 2016.

  • Burlot, Franck and François Yvon. Morphology-Aware Alignments for Translation to and from a Synthetic Language. In Proc. IWSLT, pages 188–195, Da Nang, Vietnam, 2015.

  • Burlot, Franck, Elena Knyazeva, Thomas Lavergne, and François Yvon. Two-Step MT: Predicting Target Morphology. In Proc. IWSLT, Seattle, USA, 2016.

  • Chahuneau, Victor, Eva Schlinger, Noah A. Smith, and Chris Dyer. Translating into Morphologically Rich Languages with Synthetic Phrases. In EMNLP, pages 1677–1687, 2013.

  • Cherry, Colin and George Foster. Batch Tuning Strategies for Statistical Machine Translation. In Proceedings of the NAACL-HLT, pages 427–436, Montreal, Canada, 2012.

  • Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proc. SSST@EMNLP, pages 103–111, Doha, Qatar, 2014.

  • Durgar El-Kahlout, Ilknur and François Yvon. The pay-offs of preprocessing for German-English Statistical Machine Translation. In Proc. IWSLT, pages 251–258, Paris, France, 2010.

  • Dyer, Chris, Victor Chahuneau, and Noah A. Smith. A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proc. NAACL, pages 644–648, Atlanta, Georgia, 2013.

  • El Kholy, Ahmed and Nizar Habash. Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation. In Proc. EAMT, pages 27–34, Trento, Italy, 2012a.

  • El Kholy, Ahmed and Nizar Habash. Rich Morphology Generation Using Statistical Machine Translation. In Proc. INLG, pages 90–94, 2012b.

  • Fraser, Alexander, Marion Weller, Aoife Cahill, and Fabienne Cap. Modeling Inflection and Word-Formation in SMT. In Proc. EACL, pages 664–674, Avignon, France, 2012.

  • Goldwater, Sharon and David McClosky. Improving Statistical MT through Morphological Analysis. In Proc. HLT–EMNLP, pages 676–683, Vancouver, Canada, 2005.

  • Heafield, Kenneth. KenLM: Faster and Smaller Language Model Queries. In Proc. WMT, pages 187–197, Edinburgh, Scotland, 2011.

  • Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open Source Toolkit for Statistical MT. In Proc. ACL:Systems Demos, pages 177–180, Prague, Czech Republic, 2007.

  • Lo, Chi-kiu, Colin Cherry, George Foster, Darlene Stewart, Rabib Islam, Anna Kazantseva, and Roland Kuhn. NRC Russian-English Machine Translation System for WMT 2016. In Proc. WMT, pages 326–332, Berlin, Germany, 2016.

  • Minkov, Einat, Kristina Toutanova, and Hisami Suzuki. Generating Complex Morphology for Machine Translation. In Proc. ACL, pages 128–135, Prague, Czech Republic, 2007.

  • Ney, Hermann and Maja Popovic. Improving Word Alignment Quality using Morpho-syntactic Information. In Proc. COLING, pages 310–314, Geneva, Switzerland, 2004.

  • Rosa, Rudolf. Automatic post-editing of phrase-based machine translation outputs. Master’s thesis, Institute of Formal and Applied Linguistics, Charles University, 2013.

  • Schmid, Helmut. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, 1994.

  • Sharoff, Serge, Mikhail Kopotev, Tomaz Erjavec, Anna Feldman, and Dagmar Divjak. Designing and Evaluating a Russian Tagset. In Proc. LREC, pages 279–285, Marrakech, Marocco, 2008.

  • Stanojević, Miloš and Khalil Sima’an. Fitting Sentence Level Translation Evaluation with Many Dense Features. In Proc. EMNLP, pages 202–206, Doha, Qatar, 2014.

  • Straková, Jana, Milan Straka, and Jan Hajič. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proc. ACL: System Demos, pages 13–18, Baltimore, MA, 2014.

  • Talbot, David and Miles Osborne. Modelling Lexical Redundancy for Machine Translation. In Proc. ACL, pages 969–976, Sydney, Australia, 2006.

  • Toutanova, Kristina, Hisami Suzuki, and Achim Ruopp. Applying Morphology Generation Models to Machine Translation. In Proc. ACL-08: HLT, pages 514–522, Columbus, OH, 2008.

  • Wang, Weiyue, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. CharacTer: Translation Edit Rate on Character Level. In Proc. WMT, pages 505–510, Berlin, Germany, 2016.


Zeitschrift + Hefte