The necessity of using a fixed-size word vocabulary in order to control the model complexity in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on performance, especially for morphologically rich languages. Conventional methods that aim to overcome this problem by using sub-word or character-level representations solely rely on statistics and disregard the linguistic properties of words, which leads to interruptions in the word structure and causes semantic and syntactic losses. In this paper, we propose a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language. Our method is based on unsupervised morphology learning and can be, in principle, used for pre-processing any language pair. We also present an alternative word segmentation method based on supervised morphological analysis, which aids us in measuring the accuracy of our model. We evaluate our method in Turkish-to-English NMT task where the input language is morphologically rich and agglutinative. We analyze different representation methods in terms of translation accuracy as well as the semantic and syntactic properties of the generated output. Our method obtains a significant improvement of 2.3 BLEU points over the conventional vocabulary reduction technique, showing that it can provide better accuracy in open vocabulary translation of morphologically rich languages.
If the inline PDF is not rendering correctly, you can download the PDF file here.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
Bisazza, Arianna and Marcello Federico. Morphological pre-processing for Turkish to English statistical machine translation. In IWSLT, pages 129–135, 2009.
Bradbury, James and Richard Socher. MetaMind neural machine translation system for WMT 2016. In Proceedings of the 1st Conference on Machine Translation. ACL, 2016.
Cettolo, Mauro, Christian Girardi, and Marcello Federico. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of EAMT, pages 261–268, 2012.
Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, 2014.
Clark, Jonathan H., Chris Dyer, Alon Lavie, and Noah A. Smith. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability. In Proceedings of the 49th Annual Meeting of ACL, pages 176–181. ACL, 2011.
Creutz, Mathias and Krista Lagus. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning, pages 51–59, 2005a.
Creutz, Mathias and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005b.
Creutz, Mathias and Krista Lagus. Unsupervised models for morpheme segmentation and morphology learning. Transactions on Speech and Language Processing, 4(1):3, 2007.
Cuong, Hoang and Khalil Simaan. Latent domain translation models in mix-of-domains haystack. In Proceedings of COLING, pages 1928–1939, 2014.
Duchi, John, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
Gage, Philip. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.
Grönroos, Stig-Arne, Sami Virpioja, Peter Smit, and Mikko Kurimo. Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology. In COLING, pages 1177–1185, 2014.
Lee, Jason, Kyunghyun Cho, and Thomas Hofmann. Fully Character-Level Neural Machine Translation without Explicit Segmentation. CoRR, abs/1610.03017, 2016.
Ling, Wang, Isabel Trancoso, Chris Dyer, and Alan W Black. Character-based neural machine translation. CoRR, abs/1511.04586, 2015.
Lison, Pierre and Jörg Tiedemann. Opensubtitles 2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of LREC, 2016.
Luong, Minh-Thang and Christopher D Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of ACL. ACL, 2016.
Oflazer, Kemal. Two-level description of Turkish morphology. Literary and linguistic computing, 9(2):137–148, 1994.
Oflazer, Kemal and Ilknur Durgar El-Kahlout. Exploring different representational units in English-to-Turkish statistical machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation, pages 25–32. ACL, 2007.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of ACL, pages 311–318. ACL, 2002.
Paul, Michael, Marcello Federico, and Sebastian Stücker. Overview of the IWSLT 2010 Evaluation Campaign. In Proceedings of IWSLT, pages 3–27, 2010.
Popovic, Maja. chrF: character n-gram F-score for automatic MT evaluation. 2015.
Sak, Haşim, Tunga Güngör, and Murat Saraçlar. Morphological disambiguation of Turkish text with perceptron algorithm. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 107–118. Springer, 2007.
Sánchez-Cartagena, Vıctor M and Antonio Toral. Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences. In Proceedings of the 1st Conference on Machine Translation. ACL, 2016.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.
Sennrich, Rico, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel L”aubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In Proceedings of EACL, 2017.
Skadiņš, Raivis, Jörg Tiedemann, Roberts Rozis, and Daiga Deksne. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus. In Proceedings of LREC. European Language Resources Association, 2014.
Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and Ralph Weischedel. A Study of Translation Error Rate with Targeted Human Annotation. In Proceedings of AMTA, 2006.
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
Tiedemann, Jörg. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, volume 5, pages 237–248, 2009.
Tiedemann, Jörg. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of LREC. European Language Resources Association, 2012.
Tyers, Francis M and Murat Serdar Alperen. South-east European Eimes: A parallel corpus of balkan languages. In Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pages 49–53, 2010.
Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR, abs/1609.08144, 2016.