Source-Side Discontinuous Phrases for Machine Translation: A Comparative Study on Phrase Extraction and Search

Matthias Huck 1 , Erik Scharwächter 1  and Hermann Ney 1
  • 1 Human Language Technology and Pattern Recognition Group, RWTH Aachen University

Abstract

Standard phrase-based statistical machine translation systems generate translations based on an inventory of continuous bilingual phrases. In this work, we extend a phrase-based decoder with the ability to make use of phrases that are discontinuous in the source part. Our dynamic programming beam search algorithm supports separate pruning of coverage hypotheses per cardinality and of lexical hypotheses per coverage, as well as coverage constraints that impose restrictions on the possible reorderings. In addition to investigating these aspects, which are related to the decoding procedure, we also concentrate our attention on the question of how to obtain source-side discontinuous phrases from parallel training data. Two approaches (hierarchical and discontinuous extraction) are presented and compared. On a large-scale Chinese!English translation task, we conduct a thorough empirical evaluation in order to study a number of system configurations with source-side discontinuous phrases, and to compare them to setups which employ continuous phrases only.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263-311, June 1993.

  • Chiang, David. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proc. of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), pages 263-270, Ann Arbor, MI, USA, June 2005.

  • Chiang, David. Hierarchical Phrase-Based Translation. Computational Linguistics, 33(2): 201-228, June 2007.

  • Galley, Michel and Christopher D. Manning. Accurate Non-Hierarchical Phrase-Based Translation. In Proc. of the Human Language Technology Conf. / North American Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), pages 966-974, Los Angeles, CA, USA, June 2010.

  • Huck, Matthias and Hermann Ney. Insertion and Deletion Models for Statistical Machine Translation. In Proc. of the Human Language Technology Conf. / North American Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), pages 347-351, Montréal, Canada, June 2012.

  • Huck, Matthias, Saab Mansour, Simon Wiesler, and Hermann Ney. Lexicon Models for Hierarchical Phrase-Based Machine Translation. In Proc. of the Int. Workshop on Spoken Language Translation (IWSLT), pages 191-198, San Francisco, CA, USA, Dec. 2011.

  • Huck, Matthias, Jan-Thorsten Peter, Markus Freitag, Stephan Peitz, and Hermann Ney. Hierarchical Phrase-Based Translation with Jane 2. The Prague Bulletin of Mathematical Linguistics, (98):37-50, Oct. 2012.

  • Kneser, Reinhard and Hermann Ney. Improved Backing-Off for M-gram Language Modelling. In Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 181-184, Detroit, MI, USA, May 1995.

  • Koehn, Philipp, Franz Joseph Och, and Daniel Marcu. Statistical Phrase-Based Translation. In Proc. of the Human Language Technology Conf. / North American Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), pages 127-133, Edmonton, Canada, May/June 2003.

  • Lopez, Adam. Hierarchical Phrase-Based Translation with Suffix Arrays. In Proc. of the Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 976-985, Prague, Czech Republic, June 2007.

  • Mauser, Arne, Saša Hasan, and Hermann Ney. Extending Statistical Machine Translation with Discriminative and Trigger-Based Lexicon Models. In Proc. of the Conf. on Empirical Methods for Natural Language Processing (EMNLP), pages 210-217, Singapore, Aug. 2009.

  • Moore, Robert C. and Chris Quirk. Faster Beam-Search Decoding for Phrasal Statistical Machine Translation. In Proc. of MT Summit XI, Copenhagen, Denmark, Sept. 2007.

  • Och, Franz Josef. Statistical Machine Translation: From Single-Word Models to Alignment Templates. PhD thesis, RWTH Aachen University, Aachen, Germany, Oct. 2002.

  • Och, Franz Josef. Minimum Error Rate Training for Statistical Machine Translation. In Proc. of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), pages 160-167, Sapporo, Japan, July 2003.

  • Och, Franz Josef and Hermann Ney. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. In Proc. of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), pages 295-302, Philadelphia, PA, USA, July 2002.

  • Och, Franz Josef and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19-51, Mar. 2003.

  • Och, Franz Josef, Christoph Tillmann, and Hermann Ney. Improved Alignment Models for Statistical Machine Translation. In Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20-28, University of Maryland, College Park, MD, USA, June 1999.

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proc. of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), pages 311-318, Philadelphia, PA, USA, July 2002.

  • Rahman, Mohammad Sohel, Costas S. Iliopoulos, Inbok Lee, Manal Mohamed, and William F. Smyth. Finding Patterns with Variable Length gaps or Don’t Cares. In Proc. of the International Computing and Combinatorics Conf. (COCOON), pages 146-155, Aug. 2006.

  • Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation. In Proc. of the Conf. of the Assoc. for Machine Translation in the Americas (AMTA), pages 223-231, Cambridge, MA, USA, Aug. 2006.

  • Søgaard, Anders and Jonas Kuhn. Empirical Lower Bounds on Alignment Error Rates in Syntax-Based Machine Translation. In Proc. of the Third Workshop on Syntax and Structure in Statistical Translation (SSST), pages 19-27, Boulder, CO, USA, June 2009.

  • Stein, Daniel, David Vilar, Stephan Peitz, Markus Freitag, Matthias Huck, and Hermann Ney. A Guide to Jane, an Open Source Hierarchical Translation Toolkit. The Prague Bulletin of Mathematical Linguistics, (95):5-18, Apr. 2011.

  • Stolcke, Andreas. SRILM - an Extensible Language Modeling Toolkit. In Proc. of the Int. Conf. on Spoken Language Processing (ICSLP), Denver, CO, USA, Sept. 2002.

  • Vilar, David. Investigations on Hierarchical Phrase-Based Machine Translation. PhD thesis, RWTH Aachen University, Aachen, Germany, Nov. 2011.

  • Vilar, David, Daniel Stein, Matthias Huck, and Hermann Ney. Jane: Open Source Hierarchical Translation, Extended with Reordering and Lexicon Models. In Proc. of the Workshop on Statistical Machine Translation (WMT), pages 262-270, Uppsala, Sweden, July 2010.

  • Vilar, David, Daniel Stein, Matthias Huck, and Hermann Ney. Jane: an Advanced Freely Available Hierarchical Machine Translation Toolkit. Machine Translation, 26(3):197-216, Sept.

  • Vogel, Stephan., Hermann Ney, and Christoph Tillmann. HMM-Based Word Alignment in Statistical Translation. In Proc. of the Int. Conf. on Computational Linguistics (COLING), pages 836-841, Copenhagen, Denmark, Aug. 1996.

  • Wu, Dekai. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora. Computational Linguistics, 23(3):377-404, Sept. 1997.

  • Wuebker, Joern, Matthias Huck, Stephan Peitz, Malte Nuhn, Markus Freitag, Jan-Thorsten Peter, Saab Mansour, and Hermann Ney. Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation. In Proc. of the Int. Conf. on Computational Linguistics (COLING), pages 483-491, Mumbai, India, Dec. 2012.

  • Zens, Richard. Phrase-Based Statistical Machine Translation: Models, Search, Training. PhD thesis, RWTH Aachen University, Aachen, Germany, Feb. 2008.

  • Zens, Richard and Hermann Ney. A Comparative Study on Reordering Constraints in Statistical Machine Translation. In Proc. of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), pages 144-151, Sapporo, Japan, July 2003.

  • Zens, Richard and Hermann Ney. Improvements in Dynamic Programming Beam Search for Phrase-based Statistical Machine Translation. In Proc. of the Int. Workshop on Spoken Language Translation (IWSLT), pages 195-205, Honolulu, HI, USA, Oct. 2008.

  • Zens, Richard, Hermann Ney, Taro Watanabe, and Eiichiro Sumita. Reordering Constraints for Phrase-Based Statistical Machine Translation. In Proc. of the Int. Conf. on Computational Linguistics (COLING), pages 205-211, Geneva, Switzerland, Aug. 2004.

OPEN ACCESS

Journal + Issues

Search