eppex: Epochal Phrase Table Extraction for Statistical Machine Translation

Open access

eppex: Epochal Phrase Table Extraction for Statistical Machine Translation

We present a tool that extracts phrase pairs from a word-aligned parallel corpus and filters them on the fly based on a user-defined frequency threshold. The bulk of phrase pairs to be scored is much reduced, making the whole phrase table construction process faster with no significant harm to the ultimate phrase table quality as measured by BLEU. Technically, our tool is an alternative to the extract component of the phrase-extract toolkit bundled with Moses SMT software and covers some of the functionality of sigfilter.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Bojar Ondřej and Zdeněk Žabokrtský. CzEng0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics 92:63-83 2009. ISSN 0032-6585.

  • Clark Jonathan H. Chris Dyer Alon Lavie and Noah A. Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proc. of ACL/HLT pages 176-181 Portland Oregon USA June 2011. URL http://www.aclweb.org/anthology/P11-2031 http://www.aclweb.org/anthology/P11-2031

  • Goyal Amit Hal Daumé III and Suresh Venkatasubramanian. Streaming for large scale NLP: language modeling. In Proc. of HTL/NAACL pages 512-520 Boulder Colorado 2009. URL http://portal.acm.org/citation.cfm?id=1620754.1620829 http://portal.acm.org/citation.cfm?id=1620754.1620829

  • Hardmeier Christian. Fast and Extensible Phrase Scoring for Statistical Machine Translation. The Prague Bulletin of Mathematical Linguistics 93:79-88 2010.

  • Johnson J Howard Joel Martin George Foster and Roland Kuhn. Improving Translation Quality by Discarding Most of the Phrasetable. In Proc. of EMNLP and Computational Natural Language Learning pages 967-975 2007.

  • Koehn Philipp Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke Cowan Wade Shen Christine Moran Richard Zens Chris Dyer Ondrej Bojar Alexandra Constantin and Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. In Proc. of ACL (Demonstration Session) pages 177-180 2007.

  • Levenberg Abby Chris Callison-Burch and Miles Osborne. Stream-based translation models for statistical machine translation. In Proc. of HTL/NAACL pages 394-402 Los Angeles California 2010. URL http://portal.acm.org/citation.cfm?id=1857999.1858061 http://portal.acm.org/citation.cfm?id=1857999.1858061

  • Manku Gurmeet Singh and Rajeev Motwani. Approximate Frequency Counts over Data Streams. In Proceedings of the 28th International Conference on Very Large Data Bases 2002.

  • Mareček David Rudolf Rosa Petra Galuščáková and Ondřej Bojar. Two-step translation with grammatical post-processing. In Proc. of WMT Edinburgh UK July 2011.

  • Och Franz Josef and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19-51 2003.

  • Papineni Kishore Salim Roukos Todd Ward and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL pages 311-318 Philadelphia Pennsylvania 2002. URL http://dx.doi.org/10.3115/1073083.1073135

Journal information
Cited By
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 119 47 2
PDF Downloads 73 33 0