Parallel Phrase Scoring for Extra-large Corpora

Open access

Parallel Phrase Scoring for Extra-large Corpora

This paper presents a C++ implementation of the phrase scoring step in phrase-based systems that helps to exploit the available computing resources more efficiently and trains very large systems in reasonable time without sacrificing the system's performance in terms of Bleu score.

Three parallelizing tools are made freely available. The first exploits shared memory parallelism and multiple disks for parallel IOs while the two others run in a distributed environment.

We demonstrate the efficiency and consistency of our tools, in the framework of the Fr-En systems we developed for the WMT and IWSLT evaluation campaigns, in which we were able to generate the phrase table in one third up to one seventh of the time taken by Moses in the same tasks.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Andreas Beckmann Meyer Ulrich Sanders Peter and Singler Johannes. Energy-efficient sorting using solid state disks. Sustainable Computing: Informatics and Systems 1(2):151-163 2011.

  • Arge Lars Octavian Procopiuc and Jeffrey Scott Vitter. Implementing I/O-efficient data structures using TPIE. In In Proc. European Symposium on Algorithms pages 88-100. Springer 2002.

  • Beckmann Andreas Ulrich Meyer Peter Sanders Johannes Singler and Peter Sanders Johannes Singler. Energy-efficient fast sorting 2011 2012. URL http://sortbenchmark.org/demsort_2011.pdf

  • Chapman Barbara Gabriele Jost and Ruud van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press 2007.

  • Crauser Andreas and Kurt Mehlhorn. LEDA-SM extending LEDA to secondary memory. In Proceedings of the 3rd International Workshop on Algorithm Engineering WAE '99 pages 228-242 London UK UK 1999. Springer-Verlag. ISBN 3-540-66427-0.

  • Dean Jeffrey and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM 51(1):107-113 Jan. 2008. ISSN 0001-0782.

  • Dementiev R. and L. Kettner. STXXL: Standard template library for XXL data sets. In In: Proc. of ESA 2005. Volume 3669 of LNCS pages 640-651. Springer 2005.

  • Dementiev R. L. Kettner and P. Sanders. STXXL: standard template library for XXL data sets. Softw. Pract. Exper. 38(6):589-637 May 2008. ISSN 0038-0644.

  • Foster George F. Roland Kuhn and Howard Johnson. Phrasetable smoothing for statistical machine translation. In EMNLP pages 53-61 2006.

  • Gao Qin and Stephan Vogel. Training phrase-based machine translation models on the cloudopen source machine translation toolkit chaski. Prague Bull. Math. Linguistics 93: 37-46 2010.

  • Hardmeier Christian. Fast and extensible phrase scoring for statistical machine translation. Prague Bull. Math. Linguistics 93:87-96 2010.

  • Koehn Philipp Hieu Hoang Alexandra Birch Chris Callison-Burch Marcello Federico Nicola Bertoldi Brooke Cowan Wade Shen Christine Moran Richard Zens Chris Dyer Ondřej Bojar Alexandra Constantin and Evan Herbst. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions ACL '07 pages 177-180 Stroudsburg PA USA 2007. Association for Computational Linguistics.

  • Olson Michael A. Keith Bostic and Margo Seltzer. Berkeley DB. In Proceedings of the annual conference on USENIX Annual Technical Conference ATEC '99 pages 43-43 Berkeley CA USA 1999. USENIX Association.

  • Pacheco Peter S. Parallel programming with MPI. Morgan Kaufmann Publishers Inc. San Francisco CA USA 1996. ISBN 1-55860-339-5.

  • Rahn Mirko Peter S Johannes Singler and Tim Kieritz. DEMSort-distributed external memory sort 2009. URL http://sortbenchmark.org/demsort.pdf

  • Rahn Mirko Peter Sanders and Johannes Singler. Scalable distributed-memory external sorting. In on Data Engineering (ICDE) International Conference editor 26th IEEE International Conference on Data Engineering March 1-6 2010 Long Beach California USA pages 685-688. IEEE Computer Society März 2010.

  • Sanders Peter and Roman Dementiev. Asynchronous parallel disk sorting. Research Report MPI-I-2003-1-001 Max-Planck-Institut für Informatik Stuhlsatzenhausweg 85 66123 Saarbrücken Germany February 2003.

  • Vitter Jeffrey Scott. Algorithms and Data Structures for External Memory. Now Publishers Inc. Hanover MA USA 2008. ISBN 1601981066 9781601981066.

Search
Journal information
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 167 78 2
PDF Downloads 76 41 1