Empirical Investigation of Optimization Algorithms in Neural Machine Translation

Parnia Bahar 1 , Tamer Alkhouli 1 , Jan-Thorsten Peter 1 , Christopher Jan-Steffen Brix 1  and Hermann Ney 1
  • 1 Human Language Technology and Pattern Recognition Group, RWTH Aachen University, Aachen, Germany


Training neural networks is a non-convex and a high-dimensional optimization problem. In this paper, we provide a comparative study of the most popular stochastic optimization techniques used to train neural networks. We evaluate the methods in terms of convergence speed, translation quality, and training stability. In addition, we investigate combinations that seek to improve optimization in terms of these aspects. We train state-of-the-art attention-based models and apply them to perform neural machine translation. We demonstrate our results on two tasks: WMT 2016 En→Ro and WMT 2015 De→En.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473, 2015.

  • Bastien, Frédéric, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

  • Britz, Denny, Anna Goldie, Thang Luong, and Quoc Le. Massive Exploration of Neural Machine Translation Architectures. arXiv preprint arXiv:1703.03906, 2017.

  • Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Qatar, pages 103–111, 2014.

  • Clark, Jonathan H., Chris Dyer, Alon Lavie, and Noah A. Smith. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability. In 49th Annual Meeting of the Association for Computational Linguistics, pages 176––181, USA, 2011.

  • Dozat, Timothy. Incorporating Nesterov momentum into Adam. Technical report, 2015.

  • Duchi, John C., Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

  • Farajian, M Amin, Rajen Chatterjee, Costanza Conforti, Shahab Jalalvand, Vevake Balaraman, Mattia A Di Gangi, Duygu Ataman, Marco Turchi, Matteo Negri, and Marcello Federico. FBK’s Neural Machine Translation Systems for IWSLT 2016. In Proceedings of the ninth International Workshop on Spoken Language Translation, USA, 2016.

  • Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

  • Hinton, Geoffrey, N Srivastava, and Kevin Swersky. Lecture 6a overview of mini–batch gradient descent. Coursera Lecture slides https://class.coursera.org/neuralnets-2012-001/, 2012.

  • Im, Daniel Jiwoong, Michael Tao, and Kristin Branson. An Empirical Analysis of Deep Network Loss Surfaces. CoRR, abs/1612.04010, 2016.

  • Jean, Sébastien, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. Montreal Neural Machine Translation Systems for WMT’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, WMT 2015, Portugal, pages 134–140, 2015.

  • Junczys-Dowmunt, Marcin, Tomasz Dwojak, and Rico Sennrich. The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions in Phrase-based SMT. In Proceedings of the First Conference on Machine Translation, WMT 2016, Germany, pages 319–325, 2016.

  • Kingma, Diederik P. and Jimmy Ba. Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980, 2014.

  • Merriënboer, Bart, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio. Blocks and Fuel: Frameworks for deep learning. 2015.

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 311–318, USA, 2002.

  • Robbins, Herbert and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

  • Ruder, Sebastian. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.

  • Sennrich, Rico, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Germany, 2016.

  • Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pages 223–231, USA, 2006.

  • Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Canada, pages 3104–3112, 2014.

  • Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR, abs/1609.08144, 2016.

  • Zeiler, Matthew D. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701, 2012.

  • Zeyer, Albert, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, and Hermann Ney. A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition. CoRR, abs/1606.06871, 2017.


Journal + Issues