Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

Open access

Abstract

Deep Neural Networks (DNN) are nothing but neural networks with many hidden layers. DNNs are becoming popular in automatic speech recognition tasks which combines a good acoustic with a language model. Standard feedforward neural networks cannot handle speech data well since they do not have a way to feed information from a later layer back to an earlier layer. Thus, Recurrent Neural Networks (RNNs) have been introduced to take temporal dependencies into account. However, the shortcoming of RNNs is that long-term dependencies due to the vanishing/exploding gradient problem cannot be handled. Therefore, Long Short-Term Memory (LSTM) networks were introduced, which are a special case of RNNs, that takes long-term dependencies in a speech in addition to short-term dependencies into account. Similarily, GRU (Gated Recurrent Unit) networks are an improvement of LSTM networks also taking long-term dependencies into consideration. Thus, in this paper, we evaluate RNN, LSTM, and GRU to compare their performances on a reduced TED-LIUM speech data set. The results show that LSTM achieves the best word error rates, however, the GRU optimization is faster while achieving word error rates close to LSTM.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] G. E. Hinton S. Osindero Y. Teh A fast learning algorithm for deep belief nets Neural Computation 18 1527-1554 2006.

  • [2] A. Rousseau P. Deléglise Y. Estève Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. Proceedings of Sventh Language Resources and Evaluation Conference 3935-3939 May 2014.

  • [3] Y. Gaur F. Metze J. P. Bigham Manipulating Word Lattices to Incorporate Human Corrections Inter-speech 2016 17th Annual Conference of the International Speech Communication Association San Francisco CA USA September 2016.

  • [4] E. Busseti I. Osband S. Wong Deep Learning for Time Series Modeling Seminar on Collaborative Intelligence in the TU Kaiserslautern Germany June 2012.

  • [5] Deep Learning for Sequential Data - Part V: Handling Long Term Temporal Dependencies https://prateekvjoshi.com/2016/05/31/deep-learning-for-sequential-data-part-v-handling-long-term-temporal-dependencies/ last retrieved July 2017.

  • [6] Understanding LSTM Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/ last retrieved July 2017.

  • [7] A. Graves A. R. Mohamed G. Hinton Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 6645-6649 2013.

  • [8] A. Graves S. Fernández F. Gomez J. Schmidhuber Connectionist temporal classification: labelling un-segmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning 369-376 ACM June 2006.

  • [9] J. Chung C. Gulcehre K. Cho Y. Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 2014.

  • [10] TED-LIUM Corpus http://www-lium.univlemans.fr/en/content/ted-lium-corpus last retrieved July 2017.

  • [11] C. C. Chiu D. Lawson Y. Luo G.Tucker K. Swersky I. Sutskever N. Jaitly An online sequence-to-sequence model for noisy speech recognition arXiv preprint arXiv:1706.06428 2017.

  • [12] T. Hori S. Watanabe Y. Zhang W. Chan Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM arXiv preprint arXiv:1706.02737 2017.

  • [13] W. Chan N. Jaitly Q. V. Le O. Vinyals Listen attend and spell. arXiv preprint arXiv:1508.01211 2015.

  • [14] T. Mikolov Statistical language models based on neural networks PhD thesis Brno University of Technology 2012.

  • [15] W. Zaremba I. Sutskever O. Vinyals Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 2014.

  • [16] I. Sutskever O. Vinyals Q. V. Le Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 3104-3112 2014.

  • [17] F. A. Gers E. Schmidhuber LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks 12(6) 1333-1340 2001.

  • [18] O. Vinyals S. V. Ravuri D. Povey Revisiting recurrent neural networks for robust ASR. 2012 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 4085-4088 2012.

  • [19] A. L. Maas Q. V. Le T. M. O’Neil O. Vinyals P. Nguyen A. Y. Ng Recurrent neural networks for noise reduction in robust ASR. Thirteenth Annual Conference of the International Speech Communication Association 2012.

  • [20] H. Sak A. Senior F. Beaufays Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 2014.

  • [21] A. Graves J. Schmidhuber Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5) 602-610 2005.

  • [22] A. Graves S. Fernández J. Schmidhuber Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W. Kacprzyk J. Oja E. Zadro˙zny S. (eds) Artificial Neural Networks: Formal Models and Their Applications – ICANN Lecture Notes in Computer Science vol. 3697 Springer Berlin Heidelberg 2005.

  • [23] A. Graves M. Liwicki S. Fernández R. Bertolami H. Bunke J. Schmidhuber A novel connectionist system for unconstrained handwriting recognition IEEE Transactions on Pattern Analysis and Machine Intelligence 31(5) 855-868 2009.

  • [24] A. Graves N. Jaitly Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (ICML-14) 1764-1772 2014.

  • [25] A. Graves N. Jaitly A. R. Mohamed Hybrid speech recognition with deep bidirectional LSTM. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 273-278 December 2013.

  • [26] A. Hannun C. Case J. Casper B. Catanzaro G. Diamos E. Elsen R. Prenger S. Satheesh S. Sengupta A. Coates A. Y. Ng. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 2014.

  • [27] H. Xu G. Chen D. Povey S. Khudanpur Modeling phonetic context with non-random forests for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association 2015.

  • [28] T. Ko V. Peddinti D. Povey S. Khudanpur Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association 3586-3589 2015.

  • [29] G. Chen H. Xu M. Wu D. Povey S. Khudanpur Pronunciation and silence probability modeling for ASR. Sixteenth Annual Conference of the International Speech Communication Association 2015.

  • [30] Y. Gaur F. Metze J. P. Bigham Manipulating Word Lattices to Incorporate Human Corrections. Seventeenth Annual Conference of the International Speech Communication Association 3062-3065 2016.

  • [31] K. Cho B. van Merrienboer D. Bahdanau and Y. Bengio On the properties of neural machine translation: Encoder-decoder approaches Eighth Workshop on Syntax Semantics and Structure in Statistical Translation 2014.

  • [32] D. Bahdanau J. Chorowski D. Serdyuk P. Brakel Y. Bengio End-to-end attention-based large vocabulary speech recognition. 2016 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 4945-4949 March 2016.

  • [33] S. Hochreiter J. Schmidhuber Long Short-Term Memory. Neural Comput. 9 8 1735-1780 November 1997.

  • [34] K. Bahdanau K. Cho and Y. Bengio Neural machine translation by jointly learning to align and translate Technical report arXiv preprint arXiv:1409.0473 2014.

  • [35] D. Kingma J. Ba Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014.

  • [36] Reduced TED-LIUM release 2 corpus (11.7 GB) http://www.cs.ndsu.nodak.edu/~siludwig/data/TEDLIUM_release2.zip last retrieved July 2017.

  • [37] Speech recognition performance https://en.wikipedia.org/wiki/Speech_recognition#Performance last retrieved July 2017.

  • [38] Levenshtein distance https://en.wikipedia.org/wiki/Levenshtein_distance last retrieved July 2017.

  • [39] A. C. Morris V. Maier P. Green From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. Eighth International Conference on Spoken Language Processing 2004.

  • [40] Word error rate https://en.wikipedia.org/wiki/Word_error_rate last retrieved July 2017.

  • [41] A. Marzal E. Vidal Computation of normalized edit distance and applications IEEE Transactions on Pattern Analysis and Machine Intelligence 15(9) 926-932 1993.

Search
Journal information
Impact Factor


CiteScore 2018: 4.70

SCImago Journal Rank (SJR) 2018: 0.351
Source Normalized Impact per Paper (SNIP) 2018: 4.066

Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 180 180 115
PDF Downloads 141 141 95