Domain Adaptation of Deep Neural Networks for Automatic Speech Recognition via Wireless Sensors

Open access


Wireless sensors are recent, portable, low-powered devices, designed to record and transmit observations of their environment such as speech. To allow portability they are designed to have a small size and weight; this, however, along with their low power consumption, usually means that they have only quite basic recording equipment (e.g. microphone) installed. Recent speech technology applications typically require several dozen hours of audio recordings (nowadays even hundreds of hours is common), which is usually not available as recorded material by such sensors. Since systems trained with studio-level utterances tend to perform suboptimally for such recordings, a sensible idea is to adapt models which were trained on existing, larger, noise-free corpora. In this study, we experimented with adapting Deep Neural Network-based acoustic models trained on noise-free speech data to perform speech recognition on utterances recorded by wireless sensors. In the end, we were able to achieve a 5% gain in terms of relative error reduction compared to training only on the sensor-recorded, restricted utterance subset.

[1] BURCHFIELD, T. R.-VENKATESAN, S. : Accelerometer- based Human Abnormal Movement Detection in Wireless Sensor Networks, Proceedings of ACM SIGMOBILE Workshop (2007), 67-69.

[2] HAYES, J.-BEIRNE, S.-LAU, K. T.-DIAMOND, D. : Evaluation of a Low Cost Wireless Chemical Sensor Network for Environmental Monitoring, IEEE Sensors Journal 64 No. 06 (2008), 530-533.

[3] GOGOLÁK, L.-PLETL, SZ.-KUKOLJ, D. : Neural Network- based Indoor Localization in WSN Environments, Acta Polytechnica Hungarica 10 No. 06 (2013), 221-235.

[4] GOGOLÁK, L.-KUKOLJ, D.-FÜRSTNER, I. : Wireless Sensor Network Based Localization in Industrial Environments, Analecta 8 No. 1 (2014), 91-96.

[5] GOSZTOLYA, G.-TÓTH, L. : Improving the Sound Recording Quality of Wireless Sensors Using Automatic Gain Control Methods, Scientific Bulletin of ”Politehnica” University of Timisoara, Transactions on Automatic Control and Computer Science 56 No. 2 (2011), 47-56.

[6] RABINER, L.-JUANG, B. H. : Fundamentals of Speech Recognition, Prentice Hall, Upper Saddle River, NJ, USA, 1993.

[7] FURUI, S. : Cepstral Analysis Technique for Automatic Speaker Verification, Acoustics, Speech and Signal 29 No. 2 (1981), 254-272.

[8] TÓTH, SZ. L.-SZTAHÓ, D.-VICSI, K. : Speech Emotion Perception by Human and Machine, Proceedings of COST Action (2012), 213-224.

[9] GOSZTOLYA, G.-BUSA-FEKETE, R.-TÓTH, L. : Detecting Autism, Emotions and Social Signals Using AdaBoost, Proceedings of Interspeech (2013), 220-224.

[10] MORGAN, M.-BOURLARD, H. : An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition, Signal Processing Magazine (May 1995), 1025-1028.

[11] NEDERHOF, M.-J. : Practical experiments with regular approximation of context-free languages, Journal of Computational Linguistics 26 No. 1 (2000), 17-44.

[12] VARGA, I.-OHTAKE, K.-TORISAWA, K.-DESAEGER, S.-MISU, T.-MATSUDA, S.-KAZAMA, J. : Similarity Based Language Model Construction for Voice Activated Open- Domain Question Answering, Proceedings of IJCNLP (2011), 535-544.

[13] DUDA, R. O.-HART, P. E. : Pattern Classification and Scene Analysis, John Wiley & Sons, New Jersey, 1973.

[14] HINTON, G. E.-OSINDERO, S.-TEH, Y.-W. : A Fast Learning Algorithm for Deep Belief Nets, Neural Computation 18 No. 7 (2006), 1527-1554.

[15] SEIDE, F.-LI, G.-CHEN, X.-YU, D. : Feature Engineering in Context-Dependent Deep Neural Networks for Conversational Speech Transcription, Proceedings of ASRU (2011), 24-29.

[16] BENGIO, Y.-LAMBLIN, P.-POPOVICI, D.-LAROCHELLE, H. : Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19 (2007), 153-160.

[17] GLOROT, X.-BORDES, A.-BENGIO, Y. : Deep Sparse Rectifier Networks, Proceedings of AISTATS (2011), 315-323.

[18] GRÓSZ, T.-TÓTH, L. : A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition, Proceedings of TSD (2013), 36-43.

[19] TÓTH, L. : Phone Recognition with Deep Sparse Rectifier Neural Networks, Proceedings of ICASSP (2013), 6985-6989.

[20] SELTZER, M.-YU, D.-WANG, Y. : An Investigation of Deep Neural Networks for Noise Robust Speech Recognition, Proceedings of ICASSP (2013), 7398-7402.

[21] KOVÁCS, GY.-TÓTH, L. : Joint Optimization of Spectro- Temporal Features and Deep Neural Nets for Robust Automatic Speech Recognition, Acta Cybernetica 22 No. 1 (2015), 117-134.

[22] JAIN, P.-HERMANSKY, H.-KINGSBURY, B. : Distributed Speech Recognition Using Noise-Robust MFCC and TRAPSestimated Manner Features, Proceedings of Interspeech (2002), 473-476.

[23] AGARWAL, A.-CHENG, Y. M. : Two-Stage Mel-Warped Wiener Filter For Robust Speech Recognition, Proceedings of ASRU (1999), 12-15.

[24] GAO, T.-DU, J.-DAI, L.-R.-LEE, C.-H. : Joint Training of Front-end and Back-end Deep Neural Networks for Robust Speech Recognition, Proceedings of ICASSP (2015), 4375-4379.

[25] LIAO, H.-GALES, M. J. F. : Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noisy Data, Proceedings of ICASSP (2007), 389-392.

[26] HUANG, Y.-SLANEY, M.-SELTZER, M. L.-GONG, Y. : Towards Better Performance with Heterogeneous Training Data in Acoustic Modeling Using Deep Neural Networks, Proceedings of Interspeech (2015), 845-849.

[27] YOUNG, S.-EVERMANN, G.-GALES, M. J. F.-HAIN, T.-KERSHAW, D.-MOORE, G.-ODELL, J.-OLLASON, D.-POVEY, D.-VALTCHEV, V.-WOODLAND, P. C. : The HTK Book, Cambridge University Engineering Department, Cambridge, UK, 2006.

[28] ABARI, K.-OLASZY, G.-ZAINKÓ, CS.-KISS, G. : Hungarian Pronunciation Dictionary on Internet (in Hungarian), Proceedings of MSZNY (2006), 223-230.

[29] TÓTH, L. : Phone Recognition with Hierarchical Convolutional Deep Maxout Networks, EURASIP Journal on Audio, Speech, and Music Processing 2015 No. 25 (2015), 1-13.

[30] GRÓSZ, T.-BUSA-FEKETE, R.-GOSZTOLYA, G.-TÓTH, L. : Assessing the Degree of Nativeness and Parkinson’s Condition Using Gaussian Processes and Deep Rectifier Neural Networks, Proceedings of Interspeech (2015), 1339-1343.

[31] GOSZTOLYA, G.-GRÓSZ, T.-TÓTH, L.-IMSENG, D. : Building Context-Dependent DNN Acousitc Models Using Kullback- Leibler Divergence-Based State Tying, Proceedings of ICASSP (2015), 4570-4574.

Journal of Electrical Engineering

The Journal of Slovak University of Technology

Journal Information

IMPACT FACTOR 2017: 0.508
5-year IMPACT FACTOR: 0.549

CiteScore 2017: 0.78

SCImago Journal Rank (SJR) 2017: 0.205
Source Normalized Impact per Paper (SNIP) 2017: 0.506


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 187 160 9
PDF Downloads 86 77 8