Two-Microphone Dereverberation for Automatic Speech Recognition of Polish

Open access

Abstract

Reverberation is a common problem for many speech technologies, such as automatic speech recognition (ASR) systems. This paper investigates the novel combination of precedence, binaural and statistical independence cues for enhancing reverberant speech, prior to ASR, under these adverse acoustical conditions when two microphone signals are available. Results of the enhancement are evaluated in terms of relevant signal measures and accuracy for both English and Polish ASR tasks. These show inconsistencies between the signal and recognition measures, although in recognition the proposed method consistently outperforms all other combinations and the spectral-subtraction baseline.

[1] ALINAGHI A., WANG W., JACKSON P.J.B. (2011), Integrating binaural cues and blind source separation method for separating reverberant speech mixtures, [in:] Proc. of ICASSP, Prague, pp. 209-212.

[2] BLAUERT J. (1997), Spatial Hearing: The Psycho-physics of Human Sound Localization, 2nd Edition, MIT Press.

[3] BOLL S.F. (1979), Suppression of acoustic noise in speech using spectral subtraction, Acoustics Speech and Signal Processing, IEEE Trans., 27, 2, 113-120.

[4] CHIEN J.T., LAI P.Y. (2005), Car speech enhancement using a microphone array, Int. Journal of Speech Technology, 8, 1, 79-91.

[5] DRGAS S., KOCIŃSKI J., SEK A. (2008), Logatom articulation index evaluation of speech enhanced by blind source separation and single-channel noise reduction, Archives of Acoustics, 33, 4, 455-474.

[6] FUKUMORI T., NAKAYAMA M., NISHIURA T., YAMASHITA Y. (2013), Estimation of speech recognition performance in noisy and reverberant environments using pesq score and acoustic parameters, [in:] Signal and Information Processing Association Annual Summit and Conference (APSIPA), Asia-Pacific, pp. 1-4.

[7] GAROFOLO J.S., LAMEL L.F., FISHER W.M., FISCUS J.G., PALLETT D.S., DAHLGREN N.L., ZUE V. (1993), Timit acoustic-phonetic continuous speech corpus, Linguistic Data Consortium, Philadelphia.

[8] GOMEZ R., KAWAHARA T. (2010), Robust speech recognition based on dereverberation parameter optimization using acoustic model likelihood, Audio, Speech and Language Processing, IEEE Trans., 18, 7, 1708-1716.

[9] GROCHOLEWSKI S. (1998), First database for spoken polish, [in:] Proc. of International Conference on Language Resources and Evaluation, Grenada, pp. 1059-1062.

[10] HARTMANN W.M. (1999), How we localize sound, Physics Today, 52, 11, 24-29.

[11] HINTON G., DENG L., YU D., DAHL G., MOHAMED A., JAITLY N., SENIOR A., VANHOUCKE V., NGUYEN P., SAINATH T., KINGSBURY B. (2012), Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29, 6, 82.

[12] HUMMERSONE C., MASON R., BROOKES T. (2010), Dynamic precedence effect modeling for source separation in reverberant environments, Audio, Speech, and Language Processing, IEEE Trans., 18, 7, 1867-1871.

[13] JEUB M., SCHAFER M., ESCH T., VARY P. (2010), Model-based dereverberation preserving binaural cues, Audio, Speech, and Language Processing, IEEE Trans., 18, 7, 1732-1745.

[14] KRISHNAMOORTHY P., PRASANNA S. (2009), Reverberant speech enhancement by temporal and spectral processing, Audio, Speech, and Language Processing, IEEE Trans., 17, 2, 253-266.

[15] LEONARD R.G., DODDINGTON G. (1993), Tidigits, Linguistic Data Consortium, Philadelphia.

[16] LI K., GUO Y., FU Q., YAN Y. (2012), A two microphone-based approach for speech enhancement in adverse environments, [in:] Consumer Electronics (ICCE), 2012 IEEE International Conference, pp. 41-42.

[17] LITOVSKY R.Y., COLBURN H.S., YOST W.A., GUZMAN S.J. (1999), The precedence effect, J. Acoust. Soc. Am., 106, 1633-1654.

[18] MANDEL M.I., WEISS R.J., ELLIS D. (2010), Model-based expectation-maximization source separation and localization, Audio, Speech, and Language Processing, IEEE Trans., 18, 2, 382-394.

[19] NAKATANI T., KINOSHITA K., MIYOSHI M. (2007), Harmonicity-based blind dereverberation for single-channel speech signals, Audio, Speech, and Language Processing, IEEE Trans., 15, 1, 80-95.

[20] NAYLOR P.A., GAUBITCH N.D. (2005), Speech dere-verberation, [in:] Proc. of Int. Workshop Acoust. Echo Noise Control, Eindhoven.

[21] PALOMAKI K.J., BROWN G.J., WANG D. (2004), A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation, Speech Communication, 43, 4, 361-378.

[22] PEARCE D., HIRSCH H. (2000), The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, [in:] ISCA ITRW ASR., pp. 29-32.

[23] PEARSON J., LIN Q., CHE C., YUK D.S., JIN L., DE VRIES B., FLANAGAN J. (1996), Robust distant- talking speech recognition, [in:] Proc. of ICASSP, Atlanta, 1, 21-24.

[24] SAWADA H., ARAKI S., MAKINO S. (2007), A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures, [in:] Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 139-142.

[25] SELTZER M.L., RAJ B., STERN R.M. (2004), Likelihood-mazimizing beamforming for robust hands-free speech recognition, Speech and Audio Processing, IEEE Trans., 12, 5, 489-498.

[26] SHI G., AARABI P. (2003), Robust digit recognition using phase-dependent time-frequency masking, [in:] Proc. of ICASSP, Hong Kong, pp. 684-687.

[27] VINCENT E., GRIBONVAL R., FEVOTTE C. (2006), Performance measurement in blind audio source separation, Audio, Speech, and Language Processing, IEEE Trans., 14, 4, 1462-1469.

[28] WARD D.B., KENNEDY R.A., WILLIAMSON R.C. (2001), Constant directivity beamforming, [in:] Microphone Arrays, Springer-Verlag.

[29] WU M., WANG D. (2006), A two-stage algorithm for one-microphone reverberant speech enhancement, Au- dio, Speech, and Language Processing, IEEE Trans., 14, 774-784.

[30] YOUNG S. J., KERSHAW D., ODELL J., OLLASON D., VALTCHEV V., WOODLAND P. (2006), The HTK Book Version 3.4, Cambridge University Press.

[31] ZIÓŁKO B., MANANDHAR S., WILSON R.C., ZIOLKO M., GALKA J. (2008), Application of htk to the Polish language, [in:] Proc. of International Conference on Audio, Language and Image Processing, Shanghai.

[32] ZIOLKO M., GALKA J., ZIOLKO B., JADCZYK T., SKURZOK D., MASIOR M. (2011), Automatic speech recognition system dedicated for Polish, [in:] Proc. of Interspeech, Florence.

Archives of Acoustics

The Journal of Institute of Fundamental Technological of Polish Academy of Sciences

Journal Information


IMPACT FACTOR 2016: 0.816
5-year IMPACT FACTOR: 0.835

CiteScore 2016: 1.15

SCImago Journal Rank (SJR) 2016: 0.432
Source Normalized Impact per Paper (SNIP) 2016: 0.948

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 152 122 9
PDF Downloads 84 76 7