System for Automatic Transcription of Sessions of the Polish Senate

Open access

Abstract

This paper describes research behind a Large-Vocabulary Continuous Speech Recognition (LVCSR) system for the transcription of Senate speeches for the Polish language. The system utilizes severalcomponents: a phonetic transcription system, language and acoustic model training systems, a Voice Activity Detector (VAD), a LVCSR decoder, and a subtitle generator and presentation system. Some of the modules relied on already available tools and some had to be made from the beginning but the authors ensured that they used the most advanced techniques they had available at the time. Finally, several experiments were performed to compare the performance of both more modern and more conventional technologies.

[1] BROCKI L. (2010a), Koneksjonistyczny model języka polskiego, [in:] XII International PhD Workshop OWD 2010.

[2] BROCKI L. (2010b), Koneksjonistyczny Model Języka w Systemach Rozpoznawania Mowy, PhD thesis, Polish- Japanese Institute of Information Technology.

[3] BROCKI L., KORŽINEK D., MARASEK K. (2006), Rec-ognizing connected digit strings using neural networks, [in:] Text, Speech and Dialogue, pp. 343-350, Springer.

[4] BROCKI L., KORŽINEK D., MARASEK K. (2014), Im-proved factorization of a connectionist language model for single-pass real-time speech recognition, [in:] Foun-dations of Intelligent Systems, Andreasen T., Chris-tiansen H., Cubero J.-C., Ras, Z., [Eds.], volume 8502 of Lecture Notes in Computer Science, pp. 355-364, Springer International Publishing.

[5] BROCKI Ł., KORŽINEK D., MARASEK K. (2008), Tele-phony based voice portal for a university.

[6] BROCKI Ł., MARASEK K., KORŽINEK D. (2012a), Con-nectionist language model for polish, [in:] Intelligent Tools for Building a Scientific Information Platform, pp. 243-250, Springer.

[7] BROCKI Ł., MARASEK K., KORŽINEK D. (2012b), Mul-tiple model text normalization for the polish language, [in:] Foundations of Intelligent Systems, pp. 143-148, Springer.

[8] DEMENKO G., GROCHOLEWSKI S., KLESSA K., OGóRKIEWICZ J., WAGNER A., LANGE M., SLEDZIN-SKI D., CYLWIK N. (2008), Jurisdic: Polish speech database for taking dictation of legal texts, [in:] LREC.

[9] EIDE E., GISH H. (1996), A parametric approach to vo-cal tract length normalization, [in:] Acoustics, Speech, and Signal Processing, 1996, ICASSP-96, Conference Proceedings., 1996 IEEE International Conference on, volume 1, pp. 346-348, IEEE.

[10] FEDERICO M., BERTOLDI N., CETTOLO M. (2008), Irstlm: an open source toolkit for handling large scale language models, [in:] Interspeech, pp. 1618-1621.

[11] GLASS J.R., HSU B.-J. et al. (2009), Language modeling for limited-data domains.

[12] GRAVES A., ECK D., BERINGER N., SCHMIDHUBER J. (2004), Biologically plausible speech recognition with lstm neural nets, [in:] Biologically Inspired Approaches to Advanced Information Technology, pp. 127-136, Springer.

[13] GRAVES A., SCHMIDHUBER J. (2005), Framewise phoneme classification with bidirectional lstm and other neural network architectures, Neural Networks, 18, 5, 602-610.

[14] HICKSON I. (2012), Webvtt. living standard, World Wide Web Consortium.

[15] HINTON G.E., OSINDERO S., TEH Y.-W. (2006), A fast learning algorithm for deep belief nets, Neural Computation, 18, 7, 1527-1554.

[16] HUIJBREGTS M.A.H. (2008), Segmentation, diarization and speech transcription: surprise data unraveled.

[17] JELINEK F. (1997), Statistical methods for speech recog-nition, MIT press.

[18] KATSAMANIS A., BLACK M., GEORGIOU P.G., GOLD-STEIN L., NARAYANAN S. (2011), Sailalign: Robust long speech-text alignment, [in:] Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Re-search.

[19] KNESER R., NEY H. (1995), Improved backing-off for m-gram language modeling, [in:] Acoustics, Speech, and Signal Processing, 1995, ICASSP-95, 1995 Inter-national Conference on, vol. 1, pp. 181-184, IEEE.

[20] KORŽINEK D., BROCKI Ł. (2007), Grammar based au-tomatic speech recognition system for the polish lan-guage, [in:] Recent Advances in Mechatronics, pp. 87-91, Springer.

[21] KOS M., VLAJ D., KACIC Z. (1996), Sloparl-slovenian parliamentary speech and text corpus for large vocabu-lary continuous speech recognition.

[22] LEE A., KAWAHARA T., SHIKANO K. (2001), Julius -an open source real-time large vocabulary recognition engine.

[23] LööF J., BISANI M., GOLLAN C., HEIGOLD G., HOFFMEISTER B., PLAHL C., SCHLüTER R., NEY H. (2006), The 2006 RWTH parliamentary speeches tran-scription system, [in:] INTERSPEECH.

[24] MARASEK K. (2012), TED Polish-to-English translation system for the IWSLT 2012, Proceedings IWSLT 2012.

[25] MARASEK K., BROCKI R, KORŽINEK D., SZKLANNY K., GUBRYNOWICZ R. (2009), User-centered design for a voice portal, [in:] Aspects of Natural Language Pro-cessing, pp. 273-293, Springer.

[26] MICHALEWICZ Z. (1996), Genetic algorithms + data, structures = evolution programs, Springer.

[27] MILKOWSKI M. (2012), The Polish language in the dig-ital age, Springer.

[28] MORI R.D. (1998), Spoken Dialogue With Comput-ers (Signal Processing and its Applications), Academic Press.

[29] POVEY D., BURGET L., AGARWAL M., AKYAZI P., FENG K., GHOSHAL A., GLEMBEK O., GOEL N.K., KARAFIÁT M., RASTROW A. et al. (2010), Subspace gaussian mixture models for speech recognition, [in:] Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4330-4333, IEEE.

[30] POVEY D., GHOSHAL A., BOULIANNE G., BURGET L., GLEMBEK O., GOEL N., HANNEMANN M., MOTLI-CEK P., QIAN Y., SCHWARZ P. et al. (2011), The Kaldi speech recognition toolkit, [in:] IEEE 2011 workshop on automatic speech recognition and understanding.

[31] PRAZÁK A., PSUTKA J.V., HOIDEKR J., KANIS J., MÜLLER L., PSUTKA J. (2006), Automatic online sub-titling of the Czech Parliament meetings, [in:] Text, Speech and Dialogue, pp. 501-508, Springer.

[32] PRZEPIÓRKOWSKI A., BAŃKO M., GÓRSKI R., LEWANDOWSKA-TOMASZCZYK B. (2012), Narodowy Korpus Języka Polskiego, Wydawnictwo Naukowe PWN, Warszawa.

[33] PSUTKA J.V. (2007), Benefit of maximum likelihood linear transform (mllt) used at different levels of co-variance matrices clustering in asr systems, [in:] Text, Speech and Dialogue, pp. 431-438, Springer.

[34] RABINER L.R. (1989), A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, 77, 2, 257-286.

[35] ROBINSON T., HOCHBERG M., RENALS S. (1996), The use of recurrent neural networks in continuous speech recognition, [in:] Automatic speech and speaker recog-nition, pp. 233-258, Springer.

[36] ROMERO-FRESCO P. (2011), Subtitling through speech recognition: Respeaking, St. Jerome Publishing.

[37] STOLCKE A. et al. (2002), Srilm-an extensible language modeling toolkit, [in:] INTERSPEECH.

[38] VESELY K., GHOSHAL A., BURGET L., POVEY D. (2013), Sequence-discriminative training of deep neu-ral networks.

[39] WELLS J.C. (2013), Polish sampa, http://www.phon.ucl.ac.uk/home/sampa/polish.htm.

[40] YOUNG S., EVERMANN G., GALES M., HAIN T., KER-SHAW D., LIU X., MOORE G., ODELL J., OLLASON D., POVEY D. et al. (2002), The HTK book, Cambridge University Engineering Department, 3.

Archives of Acoustics

The Journal of Institute of Fundamental Technological of Polish Academy of Sciences

Journal Information


IMPACT FACTOR 2016: 0.816
5-year IMPACT FACTOR: 0.835

CiteScore 2016: 1.15

SCImago Journal Rank (SJR) 2016: 0.432
Source Normalized Impact per Paper (SNIP) 2016: 0.948

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 82 82 13
PDF Downloads 24 24 1