A phoneme segmentation method based on the analysis of discrete wavelet transform spectra is described. The localization of phoneme boundaries is particularly useful in speech recognition. It enables one to use more accurate acoustic models since the length of phonemes provide more information for parametrization. Our method relies on the values of power envelopes and their first derivatives for six frequency subbands. Specific scenarios that are typical for phoneme boundaries are searched for. Discrete times with such events are noted and graded using a distribution-like event function, which represent the change of the energy distribution in the frequency domain. The exact definition of this method is described in the paper. The final decision on localization of boundaries is taken by analysis of the event function. Boundaries are, therefore, extracted using information from all subbands. The method was developed on a small set of Polish hand segmented words and tested on another large corpus containing 16 425 utterances. A recall and precision measure specifically designed to measure the quality of speech segmentation was adapted by using fuzzy sets. From this, results with F-score equal to 72.49% were obtained.
Abry P. (1997), Ondelettes et turbulence (eng. Wavelets and turbulence), Diderot ed., Paris.
Cardinal P., Boulianne G., M. Comeau (2005), Segmentation of recordings based on partial transcriptions, Proceedings of Interspeech, 3345-3348.
Daubechies I. (1992), Ten lectures on Wavelets, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania.
Glass J. (2003), A probabilistic framework for segment-based speech recognition, Computer Speech and Language, 17, 137-152.
Grayden D.B., Scordilis M.S. (1994), Phonemic segmentation of fluent speech, Proceedings of ICASSP, Adelaide, 73-76.
Grocholewski S. (1995), Assumptions of acoustic database for Polish language [in Polish: Założenia akustycznej bazy danych dla języka polskiego (CD-ROM), Mat. I KK: Głosowa komunikacja człowiek-komputer, Wrocław, 177-180.
Hermansky H. (1990), Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, 87, 4, 1738-1752.
Hermansky H., Morgan N. (1994), RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, 2, 4, 578-589.
Holmes J.N. (2001), Speech Synthesis and Recognition, Taylor and Francis, London.
Hunt A., Black A. (1996), Unit selection in a concatenative speech synthesis system using a large speech database, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996, ICASSP-96, 1, 373-376.
Morgan N., Zhu Q., Stolcke A., Sonmez K., Sivadas S., Shinozaki T., Ostendorf M., Jain P., Hermansky H., Ellis D., Doddington G., Chen B., Cretin O., Bourlard H., Athineos M. (2005), Pushing the envelope - aside, IEEE Signal Processing Magazine, 22, 81-88.
Ostendorf M., Digalakis V.V., Kimball O.A. (1996), From HMM's to segment models: A unified view of stochastic modeling for speech recognition, IEEE Transactions on Speech and Audio Processing, 4, 360-378.
Rabiner L., Juang B.H. (1993), Fundamentals of speech recognition, PTR Prentice-Hall, Inc., New Jersey.
Rioul O., Vetterli M. (1991), Wavelets and signal processing, IEEE Signal Processing Magazine, 8, 11-38.
Russell M., Jackson P.J.B. (2005), A multiple-level linear/linear segmental HMM with a formant-based intermediate layer, Computer Speech and Language, 19, 205-225.
Stöber K., Hess W. (1998), Additional use of phoneme duration hypotheses in automatic speech segmentation, Proceedings of ICSLP, Sydney, 1595-1598.
Suh Y., Lee Y. (1996), Phoneme segmentation of continuous speech using multi-layer perceptron, Proceedings of ICSLP, Philadelphia, 1297-1300.
Toledano D.T., Gómez L.A.H., Grande L.V. (2003), Automatic phonetic segmentation, IEEE Transactions on Speech and Audio Processing, 11, 6, 617-625.
van Rijsbergen C.J. (1979), Information Retrieval, Butterworths, London.
Wang D., Narayanan S. (2005), Piecewise linear stylization of pitch via wavelet analysis, Proceedings of Interspeech, Lisboa, 3277-3280.
Weinstein C.J., McCandless S.S., Mondshein L.F., Zue V.W. (1975), A system for acoustic-phonetic analysis of continuous speech, IEEE Transactions on Acoustics, Speech and Signal Processing, 23, 54-67.
Young S. (1996), Large vocabulary continuous speech recognition: a review, IEEE Signal Processing Magazine, 13, 5, 45-57.
Zheng C., Yan Y. (2004), Fusion based speech segmentation in DARPA SPINE2 task, Proceedings of ICASSP, Montreal, I-885-888.
Ziółko B., Manandhar S., Wilson R.C., Ziółko M. (2006), Wavelet method of speech segmentation, Proceedings of 14th European Signal Processing Conference EUSIPCO, Florence.
Ziółko B., Manandhar S., Wilson R.C. (2007), Fuzzy recall and precision for speech segmentation evaluation, Proceedings of 3rd Language and Technology Conference, Poznan.
Zue V.W. (1985), The use of speech knowledge in automatic speech recognition, Proceedings of the IEEE, 73, 1602-1615.