Storytelling Voice Conversion: Evaluation Experiment Using Gaussian Mixture Models

Jiří Přibil 1 , Anna Přibilová 2 , and Daniela Ďuračková 2
  • 1 Department of Imaging Methods, Institute of Measurement Science, Slovak Academy of Sciences in Bratislava
  • 2 Institute of Electronics and Photonics, Faculty of Electrical Engineering and Information Technology STU, Ilkovičova 3, SK-812 19 Bratislava


In the development of the voice conversion and personification of the text-to-speech (TTS) systems, it is very necessary to have feedback information about the users’ opinion on the resulting synthetic speech quality. Therefore, the main aim of the experiments described in this paper was to find out whether the classifier based on Gaussian mixture models (GMM) could be applied for evaluation of different storytelling voices created by transformation of the sentences generated by the Czech and Slovak TTS system. We suppose that it is possible to combine this GMM-based statistical evaluation with the classical one in the form of listening tests or it can replace them. The results obtained in this way were in good correlation with the results of the conventional listening test, so they confirm practical usability of the developed GMM classifier. With the help of the performed analysis, the optimal setting of the initial parameters and the structure of the input feature set for recognition of the storytelling voices was finally determined.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] LEE, H. J. : Fairy Tale Storytelling System: Using Both Prosody and Text for Emotional Speech Synthesis, In: Convergence and Hybrid Information Technology (Lee, G., Howard, D., Ślȩzak, D., Hong, Y.S., eds.), Communications in Computer and Information Science, vol. 310, Springer, Berlin Heidelberg, 2012, pp. 317–324.

  • [2] ALCANTARA, J. A.—LU, L. P.—MAGNO, J. K.—SORIANO, Z.—ONG, E.—RESURRECCION, R. : Emotional Narration of Children’s Stories,, In: Theory and Practice of Computation (Nishizaki, S.Y., Numao, M., Caro, J., Suarez, M.T., eds.), Proceedings in Information and Communication Technology, vol. 5, Springer, Japan, 2012, pp. 1–14.

  • [3] DOUKHAN, D.—ROSSET, S.—RILLIARD, A.—D’ALESSANDRO, C.—ADDA-DECKER, M. : Text and Speech Corpora for Text-to-Speech Synthesis of Tales,, In: Proceedings of the 8-th International Conference on Language Resources and Evaluation, Istanbul, Turkey, 2012, pp. 1003–1010.

  • [4] MAENO, Y.—NOSE, T.—KOBAYASHI, T.—KORIYAMA, T.—IJIMA, Y.—NAKAJIMA, H.—MIZUNO, H.—YOSHIOKA, O. : Prosodic Variation Enhancement Using Unsupervised Context Labeling for HMM-based Expressive Speech Synthesis, Speech Communication 57 (2014), 144–154.

  • [5] PŘIBIL, J.—PŘIBILOVÁ, A. : Czech TTS Engine for Braille Pen Device Based on Pocket PC Platform, Proc. of the 16th Conference Electronic Speech Signal Processing ESSP 05 joined with the 15th Czech-GermanWorkshop Speech Processing (Vch, R., ed.), 2005, pp. 402–408.

  • [6] PŘIBILOVÁ, A.—PŘIBIL, J. : Spectrum Modification for Emotional Speech Synthesis, In: Multimodal Signals: Cognitive and Algorithmic Issues (Esposito, A., Hussain, A., Marinaro, M., Martone, R., eds.), LNAI 5398, Springer-Verlag Berlin Heidelberg, 2009, pp. 232–241.

  • [7] PŘIBIL, J.—PŘIBILOVÁ, A. : Application of Expressive Speech in TTS System with Cepstral Description., In: Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction (Esposito, A., Bourbakis, N., Avouris, N., Hatrzilygeroudis, I., eds.), LNAI 5042, Springer-Verlag, Berlin Heidelberg, 2008, pp. 201–213.

  • [8] BLAUERT, J.—JEKOSCH, U. : A Layer Model of Sound Quality, Journal of the Audio Engineering Society 60 (2012), 4–12.

  • [9] LEGÁT, M.—MATOUŠEK, J. : Design of the Test Stimuli for the Evaluation of Concatenation Cost Functions, In: Text, Speech and Dialogue 2009 (MATOUŠEK, V. et al, eds.), LNCS 5729, Springer, Heidelberg, 2009, pp. 339–346.

  • [10] BELLO, C.—RIBAS, D.—CALVO, J. R.—FERRER, C. A. : From Speech Quality Measures to Speaker Recognition Performance., In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (Bayro-Corrochano, E., Hancock, E., eds.), LNCS 8827, Springer International Publishing Switzerland, 2014, pp. 199–206.

  • [11] ROMPORT, J.—MATOUŠEK, J. : Formal Prosodic Structures and Their Application in NLP, In: Text, Speech and Dialogue 2005 (Matouek, V. et al., eds.), LNCS 3658, Springer-Verlag, Berlin Heidelberg, 2005, pp. 371–378.

  • [12] JEONG, Y. : Joint Speaker and Environment Adaptation Using TensorVoice for Robust Speech Recognition, Speech Communication 58 (2014), 1–10.

  • [13] REYNOLDS, D. A.—ROSE, R. C. : Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Transactions on Speech and Audio Processing 3 (1995), 72–83.

  • [14] MUHAMMAD, G.—ALGHATHBAR, K. : Environment Recognition for Digital Audio Forensics Using MPEG-7 and Mel Cepstral Features, Journal of Electrical Engineering 62 No. 4 (2011), 199–205.

  • [15] PISHRAVIAN, A.—SAHAF, M. R. A. : Application of Independent Component Analysis for Speech-Music Separation Using An Efficient Score Function Estimation, Journal of Electrical Engineering 63 No. 6 (2012), 380–385.

  • [16] PŘIBIL, J.—PŘIBILOVÁ, A. : Emotional Style Conversion in the TTS System with Cepstral Description, In: Verbal and Nonverbal Communication Behaviours (Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M., eds.), LNAI 4775, Springer-Verlag, Berlin Heidelberg New York, 2007, pp. 65–73.

  • [17] VCH, R.—PŘIBIL, J.—SMÉKAL, Z. : New Cepstral Zero-Pole Vocal Tract Models for TTS Synthesis, Proc. of IEEE Region 8 EUROCON’2001, vol. 2, 2001, pp. 458–462.

  • [18] MADHU, N. : Note on Measures for Spectral Flatness, Electronics Letters 45 No. 23 (2009), 1195–1196.

  • [19] SHAH, N. H. : Numerical Methods with C++ Programming, Prentice-Hall Of India Learning Private Limited, New Delhi, 2009.

  • [20] HOSSEINZADEH, D.—KRISHNAN, S. : On the Use of Complementary Spectral Features for Speaker Recognition, EURASIP Journal on Advances in Signal Processing (2008), Article ID 258184.

  • [21] SOUSA. R.—FERREIRA, A.—ALKU, P. : The Harmonic and Noise Information of the Glottal Pulses, Speech, Biomedical Signal Processing and Control 10 (2014), 137–143.

  • [22] LECLERC, I.—DAJANI, H. R.—GIGUERE, C. : Differences in Shimmer Across Formant Regions, Journal of Voice 27 No. 6 (2013), 685–690.

  • [23] PŘIBIL, J.—PŘIBILOVÁ, A.—ĎURAČKOVÁ, D. : Evaluation of Spectral and Prosodic Features of Speech Affected by Orthodontic Appliances using the GMM Classifier, Journal of Electrical Engineering 65 (2014), 30–36.

  • [24] PŘIBIL, J.—PŘIBILOVÁ, A. : Determination of Formant Features in Czech and Slovak for GMM Emotional Speech Classifier, Radioengineering 22 (2013), 52–59.

  • [25] NABNEY, I. T. : Netlab Pattern Analysis Toolbox, (c)1996 - 2001. Retrieved 16 February 2012 from

  • [26] PŘIBIL, J.—PŘIBILOVÁ, A.—MATOUŠEK, J. : Experiment with Evaluation of Quality of the Synthetic Speech by the GMM Classifier, In: Text, Speech and Dialogue, Proc. of the 16th International Conference TSD 2013, Plzen, Czech Republic September 2013 (Habernal, I., Matoušek, V., eds.), LNAI 8082, Springer-Verlag, Berlin Heidelberg, 2013, pp. 241–248.

  • [27] DILEEP, A. D.—SEKHAR, C. CH. : Class-Specific GMM Based Intermediate Matching Kernel for Classification of Varying Length Patterns of Long Duration Speech Using Support Vector Machines, Speech Communication 57 (2014), 126–143.

  • [28] ZHAO, J.—JIANG, Q. : Probabilistic PCA for t-Distributions, Neurocomputing 69 No. 16-18 (2006), 2217–2226.


Journal + Issues