Storytelling Voice Conversion: Evaluation Experiment Using Gaussian Mixture Models

Open access


In the development of the voice conversion and personification of the text-to-speech (TTS) systems, it is very necessary to have feedback information about the users’ opinion on the resulting synthetic speech quality. Therefore, the main aim of the experiments described in this paper was to find out whether the classifier based on Gaussian mixture models (GMM) could be applied for evaluation of different storytelling voices created by transformation of the sentences generated by the Czech and Slovak TTS system. We suppose that it is possible to combine this GMM-based statistical evaluation with the classical one in the form of listening tests or it can replace them. The results obtained in this way were in good correlation with the results of the conventional listening test, so they confirm practical usability of the developed GMM classifier. With the help of the performed analysis, the optimal setting of the initial parameters and the structure of the input feature set for recognition of the storytelling voices was finally determined.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] LEE H. J. : Fairy Tale Storytelling System: Using Both Prosody and Text for Emotional Speech Synthesis In: Convergence and Hybrid Information Technology (Lee G. Howard D. Ślȩzak D. Hong Y.S. eds.) Communications in Computer and Information Science vol. 310 Springer Berlin Heidelberg 2012 pp. 317–324.

  • [2] ALCANTARA J. A.—LU L. P.—MAGNO J. K.—SORIANO Z.—ONG E.—RESURRECCION R. : Emotional Narration of Children’s Stories In: Theory and Practice of Computation (Nishizaki S.Y. Numao M. Caro J. Suarez M.T. eds.) Proceedings in Information and Communication Technology vol. 5 Springer Japan 2012 pp. 1–14.

  • [3] DOUKHAN D.—ROSSET S.—RILLIARD A.—D’ALESSANDRO C.—ADDA-DECKER M. : Text and Speech Corpora for Text-to-Speech Synthesis of Tales In: Proceedings of the 8-th International Conference on Language Resources and Evaluation Istanbul Turkey 2012 pp. 1003–1010.

  • [4] MAENO Y.—NOSE T.—KOBAYASHI T.—KORIYAMA T.—IJIMA Y.—NAKAJIMA H.—MIZUNO H.—YOSHIOKA O. : Prosodic Variation Enhancement Using Unsupervised Context Labeling for HMM-based Expressive Speech Synthesis Speech Communication 57 (2014) 144–154.

  • [5] PŘIBIL J.—PŘIBILOVÁ A. : Czech TTS Engine for Braille Pen Device Based on Pocket PC Platform Proc. of the 16th Conference Electronic Speech Signal Processing ESSP 05 joined with the 15th Czech-GermanWorkshop Speech Processing (Vch R. ed.) 2005 pp. 402–408.

  • [6] PŘIBILOVÁ A.—PŘIBIL J. : Spectrum Modification for Emotional Speech Synthesis In: Multimodal Signals: Cognitive and Algorithmic Issues (Esposito A. Hussain A. Marinaro M. Martone R. eds.) LNAI 5398 Springer-Verlag Berlin Heidelberg 2009 pp. 232–241.

  • [7] PŘIBIL J.—PŘIBILOVÁ A. : Application of Expressive Speech in TTS System with Cepstral Description. In: Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction (Esposito A. Bourbakis N. Avouris N. Hatrzilygeroudis I. eds.) LNAI 5042 Springer-Verlag Berlin Heidelberg 2008 pp. 201–213.

  • [8] BLAUERT J.—JEKOSCH U. : A Layer Model of Sound Quality Journal of the Audio Engineering Society 60 (2012) 4–12.

  • [9] LEGÁT M.—MATOUŠEK J. : Design of the Test Stimuli for the Evaluation of Concatenation Cost Functions In: Text Speech and Dialogue 2009 (MATOUŠEK V. et al eds.) LNCS 5729 Springer Heidelberg 2009 pp. 339–346.

  • [10] BELLO C.—RIBAS D.—CALVO J. R.—FERRER C. A. : From Speech Quality Measures to Speaker Recognition Performance. In: Progress in Pattern Recognition Image Analysis Computer Vision and Applications (Bayro-Corrochano E. Hancock E. eds.) LNCS 8827 Springer International Publishing Switzerland 2014 pp. 199–206.

  • [11] ROMPORT J.—MATOUŠEK J. : Formal Prosodic Structures and Their Application in NLP In: Text Speech and Dialogue 2005 (Matouek V. et al. eds.) LNCS 3658 Springer-Verlag Berlin Heidelberg 2005 pp. 371–378.

  • [12] JEONG Y. : Joint Speaker and Environment Adaptation Using TensorVoice for Robust Speech Recognition Speech Communication 58 (2014) 1–10.

  • [13] REYNOLDS D. A.—ROSE R. C. : Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models IEEE Transactions on Speech and Audio Processing 3 (1995) 72–83.

  • [14] MUHAMMAD G.—ALGHATHBAR K. : Environment Recognition for Digital Audio Forensics Using MPEG-7 and Mel Cepstral Features Journal of Electrical Engineering 62 No. 4 (2011) 199–205.

  • [15] PISHRAVIAN A.—SAHAF M. R. A. : Application of Independent Component Analysis for Speech-Music Separation Using An Efficient Score Function Estimation Journal of Electrical Engineering 63 No. 6 (2012) 380–385.

  • [16] PŘIBIL J.—PŘIBILOVÁ A. : Emotional Style Conversion in the TTS System with Cepstral Description In: Verbal and Nonverbal Communication Behaviours (Esposito A. Faundez-Zanuy M. Keller E. Marinaro M. eds.) LNAI 4775 Springer-Verlag Berlin Heidelberg New York 2007 pp. 65–73.

  • [17] VCH R.—PŘIBIL J.—SMÉKAL Z. : New Cepstral Zero-Pole Vocal Tract Models for TTS Synthesis Proc. of IEEE Region 8 EUROCON’2001 vol. 2 2001 pp. 458–462.

  • [18] MADHU N. : Note on Measures for Spectral Flatness Electronics Letters 45 No. 23 (2009) 1195–1196.

  • [19] SHAH N. H. : Numerical Methods with C++ Programming Prentice-Hall Of India Learning Private Limited New Delhi 2009.

  • [20] HOSSEINZADEH D.—KRISHNAN S. : On the Use of Complementary Spectral Features for Speaker Recognition EURASIP Journal on Advances in Signal Processing (2008) Article ID 258184.

  • [21] SOUSA. R.—FERREIRA A.—ALKU P. : The Harmonic and Noise Information of the Glottal Pulses Speech Biomedical Signal Processing and Control 10 (2014) 137–143.

  • [22] LECLERC I.—DAJANI H. R.—GIGUERE C. : Differences in Shimmer Across Formant Regions Journal of Voice 27 No. 6 (2013) 685–690.

  • [23] PŘIBIL J.—PŘIBILOVÁ A.—ĎURAČKOVÁ D. : Evaluation of Spectral and Prosodic Features of Speech Affected by Orthodontic Appliances using the GMM Classifier Journal of Electrical Engineering 65 (2014) 30–36.

  • [24] PŘIBIL J.—PŘIBILOVÁ A. : Determination of Formant Features in Czech and Slovak for GMM Emotional Speech Classifier Radioengineering 22 (2013) 52–59.

  • [25] NABNEY I. T. : Netlab Pattern Analysis Toolbox (c)1996 - 2001. Retrieved 16 February 2012 from

  • [26] PŘIBIL J.—PŘIBILOVÁ A.—MATOUŠEK J. : Experiment with Evaluation of Quality of the Synthetic Speech by the GMM Classifier In: Text Speech and Dialogue Proc. of the 16th International Conference TSD 2013 Plzen Czech Republic September 2013 (Habernal I. Matoušek V. eds.) LNAI 8082 Springer-Verlag Berlin Heidelberg 2013 pp. 241–248.

  • [27] DILEEP A. D.—SEKHAR C. CH. : Class-Specific GMM Based Intermediate Matching Kernel for Classification of Varying Length Patterns of Long Duration Speech Using Support Vector Machines Speech Communication 57 (2014) 126–143.

  • [28] ZHAO J.—JIANG Q. : Probabilistic PCA for t-Distributions Neurocomputing 69 No. 16-18 (2006) 2217–2226.

Journal information
Impact Factor

IMPACT FACTOR 2018: 0.636
5-year IMPACT FACTOR: 0.663

CiteScore 2018: 0.88

SCImago Journal Rank (SJR) 2018: 0.200
Source Normalized Impact per Paper (SNIP) 2018: 0.771

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 146 82 2
PDF Downloads 75 54 0