Duration and speed of speech events: A selection of methods

Open access


The study of speech timing, i.e. the duration and speed or tempo of speech events, has increased in importance over the past twenty years, in particular in connection with increased demands for accuracy, intelligibility and naturalness in speech technology, with applications in language teaching and testing, and with the study of speech timing patterns in language typology. H owever, the methods used in such studies are very diverse, and so far there is no accessible overview of these methods. Since the field is too broad for us to provide an exhaustive account, we have made two choices: first, to provide a framework of paradigmatic (classificatory), syntagmatic (compositional) and functional (discourse-oriented) dimensions for duration analysis; and second, to provide worked examples of a selection of methods associated primarily with these three dimensions. Some of the methods which are covered are established state-of-the-art approaches (e.g. the paradigmatic Classification and Regression Trees, CART , analysis), others are discussed in a critical light (e.g. so-called ‘rhythm metrics’). A set of syntagmatic approaches applies to the tokenisation and tree parsing of duration hierarchies, based on speech annotations, and a functional approach describes duration distributions with sociolinguistic variables. Several of the methods are supported by a new web-based software tool for analysing annotated speech data, the Time Group Analyser.

Arnold, Denis & W agner, P etra & Möbius, Bernd. 2011. E valuating different rating scales for obtaining judgments of syllable prominence from naive listeners. In Proceedings of XVIIth International Congress of Phonetic Sciences, 253-255. H ong K ong.

Auran, Cyril & Bouzon, Caroline & H irst, Daniel. 2004. T he A ix-MARSEC project: an evolutive database of spoken E nglish. In Bel, Bernard & Marlien, Isabelle (eds.), Proceedings of the Second International Conference on Speech Prosody, 561-564. N ara, J apan.

Bachan, J olanta. 2011. Communicative alignment of synthetic speech. P oznań: A dam Mickiewicz U niversity in Poznań. (Doctoral dissertation.) Barbosa, P linio. 2009. Measuring speech rhythm variation in an oscillator-based framework. In Proceedings of Interspeech 2009. Brighton: International Speech Communication A ssociation.

Breiman, L eo & Friedman, J erome & O lshen, R. A . & Stone, Charles. 1984. Classification and regression trees. Monterey, CA: W adsworth & Brooks/Cole A dvanced Books & Software.

Buchsbaum, A dam & van Santen L ., J an P . H . 1997. Methods for O ptimal T ext Selection. In Proceedings 5th Euro. Conf. on Speech Communication and Technology, Vol 2, 553-556. Rhodes, G reece.

Campbell, N ick. 1992. Multi-level timing in speech. Brighton, UK : U niversity of Sussex (Exp. P sychol). (Doctoral dissertation.)

Carson-Berndsen, J ulie. 1998. Time map phonology: Finite state models and event logics in speech recognition. Dordrecht: K luwer A cademic P ublishers.

Cummins, Fred. 1999. Some lengthening factors in E nglish speech combine additively at most rates. The Journal of the Acoustical Society of America 105. 476-480.

Dechert, H ans W . & Raupach, Manfred (eds.), Temporal Variables in Speech. Studies in Honour of Frieda Goldman- Eisler. T he H ague: Mouton.

Demenko, G rażyna & K lessa, K atarzyna & Szymański, Marcin & Breuer, Stefan & H ess, W olfgang. 2010. P olish unit selection speech synthesis with BOSS: extensions and speech corpora. International Journal of Speech Technology 13(2). 85-99.

Everitt, Brian S. & L andau, Sabine & L eese, Morven & Stahl, Daniel 2011. Cluster Analysis, 5th Edition. King’s College, L ondon: J ohn W iley & Sons.

Gibbon, Dafydd. 1992. P rosody, time types, and linguistic design factors in spoken language system architectures. Proceedings of KONVENS 1992. 90-99.

Gibbon, Dafydd. 2003. Computational modelling of rhythm as alternation, iteration and hierarchy. In Proceedings of International Congress of Phonetic Sciences III. Barcelona, 2489-2492.

Gibbon, Dafydd. 2006. T ime types and time trees: P rosodic mining and alignment of temporally annotated data. In Sudhoff, Stefan et al. 2006. Methods in Empirical Prosody Research, 281-209. Berlin: W alter de G ruyter.

Gibbon, Dafydd. 2013. TGA : a web tool for T ime G roup A nalysis. In Proceedings of Tools and Resources for the Analysis of Speech Prosody (TRASP). A ix-en-Provence.

Gibbon, Dafydd & Fernandes, Flaviane Romani. 2005. A nnotation-mining for rhythm model comparison in Brazilian P ortuguese. Proceedings of Interspeech 2005, 3289-3292.

Gibbon, Dafydd & H irst, Daniel & Campbell, N ick (eds.). 2012. Rhythm, melody and harmony in speech. Studies in honour of Wiktor Jassem. Speech and Language Technology 14/15. P oznań.

Grosjean, François H . & L ass, N orman J . 1977. Some factors affecting the listener’s perception of reading rate in English and French. Language and Speech 20(3). 198-208.

Gut, U lrike. 2012. Rhythm in L 2 speech. In G ibbon, Dafydd & H irst, Daniel & Campbell, N ick (eds.), Rhythm, melody and harmony in speech. Studies in honour of Wiktor Jassem. Speech and Language Technology 14/15. 105-114. P oznań.

‘t Hart, J ohan & Collier, Rene & Cohen A ntonie. 1990. A Perceptual Study of Intonation: An Experimental- Phonetic Approach to Speech Melody. Cambridge: Cambridge U niversity P ress.

Hirst, Daniel & Di Cristo, A lbert (eds.). 1998. Intonation Systems. A survey of Twenty Languages. Cambridge: Cambridge U niversity P ress.

Inden, Benjamin & Malisz, Z ofia & W agner, P etra, & W achsmuth, Ipke. 2012. Rapid entrainment to spontaneous speech: A comparison of oscillator models. In Miyake, N . & P eebles, D. & Cooper, R. P . (eds.), Proceedings of 34th Annual Conference of the Cognitive Science Society. A ustin, T X: Cognitive Science Society.

Jassem, W iktor. 2003. IPA : Polish. Journal of the International Phonetic Association 33(1). 103-107.

Jassem, W iktor & K rzyśko, Mirosław & Stolarski, P rzemysław. 1981. Regression model of isochrony in speech signal, IPPT PAN 33. W arszawa.

Jassem, W iktor & H ill, David R. & W itten, Ian H . 1984. Isochrony in E nglish speech: its statistical validity and linguistic relevance. In G ibbon, Dafydd & Richter, H elmut (eds.), Intonation, accent and rhythm. Studies in Discourse Phonology 8. 203-225.

King, Simon & P ortele, T homas & H öfer, Florian. 1997. Speech synthesis using non-uniform units in the Verbmobil project. Proceedings Eurospeech 2. 569-572. Rhodes.

King, Simon & Black, A lan W . & T aylor, P aul & Caley, Richard & Clark, Rob. 2003. E dinburgh Speech T ools. System Documentation E dition 1.2, for 1.2.3 24th J an 2003. (Retrieved from: http://www.cstr.ed.ac.uk/projects/speech_tools/manual-1.2.0 on 27 A pril 2013).

Klatt, Dennis. H . 1976. L inguistic uses of segmental duration in E nglish: A coustic and perceptual evidence. The Journal of the Acoustical Society of America 59. 1208‑1221.

Klatt, Dennis. H . 1987. Review of text-to-speech conversion for E nglish. The Journal of the Acoustical Society of America 88(3). 737-793.

Klessa, K atarzyna & Szymański, Marcin & Breuer, S., & Demenko, G rażyna. 2007. O ptimization of P olish segmental duration prediction with CART. In Proceedings of 6th ISCA Workshop on Speech Synthesis (SSW-6). Vol. 1. Bonn.

Klessa, K atarzyna & W agner, A gnieszka, O leśkowicz-Popiel, Magdalena & K arpiński, Maciej. 2013. “Paralingua” - a new speech corpus for the studies of paralinguistic features. In Vargas-Sierra, Chelo (ed.), Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Science. Vol. 95, 48-58.

Koreman, J acques. 2006. P erceived speech rate: T he effects of articulation rate and speaking style in spontaneous speech. Journal of the Acoustical Society of America 119. 582-596.

Lehiste, Ilse. 1970. Suprasegmentals. Cambridge, Massachusetts-London: M.I.T. P ress.

Lehiste, Ilse. 1977. Isochrony reconsidered. Journal of Phonetics 5.

Low, E e L ing & G rabe, E sther & N olan, Francis. 2001. Quantitative characterisations of speech rhythm: Syllabletiming in Singapore E nglish. Language and Speech 43(4). 377-401.

Łobacz, P iotra. 1976a. O bjective and subjective speech tempo in P olish. Speech Analysis and Synthesis 4. 173-186.

Łobacz, P iotra. 1976b. Speech rate and vowel formants. Speech Analysis and Synthesis 4. 187-218.

Möbius, Bernd & van Santen, J an P . H . 1996. Modeling segmental duration in G erman text-to-speech synthesis. Spoken Language, 1996. Proceedings of ICSLP. Vol. 4, 2395-2398. P hiladelphia, PA : IEEE .

Möbius, Bernd. 2001. Rare events and closed domains: two delicate concepts in speech synthesis. 4th ISCA ITRW on Speech Synthesis. P erthshire.

Moers, Donata & J auk, Igor & Möbius, Bernd & W agner, P etra. 2010. Synthesizing Fast Speech by Implementing Multi-Phone U nits in U nit Selection Speech Synthesis. In Proceedings of 7th ISCA Tutorial and Research Workshop on Speech Synthesis (SSW-7).

Moos, A nja, & T rouvain, J ürgen. 2007. Comprehension of U ltra-Fast Speech-Blind vs. ‘Normally H earing’ P ersons. In Proceedings of the 16th International Congress of Phonetic Sciences, 677-680.

Olaszy, G ábor. 2002. P redicting H ungarian sound durations for continuous speech. Acta Linguistica Hungarica 49(3-4). 321-345.

OʼShaughnessy, Douglas. 1984. A multispeaker analysis of duration in read French paragraphs. Journal of the Acoustical Society of America 76(6). 1664-1672.

Pfitzinger, H artmut R. 1996. T wo approaches to speech rate estimation. In Proceedings SST. Vol. 96, 421-426.

Portele, T homas & Sendlemeier, W alter & H ess, W olfgang. 1990. A system for G erman speech synthesis based on demisyllables, diphones, and suffixes. In ESCA Workshop on Speech Synthesis Autrans, 161-164.

Richter, L utosława. 1973. T he duration of P olish vowels. Speech Analysis and Synthesis 3. 87-115. W arszawa.

Richter, L utosława. 1974. P orównanie iloczasu samogłosek polskich wymówionych w logatomach oraz w wyrazach. Biuletyn Polskiego Towarzystwa Fonetycznego 32. 173-178.

Richter, L utosława. 1987. Modelling of the rhythmic structure of utterances in P olish. Studia Phonetica Posnaniensia 1. 91-125.

Roach, P eter. 1982. O n the distinction between ‘stress-timed’ and ‘syllable-timed’ languages. In Crystal, David (ed.), Linguistic Controversies: Essays in Linguistic Theory and Practice, 73-79. L ondon: E dward A rnold.

Scott, Donia R. & Isard, S. D. & de Boysson-Bardies, Bénédicte. 1986. O n the measurement of rhythmic irregularity: a reply to Benguerel. Journal of Phonetics 14. 327-330.

Siegler, Matthiew A . & Stern, Richard M. 1995. O n the effects of speech rate in large vocabulary speech recognition systems. In International Conference on Acoustics, Speech, and Signal Processing 1995. ICASSP-95. Vol. 1, 612-615.

Syrdal, A nn K . & Bunnell, T imothy & H ertz, Susan R. & Mishra, T aniya & Spiegel, Murray & Bickley, Corine & Rekart, Deborah & Makashay, Matthew J . 2012. T ext-To-Speech Intelligibility across Speech Rates. In Proceedings of Interspeech. P ortland, O regon.

Szymański, Marcin & K lessa, K atarzyna & Breuer, Stefan & Demenko, G rażyna. 2011. O ptimization of unit selection speech synthesis. In Proceedings of XVIIth International Congress of Phonetic Sciences, 1930-1933. Hong K ong.

Treiblmaier, H orst & Filzmoser, P eter. 2009. Benefits from using continuous rating scales in online survey research. Technische U niversitt W ien, Forschungsbericht SM-2009-4.

Vainio, Martti. 2001. Artificial neural network based prosody models for Finnish text-to-speech synthesis. Helsinki: U niversity of H elsinki. (Doctoral dissertation.)

van Santen, J an P . H . 1993. Quantitative modeling of segmental duration. In Proceedings of the workshop on Human Language Technology, 323-328. A ssociation for Computational L inguistics.

Wagner, P etra & W indmann, A ndreas. 2011. T he shrinking effects on speech tempo perception. In Proceedings of XVIIth International Congress of Phonetic Sciences, 2082-2085. H ong K ong.

Zee, E ric. 2002. T he effect of speech rate on the temporal organization of syllable production in cantonese. Proceedings of Speech Prosody. Aix-en-Provence.

Lingua Posnaniensis

The Journal of Poznan Society for the Advancement of the Arts and Sciences and Adam Mickiewicz University, Institute of Linguistics

Journal Information

CiteScore 2017: 0.06

SCImago Journal Rank (SJR) 2017: 0.133
Source Normalized Impact per Paper (SNIP) 2017: 0.138


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 259 259 30
PDF Downloads 63 63 13