Early Modern Multiloquent Authors (EMMA): Designing a large-scale corpus of individuals’ languages

Peter Petré 1 , Lynn Anthonissen 1 , 2 , Sara Budts 1 , Enrique Manjavacas 1 , Emma-Louise Silva 1 , William Standing 1 , and Odile A.O. Strik 1
  • 1 University of Antwerp,
  • 2 Ludwig Maximilian University of Munich,


The present article provides a detailed description of the corpus of Early Modern Multiloquent Authors (EMMA), as well as two small case studies that illustrate its benefits. As a large-scale specialized corpus, EMMA tries to strike the right balance between big data and sociolinguistic coverage. It comprises the writings of 50 carefully selected authors across five generations, mostly taken from the 17th-century London society. EMMA enables the study of language as both a social and cognitive phenomenon and allows us to explore the interaction between the individual and aggregate levels.

The first part of the article is a detailed description of EMMA’s first release as well as the sociolinguistic and methodological principles that underlie its design and compilation. We cover the conceptual decisions and practical implementations at various stages of the compilation process: from text-markup, encoding and data preprocessing to metadata enrichment and verification.

In the second part, we present two small case studies to illustrate how rich contextualization can guide the interpretation of quantitative corpus-linguistic findings. The first case study compares the past tense formation of strong verbs in writers without access to higher education to that of writers with an extensive training in Latin. The second case study relates s/th-variation in the language of a single writer, Margaret Cavendish, to major shifts in her personal life.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Anderwald, Lieselotte. 2011. Norm vs. variation in British English irregular verbs: The case of past tense sang vs. sung. English Language and Linguistics 15: 85–112.

  • Anthonissen, Lynn and Peter Petré. 2019 (forthcoming). Grammaticalization and the linguistic individual: new avenues in lifespan research. To appear in Linguistics Vanguard (Special Issue: Language and Aging).

  • Anthonissen, Lynn. (Manuscript). Cognition in construction grammar. Cognitive Linguistics (Special issue: Constructionist Approaches to Individual Grammars).

  • Anthonissen, Lynn. 2019 (forthcoming). Constructional change across the lifespan: The nominative and infinitive in early modern writers. To appear in K. Bech and R. Möhlig-Falke (eds.). Grammar – discourse – context: Grammar and usage in language variation and change (Discourse Patterns). Berlin: De Gruyter Mouton.

  • Apache OpenNLP. 2017. The Apache Software Foundation. https://opennlp.apache.org

  • Archer, Ian W. 2000. Social networks in Restoration London: The evidence of Samuel Pepys’s diary. In A. Shepard, P. J. Withington and P. Withington (eds.). Communities in early modern England: networks, place, rhetoric, 76–94. Manchester: Manchester University Press.

  • Bastian, Mathieu, Sebastien Heymann and Mathieu Jacomy. 2009. Gephi: An open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.

  • Beckner, Clay, Richard Blythe, Joan Bybee, Morten H. Christiansen, William Croft, Nick C. Ellis, John Holland, Jinyun Ke, Diane Larsen-Freeman and Tom Schoenemann. 2009. Language is a complex adaptive system. Language Learning 59: 126.

  • Bergs, Alexander. 2005. Social networks and historical sociolinguistics: Studies in morphosyntactic variation in the Paston Letters (1421–1503) (Topics in English Linguistics 51). Berlin: Mouton de Gruyter.

  • Biber, Douglas, Edward Finegan and David Atkinson. 1994. ARCHER and its challenges: Compiling and exploring A Representative Corpus of Historical English Registers. In U. Fries, G. Tottie and P. Schneider (eds.). Creating and using English language corpora, 1–14. Amsterdam: Rodopi.

  • Burns, Philip R. 2013. MorphAdorner v2: A Java library for the morphological adornment of English language texts. Evanston: Northwestern University. https://morphadorner.northwestern.edu/morphadorner/download/morphadorner.pdf.

  • Bybee, Joan L. 2010. Language, usage and cognition. Cambridge: Cambridge University Press.

  • Dąbrowska, Ewa and James Street. 2006. Individual differences in language attainment: Comprehension of passive sentences by native and non-native English speakers. Language Sciences 28: 604–615.

  • Dąbrowska, Ewa. 2015. Individual differences in grammatical knowledge. In E. Dąbrowska and D. Divjak (eds.). Handbook of cognitive linguistics, 649–667. Berlin: De Gruyter Mouton.

  • de Does, Jess, Jan Niestadt and Katrien Depuydt. 2017. Creating research environments with blackLab. In J. Odijk and A. van Hessen (eds.). CLARIN in the Low Countries, 245–257. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.20. License: CC-BY 4.0

  • ECCO = Eighteenth Century Collections Online. quod.lib.umich.edu/e/ecco.

  • ECCO-TCP = Eighteenth Century Collections Online – Text Creation Partnership.www.textcreationpartnership.org/tcp-ecco.

  • Eckert, Penelope. 2000. Linguistic variation as social practice. Oxford: Black-well.

  • Eckert, Penelope. 2008. Variation and the indexical field. Journal of Sociolinguistics 12 (4): 453–476.

  • EEBO = Early English Books Online. eebo.chadwyck.com.

  • EEBO-TCP = Early English Books Online – Text Creation Partnership. www.textcreationpartnership.org/tcp-eebo.

  • Ellis, Nick C. 2011. The emergence of language as a complex adaptive system. In J. Simpson (ed.). The Routledge handbook of applied linguistics, 654–667. New York: Routledge.

  • Evans-TCP = Evans Early American Imprints – Text Creation Partnership.www.textcreationpartnership.org/tcp-evans.

  • Evert, Stefan and Andrew Hardie. 2011. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics Conference 2011, Birmingham, 20–22 July. Paper #153. https://www.birmingham.ac.uk/documents/college-artslaw/corpus/conference-archives/2011/Paper-153.pdf.

  • Fitzmaurice, James. 2004. Cavendish [née Lucas], Margaret, duchess of New-castle upon Tyne (1623?–1673), writer. Oxford dictionary of national biography. Oxford: Oxford University Press. https://doi.org/10.1093/ref:odnb/4940.

  • Fitzmaurice, Susan. 2004. The meanings and uses of the progressive construction in an early eighteenth-century English network. In A. Curzan and K. Emmons (eds.). Studies in the history of the English language II, 131–174. Berlin: de Gruyter.

  • Fonteyn, Lauren and Andrea Nini. My alternation, my rules: Investigating syntactic variation in individual Englishes. Cognitive Linguistics (Special issue: Constructionist Approaches to Individual Grammars).

  • Gotti, Maurizio. 2013. The formation of the Royal Society as a community of practice and discourse. In J. Kopaczyk and A.H. Jucker (eds.). Communities of practice in the history of English, 269–285. Amsterdam/Philadelphia: John Benjamins.

  • Guy, Gregory and Sally Boyd. 1990. The development of a morphological Class. Language Variation and Change 2 (1): 1–18.

  • Hanson, Craig Ashley. 2009. The English virtuoso: Art, medicine, and antiquarianism in the age of Empiricism. Chicago, IL: University of Chicago Press.

  • Howard-Hill, T.H. 2006. Early modern printers and the standardization of English spelling. The Modern Language Review 101 (1): 16–29.

  • Kopaczyk, Joanna and Andreas H. Jucker (eds.). 2013. Communities of practice in the history of English. Amsterdam and Philadelphia: John Benjamins.

  • Kroch, Anthony, Beatrice Santorini and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). University of Pennsylvania: Department of Linguistics. CD-ROM, first edn., release 3. www.ling.upenn.edu/ppche/ppche-release-2016/PPCEME-RELEASE-3.

  • Kytö, Merja and Terry Walker. 2006. Guide to A Corpus of English Dialogues 1560–1760 (Studia Anglistica Upsaliensia 130). Uppsala: Acta Universitatis Upsaliensis.

  • Labov, William. 2007. Transmission and diffusion. Language 83: 344–387.

  • Manjavacas, Enrique A. and Peter Petré. 2017. Enabling annotation of historical corpora in an asynchronous collaborative environment. In Proceedings of DATeCH2017, Göttingen, Germany, June 01–02, 2017, 6 pages. http://dx.doi.org/10.1145/3078081.3078089.

  • Milroy, James and Lesley Milroy. 1997. Network structure and linguistic change. In N. Coupland and A. Jaworski (eds.). Sociolinguistics, 199–211. London: Palgrave.

  • Milroy, Lesley and James Milroy. 1992. Social network and social class: Toward an integrated sociolinguistic model. Language in Society 21 (1): 1–26.

  • Nevalainen, Terttu, Helena Raumolin-Brunberg and Heikki Mannila. 2011. The diffusion of language change in real-time. Language Variation and Change 23: 1–43.

  • Nevalainen, Terttu. 2015. Social networks and language change in Tudor and Stuart London – only connect? English Language and Linguistics 19 (2): 269–292.

  • Nurmi, Arja, Ann Taylor, Anthony Warner, Susan Pintzuk and Terttu Nevalainen. 2006. Parsed Corpus of Early English Correspondence, tagged version (PCEEC). Compiled by the CEEC Project Team. York and Helsinki: University of York and University of Helsinki. Distributed through the Oxford Text Archive.

  • Petré, Peter and Freek Van de Velde. 2018. The real-time dynamics of the individual and the community in grammaticalization. Language 94 (4): 867–901.

  • Raumolin-Brunberg, Helena. 2009. Lifespan changes in the language of three early modern gentlemen. In A. Nurmi, M. Nevala and M. Palander-Collin (eds.). The language of daily life in England (1400–1800) (Pragmatics & Beyond 183), 165–196. Amsterdam: Benjamins.

  • Repo, Liina. 2018. Errors and corrections: Early Modern English errata lists in 1529–1700 and their connection to prescriptivism. Turku: Faculty of Humanities, MA thesis. http://www.utupub.fi/handle/10024/146176.

  • Rissanen, Matti, Merja Kytö, Leena Kahlas-Tarkka, Matti Kilpiö, Saara Nevanlinna, Irma Taavitsainen, Terttu Nevalainen and Helena Raumolin-Brunberg. 1991. Helsinki Corpus of English Texts. Department of Modern Languages: University of Helsinki.

  • Rivers, Isabel. 2004. Tillotson, John (1630–1694), archbishop of Canterbury. Oxford dictionary of national biography. Oxford: Oxford University Press. https://doi-org/10.1093/ref:odnb/27449.

  • Sairio, Anni. 2009. Methodological and practical aspects of historical network analysis. In A. Nurmi, M. Nevala and M. Palander-Collin (eds.). The language of daily life in England (1400–1800) (Pragmatics & Beyond 183), 107–135. Amsterdam: Benjamins.

  • Sankoff, Gillian. 2005. Cross-sectional and longitudinal studies in sociolinguistics. In P. Trudgill (ed.). Sociolinguistics: An international handbook of the science of language and society, 1003–1013. Berlin: De Gruyter Mouton

  • Schmid, Hans-Jörg. (Forthcoming). The dynamics of the linguistic system: Usage, conventionalization, and entrenchment. Oxford: Oxford University Press.

  • Standing, William and Peter Petré. (Submitted). Lifespan change versus inter-generational incrementation in the schematization of syntactic constructions. In I. Buchstaller, S. Wagner and K. Beaman (eds.). Panel studies of variation and change, vol. II. Oxford: Routledge.

  • Standing, William, Odile A.O. Strik and Peter Petré. (Submitted). Change versus stability in syntactic constructions of Early Modern English networked individuals. Journal of English Linguistics (Special issue: The Role of an Individual Speaker in Linguistic Change).

  • Steels, Luc. 2000. Language as a complex adaptive system. In M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo and H-P. Schwefel (eds.). Parallel Problem Solving from Nature (PPSN) VI (Lecture Notes in Computer Science 1917), 17–26. New York: Springer.

  • Taavitsainen, Irma, Päivi Pahta, Turo Hiltunen, Martti Mäkinen, Ville Marttila, Maura Ratia, Carla Suhr and Jukka Tyrkkö. 2010. Early Modern English Medical Texts (EMEMT). CD-ROM. Amsterdam: John Benjamins.

  • Theobald, Martin, Jonathan Siddharth and Andreas Paepcke. 2008. SpotSigs: Robust and efficient near duplicate detection in large web collections. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 563–570. New York: ACM. https://dl.acm.org/citation.cfm?id=1390431&dl=ACM&coll=DL.

  • Trudgill, Peter. 2011. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford: Oxford University Press.

  • Van de Velde, Freek and Peter Petré. 2017. Linking grammaticalization to historical demography. Paper presented at Historical Sociolinguistics Network, New York, April 6–7.

  • Van de Velde, Freek. 2014. Degeneracy: The maintenance of constructional networks. In R. Boogaart, T. Colleman and G. Rutten (eds.). Extending the scope of construction grammar, 141179. Berlin: De Gruyter Mouton.

  • Wagner, Suzanne Evans. 2012. Age grading in sociolinguistic theory. Language and Linguistics Compass 6 (6): 371–382.

  • Walker, Terry. 2017. “he saith yt he thinkes yt”: Linguistic factors influencing third person singular present tense verb inflection in Early Modern English depositions. Studia Neophilologica 89 (1): 133–346.

  • Yáñez-Bouza, Nuria. 2011. ARCHER past and present (1990–2010). ICAME Journal 35: 205–236.


Journal + Issues