Guidelines for normalising Early Modern English corpora: Decisions and justifications

Open access


Corpora of Early Modern English have been collected and released for research for a number of years. With large scale digitisation activities gathering pace in the last decade, much more historical textual data is now available for research on numerous topics including historical linguistics and conceptual history. We summarise previous research which has shown that it is necessary to map historical spelling variants to modern equivalents in order to successfully apply natural language processing and corpus linguistics methods. Manual and semiautomatic methods have been devised to support this normalisation and standardisation process. We argue that it is important to develop a linguistically meaningful rationale to achieve good results from this process. In order to do so, we propose a number of guidelines for normalising corpora and show how these guidelines have been applied in the Corpus of English Dialogues.

Archer, Dawn, Anthony M. McEnery, Paul Rayson and Andrew Hardie. 2003. Developing an automated semantic analysis system for Early Modern English. In D. Archer, P. Rayson, A. Wilson and A. M. McEnery (eds.). Proceedings of the Corpus Linguistics Conference 2003, 22-31. Lancaster: University of Lancaster.

Baron, Alistair and Paul Rayson. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. See

Baron, Alistair and Paul Rayson. 2009. Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.). Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009. See

Baron, Alistair, Paul Rayson and Dawn Archer. 2009. Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies 20 (1): 41-67.

Beal, Joan C. 2002. English pronunciation in the Eighteenth Century: Thomas Spence’s Grand Repository of the English Language. Oxford: Oxford University Press.

Beal, Joan C. 2006. Language and region. London and New York: Taylor & Francis.

Blake, Norman. 1989. The language of Shakespeare. Houndmills, Basingstoke, Hampshire and London: Macmillan.

Blake, Norman. 2002. A grammar of Shakespeare’s language. Houndmills, Basingstoke, Hampshire and London: Palgrave.

Brengelman, Fred. H. 1980. Orthoepists, printers and the rationalisation of English spelling. Journal of English and Germanic Philology 79: 332-354.

Carney, Edward. 1994. A survey of English spelling. London and New York: Routledge.

Cercignani, Fausto. 1981. Shakespeare’s works and Elizabethan pronunciation. Oxford: Clarendon Press.

A Corpus of English Dialogues 1560-1760. 2006. Compiled under the supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University), with the assistance of Dawn Archer and Terry Walker.

Dobson, Eric J. 1955. Early Modern Standard English. Transactions of the Philological Society, 25-40.

Dobson, Eric J. 1957. English pronunciation 1500-1700. Oxford: Clarendon Press.

Elphinston, James. 1765. The principles of the English language digested: or, English grammar reduced to analogy… 2 vols. London. A i.261.

Elphinston, James. 1790. Inglish orthoggraphy epitomized … London. EL 288. A vi.544.

Evans, Mel. 2012. A sociolinguistics of early modern spelling: An account of Queen Elizabeth I’s correspondence. In J. Tyrkkö, M. Kilpiö, T. Nevalainen and M. Rissanen (eds.). Outposts of historical corpus linguistics: From the Helsinki Corpus to a proliferation of resources (Studies in Variation, Contacts and Change in English 10 [online.]). Available at: [Last accessed 09/12/2014].

Görlach, Manfred. 1991. Introduction to Early Modern English. Cambridge: Cambridge University Press.

Hiltunen, Turo and Jukka Tyrkkö. 2013. Tagging Early Modern English Medical Texts. Corpus Analysis with Noise in the Signal (CANS) 2013 workshop. Lancaster University. See

Jones, Charles. 1989. A history of English phonology. London: Longman.

Kökeritz, Helge. 1953. Shakespeare’s pronunciation. New Haven: Yale University Press.

Lass, Roger. 1999. Introduction. In R. Lass (ed.), The Cambridge history of the English language: Volume III. 1476-1776, 1-12. Cambridge: Cambridge University Press.

Lehto, Anu, Alistair Baron, Maura Ratia and Paul Rayson. 2010. Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In I. Taavitsainen and P. Pahta (eds.). Early Modern English Medical Texts: Corpus description and studies, 279-290. Amsterdam: John Benjamins.

Nevalainen, Terttu and Helena Raumolin-Brunberg. 2003. Historical sociolinguistics: Language change in Tudor and Stuart England. (Longman Linguistics Library). London: Longman Pearson.

Osselton, Noel E. 1963. Formal and informal spelling in the 18th century. Errour, honor and related words. English Studies 44: 267-275.

Osselton, Noel E. 1984. Informal spelling systems in Early Modern English: 1500-1800. In N.F. Blake and C. Jones (eds.). English historical linguistics: Studies in development, 123-137. Sheffield: CECTAL.

Palander-Collin, Minna and Mikko Hakala. 2011. Standardizing the Corpus of Early English Correspondence (CEEC). A poster given at the 32nd ICAME conference, 1-5 June, 2011. See corpora/CEEC/standardized.html; for an enlarged version, see CEEC/Standardization%20poster%20v2.pdf.

Rayson, Paul, Dawn Archer and Nick Smith. 2005. VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora. In Proceedings of Corpus Linguistics 2005, Birmingham University, July 14-17, 2005.

Rayson, Paul, Dawn Archer, Alistair Baron and Nicholas Smith. 2007a. Tagging historical corpora - the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, 3rd-8th December 2006. ISSN 1862-4405. http://

Rayson, Paul, Dawn Archer, Alistair Baron, Jonathan Culpeper and Nicholas Smith. 2007b. Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics Conference 2007. Birmingham: University of Birmingham.

Rissanen, Matti. 1998. Towards an integrated view of the development of English: Notes on causal linking. In J. Fisiak and M. Krygier (eds.). Advances in English historical linguistics, 389-406. Berlin: Mouton de Gruyter.

Rissanen, Matti. 1999. Syntax. In R. Lass (ed.). The Cambridge history of the English language: Volume III. 1476-1776, 187-331. Cambridge: Cambridge University Press.

Sairio, Anni. 2009. Language and letters of the Bluestocking Network: Sociolinguistic issues in eighteenth-century epistolary English (Mémoires de la Société Néophilologique de Helsinki 75). Helsinki: Société Néophilologique.

Salmon, Vivien. 1999. Orthography and punctuation. In R. Lass (ed.). The Cambridge history of the English language. Volume III. 1476-1776, 13-55. Cambridge: Cambridge University Press.

Schneider, Peter. 2002. Computer assisted spelling normalization of 18th century English. In P. Peters, P. Collins and A. Smith (eds.). New frontiers of corpus research: Papers from the 21st International Conference on English Language Research on Computerized Corpora, Sydney, 2000, 199-211. Amsterdam: Rodopi.

Scragg, Donald C. 1974. English spelling. Manchester: Manchester University Press.

Sebba, Mark. 2007. Spelling and society: The culture and politics of orthography around the world. Cambridge: Cambridge University Press.

Smith, Jeremy. 1996. A historical study of English: Form, function and change. London: Routledge.

Stenbrenden, Gertrud. 2010. The chronology and regional spread of long-vowel changes in English, c. 1150-1500. PhD dissertation, University of Oslo.

Taavitsainen, Irma and Päivi Pahta (eds.). 2010. Early Modern English Medical Texts. Corpus description and studies. Amsterdam/Philadelphia: John Benjamins.

Tieken-Boon van Ostade, Ingrid. 1998. Standardization of English spelling: The eighteenth-century printers’ contribution. In J. Fisiak and M. Krygier (eds.). Advances in English historical linguistics, 457-470. Berlin: Mouton de Gruyter.

Walker, John 1791. A critical pronouncing dictionary and expositor of the English language. London.

Wyld, Henry C. 1923. Studies in English rhymes from Surrey to Pope. London: Murray.

Wyld, Henry C. 1927. A short history of English. 3rd edition. London: Murray.

Wyld, Henry C. 1936. A history of modern colloquial English. 3rd edition. Oxford: Basil Blackwell.

Journal Information

Cited By


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 482 365 30
PDF Downloads 193 165 17