Modest XPath and XQuery for corpora: Exploiting deep XML annotation

Christoph Rühlemann 1 , Andrej Bagoutdinov 2  and Matthew Brook O’Donnell 3
  • 1 Philipps University, Marburg
  • 2 Ludwig-Maximilians University, Munich
  • 3 University of Pennsylvania


This paper outlines a modest approach to XPath and XQuery, tools allowing the navigation and exploitation of XML-encoded texts. The paper starts off from where Andrew Hardie’s paper “Modest XML for corpora: Not a standard, but a suggestion” (Hardie 2014) left the reader, namely wondering how one’s corpus can be usefully analyzed once its XML-encoding is finished, a question the paper did not address. Hardie argued persuasively that “there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding” (Hardie 2014: 73). In a similar vein this paper argues that even a basic understanding of XPath and XQuery can bring great benefits to corpus linguists. To make this point, we present not only a modest introduction to basic structures underlying the XPath and XQuery syntax but demonstrate their analytical potential using Obama’s 2009 Inaugural Address as a test bed. The speech was encoded in XML, automatically PoS-tagged and manually annotated on additional layers that target two rhetorical figures, anaphora and isocola. We refer to this resource as the Inaugural Rhetorical Corpus (IRC). Further, we created a companion website hosting not only the Inaugural Rhetorical Corpus, but also the Inaugural Training Corpus) (a training corpus in the form of an abbreviated version of the IRC to allow manual checks of query results) as well as an extensive list of tried and tested queries for use with either corpus. All of the queries presented in this paper are at beginners to lower-intermediate levels of XPath/XQuery expertise. Nonetheless, they yield fruitful results: they show how Obama uses the inclusive pronouns we and our as a discursive strategy to advance his political strategy to re-focus American politics on economic and domestic matters. Further, they demonstrate how sentence length contributes to the build-up of climactic tension. Finally, they suggest that Obama’s signature rhetorical figure is the isocolon and that the overwhelming majority of isocola in the speech instantiate the crescens type, where the cola gradually increase in length over the sequence.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Biria, Reza and Azadeh Mohammadi. 2012. The socio pragmatic functions of inaugural speech: A critical discourse analysis approach. Journal of Pragmatics 44: 1290-1302.

  • Clark, James and Steve DeRose. 1999. XML Path Language XPath Version 1.0, available at (last accessed December 2014).

  • Gleim, Rüdiger, Ulli Waltinger, Alexander Mehler, and Peter Menke. 2009. eHumanities Desktop - An extensible online system for corpus management and analysis. Proceedings of the Corpus Linguistics 2009 Conference. In M. Mahlberg, V. González-Díaz and C. Smith (eds.). Proceedings of the Corpus Linguistics Conference, available at (last accessed December 2014).

  • Gries, Stefan Th. 2009. Quantitative corpus linguistics with R. A practical introduction. New York and London: Routledge.

  • Gries, Stefan Th. 2010. Methodological skills in corpus linguistics: A polemic and some pointers towards quantitative methods. In T. Harris and M. Moreno Jaén (eds.). Corpus linguistics in language teaching, 121-146. Frankfurt am Main: Peter Lang.

  • Gries, Stefan Th. 2013. Statistics for linguistics with R. A practical introduction. 2nd rev. and ext. ed. Berlin and New York: De Gruyter Mouton.

  • Hardie, Andrew. 2014. Modest XML for corpora: Not a standard, but a suggestion, ICAME Journal 38: 73-103.

  • Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee and Ylva Berglund Prytz. 2008. Corpus linguistics with BNCweb - A practical guide. Frankfurt/Main: Peter Lang.

  • Leech, Geoffrey. 2007. New resources, or just better ones? The holy grail of representativeness. In M.Hundt, N. Nesselhauf and C. Biewer (eds.). Corpus linguistics and the web, 133-150. Amsterdam/New York, NY: Rodopi.

  • Leith, Sam. 2011. You talkin’ to me? Rhetoric from Aristotle to Obama. London: Profile Books.

  • Levinson, Stephen C. 1983. Pragmatics. Cambridge: Cambridge University Press.

  • Longacre, Robert E. 1983. The grammar of discourse. New York: Plenum Press.

  • Mahlow, Cerstin, Christian Grün, Alexander Holupirek and Marc H. Scholl. 2012. A framework for retrieval and annotation in digital humanities using xquery full text and update in BaseX. Proceedings of the 2012 ACM Symposium on Document Engineering; September 4-7, 2012, Paris, France, 195-204. New York, NY: ACM. Available at (last accessed December 2014).

  • O’Donnell, Matthew B., Mike Scott, Michaela Mahlberg and Michael Hoey. 2012. Exploring text-initial words, clusters and concgrams in a newspaper corpus. Special issue of Corpus Linguistics and Linguistic Theory 8(1): 73-101.

  • O’Donnell, Matthew B. and Ute Römer. 2012. From student hard drive to web corpus (Part 2): The annotation and online distribution of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora 7(1): 1-18.

  • R Development Core Team. 2010. R: A language and environment for statistical computing.

  • R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL

  • Rehm, Georg, Richard Eckart, Christian Chiarcos and Johannes Dellert. 2008. Ontology-based XQuery’ing of XML-encoded language resources on multiple annotation layers. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis and D. Tapias (eds.). Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Paris: ELRA, available at (last accessed December 2014).

  • Rühlemann, Christoph. 2013. Narrative in English conversation. A corpus analysis. Cambridge: Cambridge University Press.

  • Rühlemann, Christoph and Matthew B. O’Donnell. 2012. Towards a corpus of conversational narrative. Construction and annotation of the Narrative Corpus. Corpus Linguistics and Linguistic Theory 8(2): 313-350.

  • Rühlemann, Christoph, Matthew B. O’Donnell and Andrej Bagoutdinov. 2013. Windows on the mind: Pauses in conversational narrative. In G. Gilquin and S. De Cock (eds.). Errors and disfluencies in spoken corpora (Benjamins Current Topics 52), 59-91. Amsterdam/Philadelphia: John Benjamins.

  • Rühlemann (eds.). Corpus pragmatics. A handbook. Cambridge: Cambridge University Press.

  • Rühlemann, Christoph and Matthew B. O’Donnell. 2015. Deixis. In K. Aijmer and C.

  • Scott, Mike 2010. WordSmith tools version 5.0. Lexical Analysis Software, Liverpool.

  • Scott, Mike and Christopher Tribble. 2006. Textual patterns. Key words and corpus analysis in language education. Amsterdam/New York: John Benjamins.

  • Siegel, Erik and Adam Retter. 2014. eXist: A NoSQL Document Database and Application Platform. Sebastopol/CA: O’Reilly.

  • Walmsley, Priscilla. 2007. XQuery. Sebastopol/CA: O’Reilly.

  • Watt, Andrew. 2002. XPath essentials. New York: Wiley and Sons.


Journal + Issues