Modest XPath and XQuery for corpora: Exploiting deep XML annotation

Open access


This paper outlines a modest approach to XPath and XQuery, tools allowing the navigation and exploitation of XML-encoded texts. The paper starts off from where Andrew Hardie’s paper “Modest XML for corpora: Not a standard, but a suggestion” (Hardie 2014) left the reader, namely wondering how one’s corpus can be usefully analyzed once its XML-encoding is finished, a question the paper did not address. Hardie argued persuasively that “there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding” (Hardie 2014: 73). In a similar vein this paper argues that even a basic understanding of XPath and XQuery can bring great benefits to corpus linguists. To make this point, we present not only a modest introduction to basic structures underlying the XPath and XQuery syntax but demonstrate their analytical potential using Obama’s 2009 Inaugural Address as a test bed. The speech was encoded in XML, automatically PoS-tagged and manually annotated on additional layers that target two rhetorical figures, anaphora and isocola. We refer to this resource as the Inaugural Rhetorical Corpus (IRC). Further, we created a companion website hosting not only the Inaugural Rhetorical Corpus, but also the Inaugural Training Corpus) (a training corpus in the form of an abbreviated version of the IRC to allow manual checks of query results) as well as an extensive list of tried and tested queries for use with either corpus. All of the queries presented in this paper are at beginners to lower-intermediate levels of XPath/XQuery expertise. Nonetheless, they yield fruitful results: they show how Obama uses the inclusive pronouns we and our as a discursive strategy to advance his political strategy to re-focus American politics on economic and domestic matters. Further, they demonstrate how sentence length contributes to the build-up of climactic tension. Finally, they suggest that Obama’s signature rhetorical figure is the isocolon and that the overwhelming majority of isocola in the speech instantiate the crescens type, where the cola gradually increase in length over the sequence.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Biria Reza and Azadeh Mohammadi. 2012. The socio pragmatic functions of inaugural speech: A critical discourse analysis approach. Journal of Pragmatics 44: 1290-1302.

  • Clark James and Steve DeRose. 1999. XML Path Language XPath Version 1.0 available at (last accessed December 2014).

  • Gleim Rüdiger Ulli Waltinger Alexander Mehler and Peter Menke. 2009. eHumanities Desktop - An extensible online system for corpus management and analysis. Proceedings of the Corpus Linguistics 2009 Conference. In M. Mahlberg V. González-Díaz and C. Smith (eds.). Proceedings of the Corpus Linguistics Conference available at (last accessed December 2014).

  • Gries Stefan Th. 2009. Quantitative corpus linguistics with R. A practical introduction. New York and London: Routledge.

  • Gries Stefan Th. 2010. Methodological skills in corpus linguistics: A polemic and some pointers towards quantitative methods. In T. Harris and M. Moreno Jaén (eds.). Corpus linguistics in language teaching 121-146. Frankfurt am Main: Peter Lang.

  • Gries Stefan Th. 2013. Statistics for linguistics with R. A practical introduction. 2nd rev. and ext. ed. Berlin and New York: De Gruyter Mouton.

  • Hardie Andrew. 2014. Modest XML for corpora: Not a standard but a suggestion ICAME Journal 38: 73-103.

  • Hoffmann Sebastian Stefan Evert Nicholas Smith David Lee and Ylva Berglund Prytz. 2008. Corpus linguistics with BNCweb - A practical guide. Frankfurt/Main: Peter Lang.

  • Leech Geoffrey. 2007. New resources or just better ones? The holy grail of representativeness. In M.Hundt N. Nesselhauf and C. Biewer (eds.). Corpus linguistics and the web 133-150. Amsterdam/New York NY: Rodopi.

  • Leith Sam. 2011. You talkin’ to me? Rhetoric from Aristotle to Obama. London: Profile Books.

  • Levinson Stephen C. 1983. Pragmatics. Cambridge: Cambridge University Press.

  • Longacre Robert E. 1983. The grammar of discourse. New York: Plenum Press.

  • Mahlow Cerstin Christian Grün Alexander Holupirek and Marc H. Scholl. 2012. A framework for retrieval and annotation in digital humanities using xquery full text and update in BaseX. Proceedings of the 2012 ACM Symposium on Document Engineering; September 4-7 2012 Paris France 195-204. New York NY: ACM. Available at (last accessed December 2014).

  • O’Donnell Matthew B. Mike Scott Michaela Mahlberg and Michael Hoey. 2012. Exploring text-initial words clusters and concgrams in a newspaper corpus. Special issue of Corpus Linguistics and Linguistic Theory 8(1): 73-101.

  • O’Donnell Matthew B. and Ute Römer. 2012. From student hard drive to web corpus (Part 2): The annotation and online distribution of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora 7(1): 1-18.

  • R Development Core Team. 2010. R: A language and environment for statistical computing.

  • R Foundation for Statistical Computing Vienna Austria. ISBN 3-900051-07-0 URL

  • Rehm Georg Richard Eckart Christian Chiarcos and Johannes Dellert. 2008. Ontology-based XQuery’ing of XML-encoded language resources on multiple annotation layers. In N. Calzolari K. Choukri B. Maegaard J. Mariani J. Odijk S. Piperidis and D. Tapias (eds.). Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Paris: ELRA available at (last accessed December 2014).

  • Rühlemann Christoph. 2013. Narrative in English conversation. A corpus analysis. Cambridge: Cambridge University Press.

  • Rühlemann Christoph and Matthew B. O’Donnell. 2012. Towards a corpus of conversational narrative. Construction and annotation of the Narrative Corpus. Corpus Linguistics and Linguistic Theory 8(2): 313-350.

  • Rühlemann Christoph Matthew B. O’Donnell and Andrej Bagoutdinov. 2013. Windows on the mind: Pauses in conversational narrative. In G. Gilquin and S. De Cock (eds.). Errors and disfluencies in spoken corpora (Benjamins Current Topics 52) 59-91. Amsterdam/Philadelphia: John Benjamins.

  • Rühlemann (eds.). Corpus pragmatics. A handbook. Cambridge: Cambridge University Press.

  • Rühlemann Christoph and Matthew B. O’Donnell. 2015. Deixis. In K. Aijmer and C.

  • Scott Mike 2010. WordSmith tools version 5.0. Lexical Analysis Software Liverpool.

  • Scott Mike and Christopher Tribble. 2006. Textual patterns. Key words and corpus analysis in language education. Amsterdam/New York: John Benjamins.

  • Siegel Erik and Adam Retter. 2014. eXist: A NoSQL Document Database and Application Platform. Sebastopol/CA: O’Reilly.

  • Walmsley Priscilla. 2007. XQuery. Sebastopol/CA: O’Reilly.

  • Watt Andrew. 2002. XPath essentials. New York: Wiley and Sons.

Journal information
Cited By
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 320 191 8
PDF Downloads 129 76 6