Text collections for evaluation of Russian morphological taggers

Open access

Abstract

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Sorokin A. Shavrina T. Lyashevskaya O. Bocharov V. Alexeeva S. Droganova K. and Fenogenova A. (forthcoming). MorphoRuEval-2017: an evaluation track for the automatic morphological analysis methods for Russian. In Computational linguistics and intellectual technologies. Proceedings of International Workshop Dialogue’2017 Moscow.

  • [2] Lyashevskaya O. N. Plungian V. A. and Sichinava D. V. (2005). O morfologicheskom standarte Korpusa sovremennogo russkogo jazyka [Morphological standard of the Corpus of contemporary Russian]. In Nacional’nyj korpus russkogo jazyka: 2003–2005 [Russian National Corpus: 2003 - 2005] pages 111–135 Moscow. Accessible at: http://ruscorpora.ru/sbornik2005/08lashevs.pdf.

  • [3] Selegey D. Shavrina T. Selegey V. and Sharoff S. (2016). Automatic morphological tagging of Russian social media corpora: training and testing. In Computational linguistics and intellectual technologies. Proceedings of International Workshop Dialogue’2016 Moscow.

  • [4] Bocharov V. V. Alexeeva S. V. Granovsky D. V. Protopopova E. V. Stepanova M. E. and Surikov A. V. (2013). Crowdsourcing morphological annotation. In Computational linguistics and intellectual technologies. Proceedings of International Workshop Dialogue’2013 Vol. 12 (19) Moscow.

  • [5] Boguslavsky I. (2014). SynTagRus–a Deeply Annotated Corpus of Russian. In Blumenthal P. Novakova I. and Siepmann D. editors Les émotions dans le discours-Emotions in Discourse pages 367–380 Peter Lang Frankfurt am Main Germany.

  • [6] Nivre J. (2016). Reflections on Universal Dependencies. Department of Linguistics and Philology Uppsala University.

  • [7] Nivre J. de Marneffe M.-C. Ginter F. Goldberg Y. Hajic J. Manning Ch. D. McDonald R. Petrov S. Pyysalo S. Silveira N. Tsarfaty R. and Zeman D. (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC 2016 pages 1659–1666 Portorož Slovenia.

  • [8] Zalizniak A. A. (1977/2003). Grammaticheskij slovar’ russkogo jazyka [A Grammatical Dictionary of Russian.] Moscow.

  • [9] Sharoff S. Kopotev M. Erjavec T. Feldman A. and Divjak D. (2008). Designing and evaluating Russian tagsets. In Proceedings of LREC 2008 Marrakech Marocco.

  • [10] Lyashevskaya O. Droganova K. Zeman D. Alexeeva M. Gavrilova T. Mustafina N. and Shakurova E. (2016). Universal Dependencies for Russian: a New Syntactic Dependencies Tagset. In Series: Linguistics WP BRP 44/LNG/2016.

  • [11] Toldova S. Sokolova E. Astafiyeva I. Gareyshina A. Koroleva A. Privoznov D. Sidorova E. Tupikina L. and Lyashevskaya O. (2012). Ocenka metodov avtomaticheskogo analiza teksta 2011-2012: Sintaksicheskie parsery russkogo jazyka [NLP evaluation 2011-2012: Russian syntactic parsers.] In Computational linguistics and intellectual technologies. Proceedings of International Workshop Dialogue 2012. Vol. 11 (18) pages 797–809 RGGU Moscow.

  • [12] Lyashevskaya O. (2016). The grammatical tagset of Russian. In Lyashevskaya O. Korpusnye instrumenty v leksiko-grammaticheskikh issledovavijakh russkogo jazyka [Corpus approach to Russian grammar and lexicon] pages 435–456 Languages of Slavic culture press Moscow.

  • [13] McDonald R. Nivre J. Quirmbach-Brundage Y. Goldberg Y. Das D. Ganchev K. Hall K. Petrov S. Zhang H. Täckström O. Bedini C. Bertomeu Castelló N. and Lee J. (2013). Universal Dependency Annotation for Multilingual Parsing. In Proceedings of ACL. Accessible at: https://ryanmcd.github.io/papers/treebanksACL2013.pdf.

Search
Journal information
Impact Factor


CiteScore 2018: 0.24

SCImago Journal Rank (SJR) 2018: 0.122
Source Normalized Impact per Paper (SNIP) 2018: 0.476

Cited By
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 253 116 3
PDF Downloads 110 60 3