Microsyntactic Annotation of Corpora and its Use in Computational Linguistics Tasks

Open access


Microsyntax is a linguistic discipline dealing with idiomatic elements whose important properties are strongly related to syntax. In a way, these elements may be viewed as transitional entities between the lexicon and the grammar, which explains why they are often underrepresented in both of these resource types: the lexicographer fails to see such elements as full-fledged lexical units, while the grammarian finds them too specific to justify the creation of individual well-developed rules. As a result, such elements are poorly covered by linguistic models used in advanced modern computational linguistic tasks like high-quality machine translation or deep semantic analysis. A possible way to mend the situation and improve the coverage and adequate treatment of microsyntactic units in linguistic resources is to develop corpora with microsyntactic annotation, closely linked to specially designed lexicons. The paper shows how this task is solved in the deeply annotated corpus of Russian, SynTagRus.

[1] Iomdin, L. L. (2013). Nekotorye mikrosintaksičeskie konstruktsii v russkom jazyke s učastiem slova čto v kačestve sostavnogo elementa. [Certain microsyntactic constructions in Russian which contain the word čto as a constituent element.] Južnoslovenski filolog, LXIX:137–147. [In Russian.]

[2] Iomdin, L. L. (2014). Xorošo menja tam ne bylo: sintaksis i semantika odnogo klassa russkix razgovornyx konstruktsij. [Good thing I wasn’t there: syntax and semantics of a class of Russian colloquial constructions.] In Grammaticalization and lexicalization in the Slavic languages. Proceedings from the 36th meeting of the commission on the grammatical structure of the Slavic languages of the International committee of Slavists, pages 423–436, Verlag Otto Sagner, München/Berlin/Washington D.C. [In Russian.]

[3] Iomdin, L. L. (2015). Konstruktsii mikrosintaksisa, obrazovannye russkoj leksemoj raz. [Construction of microsyntax built by the Russian word raz.] SLAVIA, časopis pro slovanskou filologii, 84(3):291–306. [In Russian.]

[4] Iomdin, L. (2016). Microsyntactic Phenomena as a Computational Linguistics Issue. In Grammar and Lexicon: Interactions and Interfaces. Proceedings of the Workshop, pages 8–18, Osaka, Japan. Accesible at: http://aclweb.org/anthology/W/W16/W16-38.pdf.

[5] Fillmore, Ch. (1988). The Mechanisms of Construction Grammar. In Proceedings of the Fourteenth Annual Meeting of the Berkeley Linguistics Society, pages 35–55.

[6] Goldberg, A. (1995). Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago.

[7] Rakhilina, E. V., editor (2010). Lingvistika konstruktsij. [The linguistics of constructions.] Azbukovnik Publishers, Moscow. [In Russian.]

[8] Lauwers, P. and Wettere, van N. (2017). La Micro-constructionnalisation En Tandem: La Copularisation De Tourner/virer. Langue française, 194(2):85–103.

[9] Rhodes, R. (2009). Tautological constructions in English … and beyond. Presented to the Syntax and Semantics Circle, UCB. Accessible at: http://linguistics.berkeley.edu/~russellrhodes/pdfs/syntax_circle_taut_qp.pdf.

[10] Iomdin, L. (2017). Kak nam byt’ s konstruktsijami tipa kak byt? [What to do about constructions like what to do?] Computational Linguistics and Intellectual Technologies. Dialogue 2017, 16 (23)(2):150–161. [In Russian, Engl. Abstract.]

[11] Marakasova, A. A. and Iomdin, L. L. (2016). Mikrosintaksičeskaja razmetka v korpuse russkix tekstov SynTagRus [Microsyntactic tagging in the SynTagRus corpus of Russian texts.] In Informacionnye texnologii i sistemy 2016 (ITiS’2016). Sbornik trudov 40-oj meždisciplinarnoj školykonferencii IPPI RAN, pages 445–449, Repino, Saint Petersburg, Russia. [In Russian.] Accessible at: http://itas2016.iitp.ru/pdf/1570285171.pdf.

[12] Dyachenko, P. V., Iomdin, L. L., Lazursky, A. V., Mityushin, L. G., Podlesskaya, Yu, O., Sizov, V. G., Frolova, T. I., and Tsinman, L. L. (2015). Sovremennoe sostojanie gluboko annotirovannogo korpusa tekstov russkogo jazyka (SynTagRus). [The current state of the deeply annotated corpus of Russian texts (SynTagRus).] In Nacional’nyj korpus russkogo jazyka. 10 let proektu. Trudy Instituta russkogo jazyka im. V.V. Vinogradova. M, Vol. 6, pages 272–299. [In Russian.]

[13] Apresjan, Ju., D., Iomdin, L. L., Sannikov, A. V., and Sizov, V. G. (2004). Semantičeskaja razmetka v gluboko annotirovannom korpuse russkogo jazyka. [Semantic Tagging in a deeply annotated corpus of Russian.] In Trudy mezhdunarodnoj konferencii «Korpusnaja lingvistika – 2004», pages 41–54, Izd-vo Sankt-Peterburgskogo universiteta, Saint Petersburg, Russia. [In Russian.]

[14] Mel’čuk, I. A. (1974). Opyt teorii lingvističeskix modelej «Smysl Û Tekst». [An experience of creating the theory of linguistic models of the Meaning Û Text type.] Nauka Publishers, Moscow. [In Russian.]

[15] Inshakova, E. S. (2016). Razrešenie sintaksičeskoj mestoimennoj anafory v sisteme «ETAP-3». [Resolution of syntactic pronominal anaphora in the ETAP-3 system.] In Informacionnye texnologii i sistemy 2016 (ITiS’2016). Sbornik trudov 40-oj meždisciplinarnoj školy-konferencii IPPI RAN, pages 420–429, Repino, Saint Petersburg, Russia. [In Russian.] Accessible at: http://itas2016.iitp.ru/pdf/1570282678.pdf.

[16] Marakasova, A. A. (2016). Avtomatičeskoe razrešenie anafory v russkom tekste: slučaj nulevogo sub”ekta. [Automatic resolution og anaphora in a Russian text: the case of a zero subject.] In Informacionnye texnologii i sistemy 2016 (ITiS’2016). Sbornik trudov 40-oj meždisciplinarnoj školykonferencii IPPI RAN, pages 431–436, Repino, Saint Petersburg, Russia. [In Russian.] Accessible at: http://itas2016.iitp.ru/pdf/1570285121.pdf.

[17] Dikonov, V. G. and Poritski, V. V. (2014). A Virtual Russian Sense Tagged Corpus and Catching Errors In A Russian Û Semantic Pivot Dictionary. Computational Linguistics and Intellectual Technologies. Dialogue 2014, 13(20):128–137.

[18] Mihalcea, R. (1998). SemCor semantically tagged corpus, SenseEval 2 & 3 data in SemCor format. Accessible at: http://www.cse.unt.edu/~rada/downloads.html.

[19] Petrolito, T. and Bond, F. (2014). A survey of WordNet Annotated Corpora. In Proceedings of the Seventh Global WordNet Conference, pages 236–243, Tartu, Estonia.

[20] Rosén, V., Smedt, K. de, Smørdal Losnegaard, G., Bejček, E., Savary, A. and Osenova, P. (2016). MWEs in Treebanks: From Survey to Guidelines. In Proceedings, LREC 2016, Tenth International Conference on Language Resources and Evaluation, pages 2323–2330, Portorož, Slovenia.

[21] Savary, A., Sangati, F., Candito, M. et al. (2017). The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 31–47, Valencia, Spain.

[22] Apresjan, Ju. D., Boguslavsky, I. M., Iomdin, L. L., Lazursky, A. V., Mitjushin, L. G., Sannikov, V. Z., and Tsinman, L. L. (1992). Lingvističeskij processor dlja složnyx informacionnyx sistem. [A linguistic processor for complex information systems.] Nauka Publishers, Moscow. [In Russian.]

[23] Apresjan, Ju. D., Boguslavsky, I. M., Iomdin, L. L., and Sannikov, V. Z. (2010). Teoretičeskie problemy russkogo sintaksisa: Vzaimodejstvie grammatiki i slovarja. [Theoretical Issues of Russian Syntax: Interaction of the Grammar and the Lexicon.] In Apresjan, Ju. D., editor, Jazyki slavjanskix kul’tur. Moscow. [In Russian.]

Journal of Linguistics/Jazykovedný casopis

The Journal of Ludovít Štúr Institute of Linguistics, SAV

Journal Information

CiteScore 2017: 0.03

SCImago Journal Rank (SJR) 2017: 0.101
Source Normalized Impact per Paper (SNIP) 2017: 0.203


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 144 141 11
PDF Downloads 70 69 6