CzeDLex – A Lexicon of Czech Discourse Connectives

Jiří Mírovský 1 , Pavlína Synková 1 , Magdaléna Rysová 1  and Lucie Poláková 1
  • 1 Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics

Abstract

CzeDLex is a new electronic lexicon of Czech discourse connectives, planned for publication by the end of this year. Its data format and structure are based on a study of similar existing resources, and adjusted to comply with the Czech syntactic tradition and specifics and with the Prague approach to the annotation of semantic discourse relations in text.

In the article, we first put the lexicon in context of related resources and discuss theoretical aspects of building the lexicon – we present arguments for our choice of the data structure and for selecting features of the lexicon entries, while special attention is paid to a consistent and (as far as possible) uniform encoding of both primary (such as in English because, therefore) and secondary connectives (e.g. for this reason, this is the reason why). The main principle adopted for nesting entries in the lexicon is – apart from the lexical form of the connective – a discoursesemantic type (sense) expressed by the given connective, which enables us to deal with a broad formal variability of connectives and is convenient for interlinking CzeDLex with lexicons in other languages.

Second, we introduce the chosen technical solution based on the Prague Markup Language, which allows for an efficient incorporation of the lexicon into the family of Prague treebanks – it can be directly opened and edited in the tree editor TrEd, processed from the command line in btred, interlinked with its source corpus and queried in the PML Tree Query engine.

Third, we describe the process of getting data for the lexicon by exploiting a large corpus manually annotated with discourse relations – the Prague Discourse Treebank 2.0: we elaborate on the automatic extraction part, post-extraction checks and manual addition of supplementary linguistic information.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Al-Saif, Amal and Katja Markert. The Leeds Arabic Discourse Treebank: Annotating Discourse Connectives for Arabic. In Proceedings of LREC 2010, pages 2046–2053, Valletta, Malta, 2010.

  • Asher, Nicholas. Reference to abstract objects in discourse. Kluwer, Norwell, MA, 1993.

  • Ball, Wilson James. Dictionary of link words in English discourse. Macmillan, 1993.

  • Bamman, David and Gregory Crane. The ancient Greek and Latin dependency treebanks. In Language technology for cultural heritage, pages 79–98. Springer, 2011.

  • Bejček, Eduard, Eva Hajičová, Jan Hajič, Pavlína Jínová, Václava Kettnerová, Veronika Kolářová, Marie Mikulová, Jiří Mírovský, Anna Nedoluzhko, Jarmila Panevová, Lucie Poláková, Magda Ševčíková, Jan Štěpánek, and Šárka Zikánová. Prague Dependency Treebank 3.0. Data/software, 2013.

  • Berović, Daša, Željko Agić, and Marko Tadić. Croatian dependency treebank: Recent development and initial experiments. In Seventh International Conference on Language Resources and Evaluation (LREC 2012), 2012.

  • Breindl, Eva, Anna Volodina, and Ulrich Hermann Waßner. Handbuch der deutschen Konnektoren 2: Semantik der deutschen Satzverknüpfer, volume 13. Walter de Gruyter GmbH & Co KG, 2015.

  • Buscha, Joachim. Lexikon deutscher Konjunktionen. Langenscheidt, Verlag Enzyklopädie, 1989.

  • Carlson, Lynn, Daniel Marcu, and Mary Ellen Okurowski. Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory. In Current and new directions in discourse and dialogue, pages 85–112. Springer, 2003.

  • Čermák, František. Frazeologie a idiomatika: česká a obecná. Karolinum, 2007.

  • Čermák, František. Slovník české frazeologie a idiomatiky. Leda, 2009.

  • Da Cunha, Iria, Juan-Manuel Torres-Moreno, and Gerardo Sierra. On the development of the RST Spanish Treebank. In Proceedings of the 5th Linguistic Annotation Workshop, pages 1–10. Association for Computational Linguistics, 2011.

  • Danlos, Laurence, Diégo Antolinos-Basso, Chloé Braud, and Charlotte Roze. Vers le FDTB: French Discourse Tree Bank. In TALN 2012: 19ème conférence sur le Traitement Automatique des Langues Naturelles, pages 471–478, 2012.

  • Džeroski, Sašo, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtsky, and Andreja Žele. Towards a Slovene dependency treebank. In Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation (LREC), 2006.

  • Feltracco, Anna, Elisabetta Jezek, Bernardo Magnini, and Manfred Stede. LICO: A Lexicon of Italian Connectives. CLiC it, page 141, 2016.

  • Hajič, Jan, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová, Zdeněk Žabokrtský, Magda Ševčíková-Razímová, and Zdeňka Urešová. Prague Dependency Treebank 2.0. Data/software, 2006.

  • Hajič, Jan, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. Announcing Prague Czech-English Dependency Treebank 2.0. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 3153–3160, Istanbul, 2012. ELRA, European Language Resources Association.

  • Hana, Jirka and Jan Štěpánek. Prague Markup Language Framework. In Proceedings of the Sixth Linguistic Annotation Workshop, pages 12–21, Stroudsburg, 2012. Association for Computational Linguistics, Association for Computational Linguistics.

  • Hausmann, Franz Josef. Lexikographie. Handbuch der Lexikologie. Königstein: Athenäum, pages 367–411, 1985.

  • Helbig, Gerhard. Lexikon deutscher Partikeln. Verlag Enzyklopädie, 1988.

  • Helbig, Gerhard and Joachim Buscha. Deutsche Grammatik. Verlag Enzyklopädie, 1984.

  • Helbig, Gerhard and Agnes Helbig. Lexikon deutscher Modalwörter. Verlag Enzyklopädie, 1990.

  • Iruskieta, M., M. Aranzabe, A. Diaz de Ilarraza, I. Gonzalez, I. Lersundi, and O. Lopez de Lacalle. The RST Basque TreeBank: an online search interface to check rhetorical relations. In 4th Workshop RST and Discourse Studies, pages 40–49, Sociedad Brasileira de Computacao, Fortaleza, CE, Brasil, 2013.

  • Kolářová, Veronika. Valence vybraných typů deverbativních substantiv ve valenčním slovníku PDT-Vallex. Technical Report TR-2014-56, ÚFAL MFF UK, 2014.

  • Lin, Ziheng, Hwee Tou Ng, and Min-Yen Kan. A PDTB-styled end-to-end discourse parser. Natural Language Engineering, 20(2):151–184, 2014.

  • Mann, William C. and Sandra A. Thompson. Rhetorical Structure Theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 8:243–281, 1988a.

  • Mann, William C. and Sandra A. Thompson. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text, 8(3):243–281, 1988b.

  • Meyer, Thomas and Lucie Poláková. Machine translation with many manually labeled discourse connectives. In Proceedings of the 1st DiscoMT Workshop at ACL 2013 (51st Annual Meeting of the Association for Computational Linguistics), pages 43–50, Sofia, Bulgaria, 2013.

  • Meyer, Thomas, Andrei Popescu-Belis, Sandrine Zufferey, and Bruno Cartoni. Multilingual annotation and disambiguation of discourse connectives for machine translation. In Proceedings of the SIGDIAL 2011 Conference, pages 194–203. Association for Computational Linguistics, 2011.

  • Mírovský, Jiří, Lucie Mladová, and Zdeněk Žabokrtský. Annotation Tool for Discourse in PDT. In Huang, Chu-Ren and Dan Jurafsky, editors, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), volume 1, pages 9–12, Beijing, China, 2010. Chinese Information Processing Society of China, Tsinghua University Press.

  • Mírovský, Jiří, Pavlína Jínová, and Lucie Poláková. Discourse Relations in the Prague Dependency Treebank 3.0. In Tounsi, Lamia and Rafal Rak, editors, The 25th International Conference on Computational Linguistics (Coling 2014), Proceedings of the Conference System Demonstrations, pages 34–38, Dublin, Ireland, 2014. Dublin City University (DCU), Dublin City University (DCU).

  • Mírovský, Jiří, Lucie Poláková, and Jan Štěpánek. Searching in the Penn Discourse Treebank Using the PML-Tree Query. In Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 1762–1769, Paris, France, 2016a. European Language Resources Association.

  • Mírovský, Jiří, Pavlína Synková, Magdaléna Rysová, and Lucie Poláková. Designing CzeDLex – A Lexicon of Czech Discourse Connectives. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, pages 449–457, Seoul, Korea, 2016b. Kyung Hee University, Kyung Hee University.

  • Oza, Umangi, Rashmi Prasad, Sudheer Kolachina, Dipti Misra Sharma, and Aravind Joshi. The Hindi Discourse Relation Bank. In Proceedings of the third Linguistic Annotation Workshop, pages 158–161, 2009.

  • Pajas, Petr and Jan Štěpánek. Recent Advances in a Feature-Rich Framework for Treebank Annotation. In Scott, Donia and Hans Uszkoreit, editors, Proceedings of the 22nd International Conference on Computational Linguistics, pages 673–680, Manchester, 2008. The Coling 2008 Organizing Committee.

  • Pajas, Petr and Jan Štěpánek. System for Querying Syntactically Annotated Corpora. In Lee, Gary and Sabine Schulte im Walde, editors, Proceedings of the ACL–IJCNLP 2009 Software Demonstrations, pages 33–36, Suntec, 2009. Association for Computational Linguistics.

  • Pasch, Renate, Ursula Brauße, Eva Breindl, and Ulrich Hermann Waßner. Handbuch der deutschen Konnektoren. Linguistische Grundlagen der Beschreibung und syntaktische Merkmale der deutschen Satzverknüpfer (Konjunktionen, Satzadverbien und Partikeln). Walter de Gruyter, 2003.

  • Poláková, Lucie. Discourse Relations in Czech. PhD thesis, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic, 2015.

  • Poláková, Lucie, Pavlína Jínová, and Jirí Mírovskỳ. Interplay of Coreference and Discourse Relations: Discourse Connectives with a Referential Component. In LREC, pages 146–153. Citeseer, 2012.

  • Poláková, Lucie, Pavlína Jínová, Šárka Zikánová, Zuzanna Bedřichová, Jiří Mírovský, Magdaléna Rysová, Jana Zdeňková, Veronika Pavlíková, and Eva Hajičová. Manual for Annotation of Discourse Relations in Prague Dependency Treebank. Technical Report 47, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic, 2012a.

  • Poláková, Lucie, Pavlína Jínová, Šárka Zikánová, Eva Hajičová, Jiří Mírovský, Anna Nedoluzhko, Magdaléna Rysová, Veronika Pavlíková, Jana Zdeňková, Jiří Pergler, and Radek Ocelák. Prague Discourse Treebank 1.0. Data/software, 2012b.

  • Poláková, Lucie, Jiří Mírovský, Anna Nedoluzhko, Pavlína Jínová, Šárka Zikánová, and Eva Hajičová. Introducing the Prague Discourse Treebank 1.0. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 91–99, Nagoya, 2013. Asian Federation of Natural Language Processing.

  • Poláková, Lucie, Pavlína Jínová, and Jiří Mírovský. Genres in the Prague Discourse Treebank. In Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, and Joseph Mariani, editors, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 1320–1326, Reykjavík, Iceland, 2014. European Language Resources Association.

  • Prasad, Rashmi and Harry Bunt. Semantic relations in discourse: The current state of ISO 24617-8. In Proceedings 11th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (ISA-11), pages 80–92, 2015.

  • Prasad, Rashmi, Eleni Miltsakaki, Nikhil Dinesh, Alan Lee, Aravind Joshi, Livio Robaldo, and Bonnie Webber. The Penn Discourse Treebank 2.0 Annotation Manual. Technical Report IRCS-08-01, Institute for Research in Cognitive Science, Philadelphia, 2007. URL http://www.seas.upenn.edu/~pdtb/PDTBAPI/pdtb-annotation-manual.pdf.

  • Prasad, Rashmi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. The Penn Discourse Treebank 2.0. In Calzolari, Nicoletta, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, and Daniel Tapias, editors, Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), pages 2961–2968, Marrakech, 2008. European Language Resources Association.

  • Roze, Charlotte, Laurence Danlos, and Philippe Muller. LEXCONN: a French lexicon of discourse connectives. Discours. Revue de linguistique, psycholinguistique et informatique, (10), 2012.

  • Rysová, Magdaléna. Diskurzní konektory v češtině (Od centra k periferii) [Discourse Connectives in Czech (From the Centre to the Perifery)]. PhD thesis, Charles University, Prague, Czechia, 2015.

  • Rysová, Magdaléna and Kateřina Rysová. The Centre and Periphery of Discourse Connectives. In Aroonmanakun, Wirote, Prachya Boonkwan, and Thepchai Supnithi, editors, Proceedings of Pacific Asia Conference on Language, Information and Computing, pages 452–459, Bangkok, 2014. Department of Linguistics, Faculty of Arts, Chulalongkorn University, Department of Linguistics, Faculty of Arts, Chulalongkorn University.

  • Rysová, Magdaléna and Kateřina Rysová. Secondary Connectives in the Prague Dependency Treebank. In Hajičová, Eva and Joakim Nivre, editors, Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 291–299, Uppsala, Sweden, 2015. Uppsala University, Uppsala University.

  • Rysová, Magdaléna, Pavlína Synková, Jiří Mírovský, Eva Hajičová, Anna Nedoluzhko, Radek Ocelák, Jiří Pergler, Lucie Poláková, Veronika Pavlíková, Jana Zdeňková, and Šárka Zikánová. Prague Discourse Treebank 2.0. Data/software, 2016.

  • Sanders, Ted JM, Wilbert PM Spooren, and Leo GM Noordman. Toward a taxonomy of coherence relations. Discourse processes, 15(1):1–35, 1992.

  • Scheffler, Tatjana and Manfred Stede. Adding Semantic Relations to a Large-Coverage Connective Lexicon of German. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, Paris, France, 2016.

  • Schröder, Jochen. Lexikon deutscher Präpositionen. Verlag Enzyklopädie, 1986.

  • Stede, Manfred. Resolving connective ambiguity: A prerequisite for discourse parsing. The Pragmatics of Discourse Coherence. John Benjamins, Amsterdam, 2014.

  • Stede, Manfred and Yulia Grishina. Anaphoricity in Connectives: A Case Study on German. Coreference Resolution beyond OntoNotes, page 41, 2016.

  • Stede, Manfred and Arne Neumann. Potsdam Commentary Corpus 2.0: Annotation for Discourse Research. In Proceedings of LREC 2014, pages 925–929, Reykjavik, Iceland, 2014.

  • Stede, Manfred and Carla Umbach. DiMLex: A Lexicon of Discourse Markers for Text Generation and Understanding. In Proceedings of the 17th International Conference on Computational Linguistics (Coling 1998), pages 1238–1242. Association for Computational Linguistics, 1998.

  • Synková, Pavlína, Magdaléna Rysová, Lucie Poláková, and Jiří Mírovský. Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus. In Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, pages 1–8, Cebu, Philippines, 2017, in print. University of the Philippines Cebu.

  • Urešová, Zdeňka. Valenční slovník Pražského závislostního korpusu (PDT-Vallex). Studies in Computational and Theoretical Linguistics. Ústav formální a aplikované lingvistiky, Praha, Czechia, 2011.

  • Urešová, Zdeňka, Eva Fučíková, and Jana Šindlerová. CzEngVallex: a bilingual Czech-English valency lexicon. The Prague Bulletin of Mathematical Linguistics, 105:17–50, 2016.

  • Veselovská, Kateřina and Ondřej Bojar. Czech SubLex 1.0, 2013.

  • Zeman, Daniel, David Mareček, Jan Mašek, Martin Popel, Loganathan Ramasamy, Rudolf Rosa, Jan Štěpánek, and Zdeněk Žabokrtský. HamleDT 3.0, 2015.

  • Zeyrek, Deniz and Murathan Kurfalı. TDB 1.1: Extensions on Turkish Discourse Bank. LAW XI 2017, page 76, 2017.

  • Zeyrek, Deniz, Işin Demirşahin, Ayişiği Sevdik-Çalli, Hale Ögel Balaban, İhsan Yalçinkaya, and Ümit Deniz Turan. The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations. In Proceedings of the fourth Linguistic Annotation Workshop, pages 282–289, 2010.

  • Zhou, Yuping and Nianwen Xue. PDTB-style discourse annotation of Chinese text. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers- Volume 1, pages 69–77, 2012.

  • Zhou, Yuping and Nianwen Xue. The Chinese discourse treebank: a Chinese corpus annotated with discourse relations. Language Resources and Evaluation, 49(2):397, 2015.

  • Zikánová, Šárka, Eva Hajičová, Barbora Hladká, Pavlína Jínová, Jiří Mírovský, Anna Nedoluzhko, Lucie Poláková, Kateřina Rysová, Magdaléna Rysová, and Jan Václ. Discourse and Coherence. From the Sentence Structure to Relations in Text. Studies in Computational and Theoretical Linguistics. ÚFAL, Praha, Czechia, 2015.

OPEN ACCESS

Journal + Issues

Search