Kateřina Rysová, Magdaléna Rysová, Michal Novák, Jiří Mírovský and Eva Hajičová
In the paper, we present EVALD applications (Evaluator of Discourse) for automated essay scoring. EVALD is the first tool of this type for Czech. It evaluates texts written by both native and non-native speakers of Czech. We describe first the history and the present in the automatic essay scoring, which is illustrated by examples of systems for other languages, mainly for English. Then we focus on the methodology of creating the EVALD applications and describe datasets used for testing as well as supervised training that EVALD builds on. Furthermore, we analyze in detail a sample of newly acquired language data – texts written by non-native speakers reaching the threshold level of the Czech language acquisition required e.g. for the permanent residence in the Czech Republic – and we focus on linguistic differences between the available text levels. We present the feature set used by EVALD and – based on the analysis – we extend it with new spelling features. Finally, we evaluate the overall performance of various variants of EVALD and provide the analysis of collected results.
Daniel Kondratyuk, Ronald Cardenas and Ondřej Bojar
Recent developments in machine translation experiment with the idea that a model can improve the translation quality by performing multiple tasks, e.g., translating from source to target and also labeling each source word with syntactic information. The intuition is that the network would generalize knowledge over the multiple tasks, improving the translation performance, especially in low resource conditions. We devised an experiment that casts doubt on this intuition. We perform similar experiments in both multi-decoder and interleaving setups that label each target word either with a syntactic tag or a completely random tag. Surprisingly, we show that the model performs nearly as well on uncorrelated random tags as on true syntactic tags. We hint some possible explanations of this behavior.
The main message from our article is that experimental results with deep neural networks should always be complemented with trivial baselines to document that the observed gain is not due to some unrelated properties of the system or training effects. True confidence in where the gains come from will probably remain problematic anyway.
The paper proposes design of a generic database for multiword expressions (MWE), based on the requirements for implementation of the lexicon of Czech MWEs. The lexicon is aimed at different goals concerning lexicography, teaching Czech as a foreign language, and theoretical issues of MWEs as entities standing between lexicon and grammar, as well as for NLP tasks such as tagging and parsing, identification and search of MWEs, or word sense and semantic disambiguation. The database is designed to account for flexibility in morphology and word order, syntactic and lexical variants and even creatively used fragments. Current state of implementation is presented together with some emerging issues, problems and solutions.
Graph theory, which quantitatively measures the precise structure and complexity of any network, uncovers an optimal force balance in sentential graphs generated by the computational procedures of human natural language (CHL). It provides an alternative way to evaluate grammaticality by calculating ‘feature potential’ of nodes and ‘feature current’ along edges. An optimal force balance becomes visible by expressing ‘feature current’ through different point sizes of lines. Graph theory provides insights into syntax and contradicts Chomsky’s current proposal to discard tree notations. We propose an error minimization hypothesis for CHL: a good sentential network possesses an error-free self-organized force balance. CHL minimizes errors by (a) converting bottom-up flow (structure building) to top-down flow (parsing), (b) removing head projection edges, (c) preserving edges related to feature checking, (d) deleting DPmovement trajectories headed by an intermediate copy, (e) ensuring that covert wh-movement trajectories have infinitesimally small currents and conserving flow directions, and (f) robustly remedying a gap in wh-loop by using infinitesimally inexpensive wh-internally-merged (wh- IM) edge with the original flow direction. The CHL compels the sensorimotor (SM) interface to ground nodes so that Kirchhoff’s current law (a fundamental balance law) is satisfied. Internal merges are built-in grounding operations at the CHL–SM interface that generate loops and optimal force balance in sentential networks.
Agata Savary, Silvio Ricardo Cordeiro, Timm Lichte, Carlos Ramisch, Uxoa Iñurrieta and Voula Giouli
Multiword expressions can have both idiomatic and literal occurrences. For instance pulling strings can be understood either as making use of one’s influence, or literally. Distinguishing these two cases has been addressed in linguistics and psycholinguistics studies, and is also considered one of the major challenges in MWE processing. We suggest that literal occurrences should be considered in both semantic and syntactic terms, which motivates their study in a treebank. We propose heuristics to automatically pre-identify candidate sentences that might contain literal occurrences of verbal VMWEs, and we apply them to existing treebanks in five typologically different languages: Basque, German, Greek, Polish and Portuguese. We also perform a linguistic study of the literal occurrences extracted by the different heuristics. The results suggest that literal occurrences constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts. This article is a largely extended version of Savary and Cordeiro (2018).
Jetic Gū, Anahita Mansouri Bigvand and Anoop Sarkar
In this paper, we present a new word aligner with built-in support for alignment types, as well as comparisons between various models and existing aligner systems. It is an open source software that can be easily extended to use models of users’ own design. We expect it to suffice the academics as well as scientists working in the industry to do word alignment, as well as experimenting on their own new models. Here in the present paper, the basic designs and structures will be introduced. Examples and demos of the system are also provided.
Václava Kettnerová, Markéta Lopatková, Eduard Bejček and Petra Barančíková
This paper summarizes results of a theoretical analysis of syntactic behavior of Czech light verb constructions and their verification in the linguistic annotation of a large amount of these constructions. The concept of LVCs is based on the observation that nouns denoting actions, states, or properties have a strong tendency to select semantically underspecified verbs, which leads to a specific rearrangement of valency complementations of both nouns and verbs in the syntactic structure. On the basis of the description of deep and surface syntactic properties of LVCs, a formal model of their lexicographic representation is proposed here. In addition, the resulting data annotation, capturing almost 1,500 LVCs, is described in detail. This annotation has been integrated in a new version of the VALLEX lexicon, release 3.5.
We present NMT-Keras, a flexible toolkit for training deep learning models, which puts a particular emphasis on the development of advanced applications of neural machine translation systems, such as interactive-predictive translation protocols and long-term adaptation of the translation system via continuous learning. NMT-Keras is based on an extended version of the popular Keras library, and it runs on Theano and TensorFlow. State-of-the-art neural machine translation models are deployed and used following the high-level framework provided by Keras. Given its high modularity and flexibility, it also has been extended to tackle different problems, such as image and video captioning, sentence classification and visual question answering.
Thomas Zenkel, Matthias Sperber, Jan Niehues, Markus Müller, Ngoc-Quan Pham, Sebastian Stüker and Alex Waibel
In this paper we introduce an open source toolkit for speech translation. While there already exists a wide variety of open source tools for the essential tasks of a speech translation system, our goal is to provide an easy to use recipe for the complete pipeline of translating speech. We provide a Docker container with a ready to use pipeline of the following components: a neural speech recognition system, a sentence segmentation system and an attention-based translation system. We provide recipes for training and evaluating models for the task of translating English lectures and TED talks to German. Additionally, we provide pre-trained models for this task. With this toolkit we hope to facilitate the development of speech translation systems and to encourage researchers to improve the overall performance of speech translation systems.
We present PanParser, a Python framework dedicated to transition-based structured prediction, and notably suitable for dependency parsing. On top of providing an easy way to train state-of-the-art parsers, as empirically validated on UD 2.0, PanParser is especially useful for research purposes: its modular architecture enables to implement most state-of-the-art transition-based methods under the same unified framework (out of which several are already built-in), which facilitates fair benchmarking and allows for an exhaustive exploration of slight variants of those methods. PanParser additionally includes a number of fine-grained evaluation utilities, which have already been successfully leveraged in several past studies, to perform extensive error analysis of monolingual as well as cross-lingual parsing.