Search Results

1 - 10 of 20 items :

  • "Manual evaluation" x
Clear All

, and Omar Zaidan. A Grain of Salt for the WMT Manual Evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation , pages 1-11, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL Callison-Burch, Chris, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. Further Meta-Evaluation of Machine Translation. In Proceedings of the Third Workshop on Statistical Machine Translation , pages 70-106, Columbus, Ohio, June 2008


In this article we present a novel linguistically driven evaluation method and apply it to the main approaches of Machine Translation (Rule-based, Phrase-based, Neural) to gain insights into their strengths and weaknesses in much more detail than provided by current evaluation schemes. Translating between two languages requires substantial modelling of knowledge about the two languages, about translation, and about the world. Using English-German IT-domain translation as a case-study, we also enhance the Phrase-based system by exploiting parallel treebanks for syntax-aware phrase extraction and by interfacing with Linked Open Data (LOD) for extracting named entity translations in a post decoding framework.

References Aziz, Wilker, Sheila Castilho Monteiro de Sousa, and Lucia Specia. PET: a tool for post-editing and assessing machine translation. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) , Istanbul, Turkey, may 2012. European Language Resources Association (ELRA). ISBN 978-2-9517408-7-7. Federmann, Christian. Appraise: An open-source toolkit for manual evaluation of machine translation output. The Prague Bulletin of Mathematical Linguistics , 98:25-35, September 2012. Koehn, Philipp. Statistical Machine

_{TF-IDF \_Location} = w_{Location} \ast w_{TF-IDF} \end{array}$$ (10) w M M R _ L o c a t i o n = w L o c a t i o n ∗ w M M R $$\begin{array}{} \displaystyle w_{MMR\_{Location}} = w_{Location} \ast w_{MMR} \end{array}$$ (11) 4 Experiments and Results Analysis 4.1 Evaluation Method In this paper, we invite 2 volunteers to make manual evaluation for the generated surveys. And we assign a score between 1–5 where 5 means the survey is very comprehensive, 1 means it is very poor and has no logical. The volunteers are fifth-year undergraduate students majoring in clinical medicine. The


We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems’ outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.


We propose a manual evaluation method for machine translation (MT), in which annotators rank only translations of short segments instead of whole sentences. This results in an easier and more efficient annotation. We have conducted an annotation experiment and evaluated a set of MT systems using this method. The obtained results are very close to the official WMT14 evaluation results. We also use the collected database of annotations to automatically evaluate new, unseen systems and to tune parameters of a statistical machine translation system. The evaluation of unseen systems, however, does not work and we analyze the reasons


An attempt of designing artificial neural networks for empirical laboratory test results tracers No. 5, No. 7 and No. 8 was introduced in the article. These tracers are applied in cartridges with calibres from 37 mm to 122 mm which are still used and stored both in the marine climate and land. The results of laboratory tests of tracers in the field of over 40 years of tests have been analysed. They have been properly prepared in accordance with the requirements that are necessary to design of neural networks. Only the evaluation module of these tracers was evaluated, because this element of tests, fulfilled the necessary assumptions needed to build artificial neural networks. Several hundred artificial neural networks have been built for each type of analysed tracers. After an in-depth analysis of received results, it was chosen one the best neural network, the main parameters of which were described and discussed in the article. Received results of working built of neural networks were compared with previously functioning manual evaluation module of these tracers. On the basis conducted analyses, proposed the modification of functioning test methodology by replacing the previous manual evaluation modules through elaborated automatic models of artificial neural networks. Artificial neural networks have a very important feature, namely they are used in the prediction of specific output data. This feature successfully used in diagnostic tests of other elements of ammunition.


We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English–Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.

Analyzing Error Types in English-Czech Machine Translation

This paper examines two techniques of manual evaluation that can be used to identify error types of individual machine translation systems. The first technique of "blind post-editing" is being used in WMT evaluation campaigns since 2009 and manually constructed data of this type are available for various language pairs. The second technique of explicit marking of errors has been used in the past as well.

We propose a method for interpreting blind post-editing data at a finer level and compare the results with explicit marking of errors. While the human annotation of either of the techniques is not exactly reproducible (relatively low agreement), both techniques lead to similar observations of differences of the systems. Specifically, we are able to suggest which errors in MT output are easy and hard to correct with no access to the source, a situation experienced by users who do not understand the source language.

Quiz-Based Evaluation of Machine Translation

This paper proposes a new method of manual evaluation for statistical machine translation, the so-called quiz-based evaluation, estimating whether people are able to extract information from machine-translated texts reliably. We apply the method to two commercial and two experimental MT systems that participated in WMT 2010 in English-to-Czech translation. We report inter-annotator agreement for the evaluation as well as the outcomes of the individual systems. The quiz-based evaluation suggests rather different ranking of the systems compared to the WMT 2010 manual and automatic metrics. We also see that overall, MT quality is becoming acceptable for obtaining information from the text: about 80% of questions can be answered correctly given only machine-translated text.