Artuur Leeuwenberg, Mihaela Vela, Jon Dehdari and Josef van Genabith
In this paper we present a novel approach to minimally supervised synonym extraction. The approach is based on the word embeddings and aims at presenting a method for synonym extraction that is extensible to various languages.
We report experiments with word vectors trained by using both the continuous bag-of-words model (CBoW) and the skip-gram model (SG) investigating the effects of different settings with respect to the contextual window size, the number of dimensions and the type of word vectors. We analyze the word categories that are (cosine) similar in the vector space, showing that cosine similarity on its own is a bad indicator to determine if two words are synonymous. In this context, we propose a new measure, relative cosine similarity, for calculating similarity relative to other cosine-similar words in the corpus. We show that calculating similarity relative to other words boosts the precision of the extraction. We also experiment with combining similarity scores from differently-trained vectors and explore the advantages of using a part-of-speech tagger as a way of introducing some light supervision, thus aiding extraction.
We perform both intrinsic and extrinsic evaluation on our final system: intrinsic evaluation is carried out manually by two human evaluators and we use the output of our system in a machine translation task for extrinsic evaluation, showing that the extracted synonyms improve the evaluation metric.
Treating morphologically complex words (MCWs) as atomic units in translation would not yield a desirable result. Such words are complicated constituents with meaningful subunits. A complex word in a morphologically rich language (MRL) could be associated with a number of words or even a full sentence in a simpler language, which means the surface form of complex words should be accompanied with auxiliary morphological information in order to provide a precise translation and a better alignment. In this paper we follow this idea and propose two different methods to convey such information for statistical machine translation (SMT) models. In the first model we enrich factored SMT engines by introducing a new morphological factor which relies on subword-aware word embeddings. In the second model we focus on the language-modeling component. We explore a subword-level neural language model (NLM) to capture sequence-, word- and subword-level dependencies. Our NLM is able to approximate better scores for conditional word probabilities, so the decoder generates more fluent translations. We studied two languages Farsi and German in our experiments and observed significant improvements for both of them.
Kriya - An end-to-end Hierarchical Phrase-based MT System
This paper describes Kriya - a new statistical machine translation (SMT) system that uses hierarchical phrases, which were first introduced in the Hiero machine translation system (Chiang, 2007). Kriya supports both a grammar extraction module for synchronous context-free grammars (SCFGs) and a CKY-based decoder. There are several re-implementations of Hiero in the machine translation community, but Kriya offers the following novel contributions: (a) Grammar extraction in Kriya supports extraction of the full set of Hiero-style SCFG rules but also supports the extraction of several types of compact rule sets which leads to faster decoding for different language pairs without compromising the BLEU scores. Kriya currently supports extraction of compact SCFGs such as grammars with one non-terminal and grammar pruning based on certain rule patterns, and (b) The Kriya decoder offers some unique improvements in the implementation of cube-pruning, such as increasing diversity in the target language n-best output and novel methods for language model (LM) integration. The Kriya decoder can take advantage of parallelization using a networked cluster. Kriya supports both KENLM and SRILM for language model queries. This paper also provides several experimental results which demonstrate that the translation quality of Kriya compares favourably to the Moses (Koehn et al., 2007) phrase-based system in several language pairs while showing a substantial improvement for Chinese-English similar to Chiang (2007). We also quantify the model sizes for phrase-based and Hiero-style systems and also present experiments comparing variants of Hiero models.
, and Alex Waibel. Pre-Translation for Neural Machine Translation. In Proceedings of the 26th International Conference on Computational Linguistics (CoLing 2016) , pages 1828–1836, Osaka, Japan, December 2016.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wie-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002) , pages 311–318, Philadelphia, PA, July 2002.
Popović, Maja. chrF: Character n-gram F-score for Automatic MT Evaluation
University Press, 2009.
Madnani, Nitin. iBLEU: Interactively debugging and scoring statistical machine translation systems. In Semantic Computing (ICSC), 2011 Fifth IEEE International Conference on , pages 213–214. IEEE, 2011.
Och, Franz Josef and Hermann Ney. A Comparison of Alignment Models for Statistical Machine Translation. In Proceedings of the 17th conference on Computational linguistics , pages 1086–1090. Association for Computational Linguistics, 2000. ISBN 1-555-55555-1.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a
Alberto Poncelas, Gideon Maillette de Buy Wenniger and Andy Way
-English PB-SMT. 2009.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL , pages 311–318, Philadelphia, PA, USA, 2002.
Poncelas, Alberto, Andy Way, and Antonio Toral. Extending Feature Decay Algorithms using Alignment Entropy. 2016.
Popovic, Maja. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 392–395, Lisbon, Portugal, 2015.
Snover, Matthew, Bonnie
-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011.
Ravi, Sujith, Kevin Knight, and Radu Soricut. Automatic prediction of parser accuracy. In Proc. of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 887-896, Stroudsburg, PA, USA, 2008. URL http://dl:acm:org/citation:cfm?id=1613715:1613829.
Seginer, Yoav. Learning Syntactic Structure. PhD thesis, Universiteit van Amsterdam, 2007.
Sekine, Satoshi and Michael J. Collins. Evalb - Bracket Scoring Program
, Dec 2008. doi: 10.1109/SLT.2008.4777890.
Mansour, Saab, Joern Wuebker, and Hermann Ney. Combining translation and language model scoring for domain-specific data filtering. In International Workshop on Spoken Language Translation, pages 222-229, San Francisco, California, USA, Dec. 2011.
Moore, Robert C. and William Lewis. Intelligent selection of language model training data. In Proc. of the Association for Computational Linguistics 2010 Conference Short Papers, pages 220-224, Uppsala, Sweden, July 2010. URL http
Álvaro Peris, Mara Chinea-Ríos and Francisco Casacuberta
scoring for domain-specific data filtering. In Proc. of IWSLT , pages 222–229, 2011.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS , pages 3111–3119, 2013.
Moore, Robert C and William Lewis. Intelligent selection of language model training data. In Proc. of ACL , pages 220–224, 2010.
Och, Franz Josef. Minimum error rate training in statistical machine translation. In Proc. of ACL , pages 160–167, 2003.
Och, Franz Josef and
Sheila Castilho, Joss Moorkens, Federico Gaspari, Iacer Calixto, John Tinsley and Andy Way
: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 392–395, Lisbon, Portugal, September 2015.
Sennrich, Rico and Barry Haddow. Linguistic Input Features Improve Neural Machine Translation. In Proceedings of the First Conference on Machine Translation , pages 83–91, Berlin, Germany, August 2016.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. Edinburgh Neural Machine Translation Systems for WMT 16. In Proceedings of the First Conference on Machine Translation (WMT16