Ncode: an Open Source Bilingual N-gram SMT Toolkit
This paper describes Ncode, an open source statistical machine translation (SMT) toolkit for translation models estimated as n-gram language models of bilingual units (tuples). This toolkit includes tools for extracting tuples, estimating models and performing translation. It can be easily coupled to several other open source toolkits to yield a complete SMT pipeline. In this article, we review the main features of the toolkit and explain how to build a translation engine with Ncode. We also report a short comparison with the widely known Moses system. Results show that Ncode outperforms Moses in terms of memory requirements and translation speed. Ncode also achieves slightly higher accuracy results.
When translating between a morphologically rich language (MRL) and English, word forms in the MRL often encode grammatical information that is irrelevant with respect to English, leading to data sparsity issues. This problem can be mitigated by removing from the MRL irrelevant information through normalization. Such preprocessing is usually performed in a deterministic fashion, using hand-crafted rules and yielding suboptimal representations. We introduce here a simple way to automatically compute an appropriate normalization of the MRL and show that it can improve machine translation in both directions.