Search Results

1 - 10 of 14 items :

  • "Parallel treebanks" x
Clear All

Conference of the Association for Machine Translation in the Americas (AMTA '06) , pp. 128-137. Boston, MA. Och, Franz Josef and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics , 29 (1): 19-51. Samuelsson, Yvonne and Martin Volk. 2007. Alignment Tools for Parallel Treebanks. In Data Structures for Linguistic Resources and Applications: Proceedings of the Biennial GLDV Conference 2007 , eds. Georg Rehm, Andreas Witt and Lothar Lemnitzer. Tübingen, Germany: Gunter Narr. Wu, Dekai. 2000. Bracketing and aligning

Computational Linguistics (ACL’05), Ann Arbor, Michigan, 2005, pp. 271-279. 9. Galley, M., et al. Scalable Inference and Training of Context-Rich Syntactic Models. – In: Proc. of 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL’06), Sydney, Australia, 2006, pp. 961-968. 10. Tinsley, J., M. Hearne, A. Way. Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation. – In: Proc. of 6th International Workshop on Treebanks and Linguistic Theories (TLT’07

CzEng 0.9: Large Parallel Treebank with Rich Annotation

We describe our ongoing efforts in collecting a Czech-English parallel corpus CzEng. The paper provides full details on the current version 0.9 and focuses on its new features: (1) data from new sources were added, most importantly a few hundred electronically available books, technical documentation and also some parallel web pages, (2) the full corpus has been automatically annotated up to the tectogrammatical layer (surface and deep syntactic analysis), (3) sentence segmentation has been refined, and (4) several heuristic filters to improve corpus quality were implemented. In total, we provide a sentence-aligned automatic parallel treebank of about 8.0 million sentences, 93 million English and 82 million Czech words. CzEng 0.9 is freely available for non-commercial research purposes.


We present a work in progress aimed at extracting translation pairs of source and target dependency treelets to be used in a dependency-based machine translation system. We introduce a novel unsupervised method for parallel tree segmentation based on Gibbs sampling. Using the data from a Czech-English parallel treebank, we show that the procedure converges to a dictionary containing reasonably sized treelets; in some cases, the segmentation seems to have interesting linguistic interpretations.

References Bojar, Ondřej and Zdeněk Žabokrtský. CzEng0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics , 92, 2009. Kaalep, Heiki-Jaan and Kaarel Veskis. Comparing parallel corpora and evaluating their quality. In Proceedings of MT Summit XI , pages 275-279, Copenhagen, Denmark, 2007. Koehn, Philipp. Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit X , pages 79-86, Phuket, Thailand, 2005. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico

(1):15-26, 2010. Bojar, Ondřej. Analyzing Error Types in English-Czech Machine Translation. Prague Bulletin of Mathematical Linguistics , 95, 2011. Bojar, Ondřej and Zdeněk Žabokrtský. CzEng 0.9: Large parallel treebank with rich annotation. Prague Bulletin of Mathematical Linguistics , 92:63-83, 2009. Denkowski, Michael and Alon Lavie. Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level. In Proc. of HLT-NAACL'10 , pages 250-253, 2010. Giménez, Jesús and Lluis Màrquez. Towards heterogeneous automatic MT error analysis. In Proc. of LREC'08

References Bojar, Ondřej and Zdeněk Žabokrtský. CzEng0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics , 92:63-83, 2009. ISSN 0032-6585. Clark, Jonathan H., Chris Dyer, Alon Lavie, and Noah A. Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proc. of ACL/HLT , pages 176-181, Portland, Oregon, USA, June 2011. URL Goyal, Amit, Hal Daumé, III, and Suresh Venkatasubramanian

] Colhon, M. Acquiring Syntactic Translation rules from a Parallel Tree-bank , Journal of Information and Library Science INFOtheca, XIII(2) (2012), 19-32. [5] Colhon, M. Ţăndăreanu, N. A Semantic Schema - based Approach for Natural Language Translation , WSEAS Journal Transactions on Computers, 9(11) (2010), 1307-1317. [6] Dai, Y.; Zhang, S.; Chen, J.; Chen, T.; Zhang, W. Semantic Network Language Generation based on a Semantic Networks Serialization Grammar , World Wide Web 13(3) (2010), 307-341. [7] Dale, R.; Di Eugenio, B.; Scow, D. Introduction to the

. Mettler. Alignment tools for parallel treebanks. In In Proc. of The Linguistic Annotation Workshop at the Association for Computational Linguistics (LAW-ACL), 2007. Wellington, Benjamin, Sonjia Waxmonsky, and I. Dan Melamed. Empirical lower bounds on the complexity of translational equivalence. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2006. Wu, Dekai. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 3(23):377-403, 1997. Zens, Richard and Hermann Ney. A

., 2009. Yasuda, Keiji, Fumiaki Sugaya, Toshiyuki Takezawa, Seiichi Yamamoto, and Masuzo Yanagida. Automatic machine translation selection scheme to output the best result. In Proceedings of LREC2002 , pages 525–528, Las Palmas, Spain, 2002. Zhechev, Ventsislav. Unsupervised Generation of Parallel Treebank through Sub-Tree Alignment. Prague Bulletin of Mathematical Linguistics , 91:89–98, 2009.