Historical Documents Modernization

Open access

Abstract

Historical documents are mostly accessible to scholars specialized in the period in which the document originated. In order to increase their accessibility to a broader audience and help in the preservation of the cultural heritage, we propose a method to modernized these documents. This method is based in statistical machine translation, and aims at translating historical documents into a modern version of their original language. We tested this method in two different scenarios, obtaining very encouraging results.

Keywords:

Bibliography

  • Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (arXiv:1409.0473), 2015.

  • Brown, Peter F., Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263–311, 1993.

  • Chen, Stanley F. and Joshua Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the Annual Meeting on Association for Computational Linguistics, pages 310–318, 1996.

  • Crowther, John. No Fear Shakespeare: Sonnets. SparkNotes, 2004.

  • F. Jehle, Fred. Works of Miguel de Cervantes in Old- and Modern-spelling. Indiana University Purdue University Fort Wayne, 2001.

  • Gascó, Guillem, Martha-Alicia Rocha, Germán Sanchis-Trilles, Jesús Andrés-Ferrer, and Francisco Casacuberta. Does more data always yield better translations? In Proccendings of the European Chapter of the Association for Computational Linguistics, pages 152–161, 2012.

  • Koehn, Philipp. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 388–395, 2004.

  • Koehn, Philipp. Statistical Machine Translation. Cambridge University Press, 2010.

  • Koehn, Philipp, Franz Josef Och, and Daniel Marcu. Statistical Phrase-Based Translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, 2003.

  • Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 177–180, 2007.

  • Och, Franz Josef. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 160–167, 2003.

  • Och, Franz Josef and Hermann Ney. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 295–302, 2002.

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002.

  • Sennrich, Rico, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1715–1725, 2016.

  • Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the Association for Machine Translation in the Americas, pages 223–231, 2006.

  • Stolcke, Andreas. SRILM - An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286, 2002.

  • Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural Networks, 2014.

  • Villegas, Mauricio, Alejandro H. Toselli, Verónica Romero, and Enrique Vidal. Exploiting Existing Modern Transcripts for Historical Handwritten Text Recognition. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, pages 66–71, 2016.

  • Zens, Richard, Franz Josef Och, and Hermann Ney. Phrase-Based Statistical Machine Translation. In Proceedings of the Annual German Conference on Advances in Artificial Intelligence, volume 2479, pages 18–32, 2002.

The Prague Bulletin of Mathematical Linguistics

The Journal of Charles University

Journal Information

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 29 29 22
PDF Downloads 6 6 5