A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction

Open access

Abstract

We consider the isolated spelling error correction problem as a specific subproblem of the more general string-to-string translation problem. In this context, we investigate four general string-to-string transformation models that have been suggested in recent years and apply them within the spelling error correction paradigm. In particular, we investigate how a simple ‘k-best decoding plus dictionary lookup’ strategy performs in this context and find that such an approach can significantly outdo baselines such as edit distance, weighted edit distance, and the noisy channel Brill and Moore model to spelling error correction. We also consider elementary combination techniques for our models such as language model weighted majority voting and center string combination. Finally, we consider real-world OCR post-correction for a dataset sampled from medieval Latin texts.

Bartlett, Susan, Grzegorz Kondrak, and Colin Cherry. Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion. In McKeown, Kathleen, Johanna D. Moore, Simone Teufel, James Allan, and Sadaoki Furui, editors, ACL, pages 568–576. The Association for Computational Linguistics, 2008. ISBN 978-1-932432-04-6. URL http://dblp.uni-trier.de/db/conf/acl/acl2008.html#BartlettKC08.

Bisani, Maximilian and Hermann Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434–451, 2008. URL http://dblp.uni-trier.de/db/journals/speech/speech50.html#BisaniN08.

Brill, Eric and Robert C. Moore. An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL), ACL ’00, pages 286–293, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. doi: 10.3115/1075218.1075255. URL http://dx.doi.org/10.3115/1075218.1075255.

Cortes, Corinna, Mehryar Mohri, and Jason Weston. A General Regression Technique for Learning Transductions. In Proceedings of the 22Nd International Conference on Machine Learning, Proceedings of the International Conference on Machine Learning (ICML), pages 153–160, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102371. URL http://doi.acm.org/10.1145/1102351.1102371.

Cortes, Corinna, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble Methods for Structured Prediction. In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.

Cotterell, Ryan, Nanyun Peng, and Jason Eisner. Stochastic Contextual Edit Distance and Probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, June 2014. URL http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2014. 6 pages.

Cotterell, Ryan, Nanyun Peng, and Jason Eisner. Modeling Word Forms Using Latent Underlying Morphs and Phonology. Transactions of the Association for Computational Linguistics, 3:433–447, 2015. ISSN 2307-387X. URL https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/480.

Cucerzan, S. and E. Brill. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.

Dreyer, Markus, Jason Smith, and Jason Eisner. Latent-Variable Modeling of String Transductions with Finite-State Methods. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1080–1089. ACL, 2008. URL http://dblp.uni-trier.de/db/conf/emnlp/emnlp2008.html#DreyerSE08.

Duan, Huizhong and Bo-June (Paul) Hsu. Online Spelling Correction for Query Completion. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 117–126, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0632-4. doi: 10.1145/1963405.1963425. URL http://doi.acm.org/10.1145/1963405.1963425.

Eger, Steffen. S-Restricted Monotone Alignments: Algorithm, Search Space, and Applications. In Proceedings of the Conference on Computational Linguistics (COLING), pages 781–798, 2012.

Eger, Steffen. Sequence Segmentation by Enumeration: An Exploration. Prague Bull. Math. Linguistics, 100:113–132, 2013. URL http://dblp.uni-trier.de/db/journals/pbml/pbml100.html#Eger13.

Eger, Steffen. Designing and comparing G2P-type lemmatizers for a morphology-rich language. In Fourth International Workshop on Systems and Frameworks for Computational Morphology, pages 27–40. Springer International Publishing Switzerland, 2015a.

Eger, Steffen. Do we need bigram alignment models? On the effect of alignment quality on transduction accuracy in G2P. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1175–1185, Lisbon, Portugal, September 2015b. Association for Computational Linguistics. URL http://aclweb.org/anthology/D15-1139.

Eger, Steffen. Improving G2P from wiktionary and other (web) resources. In INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 3340–3344, 2015c.

Eger, Steffen. Multiple Many-to-Many Sequence Alignment for Combining String-Valued Variables: A G2P Experiment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 909–919, Beijing, China, July 2015d. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P15-1088.

Farra, Noura, Nadi Tomeh, Alla Rozovskaya, and Nizar Habash. Generalized Character-Level Spelling Error Correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 161–167, Baltimore, Maryland, June 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P14/P14-2027.

Gubanov, Sergey, Irina Galinskaya, and Alexey Baytin. Improved Iterative Correction for Distant Spelling Errors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages 168–173, 2014. URL http://aclweb.org/anthology/P/P14/P14-2028.pdf.

Gusfield, Dan. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. ISBN 0-521-58519-8.

Jiampojamarn, Sittichai, Grzegorz Kondrak, and Tarek Sherif. Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 372–379, Rochester, New York, April 2007. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N/N07/N07-1047.

Jiampojamarn, Sittichai, Colin Cherry, and Grzegorz Kondrak. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion. In Proceedings of ACL-08: HLT, pages 905–913, Columbus, Ohio, June 2008. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P08/P08-1103.

Jiampojamarn, Sittichai, Aditya Bhargava, Qing Dou, Kenneth Dwyer, and Grzegorz Kondrak. DirecTL: a Language Independent Approach to Transliteration. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), pages 28–31, Suntec, Singapore, August 2009. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W09/W09-3504.

Jiampojamarn, Sittichai, Colin Cherry, and Grzegorz Kondrak. Integrating Joint n-gram Features into a Discriminative Training Framework. In Proceedings of HLT-NAACL, pages 697–700. The Association for Computational Linguistics, 2010a. ISBN 978-1-932432-65-7. URL http://dblp.uni-trier.de/db/conf/naacl/naacl2010.html#JiampojamarnCK10.

Jiampojamarn, Sittichai, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. Transliteration Generation and Mining with Limited Training Resources. In Proceedings of the 2010 Named Entities Workshop, pages 39– 47, Uppsala, Sweden, July 2010b. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W10-2405.

Koehn, Philipp. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand, 2005. AAMT, AAMT. URL http://mt-archive.info/MTS-2005-Koehn.pdf.

Kukich, Karen. Techniques for Automatically Correcting Words in Text. ACM Comput. Surv., 24(4):377–439, Dec. 1992. ISSN 0360-0300. doi: 10.1145/146370.146380. URL http://doi.acm.org/10.1145/146370.146380.

Mehler, Alexander, Tim vor der Brück, Rüdiger Gleim, and Tim Geelhaar. Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts using the TTLab Latin Tagger. In Biemann, Chris and Alexander Mehler, editors, Text Mining: From Ontology Learning to Automated Text Processing Applications, Theory and Applications of Natural Language Processing, pages 87–112. Springer, Berlin/New York, 2015.

Migne, Jacques-Paul, editor. Patrologiae cursus completus: Series latina. 1–221. Chadwyck-Healey, Cambridge, 1844–1855.

Mitankin, Petar, Stefan Gerdjikov, and Stoyan Mihov. An Approach to Unsupervised Historical Text Normalisation. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, Proceedings of DATeCH ’14, pages 29–34, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2588-2. doi: 10.1145/2595188.2595191. URL http://doi.acm.org/10.1145/2595188.2595191.

Müller, Thomas, Helmut Schmid, and Hinrich Schütze. Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D13-1032.

Nicolai, Garrett, Colin Cherry, and Grzegorz Kondrak. Inflection Generation as Discriminative String Transduction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 922–931, Denver, Colorado, May–June 2015. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N15-1093.

Novak, Josef Robert, Nobuaki Minematsu, and Keikichi Hirose. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, 2015.

Okazaki, Naoaki, Yoshimasa Tsuruoka, Sophia Ananiadou, and Jun’ichi Tsujii. A Discriminative Candidate Generator for String Transformations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), EMNLP ’08, pages 447– 456, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1613715.1613772.

Pirinen, Tommi A. and Krister Lindén. State-of-the-Art in Weighted Finite-State Spell-Checking. In Computational Linguistics and Intelligent Text Processing - 15th International Conference, CICLing 2014, Kathmandu, Nepal, April 6-12, 2014, Proceedings, Part II, pages 519–532, 2014. doi: 10.1007/978-3-642-54903-8_43. URL http://dx.doi.org/10.1007/978-3-642-54903-8_43.

Raaijmakers, Stephan. A deep graphical model for spelling correction. In Proceedings BNAIC 2013, 2013.

Reynolds, L. D. and Nigel Wilson. Scribes and scholars. A guide to the transmission of Greek and Latin literature. Clarendon Press, Oxford, 3. aufl. edition, 1991. ISBN 0-19-872145-5.

Rosti, Antti-Veikko I., Necip Fazil Ayan, Bing Xiang, Spyridon Matsoukas, Richard M. Schwartz, and Bonnie J. Dorr. Combining Outputs from Multiple Machine Translation Systems. In Sidner, Candace L., Tanja Schultz, Matthew Stone, and ChengXiang Zhai, editors, Proceedings of HLT-NAACL, pages 228–235. The Association for Computational Linguistics, 2007. URL http://dblp.uni-trier.de/db/conf/naacl/naacl2007.html#RostiAXMSD07.

Springmann, Uwe, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek, and Florian Fink. OCR of historical printings of Latin texts: problems, prospects, progress. In Digital Access to Textual Cultural Heritage 2014, DATeCH 2014, Madrid, Spain, May 19-20, 2014, pages 71–75, 2014. doi: 10.1145/2595188.2595205.

Stolcke, Andreas. SRILM-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pages 257–286, November 2002.

Sun, Xu, Jianfeng Gao, Daniel Micol, and Chris Quirk. Learning Phrase-Based Spelling Error Models from Clickthrough Data. In Hajic, Jan, Sandra Carberry, and Stephen Clark, editors, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 266–274. The Association for Computational Linguistics, 2010. ISBN 978-1-932432-67-1. URL http://dblp.uni-trier.de/db/conf/acl/acl2010.html#SunGMQ10.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112, 2014.

Toutanova, Kristina and Robert C. Moore. Pronunciation Modeling for Improved Spelling Correction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 144–151. ACL, 2002. URL http://dblp.uni-trier.de/db/conf/acl/acl2002.html#ToutanovaM02.

vor der Brück, Tim, Alexander Mehler, and Md. Zahurul Islam. ColLex.EN: Automatically Generating and Evaluating a Full-form Lexicon for English. In Proceedings of LREC 2014, Reykjavik, Iceland, 2014.

Wang, Ziqi, Gu Xu, Hang Li, and Ming Zhang. A Probabilistic Approach to String Transformation. IEEE Trans. Knowl. Data Eng., 26(5):1063–1075, 2014. doi: 10.1109/TKDE.2013.11. URL http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.11.

Whitelaw, Casey, Ben Hutchinson, Grace Y. Chung, and Gerard Ellis. Using the Web for Language Independent Spellchecking and Autocorrection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP ’09, pages 890–899, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-62-6. URL http://dl.acm.org/citation.cfm?id=1699571.1699629.

Yao, Kaisheng and Geoffrey Zweig. Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion. CoRR, abs/1506.00196, 2015.

Yao, Lei and Grzegorz Kondrak. Joint Generation of Transliterations from Multiple Representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 943–952, Denver, Colorado, May–June 2015. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N15-1095.

The Prague Bulletin of Mathematical Linguistics

The Journal of Charles University

Journal Information

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 481 332 32
PDF Downloads 170 134 23