RealText-lex: A Lexicalization Framework for RDF Triples

Open access

Abstract

The online era has made available almost cosmic amounts of information in the public and semi-restricted domains, prompting development of corresponding host of technologies to organize and navigate this information. One of these developing technologies deals with encoding information from free form natural language into a structured form as RDF triples. This representation enables machine processing of the data, however the processed information can not be directly converted back to human language. This has created a need to be able to lexicalize machine processed data existing as triples into a natural language, so that there is seamless transition between machine representation of information and information meant for human consumption. This paper presents a framework to lexicalize RDF triples extracted from DBpedia, a central interlinking hub for the emerging Web of Data. The framework comprises of four pattern mining modules which generate lexicalization patterns to transform triples to natural language sentences. Among these modules, three are based on lexicons and the other works on extracting relations by exploiting unstructured text to generate lexicalization patterns. A linguistic accuracy evaluation and a human evaluation on a sub-sample showed that the framework can produce patterns which are accurate and emanate human generated qualities.

Auer, S, C Bizer, G Kobilarov, and J Lehmann. Dbpedia: A nucleus for a web of open data. In 6th international The semantic web and 2nd Asian conference on Asian semantic web conference, pages 722-735, Busan, Korea, 2007. Springer-Verlag. URL http://link.springer.com/chapter/10.1007/978-3-540-76298-0{\_}52.

Bizer, C, J Lehmann, and G Kobilarov. DBpedia-A crystallization point for the Web of Data. Web Semantics: science …, 2009. URL http://www.sciencedirect.com/science/article/pii/S1570826809000225.

Busemann, Stephan. Ten Years After : An Update on TG/2 (and Friends). Proceedings 10th European Workshop on Natural Language Generation, 2, 2005.

de Marneffe, Marie-Catherine, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D Manning. Universal Stanford Dependencies: A crosslinguistic typology. In 9th International Conference on Language Resources and Evaluation (LREC’14), pages 4585-4592, 2014. ISBN 978-2-9517408-8-4. URL papers3://publication/ uuid/D4B7BB39-4FFB-4AA6-B21E-701A91F27739.

Del Corro, Luciano and Rainer Gemulla. ClausIE: clause-based open information extraction. pages 355-366, may 2013. URL http://dl.acm.org/citation.cfm?id=2488388.2488420.

Duma, Daniel and Ewan Klein. Generating Natural Language from Linked Data: Unsupervised template extraction. In 10th International Conference on Computational Semantics (IWCS 2013), Potsdam, 2013. Association for Computational Linguistics.

Ell, Basil and Andreas Harth. A language-independent method for the extraction of RDF verbalization templates. In 8th International Natural Language Generation Conference, Philadelphia, 2014. Association for Computational Linguistics.

Etzioni, Oren, Michele Banko, Stephen Soderland, and Daniel S. Weld. Open information extraction from the web. Communications of the ACM, 51(12):68-74, dec 2008. ISSN 00010782. doi:

Fader, Anthony, Stephen Soderland, and Oren Etzioni. Identifying relations for open information extraction. In Empirical methods in Natural Language Processing, pages 1535-1545, 2011. ISBN 978-1-937284-11-4. doi:

Kipper, Karin, Anna Korhonen, Neville Ryant, and Martha Palmer. A large-scale classification of English verbs. Language Resources and Evaluation, 42(1):21-40, 2008. ISSN 1574020X. doi:

Kohlschütter, Christian, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate Detection using Shallow Text Features. In ACM International Conference on Web Search and Data Mining, pages 441-450, 2010. ISBN 9781605588896. doi:

Kövecses, Zoltán and Günter Radden. Metonymy: Developing a cognitive linguistic view. Cognitive Linguistics (includes Cognitive Linguistic Bibliography), 9(1):37-78, 1998.

Lassila, Ora, Ralph R Swick, et al. Resource Description Framework (RDF) model and syntax specification. 1998.

Lehmann, Jens, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Soren Auer, and Christian Bizer. DBpedia - A Large-scale , Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web journal, 5(1):1-29, 2014.

Manning, Christopher, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. Association for Computational Linguistics.

Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. Open language learning for information extraction. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 523-534, Jeju Island, jul 2012. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=2390948.2391009.

Mendes, Pablo N., Max Jakob, and Christian Bizer. DBpedia for NLP: A Multilingual Crossdomain Knowledge Base. In International Conference on Language Resources and Evaluation, Istanbul, Turkey, 2012.

Moens, Marie Francine. Information extraction: Algorithms and prospects in a retrieval context, volume 21. 2006. ISBN 1402049870. doi:

Reiter, Ehud and Anja Belz. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics, 35(4):529-558, dec 2009. ISSN 0891-2017. doi:

Reiter, Ehud and Robert Dale. Building Natural Language Generation Systems. Cambridge University Press, Cambridge, United Kingdom, jan 2000. ISBN 9780511551154. URL http://www.cambridge.org/us/academic/subjects/languages-linguistics/computational-linguistics/building-natural-language-generation-systems.

Schäfer, Florian. Naturally atomic er-nominalizations. Recherches linguistiques de Vincennes, 40 (1):27-42, 2011. ISSN 0986-6124. doi:

Stribling, Jeremy, Max Krohn, and Dan Aguayo. SciGen, 2005. URL https://pdos.csail.mit.edu/archive/scigen/.

Unger, Christina. Question Answering over Linked Data: QALD-1 Open Challenge. Technical report, Bielefeld University, Bielefeld, 2011.

Walter, Sebastian, Christina Unger, and Philipp Cimiano. A Corpus-Based Approach for the Induction of Ontology Lexica. In 18th International Conference on Applications of Natural Language to Information Systems, pages 102-113, Salford, 2013. Springer-Verlag.

The Prague Bulletin of Mathematical Linguistics

The Journal of Charles University

Journal Information

Cited By

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 136 136 29
PDF Downloads 46 46 7