Linking Datasets Using Semantic Textual Similarity

Open access


Linked data has been widely recognized as an important paradigm for representing data and one of the most important aspects of supporting its use is discovery of links between datasets. For many datasets, there is a significant amount of textual information in the form of labels, descriptions and documentation about the elements of the dataset and the fundament of a precise linking is in the application of semantic textual similarity to link these datasets. However, most linking tools so far rely on only simple string similarity metrics such as Jaccard scores. We present an evaluation of some metrics that have performed well in recent semantic textual similarity evaluations and apply these to linking existing datasets.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • 1. Agirre E. C. Banea D. M. Cer M. T. Diab A. Gonzalez-Agirre R. Mihalcea G. Rigau J. Wiebe. SemEval-2016. Task 1: Semantic Textual Similarity Monolingual and Cross-Lingual Evaluation. – In: SemEval@NAACL-HLT 2016 pp. 497-511.

  • 2. Cer D. M. Diab E. Agirre I. Lopez-Gazpio L. Specia. SemEval-2017. Task 1: Semantic Textual Similarity-Multilingual and Cross-Lingual Focused Evaluation. arXiv Preprint arXiv:1708.00055 31 July 2017.

  • 3. Euzenat J. C. Meilicke H. Stuckenschmidt P. Shvaiko C. Trojahn. Ontology Alignment Evaluation Initiative: Six Years of Experience. – Journal on Data Semantics Vol. XV 2011 Berlin Heidelberg Springer pp. 158-192.

  • 4. Fellbaum C. WordNet. In Theory and Applications of Ontology. Computer Applications. Netherlands Springer 2010 pp. 231-243.

  • 5. Fernando S. M. Stevenson. Mapping WordNet Synsets to Wikipedia Articles. – In: Proc. of 8th International Conference on Language Resources and Evaluation 2012 pp. 590-596.

  • 6. Finkel J. R. T. Grenager C. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. – In: Proc. of 43nd Annual Meeting of the Association for Computational Linguistics (ACL’05) 2005 pp. 363-370

  • 7. Frank E. M. Hall G. Holmes R. Kirkby B. Pfahringer I. H. Witten L. Trigg. Weka-a Machine Learning Workbench for Data Mining. – In: Data Mining and Knowledge Discovery Handbook. US Springer 2009 pp. 1269-1277.

  • 8. Gal Y. Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. – In: Proc. of International Conference on Machine Learning 2016 pp. 1050-1059.

  • 9. Ganitkevitch J. B. Van Durme C. Callison-Burch. PPDB: The Paraphrase Database. – In: HLT-NAACL 9 Jun 2013 pp. 758-764.

  • 10. Hurley N. S. Rickard. Comparing Measures of Sparsity. – IEEE Transactions on Information Theory Vol. 55 October 2009 No 10 pp. 4723-4741.

  • 11. Jiménez-Ruiz E. B. C. Grau Y. Zhou. LogMap 2.0: Towards Logic-Based Scalable and Interactive Ontology Matching. – In: Proc. of 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences 2011 pp. 45-46.

  • 12. Kuhn H. W. The Hungarian Method for the Assignment Problem. – Naval Research Logistics Quarterly Vol. 2 1955 pp. 83-97.

  • 13. Leacock C. M. Chodorow. Combining Local Context and Wordnet Similarity for Word Sense Identification. – An Electronic Lexical Database 1998 pp. 265-283.

  • 14. Li Y. Z. M. D. Bandar. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. – Transactions on Knowledge and Data Engineering Vol. 15 2003 No 4 pp. 871-882.

  • 15. Lin F K. Sandkuhl. A Survey of Exploiting Wordnet in Ontology Matching. – Artificial Intelligence in Theory and Practice Vol. II 2008 pp. 341-350.

  • 16. McCrae J. P. Mapping WordNet Instances to Wikipedia. – In: Proc. of 2018 Global WordNet Conference 2018.

  • 17. Mikolov T W. T. Yih G. Zweig. Linguistic Regularities in Continuous Space Word Representations. – In: HLT-NAACL Vol. 13 9 Jun 2013 pp. 746-751.

  • 18. Navigli R S. P. Ponzetto. BabelNet: The Automatic Construction Evaluation and Application of a Wide-Coverage Multilingual Semantic Network. – Artificial Intelligence Vol. 193 1 December 2012 pp. 217-250.

  • 19. Nentwig M. M. Hartung A. C. Ngonga-Ngomo E. Rahm. A Survey of Current Link Discovery Frameworks. – Semantic Web Vol. 8 1 January 2017 No 3 pp. 419-36.

  • 20. Ngonga-Ngomo A. C. S. Auer. LIMES – A Time-Efficient Approach for Large Scale Link Discovery on the Web of Data. – In: Proc. of 22nd Joint International Conference on Artificial Intelligence 2011 pp. 2313-2317.

  • 21. Niepert M. C. Meilicke H. Stuckenschmidt. A Probabilistic-Logical Framework for Ontology Matching. – In: Proc. of 24th AAAI Conference on Artificial Intelligence 2010 pp. 1413-1418.

  • 22. Niu X. S. Rong H. Wang Y. Yong. An Effective Rule Miner for Instance Matching on the Web of Data. – In: Proc. of 21st ACM International Conference on Information and Knowledge Management 2012 pp. 1085-1094.

  • 23. Richardson M. P. Domingos. Markov Logic Networks. – Machine Learning Vol. 62 2006 No 1-2 pp. 107-136.

  • 24. Rychalska B K. Pakulska K. Chodorowska W. Walczak P. Andruszkiewicz. Samsung Poland NLP Team at SemEval-2016. Task 1: Necessity for Diversity; Combining Recursive Autoencoders WordNet and Ensemble Methods to Measure Semantic Similarity. – In SemEval@NAACL-HLT 2016 pp. 602-608.

  • 25. Sultan M. A. S. Bethard T. Sumner. Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence. – Transactions of the Association for Computational Linguistics Vol. 2 31 May 2014 pp. 219-230.

  • 26. Tai K. S. R. Socher C. D. Manning. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks. – arXiv Preprint arXiv:1503.00075 28 February 2015.

  • 27. Volz J. C. Bizer M. Gaedke G. Kobilarov. Discovering and Maintaining Links on the Web of Data. – The Semantic Web-ISWC’09 2009 pp. 650-665.

  • 28. Wu Z. M. Palmer. Verb Semantics and Lexical Selection. – In: 32nd Annual Meeting of the Association for Computational Linguistics New Mexico State University Las Cruces New Mexico 1994 pp. 133-138.

Journal information
Impact Factor

CiteScore 2018: 0.84

SCImago Journal Rank (SJR) 2018: 0.215
Source Normalized Impact per Paper (SNIP) 2018: 0.595

Mathematical Citation Quotient (MCQ) 2018: 0.01

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 423 189 4
PDF Downloads 231 134 3