Evaluation of Fingerprint Selection Algorithms for Local Text Reuse Detection

  • 1 Riga Technical University, , Riga, Latvia

Abstract

Detection of local text reuse is central to a variety of applications, including plagiarism detection, origin detection, and information flow analysis. This paper evaluates and compares effectiveness of fingerprint selection algorithms for the source retrieval stage of local text reuse detection. In total, six algorithms are compared – Every p-th, 0 mod p, Winnowing, Hailstorm, Frequency-biased Winnowing (FBW), as well as the proposed modified version of FBW (MFBW).

Most of the previously published studies in local text reuse detection are based on datasets having either artificially generated, long-sized, or unobfuscated text reuse. In this study, to evaluate performance of the algorithms, a new dataset has been built containing real text reuse cases from Bachelor and Master Theses (written in English in the field of computer science) where about half of the cases involve less than 1 % of document text while about two-thirds of the cases involve paraphrasing.

In the performed experiments, the overall best detection quality is reached by Winnowing, 0 mod p, and MFBW. The proposed MFBW algorithm is a considerable improvement over FBW and becomes one of the best performing algorithms.

The software developed for this study is freely available at the author’s website http://www.cs.rtu.lv/jekabsons/.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso, and B. Stein, “Overview of the 6th International Competition on Plagiarism Detection,” in CEUR Workshop Proceedings, vol. 1180, 2014, pp. 845–876.

  • [2] D. T. Citron and P. Ginsparg, “Patterns of text reuse in a scientific corpus,” in Proceedings of the National Academy of Sciences, Jan 2015, 112, no. 1, pp. 25–30. https://doi.org/10.1073/pnas.1415135111

  • [3] Y. Sun, J. Qin, and W. Wang, “Near Duplicate Text Detection Using Frequency-Biased Signatures,” in Web Information Systems Engineering (WISE 2013), Lecture Notes in Computer Science, vol. 8180. Springer, Berlin, Heidelberg, 2013, pp. 277–291. https://doi.org/10.1007/978-3-642-41230-1_24

  • [4] O. Abdel-Hamid, B. Behzadi, S. Christoph, and M. Henzinger, “Detecting the origin of text segments efficiently,” in WWW’09: Proceedings of the 18th international conference on World wide web, ACM, New York, NY, USA, 2009, pp. 61–70. https://doi.org/10.1145/1526709.1526719

  • [5] J. Seo and W.B. Croft. “Local text reuse detection,” in Proceedings of SIGIR’08, Singapore. ACM, ACM Press, July 2008, pp. 571–578. https://doi.org/10.1145/1390334.1390432

  • [6] D. Sorokina, J. Gehrke, S. Warner, and P. Ginsparg, “Plagiarism detection in arXiv,” Cornell University, Ithaca, NY, USA, Tech. Rep. TR2006-2046, 2006. https://doi.org/10.1109/ICDM.2006.126

  • [7] T. C. Hoad and J. Zobel, “Methods for identifying versioned and plagiarized documents,” Journal of the American Society for Information Science and Technology, vol. 54, no. 3, 2003, pp. 203–215. https://doi.org/10.1002/asi.10170

  • [8] S. Schleimer, D. S. Wilkerson, and A. Aiken, “Winnowing: local algorithms for document fingerprinting,” in Proceedings of SIGMOD’03, 2003, pp. 76–85. https://doi.org/10.1145/872757.872770

  • [9] R. A. Finkel, A.B. Zaslavsky, K. Monostori, and H. W. Schmidt. “Signature extraction for overlap detection in documents,” in Proceedings of the 25th Australasian Computer Science Conference, Conferences in Research and Practice in Information Technology, vol 4, Melbourne, Australia: Australian Computer Society Inc., 2002, pp. 59–64.

  • [10] N. Heintze, “Scalable document fingerprinting,” in 1996 USENIX Workshop on Electronic Commerce, 1996.

  • [11] N. Shivakumar and H. Garcia-Molina, “SCAM: A copy detection mechanism for digital documents,” in Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries, 1995.

  • [12] S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” in Proceedings of ACM SIGMOD’95, 1995, pp. 398–409. https://doi.org/10.1145/568271.223855

  • [13] U. Manber, “Finding similar files in a large file system,” in WTEC’94: Proceedings of the USENIX Winter 1994 Technical Conference, USENIX Association, Berkeley, CA, USA, 1994, pp. 1–10.

  • [14] A. Mittelbach, L. Lehmann, C. Rensing, and R. Steinmetz, “Automatic Detection of Local Reuse,” in Sustaining TEL: From Innovation to Learning and Practice - Proceedings of the 5th European Conference on Technology Enhanced Learning, EC-TEL 2010, no. LNCS 6383, Springer Verlag, September 2010, pp. 229–244. https://doi.org/10.1007/978-3-642-16020-2_16

  • [15] R. Rivest, “The MD5 Message-Digest Algorithm,” RFC 1321, April 1992. https://doi.org/10.17487/rfc1321

  • [16] M. O. Rabin, “Fingerprinting by random polynomials,” Harvard University, Cambridge, MA, USA, Tech. Rep. TR-15-81, 1981.

  • [17] G. Fowler, L. C. Noll, K.-P. Vo, D. Eastlake, and T. Hansen, “The FNV non-cryptographic hash algorithm,” Internet Engineering Task Force, Internet-Draft, 2019. [Online]. Available: https://tools.ietf.org/html/draft-eastlake-fnv-17 [Accessed: Feb. 24, 2020].

OPEN ACCESS

Journal + Issues

Search