Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

Edwin Dauber 1 , Aylin Caliskan 2 , Richard Harang 3 , Gregory Shearer 4 , Michael Weisman 5 , Frederica Nelson 6 ,  and Rachel Greenstadt 7
  • 1 Drexel University,
  • 2 George Washington University,
  • 3 Sophos Data Science Team,
  • 4 ICF International,
  • 5 United States Army Research Laboratory,
  • 6 United States Army Research Laboratory,
  • 7 New York University,


Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that individually authored complete files can be attributed, these efforts have focused on such ideal data sets as contest submissions and student assignments. We explore the problem of authorship attribution “in the wild,” examining source code obtained from open-source version control systems, and investigate how contributions can be attributed to their authors, either on an individual or a per-account basis. In this work, we present a study of attribution of code collected from collaborative environments and identify factors which make attribution of code fragments more or less successful. For individual contributions, we show that previous methods (adapted to be applied to short code fragments) yield an accuracy of approximately 50% or 60%, depending on whether we average by sample or by author, at identifying the correct author out of a set of 104 programmers. By ensembling the classification probabilities of a sufficiently large set of samples belonging to the same author we achieve much higher accuracy for assigning the set of samples to the correct author from a known suspect set. Additionally, we propose the use of calibration curves to identify which samples are by unknown and previously unencountered authors.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] 2018. The GitHub repository hosting service. (2018).

  • [2] 2018. Google Code Jam. (2018).

  • [3] 2018. Stunnix. (2018).

  • [4] 2018. The tigress diversifying c virtualizer. (2018).

  • [5] Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 101–114.

  • [6] Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, Principles, Techniques. Addison wesley.

  • [7] Leo Breiman. 2001. Random Forests. Machine Learning (2001).

  • [8] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann. 2010. The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition. IEEE, 3121–3124.

  • [9] Steven Burrows. 2010. Source code authorship attribution. Ph.D. Dissertation. RMIT University.

  • [10] Steven Burrows and Seyed MM Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University. Citeseer, 32–39.

  • [11] Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In Database Systems for Advanced Applications. Springer, 699–713.

  • [12] Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15). 255–270.

  • [13] Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49–57.

  • [14] Georgia Frantzeskou, Stephen MacDonell, Efstathios Stamatatos, and Stefanos Gritzalis. 2008. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81, 3 (2008), 447–460.

  • [15] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence 6, 1 (2007), 1–18.

  • [16] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th international conference on Software engineering. ACM, 893–896.

  • [17] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.

  • [18] Jane Huffman Hayes and Jeff Offutt. 2010. Recognizing authors: an examination of the consistent programmer hypothesis. Software Testing, Verification and Reliability 20, 4 (2010), 329–356.

  • [19] William P Lincoln and Josef Skrzypek. 1990. Synergy of clustering multiple back propagation networks. In Advances in neural information processing systems. 650–657.

  • [20] Stephen G MacDonell, Andrew R Gray, Grant MacLennan, and Philip J Sallis. 1999. Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis. In Neural Information Processing, 1999. Proceedings. ICONIP’99. 6th International Conference on, Vol. 1. IEEE, 66–71.

  • [21] Xiaozhu Meng, Barton P Miller, William R Williams, and Andrew R Bernat. 2013. Mining software repositories for accurate authorship. In 2013 IEEE International Conference on Software Maintenance. IEEE, 250–259.

  • [22] Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.

  • [23] Rebekah Overdorf and Rachel Greenstadt. 2016. Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution. PoPETs 2016, 3 (2016), 155–171.

  • [24] Brian N Pellin. 2000. Using classification techniques to determine source code authorship. White Paper: Department of Computer Science, University of Wisconsin (2000).

  • [25] J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1, 1 (1986), 81–106.

  • [26] Eugene H Spafford and Stephen A Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security 12, 6 (1993), 585–595.

  • [27] Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. 2013. Classify, but verify: Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Working Group, Vol. 11.

  • [28] Kagan Tumer and Joydeep Ghosh. 1996. Error correlation and error reduction in ensemble classifiers. Connection science 8, 3-4 (1996), 385–404.

  • [29] Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P).


Journal + Issues