Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that individually authored complete files can be attributed, these efforts have focused on such ideal data sets as contest submissions and student assignments. We explore the problem of authorship attribution “in the wild,” examining source code obtained from open-source version control systems, and investigate how contributions can be attributed to their authors, either on an individual or a per-account basis. In this work, we present a study of attribution of code collected from collaborative environments and identify factors which make attribution of code fragments more or less successful. For individual contributions, we show that previous methods (adapted to be applied to short code fragments) yield an accuracy of approximately 50% or 60%, depending on whether we average by sample or by author, at identifying the correct author out of a set of 104 programmers. By ensembling the classification probabilities of a sufficiently large set of samples belonging to the same author we achieve much higher accuracy for assigning the set of samples to the correct author from a known suspect set. Additionally, we propose the use of calibration curves to identify which samples are by unknown and previously unencountered authors.
If the inline PDF is not rendering correctly, you can download the PDF file here.
 2018. The GitHub repository hosting service. (2018). http://www.github.com
 2018. Google Code Jam. (2018). https://code.google.com/codejam
 2018. Stunnix. (2018). http://stunnix.com/
 2018. The tigress diversifying c virtualizer. (2018). http://tigress.cs.arizona.edu/
 Mohammed Abuhamad Tamer AbuHmed Aziz Mohaisen and DaeHun Nyang. 2018. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM 101–114.
 Alfred V Aho Ravi Sethi and Jeffrey D Ullman. 1986. Compilers Principles Techniques. Addison wesley.
 Leo Breiman. 2001. Random Forests. Machine Learning (2001).
 Kay Henning Brodersen Cheng Soon Ong Klaas Enno Stephan and Joachim M Buhmann. 2010. The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition. IEEE 3121–3124.
 Steven Burrows and Seyed MM Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium Melbourne Australia RMIT University. Citeseer 32–39.
 Steven Burrows Alexandra L Uitdenbogerd and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In Database Systems for Advanced Applications. Springer 699–713.
 Aylin Caliskan-Islam Richard Harang Andrew Liu Arvind Narayanan Clare Voss Fabian Yamaguchi and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15). 255–270.
 Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72 1 (2004) 49–57.
 Georgia Frantzeskou Stephen MacDonell Efstathios Stamatatos and Stefanos Gritzalis. 2008. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81 3 (2008) 447–460.
 Georgia Frantzeskou Efstathios Stamatatos Stefanos Gritzalis Carole E Chaski and Blake Stephen Howald. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence 6 1 (2007) 1–18.
 Georgia Frantzeskou Efstathios Stamatatos Stefanos Gritzalis and Sokratis Katsikas. 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th international conference on Software engineering. ACM 893–896.
 Jerome Friedman Trevor Hastie and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.
 Jane Huffman Hayes and Jeff Offutt. 2010. Recognizing authors: an examination of the consistent programmer hypothesis. Software Testing Verification and Reliability 20 4 (2010) 329–356.
 William P Lincoln and Josef Skrzypek. 1990. Synergy of clustering multiple back propagation networks. In Advances in neural information processing systems. 650–657.
 Stephen G MacDonell Andrew R Gray Grant MacLennan and Philip J Sallis. 1999. Software forensics for discriminating between program authors using case-based reasoning feedforward neural networks and multiple discriminant analysis. In Neural Information Processing 1999. Proceedings. ICONIP’99. 6th International Conference on Vol. 1. IEEE 66–71.
 Xiaozhu Meng Barton P Miller William R Williams and Andrew R Bernat. 2013. Mining software repositories for accurate authorship. In 2013 IEEE International Conference on Software Maintenance. IEEE 250–259.
 Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. ACM 625–632.
 Brian N Pellin. 2000. Using classification techniques to determine source code authorship. White Paper: Department of Computer Science University of Wisconsin (2000).
 J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1 1 (1986) 81–106.
 Eugene H Spafford and Stephen A Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security 12 6 (1993) 585–595.
 Ariel Stolerman Rebekah Overdorf Sadia Afroz and Rachel Greenstadt. 2013. Classify but verify: Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Working Group Vol. 11.
 Kagan Tumer and Joydeep Ghosh. 1996. Error correlation and error reduction in ensemble classifiers. Connection science 8 3-4 (1996) 385–404.
 Fabian Yamaguchi Nico Golde Daniel Arp and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P).