Principal component analysis for authorship attribution
Background: To recognize the authors of the texts by the use of statistical tools, one first needs to decide about the features to be used as author characteristics, and then extract these features from texts. The features extracted from texts are mostly the counts of so called function words. Objectives: The data extracted are processed further to compress as a data with less number of features, such a way that the compressed data still has the power of effective discriminators. In this case feature space has less dimensionality then the text itself. Methods/Approach: In this paper, the data collected by counting words and characters in around a thousand paragraphs of each sample book, underwent a principal component analysis performed using neural networks. Once the analysis was complete, the first of the principal components is used to distinguish the books authored by a certain author. Results: The achieved results show that every author leaves a unique signature in written text that can be discovered by analyzing counts of short words per paragraph. Conclusions: In this article we have demonstrated that based on analyzing counts of short words per paragraph authorship could be traced using principal component analysis. Methodology could be used for other purposes, like fraud detection in auditing.
Andrić, I. (1981). Na Drini Ćuprija, Svjetlost, Sarajevo.
Andrić, I. (1989). Znakovi Pored Puta, Svjetlost, Sarajevo.
Andrić, I. (1980). Prokleta Avlija, Svjetlost, Sarajevo.
Binongo, J. (2003), "Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution", Chance, Vol. 16, No. 2, pp. 9-17.
Bosch, R, Smith J. (1998), "Separating hyperplanes and the authorship of the disputed federalist papers", The American Mathematical Monthly, Vol. 105, No. 7, pp. 601-608.
Burrows, J. (1992), "Not unless you ask nicely: The interpretative nexus between analysis and information", Literary and Linguistic Computing, Vol. 7, No. 2, pp. 91-109.
Can, M, Jamak, A, Savatić, A. (2012), "Teaching Neural Networks to Detect the Authors of Texts Using Lexical Descriptors", Southeast Europe Journal of Soft Computing, Vol. 1, No. 1, pp. 57-72.
Chaski, C. (2001), "Empirical evaluations of language-based author identification techniques", Journal of Forensic Linguistics, Vol. 8, No. 1, pp. 1-65.
Chaski, C. (2005), "Who's at the keyboard? Authorship attribution in digital evidence investigations", International Journal of Digital Evidence, Vol. 4, No. 1, pp. 14.
Fung, G. (2003), "The disputed Federalist Papers: SVM feature selection using concave minimization", Proceedings of the 2003 Conference on Diversity in Computing, Tapia Companion, Atlanta, pp. 42-46.
Hayes, J. F. (2008), "Authorship Attribution: A Principal Component and Linear Discriminant Analysis of the Consistent Programmer Hypothesis", International Journal of Computers and Their Applications, Vol. 15, No. 2, pp. 79-99.
Holmes, D. (1998), "The evolution of stylometry in humanities scholarship", Literary and Linguistic Computing, Vol. 13, No. 3, pp. 111-117.
Holmes, D, Forsyth R. (1995), "The Federalist revisited: New directions in authorship attribution", Literary and Linguistic Computing, Vol. 10, No. 2, pp. 111-127.
Holmes, D, Gordon L, Wilson C. (2001), "A widow and her soldier: Stylometry and the American Civil War", Literary and Linguistic Computing, Vol. 16, No. 4, pp. 403-420.
Juola, P. (2006), "Authorship attribution", Foundations and Trends in Information Retrieval, Vol. 1, No. 3, pp. 233-334.
Juola, P, Sofko J, Brennan P. (2006), "A prototype for authorship attribution studies", Literary and Linguistic Computing, Vol. 21, No. 2, pp. 169-178.
Kjell, B. (1994), "Authorship determination using letter pair frequency features with neural network classifiers", Literary and Linguistic Computing, Vol. 9, No. 2, pp. 119-124.
Markov, A. A. (1916). Ob odnom primenenii statisticheskogo metoda (On some application of statistical method). In: Izvestia Akademii Nauk. (Russia). Ser.6, vol. X, N4, p.239 (in Russian).
Mendenhall, T. C. (1887). The characteristic curves of composition. Science, IX, 237-49.
Morozov, N. A. (1915). Lingvisticheskie spektry (Linguistic spectrums). In: Izvestia Akademii Nauk (Russia), (Section of Russian Language), Books 1-4, vol. XX, (in Russian).
Mosteller, F, Wallace, DL. (1964), Inference and Disputed Authorship: The Federalist, Addison Wesley, Reading, MA.
Peng, R, Hengartner N. (2002), "Quantitative analysis of literary styles", The American Statistician, Vol. 56, No. 3, pp. 175-185.
Savatić, A, Jamak A, Can M. (2012), "Detecting the Authors of Texts by Boosting Neural Network Committee Machines", Southeast Europe Journal of Soft Computing, Vol. 1, No. 1, pp. 81-92.
Selimović, M. M. (1966). Derviš i smrt, Svjetlost, Sarajevo.
Selimović, M. M. (1970). Tvrdjava, Svjetlost, Sarajevo.
Selman S, Turan K, Kuşakçı A. O. (2011), "Distingtion of the Authors of Texts Using Multilayered Feedforward Neural Networks", S. Europe Journal of Soft Computing, Vol. 1, No. 1, pp. 128-138.
Simpson, E. H. (1949) "Measurement of diversity". Nature 163, 688-688.
Sušić, D. (1966). Pobune, Veselin Masleša, Sarajevo.
Williams, C. (1975), "Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon", Biometrika, Vol. 62, No. 1, pp. 207-212.
Yule, G. U. (1944) The Statistical Study of Literary Vocabulary, Cambridge University Press.
Zipf, G. K. (1935) The Psychobiology of Language. Houghton-Mifflin.