Open Access

Historical Bibliometrics Using Google Scholar: The Case of Roman Law, 1727–2016


Cite

Introduction

Bibliometrics—the quantitative study of publications and citations—can be a useful resource for social sciences and humanities (SSH) research beyond its role in research evaluation and funding-schemes (Scharnhorst & Garfield, 2010). From a sociology of science perspective, it enables us to understand the size and growth of disciplines in terms of number of publications and researchers, as well as changes in dissemination and citation patterns. Additionally, bibliometrics can be used to study the intellectual base, development and exchange of ideas within and between disciplines, in order to enlighten and tackle discipline specific historiographical questions

This paper is an extended version of a poster presented in ISSI2019 conference: Pölönen, J. & Hammarfelt, B., “Historical bibliometrics using Google Scholar: the case of Roman law, 1500–2016”. In G. Catalano et al. (eds), Proceedings of the 17th International Conference of the International Society for Scientometrics and Informetrics, Vol II. Rome: Edizioni Efesto, pp. 2491–2492.

.

Predominately research in the field of bibliometrics focus on contemporary developments using datasets that rarely provide historical perspectives. Still, historical approaches are not unheard of, with de Solla Price's seminal studies of the growth of science being one key example of how developments over longer time periods can be studied (de Solla Price, 1986). Later studies using a longer time-frame have for example studied the distribution of citations over time (Larivière, Gingras, & Archambault, 2009), and production of papers per authors (Fanelli & Larivière, 2016). Few studies have however incorporated publications dating before the beginning of the 20th century. Hence, the argument can be made that the field of bibliometrics has not fully explored the potential of what Hérubel (1999) calls “historical bibliometrics”. Especially there is a lack of studies that incorporates not only references to other scholarly publications, but which also considers primary sources in the form of historical documents (Colavizza, 2018). There have been several attempts of going beyond established databases in order to study for example Catalan literature (Ardanuy, Urbano, & Quintana, 2009), Swedish literature (Hammarfelt, 2012), and Venetian histography (Colavizza, 2018). Yet, the approaches and methods used are often time-consuming and not easily transferred to other contexts and materials.

Considering the lack of coverage in established citation databases, as well as the limitations of local and specific approaches we propose Google Scholar as a potentially valuable data source for historical studies using bibliometric data. Consequently, in this paper we study and discuss the potential that Google Scholar data has for studying the development of research fields from a historical perspective. By using the example of “Roman law” we demonstrate the potential of this approach, while at the same time highlighting challenges and difficulties when applying bibliometric methods to this kind of heterogeneous data. Our approach is a probing one, and rather than delivering a fixed recipe for how to conduct “historical bibliometrics” we hope to open a discussion on what such a, in our view promising, approach might have to offer in the future. Especially we find that this and similar initiatives might potentially provide a bridge between bibliometrics and the dynamic field of “digital humanities”.

Background

A well-known problem when studying fields within the humanities is that well-structured international databases (Web of Science and Scopus) offer a limited coverage of publications and citations, especially in books and languages other than English (Nederhof, 2006). Studies comparing the international database coverage with comprehensive national publication data drawn from the institutional research information systems show that law is among the least covered fields (Kulczycki et al., 2018). This is because in law, articles in books and monographs, as well as national languages, play an important role in the dissemination of research results (Kulczycki et al., 2020; Pichonnaz, 2014; van Leeuwen, 2013).

In comparison Google Scholar (GS) indexes publication data from a wider range of documents available online, and a recent estimation suggest that it is the most comprehensive database, with more than 389 million records (Gusenbauer, 2018). However, from its launch in 2004 the quality and stability of data in GS has been questioned (Jacsó, 2010), and although the database has developed over the years it is still recommended to include other data sources in order to compare and confirm findings (Halevi, Moed, & Bar-Ilan, 2017). With these precautions in mind it offers great advantages when studying research in the social sciences and humanities, with its significantly broader coverage of non-English, and non-journals materials being particularly important (Halevi, Moed, & Bar-Ilan, 2017; Prins et al., 2016).

In general, studies of GS have focused on issues concerning coverage—both generally and in specific disciplines—and questions regarding data quality. The age of sources indexed has not received the same amount of attention despite temporal coverage being key question when using bibliometrics for the purpose of studying the development of fields and disciplines. An initial suggestions was that “GS does not perform well for older publications, as these publications and the sources that cite them have not yet been posted on the web” (Harzing & Van der Wal, 2008). Yet, if this assertion holds true remains an open question as the rapid digitalization of older materials makes an increasing amount of materials available for indexing by Google Scholar.

Roman law (RL) has constituted an international research field within the legal academia since the foundation of the modern Western university in Bologna in the 12th century around the study of the Emperor Justinian's codification (Orestano, 1987; Stein, 1999). Even today, classical studies and Roman law are multilingual research fields (Scheidel, 2007). After Latin ceased to be the lingua franca, there are five international RL publishing languages: English, French, German, Italian, and Spanish. Yet, following the growth of national legal systems, an important part of the publishing activity in these languages has consisted of research and elementary publications addressed to national audiences. RL has also grown an increasingly interdisciplinary field, involving vide range of SSH researchers and approaches (Pölönen, 2016). In all, Roman law literature provides a good case for probing the historical and linguistic coverage of Google Scholar.

Methods and materials

First, a longitudinal study of the size and growth of the field in terms of the number publications and authors in five major languages is attempted. The main challenge is the identification of Roman law publications. We propose to resolve this problem by limiting analysis to publications, in the title of which “Roman law” appears in one of the five languages studied. The rationale is that there probably are at any time only a few RL specialists in the field who never published an article or book including in its title “Roman law” in the language of the publication. This method will not yield a complete record of Roman law publications, as not all relevant publications have “Roman law” in their title, but it produces a subset of publications that most likely are relevant to our probing investigation.

Therefore, in order to create a basic dataset, all publication records including in the title words denoting “Roman law” in the current international publishing languages (English, French, German, Italian, or Spanish), published between years 1500 and 2016, were retrieved from Google Scholar in August 2017, in blocs not exceeding 1,000, using the Publish or Perish (PoP) interface (see Table 3 below). The publication records retrieved from GS with PoP were copied to Excel in RIS format, and processed with BibExcel tool-box developed by Olle Persson (Persson, Danell, & Schneider, 2009).

Secondly, this dataset of Roman law publications is analyzed to establish the number of publications and authors, differentiating between the five language groups, from 1500 to 2016. The growth of the field is estimated on basis of the development of the absolute number and average yearly number of publications, as well as the number of authors involved in producing them, in different periods.

Thirdly, bibliometric measurements are performed on the data to investigate its properties and consistency. These include the average number of publications per author (publication productivity), as well as the concentration of publications and citations.

Google Scholar data

For each record (Table 1), the GS data contains the following information (with RIS format tags): reference type (TY), information about the authors (AU), publication title (TI), publisher (PB), publication year (PY), query date (M1), number and link to GS citations (M1), number of citations since publication year (N1), and an empty end record (ER). In the data used for this study, language information was added to each record according to the language in which “roman law” occurred in the title. Note that this may not always necessarily be the publication language, for instance if the publication is a review in English of a Roman law book in German including “römische Recht” in the title.

Google Scholar RIS format tags.

TagMeaningExample record
TYType of reference (must be the first tag)TY - CITATION
AUAuthor (each author on its own line preceded by the tag)AU - Sciascia, G
TITitleTI - Sinopse de direito romano: com tábuas
PBPublisherPB - Oficinas Gráficas de Saraiva
PYPublication year (YYYY/MM/DD)PY - 1955///
M1NumberM1 - Query date: 2017-04-27
M1NumberM1 - 14 cites: https://scholar.google.com/scholar?cites=4390463247417945507&as_sdt=2005&sciodt=0,5&hl=en&num=20
N1NotesN1 - Cited By (since 1955): 14
LALanguageLA - Portuguese
EREnd RecordER -

In the GS data, each publication is assigned to one of five record types, of which 76% are citations, 13% journal articles, 8% books, 2% PDFs and 1% HTML documents (Table 2). Citations are publications not indexed by GS but referred to by publications indexed by GS, so called “non-source items”. There are some differences in the distribution of the reference types between the languages, though it is not clear what effect—if any—this has on the results. The availability of documents on the internet is the largest in case of English and German languages. Books are on average the most highly cited document type.

Google Scholar publication types.

Record typeEnglishFrenchGermanItalianSpanishAllAverage citations per document
Citation*45 %92 %57 %72 %76 %76 %7.9
Journal40 %3 %21 %14 %7 %13 %7.4
Book11 %3 %19 %12 %8 %8 %28.1
PDF3 %1 %2 %2 %6 %2 %4.9
HTML1 %1 %1 %0 %3 %1 %4.5
Other0 %0 %0 %0 %0 %0 %1.0
Total100 %100 %100 %100 %100 %100 %9.4

Non-source item

Findings
Number of “Roman law” publications in GS

The data retrieved from GS contains a total of 21,300 publications published between years 1500 and 2016 and including the title words “Roman law” in five languages. Largest group of records consists of 9,983 French publications that account for 47% of all records. English language publications make up 18%, Italian publication 13%, Spanish publication 13% and German publications 9% of the records (Table 3).

GS records for publications with title including words “Roman law” in five languages 1500–2016.

LanguageTitle wordsNumber of recordsShare
Englishroman law3,78317.8 %
Frenchdroit romain9,98346.9 %
Italiandiritto romano2,80313.2 %
Spanishderecho romano2,73112.8 %
German“römische Recht” “römischen Recht” “römishes Recht” “römischen Rechts”2,0009.4 %
All Total21,300100 %
Earliest Roman law publications in the GS record

The oldest publication year in French is 1,727, in German 1,730, in English 1,772, in Spanish 1,796 and in Italian 1,833 (Table 4). With regard to English publications there is an error, as the publication year of Cairns, JW, Slavery and the Roman Law of Evidence in Eighteenth-Century Scotland published in 2006 is mistakenly 1770. So, the oldest English language publication in the record is Taylor, J, A Summary of the Roman Law, from the same author's Elements of the Civil Law from 1772.

Earliest publications in the GS record in each language group.

LanguageAuthorsTitleYear
FrenchBrillon, PJ. . . : contenant par ordre alphabétique les matières bénéficiales, civiles et criminelles, les maximes du droit ecclésiastique, du droit Romain, du droit public, des . . .1727
GermanTelgmann, RFRud. Fridr. Telgmanns Einleitung zu der Historie der römischen Rechts-Gelehrsamkeit1730
EnglishCairns, JW; Burrows, ASlavery and the Roman Law of Evidence in Eighteenth-Century Scotland1770 (2006)
EnglishTaylor, JA Summary of the Roman Law, from the same author's Elements of the Civil Law1772
SpanishPosadilla, J A’lvarez; Ibarra, J. . ., arreglando sus decisiones a’ las leyes y resoluciones ma's modernas que en el dia rigen: obra u’til a todos los que no hayan estudiado el derecho romano . . .1796
ItalianUrsino, SDiscorso per lo stabilimento ed apertura della cattedra del codice leggi civili col confronto del diritto romano1833
Development of Roman law publications

Earliest publications in the data are from the early 18th century. When the records are divided according to the publication year to groups of 25 years, it can be observed that the number of Roman law publications has increased from around 10 publications in the earliest periods to 4,000 publications in 2000–2016 (Table 5). Notice that the latest time frame is only 17 years. The early 19th century is a period when the number of publications begins to increase in all language groups.

GS records for publications 1725–2016.

PeriodEnglishFrenchGermanItalianSpanishAll
1725–17490370010
1750–1774414009
1775–179938140126
1800–1824323410168
1825–1849835485526478
1850–1874322,54011622322,742
1875–18991065,592213148566,115
1900–192425916221821448901
1925–19494062761684151251,390
1950–19746164143596293012,319
1975–19999282943596137212,915
2000–20161,3473003637081,3524,070
No date7116534968257
Total3,7839,9832,0002,8032,73121,300

The largest number of publications (6,115) was published in 1875–1899. When we look at the different language groups, there is a very large number of French publications (mostly thesis and dissertations) in the record in the period 1825–1899 (Table 5 and Figure 1). The requirements of the French legal education could explain the growth in number of Roman law publication in this period, and the sudden decrease in the beginning of the 20th century. Following the introduction of Code Napoleon in 1804 and the reform of law schools, between 1808 and 1895 doctoral thesis in law consisted of two dissertations, one of which had to be based on Roman law (Imbert, 1984).

Figure 1

Average number of publications per year 1725–2016.

The average yearly number of publications ceases to increase after 1975 in Italian, German, and French languages, while it continues to rise in English and Spanish (Figures 1 and 2).

Figure 2

Average number of publications per year 1900–2016.

Number of Roman law authors in GS

According to the GS data, as it stands, the 21,300 Roman law publications published between 1727 and 2016 have a total of 11,420 different authors. Because one author can appear in more than one language group, the sum of authors of all language groups is larger than 11,420. The largest group consists of 6,323 authors of French publications that account for 51% of all authors. The authors of English language publications make up 18%, Spanish publications 11%, German publications 10%, and Italian publication 10% of all authors (Table 6). Differences in the number of publications per author are relatively small between the language groups, with the notably higher average for the authors of Italian publications.

Number of authors in the GS records for publications 1500–2016.

LanguageAuthorsSharePublications per author
English2,16317.6 %1.7
French6,32651.4 %1.6
German1,24110.1 %1.6
Italian1,1859.6 %2.4
Spanish1,40411.4 %1.9
All12,319100 %1.7

To study the development in the number of authors, each author in the data was assigned the average year of all her publications, on basis of which authors were divided in groups representing periods of 25 years. Overall, the development of the number of authors follows quite closely the pattern observed already in case of publications (Table 5). The largest number of authors is attested in 1875–1899, vast majority being related to the French publications (Table 7). The average number of publications per author in the GS dataset has somewhat increased (Figure 3).

GS records for authors 1725–2016.

PeriodEnglishFrenchGermanItalianSpanishAll
1725–1749014005
1750–17742522011
1775–179931133222
1800–182482183444
1825–184954220232532354
1850–18742691,3871631471782,144
1875–18994482,2362692453143,512
1900–1924124228756584576
1925–19491422551069394690
1950–19742423581301481671,045
1975–19993985021871812241,492
2000–20164469612522512832,193
No date27141192222231
Total2,1636,3261,2411,1851,40412,319

Figure 3

Publications per author 1725–2016.

Distribution of publications to authors

Publication are unevenly distributed among the authors. The 10 most prolific author in the dataset is Savigny with 79 publications, followed by Bonfante (71), Gaudemet (64), Hamza (61), Kaser (58), Arangio-Ruiz (57), Buckland (56), Lemosse (56), Wenger (54), and Stein (49). Of all 11423 authors in the GS dataset, however, only 2% have more than 10 publications, while 65% are recorded only one publication with “Roman law” in in any one of the five languages appearing in the title (Figure 4). The concentration of publications to small share of authors means that one-half of all publications is produced by 16% of the most prolific authors (Figure 5).

Figure 4

Concentration of publication to authors.

Figure 5

Concentration of citations to publications.

Distribution of citations to publications

Citations are even more unevenly distributed. The 21,300 Roman law publications from 1727 to 2016 have accumulated a total of 61,121 citations. However, 73% of the publications have received no citations in other scholarly publications recorded in Google Scholar. Only 1% of the most highly cited publications account for one-half of all the citations (Figure 5). The most highly cited author in the GS dataset is Buckland with 1,758 citations, followed by Zimmermann (1,716), Nicholas (1,581), Savigny (1,486), Watson (1,477), Stein (1,283), Schulz (1,083), Berger (911), Jolowicz (856), and Gardner (835).

The distribution of citations across publications is highly skewed, even more skewed than what is usually found in bibliometric studies. In comparison, a study of citation distributions across research fields Bornman and Leydesdorff (2017) found that 10% of the papers accounted for about 50% of the total citations, and the skewedness was highest in the humanities were 68.5% of the citations were generated by the 10% most cited. A suggested explanation for this was the low coverage of sources indexed in Web of Science. While coverage in Google Scholar might be an issue here as well, we suggest that other factors could also explain the skewedness. One possible explanation is the long-time frame used, which might emphasise the role of older classic materials. Notably the most cited publication, A text-book of Roman law: From Augustus to Justinian by Buckland, was first published in 1921 and has since been published in several editions. The reliance on classic works, which in turn results in a narrow intellectual base of the discipline, might thus be one explanation for the skewedness of citations.

Limitations

The quality of data in GS must be considered in any study of this kind. Two or more authors with the same name can be confused as only one author, or small differences in name can result in one person being mistaken as two or more different authors. Thus, homonyms, and synonyms as well as misspellings and duplicates may cause errors in the data. Such an error was indeed discovered in the analysis, for example a monograph from 2006 was mistakenly indexed as being published in 1770. Estimations regarding the prevalence of such errors are difficult to make, and automated detection and correction of errors was not deemed feasible due to the heterogeneous nature of the data used in terms of publication types, languages and age of source. Manual checking of records would have required extensive resources, especially when checking so called “non-source items” e.g. publications that are not indexed by GS but referred to by publications indexed by GS. Thus, the cleaning and checking of data remains a major challenge in using Google Scholar as a source for historical bibliometrics, and methods for improving data quality is a central issue when developing this line of research.

Discussion

By using Roman law as an example this study tested the possibilities of using Google Scholar data for historical analysis. The analysis reveals that Google Scholar can be used to find key publications and track citation over a long period of time. In this case we were able to find citations to works dating back to 1727. Moreover, the distribution of publications from different language areas over time allowed us to gain insights into larger developments within the field of Roman law. Our data shows a large surge of French Roman law publications in the 19th century, with a possible explanation being specific requirements of the French legal education at this time. While one should be careful to infer to strong conclusions based on these findings—they may partly be due to the availability of digitised materials—we find that historical approaches might be used for further more detailed analysis of intellectual development. At the same time, it is obvious that studies of this kind need to be supported by experts in the field for findings to be contextualised and corroborated.

Obviously, the choice of GS also comes with a range of methodological challenges, and many of these has already been discussed in the literature. The data used is automatically collected, and therefore issues concerning inadequate data, duplicates, homonyms etc. are persistent. A further issue when working with a heterogeneous material in terms of languages is how to deal with different translations of the same work, and of course different editions of seminal texts.

An insight from working with this material is that “historical bibliometrics” comes with specific affordances for researchers, as they need to combine historical insights with bibliometric competencies. Working with a field such as Roman law necessities’ language skills that allows for accessing the material analysed. Moreover, when conducting bibliometric analysis of older materials there is a need for understanding the publication culture, or knowledge infrastructure, in which these materials were produced. Publication patterns, dissemination paths, and referencing practices (Grafton, 1999) are all important aspects to consider.

In this paper we have used data retrieved from GS. The possible next steps in developing the dataset and the method could include manual checking of the data quality, identification of duplicate records, mistakes in author attribution or publication year, and estimating the effect of the data cleaning to the results. It is also possible to investigate to what extent the GS dataset of “Roman law” publications overlaps with similar datasets retrieved from Web of Science, Scopus, and Microsoft Academic. These datasets, and the GS dataset, could be compared to a more complete field specific bibliography—L’année philologique—in which references to scholarly literature concerning classical studies, including Roman law, have been collected since 1888. More generally further inquiries into the specificities of historical citation patterns and distributions, including institutional and disciplinary affiliations of authors, could provide valuable insights on the social and intellectual history of research fields.

Conclusions

In all, we find Google Scholar to be a promising data source for historical bibliometrics: it is accessible, has broad coverage and as demonstrated it also has quite a historical depth which allows for the analysis of older materials. At the same time there are distinct disadvantages: the quality of data is low, and as the database is continuously updating it is hard to reproduce earlier searches and data collections. The historical coverage is moreover dependent on the continuous digitisation and availability on the internet of older materials, which may in turn have large influence of the coverage in specific fields. Still, the possibilities for historical bibliometrics will most likely increase as the digitisation of older materials progress. Hence, while the approach taken here is a probing one, with many difficulties till to solve, we find that employing Google Scholar data for historical studies of fields and disciplines is a promising path for the future, and it is likely that such a path might attract travellers among bibliometricians as well as historians and other digital humanists.

eISSN:
2543-683X
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining