Document- and Keyword-based Author Co-citation Analysis

Open access


In the field of scientometrics, the principal purpose for author co-citation analysis (ACA) is to map knowledge domains by quantifying the relationship between co-cited author pairs. However, traditional ACA has been criticized since its input is insufficiently informative by simply counting authors’ co-citation frequencies. To address this issue, this paper introduces a new method that reconstructs the raw co-citation matrices by regarding document unit counts and keywords of references, named as Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Based on the traditional ACA, DKACA counted co-citation pairs by document units instead of authors from the global network perspective. Moreover, by incorporating the information of keywords from cited papers, DKACA captured their semantic similarity between co-cited papers. In the method validation part, we implemented network visualization and MDS measurement to evaluate the effectiveness of DKACA. Results suggest that the proposed DKACA method not only reveals more insights that are previously unknown but also improves the performance and accuracy of knowledge domain mapping, representing a new basis for further studies.


In the field of scientometrics, the principal purpose for author co-citation analysis (ACA) is to map knowledge domains by quantifying the relationship between co-cited author pairs. However, traditional ACA has been criticized since its input is insufficiently informative by simply counting authors’ co-citation frequencies. To address this issue, this paper introduces a new method that reconstructs the raw co-citation matrices by regarding document unit counts and keywords of references, named as Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Based on the traditional ACA, DKACA counted co-citation pairs by document units instead of authors from the global network perspective. Moreover, by incorporating the information of keywords from cited papers, DKACA captured their semantic similarity between co-cited papers. In the method validation part, we implemented network visualization and MDS measurement to evaluate the effectiveness of DKACA. Results suggest that the proposed DKACA method not only reveals more insights that are previously unknown but also improves the performance and accuracy of knowledge domain mapping, representing a new basis for further studies.

1 Introduction

Author co-citation analysis (ACA) was first proposed by Drs.H. D. White and B. C. Griffith in 1981. As a significant branch of scientometrics, the main purpose of ACA is to map scientific domains by pointing out the relationship of co-cited authors (McCain, 1990). The past decades have witnessed the growth and development of ACA, which has been applied extensively in various domains, such as library and information science (Anegón et al., 1998), computer science (Eom, 1998), management science (Charve et al., 2008), psychology (Bruer, 2010), management information systems (Chen, 2011), public policy (Nykiforuk, Osler, & Viehbeck, 2010), and supply chain (Charvet, Cooper, & Gardner, 2008).

However, traditional ACA is criticized by many scholars for its inaccuracy (Jeong et al., 2014) due to its limited focus on authors themselves without considering other information such as the citing times on the same author in each paper. To facilitate the output of traditional ACA by taking actual cited paper counts into account, in this paper, we propose a document-based counting algorithm, which aims to weigh the co-cited author pair by the number of co-cited papers from certain authors instead of weighing simply by the number of co-cited authors. This allows us to improve traditional ACA by changing the counting method, thus not only making the raw co-citation matrix more informative and accurate, but also uncovering more details that optimize the results of knowledge domain mapping.

Furthermore, keywords have hardly been used in previous studies on ACA because of the difficulties in topical similarity calculation. Nevertheless, keywords can help a lot in citation analysis by manifesting the core content of a paper and facilitating our understanding about its profound meaning (Griffith & Steyvers, 2004). In order to utilize the keywords and improve domain knowledge map, we construct a keyword-based thesaurus, which can measure the relationship between words as well as papers. In this paper, we collected keywords of references, calculated semantic similarities according to their relative position in the thesaurus, and then combined their semantic similarities with traditional ACA. As elaborated in the results, this method genuinely shows better performance of clustering scientific communities in knowledge domain mapping, thus making subfields discernible and obvious to figure out.

This paper is outlined as follows: related works concerning this study are described in Section 2. Then the statement of problems is demonstrated in Section 3. In Section 4, we describe our dataset and process of Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Next, performance evaluation and results analysis are detailed in Section 5. Finally, we conclude our contribution and limitation in the last section.

2 Related Studies

2.1 Author Co-Citation Analysis

Traditional ACA constructs the raw co-citation matrix mainly by calculating co-citation frequencies among co-cited authors and maps the knowledge domains of certain field(s) via a series of analyses and transitions (Jeong et al., 2014). It is also helpful to find out developing disciplines and distributions of sub-domains via ACA to promote the process of academic integration and communication. In this paper, we introduce the related work following the main steps of ACA (McCain, 1990), as shown in Figure 1.

Figure 1

Download Figure

Figure 1

Steps of the traditional ACA.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

  1. Data collection: ACA reveals knowledge structures in a macro-level through describing a subfield/research group and their interdependent relationships. Hence, it is important to specify the field to be analyzed (McCain, 1990). After that, researchers can select the important monographs, research groups, academic journals, and conferences by several ways, for instance, consulting specialists of the field, analyzing the content and influence of a carrier (White & McCain, 1998), snowballing (Cothill et al., 1989), or using personal knowledge bank’ (Anegón et al., 1998).

  2. Constructing raw co-citation matrix: In traditional ACA, if two authors whose first-authored publications co-occur in the reference list of the same paper, their co-citation count would increase by one. To increase the accuracy of ACA, Zhao and Strotmann (2006, 2008b, 2011) facilitated the matrix by taking all authors into consideration rather than only the first authors. Bu, Liu, and Huang (2016) added more metadata such as venue information and keyword information to reconstruct the raw matrix. Moreover, there is a long-term debate to understand how to define “co-citation” in the main diagonal of the matrix (Eom, 2008).

  3. Transforming raw co-citation matrix into correlation co-citation matrix: It is required to transform the matrix because of the needs of normalization for further analyses (McCain, 1990). A common way is to use Pearson’s correlation coefficient. However, a debate concerning whether Pearson’s r could be used in ACA matrix transformation has been raised for several years; the pros and cons of using Pearson’s r compared with other metrics (e.g. Euclidean metric and Chi-square) have also been covered. Mêgnigbêto (2013) proposed that the major points of the debate can be summarized into two questions: (1) whether Pearson’s correlation coefficient is suitable for ACA; and (2) whether it is better than other methods (such as cosine, Jaccard, Euclidean, and Chi-square distance). Considering several drawbacks of using Pearson’s coefficient in ACA, in this paper, we used cosine similarity to transform the co-citation matrix similar to one of our previous work (Bu et al., 2018).

  4. Data analysis, visualization, and result interpretation: Traditional ACA mainly uses clustering analysis, multidimensional scaling (MDS), and factor analysis to analyze and visualize the data. These methods basically complement with each other. Recently, network analysis becomes popular due to the improved visualization tools and technologies. Furthermore, MDS measurement, other than MDS, calculates the separation and cohesion of the clustering to evaluate the effect of ACA (Bu et al., 2016). Reasonable and appropriate interpretations upon the results are also required with peer validations.

2.2 Content-based Bibliometric Research

Content-based bibliometric research can be divided into two subtopics, namely syntactic and semantic context analyses, probing to the bottom. In syntactic citation context analysis, Voos and Dagaev (1976) first studied the distribution of citations based on their locations and found that “introduction” is endowed with more citations than any other sections. Similarly, Maričić et al. (1998) also discussed the topic of meaningful and valuable citation location. Nanba and Okumura (1999) utilized the citation area to generate summaries, extracted relationships between papers, and classified reference types. In 2004, a new term “citance” (Nakov et al., 2004) was proposed to represent sentences in full text including citations. During these refinements, the research unit of bibliometrics altered from papers (articles) to sentences, which optimized the knowledge map by anatomizing full-text context in detail. Nevertheless, most of these studies have purely focused on the framework of papers and structure of sentences, but ignored the interior literal meaning, making it less persuadable.

While syntactic citation context analysis is mainly quantitative, the semantic citation context analysis functions more qualitatively. Garfield (1964), for instance, first demonstrated 15 possible citation motivations to get more details about the relationship between citations and contents. In addition, Chubin and Moitra (1975) concentrated on various contributions of different types of citations by setting up a tree hierarchy, aiming to illustrate their relationships. Besides, Small (1978) studied the scientific content of citations by analyzing the surrounding text of citations and concluded that the referenced context was supposed to be the interpretation of concepts or methods.

With the growth of empirical citation data, data-driven methods such as natural language processing and machine learning are capable to handle massive content-based data analysis. Teufel, Siddharthan, and Tidhar (2006), for example, applied a supervised machine learning algorithm to distinguish citation functions with four top-level types to label them. Small (2011) analyzed the context surrounding the citations to dig the shallow and deep natural language features. Nevertheless, due to the difficulty in gaining corpus, the sample size of the most current citation analysis in full text is limited, which means that only a small part of the citation contexts has been selected. Jeong et al. (2014) collected full-text journal papers and extracted citing sentences to calculate their similarity distances, thereby refining the result of traditional ACA. Kim, Jeong, and Song (2016) considered both citation contents and proximity by extracting citation sentences and locations from full-text papers.

Furthermore, as a typically used entity for content-based analysis, keywords can significantly represent and capture the research field of authors. Indeed, the distance of words can better measure the similarity of words, documents, and even authors. Thesaurus construction and corpus statistics are two main approaches to quantify the similarity of words. The most representative English word thesaurus is WordNet (Miller, 1995), which provides a platform of traditional lexicographic information and modern computing. As for statistical methods, Jiang and Conrath (1997), mainly based on corpus statistics and lexical taxonomy, proposed a new approach for measuring semantic similarity. Besides, recent breakthroughs in machine learning such as Word2Vec (Goldberg & Levy, 2014) are also effective ways to understand the “distance” of words. Zooming out to the more specific theme of “co-citation”, Bu et al. (2016) combined keyword information of references into traditional ACA to refine previous results; however, methodologically, their proposed method only concerns whether the keywords are same or not (i.e., binary thinking) without getting their latent relationships. In this paper, we constructed an “information thesaurus network” based on WordNet and Oxford English Dictionary (Simpson & Weiner, 1989) to illustrate the relationship of two keywords of references to facilitate the way of the keyword-based co-citation analysis.

2.3 Weighted Citation Analysis

Studies related to weighted citation analysis can be divided into three levels: journals, documents, and authors. As for the relationship between citations and journals, Garfield (1955) first proposed citation counts to measure the journal impact by counting citation of average papers published in a journal. Moreover, citation count is also regarded as an important measurement for evaluating an author’s impact by calculating all their papers’ impact indicator. Pinski and Narin (1976) argued that citations from prestigious journals should be weighed higher than those from peripheral journals. As for the document level, Small (1973) came up with document co-citation analysis (DCA) and drew more attentions to document weighted citation research. Furthermore, Yan and Ding (2010) proposed three factors that can define paper status, i.e. the number of citations a paper received, the prestige of citing journals, and the citation time interval. From the perspective of the author level, Yan and Ding (2010) concluded that methods of distributing weights for authors could be divided into three types, including straight counting where only the first author’s contribution is acknowledged; unit counting where each coauthor’s contribution is counted equally; and adjusted counting where each coauthor’s contribution is divided based on the number of coauthors. Furthermore, the time of each citing paper is of much significance for evaluating the citation impact (Yin & Wang, 2017; Wang, Song, & Barabási, 2013). The Discounted Cumulated Impact (DCI) Index, for example, utilized a decay parameter devaluing old citations additionally to consider the influence of time, allowing weighing of the citations by the citation weight value of the citing publication (Järvelin & Persson, 2008).

Overall, we argue that ACA should take the weight of authors with more information regarding metadata into consideration since it aims to capture the knowledge domain maps. For instance, published time of references, venues, and keywords were also combined into raw author co-cited matrix by different weights to make the traditional ACA more informative (Bu et al., 2016). However, the studies related to weighted citation analysis in the author level have only focused on authors themselves, with the ignorance of actual citation counts on documents. To solve this problem, we can employ the idea of document co-citation analysis (DCA) proposed by Small (1973), where different numbers of citations for the same author would be endowed by different weights. By changing the counting method, we can finally improve author co-citation matrix and increase the information amount of the input in traditional ACA.

3 Problem Statement

Traditional ACA counts co-cited frequency of author pairs (White & Griffith, 1981). However, if two or more papers of authors A and B are co-cited by the same paper, the co-cited count of authors A and B is only one based on the definition of traditional ACA. This leads to a limited ability to distinguish co-cited authors due to uninformative input. Inspired by this, we refined the granularity of analysis from co-cited authors to co-cited documents (papers). For example, in Kim et al.’s (2016) paper named Content- and proximity-based author co-citation analysis using citation sentences, the author, Zhao (Zhao & Strotmann, 2008a, 2008b, 2011, 2014; Zhao, 2006), was cited five times while Shen (Shen et al., 2014) was cited only once, which seems inaccurate to weigh them equally. Actually, both Kim and Zhao focus on bibliometrics, while Shen concentrates on cancer studies. In this perspective, calculating the citation counts can distinguish authors more accurately. Consequently, the current paper proposes a document-based method on ACA, thus making the weight values for different authors more practical.

Moreover, keywords of references indicate the information on the field dimension, where similar keywords of references the authors use imply that they tend to research on similar issues. For example, Kim et al. (2016) cited papers separately written by Shen et al. (2014), Eom (1996), and Eto (2013), which would be equally evaluated the weight of co-citation between each two authors among them in traditional ACA. However, if we explored the keywords used in these three references (Table 1), we can find something different. Specifically, Shen et al. (2014) adopted “circulating microRNAs”, “breast cancer”, “biomarkers”, and “detection” as keywords of the paper, which can manifest that Shen concentrated on cancer studies. Besides, Eom (1996) utilized keywords as “decision support systems”, “intellectual structure”, “bibliometrics”, “cocitation analysis”, and “factor analysis”, while Eto (1996) selected “citation searching”, “co-citation”, “co-citation context”, and “similar document search” as keywords, where the two authors, from an intuitive view, both pay more attentions to bibliometrics and scientometrics. Consequently, keywords of references are necessary to be taken into consideration and integrated with keywords and co-citation parameter, thereby facilitating the analysis results of author co-citation.

Table 1

Keyword list of Shen et al. (2014), Eom (1996), and Eto (1996) co-cited by Jeong et al. (2016).

First authorKeywords
J. ShenCirculating microRNAs; breast cancer; biomarkers; detection
S. B. EomDecision support systems; Intellectual structure; Bibliometrics; cocitation analysis; factor analysis
M. EtoCitation searching; co-citation; co-citation context; similar document search

4 Methodology

4.1 Data

In this paper, the articles published in Journal of the Association for Information Science and Technology (JASIST, named as Journal of the American Society for Information Science and Technology before 2014) from 2012 to 2015 are collected. All the general descriptive metadata of the papers and their citations, including title, author(s), published time, published carrier, volume and issues, keywords, and the number of pages, are downloaded. The dataset contains 866 papers and 38,910 references. For simplicity we use the first author information of the papers, and their names are processed for disambiguation and artificial filtration. To avoid sparseness of the matrix, in this paper, we select 100 authors with highest number of citations for our future analysis. The main diagonal values of the co-citation matrices are set as zero. Network analysis is then performed for showing the performance of the proposed methods by using Gephi (Bastian, Heymann, & Jacomy, 2009).

Table 2 lists the 20 most cited authors, with research areas spanning over various areas in information science. For example, Dr. Leydesdorff plays a significant role in informetrics (Leydesdorff & Etzkowitz, 1996), while the influence exerted by Dr. Belkin in information seeking behavior and interactive information retrieval can never be underemphasized (Belkin, 1978).

Table 2

The 20 most cited authors in our dataset.

RankAuthor nameRankAuthor name
1L. Leydesdorff11C. Kuhlthau
2L. Bornmann12B. Hjørland
3C. Vining13H. Small
4L. Egghe14A. Spink
5R. Garfield15T. Saracevic
6M. Thelwall16P. Vakkari
7W. Glänzel17V. Savolainen
8L. Waltman18N. Belkin
9H. Moed19J. Hirsch
10B. J Jensen20B. Dervin

4.2 The Framework of the Proposed DKACA

In this paper, our newly proposed DKACA combines document unit counting method and semantic similarity measurement, which is described in Figure 2 step by step. The gray blocks in Figure 2 highlight the differences between our method and traditional ACA. DKACA consists of two sections, DACA (document-based ACA) and KACA (keyword-based ACA); the former combines document-level information into ACA, while the latter combines keyword-level information. Specifically, our proposed method first selects data from JASIST and disambiguates the authors’ names manually. Concerning the branch of DACA, we process the information of references (metadata) and build a citation network, after which we count authors’ co-citation frequency based on the co-cited paper pairs and adjust the citation network by modifying the weight value between two given authors. As for the KACA branch, we follow a four-step procedure:

Figure 2

Download Figure

Figure 2

The framework of keyword-based ACA.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

  1. select high-frequency keywords and delete stop words;

  2. “standardize” keywords (e.g., both verbs and nouns are standardized to nouns) and construct the keyword-based thesaurus network; (3) find out the shortest path between words and quantify their similarity; and (4) calculate the author and paper similarity based on the similarity of keywords. We also combine the parameters derived by KACA and DACA and transform the raw matrix into correlation matrix by cosine similarity. Further data analysis and visualization are implemented. A detailed introduction of the process of DACA and KACA and how to combine them into DKACA are given in the following sections.

4.3 Document-based Author Co-citation Analysis (DACA)

In traditional ACA, diverse numbers of citations from the same author are equally regarded, omitting the different impact of each scholar. Consequently, the newly proposed method aims to set various weights for distinct types of co-cited authors, in which we prefer to count the cocitation number based on documents rather than authors. Specifically, if first-authored papers of the two authors are co-cited, where many of the one author are included in the co-citation paper set while only one paper of the other author is, the new algorithm would set the two authors’ co-citation frequency (count) as different values.

Suppose that Γ is the obtained paper set and ψ is the citation author set of Γ . The size of Γ is n while that of ψ is I. Given that paper Pl∈n ∈ Γ cites papers D1, D2, …, Dm of author AiI and papers E1, E2, …, En of author AjIψ, the co-cited count SfcAiAj of AiI and AjI in the set is:


For example, in one of the author A’s paper, two papers written by author B, labeled as B1 and B2, three papers written by author C, labeled as C1, C2, and C3, and five papers written by author D, labeled as D1, D2, D3, D4, and D5, are co-cited. According to the definition of the traditional ACA, the raw co-citation matrix should be calculated as summarized in Table 3.

Table 3

An example of raw co-citation matrix under traditional ACA.


As we can see, no difference of the authors’ co-cited counts is documented in Table 3 because of ignoring distinguished co-cited papers. However, it should be recognized that the relationship between authors C and D is much closer than that between B and C. To solve this problem, in our proposed DACA method, we defined that there are six times that B and C are co-cited, which are (B1, C1), (B1, C2), (B1, C3), (B2, C1), (B2, C2), (B2, C3). A new raw co-citation matrix based on ACA combined with documents is summarized in Table 4.

Table 4

An example of raw co-citation matrix under documentbased counting method.


4.4 Keyword-based Author Co-citation Analysis (KACA)

Regarded as significant labels of papers, keywords indicate papers’ research domains explicitly. Furthermore, the similarity of keywords from two papers can represent how similar papers, even the authors, are, to some degree. From this perspective, it is essential to find an appropriate method to quantify the similarity between two keywords. In further detail, we can construct the keyword-based thesaurus according to its knowledge domain ontology, thus depicting the relationships between keywords clearly.

We developed the map structure of the thesaurus based on WordNet and Oxford English Dictionary. Specifically, three types of relationships are defined in our thesaurus. First, synonyms indicate words with the same meaning, where one can replace another easily, such as “earphone” and “headphone”. Besides, hierarchical relationship shows the relationship between two entities in different “layers” (detailed in the next paragraph), and the related terms aim to demonstrate the relationship that two keywords can be referred (but not the same) from each one.

The hierarchical relationships of a thesaurus are represented as a tree structure, which leads to an inheritance relationship from the “root” to the “leaves”. This kind of essential inheritance relationship falls into three categories: non-coexistent relationship, whole–part relationship, and exemplification. Non-coexistent relationship means a specific type of the object, like the relationship between “shoes” and “boots”, where “boots” are a type of “shoes”. Wholepart relationship demonstrates a part belonging to the object. Taking “hand” as an example, a “finger” is part of a “hand”, thus making the relationship between “finger” and “hand” a whole–part relationship. Meanwhile, the third one, exemplification, is defined to be the realistic instance of the object, search engine and Google being representative. Taking “computer” as an example, its relevant words and detailed explanations are listed in Table 5.

Table 5

Three types of hierarchical relationships.

DescriptorsRelationshipsCorresponding descriptors
ComputerWhole–partKeyboard, Screen
ComputerExemplificationDell, Lenovo

We initialize the word relationship map from WordNet and then refine its structure based on these three types of relationships. Each child node is a linked list structure. We use arrows to link the keywords with synonymous, dotted lines to link the keywords with related relationships, and solid line to link the keywords with hierarchical relationships.

By constructing keyword-based thesaurus from a qualitative perspective, we also measured the weight of their relationship quantitatively. To increase the persuasiveness and reliability, two research assistants with specialized knowledge are asked to construct the keyword thesaurus separately and then discuss their divergent views. Figure 3 is an example of the keyword-based thesaurus network, in which different types of ties between keywords indicate distinct relationships.

Figure 3

Download Figure

Figure 3

An example of the information thesaurus network.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

Furthermore, by assigning each type of relationship with a constant weight (to simplify the model, the value of different relationships is given the same, which is summarized in Table 6) and adding up the weight value for each step, we quantify the relevance between two keywords according to their paths in the thesaurus. In addition, the total weight value between two papers and, furthermore, two authors can be figured out based on the relevant weight values of keywords.

Table 6

The value of relationships between words.

Word relationshipSymbolGiven value
Synonyms (USE/UF)KUseUf1
Related terms (RT)KRT1
Genus/species relationshipKNTGBTG1
Part/whole relationshipKNTPBTP1
Instance relationshipKNTIBTI1

In detail, to quantify the relevance of keywords of two papers, we use the “dijkstra” algorithm (Dijkstra, 1959) to find the shortest distance between the nodes representing two keywords according to the keyword-based thesaurus, essentially a network-structured tree. We then calculated the relevance degree WordsDistance. between two keywords according to the path we found, where KStep represents the weight of each step. The longer the distance is, the less relevant in general the two words are.


For instance, “Baidu” and “tag”: suppose that the weight for the five relationships are all assigned as the same, e.g., set as one, their shortest path length should be five according to “dijkstra” algorithm (red lines in Figure 4 indicate the shortest steps).

Figure 4

Download Figure

Figure 4

An example of using “dijkstra” algorithm.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

According to Dijkstra (1959), we can easily obtain the WordsDistance of relationship among the descriptors of “Baidu” and “tag”. Moreover, the larger WordsDistance is, the relationship between them would be alienated as it is known to all. Consequently, the weight of the relationship, WordsSimilarity, should be the reciprocal of WordsDistance:


In the above-mentioned case, the keywords similarity between “Baidu” and “tag” can be calculated as follows:


Suppose that the number of keywords of Paperi is I, while that of Paperj is J; there would be IJ2 pairs of relationship. Thus, the PaperSimilarity (the relationship value of two papers) between author Pi and Pj; could be:


AuthorSimilarity (the similarity of two authors) depends on the PoperSimiiarity of their co-citation papers and can be calculated using the following formula (supposing the paper number of Authori M, while Authorj is ;N):


4.5 Combination between Document- and Keyword-based Author Co-citation Analysis

Document-based counting method and keyword-based ACA matrix need to be standardized for better integrating together. The raw matrices of traditional ACA, document-based ACA, and keyword-based ACA are relatively Ra, Rd, and Rk . Normalized matrix of traditional ACA, document-based counting method, and keyword-based ACA, relatively Na, Nd, and Nk should be:




where Max (· ) is the maximum in the given matrix. The purpose of normalizing is to limit elements in the matrix to [0, 1] to facilitate the later operation of integrating and the weight assignments. Eq. 9 shows -ow to adopt different weights to combine them:


In this paper, wa, wd, and wk are weight values for normalized traditional ACA matrix, normalized document-based counting matrix, and normalized keyword-based ACA matrix. If wd = wk = 0, it goes to the traditional ACA. When wk = 0 but wa and wd not, we simply involve document counting thinking into ACA but not keyword relationship; under such circumstance, we call the method as document-based ACA (DACA). Only wdwk 0 is for DKACA.

5 Results and Discussion

5.1 Network Analysis

To explore the effectiveness of DKACA, a two-dimensional graph (network) is mapped. Figures 57 show the network visualization results of ACA, DACA, and DKACA, respectively. Moreover, the weight values of ACA, DACA, and DKACA (Table 7) are the optimized value after many experiments. As shown in Figures 57, each node represents an author; the size of nodes varies in proportion to the weighted degrees of nodes. The nodes’ colors are determined by Newman (2006)’s modularity algorithm: nodes in the same color are partitioned in the same group, showing similar research interests, while those in different colors indicate distinct research interests. The distance between two nodes also indicates their cross-value in the input matrix; a nearer distance indicates their high cross-value, and vice versa. We combined two layout algorithms, Yifan Hu (Hu, 2007) and ForceAtlas2 (Jacomy et al., 2014), to help illustrate the visualizations.

Figure 5

Download Figure

Figure 5

ACA visualization.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

Table 7

Weight values of the abovementioned methods.

Document-based ACA (DACA)
Document- and keyword-based ACA (DKACA)

We can see that all authors, represented by the nodes, are divided into three categories as shown in Figures 5 and 6 and four categories as shown in Figure 7. After collecting authors’ research domains from their homepages, we manually interpret what each cluster (a certain color of nodes) means and what subdiscipline it belongs to under Library and Information Science (LIS). Authors showed as orange nodes represent those focusing on information retrieval and information seeking behavior. The greens represent informetrics and scientometrics. Meanwhile, purple nodes represent authors studying in computer science, engineering, and network-based informatics. The pink nodes as shown in Figure 7 correspond to applied behavioral science and business-related information science.

Figure 6

Download Figure

Figure 6

DACA visualization.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

Figure 7

Download Figure

Figure 7

DKACA visualization.

Citation: Data and Information Management 2, 2; 10.2478/dim-2018-0009

Figures 57 show that the nodes in the same color are closer as we involve document and keyword information, which means that the authors in the similar domain are “closer” to each other in the maps of Figures 6 and 7 than in Figure 5. For example, H. Small, L. Waltman, and H. F. Moed are all scholars in the fields of bibliometrics and scientometrics, and nodes representing the three authors always stay close with each other in these figures. Meanwhile, they get closer to each other while our proposed methods are used. By taking A. L. Barabási and J. E. Hirsch as another example, they both have a background of physics or network science and a strong interest in scientific productivity research. Dr. Barabási, for instance, did much research in how a scholar succeeds in his/her scientific career and how to measure their success (Jeong, Neda, & Barabási, 2003). Some of his papers on the topic of the importance of scientific collaboration and social network (Barabási et al., 2015) are cited by many informetric studies. Meanwhile, Dr. Hirsch created and improved the h-index (Hirsch, 2005, 2008) that considers authors’ productivity and citation number simultaneously and aims to reflect a scholar’s academic achievements. Dr. Hirsch’s papers on the topic of the putting forward and promoting h-index (Hirsch, 2010) contribute to his co-citation relationship with other authors in the scientific evaluation and informetric area (Raan, 2005). Therefore, from a perspective of science and scientific success, the research area of Drs. Barabási and Hirsch is similar as they both have had close relationships with and made great contributions to bibliometrics and science of science; hence, it makes sense that the nodes representing them get closer as shown in Figures 6 and 7.

Moreover, nodes in different colors (i.e., the scholars in different areas) get farther away in visualization graphs as shown in Figures 6 and 7. The distance of the authors in different categories is more recognizable. For example, B. Dervi, J. E. Hirsch, and M. Thelwall are researchers in different fields (Table 8). However, because of the great contribution to informetrics, they have been co-cited for many times, thus making them more representative in purple groups in the visualization generated by the traditional ACA method. As the algorithm is improved, we refined the traditional ACA combining document information and keywords from cited papers with the traditional ACA. As shown in Figures 57, we can easily find out that the nodes representing the three authors get farther away and they finally change into three different colors. This is a consequence that is consistent with the real situation.

Table 8

Three authors’ research areas.

AuthorB. DervinJ. E. HirschM. Thelwall
DomainsCommunication, library science, information sciencePhysics, statistics, science of scienceWebometrics, altmetrics, sentiment analysis

However, we also found some “weird” phenomena. For example, the research interests of Dr. E. Garfield include information retrieval. As a result, the node representing him should be a purple one. However, this node is far away from other purple nodes in all the figures. We interpret this phenomenon as the fact that Dr. Garfield tended to do more quantitative research although he was a very important scientist in information retrieval; in addition, as the creator of Scientometrics, his papers were often co-cited with papers about informetrics. As a result, he is clustered to the informetrics field in these figures.

As more metadata are involved, we also found the changes and development of an author’s research interests dynamically. For example, Dr. D. Zhao is represented as purple nodes in traditional ACA (Figure 5), but the node changes to orange in DKACA as shown in Figures 6 and 7. Indeed, her research gradually focused on informetrics and scientometrics after 2003. Moreover, some of her publications published from 2003 are actually reflected better in the change existing in the latter two methods, especially research concentrating on all-author co-citation analysis and first author co-citation analysis (Zhao, 2006; Zhao & Strotmann, 2008b, 2011). In addition, his earlier knowledge representing research about XML and semantic web (Zhao & Logan, 2002) is more relevant to ontology instead of pure scientometrics. Although such research is also relevant about “networks”, it is much different with computer science and “network science”. It is a beneficial advance to visualize the color changes of an author’s nodes in the figures triggering by different algorithms. Many previous studies used multidimensional scaling (MDS) with SPSS, and the authors’ proposed areas (i.e., the color or their corresponding nodes) are fixed. As a result, it cannot reflect the authors’ “switching” between different areas, which is the general change of the research areas and interests. The dynamic classification pattern, on the contrary, has a strong capacity of mapping the inner connections between different authors.

5.2 MDS Measurement Analysis

In the section mentioned earlier, we have analyzed and discussed the network analysis result qualitatively. In this paper, we would like to evaluate the abovementioned methods from a quantitative perspective in which MDS measurement (different from MDS) is employed (Bu, Ni, & Huang, 2017). It is an important and useful method to indicate clustering result in a multidimensional graph. MDS measurement value, defined as σ, is influenced by two factors, c and S, representing cohesion and separation, respectively. The higher cohesion (larger c ) in the same category and higher separation (smaller S) in different categories demonstrate a better clustering result. The MDS measurement results of the abovementioned methods are summarized in Table 9. We found that σ(DKACA) < σ(DACA) < σ(ACA), which reveals that σ becomes smaller as more factors are involved in the experiments. This clearly implies that nodes in the same category are closer and those in different categories are more separate from each other, especially when more bibliometric entities are involved. This result confirms the result given in “Network Analysis” section. To sum up, the DACA and DKACA methods perform better than the traditional ACA in terms of clustering when we map knowledge domains.

Table 9

MDS measurement result of the methods.


6 Conclusion

We used the metadata of scientific articles published in JASIST as the dataset in this paper and proposed document- and keyword-based author co-citation analysis (DKACA) methods. Compared with traditional ACA, the major innovation of the approaches is to construct the raw cocitation matrix by: (1) considering co-citation documents instead of co-citation authors; and (2) adding keywords of references into consideration. The empirical results show that our proposed methods offer a better understanding than traditional ACA method, hence improving the visualization of knowledge mapping. Moreover, it enhances the performance of ACA, because the depiction of scientific domain mapping is more accurate.

The main contributions of the proposed document- and keyword-based ACA are as follows: (1) by adding more information to the author co-citation analysis, one can provide more details and nuance analyses to the dataset; (2) document-based calculation refined the original algorithm by utilizing documents as the count unit rather than authors, thereby making the clustering result more accurate; and (3) keyword-based ACA has a good demonstration of the analysis of knowledge domain.

Besides the methodological discussion, we wanted to highlight that our proposed approaches provide a keyword similarity calculation method from semantic perspective and facilitate information recommendation in search engine. Besides, the proposed methods construct a bridge between document co-citation analysis (DCA) and ACA, thereby considering document and author information simultaneously. Apart from these, document- and keyword-based method can also be applied in bibliographic coupling, co-authorship, co-occurrence, and topical network analysis in addition to ACA itself.

Overall, we have changed the calculation method of counting the co-cited times and have combined the co-cited matrix with keywords, thus making the raw co-citation matrix more informative. In the future, we will study about ACA combining with topic modeling methods, such as Latent Dirichlet Allocation (LDA) as well as Author-Conference-Topic model, an extended LDA algorithm (Tang, Jin, & Zhang, 2008), and network analysis method such as Node2Vec (Grover & Leskovec, 2016), which can also be analyzed by the similar methods we used in this paper. Moreover, this paper has only utilized the first author to analyze knowledge domain but ignored other authors, which is likely to cause some biases in this research; in the future, we planned to consider all authors’ information, such as Zhao (2006), and we believe that the results can be further improved.


We would like to thank the research assistants in our group, Yuchen Wang, Wenyi Shang, and Bo He, for their work in data collection and preprocessing. We are also grateful to two anonymous reviewers and Dr. Xiaojuan Zhang, the executive editor-in-chief of Data and Information Management for providing meaningful suggestions and help on the improvement of the manuscript.


Anegón, F. D. M., Contreras, E. J., & Corrochano, M. (1998). Research fronts in library and information science in Spain (1985-1994). Scientometrics, 42(2), 229–246.

Barabási, A.-L., Jeong, H., Néda, Z., Ravasz, E., & Schubert, A. (2015). Evolution of the social network of scientific collaborations. Veterinary Surgery, 6(2), 66–70.

Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: An Open Source Software for Exploring and Manipulating Networks. In Proceedings of the Third International ICWSM Conference, (pp.361-362), May 17-20, 2009, San Jose, California, U.S.A.

Belkin, N. J. (1978). Information concepts for information science. Journal of Documentation, 34(1), 55–85.

Bruer, J. T. (2010). Can we talk? How the cognitive neuroscience of attention emerged from neurobiology and psychology, 1980-2005. Scientometrics, 83(3), 751–764.

Bu, Y., Liu, T., & Huang, W.-B. (2016). MACA: A modified author cocitation analysis method combined with general descriptive metadata of citations. Scientometrics, 108(1), 143–166.

Bu, Y., Ni, S., & Huang, W.-B. (2017). Combining multiple scholarly relationships with author cocitation analysis: A preliminary exploration on improving knowledge domain mappings. Journal of Informetrics, 11(3), 810–822.

Bu, Y., Wang, B., Huang, W.-B., Che, S., & Huang, Y. (2018). Using the appearance of citations in full text on author co-citation analysis. Scientometrics, 116(1), 275-289.

Charvet, F. F., Cooper, M. C., & Gardner, J. T. (2008). The intellectual structure of supply chain management: A bibliometrics approach. Journal of Business Logistics, 29(1), 47–73.

Chen, L. C., & Lien, Y. H. (2011). Using author co-citation analysis to examine the intellectual structure of e-learning: A MIS perspective. Scientometrics, 89(3), 867–886.

Chubin, D. E., & Moitra, S. D. (1975). Content analysis of references: Adjunct or alternative to citation counting? Social Studies of Science, 5(4), 423–441.

Cothill, C. A., Rogers, E. M., & Mills, T. (1989). Co-citation analysis of the scientific literature of innovation research traditions. Science Communication, 11(2), 181–208.

Dijkstra, E. W. (1959). A note on two problems in connection with graphs. Numerische Mathematik, 1(1), 269–271.

Eom, S. (1996). Mapping the intellectual structure of research in decision support systems through author cocitation analysis (1971–1993). Decision Support Systems, 16(4), 315–338.

Eom, S. (1998). Relationships between the decision support system subspecialties and reference disciplines: An empirical investigation. European Journal of Operational Research, 104(1), 31–45.

Eom, S. (2008). Author cocitation analysis: Quantitative methods for mapping the intellectual structure of an academic discipline. Hershey, Pennsylvania: IGI Global.

Eto, M. (2013). Evaluations of context-based co-citation searching. Scientometrics, 94(2), 651–673.

Garfield, E. (1955). Citation indexes for science; A new dimension in documentation through association of ideas. Science, 122(3159), 108–111.

Garfield, E. (1964). Can citation indexing be automated? In Proceedings Symposium of the Statistical Association Methods for Mechanized Documentation, (pp. 2-4), March 17-19, 1964, Washington D.C., U.S.A.

Goldberg, Y., & Levy, O. (2014). Word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235.

Grover, A., & Leskovec, J. (2016). Node2vec: Scalable feature learning for networks. In Proceedings of the Twenty-second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. (855-864), August 13-17, 2016, San Francisco, California, U.S.A.

Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572.

Hirsch, J. E. (2007). Does the H index have predictive power? Proceedings of the National Academy of Sciences of the United States of America, 104(49), 19193–19198.

Hirsch, J. E. (2010). An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics, 85(3), 741–754.

Jacomy, M., Venturini, T., Heymann, S., & Bastian, M. (2014). ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS One, 9(6), e98679.

Järvelin, K., & Persson, O. (2008). The DCI index: Discounted cumulated impact-based research evaluation. Journal of the American Society for Information Science and Technology, 59(9), 1433–1440.

Jeong, H., Neda, Z., & Barabási, A.-L. (2003). Measuring preferential attachment for evolving networks. Europhysics Letters, 61(4), 567–572.

Jeong, Y.-K., Song, M., & Ding, Y. (2014). Content-based author cocitation analysis. Journal of Informetrics, 8(1), 197–211.

Jiang, J.J., & Conrath, D.W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008.

Kim, H.-J., Jeong, Y.-K., & Song, M. (2016). Content- and proximity-based author co-citation analysis using citation sentences. Journal of Informetrics, 10(4), 954–966.

Leydesdorff, L., & Etzkowitz, H. (1996). Emergence of a triple helix of university-industry-government relations. Science & Public Policy, 23(5), 279–286.

Maričić, S., Spaventi, J., Pavičić, L., & Pifat-Mrzljak, G. (1998). Citation context versus the frequency counts of citation histories. Journal of the American Society for Information Science and Technology, 49(6), 530–540.;2-8

McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American Society for Information Science, 41(6), 433–443.;2-Q

Mêgnigbêto, E. (2013). Controversies arising from which similarity measures can be used in co-citation analysis. Malaysian Journal of Library and Information Science, 18(2), 25–31.

Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41.

Nakov, P.I., Schwartz, A.S., & Hearst, M.A. (2004). Citances: Citation sentences for semantic analysis of bioscience text. In Proceedings of the Twenty-seventh ACM SIGIR Conference Workshop on Search and Discovery in Bioinformatics, July 25-29, 2004, Sheffield, U.K.

Nanba, H., & Okumura, M. (1999). Towards multi-paper summarization using reference information IJCAI’99. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, (pp. 926–931), July 31-August 6, 1999, San Francisco, California, U.S.A.

Newman, M. E. J. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America, 103(23), 8577–8582.

Nykiforuk, C. I. J., Osler, G. E., & Viehbeck, S. (2010). The evolution of smoke-free spaces policy literature: A bibliometric analysis. Health Policy (Amsterdam), 97(1), 1–7.

Pinski, G., & Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing & Management, 12(5), 297–312.

Raan, A. F. J. V. (2005). Comparison of the Hirsch-index with standard bibliometric Indicators and with peer judgment for 147 chemistry research groups. Scientometrics, 67(3), 491–502.

Shen, J., Hu, Q., Schrauder, M., Yan, L., Wang, D., Medico, L.,…Liu, S. (2014). Circulating miR-148b and miR-133a as biomarkers for breast cancer detection. Oncotarget, 5(14), 5284–5294.

Simpson, J., & Weiner, E. S. (1989). Oxford English dictionary online. Oxford: Clarendon Press. Retrieved March, 6, 2008.

Small, H. G. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4), 265–269.

Small, H. G. (1978). Cited documents as concept symbols. Social Studies of Science, 8(3), 327–340.

Small, H. G. (2011). Interpreting maps of science using citation context sentiments: A preliminary investigation. Scientometrics, 87(2), 373–388.

Tang, J., Jin, R., & Zhang, J. (2008). A topic modeling approach and its integration into the random walk framework for academic search. In Proceeding of the Eighth IEEE International Conference on Data Mining, pp. 1055-1060, December 15-19, 2008, Pisa, Italy.

Teufel, S., Siddharthan, A., & Tidhar, D. (2006). An annotation scheme for citation function. In Proceedings of the seventh SIGDIAL Workshop on Discourse and Dialogue, pp. 80-87, July 15-16, 2006, Sydney, Australia.

Voos, H., & Dagaev, K. S. (1976). Are all citations equal? Or, did we op. cit. your idem? Journal of Academic Librarianship, 1(6), 19–21.

Wang, D., Song, C., & Barabási, A. L. (2013). Quantifying long-term scientific impact. Science, 342(6154), 127–132.

White, H. D., & Griffith, B. C. (1981). Author co-citation: A literature measure of intellectual structure. Journal of the American Society for Information Science, 32(3), 163–171.

White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author cocitation analysis of information science (1972-1995). Journal of the American Society for Information Science, 49(4), 327–335.

Yan, E., & Ding, Y. (2010). Weighted citation: An indicator of an article’s prestige. Journal of the American Society for Information Science and Technology, 61(8), 1635–1643.

Yin, Y., & Wang, D. (2017). The time dimension of science: Connecting the past to the future. Journal of Informetrics, 11(2), 608–621.

Zhao, D., & Logan, E. (2002). Citation analysis using scientific publications on the web as data source: A case study in the XML research area. Scientometrics, 54(3), 449–472.

Zhao, D. (2006). Towards all-author co-citation analysis. Information Processing & Management, 42(6), 1578–1591.

Zhao, D., & Strotmann, A. (2008a). Information science during the first decade of the web: An enriched author cocitation analysis. Journal of the American Society for Information Science, 59(6), 916–937.

Zhao, D., & Strotmann, A. (2008b). Comparing all-author and first-author co-citation analyses of information science. Journal of Informetrics, 2(3), 229–239.

Zhao, D., & Strotmann, A. (2011). Counting first, last, or all authors in citation analysis: A comprehensive comparison in the highly collaborative stem cell research field. Journal of the American Society for Information Science and Technology, 62(4), 654–676.

Zhao, D., & Strotmann, A. (2014). The knowledge base and research front of information science 2006–2010: An author cocitation and bibliographic coupling analysis. Journal of the Association for Information Science and Technology, 65(5), 995–1006.

Journal Information



All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 352 352 68
PDF Downloads 78 78 14