Open Access

Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation


Cite

Introduction

The scientific community benefits from data sharing. By using previous research data, researchers can advance scientific discovery far beyond their original analysis (Piwowar & Vision, 2013). Scientific data usage facilitates original result confirmation, and it improves new hypothesis generation when combining other types of data.

Biomedical data sharing policies have been established to ensure that data are publicly available. For example, it is mandatory to share large scale genomic data that were generated or analyzed on basis of the U.S. National Institute of Health funding (Green et al., 2015). All of the grantees consciously deposit their data to a public database, and they serve as the data author. To make a collection of enduring scientific data and create a sustainable data ecosystem, all of the key players of data management, such as data author, data curator, data user, and funding agencies, actively fulfill their responsibilities (Bourne, Lorsch, & Green, 2015; National Science Board, 2005). For this data management lifecycle, data authors conform to the data standard and data quality requirement, and they produce scientific data that are further deposited into a public database in a comprehensive manner. Moreover, data users adhere to license or copyright requirements regarding the usage of the data that is generated by data authors, and they must correctly cite the data used in their scientific publications to indicate that their studies feature the use of other research data. Identifying data citations is important for funding agencies to evaluate grantees’ scientific contribution and grant outcomes. Additionally, a dataset that is more frequently cited by other researchers is confirmed to be high-quality data and represents domain trends.

It is challenging to identify data citations in full-text literature, although there is a long tradition of partnership between scientific literature and public data in the field of medical sciences (Kafkas, Kim, & McEntyre, 2013). A few studies have been conducted to identify data citation using unique data accession numbers in the database. However, most project-generated data focuses on one specific scientific goal and lacks either a well-defined data identifier or standardized citation regulations.

The Cancer Genome Atlas (TCGA) project was launched in 2005 and funded by the US government, and it aims to catalogue and discover major cancer-causing genomic alterations to help improve the clinical outcome of cancers (Tomczak, Czerwinska, & Wiznerowicz, 2015). A major goal of the project was to provide publicly available cancer genomic datasets (https://tcga-data.nci.nih.gov/tcga/) that include over 30 human cancer types (e.g. brain cancer, lung cancer, breast cancer, etc.) with multiple genomic profiles based on recent high-throughput platforms (e.g. RNA sequencing, single nucleotide polymorphisms, etc.). The TCGA program encouraged worldwide scientists to conduct comprehensive analyses of the large-scale dataset collected by the project, which contributes to the common goal of improving cancer diagnosis, treatment, and prevention (Chin et al., 2011; Tomczak, Czerwinska, & Wiznerowicz, 2015). Thus, the TCGA data are widely used and support meaningful scientific discoveries. However, it is challenging to track the TCGA data usage due to the data complexity and the lack of data identity. Researchers can use the TCGA via freely combining data from multiple types of cancer that were tested using multiple high-throughput platforms. In this preliminary study, we intended to identify the TCGA data citations by analyzing full-text literature mining.

Related Work

Text-mining methods have been developed to identify database citations from the literature by characterizing database entry accession numbers. Neveol et al. developed a machine learning method to extract data deposition statements from full-text literature (Neveol et al., 2011). Furthermore, they analyzed link curation between disposition databases (e.g. GEO and PDB) and the literature and proposed that text-mining tools can improve the links between literature and biological databases (Neveol et al., 2012). Kafkas, Kim, and McEntyre applied the patterns of ENA, Uniprot, and PDB accession numbers to identify database citations from full-text literature (Kafkas, Kim, & McEntyre, 2013) and from article supplemental files (Kafkas et al., 2015). Piwowar et al. investigated citation relationships between microarray databases (e.g. GEO and ArrayExpress) and the literature (Piwowar & Chapman, 2010; Piwowar & Vision, 2013). Yu et al. constructed a database link network from a set of pairs of databases that were co-mentioned in the methodology sections of full-text literature to track the database usage, connection, and evolution (Yu et al., 2015). These efforts have improved the understanding of data citations for specific databases that have identical accession numbers. However, few studies have been conducted to identify data-literature citation relationships for data generated by a scientific project in which the data is required to be shared but lacks a redefined identifier. In this study, we selected a publicly funded project and publicly available project data: TCGA (Tomczak, Czerwinska, & Wiznerowicz, 2015).

Methodology

To identify TCGA data usage from full-text articles, we proposed a computational framework (Figure 1). We collected TCGA-related full-text articles from PubMed Central, constructed a benchmark dataset which truly used the TCGA data, and analyzed data usage according to the specific cancer type and high-throughput platform.

Figure 1

Computational workflow for identifying TCGA data usage.

TCGA-related Literature

PubMed Central (PMC, http://www.ncbi.nlm.nih.gov/pmc/), a publicly available literature archive, was used as the full-text article resource. In the literature contexts, the TCGA-related terms were mentioned in the form of both abbreviations and full-name descriptions. We extracted a set of full-text articles from PMC using the query “TCGA” or “Cancer Genome Atlas” and using the publication date of 2008 (Search term: (tcga OR “cancer genome atlas”) AND (“2008” [Publication Date]: “2015” [Publication Date])). In total, 5,372 papers in XML format were collected as of October, 2015. Further, we removed the articles that merely mentioned “TCGA” or “Cancer Genome Atlas” in the article reference section. Then, 5,005 full-text articles remained and were included in the raw dataset for further analysis.

Benchmark Dataset

We collected 25 open access publications that used TCGA data, as confirmed by the TCGA Network, from the official website (http://cancergenome.nih.gov/publications). These articles constitute the benchmark dataset for the following analysis. We attempted to characterize the TCGA data usage article patterns by analyzing the benchmark dataset. The location of the key term, “Cancer Genome Atlas” or its abbreviations, in the full-text literature was primarily investigated. Considering the varying structural composition of different journal articles, we divided the articles in six sections including title, abstract, introduction/background, method/material, results, and discussion/conclusion. We manually intervened when the PMC XML parser failed to identify the above sections.

TCGA Data Usage Analysis

We developed a full-text extraction method to parse the full-text articles in XML format, extracted metadata such as publication date and author country, and identified TCGA-related key terms as provided.

The cancer type and high-throughput platform are two characteristic classes of key words in the TCGA data usage statements. Here, the cancer type refers to a list of cancers investigated in the TCGA program, whereas the high-throughput platform refers to a list of high-throughput biotechnologies used by the TGCA investigators to test the cancer genomic information. In the TCGA program from 2005 to 2014, over 30 cancers were studied using microarray and next-generation sequencing platforms, consequently producing large-scale data, such as gene expression, exon expression, miRNA, copy number variation (CNV), single nucleotide polymorphism (SNP), loss of heterozygosity (LOH), mutations, DNA methylation, and protein expression. Referring to Disease Ontology (Kibbe et al., 2015) and the TCGA data matrix (TCGA data matrix, 2015), we developed a controlled vocabulary for the TCGA cancer type (Table 1) and high-throughput platform (Table 2).

Examples of TCGA cancer-type concepts.

Concept IDNameTCGA defined terms [abbr] – [full name]SynonymsDO mapping
D0001GlioblastomaGBM – Glioblastoma MultiformeGlioblastoma, GBM, adult glioblastoma multiforme, primary glioblastoma multiforme, spongioblastoma multiformeDOID: 3068
D0002Breast cancerBRCA – Breast Invasive CarcinomaBreast cancer, breast tumor, breast neoplasm, mammary cancer, mammary tumor, mammary neoplasm, malignant tumor of breast,DOID: 1612
D0003Ovarian cancerOV – Ovarian Serous CystadenocarcinomaOvarian cancer, ovarian tumor, ovarian neoplasm, ovary cancer, ovary tumor, ovary neoplasm, malignant tumor of ovaryDOID: 2394
D0004Acute myeloid leukemiaLAML – Acute Myeloid LeukemiaAcute myeloid leukemia, AML, acute myeloblastic leukemia, acute myelogenous leukemiaDOID: 9119

Examples of TCGA high-throughput platform concepts.

Concept IDNameTCGA-defined termsGenerated data
P0001RNASeqIlluminaGA_RNASeq,Nucleotide sequence, gene expression
IlluminaHiSeq_RNASeq
P0002miRNASeqIlluminaGA_miRNASeqmiRNAs, microRNA, microRNA sequence
P0003SNPGenome_Wide_SNPSNPs, single nucleotide polymorphisms, CNV, copy number variation
P0004MethylationHuman methylationDNA methylation

As shown in Tables 1 and 2, the TCGA-defined terms were used to standardize the program-generated data description; however, they are not the terms used in the full-text articles. For example, in the results section of one article (PMCID: PMC3910500), it described the genomic landscape of glioblastoma using the whole-exome (WES), whole-genome sequencing (WGS), and RNA-Sequencing (RNA) (Brennan et al., 2013). To identify the TCGA cancer type and high-throughput platform concept from the free texts, we developed a named entity recognition method that is based on a biomedical text mining tool (Leaman, Islamaj, & Lu, 2013).

Results
Overview of the TCGA-related Publications

The number of TCGA-related articles increases as the program continues. Figure 2 shows the number of PMC articles related to the TCGA-related articles published from 2008 to 2015, and there were over 1,600 TCGA articles published in 2014. The 2015 reduction is due to data incompleteness as of September, 2015. TCGA data accumulation and data sharing contributed to the significant increase in TCGA publications. Phase I of the TCGA program (a 3-year pilot study) aimed to collect cancer tissues, process the biospecimen, apply high-throughput platforms to identify cancer genomic information, and analyze genetic changes involved in the cancer. Since 2009 (phase II), the data that were generated by the TCGA program have been centrally managed at the TCGA Coordinating Center and entered into public databases, allowing scientists to continually search, download, and analyze the data.

Figure 2

Number of TCGA-related publications in PMC.

Figure 3 shows the geographical distribution of the TCGA-related publications. Researchers from 37 countries used the TCGA data in their studies, and the United States was the most productive one, followed by China, Canada, Australia, and Germany, etc.

Figure 3

Geographical distribution of TCGA-related publications.

TCGA Key Terms That Were Mentioned in the Full-text Articles

We compared the TCGA key term features, TCGA term positions, and the TCGA-related concepts mentioned in the retrieved PMC articles (Section 3.1) and in the benchmark dataset (Section 3.2). Table 3 shows the true positive rate (TPR) of full-text articles in each dataset that have the TCGA key term features. The TCGA term (i.e. ‘TCGA’ or ‘Cancer Genome Atlas’) was mostly likely to appear in the results section in both the retrieved PMC article set (74%) and in the benchmark dataset (96%). Additionally, studies using the TCGA data are likely to describe the cancer type and high-throughput platform in the full-text articles of both datasets. Although there was a similar TCGA feature distribution within the retrieved PMC article set and within the benchmark dataset (χ2 test, p<0.05), the article proportions were lower in the retrieved PMC article set than in the benchmark dataset. This is because some articles in the retrieved set merely mentioned the TCGA term rather than actually using the data. In the following analysis, we focused on the PMC articles in which the TCGA term occurred in the methods/materials or the results section, which were the studies that were more likely to use TCGA data.

Distribution of TCGA key terms in full-text articles.

FeatureRetrieved PMC article set (%)Benchmark dataset (%)
TCGA term positonTitle14
Abstract1128
Introduction/Background1220
Method/Material3168
Result7496
Discussion/Conclusion2036
TCGA related conceptCancer type mention73100
mentionPlatform mention6696

TCGA Cancer Type and High-throughput Platform Generated Data Usage

To investigate the specific TCGA data usage, we identified the TCGA cancer type that was mentioned and the high-throughput platform that was mentioned in the methods/materials and in the results sections of the PMC full-text articles (Section 3.3). Figure 4 shows the proportion of different TCGA cancer types in the retrieved PMC article set. Glioblastoma (28%), lung cancer (18%), and breast cancer (11%) were the most frequent cancer types in which the data were used. Glioblastoma was the first cancer studied by the TCGA program, leading to TCGA infrastructure development that included data collection and sharing (Cancer Genome Atlas Research Network, 2008). Thus, this may be the major reason that the TCGA glioblastoma data were more frequently used.

Figure 4

Distribution of TCGA cancer types.

As shown in Figure 5, the data generated by the RNASeq platform are the most widely used (48%). Compared with traditional DNA sequencing technology, RNA sequencing can help understand the transcriptome via precisely and rapidly deriving wide-range strand information, such as transcripts, isoforms, gene fusions, and non-coding RNAs (Wang et al., 2009). The TCGA data that is generated by the RNASeq platform provides researchers with standardized and comprehensive cancer transcriptome profiles to discover biomarkers related to tumorigenesis and metastasis (Peng et al., 2015).

Figure 5

Distribution of the TCGA high-throughput platform.

Discussion

In this preliminary study, we conducted an investigation to track the use of scientific data that were generated by long-term government-funded program. We selected the TCGA program and analyzed over 5,000 full-text articles that were collected from PMC. We constructed a benchmark dataset that truly used TCGA data, and we compared it with full-text articles retrieved from PMC. Furthermore, we built up a controlled vocabulary that was tailored for the TCGA program that describes the cancer type and high-throughput platform. Thus, it provides insights into which specific data were used. Our work can contribute to scientific data and scientific literature integration. As shown in the box in Figure 6, the TCGA funding agencies manually collected the articles and linked the articles to their source data (TCGA publication, 2016). Our efforts may help develop an automatic method to identify recent publications that use TCGA data.

Figure 6

Manual link literature that includes TCGA data.

However, this study has limitations. (1) The benchmark set may cause a bias. We only collected 25 articles from the TCGA website to construct the benchmark dataset. The patterns of full-text articles that actually cite the TCGA data were not validated in a large scale dataset. Here, we only compared the TCGA term position and TCGA-related concept that were mentioned in the retrieved PMC articles and in the benchmark dataset. In the future, we may manually construct a benchmark dataset that includes more full-text articles that actually cite TCGA data. (2) The identification performance of TCGA-related term requires evaluation. Here, we applied a biomedical text mining tool to identify the mentioned TCGA cancer type and high-throughput platform without validating the named entity recognition. (3) Natural language processing technology needs to confirm the relationships between cancer type and platform. The data usage statement in full-text literature describes which cancer type samples are tested by which platforms, however, we have not yet considered these specific relationships.

Conclusion

We present a workflow to identify scientific project-generated data citation via full-text article analysis, and we applied this workflow to track TCGA data citations via PMC literature analysis. In contrast to previous studies, the scientific data entries in our studies lacked predefined accession numbers. Although our preliminary study has limitations, this work is a step towards integrating literature with scientific data that are generated by a government-funded project. In future work, we expect to improve the construction of the scientific data citation benchmark dataset, normalize the full-text article sections, map the project self-defined vocabulary, and evaluate the performance of data citation identification.

eISSN:
2543-683X
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining