<abstract xmlns="http://www.w3.org/1999/xhtml"><p>In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable hypothesis is well on its way of becoming obsolete within just a year or two. Analyzing the literature and data to automatically generate a hypothesis might become the <italic>de facto</italic> approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets. Here, viewpoints are provided and discussed to help the understanding of challenges of data-driven discovery.</p><p>The Panama Canal, the 77-kilometer waterway connecting the Atlantic and Pacific oceans, has played a crucial role in international trade for more than a century. However, digging the Panama Canal was an exceedingly challenging process. A French effort in the late 19<sup>th</sup> century was abandoned because of equipment issues and a significant loss of labor due to tropical diseases transmitted by mosquitoes. The United States officially took control of the project in 1902. The United States replaced the unusable French equipment with new construction equipment that was designed for a much larger and faster scale of work. Colonel William C. Gorgas was appointed as the chief sanitation officer and charged with eliminating mosquito-spread illnesses. After overcoming these and additional trials and tribulations, the Canal successfully opened on August 15, 1914. The triumphant completion of the Panama Canal demonstrates that using the right tools and eliminating significant threats are critical steps in any project.</p><p>More than 100 years later, a paradigm shift is occurring, as we move into a data-centered era. Today, data are extremely rich but overwhelming, and extracting information out of data requires not only the right tools and methods but also awareness of major threats. In this data-intensive era, the traditional method of exploring the related publications and available datasets from previous experiments to arrive at a testable hypothesis is becoming obsolete. Consider the fact that a new article is published every 30 seconds (<xref ref-type="bibr" rid="j_jdis.201622_ref_013_w2aab2b8b3b1b7b1ab2ac13Aa">Jinha, 2010</xref>). In fact, for the common disease of diabetes, there have been roughly 500,000 articles published to date; even if a scientist reads 20 papers per day, he will need 68 years to wade through all the material. The standard method simply cannot sufficiently deal with the large volume of documents or the exponential growth of datasets. A major threat is that the canon of domain knowledge cannot be consumed and held in human memory. Without efficient methods to process information and without a way to eliminate the fundamental threat of limited memory and time to handle the data deluge, we may find ourselves facing failure as the French did on the Isthmus of Panama more than a century ago.</p><p>Scouring the literature and data to generate a hypothesis might become the <italic>de facto</italic> approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets (<xref ref-type="bibr" rid="j_jdis.201622_ref_010_w2aab2b8b3b1b7b1ab2ac10Aa">Evans &amp; Foster, 2011</xref>). In reality, most scholars have never been able to keep completely up-to-date with publications and datasets considering the unending increase in quantity and diversity of research within their own areas of focus, let alone in related conceptual areas in which knowledge may be segregated by syntactically impenetrable keyword barriers or an entirely different research corpus.</p><p>Research communities in many disciplines are finally recognizing that with advances in information technology there needs to be new ways to extract entities from increasingly data-intensive publications and to integrate and analyze large-scale datasets. This provides a compelling opportunity to improve the process of knowledge discovery from the literature and datasets through use of knowledge graphs and an associated framework that integrates scholars, domain knowledge, datasets, workflows, and machines on a scale previously beyond our reach (<xref ref-type="bibr" rid="j_jdis.201622_ref_009_w2aab2b8b3b1b7b1ab2ab9Aa">Ding et al., 2013</xref>).</p></abstract>

In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable hypothesis is well on its way of becoming obsolete within just a year or two. Analyzing the literature and data to automatically generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets. Here, viewpoints are provided and discussed to help the understanding of challenges of data-driven discovery.The Panama Canal, the 77-kilometer waterway connecting the Atlantic and Pacific oceans, has played a crucial role in international trade for more than a century. However, digging the Panama Canal was an exceedingly challenging process. A French effort in the late 19th century was abandoned because of equipment issues and a significant loss of labor due to tropical diseases transmitted by mosquitoes. The United States officially took control of the project in 1902. The United States replaced the unusable French equipment with new construction equipment that was designed for a much larger and faster scale of work. Colonel William C. Gorgas was appointed as the chief sanitation officer and charged with eliminating mosquito-spread illnesses. After overcoming these and additional trials and tribulations, the Canal successfully opened on August 15, 1914. The triumphant completion of the Panama Canal demonstrates that using the right tools and eliminating significant threats are critical steps in any project.More than 100 years later, a paradigm shift is occurring, as we move into a data-centered era. Today, data are extremely rich but overwhelming, and extracting information out of data requires not only the right tools and methods but also awareness of major threats. In this data-intensive era, the traditional method of exploring the related publications and available datasets from previous experiments to arrive at a testable hypothesis is becoming obsolete. Consider the fact that a new article is published every 30 seconds (Jinha, 2010). In fact, for the common disease of diabetes, there have been roughly 500,000 articles published to date; even if a scientist reads 20 papers per day, he will need 68 years to wade through all the material. The standard method simply cannot sufficiently deal with the large volume of documents or the exponential growth of datasets. A major threat is that the canon of domain knowledge cannot be consumed and held in human memory. Without efficient methods to process information and without a way to eliminate the fundamental threat of limited memory and time to handle the data deluge, we may find ourselves facing failure as the French did on the Isthmus of Panama more than a century ago.Scouring the literature and data to generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets (Evans & Foster, 2011). In reality, most scholars have never been able to keep completely up-to-date with publications and datasets considering the unending increase in quantity and diversity of research within their own areas of focus, let alone in related conceptual areas in which knowledge may be segregated by syntactically impenetrable keyword barriers or an entirely different research corpus.Research communities in many disciplines are finally recognizing that with advances in information technology there needs to be new ways to extract entities from increasingly data-intensive publications and to integrate and analyze large-scale datasets. This provides a compelling opportunity to improve the process of knowledge discovery from the literature and datasets through use of knowledge graphs and an associated framework that integrates scholars, domain knowledge, datasets, workflows, and machines on a scale previously beyond our reach (Ding et al., 2013).


<div xmlns="http://www.w3.org/1999/xhtml"><p>Dr Ying Ding is an Associate Professor of Indiana University, USA, Co-Editor-in-Chief of <italic>Journal of Data and Information Science</italic> (JDIS). She is Associate Director of Data Science Online Program, and Director of Web Science Lab. She is Changjiang Scholar at Wuhan University and Elsevier Guest Professor at Tongji University. Her research interests include scholarly communication for knowledge discovery, semantic Web for drug discovery, social network analysis for research impact, and data integration and mediation in Web 2.0. She has published more than 200 papers which have received over 4,000 times of citation. She is the Co-Editor of <italic>Semantic Web Synthesis</italic> by Morgan &amp; Claypool and serves editorial board of several leading international journals.</p><p><figure id="j_jdis.201622_fig_001_w2aab2b8b3b1b7b1ab1b1aAa" position="float" fig-type="figure"><img xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="graphic/j_jdis.201622_fig_001.jpg" src="https://sciendo-parsed.s3.eu-central-1.amazonaws.com/64720f07215d2f6c89dba645/j_jdis.201622_fig_001.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Date=20240425T054057Z&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Expires=18000&amp;X-Amz-Credential=AKIA6AP2G7AKP25APDM2%2F20240425%2Feu-central-1%2Fs3%2Faws4_request&amp;X-Amz-Signature=e850056f1911052cc80f3c9c05ea19b6a4b0ed03d242f54eceae6862eb76d594" class="mw-100"/></figure></p><sec id="j_jdis.201622_s_001_w2aab2b8b3b1b7b1ab1b2Aa"><div>Scientific Discovery</div><p>Scientific discovery revolves around the process of problem solving. It either uses existing well-established methods to explore a new area or invents new methods to solve existing problems. Either way, it is a journey into unknown terrain. Trial- and-error remains the most common approach to testing new ideas, learning from failures, and, eventually, finding success. The problem-solving process can be viewed as a search for a path connecting the initial state and the goal state (<a ref-type="bibr" href="#j_jdis.201622_ref_016_w2aab2b8b3b1b7b1ab2ac16Aa">Klahr, 2000</a>). In cognitive science, a problem space contains the set of states, operators, goals, and constraints, and this problem space can be huge or small depending on whether you are on the right path to the final goal. The time to reach the final goal can be significantly shortened if the right tools are used.</p><p>How challenging the problem-solving process is also depends on the basic components in a problem space. The vagueness of some of these components can easily make scientific discovery purposeless. For example, one can have a task with a well-defined goal state (e.g. proving a mathematical equation) but a vague initial state, a task with a clear initial state (e.g. finding potential binding drugs for a given target) but an unclear goal state, or even a task with an ill-defined initial state and goal state (e.g. inventing a cool tool). More knowledge available to the problem-solver can significantly reduce the vagueness of basic components and set clear boundaries on the problem space. It is important to understand the problem space and foresee next steps.</p></sec><sec id="j_jdis.201622_s_002_w2aab2b8b3b1b7b1ab1b3Aa"><div>Knowledge Discovery</div><p>Hypotheses can be generated from different sources. The dominant approach of developing a hypothesis in biology and medicine, for example, is through first-hand observation, which includes experimental data, electronic medical records, gene sequence data, and lab test results. The alternative method of generating a hypothesis from literature is viewed as a serendipitous process with great uncertainty—even more so now because the vast amount of published research contains a diversity of knowledge beyond what domain experts can humanly reason. Especially for researchers in transdisciplinary domains, it is no longer possible for domain experts in one domain to fully master the knowledge in another domain.</p><p>Mining literature to generate hypotheses is not confined to biology or medicine but can be done in almost any science. Publications are no longer just an output of research but rather a vital part of the scientific process. A significant number of associations between different biological entities (e.g. disease, gene, drug, side effect, and pathway) are scattered across millions of biomedical articles. Mining these documented associations can infer innovative associations and generate novel hypotheses, especially in the translational research.</p><p>Sciences are being conducted in a totally different way than 20 years ago. For example, biology is shifting from conventional biology to conceptual biology (<a ref-type="bibr" href="#j_jdis.201622_ref_004_w2aab2b8b3b1b7b1ab2ab4Aa">Blagosklonny &amp; Pardee, 2002</a>) and moving further to systems biology (<a ref-type="bibr" href="#j_jdis.201622_ref_015_w2aab2b8b3b1b7b1ab2ac15Aa">Kell, 2006</a>; <a ref-type="bibr" href="#j_jdis.201622_ref_022_w2aab2b8b3b1b7b1ab2ac22Aa">Oprea et al., 2007</a>), in part because of a strong opinion that the conceptual review and systems thinking of available published knowledge should take its place as an essential component of scientific research. The world of ideas (i.e. published knowledge) interplaying with high-throughput experiments, computational modelling, and technology can generate intelligent hypotheses that will end the aimless fishing adventures in the conventional biology. New knowledge, derived from tens of thousands of publications and manually curated datasets, can be linked back to published knowledge to form a self-evolving ecological knowledge base (<a ref-type="bibr" href="#j_jdis.201622_ref_020_w2aab2b8b3b1b7b1ab2ac20Aa">Mons et al., 2011</a>). Predictions and experiments that were carried out for other reasons can be reused or revealed in a new context that fully embraces the holistic view of knowledge processing.</p><p>New ways of conducting research are in high demand, and examples of new methods can be found in many disciplines (<a ref-type="bibr" href="#j_jdis.201622_ref_009_w2aab2b8b3b1b7b1ab2ab9Aa">Ding et al. 2013</a>). <a ref-type="bibr" href="#j_jdis.201622_ref_026_w2aab2b8b3b1b7b1ab2ac26Aa">Don Swanson’s (1986)</a> work about undiscovered public knowledge has had a wide impact on association discovery and demonstrated that new knowledge can be discovered from sets of disjointed scientific articles. Swanson’s vision of the hidden value of the literature of science in biomedical digital databases is remarkably innovative for information scientists, biologists, and physicians (<a ref-type="bibr" href="#j_jdis.201622_ref_003_w2aab2b8b3b1b7b1ab2ab3Aa">Bekhuis, 2006</a>; <a ref-type="bibr" href="#j_jdis.201622_ref_027_w2aab2b8b3b1b7b1ab2ac27Aa">Swanson, Smalheiser, &amp; Bookstein, 2001</a>). Literature-related discovery that mines knowledge in two disparate sets of literature has identified several non-drug approaches that can be used to halt or reverse the symptoms of multiple sclerosis, cataracts, and other chronic diseases (<a ref-type="bibr" href="#j_jdis.201622_ref_017_w2aab2b8b3b1b7b1ab2ac17Aa">Kostoff, 2012</a>). By combining PubMed literature and public datasets, <a ref-type="bibr" href="#j_jdis.201622_ref_007_w2aab2b8b3b1b7b1ab2ab7Aa">Chen, Ding, and Wild (2012)</a> can predict potential drug and target pairs based on publications and open datasets. The method performs extremely well in correctly identifying known drug-target pairs in the data and compares favorably with the established Similarity Ensemble Approach, or SEA, method (<a ref-type="bibr" href="#j_jdis.201622_ref_014_w2aab2b8b3b1b7b1ab2ac14Aa">Keiser et al., 2009</a>) for predicting new drug-target interactions as well as with the Connectivity Map, or CMAP, (<a ref-type="bibr" href="#j_jdis.201622_ref_018_w2aab2b8b3b1b7b1ab2ac18Aa">Lamb et al., 2006</a>) for associating drugs with changes in gene expression levels.</p><p><a ref-type="bibr" href="#j_jdis.201622_ref_025_w2aab2b8b3b1b7b1ab2ac25Aa">Spangler and colleagues (2014)</a> mined information contained in published articles to identify new protein kinases that phosphorylate the protein tumor suppressor p53. They successfully demonstrated that it is possible to automatically generate hypothesis for domain experts based on existing published scholarly articles. Even in humanity, Franco Moretti’s distance reading solution tackles literary problems by applying computational methods to aggregate and analyze massive amounts of data and generate hypotheses. He advocates that distance reading is needed because nobody is able to read the 60,000 novels published in the 19<sup>th</sup> century England to understand Victorian fiction (<a ref-type="bibr" href="#j_jdis.201622_ref_023_w2aab2b8b3b1b7b1ab2ac23Aa">Schulz, 2011</a>). All of these examples show that generating hypothesis by mining existing literature and open datasets can advance science and generate huge societal impact.</p><p>And while these examples highlight that human brains feature a great capacity for integrating information and recognizing patterns, computers are catching up. IBM Watson, a supercomputer, can process millions of articles, patents, Wikipedia pages, and datasets to facilitate research and diagnostic decision making in lung cancer treatment (<a ref-type="bibr" href="#j_jdis.201622_ref_029_w2aab2b8b3b1b7b1ab2ac29Aa">Upbin, 2013</a>). It also famously defeated two of the best human <italic>Jeopardy!</italic> players, Ken Jennings and Brad Rutter, in 2011, by parsing keywords in a large set of data to search for related terms as responses. While it is fast, it bears the disadvantage of a misunderstanding of the context of keywords. As well, the recent success of image recognition powered by deep learning outperforms humans (<a ref-type="bibr" href="#j_jdis.201622_ref_028_w2aab2b8b3b1b7b1ab2ac28Aa">Thomsen, 2015</a>). Project Adam, an initiative by Microsoft, can accurately identify a dog’s breed based on a single photo. Soon, it will be possible for computers to provide nutritional information about a meal or help diagnose skin diseases (<a ref-type="bibr" href="#j_jdis.201622_ref_006_w2aab2b8b3b1b7b1ab2ab6Aa">Chansanchai, 2014</a>).</p></sec><sec id="j_jdis.201622_s_003_w2aab2b8b3b1b7b1ab1b4Aa"><div>Translational Thinking</div><p>What Hal Varian called “combinatorial innovation” combines or recombines different component parts of previous innovations or ideas to generate new innovations (<a ref-type="bibr" href="#j_jdis.201622_ref_019_w2aab2b8b3b1b7b1ab2ac19Aa">McKinsey, 2009</a>). Polymerase chain reaction, which earned Kary Banks Mullis the 1993 Nobel Prize in Chemistry, is the result of recombination of well-understood techniques in biochemistry (<a ref-type="bibr" href="#j_jdis.201622_ref_005_w2aab2b8b3b1b7b1ab2ab5Aa">Brynjolfsson &amp; McAfee, 2014</a>). Dozens and dozens of publications that documented previous research outputs can be used to trigger translational thinking. These publications can be analyzed and mapped to show the scholarly landscape of unfamiliar fields to a researcher and suggest high-impact works to study and potential collaborators with whom to work. Other examples of combinatorial innovation include medical scientists who mine literature and open data to facilitate diagnostic decision making in cancer treatment, and healthcare professionals who study literature to generate practical guidelines for wound care (<a ref-type="bibr" href="#j_jdis.201622_ref_011_w2aab2b8b3b1b7b1ab2ac11Aa">Flanagan, 2004</a>).</p><p>More and more scientists are thinking about the translational value of their work. Sociologists apply the social concept of structural hole to understand scientific collaboration, and educators utilize literature as a scaffolding technique to enhance active learning. The transdisciplinary collaboration among material scientists, immunologists, and bioengineers has identified an implantable vaccine depot built from a polymer matrix that can kill cancer cells resulting in longer survival, which generates significant impacts on the well-being of society (<a ref-type="bibr" href="#j_jdis.201622_ref_002_w2aab2b8b3b1b7b1ab2ab2Aa">Ali et al., 2009</a>). Publications and open datasets are ideal instruments to study the success of translational endeavors to further advance scientific innovation.</p></sec><sec id="j_jdis.201622_s_004_w2aab2b8b3b1b7b1ab1b5Aa"><div>Transparent Analytics</div><p>The process of scientific endeavors, from data curation and analysis to discovery, should be transparent and easily accessible to every researcher so that replication can be easily done and the derived knowledge can be clearly interpreted (<a ref-type="bibr" href="#j_jdis.201622_ref_008_w2aab2b8b3b1b7b1ab2ab8Aa">Editorial, 2009</a>). Promoting transparency in science is crucial to ensure the reusability of knowledge, avoid reinventing the wheel, and make scientific discovery dedicated. Research, both quantitative and qualitative, is experiencing a methodological revolution (<a ref-type="bibr" href="#j_jdis.201622_ref_021_w2aab2b8b3b1b7b1ab2ac21Aa">Moravcsik, 2014</a>). Every researcher should make their work completely transparent to fellow scholars, and the process from data to conclusions should be interpretable and reproducible.</p><p>In recent years, the American Political Science Association (APSA) formally established transparency standards for qualitative and quantitative research by reinforcing the ethical obligation of researchers to facilitate the evaluation of their evidence-based knowledge claims through data access, production transparency, and analytic transparency. APSA proposed a new way of citing references called “active citation,” which suggests that any citation in a scholarly publication should be annotated with an explanation on how the citation supports the knowledge claim and should include the hyperlink to an excerpt (c.a. 50–100 words) from the original source. These active citations can be located in a “transparent appendix” at the end of the document so that transparent data to conclusions for researchers are only one click away. This can generate a healthy scholarship by actively engaging researchers to establish rigorous research ethics to criticize, evaluate, and extend fellow scholars’ research. Provenance has been introduced to data and workflows in scientific research to provide detailed documentation to enable scientific reproducibility. The World Wide Web Consortium has recommended a standard representation for provenance in a human readable and machine understandable way (<a ref-type="bibr" href="#j_jdis.201622_ref_012_w2aab2b8b3b1b7b1ab2ac12Aa">Groth &amp; Moreau, 2013</a>). Transparency must be considered essential and achieved through active citation and provenance to further advance transparent sciences.</p></sec><sec id="j_jdis.201622_s_005_w2aab2b8b3b1b7b1ab1b6Aa"><div>Connecting Intelligence</div><p>Machines taking their full place at the table of data-driven discovery is a significant step; these new participants make possible what was unimaginable 20 years ago. With machines, it is now possible to systematically collect, interdigitate, analyze, and disseminate publications and data in ways that will greatly impact the tradition of conducting research while providing powerful new resources that significantly advance the progress of both theoretical and applied research. Further, machines can be used to discover new knowledge and afford breakthroughs in current vexing research questions that can only be answered through transdisciplinary innovations.</p><p>The ever-increasing success in the application of full text indexing, taxonomies, and ontologies all dramatically improve the categorization and discovery of related content (<a ref-type="bibr" href="#j_jdis.201622_ref_024_w2aab2b8b3b1b7b1ab2ac24Aa">Song et al., 2013</a>). The movie <italic>The Imitation Game</italic> has rekindled the memory of Alan Turing’s success of machine intelligence (<a ref-type="bibr" href="#j_jdis.201622_ref_030_w2aab2b8b3b1b7b1ab2ac30Aa">You, 2015</a>). In the current data-enriched era, it may be the right time to revisit machine intelligence and connect machine intelligence with human intelligence. The next generation of artificial intelligence researchers is proposing a new Turing Championship to develop machines with a deeper understanding of the world (e.g. machine comprehension of grammatically ambiguous sentences, machine storytelling from pictures, and machine “humanness” that enables non-disruptive communication between machine and human).</p><p>The teamwork of machines and humans can make machines smarter and humans more efficient. The industrial revolution (mainly steam engine) bent the curve of human history and freed the physical muscle labor in the 19<sup>th</sup> century to allow for modern massive production. Now, the so-called Second Machine Age will bend the curve of human history again pretty soon by freeing the mental labor of humans. This will trigger massive innovation to bring scientific fiction into reality as these innovations are not only generated by human but also machines.</p><p>The combination of human and machine power can bring about new capabilities to compile, review, and mash-up related research entities and receive alerts on their activities and interactions, perhaps reaching a scale that was unimaginable 15 years ago. Much like the recent debut of driverless cars, distant scientific dreams could be realized in just a few years, demonstrating the power of the current data and machine progress (<a ref-type="bibr" href="#j_jdis.201622_ref_005_w2aab2b8b3b1b7b1ab2ab5Aa">Brynjolfsson &amp; McAfee, 2014</a>). In the new world of scholarly analytics, attention and extraction of deeply covered content and findings are the pathways to golden discoveries. Gradually, advances in information technologies, such as the advent of open access, Linked Open Data, semantic publishing, and open science, will make it possible to gather, annotate, and acquire related publications and other data sources and from those discover related content, findings, and conclusions. This could lead to sudden discovery of unanticipated correlations and connections within an incredibly large and expanding research corpus. We are working on one of the oldest and toughest challenges associated with the combination of computer and human intelligence. The combinatorial innovation of human and machine intelligence will allow us to connect the dots for things that have been disconnected and accomplish through research what has been unimaginable, allowing us to dig the canal to connect data with knowledge.</p></sec></div>

Data-driven Discovery: A New Era of Exploiting the Literature and Data

Department of Information and Library Science

Journal of Data and Information Science

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

{"article-title":"Data-driven Discovery: A New Era of Exploiting the Literature and Data"}

In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable...