Alzheimer’s disease (AD) is a neurodegenerative disorder that causes dementia, where in its early stage people cannot remember recent events, and gradually have more difficulty managing daily life tasks or even recognizing family and friends. AD is to date incurable (Alzheimer’s Association, 2015), although a great deal of resources are used to treat symptoms. The large number of sufferers from mild to severe cases creates a high social impact and tremendous costs for medical care that serves to manage symptoms instead of offering a cure. It is estimated that 4.7 million Americans have dementia, and the total number in 2050 is projected to be 13.8 million (Hebert et al., 2013). The economic costs of dementia were estimated to be $818 billion in 2015 (Prince et al., 2015). Due to its incurability, the costs for several sectors in the society, such as long-term care, home services, and nonprofessional caregivers, are greater than the cost of direct medical care (Bullock, 2004; Winblad et al., 2016; Yokoyama et al., 2016). Due to the severity and increasing number of people suffering from dementia, it is very important to promote further research on AD.
The current information available on the disease, however, is actually overwhelming. For example, simply searching “Alzheimer” in PubMed brings up 90,000+ articles
The machine reading technique of data mining is based on the idea that machines can integrate and summarize information for humans by reading and understanding large amounts of texts (Hirschberg & Manning, 2015). Previous work related to machine reading and AD used computational techniques such as topic modeling to read and take main points from a large number of papers, but its general purpose is to get overview information (Hughes et al., 2014; Lee et al., 2015; Song, Heo, & Lee, 2015; Sorensen, 2009; Sorensen, Seary, & Riopelle, 2010). Indeed, in order to understand the domain of AD, we first need an overview that will help to identify key details, such as what kind of major topics are in the literature. This will help determine what kind of specific information is needed. For instance, unless we are aware of a group or individuals that discuss AIDS within AD papers, we cannot understand how AIDS is related to AD (addressed in Section 4.2). This study first uses a topic modeling technique (Blei, Ng, & Jordan, 2003) to obtain major topics within AD literature.
Once the overview is obtained, specific questions are addressed with a focus on how AD is connected to AIDS/HIV. Answering these questions requires techniques to read the content of related papers, recognize key entities mentioned in the text, and identify the relations among these entities. Gathering entities and relations from texts is called information extraction (IE), which conventionally assumes pre-defined target information. For example, extracting gene-disease interactions information is a common IE task, but it requires that the target genes, diseases, and types of expected interactions among them are already identified. Yet it is not advisable to limit the types of information being sought in advance, as this weeds out potentially significant topics or details that can be helpful to the search. Moreover, even after the information being sought is fixed, it can change frequently depending on how we have understood the literature at that point, or due to shifts in perspectives on the topic. For example, we might be interested in which pathway contains a particular gene after a gene-disease extractor found that the gene is associated with AD. In this case, we need to build the pathway-gene extractor again if conventional IE techniques are employed. Therefore, an IE technique is used that does not require pre-defined targets, called open information extraction (Open IE) (Fader, Soderland, & Etzioni, 2011).
Open IE is an information extraction technique applied in natural language processing (Fader, Zettlemoyer, & Etzioni, 2014; Mausam, 2016) that gathers facts in the form of triples
The combination of the two methods of topic modeling and Open IE is complementary, in that Open IE answers key questions provided by topic modeling overviews. Topic modeling, specifically LDA, has been applied to a wide range of fields to reveal hidden topics from textual data (DiMaggio, Nag, & Blei, 2013; Hall, Jurafsky, & Manning, 2008; Hu et al., 2015). It is often difficult to make sense of each topic, however, even with substantial domain knowledge. This is because LDA just outputs topics as a distribution over terms, and does not provide information on how terms in the topic are linked together. On the contrary, Open IE can indicate how terms are specifically linked in texts. It is developed as a natural language processing (NLP) technique, and is applied to NLP tasks such as question answering. Yet when trying to understand a large collection of texts, searchers do not always have specific questions to ask in advance. An overview is therefore needed to identify specific questions. Combining LDA and Open IE is complimentary, as LDA provides the overview, which is helpful to infer specific questions that are answered by Open IE.
The rest of this paper is organized as follows. Section 2 briefly summarizes the key related work. Section 3 describes the proposed machine reading approach using the two distinct methods of LDA and Open IE to better comprehend large collections of literature both at the overview and specific levels. Section 4 presents the results of this approach for the medical domain of Alzheimer’s disease, whose related papers are far beyond what a single researcher can read. Due to the high social and fiscal impact of the disease, the need for further research is urgent. Finally, Section 5 concludes the paper.
Computational techniques have been extensively used to understand a scientific domain, and applications for the topic of AD has also gathered a great deal of attention (Hughes et al., 2014; Lee et al., 2015; Song, Heo, & Lee, 2015; Sorensen, 2009; Sorensen et al., 2010). Sorensen (2009), for instance, investigated the productivity and impact of the top 100 AD researchers using citation analysis, and identified the role of AD within the field of neurodegenerative diseases. In defining an AD-specific
Obtaining specific information from texts has been well studied as information extraction in natural language processing. Information extraction (IE) is a task that automatically extracts structured information from texts. For example, many IE systems can extract entities such as genes, diseases, drugs, and relations between them from the medical literature (e.g. Song et al., 2015). However, because predefined relations are required, they are not effective when no relations are extracted in advance. Open IE systems (Mausam, 2016) overcome this issue and use raw textual phrases as relations, which has been applied to several NLP tasks such as question answering (Fader et al., 2014). To the best of our knowledge, this paper is the first to use Open IE to understand large amounts of literature in a specific medical domain in combination with topic modeling methods.
The methodology of this paper is visually summarized in Figure 1. First, we collected literature with a focus that can be specified by key terms, specific periods, and/or target journals. After consulting domain experts, we collected a set of PubMed papers relevant to AD The query was performed in October, 2015.
Based on the methods shown above, we obtained 1,469,008 triples and organized them in a relational database so that they can be traced back to a specific sentence in a paper. The extracted triples For the details on calculating topic popularity, see Chen et al. (2017).
MeSH LDA results ranked by popularity.Year 1st topic 2nd topic 3rd topic 4th topic 5th topic All Huntington disease Mental disorders Tau proteins Aging Creutzfeldt-Jakob syndrome (1945–2015) Parkinson disease Caregivers Amyloid beta-protein precursor Cognition Apolipoproteins E Neurons Dementia, vascular Neurodegenerative diseases Cholinesterase inhibitors Magnetic resonance imaging Cerebral coretex AIDS dementia complex Brain diseases Memory disorders Nursing homes Nerve tissue proteins Schizophrenia Amyloid Neuropsychological tests Genetic predisposition to disease 1995–2004 Creutzfeldt-Jakob syndrome Amyloid beta-protein precursor Parkinson disease Cholinesterase inhibitors AIDS dementia complex Apolipoproteins E Neurons Caregivers Dementia, vascular Aging Huntington disease Membrane proteins Neuropsychological tests Memory disorders Peptide fragments Tau proteins Nerve tissue proteins Cognition Nootropic agents HIV-1 Magnetic resonance imaging Neurodegenerative diseases Memory Schizophrenia HIV infections 2005–2014 Neurons Aging Cognition Parkinson disease Cholinesterase inhibitors Peptide fragments Neuropsychological tests Neurodegenerative diseases Caregivers Neuroprotective agents Tau proteins Magnetic resonance imaging Mental disorders Amyloid Dementia, vascular Amyloid beta-protein precursor Memory disorders Nursing homes Creutzfeldt-Jakob syndrome Frontotemporal dementia Huntington disease Memory Amyotrophic lateral sclerosis Depression Amyloid precursor protein secretases
A basic characteristic of LDA is that it provides each topic a distribution of terms. This means the first term in a topic is its most representative term. Moreover, LDA can represent each paper as a distribution of topics, which enables the ranking of topics by popularity. From these characteristics, the popularity observations from MeSH terms and genes respectively are presented.
In MeSH term topic modeling, Huntington’s disease (HD) always appears in the first topic regardless of the rank all years or that of the last and second to last decades (Table 1). This means HD has consistently held certain popularity within the AD literature. Moreover, Creutzfeldt-Jakob syndrome (CJS), which is also a neurodegenerative disease like AD, was found to be popular in the period of 1995–2004 but not recently. This is because the disease is the first word in the first topic of that period, but later in 2005–2014, it only appears in the fourth topic.
In gene topic modeling, APP and APOE are always the most popular genes, as they appear as top words in either the first or second topic, regardless of time periods (Table 2). Other than the first or second topic, HTT became popular in recent periods as it appears as the top term in the third topic in 2005–2014, while it appears as the top term in the fifth topic in 1995–2004.
Gene LDA results ranked by popularity.Year 1st topic 2nd topic 3rd topic 4th topic 5th topic All INS MS MAPT (1945–2015) TNF BCHE PRNP SDS BDNF CAT PSEN1 GFAP SST BACE1 CA3 CA1 ALB GRN NGF CA1 MDD1 PSD 1995–2004 PRNP TNF MS BDNF INS PSEN1 CA1 CD4 ACT NGF GFAP SDS SPY A2M TF MAPT ALB PSEN1 LDLR TTR BCHE CA3 2005–2014 MS PSEN1 INS GFAP PRNP BDNF TNF CD4 CA1 MAPT CAT BACE1 BCHE ALB ACE NOTCH3 NGF SYP PSEN2
Another characteristic of LDA is that highly co-occurring terms constitute a topic. Moreover, it sometimes distinguishes the term co-occurrences in different contexts by having the same term in multiple topics. From these characteristics, observations of a term in different contexts for MeSH terms and genes are presented, respectively.
In MeSH terms, CJS appears in three different topics: the first topic in the period 1995–2004, the fourth topic in the period 2005–2014, and the fifth topic in all years. We examined co-occurring terms, where magnetic resonance imaging (MRI) co-occurs in the topic list in 1995–2004, but the term caregivers co-occurs in 2005–2014. MRI is a brain-imaging technique while caregiver is the person who takes care of the patient or person suffering from the disease. This observation indicates that CJS can be studied both from two contexts of brain imaging research and patient care. Interestingly, when investigating the topic in all years, we find that the two contexts are merged into one topic because it has both MRI and nursing homes, where patients are given care.
The research also found that APP and APOE appear in multiple topics and multiple ranks in genes. For example, APP/APOE is always the top gene in the first and second topics, but also appears as the fifth gene at the third topic in all years, and the second and third genes as the fifth topic in 2005–2014 (Table 2). This observation indicates that APP and APOE can be studied in a context where each gene itself is the key to the topic, but also in a context where it is secondary to the topic.
This section demonstrates the power of Open IE and answers questions specific to AD that are inferred by the LDA results and their interpretations. For example, the topic related to AIDS/HIV is found within the AD literature. Open IE can tell how AIDS/HIV is actually related to AD, which is answered by Open IE later in this section. In fact, Open IE can answer more basic questions such as the definition of terms, which could be helpful for researchers with limited knowledge of AD (e.g. information scientists or scientometrians who expect to study AD from the literature) to better understand a domain only from the literature. For example, you cannot interpret the results discussed in Section 4.1 if you do not know HD, CJS, MRI, APP, and APOE. Therefore, we first show how Open IE can answer these simple questions in order to understand these basic terms.
We first use an example of Huntington’s disease (HD), which in the previous section, was observed to consistently hold certain popularity within the AD literature. The immediate question “What is Huntington’s disease?” can be answered by searching triples with a pattern <Huntington disease, is, ?x > where ?x means some words are identified. In this case, 446 distinct triples were found. The top two frequent answers and other three randomly sampled results are shown in Table 3. The number inside the parenthesis represents the number of triples that matched the pattern. Now it can be discovered that HD is
Question answering example: What is Huntington’s disease?Question Answer example What is Huntington’s disease? an inherited neurodegenerative disorder (44) a neurodegenerative disorder (28) a hereditary brain disease (2) an incurable genetic neurodegenerative disorder (1) a complex, single gene (1)
Extraction from texts does not always result in correct data, so some manual inspections are required. Limitations of this approach include the wrong answer of
The same approach is also applied herein to find definitions of CJS, MRI, APP, and APOE (Table 4). These definitions are complementary in allowing more LDA results to be interpreted. It can be confirmed that, for example, APP and APOE are strongly related to AD. This fact coincides with the observation in Section 4.1, that they are always the most popular genes.
What are CJS, MRI, APP, and APOE?Term Definition AD a neurodegenerative disorder AD a genetically complex and heterogeneous disorder CJS a rare neurodegenerative disease CJS a fatal neurodegenerative illness CJS an incurable disease MRI a useful diagnostic marker MRI a promising AD biomarker MRI the most widely used and less invasive medical imaging technique APP a transmembrane glycoprotein APP an extremely complex molecule APP APOE the major apolipoprotein APOE APOE the most prevalent and best established genetic risk factor for late-onset AD.
This study also uses more interesting questions than term definitions. Suppose we know that HD is also a neurodegenerative disease like AD. The next natural question is, “How is AD related to HD?” We first queried <AD, ?x, HD >, but could not find relevant relations. After extending the pattern to <AD, ?r1, ?x> & <?x, ?r2, HD>, one finds an equivalent of finding two-step paths from node AD to node HD on a directed graph. It is also possible to find a path the other way from HD to AD, which gives similar results. The query resulted in 408 paths with 61 distinct middle nodes. Part of them are shown in Figure 2. It can be observed that AD and HD share some symptoms such as
Another important observation from LDA results relates to AIDS and HIV, which are very different diseases (HIV is common to AIDS, in that all people with AIDS generally have HIV, but is not the full-blown AIDS disease in terms of symptoms and treatment). Also, a domain expert working on AD brain imaging was queried, but he did not know how HIV/AIDS are related to AD. Similar to the HD example, triples were thus queried with <AD, ?r1, ?x> & <?x, ?r2, HIV>. The results shown in Figure 3 confirm meaningful facts: HIV actually has similar symptoms to AD such as
Some questions need additional resources to answer. For example, this study was able to confirm that the apoe gene is strongly correlated with AD (as seen in the previous section), yet now we are interested in other genes that also have high correlation with AD. The natural way to find answers is to search triples with the pattern <?x, correlated with AD> & <?x, is, gene>. But this query gave no results because it is rare for researchers to write a sentence that contains “[a gene name] is a gene.” To solve this issue, we used an additional resource of NCBI human genes
Some questions from topic modeling results cannot be answered by Open IE. These are “why” questions because Open IE simply extracts facts mentioned in the paper. For example, we observed that CJS was popular in the period of 1995–2004, but not 2005–2015. Open IE, however, cannot answer why this change happened. A doctoral student working on AD who was queried inferred that the reason might be that Stanley B. Prusiner was awarded the Nobel Prize in 1997 for his discovery of Prions, the pathogen of CJS, which created an upsurge in interest. However, this cause and effect relation cannot be verified, as Open IE can by no means infer that.
This paper provides a case study of using the machine reading method to understand the domain of Alzheimer’s disease (AD), and its relation to other diseases such as HIV and AIDS. AD is a field whose number of the related papers is overwhelmingly high, although there is a vital need for further research that may actually help find the causes of the disease as well as a cure. We demonstrate that machine reading helps identify specific information that offers a better understanding via overviews provided by topic modeling. The use of both methods of LDA and Open IE in a mutually complementary way reveals how the topic modeling technique connects AD and HIV/AIDS. Based on this observation, when querying the Open IE extractions, the two diseases are found to have different mechanisms but share some symptoms such as dementia.
This study has several implications. First of all, it shows that the literature on a topic can answer specific questions relating to it, which has not been attempted in the literature to date. From the perspective of Alzheimer’s disease, the approach provided in this article could help domain experts find important relations between entities in a similar manner as this study identified relations between AD and HIV/AIDS. Methodologically, this approach can serve as a preliminary knowledge extraction step for literature-based knowledge discovery if future researchers hope to construct a curated knowledge base for a specific purpose.
One limitation of this approach is that we need to manually clean the data, such as remove false extractions. Moreover, this study is not able to answer abstract questions when these answers are not written explicitly in texts. In the future, it would be helpful to develop a method to automate the process to detect false extractions. We could also integrate existing medical knowledge bases to answer more complex or nuanced questions.