The concept of Big Data is popular in a variety of domains. The purpose of this review was to summarize the features, applications, analysis approaches, and challenges of Big Data in health care. Big Data in health care has its own features, such as heterogeneity, incompleteness, timeliness and longevity, privacy, and ownership. These features bring a series of challenges for data storage, mining, and sharing to promote health-related research. To deal with these challenges, analysis approaches focusing on Big Data in health care need to be developed and laws and regulations for making use of Big Data in health care need to be enacted. From a patient perspective, application of Big Data analysis could bring about improved treatment and lower costs. In addition to patients, government, hospitals, and research institutions could also benefit from the Big Data in health care.
Big Data, the generic term for data sets of structured and unstructured data that are extremely large and complex so that the traditional software, algorithm, and data repositories are inadequate to collect, process, analyze, and store them (Asante-Korang & Jacobs, 2016; Kyoungyoung Jee & Gang Hoon Kim, 2013; Khoury & Ioannidis, 2014; Tan, Gao, & Koch, 2015), has become an intensively studied area in recent years. With the development of the Internet, the mobile Internet, the Internet of things, social media, biology, finance, and digital medicine, the volume of data has increased dramatically. Big Data not only describes the large size of data as its name suggests but also implies rapid data processing ability and novel technology and approaches for handling the data (Krumholz, 2014). After entering the 21st century, Big Data went through a series of evolutionary steps, and software in suitable environment has been developed. With the growth of information exchanges, Big Data has been expanded to a certain scale, not only in its size but also in data technology. In terms of its five main characteristics, volume, variety, velocity, variability, and veracity, state-of-the-art techniques, technologies, and equipment are required to deal with Big Data in correlation analysis, clustering analysis, modeling, prediction, and hypothesis verification. Thus, advanced hardware and software are required for data acquisition, extraction, processing, analysis, and storage. Currently, infrastructure for Big Data includes servers, storage systems, cloud service, and networking equipment. Software for Big Data includes parallel and distributed file systems, retrieval software, and data-mining software (Anderson & Chang, 2015).
The advanced analytical technologies developed for Big Data have driven its applications in many areas such as combating crime, business execution, finance, Global Positioning System (GPS), commerce, travel, urban informatics, meteorology, genomics, complex physics simulations, biology, environmental research, and health care (Chen, Mao, & Liu, 2014). Health care data are one of the driving forces of Big Data. With advanced data generation technology, there presents an exponential increasing trend in the volume of data. For example, as can be seen from the Human Genome Project completed in 2003, one single genome in human DNA occupies 100–150 gigabytes (Marx, 2013; O’Driscoll, Daugelaite, & Sleator, 2013). In terms of data size, Big Data in health care exceeded 150 exabytes after 2011 (Wang, Kung, Ting, & Byrd, 2015), and a study showed that data size in health care is estimated to be around 40 ZB in 2020, about 50 times the 2009 figure of 0.8 ZB (O et al., 2013) (Fig. 1A).
In addition, as researchers continue to make progress in health care, there is a dramatic explosion in the quantity of research literatures (Fig. 1B).
2 Major Types and Sources of Big Data in Health Care
Health care has become an important issue in developed countries and middle-income countries (Kyoungyoung Jee & Gang Hoon Kim, 2013). Big Data in health care can be classified into four main types based on the data sources, i.e., Big Data in medicine, also named as medical/clinical Big Data; Big Data in public health and behavior; Big Data in medical experiments; and Big Data in medical literature. Table 1 summarizes the information of major data types.
2.1 Big Data in medicine and clinics
Big Data in medicine and clinics includes various types and large amounts of data generated from hospitals, such as clinical data, and medical imaging. It is often closely associated with doctors and patients. In other words, Big Data in medicine is generated from historical clinical activities (Tsumoto, Hirano, & Iwata, 2013) and has significant effects on the medical industry. For instance, it can assist in planning treatment paths for patients, processing clinical decision support (CDS), and improving health care technology and systems (Kyoungyoung Jee & Gang Hoon Kim, 2013).
In the medical domain, Big Data comes from hospital information resources, surgeons’ work, activities of anesthesia, physical examinations, radiography, magnetic resonance imaging (MRI), computer tomography (CT), information of patients, pharmacy, treatment, medical imaging, and imaging report (Tan et al., 2015; Wang & Alexander, 2013). These clinical activities generate a large number of records including identification information of patients, diagnosis, medicine scheme, notes from physicians, and sensor data (Tan et al., 2015; Wang & Alexander, 2013). Major data from clinical activities are electronic health record (EHR)/ electronic medical record (EMR), personal health record (PHR), and medical images. EMR comprises structured and unstructured data that contain all the medical activity information of the patients and is often used for treatment and treatment decisions, while EHR is associated with health-related information for individuals such as medical information and financial information, which are closely related to the health care of the individuals (L. Wang & Alexander, 2013; Wu et al., 2017). Differences between EHR and EMR are that EHR can be shared between different systems in different organizations (Heart, Ben-Assuli, & Shabtai, 2017; Joshi & Yesha, 2012; L. Wang & Alexander, 2013) and is the whole-life record of a patient from birth to death stored in the medical institution, while EMR is the complete record of patient’s disease stored in the hospital; EHR focuses on health management of residents, while EMR focuses on clinical diagnosis of patients; EHR also contains data of demographics, medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal statistics, and billing information (M, 2014); EMR is the record of care delivery organization (CDO) and belongs to CDO, while EHR is the subset of CDO and belongs to the patients or stakeholders (Garets & Davis, 2007). EHRs are adopted by many countries, generating about 500 petabytes of data in 2012, which is expected to reach 25,000 petabytes by 2020 (Feldman, Martin, & Skotnes, 2012).
Summary of Major Date Types of Big Data in Health Care
|Data type||Data name||Data description||Data acquisition||Technology/database/system|
|Big Data in medicine and clinics||Electronic health record (EHR)/ electronic medical record (EMR)||Standard data collection of medical and health information for patients and can be shared in different organizations (Gunter & Terry, 2005). Often comes from medical activities and public health data||Hospital information resource, surgery’s work, activities of anesthesia, physical examination, radiography, magnetic resonance imaging (MRI), computer tomography (CT), information of patient, pharmacy, treatment, medical imaging, imaging report, identification information of patient, clinical diagnosis, medicine scheme, notes from physician, sensor data (Belle et al., 2015; Wang & Alexander, 2013), patient demographics, clinic or inpatient notes, electronic reports||Medical record data exchange, standards: Health Level 7 (HL7) , Continuity of Care Record (CCR), Continuity of Care Document (CCD), controlled medical vocabulary (CMV), computerized provider order entry (CPOE) (Valdes, Kibbe, Tolleson, Kunik, & Petersen, 2004) (Garets & Davis, 2007), all scripts, Epic Systems, Practice Fusion, NextGen Healthcare, clinical decision support systems, pharmacy management system, EMR Adoption Model (Wang & Alexander, 2013) (Garets & Davis, 2007), NoSQL database, clinical data repository (CDR) (Garets & Davis, 2007)|
|Personal health record (PHR)||As its name suggests, it is the health-related data and information of patients (Tang, Ash, Bates, Overhage, & Sands, 2006) and about people’s lifelong health information. It is available for further use (Chen et al., 2012)||Allergies and adverse drug reactions, chronic diseases, family history, illnesses and hospitalizations, imaging reports, laboratory test results, medications and dosing, prescription record, surgeries and other procedures, vaccinations and observations of daily living, and reported by patients (Rumsfeld, Joynt, & Maddox, 2016)||Cloud computing, Health Insurance Portability and Accountability Act(HIPAA) , and HL7 (Chen et al., 2012); stored in paper like printed laboratory reports, copies of clinic notes, and health histories created by the individual; electronic devices such as personal computer-based software, CD, DVD, and smart card; web applications such as HealthVault and PatientsLikeMe; and cloud servers (Chen et al., 2012)|
|Medical images||Data that present visual information of interior human body||X-ray, CT, histology, positron- emission tomography (PET), radiography, MRI, nuclear medicine, elastography, tactile imaging, photoacoustic imaging, echocardiography (Kovalev & Kalinovsky, 2015), ultrasonography, angiography||Statistical shape models (SSMs), medial models, clustering, active appearance models (AAMs), active shape models (ASMs) (Heimann & Meinzer, 2009), image segmentation algorithm, fuzzy C-means (FCM) algorithm (Zhang & Chen, 2004), image registration, picture archiving and communication systems, Super PACS (Picture Archiving and Communication Systems) , RIS, and digital image communication in medicine (DICOM) (Luo, Wu, Gopukumar, & Zhao, 2016)|
|Electrocardiogram||Electrical graph recording heartbeat activity of a person in a period of time like 1 minute||Electrocardiograph (ECG) signal||MIT-BIH Arrhythmia Database, American Heart Association(AHA) database, Common Standards for Electrocardiography database, ST-T database, Physikalisch-Technische Bundesanstalt (PTB) and Paroxysmal Atrial Fibrillation(PAF)|
|Big Data in public health and behavior||Vitals||Mainly refer to four sings (temperature, pulse, respiratory rate, and blood pressure) and other physiological data outside the health-care setting (Rumsfeld et al., 2016)||Temperature, pulse, respiratory rate, and blood pressure||Mobile technology, portable equipment, wearable system, and advanced devices like smartphones with third-party applications (HealthKit from Apple, Google Fit from Google, and S Health form Samsung), Android watches and Google glasses (Safavi & Shukur, 2014), and medical devices like implantable cardioverter– defibrillators (Rumsfeld et al., 2016)|
|-omics data||Biology information data in molecular- level catalog (Skotnes, 2012). Reflects characteristics of individual for treatment (Rumsfeld et al., 2016)||Genomics, transcriptomics – whole genome sequencing, RNA seq, metabolomics –Nuclear Magnetic Resonance (NMR) , mass spectrometry, proteomics – mass spectrometry, methylomics – pyrosequencing, and ChIP-on-chip||Data End-of-life (EOL) Extension (DAnTE) and DanteR|
|Molecular biology experiment||Interaction and regulation of biological activity within cells, such as interactions between DNA, RNA, proteins, and biosynthesis||Molecular cloning, polymerase chain reaction (PCR), macromolecule blotting and probing, microarrays, and next-generation sequencing||NCBI|
|Human body samples||Data and samples of cells, tissues, and organs in human body (Bagayoko, Dufour, Chaacho, Bouhaddou, & Fieschi, 2010)||Cells, tissues, and organs||Mayo Clinic Biobanks (|
|Big Data in medical experiment||Clinical trials||Experiments for evaluating new medical treatment (e.g., drug, device) (Kanagaraj & Sumathi, 2012)||Drug efficacy, toxicity, new treatment devices, and procedures||ClinicalTrials.gov|
|Journal/ conference article||Research articles written by researchers||Pubmed.com, New England Journal of Medicine, Lancet, Nature, Science, and Cell||Website of journal articles, Google Scholar, and Science Citation Index (SCI)|
|Big Data in medical literature||Structured knowledge||MeSH and International Classification of Diseases 10th revision (ICD-10)||Database in MeSH||NCBI|
PHR comes from a variety of patient health and social information; the main role of it is as a data source for medical analysis and clinical decision support (Poulymenopoulou et al., 2015) . It includes data of allergies and adverse drug reactions (ADRs), chronic diseases, family history, illnesses and hospitalizations, imaging reports, laboratory test results, medications and dosing, prescription records, surgeries and other procedures, vaccinations, and observations of daily living (ODLs). Unlike other document or text data, medical imaging mainly comes from X-ray, CT, histology, PET, radiography, magnetic resonance imaging (MRI), nuclear medicine, ultrasound, elastography, tactile imaging, photoacoustic imaging, echocardiography, and so on. It contains visual elements, and this means that data are usually very large (Kovalev & Kalinovsky, 2015).
2.2 Big Data in public health and behavior
Big Data in public health and behavior focuses on the physiological data of users that are often collected by portable equipment (Yan, Y., Qin, X., Fan, J., & Wang, L., 2014), such as electrocardiogram, vitals, contagion, wearable device, daily health record, sports, and diet.
Electrocardiogram is the electrical graph recording heartbeat activity of a person in a period of time, e.g., 1 minute; the recording process involves putting electrodes on the skin. Vitals, short for vital signs, include temperature, pulse, respiratory rate, and blood pressure. These signs are the most important four signs of the body’s function. Wearable device in public health refers to equipment that records details about lifestyle and vitals of people, from which the physicians can be assisted in treatment and diagnosis for patients. Advanced devices such as smartphones with third-party applications (HealthKit from Apple, Google Fit from Google, and S Health form Samsung), Android watches, and Google Glasses have been developed with sensors in the health care area (Safavi & Shukur, 2014). Since people have become more concerned with their own health on a day-today basis, ODLs have come to play a key role in recording personal daily health and behavior, signs, and symptoms of patients (Backonja et al., 2012). Additionally, data of sports and diet of people also contribute significantly to Big Data in public health and behavior. In the Apple iTunes store alone, there are more than 40,000 health care apps available (Aitken & Gauntlett, 2013). In 2017, it is predicted that more than 1.7 billion people will have downloaded health care apps.
In terms of infectious diseases in public health, there is a well-known case in which Google successfully predicted the time and scale of an influenza by analyzing the search engine results.
2.3 Big Data in medical experiment
This part of Big Data mainly focuses on molecular biology, human body data set, clinical trials, biology samples, gene sequences, and clinical and medical research laboratory tests and “omics” data (Table 1).
Molecular biology, a vital part of both biological and medical experiments, focuses on interaction and regulation of biological activities within cells, such as interactions between DNA, RNA, and proteins and biosynthesis (Fenderson & Bruce, 2008). It has a close relationship with fields of biochemistry and genetics in research of proteins and genes (Lodish, 2008). The main techniques of molecular biology include molecular cloning, polymerase chain reaction (PCR), macromolecule blotting and probing, microarrays, and so on. Human body data sets include samples of cells, tissues, and organs in human body, as well as cross-sectional photographs of the human body in the visible human project, which is used to visualize anatomy of human body in support of medical activities (Vesna, 2000). Similar to human body data sets, biological laboratory specimen also comes from sampling of human body and it is stored in biorepository. In case of one type of new drug, novel vaccines, or new medical device has been created, clinical trials should be processed before they come into use. Clinical trial, a kind of experiment or observation in medical or clinical research, is a procedure of evaluating the effectiveness of new medical treatment through study on human volunteers (DerSimonian & Laird, 1986). Gene sequencing, mainly referring to DNA sequencing, is a medical research activity of obtaining precise order of nucleotides within DNA. This process results in a large amount of data for recording DNA sequences. Medical research is often performed by researchers in universities, research institutions, and industry. The objective of their work is to make breakthrough in cellular, molecular, and physiological mechanisms in human for health care; fundamental parts of it also include molecular biology, medical genetics, immunology, neuroscience and psychology (Obenshain, 2004). Omics data are the biology information data in the molecular level catalog, which include genomics, proteomics, metabolomics, transcriptomics, epigenomics, lipidomics, immunomics, glycomics, and RNomics (Wu et al., 2017).
2.4 Big Data in medical literature
As the medical/clinical area has developed, currently, research articles as well as the structured knowledge are produced at a high speed. Additionally, there are also many older materials in the medical/clinical area. This literature makes a significant contribution to Big Data in health care.
2.5 Hospital information system (HIS) and its evolution
Technology for Big Data storage and processing like the Cassandra database has been applied; the main characteristic of this tool is that it can accommodate about two million columns in one row, making it more convenient to deal with large volumes of data (Kyoungyoung Jee & Gang Hoon Kim, 2013). In Big Data, including those in health care, one of the most popular processing tools Hadoop, created by Apache, uses the concept of distribution to handle tremendous volumes of data (Asante-Korang & Jacobs, 2016; Kyoungyoung Jee & Gang Hoon Kim, 2013). In terms of data management, data warehouses are used for supporting decision-making, online transaction processing (OLTP), and online analysis processing (OLAP) (Sheta & Eldeen, 2013). In addition, machine learning in data mining seems to be the most popular technological approach in Big Data analysis, and some technologies such as retrieval, web mining, decision tree, support vector machines (SVMs), clustering, neural network, network analysis, knowledge maps, and Natural Language Processing (NLP) and Multi-Layer Perceptron (MLP) approaches have been used. For instance, named-entity recognition is one of the most important techniques in BioNLP, used in recognizing particular entity processes such as gene normalization and event extraction (Usami, Cho, Okazaki, & Tsujii, 2011). Various techniques for – omics data analysis, such as amplified fragment length polymorphism (AFLP) for DNA fingerprinting and interpretation, validation tools for –omics data (Hassani S, 2010), and statistical tools data analysis tool extension (DAnTE) and data analysis tool extension R (DanteR) for –omics data analysis have emerged with different usages (Polpitiya er al., 2008; Taverner et al., 2012). In addition to the techniques in data processing, techniques for health care data have progressed in HISs. For example, a typical system is developed for data collection, data management, and data sharing in Hospital Information System (HIS) (Abernethy, Wheeler, & Bull, 2011). Currently, new technologies and new models have been found to be effective for structured and unstructured Big Data in health care. Data mining, as well as NLP, has been incorporated in the Big Data platform to handle complex scientific research oriental problems.
As a sociotechnical subsystem, HIS is commonly featured in presenting quality community for historical data resource, information, and knowledge in health care for hospital administration and patient health care (Bagayoko & Dufour, 2010; Kanagaraj & Sumathi, 2011; Roberts, 1985; Tsumoto et al., 2013) (Table 2). HIS was developed only for administrative management usage in the early 1960s and gradually expanded to information management after 1970 (Pai & Huang, 2011). Broadly speaking, there are many types of HIS. For instance, PACS, short for picture archiving and communication systems, is a common HIS for storing and transferring digital images (Joshi & Yesha, 2012). Additionally, laboratory information system (LIS), radiology information system (RIS), ultrasound information system (UIS), and EHR system, EMR system and PHR system are also included (He, Jin, Zhao, & Xiang, 2010; Joshi & Yesha, 2012). In
Systems for Acquiring Medical/Clinical Big Data
|HIS||Hospital information system; the system provides quality community for historical data resource, information, and knowledge in healthcare for hospital administration and patient health care (Bagayoko et al., 2010; Kanagaraj & Sumathi, 2011; Sirintrapun & Artz, 2016; Tsumoto, Hirano, & Iwata, 2013)|
|LIS||Laboratory information system; often used to collect, restore, archive, process, extract, and analyze data in laboratory; this system aims to improve efficiency of turn-around-times (TAT) of records, quality of resource utilization, and public health supporting (Blaya et al., 2007; Sepulveda & Young, 2013)|
|RIS||Radiology information system; it is used to capture and store data including images, demographic and clinical information, and so on, also assisting in patient registration, report repository, and physician directory with advanced technology (Nance, Meenan, & Nagy, 2013)|
|PACS (super sound PACS, endoscope PACS)||Picture archiving and communication systems; it is a common HIS for storage and transferring of digital images (Joshi & Yesha, 2012)|
|EMR||EMR system is used to maintain medical records and store, process, and retrieve information. It also ensures accuracy of information. Its aim is to ensure accuracy of information in order to provide patient control and transparency, interdepartmental communication, and great reporting capabilities for treatment (Kumar & Aldrich, 2010)|
|Cost accounting||System for collecting, recording, classifying, analyzing, summarizing, allocating, and evaluating financial cost in the medical area|
|Physical examination system||System for checking signs of patient|
terms of handling HL7 format data, the open archive information system model was applied (Celesti, Fazio, Romano, & Villari, 2016). HIS presents the ability to capture, store, and process health care data and often requires a large number of techniques to assist it. In other words, one of the major research challenges is how to integrate advanced techniques of information processing into HIS (Roberts, 1985). Cloud computing, a technique for data storage and sharing, is widely used in information system. The use of cloud computing in HIS is well known and very common for data processing, data backup, and information sharing between different organizations, such as cloud-based PACS and cloud-based EHR systems (He et al., 2010; Joshi & Yesha, 2012; Kanagaraj & Sumathi, 2011). Cloud security requires in many aspects, including data security, application security, system security, network security, and physical security, a high-quality of security management platform. Additionally, novel techniques have been proposed to improve the quality of HIS. For example, in order to achieve data-level interoperability, an adaptive AdapteR Interoperability ENgine (ARIEN) mediation system was proposed (Khan et al., 2014) for HIS with different health care standards. Open-source software is also available for supporting Hospital Information System (HIS) development. According to Bagayoko & Dufour (2010), web infrastructure, server operation systems, developer tools, and databases are commonly used in Europe and North America.
3 Unique Features of Big Data in Health Care
In addition to the “5V” features of Big Data, Big Data in health care has its own unique features, such as heterogeneity, incompleteness, timeliness and longevity, data privacy, and ownership.
Big Data in health care often has incompatible formats, which can be classified into structured and unstructured data. For example, some EHR collect data in structured formats and International Classification of Diseases 10th revision (ICD-10) are structured (Asante-Korang & Jacobs, 2016). However, the majority of Big Data in health care is unstructured, including data from CT, MRI, X-ray, Holter monitoring, angiography, and laboratories (Swan, 2013).
The sources of the Big Data in health care can be classified into four categories (Table 1). There is a shortage of tools to analyze the information from these heterogeneous sources. A German calciphylaxis registry proposed a framework and developed a tool to integrate medical record, imaging data, and signal data for the purpose of improving knowledge of rare diseases (Deserno et al., 2014). Windridge and Bober (2014) proposed a kernel-based framework to analyze heterogeneous data in the medical domain, which addressed the missing data problem presented by patients with sparse or absent data modalities. Using the kernel method, regression and classification of heterogeneous medical information can be achieved. Cismondi et al. (2013) developed a classifier to determine which missing data of ICUs should be imputed and which should not be. Through a simulated test bed, the performance of this method is improved compared with that of the previous work.
To the extent that the data created by monitoring devices consist of continuous data streams, such as electrocardiogram, it is difficult to consistently save it in the longitudinal record (Clemens Scott Kruse, Rishi Goswamy, Yesha Raval, & Sarah Marawi, 2016). It is too expensive to store all the Big Data in health care, a situation that leads to data incompleteness. Additionally, the EHR requires doctors or nurses to record disease information of patients, such as medications and allergies, and this process may also lead to data incompleteness (Hong, Kaur, Farrokhyar, & Thoma, 2015). In Menelik II Referral Hospital, inpatient medical record completeness was 73%, which is low against the standard. Medical records not only support direct patient care but also support clinical audit, epidemiology, medical research, and resource allocation. Improving the completeness of medical records is important to improve the quality of health care (Tola, Abebe, Gebremariam, & Jikamo, 2017).
3.3 Timeliness and longevity
For HIS, there is a delay time from when the EHR information is entered into HIS to the point when the EHR is available for electronic access (Medicare & Medicaid Services 2010). Medical signals such as electrocardiogram (ECG), Single Photon Emission Computed Tomography (SPECT) images, MRI, and EEG are a function of time and thus have a strong timeliness. Keeping medical/health information current is a major challenge for Big Data in health care analytics, and HIS should maximize the timeliness of data. At the same time, storage time of medical records is different among hospitals. For some familial or genetic diseases, it is useful to know the family history in order to support medical decision-making. To this point, there is no link between one’s medical records with those of his/her family members.
3.4 Data privacy
Owing to the sensitivity of health care data, there are significant concerns regarding privacy and security (Clemens Scott Kruse et al., 2016; Naito, 2014). Extreme care should be taken to protect patient privacy, and privacy concerns pose limitations in linking external data to individual insured data, which may improve consumer health-related experience and personalize service and care (Yuen-Reed & Mojsilović, 2016). Because of the centralization of much health care information, the data are highly vulnerable to attacks (Mohr, Burns, Schueller, Clarke, & Klinkman, 2013). Owing to privacy issues, Herland et al. (2014) used synthesized EMR/EHR and PHRs with help from a medical professional to conduct their research. Health care mobile phone applications, such as Google Health, promise consumers “complete control over your data,” meaning that personal information will not be sold or shared without the consumer’s explicit permission (Steinbrook, 2008). In different countries, there are two patterns of policies and regulations to protect the data in health care. In one pattern, based on the basic privacy laws, governments pass additional laws, policies, and regulations to protect personal health care information, such as HIPAA in the US, Health Records and Information Privacy Act 2002 in Australia, and Medical Privacy Act and Healthcare Insurance Act in France. In the other pattern, taking personal health care information as part of personal information or sensitive information, governments pass laws to protect personal information or sensitive information, such as the Data Protection Act in England and he Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada.
Although consumers who have medical needs legally own their health data, which may be stored and controlled by hospitals, physicians, laboratories, clinics, pharmacies, and government agencies in innumerable, incompatible data silos, consumers may lack access to and control over their own health care data. To solve this problem, the cooperative, which is an old and successful form of corporation that is entirely owned by citizens, is an effective approach. Each consumer has one account that stores and manages all health care data. They can share subsets or all the data for research purposes (Pentland, Reid, & Heibeck, 2013).
4 Importance of Big Data in Health Care
It is important to extract valuable information and discard useless fragments from Big Data. As the main issue for this discussion, Big Data in health care could produce considerable economic benefits with the application of Big Data analytics (BDA). For example, a significant amount of money could be saved in the health care industry (Asante-Korang & Jacobs, 2016). Additionally, it would be applied in clinical diagnosis, medical research, hospital management, and fundamental demand in medicine. Through the use of Big Data techniques, patients may have personalized medicine and patient-centric care. This argument supposes that Big Data would help to provide novel approaches to deal with issues in health care (C. S. Kruse, R Goswamy, Y Raval, & S Marawi, 2016).
4.1 The perspective of the research institution and the hospital
Research institutions could better understand the mechanisms and effects of newly developed drugs through BDA. For example, it could also reprocess cancer data to hunt for new cancer drugs (Marx, 2013). Through using statistical tools and algorithms, researchers could improve the clinical trial design and reduce trial failures (Wullianallur Raghupathi & Raghupathi, 2014).
Physicians could use clinical decision support systems (CDSS) with BDA to make more informed decisions, which may improve the quality of patient care (K. Jee & G.-H. Kim, 2013; Kim, Park, Yi, & Kim, 2014). Allowing Big Data to influence clinical decision-making, new practices, and treatment guidelines within clinical research may be integrated and lead to an optimized result. BDA and computer-aided diagnostics may be used to save time in cancer detection, reducing the false-positive rate of cancer diagnosis (Costa, 2014). Now in the cardiology area, computing and Big Data technology enable cardiologists to read patients’ medical record via smartphones, which are helpful in identifying emergency cases in need of immediate treatment (Hsieh, Li, & Yang, 2013).
4.2 The perspective of the government or the public
BDA could reduce costs in the medical domain, estimated at approximately 8% of national health care expenditures for the US government (Manyika et al., 2011). In Italy, by exploiting the admissions for “laparoscopic appendectomy” surgery in different sanitary districts, it was possible to categorize districts based on cost efficiency and timeliness by using the number of admissions and the average days of hospitalization. This data analysis provides an automatic and continuous monitoring of the sanitary districts. The results of this data analysis provide useful insights into reducing cost and increasing the effectiveness and efficacy of health care services (Mancini, 2014a).
BDA could help governments prevent the spread of infectious diseases. In Pakistan, BDA with smartphone technology helped in detection and prevention of the early stage of the dengue fever epidemics. The method was also used to detect outbreaks of flu epidemics in the US (Pentland et al., 2013). Governments can thus respond more quickly to epidemics and help people avoid the disease.
BDA has the potential to reveal regional health problems. For example, Duke University led a project that involved building an integrated clinical data warehouse by combining millions of patient records from their EHRs with geographic information system data (Braunstein, 2015). Based on the combined data, this project reveals the social determinants of health.
4.3 The perspective of patients and their relatives
Using health care mobile phone applications and other online health-related websites, patients can store, retrieve, manage, and share their health data. Over the long term, this process will improve health care and decrease costs, especially for patients who have complicated chronic conditions (Steinbrook, 2008), such as diabetes. Some diabetes applications offer a variety of functions, including medication or insulin logs, self-monitoring blood glucose recording, and prandial insulin dose calculators (Demidowich, Lu, Tamler, & Bloomgarden, 2012), and others integrate health care providers who can access the patients’ records and formulate personalized feedback. Thus, patients can take the right treatments and live healthier, more comfortable lives (Asri, Mousannif, Al Moatassime, & Noel, 2015).
Through Big Data techniques, patients may have personalized medicine and patient-centric care (Chawla & Davis, 2013; Collins, 2016). Chawla and Davis (2013) constructed a framework called the Collaborative Assessment and Recommendation Engine (CARE) for patient-centered disease prediction and management. It can generate personalized disease predictions and management plans. In addition through BDA, three drugs have been identified and used in specific groups of cancer patients. Dabrafenib is used to treat melanoma; the BRAF mutation V600E, a targeted therapy using trastuzumab, is used to treat breast cancer and the amplification or overexpression of the gene encoding Her2/Neu; and imatinib is used to treat different types of tumor that contain the fusion protein BCR-ABL (Costa, 2014).
Through BDA, patients may have their diseases detected earlier, receive treatment earlier, and have better outcomes (K. Jee & G.-H. Kim, 2013; Kim et al., 2014). In daily life, BDA can help patients and their relatives monitor their respective conditions.
5 Common Approaches for Analyzing Big Data in Health Care
With the growing awareness of data as an asset, more and more data-mining approaches are adopted in order to gain insights from large volumes of data. In medicine and health care, a data-rich environment generates an enormous amount of data every day. Thus, we need to use data-mining approaches such as classification, clustering, regression analysis, and association rules to analyze big health care data.
Classification is the process of organizing data into categories for its most effective and efficient use. Classification is widely applied in mining health care data. There are some specific introductions in these areas.
Primary care influences child health outcomes by managing illness and providing preventive and health promotion services. New Zealand is in a strong position to analyze patterns of childhood morbidity due to universal enrollment with a primary care provider at birth. However, analyzing morbidity patterns within these extracted data is problematic because primary care practices do not consistently or frequently use diagnostic labeling and there is marked variability between clinicians and conditions. A study conducted by MacRae et al. (2015) aimed to extend the use of Pattern Recognition Over Standard Aesculapian Information Collections (PROSAIC) to identify childhood respiratory conditions within primary care consultations by building an algorithm to classify the unstructured clinical narrative written by clinicians. Three independent sets of 1,200 child consultation records were randomly extracted from a data set of all general practitioner consultations in participating practices between January 1, 2008, and December 31, 2013, for children younger than 18 years of age (n=754,242). Each consultation record within these sets was independently classified by two expert clinicians as respiratory or non-respiratory and subclassified according to respiratory diagnostic categories to create three “gold standard” sets of classified records. These three “gold standard” record sets were used to train, test, and validate the algorithm. Then, sensitivity, specificity, positive predictive value, and F-measure were calculated to illustrate the algorithm’s ability to replicate judgments of expert clinicians within the 1,200 record “gold standard” validation set. This algorithm that uses primary care Big Data can accurately classify the content of clinical consultations. It enables accurate estimation of the prevalence of childhood respiratory illness in primary care and the resultant service utilization. The algorithm is able to analyze very large data sets, including routinely recorded unstructured clinical narratives. These data sets would be impractical to analyze manually.
Frantzidis et al. (2010) applied data classification techniques to emotion recognition for health care applications, taking into account the bidirectional emotion theory model that accounts emotions as mixtures of two (orthogonal and independent) dimensions, namely, valence and arousal. Specifically, this paper uses classification rules derived from the C4.5 algorithm and pattern classifier based on the Mahalanobis distance. It then favors the role of multiphysiological recordings for the enhancement of emotion discrimination and the use of metadata structure designs via the extensible markup language (XML) for linking the various system components.
Fan et al. (2011) developed a hybrid model named case-based reasoning and fuzzy decision tree (CBFDT) for medical data classification in two medical domains: breast cancer diagnosis and liver disorder diagnosis. In this paper, they introduced the method and algorithm of a case-based fuzzy decision tree (FDT) model for medical classification problems. Two medical data sets including liver disorders and Breast Cancer Wisconsin are selected from University of California Irvine (UCI) database. More than 900 data sets are used to conduct this experiment. Decision tree induction is free from parametric assumptions, and it generates a reasonable tree by progressively selecting attributes to branch the tree. By combining all kinds of medical features of liver disorders and Breast Cancer Wisconsin database, this research applies an FDT to develop a forecasting model for generating decision rules in disease classification. This classification model integrates a data clustering technique, an FDT, and genetic algorithms (GAs) to construct a medical classification system based on medical database. It can be divided into four major steps: (1) screening medical database from UCI data set; (2) clustering case library into smaller cases; (3) establishing FDT; and finally (4) outputting the classification results.
Clinical data usually contain numerous features with small sample size, leading to degradation in accuracy and efficiency of the system by curse of dimensionality. This leads to the degradation of classifier system’s performance in high-dimensional data sets because irrelevant features not only lead to insufficient classification accuracy but also add extra difficulties in finding potentially useful knowledge. Azar and Hassanien (2015) presented a linguistic hedges neuro-fuzzy classifier with selected features (LHNFCSF) for dimensionality reduction, feature selection, and classification. The new classifier is compared with the other classifiers for different classification problems. All data sets are in the public domain. The data sets are breast cancer Wisconsin diagnostic, breast cancer Wisconsin prognostic, erythemato-squamous disease, and thyroid disease data set. These data sets are obtained from the well-known UCI machine learning repository. The results indicate that applying LHNFCSF not only reduces the dimensions of the problem but also improves classification performance by discarding redundant, noise-corrupted, or unimportant features. The results strongly suggest that the proposed method not only helps reduce the dimensionality of large data sets but also can speed up the computation time of a learning algorithm and simplify the classification tasks.
Estella et al. (2012) designed an advanced system for autonomously classifying brain MRI images of neurodegenerative diseases, with the main purpose of assisting in decision-making in classification tasks. The method was tested on data from a large database (more than 1,500 patients were analyzed), with a sensitivity of and specificity close to 90%, which are considerably better than those predicted by human experts.
Clustering is the task of grouping a set of objects in such a way that objects in the same cluster are more similar to each other than those in other clusters. Clustering techniques are widely used for exploratory data analysis, with applications including patient segmentation, outlier health care data detection, disease prediction, and clustering of patients.
Elbattah & Molloy (2017) employed clustering in order to realize the segmentation of patients from a data-driven viewpoint. The Irish Hip Fracture Database (IHFD) is the primary source of data used in the study. Its records contain ample information about patients’ journeys from admission to discharge. Then, a set of data pre-processing procedures are conducted for two purposes: (1) dealing with data anomalies and (2) extraction of additional features that are considered as indicators of care quality. In this paper, the authors use k-means algorithm as the partitioned clustering approach. The k-means clustering uses a simple iterative technique to group points in a data set into clusters that contain similar characteristics.
Christy et al. (2015) proposed two cluster-based outlier detection algorithms including distance-based outlier detection and cluster-based outlier detection. The main purpose of the algorithms was to remove outliers that are irrelevant or only weakly relevant to the analysis of health care data. Experimental evaluation based on the metrics of F-score and likelihood ratio shows that the cluster-based outlier detection method outperforms distance-based outlier detection method.
Huang and Yao (2016) proposed a novel clustering approach for multidimensional physical health data based on artificial ant colony optimization. This method is determined through testing to be an effective and efficient approach to clustering health and medical data for further analysis.
Paul and Hoque (2010) proposed to use the background knowledge of medical domain in the clustering process to predict the likelihood of diseases. The developed algorithm can handle both continuous and discrete data and perform clustering based on anticipated likelihood attributes with core attributes of disease in data point. In this paper, its effectiveness has been demonstrated by testing it on a real-world patient data set.
Hastie et al. (2005) conducted a test in which 188 individuals (59.0% female) completed several psychological instruments and underwent ischemic, pressure, and thermal pain assessments. Then, 13 separate pain measures were obtained by using three experimental pain modalities with several parameters tested within each modality. Cluster analyses of PSI scores revealed four distinct clusters, and significant correlations were found between psychological measures and index scores. These findings highlight the need for future investigation to identify patterns of responses across different pain modalities in order to more accurately characterize individual differences in responses to experimental pain.
5.3 Regression analysis
Regression analysis is widely used in analyzing health care Big Data for estimating the relationships among variables or properties. The main research issues include trend features of data sequences, prediction of data sequences, and relationships between data.
With the emergence of administrative databases, the ability to access longitudinal patient data to adjust for comorbidity has improved considerably. This raises the issue of the most appropriate lookback period to determine patients’ disease status for risk estimation. Most research has used relatively short lookback durations, but longer lookback periods are likely to capture more conditions per patient, as well as assign comorbidities to a greater proportion of patients. Preen et al. (2006) conducted a research to discover the impact of different comorbidity ascertainment lookback periods on modeling post-hospitalization mortality and readmission. Data were extracted for ~1.1 million patients admitted to hospital in the Washington State from July 1990 to December 1996. Hierarchically nested Cox regressions were used to model mortality within one year and readmission within 30 days of index separation. Additionally, deaths within one year and readmissions within 30 days of index hospitalization were analyzed using logistic regression and receiver operator characteristic (ROC) area under the curve (AUC) determined for each hierarchically nested lookback model in order to estimate the predictive power of different models. The result is that longer lookback resulted in more comorbidity being identified. For the entire sample, 46.8% of comorbidity observed across the five-year lookback period was recorded at index hospitalization. For readmission, lookback periods of five years perform better than shorter durations for both patient groups.
Risk adjustment is an important component of outcomes and quality analysis in surgical health care. However, there are some concerns that should be addressed if risk-adjustment models avoid subjective data elements, such as history of comorbidities, and rely on objective data, such as laboratory values or other machine-collected variables that do not require subjective interpretation and input of hospital personnel.
A study was conducted by Anderson and Chang (2015) was conducted to determine whether machine-collected data elements could perform as well as a traditional, full risk-adjustment model that includes other physician-assessed and physician-recorded data elements. This research uses all available The National Surgical Quality Improvement Program (NSQIP) data from January 1, 2005, to December 31, 2010. This nationally validated program measures more than 135 variables on each patient and follows up each patient for 30 days postoperatively. The primary analysis included all patients in the database who were categorized as having had an operation performed by a general surgeon or surgeons in some surgery subspecialties and having an adverse event. Multivariate logistic regression models were created to predict either mortality or any complication in the inpatient setting or within 30 days of surgery. The researchers then compared the ROC AUC of each regression using objective preoperative risk variables to its corresponding regression with all variables. A total of 745,053 patients were included. The difference in AUC comparing models with all variables with objective variables ranged from −0.0073 to 0.1944 for mortality and from 0.0198 to 0.0687 for complications. These data suggest that it is possible to create a risk-adjustment system with a high discriminatory value based only on objective variables. By restricting data collection to objective data, we can reduce concerns about reliability and validity as well as threats of gaming the system from attempting to increase the risk score of patients through subjective variables.
Kennedy et al. (2013) conducted a retrospective cohort study. In this paper, they identified all Veterans Health Administration (VHA) patients without recent cerebral and cardiovascular (CCV) events treated at twelve facilities from 2003 to 2007 and predicted risk using the Framingham risk score (FRS), logistic regression, generalized additive modeling, and gradient tree boosting.
Oztekin et al. (2009) used three different variable selection methods on a large and feature-rich data set to generate a consolidated set of factors and use them to develop Cox regression models for heart–lung graft survival. The main objective of this study was to improve the prediction of outcomes following combined heart–lung transplantation by proposing an integrated data-mining methodology. The data files were obtained from United Network for Organ Sharing (UNOS) using a formal data requisition procedure. The complete data set consists of 443 variables and 61,391 records. These variables included the socio-demographic and health-related factors of both the donor and the recipients. There are also procedure-related factors included in the data set. The results indicated that the proposed integrated data-mining methodology using Cox hazard models better predicted graft survival with different variables than the conventional approaches commonly used in the literature.
5.4 Association rules
Association rule mining aims to discover associations between items in large databases. The typical association rule mining methods include Apriori (Agrawal, Imieliński, & Swami, 1993) and Frequent Pattern (FP)-tree growth (Han, Pei, & Yin, 2000). Association rule mining is normally a two-step process where in the first step, frequent item-sets are discovered (i.e., item-sets whose support is no less than a minimum support) and in the second step, association rules are derived from the frequent item-sets using some measures of interestingness.
Antonie et al. (2001) used Apriori algorithm to discover association rules among the features extracted from the mammography database and the category to which each mammogram belongs. They constrained the association rules to be discovered such that the antecedent of the rules is composed of a conjunction of features from the mammogram, while the consequent of the rules is always the category to which the mammogram belongs. Once the association rules are found, they are used to construct a classification system that categorizes the mammograms as normal, malign, or benign.
In a medical database, the most complete and detailed information is anamnesis data, which contain disease name, prescription, patient’s detail information, etc. Through this method, it is possible to find the association rules between diseases. Driven by this, Kuo et al. (2007) proposed a novel framework of data mining that clusters the data first and then follows with association rule mining. The first stage uses the ant system-based clustering algorithm (ASCA) and ant k-means (AK) to cluster the database, while the ant colony system (ACS)-based association rule mining algorithm is applied to mine the association rule for each cluster. Experimentation on the data sets provided by the National Health Insurance Plan of Taiwan demonstrates that the proposed method can find the hidden rules that may occur less often but have robust relationships.
6 Systems and Applications for Analyzing Big Data in Health Care
Big Data can provide support across many aspects of health care. BDA has made progress to different degrees in CDS, remote medical information services, public health, disease pattern analysis, and personalized medicine. There are some specific applications and potential opportunities in these areas.
A CDSS can provide a large amount of medical support for clinicians, helping them to make diagnoses and choose the best treatments. CDSS helps in supplementing the knowledge of clinicians, preventing human negligence, and reducing the costs while improving the quality of medical treatment. Representative data-driven CDSSs include the Health Evaluation Through Logical Processing (HELP) system, Quick Medical Reference (QMR) system, Iliad system, and MYCIN system.
6.1.1 The HELP system
The Health Evaluation Through Logical Processing system (Gardner, Pryor, & Warner, 1999) is the first data-driven clinical decision-making and HIS. The system uses the knowledge base to make decisions from the multi-source clinical data stored in its integrated clinical database. For example, a serum potassium of 6.2 meq/L will trigger an elevated potassium alert to the nurse caring for a patient via a digital pager. Time-driven decision-making capabilities are also available within the HELP system. Using natural language processing, data from transcribed reports such as handwritten medical records have become a major source of data for decision-making.
The HELP system consists of a knowledge base, decision-making processor, data and time driver, data review alerts, accounting system, longitudinal patient data repository, and other components.
6.1.2 The QMR system
QMR is a typical CDSS to help physicians, using the knowledge base of INTERNIST-1/CADUCEUS. This knowledge base is widely used as a medical book, which contains 750 diseases, 5,000 clinical symptoms, and more than 50,000 disease relationships. QMR was one of the earliest CDSSs to use artificial intelligence and probability ranking system.
Because many of the diseases in the system are rare and documented, an ad hoc scoring model is proposed to encode the relationship between specific clinical symptoms and disease. One of the factors limiting the use of QMR is that its knowledge base needs to be constantly updated. The significance of QMR lies in its powerful knowledge base, which is used as the basic model of other knowledge base system.
6.1.3 The Iliad system
Iliad is a medical expert consulting system developed by the University of Utah School of Medicine. It is used as a consultation tool or a simulation training tool for CDS and teaching (Lincoln, 1998).
The Iliad consultant utilizes a number of inferencing mechanisms to emulate the strategy of a medical expert in working with a patient. The knowledge in Iliad is represented in Bayesian and Boolean frames. These frames permit the use of sensitivities and specificities to describe the relationship of a disease to its manifestations and provide a basis for explaining its conclusions. Iliad has four basic components: the inference engine, the user interface, the data driver, and the best information algorithm.
6.1.4 The MYCIN system
MYCIN is an interactive expert system for the diagnosis and treatment of central nervous system’s infection (Berner, 2003). It is composed of three subsystems: consultation, interpretation, and rules. According to the clinical manifestations and laboratory results of patients, MYCIN imitates the expert reasoning process, assists clinicians in determining bacterial species, and makes clinical recommendations. The system adopts the method of if–then inference rules and produces more than 400 kinds of embodied knowledge expert judgment rules.
6.2 Remote medical information systems
The aggregated electrocardiogram (ECG) and images from hospitals worldwide can become Big Data, which could be used to develop an e-consultation program helping on-site practitioners deliver appropriate treatment. Real-time teleconsultation and telediagnosis of ECG and images can be practiced via an e-platform for clinical, research, and educational purposes.
With respect to large-scale data research, Chia and Syed (2011) used Big Data computing to generate a predictor of the mortality risk for patients with acute coronary syndromes in 2011. This predictor was developed through data mining and machine learning, based on 24-hour continuous ECG readings over 4,000 patients’ trials. In each trial, 24-hour ECG readings were collected in a two-year period. This Big Data-based predictor can predict over 50% of deaths with fewer false positives as compared with the traditional ECG analysis, which was conducted based on a smaller segment of ECG signals. This approach can be easily extended to other clinical and non-clinical applications focused on approximate sequential pattern discovery in massive time-series data sets.
To make telemedicine more efficient, medical wearable devices that apply Big Data-mining and analysis techniques are used. For example, patients with dementia (such as Alzheimer type) need to be looked after day and night in order to manage their negative behaviors, which means a sea of input of labor and capital. With the purpose of resolving this problem, real-time health monitoring devices have been developed to capture a large amount of data. Based on these real-time data, patients with dementia can be diagnosed whether in agitation or not. At the same time, medical Big Data also pose challenges to data cleaning; poor-quality data should be identified and rejected to ensure that the results of data mining are right. Moreover, data captured from remote motoring devices can be mined to realize long-term prognoses.
A Context Processing Algorithm (CPA) (Moore, Xhafa, Barolli, & Thomas, 2013) is proposed to address the issues encountered in decision support in medical diagnosis and potential prognoses based on the event–condition– action (ECA) rule concept. CPA regards captured Big Data as a kind of contextual information to carry out data processing in intelligent context-aware systems.
On the basis of Big Data, pervasive remote medical systems are designed for both healthy and ill people. Páez et al. (2015) proposed an architecture including the application of cloud computing, Big Data, and Internet of things approaches to make sure chronic or non-chronic patients as well as healthy people are monitored in different environments. Family members, emergency systems, and hospitals can interact with the patients whenever and wherever possible.
While Big Data promotes the function of medical remote monitoring and diagnosis, the development of telemedicine also enriches the connotation of Big Data. Traditionally, medical Big Data refers to EHR and remote monitoring health data. However now, medical Big Data, including user’s behaviors, physical strength, and mental state data, has been rapidly generated (Redmond et al., 2014). Technological advances in the medical field, such as medical video communications, also provide a new type of medical Big Data. For instance, a light-field-based 3D cloud telemedicine system (Wang, Xiang, Pickering, & Zhang, 2016) that combines Big Data analysis with 3D technologies is proposed to mine big video data.
6.3 Applications in public health
In the field of public health, BDA represents a new solution that can mine web-based and social media data to predict disease outbreaks based on consumers’ searches, social content, and query activity. Systems in public health also support clinicians and epidemiologists performing analyses across patients and care venues to help identify disease trends and drug safety.
BDA is often used for monitoring of disease networking. An example is Google’s use of BDA to study the timing and location of search engine queries to predict disease outbreaks. Research shows that one-third of consumers currently use social networking for health care purposes (Facebook, YouTube, blogs, Google, Twitter). As demand for access to health information from social networking sites continues to proliferate, BDA can potentially support key prevention programs such as disease surveillance and outbreak management.
The Global Burden of Disease Study (GBD) is a comprehensive regional and global research program of disease burden that assesses mortality and disability from major diseases, injuries, and risk factors. GBD is a collaboration of more than 1,800 researchers using medical Big Data from 127 countries. The 2015 report (Collaborators, 2017) showed that globally, diarrhea was a leading cause of death among all ages, as well as a leading cause of disability-adjusted life years (DALYs) because of its disproportionate impact on young children.
BDA is also widely applied to supervise drug safety, particularly ADRs, and identify susceptible population. ADR is defined as an appreciably harmful or unpleasant reaction resulting from an intervention related to the use of a medicinal product (Edwards & Aronson, 2000). ADR can be used in the field of medical administration and warrants prevention, specific treatment, alteration of the dosage regimen, and withdrawal of the product.
With the help of Big Data, health departments or medical companies can efficiently take actions when they detect potential ADRs among the people who take the medication. In 2004, Wilson et al. proposed that Knowledge Discovery in Databases (KDD) is a more effective way to determine the presence and assess the strength of ADR signals. At this point, numerous data-mining techniques have been used in drug safety, such as cluster analysis, link analysis, deviation detection, and disproportionality assessment.
As Big Data emerges, health social media sites are regarded as a fast and direct data resource for scientist to get first-hand ADR information. Compared with ADRs recorded by health professionals, spontaneous reporting of data on health social media sites is much more abundant, open, and timely. Owing to the advantages discussed earlier, Christopher et al. (2009) used association mining and proportional reporting ratio to analyze the detected ADRs for different drugs on the basis of social data. Given the prosperity of medical research especially in the ADR field and the advantages of Big Data, Shah et al. (2012) believed that Big Data in biomedical informatics will grow considerably. There is no doubt that the age of data-medicine is poised to create a proactive, predictive, preventive, participatory, and patient-centered health system.
Apart from the great potential shown in drug safety, Big Data can also achieve powerful effects in identifying susceptible populations. A large collection of EHRs accumulated by various medical treatments provides an opportunity to dig out the statistical model of high-risky people. The model aims to reduce the cost of health care and conserve limited resources in health value. Bates et al. (2014) suggested that identifying and managing six practical use cases’ data is the way to use predictive medical systems. The use cases include high-cost patients, readmissions, triage, decompensation, adverse events, and treatment optimization for diseases affecting multiple organ systems.
6.4 Applications in disease pattern analysis and personalized medicine
Hay et al. (2013) imported new sources of data, such as social data, to relevant environmental information to create a dynamic and real-time global infectious disease map. On the basis of infectious disease risk maps, human beings can deepen their knowledge of infectious diseases and improve the ability to triage spatially and issue infectious disease outbreak alerts. Lazer et al. (2014) stated that “Big Data hubris” is the often implicit assumption that Big Data is a substitute for, rather than a supplement to, traditional data collection and analysis. Given that most Big Data cannot reach the standard of scientific statistical analysis, there is no doubt that the results can have large errors. Additionally, medical algorithms are not constant. On the contrary, they are dynamic and process a continuous series of adjustments.
Big medical data can be applied not only to mining public medical patterns but also to personalized medical care. At present, health care is moving from a disease-centered model toward a patient-centered model. In a disease-centered model, physicians’ decision making is centered on the clinical expertise and data from medical evidence and various tests. In a patient-centered model, patients actively participate in their own care and receive services focused on individual needs and preferences.
Personalized healthcare is a data-driven approach. This means a kind of patient-centered medical model that assesses the relationship among patients who are exposed to similar risk, lifestyle, and environmental factors that are created. In light of these thoughts, Chawla and Davis (2013) developed a system named CARE that uses a collaborative filtering method to capture patient similarities and produces personalized disease profiles for personalized disease risk predictions.
Panahiazar et al. (2014) presented the main challenges in the standpoints including variety of the data, quality of the data, volume of the data, and velocity of the data. Alyass et al. (2015) proposed that personalized medicine may widen the growing gap in health systems between rich and poor countries. Moreover, they blamed the slow transition from conventional to personalized medicine based on several factors: generation of cost-effective high-throughput data, hybrid education and multidisciplinary teams, data storage and processing, data integration and interpretation, and individual and global economic relevance.
7 Challenges for Mining Big Data in Health Care
7.1 Data mining
Clinical Big Data contains a large amount of unstructured data such as natural language or other handwritten data (Jee & Kim, 2013) whose integration, analysis, and storage bring a certain degree of difficulty. At the current stage, it is inefficient to share structured data among agencies and the sharing of unstructured data among the same organizations is even more difficult to achieve. Determining how to effectively mine a large amount of unstructured data will continue to be a major challenge (Sejdic, 2014). One of the characteristics of Big Data is variability in data sources (Dieringer & Schlotterer, 2003), and medical data itself have a strong timeliness; for example, personalized medical care has high timeliness requirements. The medical industry’s processing speed of data is extremely demanding, especially when the patient’s condition deteriorates rapidly. In addition, when using real-time applications such as cloud computing to access and analyze data, the patient data’s privacy and security are also a challenge (Jee & Kim, 2013). Cloud computing now offers new possibilities for medical Big Data’s mining and sharing. However, there are also several challenges to be overcome before cloud computing can become more practical. First, although cloud computing offers an easy and flexible way to mine resources, it also increases the risk of privacy disclosure, a fact that is particularly evident in fields such as clinical informatics and public health informatics. Second, in medicine, a large amount of data are often required to be imported or exported to the cloud (petabyte level). The network bandwidth constraints affect the speed of data transmission and also increase the cost of cloud computing (J. J. Chen, Qian, Yan, & Shen, 2013). At present, the attention to Big Data focuses mainly on its accuracy; timely and accurate data mining is another challenge, which is still in the initial stage (Abenstein & Tompkins, 1982; Xu et al).
7.2 Data storage
The current difficulties in data storage are mainly due to high costs. Medical data costs arise mainly from three aspects. The huge amount of medical data is one of the sources of storage costs. With the development of medical information, the medical industry has produced a large amount of data, ranging from medical diagnostic images to pathological analysis of maps. For example, regional medical data are usually derived from a region with millions of people and hundreds of medical institutions, and the amount of data continues to grow. In accordance with the relevant provisions of the medical industry, a patient’s data typically need to be retained for more than 50 years. The data of this patient not only contain a large number of online or real-time data but also include a variety of data such as diagnosis and medication recommendations in CDS, various structured data tables, non-(semi-) structured text documents, medical images, and other information. The massive size of the data inevitably increases the cost and difficulty of storage. There are also costs associated with moving them from one place to another as well as analyzing them. Finally, the types of medical data type are diverse, including numerical data that record various disease tests, as well as various diagnostic images, records made by doctors and nurses, and even diagnostic speech, video, and other unstructured data. Unstructured data are more difficult to store, analyze, and manipulate than structured data. They also, to a certain extent, increase the cost of storage. It is also a challenge to maintain safety and privacy in the process of storing, extracting, and downloading patient-related data (Youssef, 2014).
7.3 Data sharing
7.3.1 Limited data standardization and interoperability
The current standards and technologies are inadequate to meet the requirements of the integrative applications of health care Big Data. The difficulties are two folds. First, the data lack uniform standards, consistent description format, and presentation methods. Second, different levels of structured, semi-structured, and unstructured data integration are difficult. At the same time, each database uses different software and data formats, especially the latter makes data comparison, analysis, transfer, sharing, and other processes more difficult (Chawla & Davis, 2013; Mohr, Burns, Schueller, Clarke, & Klinkman, 2013; W. Raghupathi & Raghupathi, 2014). Data integration can also reduce the cost. Hillestad et al. (2005) compared health care with the use of IT in other industries and estimated that the use of interoperable electromagnetic radiation system can save $142–137 billion.
7.3.2 Information barriers
The medical field of Big Data users covers a wide range, such as hospital clinics, regional medical centers, medical insurance companies, drug management analysis units, and medical equipment monitoring centers. The corresponding data resources are scattered in different data pools, including hospital medical records, settlement and cost data, medical firms’ records, academic medical research data, residents’ health records collected by regional health information platforms, and population and public health data of government surveys. There is not much connection between these data sets. At the same time, data sharing mechanism is imperfect due to the information barriers among hospitals, scientific research institutions, and other institutions (Kruse, Goswamy, Raval, & Marawi, 2016). For example, in China, medical institutions have limited communication and sharing with each other as a whole (Rui, Y., 2015). With the globalization of data, Big Data in health care will also face varying degrees of language, terminology, and standardization barriers (Kruse et al, 2016.).
7.3.3 Volume of data
The massive volume of health care Big Data in the terabyte (TB) level and even petabyte (PB) level is now beyond the capabilities of personal computers and network file sharing programs, thus establishing that a new sharing mechanism is urgently needed (Kruse et al, 2016; Service).
7.3.4 Insufficient data integration
More data integration is needed. The data have not yet been fully embedded in business processes and organizational management practices. For example, in many cases, patient monitoring data have not yet been integrated into clinical diagnosis and treatment, and clinical data have not yet been integrated into public health services and infectious disease monitoring (Tao, D. A. I., 2016).
7.4 Data privacy
Health care data are more sensitive and centralized than other types of Big Data. There are significant concerns regarding confidentiality (Mancini, 2014b; D. C. Mohr et al., 2013). However, for the problem of patient data privacy protection, no perfect solution has yet emerged. Patient data leakage may have unpredictable consequences (including injury, discrimination, and others). There are many real cases at home and abroad. Big Data technology makes personal medical data face a greater risk. Some people even believe that in the era of Big Data, protecting personal privacy is impossible (Schadt, 2012). The problem can be alleviated by special processing (such as de-identification and digital identity encryption), but the identification and de-identification of information still require people or applications to process
An Example of Data Privacy Breach
|Voter registration data (publicly available)||Hospital discharge data|
|Name||Sex||Zip code||Date of birth||Address||Sex||Zip code||Date of birth||Disease|
identifiable information that may cause the patient’s health information to be misappropriated by others without knowing or unauthorizedly (Rothstein, 2010). Big Data increases the risks to patient data for two reasons. First is the risk of the data itself. The data can be copied and preserved without space and time constraints, and this feature is characterized by high risk and long-term risk under Big Data conditions. Second is the risk of Big Data technology. Under Big Data technology conditions, even if a Big Database uses anonymous personal encrypted data, there is still a user identity that can be re-identified by residual risk, and personal identities can be re-determined by data link technology because Big Data uses pseudonymized personal confidential data that have been anonymized but retain a residual risk of re-identification (Ward,2014). This risk is greater when different data are used to relate. De-anonymization is an attack in which anonymous data and other sources of data are compared in order to re-identify the anonymous data sources (Yom-Tov, E, 2016). For example, comparing voter registration data and hospital discharge data can determine whether a person is sick. Voter registration data contains date of birth, sex, zip code, address, date last voted, name, data registered, and other details. Hospital discharge data contains date of birth, sex, zip code, diagnosis, ethnicity, medication, procedure, visit date, and other information. By comparing the same fields in the two data sources, such as date of birth, sex, and zip code, an attacker can determine the specific source and then determine the subject’s illness and voting situation. In the example in Table 3, through the comparison of these two data sources, it is not difficult to determine that the person whose date of birth, sex, and zip code are 06/18/90, female, 77889, respectively, is Angela and she is suffering from diabetes.
Also in the future, in order to better achieve individualized treatment, our individual genomes may be added to the EHR. The individual genome is private, and the gene sequence may lead to many privacy-related issues. Lin et al. (2004) found that “Specifying DNA sequence at only 30 to 80 statistically independent SNP positions will uniquely define a single person”. As such the privacy protection becomes the focus.
7.5 Data technologies and talent
As described in the main characteristics of Big Data, in terms of data size, Big Data in health care exceeded 150 exabytes after 2011 (Y. C. Wang et al., 2015). A study showed that data size in health care is estimated to be around 40 ZB in 2020 (Fig. 1) (O’Driscoll, Daugelaite, & Sleator, 2013). The complexity of the data is also growing rapidly, with data diversity, fast change, low value density, and other complex features becoming increasingly significant. Their complexity poses a serious challenge to traditional computing and information technology (Tony Hey, 2012.06). At present, it is difficult to accommodate the availability, consistency, and partition fault tolerance of the distributed system all at once. It is also difficult to solve the health care data collection, processing real-time and dynamic index, lack of prior knowledge, and other difficult issues (Zhang Zhen, Zhou Yi, Du Shou-hong, Luo Xue-qiong, Mei Tian, 2014). Even some widely used Big Data technology also has its challenges. For example, Hadoop helps solve the storage problems of Big Data and also reduces the cost of data storage and improves the speed of operation. However, Hadoop is faced with technical problems of low security and that data cannot be interconnected (Augustine, 2014. Mar; K. Jee & G. H. Kim, 2013). In addition, promoting the development of health care Big Data applications needs human experts who have both clinical and analytic knowledge (Mavandadi et al., 2012). According to McKinsey, even in the U.S., the leading information technology power, the related talent gap will reach 14–19 million in 2018 (James Manyika, 2011). Many of the data technologies today, including Hadoop and computing cloud, are challenging for many businesses, especially small firms. The skills required are in many cases not simple; they involve data mining, analysis, manipulation, and other techniques that are too difficult and expensive for most small firms to master (K. Jee & G. H. Kim, 2013). At present, only a small number of companies in the world have mastered the core technology of Big Data analysis. The world needs more data analysts who can use information technology to visualize the data before presenting to the policy makers. Finally, we also need to master the professional management of technology, data processing technology, and medical data management personnel. They can use the appropriate management model to make the information infrastructure a continuous research and application platform, ensure continuity, and achieve cross-cutting cooperation (Sepulveda,2013. Youssef, 2014).
Medical research that integrates Big Data will contribute to a higher level of human health at a broader and deeper level. This paper summarizes and introduces the related research of medical data at home and abroad in recent years. This paper mainly introduces the related concepts of medical Big Data, the background, and the main applications, and it introduces several key technologies related to medical Big Data. In addition, we summarize and think about the opportunities and challenges in the study of big medical data. In general, the current research on medical data is not yet mature; there are many problems that need to be resolved. In order to take full advantage of the profound patterns contained in the massive data, Big Data storage, mining, analysis, and related talent are essential. These technologies and talents will support research on health care Big Data and further serve a wide range of medical applications such as public health, medical care, and medical insurance, and many others.
ML wrote sections 1 and 2, RW wrote sections 3 and 4, LH wrote sections 5 and 6, and PL wrote sections 7 and 8. WL provided critical suggestions for the paper. LL designed the paper structure, integrated all sections, and supervised the paper writing. We thank Lina Zhou and Ni Wen for assistance in literature search. This paper is supported in part by The National Key Research and Development Program of China (No. 2016YFB1000603), Key Program of the Major Research Plan of the National Natural Science Foundation of China (No. 91646206), National Natural Science Foundation of China grants (Nos. 31601083 and 61772375), and the Recruitment Program of Global Experts (No. 104413100019).
Abenstein, J. P., & Tompkins, W. J. (1982). A new data-reduction algorithm for real-time ECG analysis. IEEE Transactions on Biomedical Engineering,29(1), 43–48.
Agrawal, R., Imieliński, T., & Swami, A. (1993, May). Mining association rules between sets of items in large databases. In B. Peter, & J. Sunshil(Eds.), Proceeding of the ACM SIGMOD Conference on Management of Data(pp.207-216). Washington, DC: ACM Press.
Aitken, M., & Gauntlett, C. (2013). Patient apps for improved healthcare: from novelty to mainstream. IMS Institute for Healthcare Informatics Retrieved from https://www.mendeley.com/catalogue/patient-apps-improved-healthcare-novelty-mainstream/
Antonie, M. L., Zaïane, O. R., & Coman, A. (2001). Application of data mining techniques for medical image classification. Proceedings of the Second International Conference on Multimedia Data Mining 94-101. doi:10.1.1.23.9742
Asri, H., Mousannif, H., Al Moatassime, H., & Noel, T. (2015, June). Big data in healthcare: challenges and opportunities. Proceedings of 2015 International Conference on Cloud Computing Technologies and ApplicationsMarrakech, Morocco.
Backonja, U., Kim, K., Casper, G. R., Patton, T., Ramly, E., & Brennan, P. F. (2012, June). Observations of daily living: putting the “personal” in personal health records. NI 2012: 11th International Congress on Nursing Informatics Montreal, Canada.
Bagayoko, C. O., Dufour, J. C., Chaacho, S., Bouhaddou, O., & Fieschi, M. (2010). Open source challenges for hospital information system (HIS) in developing countries: A pilot project in Mali. BMC Medical Informatics and Decision Making, 10(22), 1-13.
Bates, D. W., Saria, S., Ohno-Machado, L., Shah, A., & Escobar, G. (2014). Big data in health care: Using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), 1123–1131
Belle, A., Thiagarajan, R., Soroushmehr, S. M., Navidi, F., Beard, D. A., & Najarian, K. (2015). Big data analytics in healthcare. Biomed Research Internatioan,2015370194 1-16.
Blaya, J. A., Shin, S. S., Yagui, M. J., Yale, G., Suarez, C. Z., Asencios, L. L., Fraser, H. S. (2007). A web-based laboratory information system to improve quality of care of tuberculosis patients in Peru: Functional requirements, implementation and usage statistics. BMC Medical Informatics and Decision Making, 7(1), 33–43.
Braunstein, M. L. (2015). Health big data and analytics. Practitioner’s Guide to Health Informatics (pp. 133–149). Berlin, Germany: Springer International Publishing.
Celesti, A., Fazio, M., Romano, A., & Villari, M. (2016). A hospital cloud-based archival information system for the efficient management of HL7 big data. 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) Opatija, Croatia.
Centers for Medicare & Medicaid Services (CMS), HHS. (2010). Medicare and Medicaid programs; electronic health record incentive program. Final rule, Federal Register, 75(144), 44313–44588. PMID:20677415
Chen, J., Qian, F., Yan, W., & Shen, B. (2013). Translational biomedical informatics in the cloud: Present and future. BioMed Research International,2013 658925. PMID:23586054
Chia, C.-C., & Syed, Z. (2011). Computationally generated cardiac biomarkers: Heart rate patterns to predict death following coronary attacks. Proceedings of the 2011 SIAM International Conference on Data Mining 735-746.
Christopher C. Yang, H. Y., Jiang, L., & Zhang, M. (2009). Social media mining for drug safety signal detection. Proceedings of the 2012 international workshop on Smart health and wellbeing
Christy, A., Gandhi, G. M., & Vaithyasubramanian, S. (2015). Cluster based outlier detection algorithm for healthcare data. Procedia Computer Science,50 209–215.
Dai, T. (2016). Health and medical big data development perspective. Journal of Medical Informatics, 37(2), 2–8.
Deserno, T. M., Haak, D., Brandenburg, V., Deserno, V., Classen, C., & Specht, P. (2014). Integrated image data and medical record management for rare disease registries. A general framework and its instantiation to the German Calciphylaxis Registry. Journal of Digital Imaging, 27(6), 702–713.
Feldman, B., Martin, E. M., & Skotnes, T. (2012). Big data in healthcare: Hype and hope. Dr. Bonnie, 2012(1), 122–125.
Garets, D., & Davis, M. (2007). Electronic medical records vs Electronic health records: Yes, there is a difference. Zhongguo Yiyuan, 11(5), 38–39.
Gunter, T. D., & Terry, N. P. (2005). The emergence of national electronic health record architectures in the United States and Australia: Models, costs, and questions. Journal of Medical Internet Research, 7(1), 13-15.
Han, J., Pei, J., & Yin, Y.. (2000, May). Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Proceedings of the 2000 ACM SIGMOD international conference on Management of data(pp.1-12), Texas, USA.
Hay, S. I., George, D. B., Moyes, C. L., & Brownstein, J. S. (2013). Big data opportunities for global infectious disease surveillance. PLoS Medicine, 10(4), e1001413.
He, C., Jin, X., Zhao, Z., & Xiang, T. (2010, Deceember). A cloud computing solution for hospital information system Paper presented at the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems, Xiamen, China.
Herland, M., Khoshgoftaar, T. M., & Wald, R. (2014). A review of data mining using big data in health informatics. Journal of Big Data, 1(2), 1–35.
Huang, X. J., & Yao, Y. (2016, August). Multi-dimensions clustering approach for physical health data based on aritificial ant colony optimization Paper presented at the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China.
Joshi, K., & Yesha, Y. (2012). Workshop on analytics for big data generated by healthcare and personalized medicine domain. Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research 267-269.
Kanagaraj, G., & Sumathi, A. C. (2011, December). Proposal of an open-source cloud computing system for exchanging medical images of a Hospital Information System Paper presented at the 3rd International Conference on Trendz in Information Sciences & Computing (TISC2011), Chennai, India.
Khan, W. A., Khattak, A. M., Hussain, M., Amin, M. B., Afzal, M., Nugent, C., & Lee, S. (2014). An adaptive semantic based mediation system for data interoperability among Health Information Systems. Journal of Medical Systems, 38(8), 1-18.
Khoury, M. J., & Ioannidis, J. P. A. (2014). Medicine. Big data meets public health. The New Zealand Medical Journal, 346(6213), 1054–1055.
Kim, T.-W., Park, K.-H., Yi, S.-H., & Kim, H.-C. (2014). A big data framework for u-Healthcare systems utilizing vital signs Paper presented at 2014 International Symposium on Computer, Consumer and Control, Taichung, Taiwan.
Kovalev, V., & Kalinovsky, A. (2015). Big Medical Data: Image Mining, Retrieval and Analytics Paper presented at Big Data and Predictive Analytics, Minsk, Belarus.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). Big data. The parable of Google Flu: Traps in big data analysis. Science, 343(6176), 1203–1205
Lincoln, M. J. (1998). Applying commonly available expert systems in physician assistant education. Perspective on Physician Assistant Education, 9(3), 144–151.
Lodish, H. (2008). Molecular cell biology San Francisco, CA: W.H.Freeman and Company.
Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: A literature review. Biomedical Informatics Insights, 8 1–10.
Mancini, M. (2014). Exploiting big data for improving healthcare services. Journal of e-Learning and Knowledge Society, 10(2), 23-33.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity Retrieved from Mckinsey Glbal Institute website: https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation
Moore, P., Xhafa, F., Barolli, L., & Thomas, A. (2013, October). Monitoring and detection of agitation in dementia: Towards real-time and big-data solutions Paper presented at the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Compiegne, France.
Páez, D. G., Rodríguez, M. D. B., Sánz, E. P., Villalba, M. T., & Gil, R. M. (2015). Big data processing using wearable devices for wellbeing and healthy activities promotion. In I. Cleland, L. Guerrero, & J. Bravo (Eds.), IWAAL: Ambient assisted living. ICT-based Solutions in Real Life Situations (pp. 196–205). Cham, Switzerland: Springer.
Panahiazar, M., Taslimitehrani, V., Jadhav, A., & Pathak, J. (2014, October). Empowering personalized medicine with big data and semantic web technology: Promises, Challenges, and Use Cases 2014 IEEE International Conference on Big Data, Washington, DC.
Paul, R., & Hoque, A. S. M. L. (2010). Clustering medical data to predict the likelihood of diseases. 2010 Fifth International Conference on Digital Information Management 44-49. Thunder Bay, Canada.
Pentland, A., Reid, T., & Heibeck, T. (2013). Big data and health: Revolutionizing medicine and public health. Report of the Big Data andd Health Working Group 2013 Retrieved from http://www.wish-qatar.org/summits/wish-2013/forums-research-chairs/big-data-healthcare/
Poulymenopoulou, M., Malamateniou, F., Prentza, A., &Vassilacopous, G. (2015). Challenges of evolving PINCLOUD PHR into a PHR-based health analytics system Paper presented at the Proceedings of the European, Mdediterranean & Middle Eastern Conference on Information Systems EMCIS.
Redmond, S. J., Lovell, N. H., Yang, G. Z., Horsch, A., Lukowicz, P., Murrugarra, L., & Marschollek, M. (2014). What does big data mean for wearable sensor systems? Yearbook of Medical Informatics, 9(1), 135–142.
Roberts, E. B. (1985). Health information systems Clinics in Laboratory Medicine, 23(5), 672–676.
Rui, Y. (2015). Medical big data: The next industry windy spot. Business School,[Chinese], 4 100-103.
Schadt, E. E.(2012). The changing privacy landscape in the era of big data. Molecular Systems Biology, 8(1), 612.
Tola, K., Abebe, H., Gebremariam, Y., & Jikamo, B. (2017). Improving Completeness of Inpatient Medical Records in Menelik II Referral Hospital, Addis Ababa, Ethiopia. Advances in Public Health, 2017 1–5.
Tony, H., Stewart, T., & Kristin, T. (2012). The fourth paradigm: Data -intensive scientific discover Berlin, Germany : Springer-Verlag Berlin Heidelberg.
Tsumoto, S., Hirano, S., & Iwata, H. (2013). Mining nursing care plan from data extracted from hospital information system Paper presented at the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara Falls, ON, Canada.
Usami, Y., Cho, H. C., Okazaki, N., & Tsujii, J. I. (2011). Automatic acquisition of huge training data for bio-medical named entity recognition. Proceedings of BioNLP 2011 Workshop 5 65-73.
Wang, L., & Alexander, C. A. (2013). Applications of automated identification technology in EHR/EMR. International Journal of Public Health Science, 2(3), 109–122.
Wang, Y., Kung, L., Ting, C., & Byrd, T. A. (2015). Beyond a technical perspective: Understanding big data capabilities in health care. Proceedings of 48th Annual Hawaii International Conference on System Sciences 48( pp3044-3053). Hawaii, USA.
White, S. E. (2013). De-identification and the sharing of big data. Journal of American Health Information Management Association, 84(4), 44–47.
Wilson, A. M., Thabane, L., & Holbrook, A. (2004). Application of data mining techniques in pharmacovigilance. British Journal of Clinical Pharmacology, 57(2), 127–134.
Windridge, D., & Bober, M. (2014). A kernel-based framework for medical big-data analytics. In A. Holzinger & I. Jursica (Eds.), Interactive knowledge discovery and data mining in biomedical informatics (pp. 197-208). Berlin, Germany: Springer-Verlag.
Xu, J., Wise, C., Varma, V., Fang, H., Ning, B., Hong, H., Kaput, J. (2010). Two new Array Track libraries for personalized biomedical research. BMC Bioinformatics, 11(Suppl 6), S6.
Yan, Y., Qin, X., Fan, J., & Wang, L. (2014). A review on healthcare big data research. E-Science Technology & Application, [Chinese], 5(6), 3-16.
Yom-Tov, E. (2016). Crowdsourced health: How what you do on the Internet will improve medicine Cambridge, MA: Mit Press.
Zhang, Z., Zhou, Y., Du, S. H., Luo, X. Q., & Mei, T. (2014). Medical big data and the facing opportunities and challenge. Journal of Medical Informatics, 6 2–8.