Increasing health awareness and widespread adoption of wearable devices, wireless communications and big data in the context of smart cities have led to the needs for smart and connected health (SCH). SCH refers to any digital healthcare solutions or systems that can operate remotely with integration of innovative computational and engineering approaches to support the transformation of health and medicine services (Clancy, 2006; National Science Foundation). It is a young but promising field of study at the intersection of medical informatics, public health, big data, engineering, business and environment, resulting in intelligent healthcare services or enhanced cognitive capabilities through the IoT (internet of things) (Pramanik, Lau, Demirkan, & Azad, 2017). Compared with traditional health, SCH revolutionizing health care has significant potential advantages of speeding up treatment and diagnostic processes, decreasing physician visit costs, and enhancing patient care quality. Due to the increase of technological devices and the ability to process the data gathered from these devices with minimum error, SCH has received considerable attention from researchers and industry.
Previous research on SCH can be classified into two main streams. The first stream concentrated on the concept of SCH. Röcker asserted that SCH integrates ideas from ubiquitous computing and ambient intelligence and applies them to predictive, personalized, preventive and participatory healthcare systems (Röcker, Ziefle, & Holzinger, 2014). Furthermore, Solanas et al. pointed out SCH is the provision of health services by using the context-aware network and sensing infrastructure of smart cities (Solanas et al., 2014) . Suryadevara and Mukhopadhyay proposed that SCH is strongly connected to the concepts of wellness and wellbeing (Suryadevara & Mukhopadhyay, 2014). Meanwhile, the second stream paid attention to design and development SCH applications, like an IoT and cloud-based smart healthcare system (Muhammad, & Rahman, 2017), a smart healthcare system for elderly people (Hossain, 2016), and a smart Health framework for ICU (Intensive Care Unit) access control and tracking (Lopez-Iturri et al., 2018). Sannino et al. have developed a wellness APP for smart health, which is able to gather and classify information about the subject, and to make personal recommendations to enhance her/ his well-being (Sannino, Forastiere, & De Pietro, 2017). Samsung Electronics has constructed the smart health platform named Sleep Sense, which is able to analyze the sleep status of the users through the wearable device and control the surrounding environments, such as TV, air conditioner, air cleaner, and light, in order to create an atmosphere suitable for the sleep status of the users (Samsung). However, much like any other technological advancement, SCH systems and applications are facing challenges in system design, performance, and adaptation (Baig & Gholamhosseini, 2013; Venkatesh, Aksanli, Chan, Akyurek, & Rosing, 2018). Much more research is necessary to explore SCH system design, user experience, and the effect on healthcare.
U.S. National Science Foundation (NSF) has supported numerous scientific research projects that have played important roles in global economic growth and the improvement of the quality of people’s lives and health. It announced an innovative cross-division program called Smart Health and Wellbeing (SHB) in 2012 (National Science Foundation). Later, that program has been renamed as the Smart and Connected Health (SCH) program, which provides major funding support to fundamental, technical, and scientific research transforming healthcare from reactive and hospital-centered to proactive, efficient, patient-centered, and focused on wellbeing rather than disease control(Chen, Roger, & Storey, 2012). As stated by NSF’s call for proposals, these funded projects need to focus on breakthrough ideas in a variety of areas of value to health, such as sensor technology, networking, information and machine learning technology, modeling cognitive processes, system and process modeling, and social and economic issues (National Science Foundation) . To date, NSF has funded 265 SCH projects motivated by specific challenges in health and wellbeing.
What are the characteristics of these funded projects? What are the specific topics that have covered by them, and can different analytical approaches complement each other to provide a better picture of these funded projects? These questions motivated our study. The purpose of this study is to adopt text analysis, text mining and case study methods to analyze SCH projects funded by NSF, including what have been funded, characteristics of funded projects, and topics addressed by these projects. Our analysis is expected to help researchers to understand the scope and characteristics of current NSF funded SCH projects so they can better prepare their NSF proposals. It also provides a case study to data science students and educators on how text analysis can be conducted in this particular context.
2 A Text Analysis Framework
For the purpose of this study, we applied the framework as depicted in Figure 1 to perform the analysis of SCH projects funded by NSF. For an analytic research oriented on textual data, one will identify the data source based on the purpose of study or the research questions. Then the data need to be collected from the data source. If the desired data are part of a larger data collection, or big data, information retrieval is usually conducted to locate the relevant text documents or records.
The dotted rectangle in Figure 1 depicts the text analysis on the manageable text data after information retrieval. There are many different ways to analyze a collection of text data, depending on the purpose of the analysis. In the field of natural language processing, text analysis can be performed at different levels: morphemes, words, phrases, sentences, discourse, and multiple documents/records. Low-level analysis is usually more accurate than that at discourse level. Typical text analysis such as classification, clustering, topic analysis, information extraction, and summarization requires understanding of the meaning of the texts, and has been challenging.
This study applies three types of text analysis/processing: (1) low-level natural language processing such as stop-word identification and filtering and stemming. The result helps to create a high quality word cloud that reveals the most frequent content words from the abstracts of the projects; (2) descriptive or bibliometric analysis. This is possible because the records of these NSF projects are well-organized datasets, as described in Table 1. Bibliometric analysis could discover many characteristics of SCH projects, as we have presented in Section3 below; (3) automatic content/clustering/topic analysis. Multiple methods may be applied to discover the topics covered in the projects so that their results could be verified or complemented. At last, it is important to combining the findings from different analysis and come up with better understanding of these projects and what we can learn from this analysis.
Selected Metadata Information of a Sample NSF-Funded Project
|Title||SCH: INT: Collaborative Research: FITTLE+:|
|Program(s)||Theory and Models for Smartphone Ecological Momentary Intervention Smart and Connected Health|
|Abstract||Many health conditions are caused by unhealthy lifestyles and can be improved by behavior change. Traditional behavior-change methods (e.g., weight-loss clinics; personal trainers) have bottlenecks in providing expert personalized day-today support to large populations for long periods. …The collaborative team of researchers works with weight-loss interventionists at one of nation’s largest health organization’s facility in Hawaii. …|
3 General Characteristics: A Descriptive Analysis
The awards advanced search page of NSF website (National Science Foundation) has been used to retrieve the data for study. The data was collected in June 2018. And a total of 405 records were collected, among which 265 were SCH projects including active and completed awards. One project may have more than one record, as NSF allows different institutions to file their proposals separately even they are collaborating on the same project. The 265 records were retrieved and downloaded into an Excel file. The 25 metadata of the records include: Award Number, Title, NSF Division, Program(s), Start Date, Last Amendment Date, Principal Investigator (PI), State, Organization, Award Instrument, Program Manager, End Date, Awarded Amount to Date, Co-principal Investigators (Co-PI), PI email address, Organization Information (street, city, state, zip code, phone), NSF Directorate, Program Element Code(s), Program Reference Code(s), ARRA Amount, and Abstract. For the purpose of our analysis, we selected and used 10 of the metadata elements from each record. Table 1 is a sample NSF project record with the 10 elements.
To understand the general characteristics of these projects, we focused on the distribution of the funded projects over years, over the 50 states, number of principal investigators (PIs) and co-principal investigators (Co-PIs), organizations involved, and others. Below we report the specific results.
3.1 Number of Funded Projects over Years
Data compiled in our study are yearly funded project as shown in Fig. 2. NSF has begun to support SCH research projects since 2011. The total number of SCH projects from 2011 to 2018 is 265. Among them, 179 projects were funded during the period 2014 to 2017, accounting for 68% of all the SCH projects. Fewer projects in the first three years (2011,
2012 and 2013) then the number increased significantly in the later years. Note the number of projects in 2018 is fewer due to incomplete annual data.
3.2 Geographical Distribution
It can be seen from Table 2 that funded SCH projects were distributed in 37 states of the U.S. The top 5 states were Massachusetts, New York, California, Pennsylvania, and Texas, all of which had at least 19 projects. Some states, such as New Mexico and Hawaii did not have any project funded by NSF, however, that does not mean there were no researchers in these states involved in NSF funded SCH projects. Because NSF records only list PIs’ states, it is very possible that some researchers participated as CO-PIs or research staff in SCH projects in those states are not listed. However, the analysis of geographical distribution is not only useful to offer a perspective to recognize the research level of different states in the areas of SCH, but it provides an effective method to find more collaborators for interdisciplinary researchers.
Geographical Distribution of Funded Projects
|State||Number of projects||State||Number of projects||State||Number of projects|
|Arizona||9||Missouri||4||District of Columbia||1|
In terms of fund amount, California received the most funds ($14,507,762), followed by Massachusetts and New York. While West Virginia received only $140,000, which is least of funds in total 37 states that have received NSF funds on SCH. Majority of the states (34 states, 91.9%) received less than $10,000,000. We noticed that the states having the most funded projects are also the ones with the best ranked medical schools. For example: Massachusetts have four medical schools accredited by the Liaison Committee on Medical Education: Tufts University, Boston University, University of Massachusetts, and Harvard. As for the states with rare sponsored projects, NSF funders may should consider more to allocate their funding, if these states could focus more on their area-specific endemic research and try to cooperate with experienced institutions.
3.3 Number of PI and Co-PIs
The number of PI and CO-PIs for a project indicates to some degree the level of collaboration in a project. The nature of SCH project demands that multidisciplinary teams work together to address multi-dimensional challenges ranging from fundamental science to clinical practice (National Science Foundation). The distribution of the number of PI and CO-PIs were reported in Fig. 3. It showed that 128 SCH records had two or more investigators, occupying 48.3% of the total records.Among them, 63 records contain one PI and one CO-PI, 35 records having 2 CO-PIs, and 19 records having 3 CO-PIs. Furthermore, 9 records have 5 investigators (1 PI and 4 CO-PIs) and 2 records have 6 investigators (1 PI and 5 CO-PIs). The single PI projects may indicate that the team of the PI is a multidisciplinary one that can perform the tasks by themselves. For example: the single PI project named “Twitter Health: Learning Fine-Grained Models of Health Influences and Interactions from Social Media” is led by Henry Kautz alone. Professor Henry Kautz and his research team in the University of Rochester has devoted long effort to research on social media analytics to improve public health, including tracking worldwide contagious disease
spread and locating sources of food poisoning. Or the projects may be part of a collaborative projects but filed the proposal separately. It does not mean that the project does not have other partners or collaborators. Also, the number of projects obviously decreases as the number of PI and Co-PIs increases. A possible reason may be that multiparty cooperation is not easy to develop.
3.4 Other Features: Organization Involved, Project Duration, and Fund Distribution
Our analysis found that there were in total 129 organizations that received at least one SCH award, or have at least one project funded by NSF on SCH. The distribution of the project among organizations is presented in Table 3. It showed that Indiana University received 10 awards. Obviously, Indiana University has cultivated the most project teams in the study of SCH. Our analysis also indicates that 2 organizations: Georgia Tech Research Corporation and University of Southern California, have 7 funded SCH projects. Three other universities: Arizona State University, Carnegie-Mellon University and University of Minnesota-Twin Cities, each of which had 6 funded SCH projects.
Number of Projects and Organizations
|Number of projects||Organizations||Number of organizations|
|7||Georgia Tech Research Corporation, University of Southern California||2|
|6||Arizona State University, Carnegie-Mellon University, University of Minnesota-Twin Cities||3|
|5||University of Memphis, University of Rochester||2|
|4||Columbia University, University of California-San Diego, University of Colorado at Boulder, University of Florida, University of Virginia Main Campus, University of Washington, Virginia Polytechnic Institute and State University||7|
|3||Clemson University, Cornell University, Johns Hopkins University, …||20|
|2||Dartmouth College, Drexel University, Florida International University, Kansas State University,…||31|
Fig. 4 depicts the distribution of the duration of the projects. Most of them (96 or 36% of the projects) were proposed to be completed within 4 years. And 41 SCH projects should take at least 5 years to complete. Further, 40 SCH projects were proposed to be completed in 3 years and 36 SCH projects were proposed to be completed in 2 years. There are 39 SCH projects to be completed in 1 year. If we consider time is a measure of the complexity of the projects, this indicates that most of the SCH projects are dealing with research challenges that cannot be solved in less than a year.
The complexity of projects can also be measured with the amount of funds. Table 4 shows 12%, or 32 of the 265 projects are more than $1,000,000, 25% between $500,001- $1,000,000, 44% between $100,001- $500,000, and 19% are less than or equal to $100,000. The results indicate that similar to other NSF programs, NSF fund SCH in all different scales.
Amount of Funds for Each SCH Record
|Award Amount||Number of Projects||Percent (%)|
|More than $1,000,000||32||12|
|Between $500,001- $1,000,000||66||25|
|Between $100,001- $500,000||116||44|
|Between $50,001- $100,000||13||5|
|Less than or equal $50,000||38||14|
In addition, we found that the curve of award amount is similar to the changing trend in number of funded projects over years. In 2012, the total award amount was $17,790,810 which is $5,901,161 more than that of 2011. And the total award amount of 2014 was $21,281,174, which is
$4,279,903 more than that of 2013. In 2016, the total award amount was $21,397,495, which is $4,527,541 more than that of 2015. However, the award amount of SCH projects in 2017 is less than the previous year, which indicates that the found for each project on average was reduced in 2017. Owing to the incomplete annual data, the award amount of 2018 was only $1,829,404, which will be expected to reach $11,000,000 to $20,000,000 (National Science Foundation). The amount of funds over years is showed in Fig. 5.
4 Text Mining and Text Analysis
This section reports the procedures and results of text mining and analysis to the abstracts of the selected projects. The purpose of the analysis is to understand in-depth what have been investigated by these projects, including mainly research topics covered and the health challenges.
Unlike bibliometric analysis that can mainly discover the external characteristics of the projects, the analysis reported in this section is based on the text content, with a higher degree of automation and a more detailed granularity. In this study we performed low-level natural language processing to prepare the text data, then we applied term frequency and word cloud, K-means clustering, and topic analysis methods to make sense of the text content.
4.1 Content Analysis of Project Titles
At first, we applied content analysis on the titles of the 265 projects using the software Nvivo 12. As showed in Table 5, ‘SCH’, ‘research’, ‘collaborative’, ‘INT’, ‘health’, ‘EXP’, ‘SHB’, ‘data’, ‘support’ and ‘modeling’ were the top 10 high-frequency keywords. These keywords reveal that 130 of the 265 projects were collaborative ones with multiple investigators. The datasets contains 80 INT (Integrative Projects) projects, 59 EXP (Exploratory Projects) projects, and 50 SHB (Smart Health and Wellbeing) projects. A deeper content analysis of the tiles found that these titles can be categoriezed into 7 facets: modeling, sensor, monitoring, detection, data, systems, and intervention. These facets need to be verified and explained by the topic analysis on the abstracts later. Abstracts of these projects provide much richer information than their respective titles.
High Frequency Keywords in Title
4.2 Abstract Data Preparation
We chose the abstracts fields as our data sources for further text analysis from the retrieved NSF SCH project records.
The abstracts were extensive summaries written by the PIs, which usually contain rich information regarding the projects’ research purposes, targeted health problems or diseases, methodologies, devices and impacts.
We first removed duplicate records, leaving 188 records for the analysis. Then we conducted the “word cutting” by using the word_tokenize method in python, and then the “WordNet Lemmatizer method” in the Natural Language Toolkit (NLTK) (Bird) . The word lemmatizer was used to reduce inflectional forms and derivationally related forms of a word to its common base form. Compared with the stemming process, the WordNet Lemmatizer does not simply chop off inflections, but instead relies on a lexical knowledge base like WordNet to obtain the correct base forms of words, which produce better outputs.
Next, high frequency functional words such as propositions, articles, and conjunctions were removed applying an extended stopword list from NLTK. The stopword list was extended by adding functional words that were missing from the original list. Lastly, the “CountVectorizer” method in scikit-learn was used to check the 5,080 words in the whole corpus. The 178 words which appears in more than 20% of the records were filtered out, leaving 4,902 unique terms for further text analysis, as reported in the remaining sections.
4.3 Term Frequency and Word Cloud
With the normalized results from the above data preparation procedures, we obtained a list of words with their term frequency. The top 40 words are listed in Table 6. These terms were mostly nouns and verbs that have be generally used. No disease names or health-specific terms are among the top 40 words. Fig 6 is the word cloud of the top 403 content words. The word cloud was created using wordart.com – a free word cloud generator (WordArt).
Most Frequent Terms in Titles and Abstracts
We can make sense of the projects by observing the term frequency table and the word cloud: Most of the SCH projects are developing something, whether that is new device, new models, new technologies, new approaches, new systems, or new algorithms; funded projects are well aligned with NSF program solicitation that focus on patient, health, data, medicine, student and care; many projects involve propose, design and use of systems and models and the new technology are used on monitoring, improving ,and support human care activity. Note the counts under several of these most frequently terms are obtained by manually normalizing the different forms of that term. For example, frequency counts under “model” were accumulated from terms such as “modeling,” “Modeling,” “models,” and “Models.”
4.4 K-Means Clustering
Based on the term frequency analysis, we have got a preliminary schema on these NSF SCH projects, but it failed to discover more domain specific patterns contained in the projects. Next we attempted the unsupervised clustering method, which could help to group a set of similar texts into the same cluster and distinguish the different clusters. This is a common technique for statistical data analysis that have been used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Cluster analysis can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Cluster analysis is an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties. With the preprocessed data, we conducted the K-Means method in scikit-learn library based on the Python environment. Because K-means algorithm is the most widely used clustering method, relatively easy to apply, and with high efficiency. We tried to generate 20, 15, 10, 9, and 8 clusters, and each time the results were examined by 2 of the authors. The results with 8 clusters seem to be the most meaningful as showed in Table 7. Here is our interpretation of the 8 clusters starting from Cluster 0.
K-Means Clustering Results of the NSF SCH Abstracts
|Cluster ID||Keywords in the cluster|
|Cluster0||Doctoral, forum, participant, expert, career, consortium, travel, chase informatics, engage, scientific, participation, peer, connected, supportive, receive, international, pervasive, internationally-recognized , critique|
|Cluster1||behavior, mobile, measurement, adolescent, cell, informatics, phone, scale, inference, adaptive, wearable, predictive, mhealth, compute, inspire, sensor, environment, participant, wireless, collect|
|Cluster2||privacy, contact, empowerment, privacy-preserving, tracing, breach, similarity, security, publishing, preserve, trace, genomic, secondary, protect, ehr, public, re-identification, hipaa, eliminate, genetic|
|Cluster3||epidemic, forecasting, tracing, dynamic, infection, dynamical, sepsis, infectious, estimate, hcv, lh, forecast, symposium, influenza, ebola, prevalence, demographic, ass, reference, rop|
|Cluster4||mobility, motion, impairment, physical, elderly, assistive, child, prosthesis, therapy, assistance, movement, caregiver, gait, exoskeleton, cartilage, assist, rehabilitation, energy, gerontechnology, orthosis|
|Cluster5||Social, twitter, behavioral, media, health, measure, wellness, signal, mobility, volume, dynamic, monitor, symptom, personalized, drug, management, condition, theory, identity|
|Cluster6||decision, treatment, data, artificial, ai, system, model, intelligence, methods, behavioral, diagnostic, clinician, practice, software, prediction, processing, platform, software, assessment, exploratory|
|Cluster7||ultrasound, image, behavior, therapy, visualization, high-dimensional, medicine, adaptive, 3-d, modeling, imaging, processing, telemedicine infrastructure, device, signal|
- –Cluster0 is about the smart and connected health related education and academic activities. Its content includes the doctoral consortiums, annual conferences, institute on global healthcare education, travel support for students and some international forums held these years;
- –Cluster1 is the mobile health (mHealth) related applications and research, which collected the patients’ personal health data by phones, sensors or some wireless wearable devices for better monitoring, measuring or predicting the patients diseases;
- –Cluster2 is about the EHR (Electronic Health Records) related research, which may cover the privacy-preserving issues during the electronic health records’ preserving, tracing, breaching and publishing processes;
- –Cluster3 is the infectious and epidemic diseases related research. We can see some epidemic and infectious disease are selected, such as: HCV, Ebola, iluenza, and sepsis. Projects in this cluster seem to focus on the dynamic tracing, detection and forecasting of these infectious disease based on the demographic and diseases prevalence attributes;
- –Cluster4 seems to be some topics on the “gerontechnology and the rehabilitation” related research, which may assist the elderly to receive a good care or have a better therapy. The discussed diseases in this topic may be the exoskeleton-related or cartilage-related, which may cause the physical movement impairment, and the medical devices such as orthosis or prosthesis are also selected out;
- –Cluster5 are projects about how social media affecting the people’s health and interactions. There is research on” twitter health: learning fine-grained models of health influences and interaction from social media” and “CRUFS: a unified framework for social media analysis of adverse drug events”.
BTM Results of the NSF SCH Abstracts
|Topic ID||Keywords||Representative Projects|
|topic0||behavior condition family treatment environ- ment behavioral physical sleep management people caregiver dynamic adult service cognitive||doc21:EAGER: Agile Data Integration to Facilitate Scaling of Air Quality Research doc116:SCH: INT: Large-Scale Probabilistic Phenotyping Applied to Patient Record Summarization|
|topic1||participant expert scientific informatics engage doctoral connected forum collaboration career program compute undergraduate international multidisciplinary||doc173:Student Travel Grant: Fifth IEEE International Conference on Healthcare Informatics (ICHI 2017) doc172:Student Mentoring and Travel Support for the 5th International Conference on Ambulatory Monitoring of Physical Activity and Movement 2017|
|topic2||decision drug dynamic modeling training methodology cancer treatment behavioral simulation generate surgical assess adaptive cpr||doc99:SCH: INT: Collaborative Research: Diagnostic Driving: Real Time Driver Condition Detection Through Analysis of Driving Behavior doc136:SHB: Small: An Assistive, Robotic Table [ART] Promoting Independent Living|
|topic3||signal body device volume child measure sensing mobility motion image sweat impair- ment movement platform structure||doc119:SCH: INT: Multispectral Panoramic 3-D Endoscopic Imaging doc60:SCH: EAGER: New Approach: Early Diagnosis of Alzheimer’s Disease Based on Magnetic Resonance Imaging (MRI) via High-Dimensional Image Feature Identification|
|topic4||record phenotype electronic decision diabetes alert clinician physician air ehr software breath source factor exist||doc53:RAPID: SCH: A Framework for Epidemic Contact Tracing Using Multi-contextual Information doc52:RAPID: COLLABORATIVE RESEARCH: Building Infrastructure to Prevent Disasters like Hurricane Maria|
|topic5||sleep brain event feature processing cell phone device signal alarm prediction factor collect assessment extract||doc25:EAGER: Collaborative Research: CRUFS: A Unified Framework for Social Media Analysis of Adverse Drug Events doc15:CRII: SCH: Modeling and Analysis of Genetic Regulatory Networks under Drug Perturbation|
|topic6||image imaging brain device ultrasound 3d dynamic pressure 3-d automation mri feature physical cartilage pose||doc29:EAGER: Feasibility of Using Speech as Biomarker for Concussions doc13:CRII: SCH: A Smart Biosensor for Monitoring Cell Sickling in Patients with Sickle Cell Disease|
|topic7||privacy contact ehrs heart water public personal failure collect disaster device privacy-preserving tracing facilitate environmental||doc121:SCH: INT: Novel Textile Based Sensors for Inner Prosthetic Socket Environment Monitoring doc122:SCH: INT: Optimal Prosthesis Design with Energy Regeneration|
- –Cluster6 are SCH related systems and modeling research which assist the clinician to conduct better diagnose and provide better decision-support services. Related research included:” EAGER: Collaborative Research: Data Science Applications In Cyberphysical Systems for Health”, “CRII:SCH: Computational Methods to Mine Multi-omic Data for Systems Biology of Complex Diseases”, and “EAGER/Collaborative Research: Sensing, Modeling and Optimization of Postoperative Heart Health Management”;
- –Cluster 7 contains the projects on the medical imaging, ultrasound, 3-D, and visualization related areas. Such as:” Synergy: Collaborative Research: MRI Powered & Guided Tetherless Effectors for Localized Therapeutic Interventions.” and “SCH: Visualization for Better Medical Decision-Making.”
The 8 clusters presented in Table 7 make a lot of sense and is earlier to be interpreted. We think the clustering approach seems quite effective for analyzing this type of text data.
4.5 Topic Modeling
As a kind of hard clustering method, K-means assumes each document is belonged to one and only one cluster. While topic modeling assumes each document to contain multiple topics, and the corpus are clustered by the topics. The “topics” produced by topic modeling techniques are clusters of similar words. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents(Chen, Chen, Qu, Chen, & Ding, 2018). Topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as bioinformatics.
Even though topic modeling is compelling to use, our data is not big enough to run properly the LDA topic modeling, which has been realized in our previous paper (Chen, Chen, Qu, Chen, & Ding, 2018). We applied a newly proposed topic model called BTM (biterm topic model) to test what we can get from our available data (Yan, Guo, Lan, & Cheng, 2013).
Biterm Topic Modeling learns topics over short texts by directly modeling the generation of all the biterms (i.e. word co-occurrence patterns) in the whole corpus. A biterm denotes an unordered word-pair co-occurring in a short context (i.e. an instance of word co-occurrence pattern). In short texts, since documents are usually short and specific, we just take each document as an individual context unit. We extract any two distinct words in a short text document as a biterm (Yan, Guo, Lan, & Cheng, 2013).
With the preprocessed data, we conducted the BTM method in scikit-learn library based on the Python environment. We tried again 20, 15, 10, 9, and 8 topics. The results from each experiment were examined by 2 of the authors. The result with eight topics as the most coherent and readable result is showing below.
We can see that some of the topics match the clusters obtained through cluster analysis. For example, topic 1 is much alike the cluster 0, which is also the healthcare-related international conference, forum or education; topic 3 is something alike cluster1, which is about mhealth related wearable mobility devices that can monitor or measure the motion of body; topic 6 is also about the medical imaging that is same to the cluster7’s content, and topic 7 is about the privacy-preserving and HER-related topic, which is much similar to cluster2.
5.1 Characteristics of NSF funded SCH Projects
The bibliometric analysis discovered many characteristics of these projects. For examples, we learned that SCH research started in 2011 when the number of SCH projects is 24. While after 2011, the number increased significantly in the later years. It indicated that SCH is of great importance with continuous funding support from NSF; For geographical distribution, all 265 funded SCH projects were distributed in 37 states of the U.S. The top 5 states were Massachusetts, New York, California, Pennsylvania, and Texas. And nearly half of SCH projects are collaborative projects with more than one principal investigators; For organization distribution, Indiana University has cultivated the most project teams in the study of SCH and is the leading organization. And most projects were proposed to be completed within 4 years. Moreover, from the perspective of award amount, 44% of SCH projects are between $100,001 and $500,000.
There are more to analyze on these projects based on the interests of the users. For example, named entity extraction could be a useful approach to identify the disease names from the abstracts of these projects.
5.2 Topics Covered by the Projects
Through the title analysis, K-Means clustering, and BTM, we obtained different categories. The “verification/sense making” step as presented in Fig, 1 requires that we verify these automatically generated results to make sure the classifications are accurate. We therefore examined the titles of all the projects manually for the purpose to determine major research areas of these projects. We found that none of the above automatic analysis was accurate. Facets/clusters/topics could be further split or combined. For example, among the 7 facets from title analysis: modeling, sensor, monitoring, detection, data, systems, and intervention, sensor and data can be combined as these projects focus on data collection. Also, most projects with terms on monitoring, detection, and intervention address smart health devices, or algorithms for these purposes. Similarly the 8 clusters from K-Means and the 8 topics from BTM were individually analyzed to decide whether each cluster or topic could represent a research area itself, or it should be combined with others to form an area.
We concluded that the funded projects have explored the following main research areas: 1) developing medical system, user interface and platform to help treat, monitor, predict or understand some diseases such as Asthma, Type-II Diabetes Mellitus (T2DM), Infection, and Heart Failure; 2) modeling of intelligent clinical decision, health management, temporal HER, diseases treatment and reducing the risks of rehospitalization; 3) designing smart health devices such as the chair for proactive injury prevention and electronic textiles for ambulatory health monitoring; 4) exploring clinical data and its applications in cohort identification, assessment of acute respiratory distress syndrome and privacy-preserving; 5) conducting education and academic activities of SCH, including international conference, mHealth summer training institute, doctoral consortium, etc.
This study is an extension of our previous study (Chen, Chen, Qu, Chen, & Ding, 2018), which only analyzed 100 active SCH projects in 2018. In this study, we applied different automatic analysis approaches such as clustering and Biterm topic modeling. But we didn’t perform manual content analysis on the abstracts in this paper. However, checking the topics or clusters discovered by the automatic approaches as presented in this paper from the 265 projects and the research areas and challenges through manual analysis from the previous study, we found similar research areas and challenges. We believe more research be needed in SCH to help solving many health related problems and issues.
5.3 Can Different Analytical Approaches Complement Each Other?
To explore the characteristics and the topics investigated by these projects, we applied descriptive analysis, content analysis of titles, word cloud analysis, clustering and biterm topic modeling. Our analysis demonstrated that applying multiple analytical approaches than a single approach has following benefits: (1) to achieve a more comprehensive understanding of the projects. The bibliomatric analysis and topic analysis allow us to examine the projects from different angles; (2) to complement each other and increase the accuracy of analysis.
Each analytical approach has its application context, advantages, and limitations. For example the word cloud can present the most frequent general content terms across projects, such as Model, Development, Learn in addition to terms like Health and Data. It’s, however, unable to identify subject terms or named entities out of the texts. For example, we cannot identify the disease names using the word cloud. The clustering and the topic modeling, however, helps to identify the subjects or themes of these projects, including system or platform development, modeling or algorithmic development for various purposes, designing smart health devices, clinical data collection and application and education and academic activities of SCH.
To identify and apply the most appropriate analytical approach to a collection of text are always challenging. We have conducted manual content analysis (Chen, Chen, Qu, Chen, & Ding, 2018) and found that the manual approach might not achieve more accurate results than automatic methods, especially when the coders lack domain knowledge about the texts. Automatic methods are desired because they are not only cost effective, but also more objective. However, some automatic methods may not applicable, or fail to provide good analytical results. For example, our data is not big enough to run LDA topic modeling. Instead, we applied both K-Means and BTM method on our data in order to experiment a better result and to get a comparison, which can also provide some reference to other researchers’ future study. Unfortunately, the BTM result was not as good as we thought. So the 2nd clustering based on the BTM result seems meaningless. Fortunately, K-Means result is good enough to help us divide the observation data into eight coherent clusters, which is comparably clearer to explain.
This study is still quite preliminary as we attempts to test different approaches to understand a collection of short texts. The data size is small and the study is restricted to only one funding agency. Our future research will address these issues.
6 Summary and Future Research
This paper analyzes 265 NSF projects that were identified as under the smart and connected health program based on their information retrieved from NSF website. Content analysis, descriptive statistical analysis, clustering analysis, and Biterm topic modeling were carried out to understand the characteristics and topics covered by these projects.
Based on our content analysis of the project titles and clustering/topic analysis of the abstracts, SCH is a very important research area with many challenging research problems, and researchers who are interested in conducting SCH research will need to have collaborative spirit and be able to work as part of a team. Moreover, there are 7 categories included in the content of the titles, namely, modeling, sensor, monitoring, detection, data, systems, intervention. And descriptive statistical analysis indicated that number of funded projects over years, geographical distribution, number of PI and CO-PIs, organization involved, project duration, and fund distribution. Then combining the results of K-Means clustering and BTM, five main interdisciplinary topics were recognized as follows: system, interface or platform development, modeling or algorithmic development for health management and diseases treatment, all kinds of clinical data collection and application, and education and academic activities of SCH.
We believe there are many opportunities for researchers to seek funding in NSF and other agencies in the area of smart and connected health. Our attempts in this study are also meaningful, especially when data analytics has been considered important to improve human life through better understanding of the data around us. Furthermore, some of the facts presented in this paper may help funders and researchers who seek funding support to understand current funding status of the SCH program comprehensively.
This study is the beginning of our endeavor on smart and connected health. Our future research will be assessing the impacts of these funded projects through citation analysis of their publications, and the actual commercialization of the research findings. This may help the funding agency to assess the impact of its support on SCH; another research direction may be to do a more thorough analysis across agencies and explore more sophisticated text analysis techniques for effective and efficient understanding and mining of texts.
Baig M. M. & Gholamhosseini H. (2013). Smart health monitoring systems: An overview of design and modeling. Journal of Medical Systems 37(2) 9898.
Bird S. Klein E.& Loper E. Natural Language Processing with Python– Analyzing Text with the Natural Language Toolkit. Retrieved from http://www.nltk.org/book/
Chen H. Chiang R H. L. & Storey V. C. (2012). Business intelligence and analytics: from big data to big impact. Management Information Systems Quarterly 36(4) 1165–1188.
Chen J. Chen M. Qu J. Chen H. & Ding J. (2018 July). Smart and connected health projects: Characteristics and research challenges Paper presented at the International Conference on Smart Health Wuhan China.
Hossain M. S. (2016 July). Patient status monitoring for smart home healthcare Paper presented at the Multimedia & Expo Workshops (ICMEW) 2016 IEEE International Conference Seattle WA USA.
Lopez-Iturri P. Aguirre E. Trigo J. D. Astrain J. J. Azpilicueta L. Serrano L. Falcone F. (2018). Implementation and Operational Analysis of an Interactive Intensive Care Unit within a Smart Health Context. Sensors (Basel) 18(2) 389.
Muhammad G. Rahman S. M. M. Alelaiwi A. & Alamri A. (2017). Smart health solution integrating IoT and cloud: A case study of voice pathology monitoring. IEEE Communications Magazine 55(1) 69–73.
National Science Foundation. NSF Award Search: Advanced Search. Retrieved from https://www.nsf.gov/awardsearch/advancedSearch.jsp
National Science Foundation. Smart and Connected Health (SCH). Retrieved from https://www.nsf.gov/pubs/2016/nsf16601/nsf16601.htm
National Science Foundation. Smart and Connected Health (SCH): Connecting Data People and Systems. Retrieved from https://www.nsf.gov/pubs/2018/nsf18541/nsf18541.htm
National Science Foundation. Smart Health and Wellbeing (SBH) Program Solicitation NSF 12-512. Retrieved from https://www.nsf.gov/pubs/2012/nsf12512/nsf12512.htm
Pramanik M. I. Lau R. Y. Demirkan H. & Azad M. A. K. (2017). Smart health: Big data enabled health paradigm within smart cities. Expert Systems with Applications 87 370–383.
Röcker C. Ziefle M. & Holzinger A. (2014). From computer innovation to human integration: Current trends and challenges for pervasive Health Technologies. In: Holzinger A. Ziefle M. Röcker C. (eds) Pervasive health (pp. 1–17). Human–Computer Interaction Series. London: Springer.
- Export Citation
Röcker, C., Ziefle, M., & Holzinger, A. (2014). From computer innovation to human integration: Current trends and challenges for pervasive Health Technologies. In: Holzinger A., Ziefle M., Röcker C. (eds))| false Pervasive health(pp. 1–17). Human–Computer Interaction Series. London: Springer. 10.1007/978-1-4471-6413-5_1
Samsung. (2015). Samsung Announces Samsung SleepSense. Retrieved from https://www.samsung.com/uk/news/global/samsung-announces-samsung-sleepsense/
Sannino G. Forastiere M. & De Pietro G. (2017). A Wellness Mobile Application for Smart Health: Pilot Study Design and Results. Sensors (Basel) 17(3) 611.
Solanas A. Patsakis C. Conti M. Vlachos I. S. Ramos V. Falcone F. Postolache O. Pérez-Martínez P. A. Pietro R. D. Perrea D. N. & Martínez-Ballesté A. (2014). Smart health: A context-aware health paradigm within smart cities. IEEE Communications Magazine 52(8) 74–81.
- Export Citation
Solanas, A., Patsakis, C., Conti, M., Vlachos, I. S., Ramos, V., Falcone, F., Postolache, O., Pérez-Martínez, P. A., Pietro, R. D., Perrea, D. N., & Martínez-Ballesté, A. (2014). Smart health: A context-aware health paradigm within smart cities.)| false IEEE Communications Magazine, 52(8), 74–81. 10.1109/MCOM.2014.6871673
Suryadevara N. K. & Mukhopadhyay S. C. (2014). Determining wellness through an ambient assisted living environment. IEEE Intelligent Systems29(3) 30–37.
Venkatesh J. Aksanli B. Chan C. S. Akyurek A. S. & Rosing T. S. (2018). Modular and personalized smart health application design in a smart city environment. IEEE Internet of Things Journal 5(2) 614–623.
WordArt. Word Cloud Art Creator. Retrieved from https://wordart.com/
Yan X. Guo J. Lan Y. & Cheng X. (2013 May). A biterm topic model for short texts. Proceedings of the 22nd International Conference on World Wide Web 1445-1456.