Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and use machine learning algorithms. At present, each research group tackles each problem from scratch, in isolation of other projects, which causes redundancy and a great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects and as a public repository for their outputs. We initially focus on a specific goal, namely, classifying articles according to publication type and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning-based goals and projects and can be used as a public platform for disseminating the results of natural language processing (NLP) tools to end-users as well.
Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names to automated summarization of articles. A common approach is to define positive and negative training examples, extract features from article metadata or full-text, and use machine learning algorithms. A search in PubMed for [text AND (“machine learning” OR automated)] yields more than 2650 articles. The text mining research community is active, relatively cohesive, and stands poised to participate in the revolution in medical care that is associated with the digitization of medical records, knowledge bases, and scientific publications (Simpson & Demner-Fushman 2012; Przybyła, Shardlow, Aubin, Bossy, de Castilho, Piperidis, et al. 2016).
The public resources provided by the National Library of Medicine (
Nevertheless, at present, most informatics investigators maintain independent databases, extract features on their own for each project, and generally carry out their research efforts on small subsets of the biomedical literature and largely in isolation from each other. Thus, they do not benefit fully from the savings and diversity (and possible reuse) associated with shared resources. In particular, we believe that additional savings can be achieved if the community of text mining researchers can reuse and contribute to a
Features are the results that are generated when an NLP tool or machine learning model processes a passage or corpus of text. These features can be rather simple – for example, here is the title of a PubMed article: Discovering foodborne illness in online restaurant reviews.
One can employ the Porter stemmer (Porter, 1980) to process the text, which will result in: Discov foodborn ill in onlin restaur reviews.
One can also employ a Bio tokenizer/stemmer (Torvik, Smalheiser, Weeber, 2007) instead, which will result in: discovering foodborne illness in online restaurant review.
The two stemmers produce quite different results; the Porter stemmer is designed to collapse word variants into a single form (largely by stripping word endings), whereas the Bio tokenizer/stemmer is designed to be very gentle, and it merely stripped the final –s of reviews. The resulting processed text can then be used as an input for further text mining and modeling, e.g., stopword removal and part of speech tagging.
In the case just discussed earlier, the raw text is readily available as a shared resource in the open access form (from MEDLINE, PubMed, and the publisher). The tools to process short text passages are also openly available in this case (from a query interface maintained at our project website
The issue becomes even more compelling for features that require more sophisticated, large-scale modeling to produce and that do not merely process a piece of text based upon the text itself but draw upon external databases and knowledge bases, which may undergo incremental updating over time. In such cases, it may not be feasible for others to attempt to duplicate the modeling or tagging on their own. Rather than distributing the code and back-end databases to users, both of which are quite large and complex and cumbersome to distribute and get running at another site, it is much more efficient to simply provide users with the end results. Indeed, our laboratory has created a suite of such precomputed resources that are freely available online for viewing or download (
In this paper, we outline our vision for
For simplicity and concreteness, we first describe the generic approach and show how it could support the indexing of articles according to one or more publication types (PTs) (
The overall framework of our generic approach is to represent each PubMed article as a vector consisting of n multiple metadata features. Each training set is represented as a set, cloud, or cluster of these vectors in n-dimensional space. The distance between any two PubMed articles can be calculated as a weighted sum of the pairwise similarity scores of the underlying features between each PubMed article. Then, the overall distance between a PubMed article and a training set will be some function of the weighted pairwise similarity scores (for each of the articles that make up the training set). Finally, articles can be classified as belonging to one or more categories (depending on the relative distance of an article to the positive vs. negative training sets) or similar articles can be clustered together (the preferred clustering strategy and end point may vary depending on the project).
2.2 Index each article in PubMed by representing it as a multidimensional set of article features
These features should cover all of the basic types of metadata that are likely to be relevant for text mining researchers. However, in any given project, only a subset of these are likely to be informative (these can be selected either manually or via automated feature selection strategies). Furthermore, the complete set of useful metadata features are likely to expand over time as new techniques are invented and released. The system presented here has no practical limit on the number of metadata feature types, and metadata features can be added to the system and made available for future use at any time.
As a matter of policy, each feature at a minimum should be represented in its most basic “raw” or unprocessed form; if these features are processed or further encoded for the purposes of a specific project, the processed form should be represented as a separate feature. In this manner, it is easy to customize and process features in new ways to meet the demands of new projects. For example, the title of the article (encoded as a string of raw text) would constitute one feature in the feature set. Note that different investigators and different projects call for preprocessing text differently, so that no single or uniform method of preprocessing is likely to satisfy everyone. Thus, the title of the article after processing (e.g., via a particular NLP pipeline of tokenizing, making lower case, stoplisting, and stemming) would be placed as a separate feature in the article feature set. Further processing this form of the title into a “bag of words” encoding with counts for each non-stopword token would form another feature. Each of the basic metadata fields of the PubMed or MEDLINE record (title, abstract, journal, publication date, affiliations, etc.) would be extracted and possibly further processed to give rise to additional components of the article feature set. Altogether, several dozen features may be represented, some representing the same fields but in different ways. The full list of Medical Subject Headings (MeSH) extracted from the record would be one feature; another feature would be the same list but extracting only the major headings (discarding the subheadings) and removing the most frequent MeSH terms via stoplisting (Smalheiser & Bonifield, 2016).
Besides extracting information directly from the metadata as contained in the XML record downloaded from PubMed, some of the article features may be derived from external sources. For example, if one feature is the list of author names on the article, then another feature may be the list of disambiguated author IDs as assigned in the Author-ity author name disambiguation dataset (Torvik & Smalheiser, 2009). The raw list of author names must be kept so that it is possible to identify at least the first author, last author, and middle authors. The associated features may be the frequency of each author name within MEDLINE as a whole, the affiliations associated with each author, etc. Table 1 shows a simplified schema of the article feature vectors used in the Authority modeling project, which can be taken as a baseline set that can be extended with additional features that may be relevant for other modeling projects. In the case of classifying articles as randomized controlled trials, we found that the number of authors listed on a paper was a significant feature (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis & Yu, 2015), which is easily encoded from the raw author list feature and then stored as a processed feature set.
Article metadata used in the Author-ity author name disambiguation project feature set (simplified and updated from refs. 1 and 2).
|Author first name|
|Author first initial|
|Author middle initial|
|Author last name; document frequency of last name in PubMed|
|Author suffix (e.g., Jr.)|
|Author position on article (for each co-author)|
|Author affiliations (for each co-author), city, state, country|
|Author ORCID identifier|
|Author imputed gender|
|Author imputed ethnicity|
|Capitalized title words|
|Co-author names; total number of author names on the article|
|Medical Subject Headings|
|RN terms (registry numbers)|
|Language of article|
|Other articles cited in this article|
|Other articles that cite this article|
|Top 20 most related PubMed articles|
|Top 20 “also viewed” PubMed articles|
Each article feature will be held in a central database where each one can potentially be called upon to create further processed and simplified, specific article features designed for a given purpose. Ideally, groups that use article feature information and customize it should donate the customized version(s) back to the database, so that these features can be used by others. For example, customized stoplists or text that has been processed by specific tokenizer/stemmers should be archived in the database so that they can be reused by others for processing text. This saves both time and effort, as well as contributing to reproducibility and allowing for very detailed specific comparison experiments to be performed easily and precisely.
2.3 Create pairwise similarity vectors that compare article features across any two PubMed articles
For any pair of articles, most, if not all, of the article features can be compared and scored for similarity. A collection of these similarity features can be represented as a vector and used to compute an overall paired article similarity score. Generally, similarity can be computed in more than one way. For example, the titles of two articles can be scored in terms of how many words they share using raw text; one might count shared words using stoplisted, stemmed text; or one might do a weighted counting, in which rare words are counted more heavily than frequent ones. Table 2 shows a simplified schema of the pairwise similarity measures used in the Author-ity modeling project. We anticipate that a few of the more popular pairwise similarity schemes will be implemented as part of the pairwise similarity vector for two articles. As other investigators utilize other similarity schemes for a given article feature, the scripts for processing them should be donated back so that the option can be implemented by others at will.
Pairwise similarity measures employed in the Author-ity author name disambiguation project similarity vector (simplified and updated from refs. 1 and 2).
|Two articles that share the same author (last name, first initial) are compared pairwise|
|Match on author middle initial|
|Match on author first name|
|Partial match for nicknames|
|Match on author email|
|Match on author ORCID identifier|
|Number of shared words in author affiliation|
|Distance between cities in author affiliation|
|Are both articles authored by a single author|
|Number of shared words in the title|
|Number of shared capitalized words in the title|
|Match on journal name|
|Partial match for similar journals using journal–journal|
|Number of shared co-author names|
|Number of shared Medical Subject Headings|
|Partial match for similar Medical Subject Heading using|
|MeSH–MeSH similarity metric|
|Number of shared registry number terms|
|Both articles in English. Both in same non-English language|
|Difference in publication dates, in years|
|Number of shared grant acknowledgments|
|Number of shared cited articles|
|Number of shared citing articles|
|Number of shared PubMed-related articles|
|Number of shared PubMed “also viewed” articles|
|Length of longest common text string in the abstracts after|
|Number of shared rare words (i.e., those found in <25 PubMed|
|Similarity of title+abstract text as assessed by implicit weighted|
|Similarity of title+abstract as assessed by paragraph2vec|
In any specific project, each article will be represented by only a subset of features, and each article pair will be represented by only a subset of the possible pairwise similarity measures. For example, in the Author-ity author name disambiguation project, we consider the likelihood that two articles are written by the same individual – they may tend to share similar co-authors, journals, and affiliations, among other relevant features. However, if we consider two clinical case reports describing a similar condition, there is no reason to think that they will share co-authors, journals, or affiliations; rather, shared title terms, MeSH terms, and MeSH term pairs (Smalheiser & Bonifield, 2016) are likely to be important. The point here is to make it easy to create and compute pairwise similarity vectors for any given project, drawing from a larger pool both of individual article-based features and potential pairwise similarity schemes.
2.4 A machine learning algorithm is trained that optimally computes the similarity of the two articles in the context of a particular project (Figure 1)
Given a similarity vector representing multiple, heterogeneous measures corresponding to a pair of articles, one needs to “collapse” the vector to a single real number that represents the overall paired article similarity in the context of the given task, in order to have a single value that can be used for clustering similar articles together. This may be done in many ways, but perhaps the simplest method is to compute the similarity value as comprising a weighted sum of each pairwise similarity score. These weights can be determined using a machine learning algorithm, such as a support vector machine, logistic regression, and neural network, which carries out training on appropriately labeled data. Given a sufficient number of data samples, the labels can be somewhat noisy without degrading performance of the model (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis & Yu, 2015; Agarwal, Podchiyska, Banda, Goel, Leung, Minty,... & Shah, 2016; Aslam & Decatur, 1996).
Ideally, for each project, one should define sets of articles that comprise positive and/or negative training sets for machine learning. To define a positive training set, we pull a set of articles and their associated article feature vectors that have some desired property, that is, they are “similar” in a manner we are interested in. For example, for author name disambiguation, we might take a set of articles known to be authored by a particular individual or for training a model to identify randomized controlled trials, we may take as positive set those articles that have been manually indexed by MEDLINE as randomized controlled trials (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis & Yu, 2015). The negative training set may consist of all articles not in the positive training set. Alternatively, in a multitask classification project, there may be a series of different positive training sets, each positive for a different class, so that each positive training set is contrasted against each of the others. Each PubMed article may then be assigned to belong to one of a series of positive training sets based on its PT where each is in the positive training set for its PT and is contrasted against each training set for the other PTs.
One way to visualize this scheme is shown in Figure 2. Each of the articles in PubMed is represented as a point in a multidimensional article feature space. Each positive set comprises a cloud of such points. The cloud may be more or less cohesive – they may or may not cluster tightly around a single central centroid, though a good positive training set ought to be relatively cohesive. The machine learning objective is to compute pairwise similarity measures for each two articles, such that any given article that is in a positive set will be, on average, “closer” to other articles that are in the same positive set than to articles that are in the negative set (or other positive sets). One chooses some machine learning framework (e.g., SVM) and trains the model to adjust the weightings on the similarity vectors so as to minimize the average distance between the members of a positive set and maximize the average distance between the members of the positive set and the members of the negative set (or other positive sets). In this manner, one learns the optimal weightings on different similarity measures that make up the pairwise similarity vectors, which compute optimally a single similarity value for any pair of articles.
2.5 Using the learned similarity metric for article pairs, articles can be classified as belonging to one or more categories or similar articles can be clustered together (the preferred clustering strategy and end point may vary depending on the project)
Having trained the machine learning model as described earlier, to classify any new article (not in the training sets), one computes the similarity values pairwise between that article and all the articles in the positive set and between that article and all the articles in the negative set. This gives a distribution of similarity values for the positive set vs. the negative set (or each of the other positive sets). Then, one can ask, for this article, on average, which training set is it closest to? (see Figure 2). Depending on the nature of the classification task, one might assign the article to the closest positive set (or possibly to more than one positive set, if it is about equally close to more than one) or to the negative set, if an article is not sufficiently close to any of the positive set clouds. Rather than binary (yes/no) classification, it is also possible to assign a probability of belonging to a given class (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015;Niculescu-Mizil, Caruana, 2005).
Furthermore, task-specific customization of the assignment algorithm is possible using a number of standard distances to cluster measures (Aggarwal & Reddy, 2013), such as closest cluster, closest cluster median, and average cluster member distance. These can be computed very efficiently and easily compared, and an optimal cluster selection method can be chosen for a given task. Furthermore, the cluster selection method can be extended to produce cluster assignment probabilities by incorporating distances to multiple members of each cluster. For example, a K-nearest neighbors strategy could be used here. Another potential approach is to use the final cluster distance to compute a probability of cluster membership directly using a nonlinear transformation such as isotonic regression Torvik, Weeber, Swanson & Smalheiser, 2005). Alternatively, similarity values across article pairs can also be used for unsupervised clustering to identify groups of articles in a data-driven manner.
3 Results and Discussion
Let us first consider the concrete case of assigning PubMed articles automatically (and probabilistically) to one or more PTs. We show that the framework of encoding articles as multidimensional vectors, constructing pairwise similarity vectors for pairs of articles, and computing distances between articles and training sets (Figures 1 and 2) is well suited for this task. We further show the scheme benefits from having a core suite of precomputed pairwise similarity features, which are publicly available from our project website
3.1 Training sets
We take the list of MEDLINE PTs (
Feature sets. Next, for each article in PubMed, we assign a feature set that includes metadata features extracted from the PubMed XML record (or computed from information contained in the record), which we know (or suspect) may provide information that will help in assigning PTs. The feature set includes a variety of textual features – for example, words that appear in the title and/or in the abstract, as well as low-dimensional vector representations of these words (e.g., implicit term metrics (Smalheiser & Bonifield, 2018) or word2vec neural embeddings (Smalheiser & Bonifield, 2018; Mikolov, Sutskever, Chen, Corrado, Dean, 2013)). The feature set also includes journal name (since PTs are not distributed equally across journals), MeSH, and other features such as number of authors listed on the article (note that reviews are often single authored, whereas clinical trials generally have many author names on each paper). Feature selection may be performed on the basic set to select only those features that have the most utility for discriminating different PTs (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015) and to minimize collinearity (Aggarwal & Reddy, 2013; Witten, Frank, Hall, & Pal, 2016).
3.2 Pairwise similarity measures for each feature
Once each article has a feature set to describe it, the next step is to construct a pairwise similarity vector that contains multiple, heterogeneous similarity measures that contribute to the overall similarity of any two articles. Here, each feature is compared pairwise and a feature similarity score is assigned; this is done for each pairwise feature comparison within the similarity vector. The essential requirement is to have monotonicity; that is, for any given feature, a higher similarity score corresponds to a higher probability that the two articles share the same PT. We have precomputed pairwise similarity metrics for journals (D’Souza, & Smalheiser, 2014); MeSH (Smalheiser & Bonifield, 2016); biomedical terms including words, bigrams, trigrams, and abbreviations (Smalheiser & Bonifield 2018); and the title+abstract considered as a single text passage (Smalheiser & Bonifield, 2018). (Except for the title+abstract similarity measures, which are still in the process of being made available, all of these measures can be downloaded from the project website.) These measures cover almost all the pairwise features that are likely to be included in the multitask model, and we have shown that there is limited redundancy between term-based and title+abstract-based similarity measures (Smalheiser & Bonifield, 2018), so that including both types of features is likely to be warranted. We believe this should be a valuable resource for the biomedical text mining community (
3.3 Optimizing the weighted similarity metric for one article to another and for one article to a training set
The next goal is to learn how to estimate the weighting of the different similarity scores in the pairwise similarity vector, to estimate the overall similarity of any two articles. We examine each PT in turn, and for each article in the positive training set for this PT, we use machine learning to train a model that minimizes the pairwise distance of this article to the other articles in the same training set, while maximizing the distance of this article to the articles in the other PTs. A variety of machine learning methods could be explored, e.g., SVMs (linear or nonlinear), isotonic regression, random forests, or neural networks.
Note that in the abovementioned description, each article in PubMed is assigned a single feature set for each article, yet it is possible that each PT may utilize a different optimized similarity vector and weighting for comparing any two articles; for example, the optimal weighting scheme for discriminating review articles from clinical case reports may be different from the optimal weighting scheme for discriminating review articles from editorials. Another alternative approach is to customize the feature set for each PT training set individually for making similarity comparisons. For example, the word “randomized” has a high discriminative value when assigning articles to randomized controlled trials, whereas the word “cohort” has a high value when assigning articles to cohort studies. Given any PubMed article, it might be selectively compared for similarity against discriminative terms such as “randomized” when comparing the article to the randomized controlled trial training set but compared for similarity against terms such as “cohort” when comparing the article to the cohort studies training set. This is a topic that will require further research. Similarly, as discussed in the Methods section, given a list of distances from one article to a given PT training set, it is an open question how best to compute an “overall” distance. A popular choice is to represent the entire training set by its centroid, but this may not be appropriate if the training set is not coherent or if one is using a nonlinear similarity metric instead of a simple weighted sum of feature similarity scores.
The similarity-based multitask framework described here differs from our previous method (Cohen, Smalheiser, McDonagh, Yu, Adams, Davis, Yu, 2015) to estimate the probability that a given article is a randomized controlled trial. Our previous study of classifying randomized controlled trials used features derived directly from metadata – for example, title bigrams. In contrast, the present strategy formulates
3.4 Can this framework be generalized to other biomedical text mining tasks?
Although the use of implicit similarity metrics is common in certain machine learning applications (e.g., image analysis and bioinformatics), in our experience, the pairwise similarity-based approach we propose here has been less commonly used for biomedical text mining projects. Certainly, the list of our currently precomputed similarity features is not exhaustive. For example, a set of PubMed articles can be subjected to topic modeling and the articles can then be represented as a weighted vector of these topics (Hashimoto, Kontonatsios, Miwa & Ananiadou, 2016). In addition, two text passages can be assessed in terms of their string similarity (Mrabet, Kilicoglu, Demner-Fushman, 2017). Author name matches on first name (with partial matches given for nicknames), middle initial, and suffix are features important for author name disambiguation (Torvik, Weeber, Swanson & Smalheiser, 2005; Torvik & Smalheiser, 2009) and other tasks (Tables 1 and 2). Therefore, we envision our project website as comprising an open repository, wherein outside groups can not only utilize our existing resources but also donate their own processed features and similarity metrics (subject to evaluation and space limitations). We have added a Repository of Processed Text and Resources page to the project website that encourages others to donate their processed text and features back to us so that we can integrate them into our suite and host them publicly. This is not unlike the UCSC Genome Browser (
Our approach is compatible with other text mining frameworks, such as PubRunner (Anekalla, Courneya, Fiorini, Lever, Muchow, Busby, 2017), for updating processed citations with the latest PubMed entries, and the many available text processing toolkits, which can be used to process raw article metadata into processed feature sets, e.g., the NLTK (
Having a central infrastructure repository of metadata features and similarity measures can benefit the broader biomedical community of investigators as well. One can envision that specialized (perhaps proprietary) NLP tools can be run on PubMed articles as a public service and the results can be stored publicly so that end-users can utilize the results without having to acquire or learn how to use the tools themselves. For example, RobotReviewer (Marshall, Kuiper, Wallace, 2015) processes clinical trial articles to identify the clinical populations and interventions studied in the trial (among other things). If one were to store the results as metadata attached to the articles, then teams writing systematic reviews could obtain the results of tools such as RobotReviewer without needing to process articles themselves.
It may be argued that our emphasis on detailed feature engineering is old-fashioned, and even obsolete, in the face of recent advances in deep learning. Deep learning can theoretically learn the most relevant features and detect higher-order associations automatically. However, this depends on having enough data (billions of points) and enough underlying deep architecture, both of which lie beyond the scope of most deep learning frameworks being studied in biomedical text mining today. Moreover, deep learning is not certain to capture all the relevant implicit associations anyway, especially those that draw upon external reference data from the UMLS or other knowledge bases.
3.5 Extension to full-text features
We have emphasized the use of metadata as features for text mining, in part because full-text of biomedical articles has not been generally accessible. The PubMed Central Open Access dataset currently contains 1.8 million full-text articles available for download in the XML format. This can be augmented further by precomputing features that can be archived for use by others. For example, Europe PMC (
Our studies are supported by NIH grants R01LM10817 and P01AG03934. We thank Sophia Ananiadou for discussions about ways to share NLP tools and their products with end-users.
Przybyła, P., Shardlow, M., Aubin, S., Bossy, R., Eckart de Castilho, R., Piperidis, S.,… & Ananiadou, S. (2016). Text mining resources for the life sciences. Database, 2016(0), baw145.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In ACL (System Demonstrations) (pp. 55-60).
Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C., & Chute, C. G. (2010). Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association: JAMIA, 17(5), 507–513. http://doi.org/10.1136/jamia.2009.001560.
Clarke, J., Srikumar, V., Sammons, M., & Roth, D. (2012). An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines). In LREC (pp. 3276-3283).
Zeng, X. & Luo, G. Progressive sampling-based Bayesian optimization for efficient and automatic machine learning model selection Health Inf Sci Syst (2017) 5: 2. https://doi.org/10.1007/s13755-017-0023-z.
Porter, M. F. (1980). An algorithm for suffix stripping, Program, 14(3) pp 130-137.
Torvik VI, Smalheiser NR, Weeber, M. 2007. A simple Perl tokenizer and stemmer for biomedical text. Unpublished technical report, accessed January 15, 2018 from http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/tokenizer.cgi.
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the Association for Information Science and Technology, 56(2), 140-158.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 11.
Cohen, A. M., Smalheiser, N. R., McDonagh, M. S., Yu, C., Adams, C. E., Davis, J. M., & Yu, P. S. (2015). Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. Journal of the American Medical Informatics Association, 22(3), 707-717.
Smalheiser, N. R., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH):: An Aid to Biomedical Text Mining and Author Name Disambiguation. Journal of biomedical discovery and collaboration, 7.
Smalheiser, N. R., & Bonifield, G. (2018). Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings. arXiv preprint arXiv:1801.01884.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Niculescu-Mizil, A., & Caruana, R. (2005, August). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625-632). ACM.
Aggarwal, C. C., & Reddy, C. K. (Eds.). (2013). Data clustering: algorithms and applications. CRC press.
Law, M. T., Yu, Y., Urtasun, R., Zemel, R. S., & Xing, E. P. Efficient Multiple Instance Metric Learning using Weakly Supervised Data. http://www.cs.toronto.edu/~zemel/documents/mimlca_cvpr_2017.pdf
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
Mohammadi, S., Kylasa, S., Kollias, G., & Grama, A. (2016, December). Context-Specific Recommendation System for Predicting Similar PubMed Articles. In Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on (pp. 1007-1014). IEEE.
Mrabet Y, Kilicoglu H, Demner-Fushman D. TextFlow: A Text Similarity Measure based on Continuous Sequences. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017 (Vol. 1, pp. 763-772).
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D.,… & Xin, D. (2016). Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), 1235-1241.
Shanahan, J. G., & Dai, L. (2015, August). Large scale distributed data science using apache spark. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2323-2324). ACM.
Marshall, I. J., Kuiper, J., & Wallace, B. C. (2015). RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association, 23(1), 193-201.