Search Results

1 - 10 of 10 items :

  • Computer Sciences, other x
  • Library and Information Science, other x
Clear All
Investigating Weak Supervision in Deep Ranking

25 is the most common heuristic for generating weak relevance labels ( Dehghani et al., 2017 ; MacAvaney et al., 2017 ). Dehghani et al. (2017) used BM25 as the heuristic to generate weak labels and reported that their fine-tuned neural models outperformed BM25. By using documents’ titles as pseudo queries and BM25 scores as weak labels, MacAvaney et al. (2017) introduced a filtering method to effectively produce positive and negative query–document pairs. However, the implicit assumption that the exact matching signals can represent relevance usually brings

Open access
Filtering and Classifying Relevant Short Text with a Few Seed Words

topics over the pseudo-documents. Illustrated in Figure 3 , it consists of two components: word relevance estimation and short text classification and filtering . Given a set of seed words for each category of interest, we estimate the relevance score between a word and a category based on the inferred hidden topics by the word network topic model (WNTM) over the short text collection. The resultant relevance estimation serves as prior knowledge for SSCF to supervise the topic inference over two kinds of topics: category-topics and general-topics. At last, the

Open access
Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

-dimensional space. The distance between any two PubMed articles can be calculated as a weighted sum of the pairwise similarity scores of the underlying features between each PubMed article. Then, the overall distance between a PubMed article and a training set will be some function of the weighted pairwise similarity scores (for each of the articles that make up the training set). Finally, articles can be classified as belonging to one or more categories (depending on the relative distance of an article to the positive vs. negative training sets) or similar articles can be

Open access
To Phrase or Not to Phrase – Impact of User versus System Term Dependence upon Retrieval

between quotes. Each of these 52 queries was assessed by 101 users. The scores in brackets in Table 1 show the average user agreement on the most popular user choice for each query, which we computed as the % of users (out of all 101 users) who agree on the most popular term dependence option for each query. For instance, the average agreement of 69% for “rain man” means that 70 out of 101 users (≈69%) selected the option “rain man”. The 52 train queries are sorted in Table 1 by decreasing user agreement. Table 1 Train queries used on the CrowdFlower

Open access
Supporting Book Search: A Comprehensive Comparison of Tags vs. Controlled Vocabulary Metadata

bibliographical metadata change the search performance? Answer : There is no significant difference when combining Core bibliographical metadata with CVs. Including Core bibliographical metadata in general achieves a better performance. Any real-world book search engine would always include the core bibliographic data in its documents. The NDCG@10 scores seem to bene t from adding the Core elements to other metadata elements. These differences are significant according to a two-tailed paired t -test ( t (1307) = 4.799, p < .0005, ES = 0.13, 95% CI [0.0083, 0

Open access
Usability Evaluation of E-Dunhuang Cultural Heritage Digital Library

control of the mouse: “I don’t know what’s happening, why it spins so fast, I just use my mouse to drag left and right because I want to check out the paintings around, why it is so hard to control!” (participant 4). Other participants also complained about the automatic spinning feature of the panoramic function: “I feel quite dizzy that it spins all the time!” (participant 1) “Why can’t I stop if from spinning?” (participant 8) 4.3.2 Organization of information Another criterion with lower scores was “organization of information” where

Open access
Big Data in Health Care: Applications and Challenges

characteristics. Christy et al. (2015) proposed two cluster-based outlier detection algorithms including distance-based outlier detection and cluster-based outlier detection. The main purpose of the algorithms was to remove outliers that are irrelevant or only weakly relevant to the analysis of health care data. Experimental evaluation based on the metrics of F-score and likelihood ratio shows that the cluster-based outlier detection method outperforms distance-based outlier detection method. Huang and Yao (2016) proposed a novel clustering approach for multidimensional

Open access
When Econometrics Meets Machine Learning

mining, as well as a variety of econometric models to discover valuable information, which we have been doing during the past few decades. Take my research with a commercial bank as an example. What we found from the decision tree generated to explain the commercial loaning process for medium-sized companies is that the loan is based on the financial attribute and the risk level of the applicant. The numbers one to five represent the risk level perceived for the small- and medium-sized businesses, with the score one meaning the most secure and five meaning the riskiest

Open access
Improving Publication Pipeline with Automated Biological Entity Detection and Validation Service

entities of importance to authors. Since the ABNER program does not offer a way to sort entities, we used total number of entities found by ABNER, which resulted in a very low precision score. This also reinforces our motivations that existing entities tools cannot be used for solving this problem directly. Additional features and functionalities must be developed to be used in practice. Table 2 Results of Entity Recognition Against Author Curation as Ground Truth Total number of Entities Total Entities in Ground Truth Total Recall Total Precision

Open access
Enhancing Clinical Decision Support Systems with Public Knowledge Bases

We assume that a disease is highly probable to be correct if it is predicted as true diagnosis by both SemMedDB and Wikipedia.From the results of the experiments in the next section, we found that Wikipedia has a better and more robust prediction across three datasets, and hence, we use the following rules to combine the prediction outputs from Wiki-DP and SMDB-DP: – Only top 10 diseases are considered in both ranking lists. – If the two lists share the same diseases, the shared diseases are kept and ranked with Wikipedia ranking score. – If the

Open access