Yukun Zheng, Yiqun Liu, Zhen Fan, Cheng Luo, Qingyao Ai, Min Zhang and Shaoping Ma
25 is the most common heuristic for generating weak relevance labels ( Dehghani et al., 2017 ; MacAvaney et al., 2017 ). Dehghani et al. (2017) used BM25 as the heuristic to generate weak labels and reported that their fine-tuned neural models outperformed BM25. By using documents’ titles as pseudo queries and BM25 scores as weak labels, MacAvaney et al. (2017) introduced a filtering method to effectively produce positive and negative query–document pairs. However, the implicit assumption that the exact matching signals can represent relevance usually brings
topics over the pseudo-documents. Illustrated in Figure 3 , it consists of two components: word relevance estimation and short text classification and filtering . Given a set of seed words for each category of interest, we estimate the relevance score between a word and a category based on the inferred hidden topics by the word network topic model (WNTM) over the short text collection. The resultant relevance estimation serves as prior knowledge for SSCF to supervise the topic inference over two kinds of topics: category-topics and general-topics. At last, the
-dimensional space. The distance between any two PubMed articles can be calculated as a weighted sum of the pairwise similarity scores of the underlying features between each PubMed article. Then, the overall distance between a PubMed article and a training set will be some function of the weighted pairwise similarity scores (for each of the articles that make up the training set). Finally, articles can be classified as belonging to one or more categories (depending on the relative distance of an article to the positive vs. negative training sets) or similar articles can be
Christina Lioma, Birger Larsen and Peter Ingwersen
between quotes. Each of these 52 queries was assessed by 101 users. The scores in brackets in Table 1 show the average user agreement on the most popular user choice for each query, which we computed as the % of users (out of all 101 users) who agree on the most popular term dependence option for each query. For instance, the average agreement of 69% for “rain man” means that 70 out of 101 users (≈69%) selected the option “rain man”. The 52 train queries are sorted in Table 1 by decreasing user agreement.
Train queries used on the CrowdFlower
bibliographical metadata change the search performance?
Answer : There is no significant difference when combining Core bibliographical metadata with CVs. Including Core bibliographical metadata in general achieves a better performance.
Any real-world book search engine would always include the core bibliographic data in its documents. The NDCG@10 scores seem to bene t from adding the Core elements to other metadata elements. These differences are significant according to a two-tailed paired t -test ( t (1307) = 4.799, p < .0005, ES = 0.13, 95% CI [0.0083, 0
control of the mouse:
“I don’t know what’s happening, why it spins so fast, I just use my mouse to drag left and right because I want to check out the paintings around, why it is so hard to control!” (participant 4).
Other participants also complained about the automatic spinning feature of the panoramic function:
“I feel quite dizzy that it spins all the time!” (participant 1)
“Why can’t I stop if from spinning?” (participant 8)
Organization of information
Another criterion with lower scores was “organization of information” where
Liang Hong, Mengqi Luo, Ruixue Wang, Peixin Lu, Wei Lu and Long Lu
Christy et al. (2015) proposed two cluster-based outlier detection algorithms including distance-based outlier detection and cluster-based outlier detection. The main purpose of the algorithms was to remove outliers that are irrelevant or only weakly relevant to the analysis of health care data. Experimental evaluation based on the metrics of F-score and likelihood ratio shows that the cluster-based outlier detection method outperforms distance-based outlier detection method.
Huang and Yao (2016) proposed a novel clustering approach for multidimensional
Eric Zheng, Yong Tan, Paulo Goes, Ramnath Chellappa, D.J. Wu, Michael Shaw, Olivia Sheng and Alok Gupta
mining, as well as a variety of econometric models to discover valuable information, which we have been doing during the past few decades. Take my research with a commercial bank as an example. What we found from the decision tree generated to explain the commercial loaning process for medium-sized companies is that the loan is based on the financial attribute and the risk level of the applicant. The numbers one to five represent the risk level perceived for the small- and medium-sized businesses, with the score one meaning the most secure and five meaning the riskiest
Weijia Xu, Amit Gupta, Pankaj Jaiswal, Crispin Taylor, Patti Lockhart and Jennifer Regala
entities of importance to authors. Since the ABNER program does not offer a way to sort entities, we used total number of entities found by ABNER, which resulted in a very low precision score. This also reinforces our motivations that existing entities tools cannot be used for solving this problem directly. Additional features and functionalities must be developed to be used in practice.
Results of Entity Recognition Against Author Curation as Ground Truth
Total number of Entities
Total Entities in Ground Truth
We assume that a disease is highly probable to be correct if it is predicted as true diagnosis by both SemMedDB and Wikipedia.From the results of the experiments in the next section, we found that Wikipedia has a better and more robust prediction across three datasets, and hence, we use the following rules to combine the prediction outputs from Wiki-DP and SMDB-DP:
Only top 10 diseases are considered in both ranking lists.
If the two lists share the same diseases, the shared diseases are kept and ranked with Wikipedia ranking score.