In the field of scientometrics, the principal purpose for author co-citation analysis (ACA) is to map knowledge domains by quantifying the relationship between co-cited author pairs. However, traditional ACA has been criticized since its input is insufficiently informative by simply counting authors’ co-citation frequencies. To address this issue, this paper introduces a new method that reconstructs the raw co-citation matrices by regarding document unit counts and keywords of references, named as Document- and Keyword-Based Author Co-Citation Analysis (DKACA). Based on the traditional ACA, DKACA counted co-citation pairs by document units instead of authors from the global network perspective. Moreover, by incorporating the information of keywords from cited papers, DKACA captured their semantic similarity between co-cited papers. In the method validation part, we implemented network visualization and MDS measurement to evaluate the effectiveness of DKACA. Results suggest that the proposed DKACA method not only reveals more insights that are previously unknown but also improves the performance and accuracy of knowledge domain mapping, representing a new basis for further studies.
Digital libraries have been strategic in preserving and making non-movable cultural heritage information accessible to everyone with network connections. In light of their cultural and historical importance in the ancient “Silk Road,” murals and stone caves in Dunhuang, a remote city in northwest China,have been digitized, and the first batch of digitized visual materials has been made available to the general public through the e-Dunhuang digital library since May 2016. The aim of this study was to systematically evaluate e-Dunhuang from users’ perspectives, through usability testing with nine user tasks in different complexity levels and in-depth interviews with regard to a set of criteria in user experience. The results of quantitative analysis confirmed the overall effectiveness of e-Dunhuang in supporting user task completion and demonstrated significant improvements in several criteria over an earlier panorama collection of Dunhuang caves. The results of qualitative analysis revealed in-depth reasons for why participants felt satisfied with some criteria but had concerns with other criteria. Based on the findings, suggestions are proposed for further improvement in e-Dunhuang. As e-Dunhuang is a representative repository of digitized visual materials of cultural heritage, this study offers insights and empirical findings on user-centered evaluation of cultural heritage digital libraries.
In this study, we investigated the quantity and impact of worldwide research production in the field of “project management” over the past 38 years. We performed a bibliometric analysis using the Scopus database between 1980 and 2017 to develop an understanding of the evolution of research on “project management.” Using the knowledge of a domain expert in the field of “project management,” we first compiled a set of reliable keywords, which represented the field. Second, we developed a data extraction strategy for searching the phrase “project management” in the title or keyword or abstract of publications by limiting our sources to journals in English. We observed the evolution of this field by analyzing not only the quantity of publications but also their impact (citations) per year and compared their growth trend in four periods. The results of our analysis confirmed that not only the research themes or topics but also the active parties involved in project management research have experienced phonemic changes over time.
Citation performance of a publication depends heavily on its academic field. Some words in keywords, titles, and abstracts of publications may be indicative of their academic field. Therefore, analysis of differences in citation performance of these words helps us understand inter-field differences in citation performance. In this article, we analyzed citation performance of publications that contain certain words in their keywords, titles, and abstracts in Web of Science from 2010 to 2012. We found that some words do not have a consistent performance. For instance, publications that use a certain word in their keywords have a different average performance compared to publications that use the same word in their titles. Next, we investigated keywords, titles, and abstracts separately. We laid out the words that have the lowest and highest average citations. Words that contain animal names, country names, and mathematical concepts are among the worst performers. Words that contain terminology specific to a scientific field and have relatively lower frequency are among the best performers.
Linked data is becoming a mature technology as a lightweight realization of the Semantic Web, as well as a way of facilitating knowledge reorganization and discovery. As a use case and start point, based on linked data technology, a genealogy knowledge service platform was implemented by the Shanghai Library for providing knowledge discovery and open data services. This article explains the design and development of the Genealogy Knowledge Service Platform, describes the method and process of the implementation, and introduces four examples of how the platform helps users to discover questions, raise questions, and solve questions for their research, to explain how Linked Data can be used in Digital Humanities.
Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and use machine learning algorithms. At present, each research group tackles each problem from scratch, in isolation of other projects, which causes redundancy and a great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects and as a public repository for their outputs. We initially focus on a specific goal, namely, classifying articles according to publication type and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning-based goals and projects and can be used as a public platform for disseminating the results of natural language processing (NLP) tools to end-users as well.
Christina Lioma, Birger Larsen and Peter Ingwersen
When submitting queries to information retrieval (IR) systems, users often have the option of specifying which, if any, of the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes.In addition to such cases where users specify term dependence, automatic ways also exist for IR systems to detect dependent terms in queries. Most IR systems use both user and algorithmic approaches. It is not however clear whether and to what extent user-defined term dependence agrees with algorithmic estimates of term dependence, nor which of the two may fetch higher performance gains. Simply put, is it better to trust users or the system to detect term dependence in queries? To answer this question, we experiment with 101 crowdsourced search engine users and 334 queries (52 train and 282 test TREC queries) and we record 10 assessments per query. We find that (i) user assessments of term dependence differ significantly from algorithmic assessments of term dependence (their overlap is approximately 30%); (ii) there is little agreement among users about term dependence in queries, and this disagreement increases as queries become longer; (iii) the potential retrieval gain that can be fetched by treating term dependence (both user- and system-defined) over a bag of words baseline is reserved to a small subset (approximately 8%) of the queries, and is much higher for low-depth than deep precision measures. Points (ii) and (iii) constitute novel insights into term dependence.
Yiming Zhao, Baitong Chen, Jin Zhang, Ying Ding, Jin Mao and Lihong Zhou
This study investigates the evolution of diabetics’ concerns based on the analysis of terms in the Diabetes category logs on the Yahoo! Answers website. Two sets of question-and-answer (Q&A) log data were collected: one from December 2, 2005 to December 1, 2006; the other from April 1, 2013 to March 31, 2014. Network analysis and a t-test were performed to analyze the differences in diabetics’ concerns between these two data sets. Community detection and topic evolution were used to reveal detailed changes in diabetics’ concerns in the examined period. Increases in average node degree and graph density imply that the vocabulary size that diabetics use to post questions decreases while the scope of questions has become more focused. The networks of key terms in the Q&A log data of 2005–2006 and 2013–2014 are significantly different according to the t-test analysis of the degree centrality and betweenness centrality. Specifically, there is a shift in diabetics’ focus in that they have become more concerned about daily life and other nonmedical issues, including diet, food, and nutrients. The recent changes and the evolution paths of diabetics’ concerns were visualized using an alluvial diagram. The food- and diet-related terms have become prominent, as deduced from the visualization results.
The second-order h-type indicators are suggested to identify top units in scientometrics. Basically, the re-ranking of h-type series leads to the second-order h-type indicator. The second-order h-type indicators provide an interesting and natural method to identify top units, yielding fixed h-top. Differentiating from the series of artificially defined highly cited percentile classes, the h-top contributes a natural definite top in the series of highly cited classes. When studying theoretically, the second-order h-index concerns 3% of the h-top whereas the first-order h-index refers to 10% of the h-core. The ratio of the first- and second-order h-index, hT/h, is 30%. When studying empirically, the ratio of the first- and second-order h-index, hT/h, is <30%. The approach of calculating second-order h-type indicators is exemplified based on journals in two fields.
Currently, we are witnessing the emergence and abundance of many different data repositories and archival systems for scientific data discovery, use, and analysis. With the burgeoning of available data-sharing platforms, this study addresses how scientists working in the fields of natural resources and environmental sciences navigate these diverse data sources, what their concerns and value propositions are toward multiple data discovery channels, and most importantly, how they perceive the characteristics and compare the functionalities of different types of data repository systems. Through a user community research of domain scientists on their data use dynamics and insights, this research provides strategies and discusses ideas on how to leverage these different platforms. Furthermore, it proposes a top–down, novel approach to the processes of searching, browsing, and visualizing for the dynamic exploration of environmental data.