Filtering out irrelevant documents and classifying the relevant ones into topical categories is a de facto task in many applications. However, supervised learning solutions require extravagant human efforts on document labeling. In this paper, we propose a novel seed-guided topic model for dataless short text classification and filtering, named SSCF. Without using any labeled documents, SSCF takes a few “seed words” for each category of interest, and conducts short text filtering and classification in a weakly supervised manner. To overcome the issues of data sparsity and imbalance, the short text collection is mapped to a collection of pseudodocuments, one for each word. SSCF infers two kinds of topics on pseudo-documents: category-topics and general-topics. Each category-topic is associated with one category of interest, covering the meaning of the latter. In SSCF, we devise a novel word relevance estimation process based on the seed words, for hidden topic inference. The dominating topic of a short text is identified through post inference and then used for filtering and classification. On two real-world datasets in two languages, experimental results show that our proposed SSCF consistently achieves better classification accuracy than state-of-the-art baselines. We also observe that SSCF can even achieve superior performance than the supervised classifiers supervised latent dirichlet allocation (sLDA) and support vector machine (SVM) on some testing tasks.
Yukun Zheng, Yiqun Liu, Zhen Fan, Cheng Luo, Qingyao Ai, Min Zhang and Shaoping Ma
A number of deep neural networks have been proposed to improve the performance of document ranking in information retrieval studies. However, the training processes of these models usually need a large scale of labeled data, leading to data shortage becoming a major hindrance to the improvement of neural ranking models’ performances. Recently, several weakly supervised methods have been proposed to address this challenge with the help of heuristics or users’ interaction in the Search Engine Result Pages (SERPs) to generate weak relevance labels. In this work, we adopt two kinds of weakly supervised relevance, BM25-based relevance and click model-based relevance, and make a deep investigation into their differences in the training of neural ranking models. Experimental results show that BM25-based relevance helps models capture more exact matching signals, while click model-based relevance enhances the rankings of documents that may be preferred by users. We further proposed a cascade ranking framework to combine the two weakly supervised relevance, which significantly promotes the ranking performance of neural ranking models and outperforms the best result in the last NTCIR-13 We Want Web (WWW) task. This work reveals the potential of constructing better document retrieval systems based on multiple kinds of weak relevance signals.
The study aims to reveal the role of social media and its influence on information sharing within public organizations and emphasis on the distribution affordance to facilitate information processes. Existing literature emphasized different aspects of social media in the public sector to promote the relationship between government and citizens or provide better public service, for example, innovation, policies, openness, and communication. However, there is a wide gap in the literature to investigate social media use and information sharing within public organizations. The current study tries to accomplish the goal by conducting semi-structured interviews with 15 employees in public organizations in Chaohu city, China and applying content analysis on the interviews. Despite the existing literature, the targeted group for this study is divided into three levels (i) senior-level, (ii) middle-level, and (iii) junior-level employees to get a better view of social media. The study is based on grounded theory for coding analysis. We provide an overview of social media use within Chinese public organizations and discuss five social media affordances involved in the public organizations. Finally, we provide the implications, limitations, recommendation, and future research of this research area.
Danmu function as an augmented comment feature has been adopted by almost all live streaming platforms to foster interaction between viewers and the streamer in China. However, few studies have been conducted to understand the determinants of users’ Danmu sending behavior on live streaming platforms. This study examines this phenomenon from the lens of effectance theory and the S-O-R framework. We propose that two effectances – Danmu effectance and live streaming effectance – play an essential role in active Danmu participation. In addition, we explore the effects of time-enhanced (synchronicity) and space-enhanced technical characteristic (visibility) of Danmu on live streaming platforms on two effectances. Data analysis of 877 participations from Douyu platform in mainland China indicates that active Danmu participation is positively associated with Danmu effectance and live streaming effectance which are influenced by both time-enhanced technical feature (synchronicity) and space-enhanced technical feature (visibility). In addition, the study finds that demographic characteristics, namely education and income, also affect active Danmu participation.
This study aims at constructing a microblog influence prediction model and revealing how the user, time, and content features of microblog entries about public health emergencies affect the influence of microblog entries. Microblog entries about the Ebola outbreak are selected as data sets. The BM25 latent Dirichlet allocation model (LDA-BM25) is used to extract topics from the microblog entries. A microblog influence prediction model is proposed by using the random forest method. Results reveal that the proposed model can predict the influence of microblog entries about public health emergencies with a precision rate reaching 88.8%. The individual features that play a role in the influence of microblog entries, as well as their influence tendencies are also analyzed. The proposed microblog influence prediction model consists of user, time, and content features. It makes up the deficiency that content features are often ignored by other microblog influence prediction models. The roles of the three features in the influence of microblog entries are also discussed.
A prevalent belief is that it is advantageous to have surname initials that are placed early in the alphabet (early surname initials) in academic fields in which authors are ordered alphabetically (alphabetic academic fields), because first authors are more visible. However, it is not certain that the advantage is strong enough to affect academic careers. In this paper, the advantage in having such early surname initials is analyzed by using data from 1,345 course catalogs that span a 100 years. We obtained academic titles and surname initials of 19,353 faculty members who appeared 211,816 times in these course catalogs. Two alphabetic academic fields – economics and mathematics – and four other academic fields that are not alphabetic were analyzed. We found that there are some years when faculty members who have early surname initials are more likely to be full professors. However, there are many other years when faculty members who have early surname initials are less likely to be full professors. We also analyzed the career path of each faculty member. Economists who have early surname initials are found to be more likely to become full professors. However, this result is not significant and does not extend to mathematicians.
This paper reports the results of an international survey on research data management (RDM) services in libraries. More than 240 practicing librarians responded to the survey and outlined their roles and levels of preparedness in providing RDM services, challenges their libraries face, and knowledge and skills that they deemed essential to advance the RDM practice. Findings of the study revealed not only a number of location and organizational differences in RDM services and tools provided but also the impact of the level of preparedness and degree of development in RDM roles on the types of RDM services provided. Respondents’ perceptions on both the current challenges and future roles of RDM services were also examined. With a majority of the respondents recognizing the importance of RDM and hoping to receive more training while expressing concerns of lack of bandwidth or capacity in this area, it is clear that, in order to grow RDM services, institutional commitment to resources and training opportunities is crucial. As an emergent profession, data librarians need to be nurtured, mentored, and further trained. The study makes a case for developing a global community of practice where data librarians work together, exchange information, help one another grow, and strive to advance RDM practice around the world.
Internationalization is important for research quality and for specialization on new themes in the social sciences and humanities (SSH). Interaction with society, however, is just as important in these areas of research for realizing the ultimate aims of knowledge creation. This article demonstrates how the heterogenous publishing patterns of the SSH may reflect and fulfill both purposes. The limited coverage of the SSH in Scopus and Web of Science is discussed along with ideas about how to achieve a more complete representation of all the languages and publication types that are actually used in the SSH. A dynamic and empirical concept of balanced multilingualism is introduced to support combined strategies for internationalization and societal interaction. The argument is that all the communication purposes in all different areas of research, and all the languages and publication types needed to fulfill these purposes, should be considered in a holistic manner without exclusions or priorities whenever research in the SSH is evaluated.
Maria Esteva, Ramona L. Walls, Andrew B. Magill, Weijia Xu, Ruizhu Huang, James Carson and Jawon Song
The Identifier Services (IDS) project conducted research into and built a prototype to manage distributed genomics datasets remotely and over time. Inspired by archival concepts, IDS allows researchers to track dataset evolution through multiple copies, modifications, and derivatives, independent of where data are located – both symbolically, in the research lifecycle, and physically, in a repository or storage facility. The prototype implementation is based on a three-step data modeling process involving: a) understanding and recording of different researcher workflows, b) mapping the workflows and data to a generic data model and identifying functions, and c) integrating the data model as architecture and interactive functions into cyberinfrastructure (CI). Identity functions are operationalized as continuous tracking of authenticity attributes including data location, differences between seemingly identical datasets, metadata, data integrity, and the roles of different types of local and global identifiers used during the research lifecycle. CI resources were used to conduct identity functions at scale, including scheduling content comparison tasks on high-performance computing resources. The prototype was developed and evaluated considering six data test cases, and feedback was received through a focus-group activity. While there are some technical roadblocks to overcome, our project demonstrates that identity functions are innovative solutions to manage large distributed genomic datasets.
Weijia Xu, Amit Gupta, Pankaj Jaiswal, Crispin Taylor, Patti Lockhart and Jennifer Regala
With the increasing amount of digital journal submissions, there is a need to deploy new scalable computational methods to improve information accessibilities. One common task is to identify useful information and named entity from text documents such as journal article submission. However, there are many technical challenges to limit applicability of the general methods and lack of general tools. In this paper, we present domain informational vocabulary extraction (DIVE) project, which aims to enrich digital publications through detection of entity and key informational words and by adding additional annotations. In a first of its kind to our knowledge, our system engages authors of the peer-reviewed articles and the journal publishers by integrating DIVE implementation in the manuscript proofing and publication process. The system implements multiple strategies for biological entity detection, including using regular expression rules, ontology, and a keyword dictionary. These extracted entities are then stored in a database and made accessible through an interactive web application for curation and evaluation by authors. Through the web interface, the authors can make additional annotations and corrections to the current results. The updates can then be used to improve the entity detection in subsequent processed articles in the future. We describe our framework and deployment in details. In a pilot program, we have deployed the first phase of development as a service integrated with the journals Plant Physiology and The Plant cell published by the American Society of Plant Biologists (ASPB). We present usage statistics to date since its production on April 2018. We compare automated recognition results from DIVE with results from author curation and show the service achieved on average 80% recall and 70% precision per article. In contrast, an existing biological entity extraction tool, a biomedical named entity recognizer (ABNER), can only achieve 47% recall and return a much larger candidate set.