Browse

1 - 10 of 1,469 items :

  • Information Technology x
Clear All

Abstract

Purpose

Providing an overview of types of citation curves.

Design/methodology/approach

The terms citation curves or citation graphs are made explicit.

Findings

A framework for the study of diachronous (and synchronous) citation curves is proposed.

Research limitations

No new practical applications are given.

Practical implications

This short note about citation curves will help readers to make the optimal choice for their applications.

Originality/value

A new scheme for the meaning of the term “citation curve” is designed.

Abstract

Purpose

The aim of this research is to propose a modification of the ANOVA-SVM method that can increase accuracy when detecting benign and malignant breast cancer.

Methodology

We proposed a new method ANOVA-BOOTSTRAP-SVM. It involves applying the analysis of variance (ANOVA) to support vector machines (SVM) but we use the bootstrap instead of cross validation as a train/test splitting procedure. We have tuned the kernel and the C parameter and tested our algorithm on a set of breast cancer datasets.

Findings

By using the new method proposed, we succeeded in improving accuracy ranging from 4.5 percentage points to 8 percentage points depending on the dataset.

Research limitations

The algorithm is sensitive to the type of kernel and value of the optimization parameter C.

Practical implications

We believe that the ANOVA-BOOTSTRAP-SVM can be used not only to recognize the type of breast cancer but also for broader research in all types of cancer.

Originality/value

Our findings are important as the algorithm can detect various types of cancer with higher accuracy compared to standard versions of the Support Vector Machines.

Abstract

Purpose

This paper aims to analyze the effectiveness of two major types of features—metadata-based (behavioral) and content-based (textual)—in opinion spam detection.

Design/methodology/approach

Based on spam-detection perspectives, our approach works in three settings: review-centric (spam detection), reviewer-centric (spammer detection) and product-centric (spam-targeted product detection). Besides this, to negate any kind of classifier-bias, we employ four classifiers to get a better and unbiased reflection of the obtained results. In addition, we have proposed a new set of features which are compared against some well-known related works. The experiments performed on two real-world datasets show the effectiveness of different features in opinion spam detection.

Findings

Our findings indicate that behavioral features are more efficient as well as effective than the textual to detect opinion spam across all three settings. In addition, models trained on hybrid features produce results quite similar to those trained on behavioral features than on the textual, further establishing the superiority of behavioral features as dominating indicators of opinion spam. The features used in this work provide improvement over existing features utilized in other related works. Furthermore, the computation time analysis for feature extraction phase shows the better cost efficiency of behavioral features over the textual.

Research limitations

The analyses conducted in this paper are solely limited to two well-known datasets, viz., YelpZip and YelpNYC of Yelp.com.

Practical implications

The results obtained in this paper can be used to improve the detection of opinion spam, wherein the researchers may work on improving and developing feature engineering and selection techniques focused more on metadata information.

Originality/value

To the best of our knowledge, this study is the first of its kind which considers three perspectives (review, reviewer and product-centric) and four classifiers to analyze the effectiveness of opinion spam detection using two major types of features. This study also introduces some novel features, which help to improve the performance of opinion spam detection methods.

Abstract

Purpose

We propose InParTen2, a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework. The proposed method reduces re-decomposition cost and can handle large tensors.

Design/methodology/approach

Considering that tensor addition increases the size of a given tensor along all axes, the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors. Additionally, InParTen2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.

Findings

The performance of InParTen2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets. The results confirm that InParTen2 can process large tensors and reduce the re-calculation cost of tensor decomposition. Consequently, the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.

Research limitations

There are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods. However, the former require longer iteration time, and therefore their execution time cannot be compared with that of Spark-based algorithms, whereas the latter run on a single machine, thus limiting their ability to handle large data.

Practical implications

The proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.

Originality/value

The proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark. Moreover, InParTen2 can handle static as well as incremental tensor decomposition.

Abstract

Purpose

Opinion mining and sentiment analysis in Online Learning Community can truly reflect the students’ learning situation, which provides the necessary theoretical basis for following revision of teaching plans. To improve the accuracy of topic-sentiment analysis, a novel model for topic sentiment analysis is proposed that outperforms other state-of-art models.

Methodology/approach

We aim at highlighting the identification and visualization of topic sentiment based on learning topic mining and sentiment clustering at various granularity-levels. The proposed method comprised data preprocessing, topic detection, sentiment analysis, and visualization.

Findings

The proposed model can effectively perceive students’ sentiment tendencies on different topics, which provides powerful practical reference for improving the quality of information services in teaching practice.

Research limitations

The model obtains the topic-terminology hybrid matrix and the document-topic hybrid matrix by selecting the real user’s comment information on the basis of LDA topic detection approach, without considering the intensity of students’ sentiments and their evolutionary trends.

Practical implications

The implication and association rules to visualize the negative sentiment in comments or reviews enable teachers and administrators to access a certain plaint, which can be utilized as a reference for enhancing the accuracy of learning content recommendation, and evaluating the quality of their services.

Originality/value

The topic-sentiment analysis model can clarify the hierarchical dependencies between different topics, which lay the foundation for improving the accuracy of teaching content recommendation and optimizing the knowledge coherence of related courses.

Abstract

Purpose

The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately. To serve this purpose, a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.

Design/methodology/approach

This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems. The proposed approach, named Bagged and Voted Local Outlier Detection (BV-LOF), benefits from the Local Outlier Factor (LOF) as the base algorithm and improves its detection rate by using ensemble methods.

Findings

Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method. According to the results, the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.

Research limitations

In the BV-LOF approach, the base algorithm is applied to each subset data multiple times with different neighborhood sizes (k) in each case and with different ensemble sizes (T). In our study, we have chosen k and T value ranges as [1–100]; however, these ranges can be changed according to the dataset handled and to the problem addressed.

Practical implications

The proposed method can be applied to the datasets from different domains (i.e. health, finance, manufacturing, etc.) without requiring any prior information. Since the BV-LOF method includes two-level ensemble operations, it may lead to more computational time than single-level ensemble methods; however, this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.

Originality/value

The proposed approach (BV-LOF) investigates multiple neighborhood sizes (k), which provides findings of instances with different local densities, and in this way, it provides more likelihood of outlier detection that LOF may neglect. It also brings many benefits such as easy implementation, improved capability, higher applicability, and interpretability.

Abstract

Purpose

This paper presents the ARQUIGRAFIA project, an open, public and nonprofit, continuous growth web collaborative environment dedicated to Brazilian architectural photographic images.

Design/methodology/approach

The ARQUIGRAFIA project promotes the active and collaborative participation among its institutional users (GLAMs, NGOs, laboratories and research groups) and private users (students, professionals, professors, researchers), both can create an account and share their digitized iconographic collections in the same Web environment by uploading their files, indexing, georeferencing and assigning a Creative Commons license.

Findings

The development of users interactions by means of semantic differentials impressions recording on visible plastic-spatial aspects of the architectures in synthetic infographics, as well as by the retrieval of images through an advanced system search based on those impressions parameters. By gamification means, the system often invites users to review images’ in order to improve images’ data accuracy. The pilot project named Open Air Museum that allows users to add audio descriptions to images in situ. An interface for users’ digital curatorship will be soon available.

Research limitations

The ARQUIGRAFIA’s multidisciplinary team gathering professors-researchers, graduate and undergraduate students from the Architecture and Urbanism, Design, Information Science, Computer Science faculties of the University of São Paulo, demands continuous financial resources for grants, for contracting third party services, for the participation in scientific events in Brazil and abroad, and for equipment. Since 2016, significant budget cuts in the University of São Paulo own research funds and in Brazilian federal scientific agencies can compromise the continuity of this project.

Practical implications

The open source template called +GRAFIA that can freely help other areas of knowledge to build their own visual Web collaborative environments.

Originality/value

The collaborative nature of the ARQUIGRAFIA distinguishes it from institutional image databases on the internet, precisely because it involves a heterogeneous network of collaborators.

Abstract

Purpose

With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.

Design/methodology/approach

State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels).

Findings

Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes).

Research limitations

Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.

Practical implications

In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.

Originality/value

The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.

Abstract

Purpose

To develop a set of metrics and identify criteria for assessing the functionality of LOD KOS products while providing common guiding principles that can be used by LOD KOS producers and users to maximize the functions and usages of LOD KOS products.

Design/methodology/approach

Data collection and analysis were conducted at three time periods in 2015–16, 2017 and 2019. The sample data used in the comprehensive data analysis comprises all datasets tagged as types of KOS in the Datahub and extracted through their respective SPARQL endpoints. A comparative study of the LOD KOS collected from terminology services Linked Open Vocabularies (LOV) and BioPortal was also performed.

Findings

The study proposes a set of Functional, Impactful and Transformable (FIT) metrics for LOD KOS as value vocabularies. The FAIR principles, with additional recommendations, are presented for LOD KOS as open data.

Research limitations

The metrics need to be further tested and aligned with the best practices and international standards of both open data and various types of KOS.

Practical implications

Assessment performed with FAIR and FIT metrics support the creation and delivery of user-friendly, discoverable and interoperable LOD KOS datasets which can be used for innovative applications, act as a knowledge base, become a foundation of semantic analysis and entity extractions and enhance research in science and the humanities.

Originality/value

Our research provides best practice guidelines for LOD KOS as value vocabularies.

Abstract

Purpose

This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment. Key requirements, which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment, are presented in the perspective of linked data.

Design/methodology/approach

This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive, the search and browse menus of Gugak Archive’s website and K-PAAN, the performing arts portal site.

Findings

The importance of consistency, continuity, and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment. However, a semantic web environment also requires new tools such as web identifiers (URIs), data models (RDF), and link information (interlinking).

Research limitations

The scope of this study does not include practical implementation strategies for the archival records management system and website services. The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.

Practical implications:

The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system. This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.

Originality/value

This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment. In the application of the semantic web services’ principles and methods to an Gugak Archive, this study can contribute to the improvement of information organization and services in the field of Korean traditional music.