Impact of Antibody Panel Size on Classification Accuracy
This paper experimentally studies the influence of antibody panel size reduction on classification results. The presented study includes four classification methods and five feature evaluators that are applied to five different biomedical data sets with large dimensionality (1200 features). The behaviour of the classifiers in these data sets is examined to reveal overall trends of dimensionality reduction impact on classification accuracy.
Using Data Structure Properties in Decision Tree Classifier Design
This paper studies the techniques of performance enhancement for decision tree classifiers (DTC) that are based on data structure analysis. To improve the performance of DTC, two methods are used - class decomposition that uses the structure of class density and taxonomy based DTC design that uses interactions between attribute values. The paper shows experimental exploration of the methods, their strengths and imperfections and also outlines the directions for further research.
This paper presents a literature review of articles related to the use of decision tree classifiers in gene microarray data analysis published in the last ten years. The main focus is on researches solving the cancer classification problem using single decision tree classifiers (algorithms C4.5 and CART) and decision tree forests (e.g. random forests) showing strengths and weaknesses of the proposed methodologies when compared to other popular classification methods. The article also touches the use of decision tree classifiers in gene selection.
Ivars Namatēvs, Ludmila Aleksejeva and Inese Poļaka
Extraction of meaningful information by using artificial neural networks, where the focus is upon developing new insights for sports performance and supporting decision making, is crucial to gain success. The aim of this article is to create a theoretical framework and structurally connect the sports and multi-layer artificial neural network domains through: (a) describing sports as a complex socio-technical system; (b) identification of pre-processing subsystem for classification; (c) feature selection by using data-driven valued tolerance ratio method; (d) design predictive system model of sports performance using a backpropagation neural network. This would allow identifying, classifying, and forecasting performance levels for an enlarged data set.
This article presents an approach in bioinformatics data analysis and exploration that improves classification accuracy by learning the inner structure of the data. The diseases studied in bioinformatics (diagnostic, prognostic etc. studies) often have the known or yet undiscovered subtypes that can be used while solving bioinformatics tasks providing more information and knowledge. This study deals with the problem above by studying inner class structures (probable disease subtypes) using a cluster analysis to find classification subclasses and applying it in classification tasks. The study also analyses possible cluster merges that would best describe classes. Evaluation is carried out using four classification methods that can be successfully used in bioinformatics: Naïve Bayes classifiers, C4.5, Random Forests and Support Vector Machines.
This article focuses on cluster stability evaluation to assess the characteristics of the dataset and the subclasses found in class decomposition. The evaluation is an iterative process, making small changes to the dataset in every step and reapplying the cluster analysis. These small changes (removing one object from the dataset is repeated for 20 iterations in this case) should not have any impact on clusters if they are stable (meaning that other objects that were not removed stay in the same clusters as in the full clustering).
Arnis Kirshners, Inese Polaka and Ludmila Aleksejeva
Data mining methods are applied to a medical task that seeks for the information about the influence of Helicobacter Pylori on the gastric cancer risk increase by analysing the adverse factors of individual lifestyle. In the process of data preprocessing, the data are cleared of noise and other factors, reduced in dimensionality, as well as transformed for the task and cleared of non-informative attributes. Data classification using C4.5, CN2 and k-nearest neighbour algorithms is carried out to find relationships between the analysed attributes and the descriptive class attribute – Helicobacter Pylori presence that could have influence on the cancer development risk. Experimental analysis is carried out using the data of the Latvian-based project “Interdisciplinary Research Group for Early Cancer Detection and Cancer Prevention” database.
Henrihs Gorskis, Ludmila Aleksejeva and Inese Poļaka
There are multiple approaches for mapping from a domain ontology to a database in the task of ontology-based data access. For that purpose, external mapping documents are most commonly used. These documents describe how the data necessary for the description of ontology individuals and other values, are to be obtained from the database. The present paper investigates the use of special database concepts. These concepts are not separated from the domain ontology; they are mixed with domain concepts to form a combined application ontology. By creating natural relationships between database concepts and domain concepts, mapping can be implemented more easily and with a specific purpose. The paper also investigates how the use of such database concepts in addition to domain concepts impacts ontology building and data retrieval.
Natalia Novoselova, Igor Tom, Arkady Borisov and Inese Polaka
This article considers the gene ranking algorithm for the microarray data. The rank vector is estimated by classifications of the random data samples. At each iteration, the ranks of genes participating in the successful classification become higher. Unlike other methods of feature selection, the proposed algorithm allows increasing the generality of the classification models by construction of the balanced training samples and taking into account the descriptiveness of the gene combinations by the subset estimation.
Madara Gasparovica-Asite, Inese Polaka and Ludmila Alekseyeva
The present research examines a wide range of attribute selection methods – 86 methods that include both ranking and subset evaluation approaches. The efficacy evaluation of these methods is carried out using bioinformatics data sets provided by the Latvian Biomedical Research and Study Centre. The data sets are intended for diagnostic task purposes and incorporate values of more than 1000 proteomics features as well as diagnosis (specific cancer or healthy) determined by a golden standard method (biopsy and histological analysis). The diagnostic task is solved using classification algorithms FURIA, RIPPER, C4.5, CART, KNN, SVM, FB+ and GARF in the initial and various sets with reduced dimensionality. The research paper finalises with conclusions about the most effective methods of attribute subset selection for classification task in diagnostic proteomics data.