Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55 (1): 119–139.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H. and Herrera, F. (2011). An overview of ensemblemethods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes, Pattern Recognition 44 (8): 1761–1776.
García-Osorio, C., de Haro-García, A. and García-Pedraja, N. (2010). Democratic instance selection: A linear complexity instance
., Tsoumakas, G., Kalliris, G. and Vlahavas, I. (2008). Multilabel classification of music into emotions, 9th International Conference on Music Information Retrieval (ISMIR 2008), Philadelphia, PA, USA , pp. 325-330.
Tsoumakas, G., Katakis, I. and Vlahavas, I. (2011). Random k-labelsets for multilabel classification, IEEE Transactions on Knowledge and Data Engineering 23 (7): 1079-1089.
Tsoumakas, G. and Vlahavas, I. (2007). Random k-labelsets: An EnsembleMethod for Multilabel Classification , Lecture Notes in Artificial Intelligence, Vol
An intrusion detection system (IDS) is an important feature to employ in order to protect a system against network attacks. An IDS monitors the activity within a network of connected computers as to analyze the activity of intrusive patterns. In the event of an ‘attack’, the system has to respond appropriately. Different machine learning techniques have been applied in the past. These techniques fall either into the clustering or the classification category. In this paper, the classification method is used whereby a neural network ensemble method is employed to classify the different types of attacks. The neural network ensemble method consists of an autoencoder, a deep belief neural network, a deep neural network, and an extreme learning machine. The data used for the investigation is the NSL-KDD data set. In particular, the detection rate and false alarm rate among other measures (confusion matrix, classification accuracy, and AUC) of the implemented neural network ensemble are evaluated.
Krzysztof Siwek, Stanisław Osowski and Ryszard Szupiluk
Ensemble Neural Network Approach for Accurate Load Forecasting in a Power System
The paper presents an improved method for 1-24 hours load forecasting in the power system, integrating and combining different neural forecasting results by an ensemble system. We will integrate the results of partial predictions made by three solutions, out of which one relies on a multilayer perceptron and two others on self-organizing networks of the competitive type. As the expert system we will apply different integration methods: simple averaging, SVD based weighted averaging, principal component analysis and blind source separation. The results of numerical experiments, concerning forecasting the hourly load for the next 24 hours of the Polish power system, will be presented and discussed. We will compare the performance of different ensemble methods on the basis of the mean absolute percentage error, mean squared error and maximum percentage error. They show a significant improvement of the proposed ensemble method in comparison to the individual results of prediction. The comparison of our work with the results of other papers for the same data proves the superiority of our approach.
When running data-mining algorithms on big data platforms, a parallel, distributed framework, such asMAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.
Mathematical models that explain match outcome, based on the value of technical performance indicators (PIs), can be used to identify the most important aspects of technical performance in team field-sports. The purpose of this study was to evaluate several methodological opportunities, to enhance the accuracy of this type of modelling. Specifically, we evaluated the potential benefits of 1) modelling match outcome using an increased number of seasons and PIs compared with previous reports, 2) how to identify eras where technical performance characteristics were stable and 3) the application of a novel feature selection method. Ninety-one PIs across sixteen Australian Football (AF) League seasons were analysed. Change-point and Segmented Regression analyses were used to identify eras and they produced similar but non-identical outcomes. A feature selection ensemble method identified the most valuable 45 PIs for modelling. The use of a larger number of seasons for model development lead to improvement in the classification accuracy of the models, compared with previous studies (88.8 vs 78.9%). This study demonstrates the potential benefits of large databases when creating models of match outcome and the pitfalls of determining whether there are eras in a longitudinal database.
Hamza Harkous, Rameez Rahman, Bojan Karlas and Karl Aberer
Third party apps that work on top of personal cloud services, such as Google Drive and Drop-box, require access to the user’s data in order to provide some functionality. Through detailed analysis of a hundred popular Google Drive apps from Google’s Chrome store, we discover that the existing permission model is quite often misused: around two-thirds of analyzed apps are over-privileged, i.e., they access more data than is needed for them to function. In this work, we analyze three different permission models that aim to discourage users from installing over-privileged apps. In experiments with 210 real users, we discover that the most successful permission model is our novel ensemble method that we call Far-reaching Insights. Far-reaching Insights inform the users about the data-driven insights that apps can make about them (e.g., their topics of interest, collaboration and activity patterns etc.) Thus, they seek to bridge the gap between what third parties can actually know about users and users’ perception of their privacy leakage. The efficacy of Far-reaching Insights in bridging this gap is demonstrated by our results, as Far-reaching Insights prove to be, on average, twice as effective as the current model in discouraging users from installing over-privileged apps. In an effort to promote general privacy awareness, we deployed PrivySeal, a publicly available privacy-focused app store that uses Far-reaching Insights. Based on the knowledge extracted from data of the store’s users (over 115 gigabytes of Google Drive data from 1440 users with 662 installed apps), we also delineate the ecosystem for 3rd party cloud apps from the standpoint of developers and cloud providers. Finally, we present several general recommendations that can guide other future works in the area of privacy for the cloud. To the best of our knowledge, ours is the first work that tackles the privacy risk posed by 3rd party apps on cloud platforms in such depth.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140
Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Computational linguistics, 22(2), 249-254.
Dawson, C. W., & Wilby, R. (1998). An artificial neural network approach to rainfall-runoff modelling. Hydrological Sciences Journal, 43(1), 47-66.
Dietterich, T. G. (2000, June). Ensemblemethods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg
Rokach, L. (2009). Taxonomy for characterizing ensemblemethods in classification tasks: A review and annotated bibliography. Computational Statistics and Data Analysis , 53(12), 4046–4072.
Rokach, L. (2010a). Pattern Classification Using EnsembleMethods. In H. Bunke, & P. S. P. Wang (Eds.), Series in Machine Perception and Artificial Intelligence (Vol. 75). World Scientific Publishing.
Rokach, L. (2010b). Ensemble-based classifiers. Artificial Intelligence Review , 33(1–2), 1–39.
Rokach, L., & Maimon, O. (2005). Top-down induction of decision
 Bhat S. Y., Abulaish M., Community-based features for identifying spammers in online social networks, in: Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) , ACM, 2013, 100-107.
 Bhat S. Y., Abulaish M., Analysis and mining of online social networks: emerging trends and challenges, WIREs: Data Mining and Knowledge Discovery , 3, 6, 2013, 408-444.
 Bhat S. Y., Abulaish M., Mirza A. A., Spammer classification using ensemblemethods over structural social network