In recent years, various studies require data mining that can find outliers and protect system reliability. Outliers are observations that differ significantly from most of the samples in the dataset. They could be perceived as though they are generated by a different process. Generally, the number of outliers is comparably much less than the normal behaving data (typically lower than 10%). Data instances containing extreme feature values out of the accepted range may cause a negative effect on data analyses such as regression or may provide useful information about data such as terrorist attacks. Outlier analysis is especially adequate for fraud detection which is performed by an
If any data point is extremely different from the whole data in a process, then it is marked as an outlier (Qi & Chen, 2018). Outliers may emerge because of various causes such as malicious attacks, environmental factors, human errors, abnormal conditions, measurement errors, and hardware malfunction. Outlier detection is related but slightly different from novelty detection, which is concerned with the identification of new or unknown data not covered by the training data. Outlier detection makes the model adapt for normal behaving data, thus abnormally behaving data could be detected by the built model since it does not fit the conditioned model. Local Outlier Factor (LOF) and Isolation Forest (IF) are some of the soft computing methods dealing with this problem (Domingues et al., 2018).
However, it is not always straightforward to perform anomaly detection successfully for many methods (Cao et al., 2017). Sampling approaches are not generally used for this task since they are affected by over-sampling minority values or under-sampling majority values. Detecting an outlier in any context is a challenging task due to the lack of a known underlying model and the strong dependence of the existing methods on input parameters. In order to overcome these problems, in this study, we propose a new ensemble-based outlier detection approach, named Bagged and Voted Local Outlier Detection (BV-LOF). Our approach uses the LOF algorithm as a base detector on various subsets of data and aggregates the obtained results by a combining mechanism.
The main contributions of this paper are three-fold: (i) It proposes a novel approach (BV-LOF), which integrates two different styles of ensemble learning (bagging and voting) using the Local Outlier Factor (LOF) algorithm in order to detect outliers. In this way, it is aimed at improving the general performance of LOF on detecting anomalies. (ii) This is the first study that runs the LOF algorithm with different input parameter
In the experimental studies, ten benchmark outlier datasets were used and the results were compared with the ones obtained by using the LOF method. Area Under the Receiver Operating Characteristics Curve (AUC) measure was used to evaluate the outlier detection performance of the proposed method for all datasets. According to the experimental results, the proposed two-level approach (BV-LOF) demonstrated better performance compared to LOF.
The rest of this paper is organized as follows. Section 2 describes outlier detection studies that have been conducted by researchers so far. Section 3 explains the proposed method (BV-LOF) in detail with its benefits and time complexity. Section 4 presents the datasets used in the experiments and discusses the experimental results obtained by the comparison of two methods: LOF and BV-LOF. Final comments on the study and possible future works are presented in Section 5.
In the majority of the cases, the number of instances that belong to the class of outliers is less than 10% of the total number of instances in the whole training set (Wu et al., 2018). Their identification becomes one of the most challenging situations in data analysis, hence, in general, the probabilities of detecting an anomaly in a dataset are extremely low for many methods. The studies presented in the literature have shown that many traditional classification, clustering, and regression algorithms are not capable of dealing with outliers. The reason behind this situation is that learning algorithms are likely to misinterpret the data containing outliers. For this reason, the proposed approach in this paper can be applied as a separate data preprocessing step before classification, regression, and clustering tasks.
In the literature, many different outlier detection algorithms have been proposed to observe the presence of outliers effectively. A group of algorithms uses the reconstruction error to find contaminations (Balamurali & Melkumvan, 2018), in which the data that have greater reconstruction errors are labeled as outliers. These algorithms assume that strong statistical correlations exist among the attributes of normally behaving data. These reconstruction-based techniques can capture the correlations by reducing the reconstruction errors of data having the expected behavior. For this reason, the outliers have comparatively larger reconstruction errors than normal data.
Different assumptions have been made by the existing outlier detection approaches. These assumptions vary and make all the approaches differ from each other. Density-based and distance-based techniques are generally the source of most of the complex algorithms (Lopes et al., 2018).
As shown in Figure 1, the learning scenarios proposed by the current outlier detection methods can be grouped into three categories: supervised, semi-supervised, and unsupervised. In supervised outlier detection, class-labeled data tagged as inlier and outlier is provided for training. So, distinct classes of data are available like in binary or multi-class classification (Huang et al., 2012). Semi-supervised outlier detection can utilize labeled instances for only inlier class or can use both labeled data (tagged as inlier and outlier) and unlabeled data (maybe or may not be an outlier). In unsupervised outlier detection, the algorithm tries to learn from only unlabeled data (Pasillas-Diaz & Ratte, 2017), so it has no information about class labels. In this study, we proposed a novel outlier detection approach that focuses on unsupervised learning, so only data without label or class exist.
Unsupervised outlier detection approaches have also sub-categories: clustering-based, density-based, and relative density-based methods.
First, clustering-based techniques have the strategy that each instance either belongs to a cluster or not (called as an outlier). In these types of algorithms, data is split into clusters, and being a member of a cluster determines whether the instance is an anomaly or not. A binary decision is taken by the model. Therefore, it does not provide an extensive comprehension of the identified outliers.
Secondly, density-based techniques identify instances found in the areas with low density and named them as anomalies. An anomaly score is given to each instance based on the nearest neighbor (NN) distance strategy. Different density-based outlier detection methods have been proposed by researchers so far. Local Outlier Factor (LOF) is one of the most popular density-based methods that use NN methodology. For this reason, in this study, we preferred to use the LOF method because of its advantages such as being an unsupervised algorithm, easy to understand and implement, dealing with high dimensional data, and requiring no assumption of the underlying data.
Thirdly, in relative density-based techniques, anomalies are defined as observations that have lower relative density than its neighbors (Bandaragoda et al., 2017). This technique is proposed since the global density function causes the problem of limited identified local outliers in dense areas having low relative density to its neighbors. The ratio of density between an instance and its neighborhood is taken as a measure of relative density. In this way, instances with low density are labeled as outliers.
Several variants of the LOF algorithm have been implemented in order to overcome a disadvantage that the researchers observed in the original LOF. For example, Qin et al. (2019) proposed Weighted LOF (WLOF) to deal with the impact of data duplication. Gan and Zhou (2018) proposed the D-LOF approach, which uses the DBSCAN algorithm to optimize the LOF algorithm by dynamically adjusting the parameter
Ensemble learning is a supported mechanism in outlier detection tasks. Ensemble-based models utilize an outlier detection algorithm or a series of different algorithms more than once on various settings of the same dataset and combine the outputs to get the final anomaly score (Chakraborty, Narayanan, & Ghosh, 2019; Zhang et al., 2019). In the literature, several studies have been performed by using an ensemble strategy for this purpose (Chen et al., 2018; Hu et al., 2019; Li, Fang, & Yan, 2019; Wang & Mao, 2019). The previous studies have proven that ensemble learning yields advantages in many different applications such as process monitoring (Wang & Mao, 2019), video anomaly detection (Hu et al. 2019), maintenance prediction (Li, Fang & Yan, 2019) and image-based outlier detection (Chen et al., 2018). This is because, they try to minimize errors of judgment made by different models or differently selected subsets of data (Kaneko, 2018).
Lazarevic and Kumar (2005) proposed a feature bagging mechanism to create different datasets with randomly selected attributes before applying LOF on those datasets. However,
Differently from the previous studies, our approach consists of two consecutive phases: Bagging and Voting. In the first phase, differently chosen subsets of attributes are used to benefit from various correlations of input features. In the second stage, the effects of diverse
Let
Local Outlier Factor (LOF) method is aimed at finding extreme data points with the help of local deviation with respect to
Given a dataset for at least for at most
Given a data point
For a positive number
Given a dataset
The LOF is a score assigned to each data point, which gives information about how likely the data point to be an outlier.
In this paper, we propose a novel two-level outlier detection approach, named Bagged and Voted Local Outlier Detection (BV-LOF). As shown in Figure 2, the BV-LOF approach has two main phases which are bagging and voting. Bagging, which is the short form of bootstrap aggregation, finds the outlier scores in a series of
The BV-LOF approach provides a solution to improve the performance of the LOF algorithm. The intuition behind this approach is that the combination of diverse feature subsets and multiple variants of the same algorithm can supplement each other and collective decision provides an improvement on the overall detection rate. The integration of partial features provides many advantages over the full features that can be summarized as follows:
Improving accuracy: Integration of partial features increases the possibility of selecting the right subset to improve the accuracy of the model. Since initially it is unknown that which subset of features is more adequate for outlier detection, creating different subsets with different sizes from full features and combining their decisions would be an effective strategy. In the previous studies (Aggarwal, 2017; Lazarevic & Kumar, 2005), it has also been proven that the performance of outlier detection can be significantly improved by using multiple feature subsets. Dealing with noisy data: Feature subset selection places a significant role in improving the performance of outlier detection, especially in the case of noisy data. When analyzing full features, the data becomes sparse, and the actual outliers become masked by the noise effects of high dimensions (Aggarwal, 2017). On the other hand, some deviations in a small feature subset may be significant enough to be indicative of abnormal behavior. Increasing robustness: In distance computation, the presence of irrelevant features can significantly degrade the performance of outlier detection (Lazarevic & Kumar, 2005). This is because all dimensions are weighted equally, even the irrelevant and unimportant dimensions. Since the outliers are often embedded in locally relevant subspaces, a set of feature subspaces can be sampled in order to improve robustness (Leng & Huang, 2011). Taking the advantages of ensemble learning: The advantages of the ensemble-based combined score of a given data point are very significant in the context of outlier detection (Aggarwal, 2017). The outlier scores obtained from different feature subspaces may be very different; therefore, it is usually difficult to fully trust the score of a single subspace, and so the combination of scores is crucial. In this sense, this approach is similar to that of taking advantage of the power of many weak learners to create a single strong learner in ensemble-based classification problems.
Figure 3 illustrates the feature bagging operation performed in the first phase of the BV-LOF approach. Let
In the first step of the BV-LOF approach, the LOF algorithm is applied to each subset data multiple times with different sizes of the neighborhood (
Figure 4 shows the majority voting operation of the BV-LOF approach. The LOF algorithm produces an output for each different size of the neighborhood (
The pseudo-code of the BV-LOF approach is presented in Algorithm 1. For each ensemble iteration, the size of the feature subset (
In this section, the proposed approach (BV-LOF) is illustrated with an example. Assume that three feature subsets (
|
|
Randomly determine subset size |
|
|
|
|
Generate |
|
Apply LOF( |
Obtain output vectors |
|
|
// find highest total vote |
|
Obtain single output vector |
|
|
// find highest total vote |
|
Obtain a single output vector |
An example to illustrate bagging and voting steps of BV-LOF approach.
The main advantages of the proposed approach (BV-LOF) over other outlier detection methods can be summarized as follows:
Detection of meaningful local outliers: It provides detection of local abnormally behaving data points with respect to the density of their neighboring data instances, instead of the global model. Running without any parameter for the neighborhood: It does not require the selection of Applicability: Since it considers local densities of instances, it is proper for arbitrarily shaped data with groups of varying sizes, characteristics, and densities. Moreover, it has the ability to learn the underlying structure of data without any prior knowledge, i.e. without any assumption about the distributions of data points. In addition, it is independent of the domain of application, so it can be easily generalized for different problems such as network intrusion detection, fraud discovery in telecommunication, failure detection in manufacturing, email spam filtering, and medical data analysis. Furthermore, the BV-LOF algorithm is also suitable for high-dimensional data since it applies a feature selection scheme in the first stage of the algorithm. Providing ensemble-based outlier detection: The capability of outlier detection of the LOF method is improved using two-level ensemble approached which are bagging and voting. Parallelization: Although the BV-LOF algorithm runs more slowly than LOF since it operates on multiple data subsets and for multiple Improved capability: The integration of various models generated by using different Easy implementation: It can be easily implemented since it does require any change in the general structure of the LOF algorithm. Interpretability: Output of the algorithm, which is in the form of observation score, can be easily interpreted by the user.
Besides the advantages of the BV-LOF approach, there are some drawbacks that can be summarized as follows:
Performance depending on the k value range: Since the majority voting treats all LOF implementations with each Computational cost: The BV-LOF method is slower than LOF because it requires multiple implementations for different subsets. Problem specific outliers: The meaning of an outlier may change according to the context of the data and the model does not provide any prior information about the significance of the outlier with respect to the usage criteria.
In this section, the two-level proposed model is analyzed with respect to time complexity. The time complexity of the BV-LOF approach mainly depends on the time complexity of the base algorithm (LOF) and the ensemble size (in other words the number of iterations). More specifically, it depends on following parameters: the number of instances (
The time complexity of LOF is near-quadratic with respect to the number of instances. However, different variations of LOF were proposed in the literature to reduce the time complexity of LOF such as IF-LOF (Cheng, Zou, & Dong, 2019), N2DLOF (Su et al., 2017), Top-n LOF (TOLF) (Yan, Cao, & Rundensteiner, 2017), BLOF (Bounded LOF) (Tang & Ngan, 2016), SimLOF (Sun et al., 2015), GridLOF (Wang & Huang, 2007), iLOF (Incremental LOF) (Pokrajac, Lazarevic, & Latecki, 2007), and MiLOF (memory efficient incremental local outlier) (Salehi et al., 2016). Therefore, instead of the traditional LOF algorithm, one of its variations can be used in the BV-LOF approach. The most computationally expensive step in BV-LOF is the k nearest neighbor search. To reduce the time complexity of BV-LOF, it is also possible to use various data structures such as R*-tree, KD-tree, Cover tree, and M-tree thanks to efficient nearest neighbor search.
Even though the running time of the BV-LOF method seems higher than the LOF algorithm due to the two-level ensemble operations such as divisions and integrations, this duration can be decreased by parallelization. If each subset is run simultaneously using different threads or processors, the negative effect of the proposed approach can be eliminated, and almost only LOF implementation and voting operations determine the length of the process.
In this study, the BV-LOF method was implemented by using Scikit-Learn open-source library with Python programming language. We preferred Scikit-Learn since it provides many advantages such as ease of use, consistent and reliable API’s, a wide range of alternative algorithms, automatic hyperparameter tuning, integration of parallelization, and good documentation/tutorials. The LOF algorithm was selected to be used as a base outlier detection method. In all experiments, the default parameters of the algorithms in the Scikit-Learn library were used; expect
In order to validate the LOF and BV-LOF results, we used an external validation technique. External evaluation measure, which has been extensively studied for clustering and outlier detection tasks to examine the results, uses class labels known as ground truth. Since our study concentrates on datasets with class labels, it uses an external validity measure. Only the data, without class-labels, is used as input to the algorithm; however, the measure evaluates the goodness of the results with respect to class labels tagged as inlier and outlier. The measure evaluates the extent to which the structure discovered by an outlier detection algorithm matches by the externally given class labels. Since the external evaluation measure utilizes the true class labels in the comparison, it is a robust indicator of the true error rate of an outlier detection algorithm. By this way, external measure performs well in predicting the outlier detection errors.
When the data is imbalanced, using overall accuracy measure does not give sufficient information about the performance of the methods. Even though precision and recall scores can also be used, one of the most widely used validation measures in the research studies is the AUC (Area under the Receiver Operating Characteristics Curve) score. Thus, in the experimental studies, we used AUC validation measure to compare the performances of LOF and BV-LOF. The ROC curve is drawn by putting True Positive Rate (TPR) on the
The proposed method BV-LOF has been tested with 10 outlier detection datasets ranging from 129 to 286,048 instances and from 6 to 41 attributes. The datasets are obtained from different domains. They are publicly available on well-known data repositories such as UCI Machine Learning Repository and Kaggle. They are frequently used for validating outlier (anomaly) detection algorithms in the literature (Reif, Goldstein, & Stahl, 2008; Yao et al., 2018; Zhou et al., 2013). Actually, these datasets are multi-class datasets having class labels. Here, the minority class (or classes) is considered as outliers compared to other larger classes. Hence, the instances in the smallest class(es) are marked as outliers, while all other instances are inliers. The overview of the datasets is given in Table 2. Generally, the density of the extreme samples is less than 10% of the entire dataset as is expected. The CoverType dataset is much larger than the rest with 286,048 records.
Basic characteristics of the datasets.
ID | Dataset | # Instance | # Feature | % Outliers |
---|---|---|---|---|
1 | CoverType | 286,048 | 10 | 0.9 |
2 | Glass | 214 | 9 | 4.2 |
3 | KDDCup | 60,632 | 41 | 0.4 |
4 | Lymphography | 148 | 18 | 4.1 |
5 | PageBlocks | 5,473 | 10 | 10.2 |
6 | PenDigits | 9,868 | 16 | 0.2 |
7 | Shuttle | 1013 | 9 | 1.2 |
8 | Stamps | 340 | 9 | 9.1 |
9 | Thyroid | 3,772 | 6 | 2.5 |
10 | Wine | 129 | 13 | 7.7 |
In the experimental studies, the LOF algorithm was applied 100 times with different
Figure 5 presents the comparison of the performance of LOF and BV-LOF methods in terms of maximum AUC values obtained in all trials. According to the results, the BV-LOF approach wins against LOF on 7 datasets of 10 ones. Thus, it is clearly seen that BV-LOF generally outperforms LOF on the datasets. For instance, BV-LOF produced a notable increment (~4%) in AUC for the Shuttle dataset. BV-LOF achieved the best performance with 83.92% AUC for the PageBlocks dataset. However, it was not able to get better results for Glass, Lymphography, and Wine datasets. In those datasets, different
Figure 6 shows the comparative results of LOF and BV-LOF in terms of average AUC scores. Apparently, the BV-LOF approach has proved to perform better all datasets, except the Wine data. Thus, compared to LOF, the win ratio for BV-LOF is 9/10 (90%), hence BV-LOF significantly more accurately detects outliers than conventional LOF on average. For instance, the outlier detection rates of BV-LOF and LOF for the Thyroid dataset are 82.68% and 90.93% respectively. When the PageBlocks dataset is used, the difference in outlier detection accuracy between the BV-LOF and LOF algorithms remains high, 87.96% versus 81.71%. Improvements also exist for other datasets, expect the Wine data. These improvements indicate that a two-level ensemble approach can become more useful for the outlier detection task.
When the average results of all datasets given in Table 3 are examined, it is clearly seen that BV-LOF outperforms LOF significantly. BV-LOF (83.97%) is remarkably more accurate than LOF (80.47%) on average. Similarly, in terms of maximum performances, BV-LOF (90%) shows a small improvement over LOF (88.42%).
The average of experimental results of all datasets.
LOF (%) | BV-LOF (%) | |
---|---|---|
Maximum Outlier Detection Performance | 88.42 | 90 |
Average Outlier Detection Performance | 80.47 | 83.97 |
In order to reveal the influences of parameters on the algorithms, BV-LOF and LOF methods are compared in the experiment, where both the parameter
It is also possible to say from these results shown in Figure 7 that the point in which the AUC measure reaches to maximum varies according to the dataset used. Namely, both LOF and BV-LOF methods are clearly sensitive to the value of the input parameters. This indicates that the parameter
In the tests with different
Another interesting fact is that for each
Table 4 shows the
The points to which AUC value reaches the maximum for each dataset.
Datasets | LOF Best | BV-LOF Best |
---|---|---|
Cover | 60 | 8 |
Glass | 12 | 1 |
KDDCup | 56 | 16 |
Lymphography | 54 | 85 |
PageBlocks | 19 | 4 |
PenDigits | 88 | 3 |
Shuttle | 7 | 4 |
Stamps | 3 | 2 |
Thyroid | 63 | 4 |
Wine | 13 | 99 |
Average | 37.5 | 22.6 |
In the BV-LOF approach, the outliers are searched in lower-dimensional subspaces thanks to the feature subset selection, and so this solution attempts to reduce variance by guiding diversity. The integration of partial features is often better than the full features since the combination function at the end recognizes the differential behavior of different subspace samples for a given data point. The multiple feature subspace selection places an important role in improving the performance of an outlier detector and can help to deal with noisy data, to increase robustness and to take the advantages of ensemble learning. Data processing on a lower-dimensional space naturally influences the variance-bias trade-off of the base algorithm (LOF), and so the ensemble framework benefits from the existing variability in each detector. Being based on a two-level ensemble approach to promote diversity, BV-LOF offers an improvement on outlier detection rate, while requiring running time higher than single-level ensemble approaches.
In this study, a novel two-level ensemble outlier detection mechanism is proposed. The method consists of two consecutive stages and, at each stage, a different ensemble approach is applied, firstly bagging and then voting. Therefore, it has been called as Bagged and Voted Local Outlier Factor (BV-LOF). In the first stage, the dataset is split into subsets without interfering with instance size in terms of different selection of features. As feature bagging has been proved to produce more robust results by the previous studies, this methodology is used in the first stage. However, it is noticed that the choice of input parameter
As future work, different kinds of base outlier detectors from different categories (clustering-based, distribution-based, etc.) can be applied in the proposed method. In addition, majority voting could be converted into a weighted form by checking the number of outliers returned from each