Open Access

A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection


Cite

Figure 1

Learning scenarios for outlier detection models.
Learning scenarios for outlier detection models.

Figure 2

The general structure of the proposed BV-LOF approach.
The general structure of the proposed BV-LOF approach.

Figure 3

Feature subset selection operation in the first stage of the BV-LOF approach.
Feature subset selection operation in the first stage of the BV-LOF approach.

Figure 4

LOF application and majority voting operation in the second stage of the BV-LOF approach
LOF application and majority voting operation in the second stage of the BV-LOF approach

Figure 5

Comparison between LOF and BV-LOF methods in terms of maximum AUC values.
Comparison between LOF and BV-LOF methods in terms of maximum AUC values.

Figure 6

Comparison between LOF and BV-LOF methods in terms of average AUC values.
Comparison between LOF and BV-LOF methods in terms of average AUC values.

Figure 7

AUC values obtained from LOF with different k and from BV-LOF with different ensemble sizes (T).
AUC values obtained from LOF with different k and from BV-LOF with different ensemble sizes (T).

ALGORITHM 1 - Bagged and Voted Local Outlier Detection (BV LOF)
Inputs:T (# of iterations (in other words ensembles size))
  D = {X1, X2, ..., Xn} (the entire dataset), where n is the number of instances
  F = {F1, F2, ..., Fd} (feature set), where d is the dimension of the dataset
OUTPUT:O = {o1, o2, ..., op} (a set of objects that are assigned as outliers)
fori = 1 toTdo
  Randomly determine subset size R in [d/, d-1]
  forj = 1 toRdo
    ft = Randomly select a feature w/o replacement from F
      Si = Sift
  end for
  Generate Di that includes the features in the subset Si
  foreach neighbor size kin [1, 100] do
    Apply LOF(k) on Di
    Obtain output vectors O(Di, k)
  end for
  foreach object oinO(Di, k) do
      // find highest total vote
       hi(o)=argmaxyYk=1100vwhereY={1,1}andv{(hk(o)=1)=1(outlier)(hk(o)=1)=0(inlier)\matrix{{{h_i}(o)} {= {{\rm argmax}_{y \in Y}}\sum\nolimits_{k = 1}^{100} v} \hfill \cr \hfill {\kern 30pt} {{\rm where}\,Y = \{1, - 1\} \,{\rm and}\,v\left\{{\matrix{{({h_k}(o) = - 1) = 1} \hfill & {({\rm outlier})} \hfill \cr {({h_k}(o) = 1) = 0} \hfill &\,\,\,\, {({\rm inlier})} \hfill \cr}} \right.} \hfill}
      Obtain single output vector O(Di) for dataset Di
  end for
  O(D) = O(D) ∪ O(Di)
end for
foreach object oinO(D) do
      // find highest total vote
       h(o)=argmaxyYt=1TvwhereY={1,1}andv{(ht(o)=1)=1(outlier)(ht(o)=1)=0(inlier)\matrix{{h(o)} {= {{\rm argmax}_{y \in Y}}\sum\nolimits_{t = 1}^T v} \hfill \cr {\kern 30pt}{{\rm where}\,Y = \{1, - 1\} \,{\rm and}\,v\left\{{\matrix{{({h_t}(o) = - 1) = 1} \hfill & {({\rm outlier})} \hfill \cr {({h_t}(o) = 1) = 0} \hfill &\,\,\,\, {({\rm inlier})} \hfill \cr}} \right.} \hfill}
  Obtain a single output vector O representing all outliers in the dataset D
end for
Return O
END ALGORITHM

Basic characteristics of the datasets.

IDDataset# Instance# Feature% Outliers
1CoverType286,048100.9
2Glass21494.2
3KDDCup60,632410.4
4Lymphography148184.1
5PageBlocks5,4731010.2
6PenDigits9,868160.2
7Shuttle101391.2
8Stamps34099.1
9Thyroid3,77262.5
10Wine129137.7

The points to which AUC value reaches the maximum for each dataset.

DatasetsLOF Best k ValueBV-LOF Best T Value
Cover608
Glass121
KDDCup5616
Lymphography5485
PageBlocks194
PenDigits883
Shuttle74
Stamps32
Thyroid634
Wine1399
Average37.522.6

The average of experimental results of all datasets.

LOF (%)BV-LOF (%)
Maximum Outlier Detection Performance88.4290
Average Outlier Detection Performance80.4783.97
eISSN:
2543-683X
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining