A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection

Alican Dogan; Derya Birant

Open Access

A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection

Alican Dogan

and

Derya Birant

| May 20, 2020

Journal of Data and Information Science

Volume 5 (2020): Issue 2 (April 2020)

About this article

Cite

Published Online: May 20, 2020

Page range: 111 - 135

Received: Dec 13, 2019

Accepted: Apr 29, 2020

DOI: https://doi.org/10.2478/jdis-2020-0014

Keywords
Outlier detection, Local outlier factor, Ensemble learning, Bagging, Voting

© 2020 Alican Dogan et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Learning scenarios for outlier detection models.

The general structure of the proposed BV-LOF approach.

Feature subset selection operation in the first stage of the BV-LOF approach.

LOF application and majority voting operation in the second stage of the BV-LOF approach

Comparison between LOF and BV-LOF methods in terms of maximum AUC values.

Comparison between LOF and BV-LOF methods in terms of average AUC values.

AUC values obtained from LOF with different k and from BV-LOF with different ensemble sizes (T).

ALGORITHM 1 - Bagged and Voted Local Outlier Detection (BV LOF)

Inputs:T (# of iterations (in other words ensembles size))

D = {X₁, X₂, ..., X_n} (the entire dataset), where n is the number of instances

F = {F₁, F₂, ..., F_d} (feature set), where d is the dimension of the dataset

OUTPUT:O = {o₁, o₂, ..., o_p} (a set of objects that are assigned as outliers)

fori = 1 toTdo

Randomly determine subset size R in [d/, d-1]

forj = 1 toRdo

ft = Randomly select a feature w/o replacement from F

S_i = S_i ∪ ft

end for

Generate D_i that includes the features in the subset S_i

foreach neighbor size kin [1, 100] do

Apply LOF_(k) on D_i

Obtain output vectors O(D_i, k)

end for

foreach object oinO(D_i, k) do

// find highest total vote

\begin{matrix} h_{i} (o) & = {argmax}_{y \in Y} \sum_{k = 1}^{100} v \\ where Y = {1, - 1} and v {\begin{matrix} (h_{k} (o) = - 1) = 1 & (outlier) \\ (h_{k} (o) = 1) = 0 & (inlier) \end{matrix} \end{matrix}

\matrix{{{h_i}(o)} {= {{\rm argmax}_{y \in Y}}\sum\nolimits_{k = 1}^{100} v} \hfill \cr \hfill {\kern 30pt} {{\rm where}\,Y = \{1, - 1\} \,{\rm and}\,v\left\{{\matrix{{({h_k}(o) = - 1) = 1} \hfill & {({\rm outlier})} \hfill \cr {({h_k}(o) = 1) = 0} \hfill &\,\,\,\, {({\rm inlier})} \hfill \cr}} \right.} \hfill}

Obtain single output vector O(D_i) for dataset D_i

end for

O(D) = O(D) ∪ O(D_i)

end for

foreach object oinO(D) do

// find highest total vote

\begin{matrix} h (o) & = {argmax}_{y \in Y} \sum_{t = 1}^{T} v \\ where Y = {1, - 1} and v {\begin{matrix} (h_{t} (o) = - 1) = 1 & (outlier) \\ (h_{t} (o) = 1) = 0 & (inlier) \end{matrix} \end{matrix}

\matrix{{h(o)} {= {{\rm argmax}_{y \in Y}}\sum\nolimits_{t = 1}^T v} \hfill \cr {\kern 30pt}{{\rm where}\,Y = \{1, - 1\} \,{\rm and}\,v\left\{{\matrix{{({h_t}(o) = - 1) = 1} \hfill & {({\rm outlier})} \hfill \cr {({h_t}(o) = 1) = 0} \hfill &\,\,\,\, {({\rm inlier})} \hfill \cr}} \right.} \hfill}

Obtain a single output vector O representing all outliers in the dataset D

end for

Return O

END ALGORITHM

Basic characteristics of the datasets.

ID	Dataset	# Instance	# Feature	% Outliers
1	CoverType	286,048	10	0.9
2	Glass	214	9	4.2
3	KDDCup	60,632	41	0.4
4	Lymphography	148	18	4.1
5	PageBlocks	5,473	10	10.2
6	PenDigits	9,868	16	0.2
7	Shuttle	1013	9	1.2
8	Stamps	340	9	9.1
9	Thyroid	3,772	6	2.5
10	Wine	129	13	7.7

The points to which AUC value reaches the maximum for each dataset.

Datasets	LOF Best k Value	BV-LOF Best T Value
Cover	60	8
Glass	12	1
KDDCup	56	16
Lymphography	54	85
PageBlocks	19	4
PenDigits	88	3
Shuttle	7	4
Stamps	3	2
Thyroid	63	4
Wine	13	99
Average	37.5	22.6

The average of experimental results of all datasets.

	LOF (%)	BV-LOF (%)
Maximum Outlier Detection Performance	88.42	90
Average Outlier Detection Performance	80.47	83.97

eISSN:: 2543-683X
Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

Journal RSS Feed

A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection

Article Category: Research Paper

Published Online: May 20, 2020

Page range: 111 - 135

Received: Dec 13, 2019

Accepted: Apr 29, 2020

DOI: https://doi.org/10.2478/jdis-2020-0014

Keywords
Outlier detection, Local outlier factor, Ensemble learning, Bagging, Voting

© 2020 Alican Dogan et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Basic characteristics of the datasets.

The points to which AUC value reaches the maximum for each dataset.

The average of experimental results of all datasets.

A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection

Article Category: Research Paper

Published Online: May 20, 2020

Page range: 111 - 135

Received: Dec 13, 2019

Accepted: Apr 29, 2020

DOI: https://doi.org/10.2478/jdis-2020-0014

KeywordsOutlier detection, Local outlier factor, Ensemble learning, Bagging, Voting

© 2020 Alican Dogan et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Basic characteristics of the datasets.

The points to which AUC value reaches the maximum for each dataset.

The average of experimental results of all datasets.

Keywords
Outlier detection, Local outlier factor, Ensemble learning, Bagging, Voting