Open Access

Improvement of the Fast Clustering Algorithm Improved by K-Means in the Big Data


Cite

Introduction

The data Data is are quantizsed symbol of the information. Data clustering is a process to find the effective information and hidden structure feature based on data collection and reasonable division by a similarity measure, which is an important data mining technique for unsupervised learning and have an is important and widely used in pattern recognition [1,2,3], machine learning [4,5], image processing [6,7] and other fields. In the era of Big Data, a great deal of valuable data information is produced at all times with the rapid development of economy, science and technology. Different from traditional data, Big Data usually is sparse and has multi noise, , high dimension, sparse, heterogeneous feature fusion and so on [8,9,10]. How to construct efficient clustering models and algorithms for Big Data is a very important and challenging research topic, and has important scientific value and economic benefits.

Data clustering, as an important data mining technology, aims to divide data objects into several different clusters according to similarity measure, so that data objects within clusters have the greatest similarity and data objects between different clusters have the smallest similarity. A variety of data clustering algorithms have been researched in past years, which include nonnegative matrix factorizsation [11,12,13,14], mean shift [15,16,17], spectral clustering [18,19,20], sparse subspace clustering [21,22,23], and K-means [24,25,26,27], etc. Undoubtedly, K-means is the most commonly used and important clustering algorithm. The purpose of K-means clustering purpose is to minimizse the sum of squared Euclidean distance between each data points and its closest centere point [25]: mincj,jj=1kijdist(micj)\mathop {\min }\limits_{{c_{j,j}}} \sum\limits_{j = 1}^k {\sum\limits_{i \in {{\mathbb C}_j}} {dist\left( {{m_i} - {c_j}} \right)} } where dist is the distance function, k is the clustering number, cj is the clustering centere, ℂj is the index set of the j-th cluster and j=1kj={1,2,,n}\cup {\mathbb C}_{j = 1}^kj = \{ 1,2, \cdots ,n\} . In fact, this process is similar to Voronoi diagram of function dist in data space, so locating the global optimum solution of this problem is NP-hard even for the simplest case. Therefore, only the local minimum of K-means model is applied in practice. The most commonly used algorithm is Lloyd’s algorithm [25]. Lloyd’s algorithm is a heuristic iterative refinement method, which can quickly converge to the local optimum solution. For the pre-selected cluster centeres, the algorithm mainly carries out the following iterative processes: allocation steps and update steps. The purpose of the allocation step is to allocate the data points to each data centere by calculating the distance from the checkpoint to each clustering centere,; the update steps is to involves recalculateion of the new clustering centeres through the current allocation. From the Lloyd’s algorithm, we can find that different clustering models and corresponding algorithms can be obtained by defining different similarity measure dist, selecting different initial clustering centeres and defining different recalculating clustering centeres. It is precisely because K-means model and Lloyd’s algorithm have such flexible forms that they can choose suitable forms according to different problems, so they have been widely used. However, regardless of the model and algorithm, the disaster of data scale and dimension is unavoidable. We know that the most time-consuming part of the Lloyd’s algorithm is the allocation step. The calculating amount of the distance between the point and centere (e.g. Euclidean distance) is O(kdn), and the total amount of calculation of Lloyd’s algorithm is O(tkdn), where t is the iteration step. Therefore, when d and n is are large, the efficientcy of Lloyd’s algorithm will be greatly reduced [28,29,30].

In the process of clustering, the basic K-means model needs to determine three parameters the following: the selection of initial iteration point c0 and the iterative process;, the determination or estimation of clustering number k; and the definition of similarity measure dist between sample points. Many improved K-means models and algorithms can be obtained by choosing different processing methods offor the above-mentioned three parameters. Several improved initializsation schemes were proposed to deal with the initializsation issue of Lloyd’s algorithm. For example, one of the most important improved models is the K-means++++ [28]. This is assuming that k¯(0<k¯<k)\bar k\left( {0 < \bar k < k} \right) initial clustering centeres have been selected,; the farther away from the current k¯{\bar k} clustering centeres, the higher the probability of selecting the k¯+1\bar k + 1 clustering centere. For more detail on initializsation methods for K-means, we refer to Celebi et al. the Reference [30]. According to the different selection rules of clustering number k, the improved K-means model generally includes: K-J inflection point model [31] and ISODATA [32] (Iterative Self-Organizing Data Analysis Techniques Algorithm (ISODATA) [32]. The K-J model calculates a series of cluster number k and cluster objective function value J, draws the relationship between them and uses the inflection point of the graph to determine the value of k. The ISODATA repeatedly modifies the number of clustering centeres to get a more reasonable number of categories k in the process of iteration. When given parameters, this modification is achieved by merging and splitting, we can (refer to the Reference [33]). In order to deal with data clustering problems with different data distributions, it is necessary to select different similarity measures between sample points, so that different types of K-means improved models can be obtained. Kernel K-means model is similar to the idea of kernel function in support vector machine, mapping all samples to another vector space and clustering [31]. Fuzzy C-means model introduces the fuzzy factors, calculates the membership degree from each point to each cluster and uses this membership degree to determine the level to which each point belongs to a cluster [34]. Spherical K-means model is to solve the problem clustering data points on a sphere [35]. When using this model, we need to normalizse each data point, and then use cosine dissimilarity instead of Euclidean distance to measure the similarity between data points. In addition, there are some improved models that used Mahalanobis distance, Itakura–Saito divergence and Bergman divergence instead of Euclidean distance as a similarity measure. For sample points xRm×n, and yRm×n, S is covariance, and Φ is convex function, there are we have [36,37,38]:

Mahalanobis distance: d(x,y)=(xy)TS1(xy)d\left( {x,y} \right) = \sqrt {{{\left( {x - y} \right)}^T}{S^{ - 1}}\left( {x - y} \right)}

Itakura–Saito divergence: d(x,y)=xyln(xy)1d\left( {x,y} \right) = {x \over y} - \ln ( {{x \over y}} ) - 1

Bergman divergence: d(x,y)=Φ(x)Φ(y)Φ(y),(xy)d\left( {x,y} \right) = \Phi \left( x \right) - \Phi \left( y \right) - \left\langle {\nabla \Phi \left( y \right),\left( {x - y} \right)} \right\rangle

In Big Data clustering problems, the advantages of K-means and related improved algorithms are mainly focused on include the following: concise and intuitive, high computing efficiency and scalability. Comparatively, it also has obvious shortcomings: the distribution of clustering data is too strict;, different initial points lead to distinct clustering results and easily fall into local optimal solution; and computational complexity is linearly correlated with data dimension. Therefore, when dealing with the low-dimensional data clustering problems, K-means and related improved algorithms usually get more accurate results and the running time is acceptable. However, for high-dimensional Big Data, due to the impact of dimension disaster, data distribution, data size, data noise and so on, the K-means and related improved algorithms often cannot get the desired results and the computational efficiency is low. In order to deal with this problem, a common method is to reduce the dimensionality of data, that is, to seek low-dimensional features of high-dimensional data, and then to cluster within low-dimensional features. However, there are some difficulties problems in high-dimensional data clustering algorithm based on dimensionality reduction. Firstly, is low-dimensional data the dominating feature needed in clustering of practical problems? Secondly, whether the mapping of distances between low-dimensional data points is conductive clustering?. Generally speaking, data dimensionality reduction and low-dimensional clustering are two important partsteps in solving large data clustering process: data dimensionality reduction clears obstacles for low-dimensional clustering (removing data noise, reducing data dimension, etc.), and low-dimensional clustering achieves the ultimate goal of clustering. In order to achieve excellent processing results, the two processes of data dimensionality reduction and low-dimensional clustering should complement and match each other in Big Data clustering.

In this paper, unlike the existing improved K-means models, aiming at the low efficiency of traditional algorithms caused by Big Data, we purpose clustering in feature space to improve the efficiency of the algorithm while ensuring the accuracy. We point out that as long as the clustering centere and distance function satisfy certain conditions, most K-means algorithms can be accelerated with our ideas. In addition, we proved that the processing steps (data dimension reduction) and clustering steps (low-dimensional clustering) of the proposed method are completely matched in the problem of Dig Data clustering.

K-means Fast Algorithms for Big Data Clustering

In the classical K-means model, the mean of data points is usually chosen as the clustering centere and the initial value is chosen randomly. In the problems of Big Data clustering, through continuous theoretical research and practical application, researchers found that European distance is extremely unfavourable to measure the similarity between high-dimensional data, especially sparse high-dimensional data. In addition, the selection method of random cluster centeres is also very sensitive to noise. The improved models and algorithms are mainly aimed at the above-mentioned two points. Next, we will introduce two famous improved models in practical application forms in Big Data clustering problems: spherical K-means model and K-medoids model.

For Big Data clustering, the original data are heterogeneous, noisy, high-dimensional and sparse. The directional differences between the original data are far more important than the metric measurement differences, because the length of the original data is likely to be different, even though the differences between the same measurements should be significant. Based on this, in 2011, Dhillon et al. proposed the use of cosine dissimilarity to measure the distance between data [35], that is spherical K-means model. The definition of cosine difference of two points xRd×n and yRd×n is as follows: d(x,y)=1cos(x,y)=1x,yxyd\left( {x,y} \right) = 1 - \cos \left( {x,y} \right) = 1 - {{\left\langle {x,y} \right\rangle } \over {\left\| x \right\|\left\| y \right\|}}

It can be obtained that the objective function of spherical K-means model has the following form: j=1ki=1nuij(1mi,cjmicj)=j=1k[i=1nuiji=1nuijmimi,cjcj]\sum\limits_{j = 1}^k {\sum\limits_{i = 1}^n {{u_{ij}}(1 - {{\left\langle {{m_i},{c_j}} \right\rangle } \over {\left\| {{m_i}} \right\|\left\| {{c_j}} \right\|}}) = \sum\limits_{j = 1}^k {[\sum\limits_{i = 1}^n {{u_{ij}} - \left\langle {\sum\limits_{i = 1}^n {{u_{ij}}{{{m_i}} \over {\left\| {{m_i}} \right\|}},{{{c_j}} \over {\left\| {{c_j}} \right\|}}} } \right\rangle } ]} } }

Using Cauchy–Schwartz inequality, spherical K-means model can get the optimal value if and only if: cji=1nuijmimij{c_j} \leftarrow \sum\limits_{i = 1}^n {{u_{ij}}} {{{m_i}} \over {\left\| {{m_i}} \right\|}}\forall j

If the data is are normalizsed, i.e. ‖mi‖ = 1, the clustering centere of spherical K-means model can be expressed as: cji=1nuijmimi=ijmij{c_j} \leftarrow \sum\limits_{i = 1}^n {{u_{ij}}{{{m_i}} \over {\left\| {{m_i}} \right\|}} = \sum\limits_{i \in j} {{m_i}\forall j} } that is, the clustering centere cj is proportional to the sum of all the data in the j class., where miM, MRd×n is the original data, k is the clustering number, cj is the clustering centere, j is the index set of the j-th cluster, and uij = {0,1} is the indicator function.

For spherical K-means model, Lloyd iteration algorithm cannot increase the objective function, but and cannot guarantee certain convergence. The algorithm process is shown in below Figure 1.

Figure 1

The algorithm of the spherical K-means model.

In Big Data clustering, there are always a lot of outliers due to the influence of practical problem. Generally speaking, the accuracy of K-means will be greatly reduced when the data hasve outliers. The main reason is that the clustering centere will be seriously affected by outliers. In order to solve this problem, Kaufman and Rousseeuw proposed using medoids, a point in the centere of data (location in the centere), as the clustering centere. Therefore, this method is also called K-medoids model. The algorithm process is shown in below Figure 2.

Figure 2

The algorithm of the K-medoids model.

Medoid minimizses the average difference between all data points in the same cluster, so it has high robustness to noise. But However, the calculation of medoid needs to compare the distance of all data points before deciding which one is medoid, which is much more complex than mean, resulting in the low computational efficiency of K-medoids model for high-dimensional data. It has good robustness but low efficiency in the application of Big Data clustering. Partitioning Around Medoids PAM (PAMPartitioning Around Medoids) is the most effective algorithm for solving K-medoids model. Given a randomly selected initial value, PAM replaces each cluster centere with a data point that reduces the objective function. This process continues to iterate until each medoids cannot be replaced. In each iteration, the computational complexity of PAM is O(d(nk)2k). In order to reduce the computational complexity, Park and Jun proposed a fast K-medoids clustering algorithm. This method calculates the distance matrix iteratively and uses it to calculate the mew medoids. The computational complexity of this algorithm is consistent with K-means, which makes this algorithm widely concerned applicable in Big Data clustering. However, when the data scale increases sharply, the calculation of the preprocessing matrix is too large.

The Improved Algorithms of K-means on Feature Space

To reasonably utilise In order to make the concise and efficient K-means algorithm reasonably utilized in Big Data clustering, we will consider the improved K-means algorithm in dimension reduction space (feature space). Assuming that the rank of the data matrix MRd×n is r = Rank(M) ≤ min(d,n), and M is decomposed into M = UΣVT by singular value decomposition, then: UTM=ΣVT=[Σr000][V1rV2nr]T=[ΣrV1T0][M^0]{U^T}M = \Sigma {V^T} = \left[ {\matrix{{{\Sigma _r}} & 0 \cr 0 & 0 \cr } } \right]{\left[ {\underbrace {{V_1}}_r\;\underbrace {{V_2}}_{n - r}} \right]^T} = \left[ {\matrix{{{\Sigma _r}V_1^T} \cr 0 \cr } } \right] \buildrel \Delta \over = \left[ {\matrix{{\hat M} \cr 0 \cr } } \right] where V1 contains the first r right singular vectors of data M. We have the following theorem about the problem of original problem and feature space, that is, the original problem is completely equivalent to the problem of feature space under certain conditions.

Theorem 1

If the K-means problem and its extended model satisfy the following two conditions,:

cluster centere cj is the linear combination of all data points. and

distance function dist is orthogonal invariant.,

then the K-means problem in the original data space M is equivalent to the K-means problem in the feature space M{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over M} } . Thus, the computational complexity of K-means clustering problem is reduced from O(ikdn) to O(ikrn).

Proof

Without losing generality, it can be assumed that the cluster centere of the original space M is cj = Σjwljml, and cj=Σjwljml{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over c} _j} = {\Sigma _j}{w_{lj}}{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over m} _l} is defined according to its weight w, where m\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over m} is the point in the feature space M\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over M} . According to the defineition of M, the cluster centere is transformed into: UTcj=UT(lwljml)=lwlj(UTml)=lwlj[ml0]=[cj0]{U^T}{c_j} = {U^T}\left( {\sum\limits_l {{w_{lj}}{m_l}} } \right) = \sum\limits_l {{w_{lj}}\left( {{U^T}{m_l}} \right) = \sum\limits_l {{w_{lj}}\left[ {\matrix{{{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over m} }_l}} \cr 0 \cr } } \right] = \left[ {\matrix{ {{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over c} }_j}} \cr 0 \cr } } \right]} }

From the orthogonal invariance of the distance function dist, the following equation holds: dist(mi,cj)=dist(UTmi,UTcj)=dist(ml,cj)dist\left( {{m_i},{c_j}} \right) = dist\left( {{U^T}{m_i},{U^T}{c_j}} \right) = dist\left( {{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over m} }_l},{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over c} }_j}} \right)

Therefore, the objective function of the K-means problem in the original space M is the same as that in the feature space M\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over M} , and the selection method of clustering centere is the same.

Theorem 2

For the standard K-means problem, the Spherical spherical K-means problem and some K-medoids problems, the clustering in the original space is consistent with that in the feature space.

Proof

First of all, the clustering centere of the standard K-means problem and spherical K-means problem are the average of all data points, while the clustering centere of the K-medoids problem are some data points, so the first condition in Theorem 1 is satisfied.

Then, the Euclidean distance and cosine difference satisfy the orthogonal condition. Therefore, as long as the Euclidean distance and cosine difference are chosen as the distance function, the K-medoids problem satisfies the second condition of Theorem 1.

According to Theorem 1, this conclusion can be obtained.

Based on the above discussion, when using K-means to deal with Big Data clustering problem, we can first reduce the dimension of the data, and then carry out clustering analysis in the feature space. The algorithm process is shown in below Figure 3.

Figure 3

The K-means clustering algorithm of for high-dimensional data.

Numerical Experiment

In this section, we use artificial data and actual data to test the performance of the algorithm, mainly verifying two aspects: whether the objective function is consistent, whether the running time is reduced. All numerical experiments were run on a desktop computer with an Intel Core i7-3770 CPU at 3.40 GHz with 8 GB RAM under Matlab R2017b.

We construct the following artificial data to verify the accuracy of the algorithm. Firstly, k points in d dimensional space are randomly selected as clustering centeres. Then, Gaussian points with variance of σ are added around the clustering centeres, and the total number of data points is n. Finally, n data points are randomly replaced. In order to test the influence of data dimension, we make d vary from 1,000 to 50,000. The number of samples is n = 1,000, and the number of clusters is k =10.

Table 1 records the differences of K-means, spherical K-means and K-medoids objective functions between the original space and the feature space. It can be seen that the clustering results of the original space and the feature space are consistent from in any initial value. Table 2 records the running time ratio of K-means, spherical K-means and K-medoids in the original space and feature space. It can be see that when the dimension of d is small (compared with the number of samples), clustering in the feature space does not speed up the algorithm, because clustering in the feature space also needs to calculate the eigenvalues decomposition. However, with the increase of in dimension, K-means clustering in the feature space should to save a lot of time (when d = 50,000, the running time of the algorithm is reduced by 44 times).

The comparison of the objective functions on the artificial data

SizeK-meansSpherical K-meansK-medoids
d= 1,000000
d= 2,0001.863e–0900
d= 5,0003.725e–092.842e–133.275e–09
d= 10,0003.725e–091.137e–137.451e–09
d= 20,0001.490e–086.253e–130
d= 50,0002.980e–0805.960e–08

The comparison of the run time on the artificial data

SizeK-meansSpherical K-meansK-medoids
d= 1,0001.011.051.00
d= 2,0001.461.131.30
d= 5,0004.302.592.73
d= 10,0007.974.715.19
d= 20,00017.768.4210.09
d= 50,00044.1033.6026.30

Next, we will test the performance of the algorithm on image data and DNA data. These two kinds of data from different fields, have different meanings and the scale of data is also different, which is helpful to verify the advantages and disadvantages of our algorithm in different data. Table 3 records information of various data. The first three data are image data and the last four data are DNA data.

The AT&T ORL database [39] consists of cropped face images of d = 112 ×× 92 pixels cropped face images with n = 400 face images from k = 10 different persons, and each person contains 40 sample images captured at different conditions. All images were taken against a dark homogeneous background with the subjects in an upright, frontal position.

The Yale database [40] consists of cropped face images of d = 100××100 pixels cropped face images with n = 165 face images from k = 15 different persons, each of which includes 11 images. They refer to some different facial expressions or configurations, i.e. glasses, happy, normal, sad, sleepy, surprised, and wink.

The COIL-20 database [41] consists of grey-scale images of d = 128××128 pixels gray-scale images with n = 1,440 objects images from k = 20 different objects. The objects were placed on a motorizsed turntable against a black background. The turntable was rotated through 360 degrees to vary object pose with respect to axed camera. Images of the objects were taken at pose intervals of 5 degrees.

The CMD data [42] consists of d = 7,129 dimensions with n = 60 cotton microsatellites from k = 2 different characters. In addition, CMD displays data for three of the microsatellite projects that have been screened against a panel of core germplasm. The standardizsed panel consists of 12 diverse genotypes including genetic standards, mapping parents, BAC donors, subgenome representatives, unique breeding lines, exotic introgression sources, and contemporary Upland cottons with significant acreage.

The DLBCL data [43] consists of d = 7,129 dimensions with n = 77 DLBCL patients from k = 2 different factors. The original data frame with had over more than 8,000 observations (rows) on the following 3three markers (rows) and contained measurements from biopsies of 30 DLBCL patients. Each sample was stained with three antibodies,: CD3, CD5, and CD19.

The LunG data [44] consists of d = 1,000 dimensions with n = 197 lung cancer patients from k = 4 different factors.

The Prostate data [45] consists of d = 12,600 dimensions with n = 102 prostate cancer patients from k = 2 different factors. The original data set from contained 97 men who haved prostate cancer and recorded the information of the patients.

The date information in the algorithm test

Namednk
ORL4,09640020
YALE4,09616515
COIL2016,3841,44020
CMD7,129602
DLBCL7,129772
LunG1,0001974
Prostate12,6001022

Table 4 records the differences in objective functions of K-means, spherical K-means and K-medoids objective functions between the original space and the feature space. It can be seen that the clustering results of the original space and the feature space are consistent from for any initial value. Table 5 records the running time ratio of K-means, spherical K-means and K-medoids in the original space and feature space. It can be see that the acceleration effect of the algorithm for LunG data is not obvious, because the data dimension is not high (d = 1,000) and the number of samples is relatively large (n = 197). The algorithm has the best acceleration effect (more than 10 times) for Prostate data, mainly due to the high dimension and small number of data.

The comparison of the objective functions on the actual data

NameK-meansSpherical K-meansK-medoids
ORL2.309e–141.705e–131.243e–13
YALE4.263e–142.398e–147.105e–15
COIL202.757e–121.121e–114.547e–13
CMD7.105e–156.128e–141.776e–14
DLBCL7.105e–159.548e–143.730e–14
LunG3.908e–144.796e–143.553e–14
Prostate7.105e–151.172e–133.553e–14

The comparison of the run times on the actual data

NameK-meansSpherical K-meansK-medoids
ORL10.778.452.74
YALE8.882.893.96
COIL208.866.225.70
CMD10.224.866.61
DLBCL7.704.766.92
LunG5.131.211.96
Prostate15.4710.9615.44
Conclusion

In this paper, we aim at the low efficiency of K-means algorithm caused by high-dimensional data. The clustering algorithm in the feature space is proposed to improve the efficiency of the algorithm while ensuring the accuracy. We also point out that as long as the clustering centere and distance function satisfy certain conditions, K-means type problems and corresponding algorithms can be accelerated with our ideas. In addition, we demonstrate in detail that the pre-processing steps (data dimensionality reduction) of our proposed method perfectly match the clustering steps (low-dimensional clustering) for the problem of high-dimensional Big Data problem.

eISSN:
2444-8656
Language:
English
Publication timeframe:
Volume Open
Journal Subjects:
Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics