The data Data is are quantizsed symbol of the information. Data clustering is a process to find the effective information and hidden structure feature based on data collection and reasonable division by a similarity measure, which is an important data mining technique for unsupervised learning and have an is important and widely used in pattern recognition [1,2,3], machine learning [4,5], image processing [6,7] and other fields. In the era of Big Data, a great deal of valuable data information is produced at all times with the rapid development of economy, science and technology. Different from traditional data, Big Data usually is sparse and has multi noise, , high dimension, sparse, heterogeneous feature fusion and so on [8,9,10]. How to construct efficient clustering models and algorithms for Big Data is a very important and challenging research topic, and has important scientific value and economic benefits.
Data clustering, as an important data mining technology, aims to divide data objects into several different clusters according to similarity measure, so that data objects within clusters have the greatest similarity and data objects between different clusters have the smallest similarity. A variety of data clustering algorithms have been researched in past years, which include nonnegative matrix factorizsation [11,12,13,14], mean shift [15,16,17], spectral clustering [18,19,20], sparse subspace clustering [21,22,23], and K-means [24,25,26,27], etc. Undoubtedly, K-means is the most commonly used and important clustering algorithm. The purpose of K-means clustering purpose is to minimizse the sum of squared Euclidean distance between each data points and its closest centere point :
In the process of clustering, the basic K-means model needs to determine three parameters the following: the selection of initial iteration point c0 and the iterative process;, the determination or estimation of clustering number k; and the definition of similarity measure dist between sample points. Many improved K-means models and algorithms can be obtained by choosing different processing methods offor the above-mentioned three parameters. Several improved initializsation schemes were proposed to deal with the initializsation issue of Lloyd’s algorithm. For example, one of the most important improved models is the K-means++++ . This is assuming that
In Big Data clustering problems, the advantages of K-means and related improved algorithms are mainly focused on include the following: concise and intuitive, high computing efficiency and scalability. Comparatively, it also has obvious shortcomings: the distribution of clustering data is too strict;, different initial points lead to distinct clustering results and easily fall into local optimal solution; and computational complexity is linearly correlated with data dimension. Therefore, when dealing with the low-dimensional data clustering problems, K-means and related improved algorithms usually get more accurate results and the running time is acceptable. However, for high-dimensional Big Data, due to the impact of dimension disaster, data distribution, data size, data noise and so on, the K-means and related improved algorithms often cannot get the desired results and the computational efficiency is low. In order to deal with this problem, a common method is to reduce the dimensionality of data, that is, to seek low-dimensional features of high-dimensional data, and then to cluster within low-dimensional features. However, there are some difficulties problems in high-dimensional data clustering algorithm based on dimensionality reduction. Firstly, is low-dimensional data the dominating feature needed in clustering of practical problems? Secondly, whether the mapping of distances between low-dimensional data points is conductive clustering?. Generally speaking, data dimensionality reduction and low-dimensional clustering are two important partsteps in solving large data clustering process: data dimensionality reduction clears obstacles for low-dimensional clustering (removing data noise, reducing data dimension, etc.), and low-dimensional clustering achieves the ultimate goal of clustering. In order to achieve excellent processing results, the two processes of data dimensionality reduction and low-dimensional clustering should complement and match each other in Big Data clustering.
In this paper, unlike the existing improved K-means models, aiming at the low efficiency of traditional algorithms caused by Big Data, we purpose clustering in feature space to improve the efficiency of the algorithm while ensuring the accuracy. We point out that as long as the clustering centere and distance function satisfy certain conditions, most K-means algorithms can be accelerated with our ideas. In addition, we proved that the processing steps (data dimension reduction) and clustering steps (low-dimensional clustering) of the proposed method are completely matched in the problem of Dig Data clustering.
2 K-means Fast Algorithms for Big Data Clustering
In the classical K-means model, the mean of data points is usually chosen as the clustering centere and the initial value is chosen randomly. In the problems of Big Data clustering, through continuous theoretical research and practical application, researchers found that European distance is extremely unfavourable to measure the similarity between high-dimensional data, especially sparse high-dimensional data. In addition, the selection method of random cluster centeres is also very sensitive to noise. The improved models and algorithms are mainly aimed at the above-mentioned two points. Next, we will introduce two famous improved models in practical application forms in Big Data clustering problems: spherical K-means model and K-medoids model.
For Big Data clustering, the original data are heterogeneous, noisy, high-dimensional and sparse. The directional differences between the original data are far more important than the metric measurement differences, because the length of the original data is likely to be different, even though the differences between the same measurements should be significant. Based on this, in 2011, Dhillon et al. proposed the use of cosine dissimilarity to measure the distance between data , that is spherical K-means model. The definition of cosine difference of two points x ∈ Rd×n and y ∈ Rd×n is as follows:
It can be obtained that the objective function of spherical K-means model has the following form:
Using Cauchy–Schwartz inequality, spherical K-means model can get the optimal value if and only if:
If the data is are normalizsed, i.e. ‖mi‖ = 1, the clustering centere of spherical K-means model can be expressed as:
For spherical K-means model, Lloyd iteration algorithm cannot increase the objective function, but and cannot guarantee certain convergence. The algorithm process is shown in below Figure 1.
In Big Data clustering, there are always a lot of outliers due to the influence of practical problem. Generally speaking, the accuracy of K-means will be greatly reduced when the data hasve outliers. The main reason is that the clustering centere will be seriously affected by outliers. In order to solve this problem, Kaufman and Rousseeuw proposed using medoids, a point in the centere of data (location in the centere), as the clustering centere. Therefore, this method is also called K-medoids model. The algorithm process is shown in below Figure 2.
Medoid minimizses the average difference between all data points in the same cluster, so it has high robustness to noise. But However, the calculation of medoid needs to compare the distance of all data points before deciding which one is medoid, which is much more complex than mean, resulting in the low computational efficiency of K-medoids model for high-dimensional data. It has good robustness but low efficiency in the application of Big Data clustering. Partitioning Around Medoids PAM (PAMPartitioning Around Medoids) is the most effective algorithm for solving K-medoids model. Given a randomly selected initial value, PAM replaces each cluster centere with a data point that reduces the objective function. This process continues to iterate until each medoids cannot be replaced. In each iteration, the computational complexity of PAM is O(d(n − k)2k). In order to reduce the computational complexity, Park and Jun proposed a fast K-medoids clustering algorithm. This method calculates the distance matrix iteratively and uses it to calculate the mew medoids. The computational complexity of this algorithm is consistent with K-means, which makes this algorithm widely concerned applicable in Big Data clustering. However, when the data scale increases sharply, the calculation of the preprocessing matrix is too large.
3 The Improved Algorithms of K-means on Feature Space
To reasonably utilise In order to make the concise and efficient K-means algorithm reasonably utilized in Big Data clustering, we will consider the improved K-means algorithm in dimension reduction space (feature space). Assuming that the rank of the data matrix M ∈ Rd×n is r = Rank(M) ≤ min(d,n), and M is decomposed into M = UΣVT by singular value decomposition, then:
If the K-means problem and its extended model satisfy the following two conditions,:
- (1)cluster centere cj is the linear combination of all data points. and
- (2)distance function dist is orthogonal invariant.,
Without losing generality, it can be assumed that the cluster centere of the original space M is cj = Σjwljml, and
From the orthogonal invariance of the distance function dist, the following equation holds:
Therefore, the objective function of the K-means problem in the original space M is the same as that in the feature space
For the standard K-means problem, the Spherical spherical K-means problem and some K-medoids problems, the clustering in the original space is consistent with that in the feature space.
First of all, the clustering centere of the standard K-means problem and spherical K-means problem are the average of all data points, while the clustering centere of the K-medoids problem are some data points, so the first condition in Theorem 1 is satisfied.
Then, the Euclidean distance and cosine difference satisfy the orthogonal condition. Therefore, as long as the Euclidean distance and cosine difference are chosen as the distance function, the K-medoids problem satisfies the second condition of Theorem 1.
According to Theorem 1, this conclusion can be obtained.
Based on the above discussion, when using K-means to deal with Big Data clustering problem, we can first reduce the dimension of the data, and then carry out clustering analysis in the feature space. The algorithm process is shown in below Figure 3.
4 Numerical Experiment
In this section, we use artificial data and actual data to test the performance of the algorithm, mainly verifying two aspects: whether the objective function is consistent, whether the running time is reduced. All numerical experiments were run on a desktop computer with an Intel Core i7-3770 CPU at 3.40 GHz with 8 GB RAM under Matlab R2017b.
We construct the following artificial data to verify the accuracy of the algorithm. Firstly, k points in d dimensional space are randomly selected as clustering centeres. Then, Gaussian points with variance of σ are added around the clustering centeres, and the total number of data points is n. Finally, n data points are randomly replaced. In order to test the influence of data dimension, we make d vary from 1,000 to 50,000. The number of samples is n = 1,000, and the number of clusters is k =10.
Table 1 records the differences of K-means, spherical K-means and K-medoids objective functions between the original space and the feature space. It can be seen that the clustering results of the original space and the feature space are consistent from in any initial value. Table 2 records the running time ratio of K-means, spherical K-means and K-medoids in the original space and feature space. It can be see that when the dimension of d is small (compared with the number of samples), clustering in the feature space does not speed up the algorithm, because clustering in the feature space also needs to calculate the eigenvalues decomposition. However, with the increase of in dimension, K-means clustering in the feature space should to save a lot of time (when d = 50,000, the running time of the algorithm is reduced by 44 times).
The comparison of the objective functions on the artificial data
The comparison of the run time on the artificial data
Next, we will test the performance of the algorithm on image data and DNA data. These two kinds of data from different fields, have different meanings and the scale of data is also different, which is helpful to verify the advantages and disadvantages of our algorithm in different data. Table 3 records information of various data. The first three data are image data and the last four data are DNA data.
- The AT&T ORL database  consists of cropped face images of d = 112 ×× 92 pixels cropped face images with n = 400 face images from k = 10 different persons, and each person contains 40 sample images captured at different conditions. All images were taken against a dark homogeneous background with the subjects in an upright, frontal position.
- The Yale database  consists of cropped face images of d = 100××100 pixels cropped face images with n = 165 face images from k = 15 different persons, each of which includes 11 images. They refer to some different facial expressions or configurations, i.e. glasses, happy, normal, sad, sleepy, surprised, and wink.
- The COIL-20 database  consists of grey-scale images of d = 128××128 pixels gray-scale images with n = 1,440 objects images from k = 20 different objects. The objects were placed on a motorizsed turntable against a black background. The turntable was rotated through 360 degrees to vary object pose with respect to axed camera. Images of the objects were taken at pose intervals of 5 degrees.
- The CMD data  consists of d = 7,129 dimensions with n = 60 cotton microsatellites from k = 2 different characters. In addition, CMD displays data for three of the microsatellite projects that have been screened against a panel of core germplasm. The standardizsed panel consists of 12 diverse genotypes including genetic standards, mapping parents, BAC donors, subgenome representatives, unique breeding lines, exotic introgression sources, and contemporary Upland cottons with significant acreage.
- The DLBCL data  consists of d = 7,129 dimensions with n = 77 DLBCL patients from k = 2 different factors. The original data frame with had over more than 8,000 observations (rows) on the following 3three markers (rows) and contained measurements from biopsies of 30 DLBCL patients. Each sample was stained with three antibodies,: CD3, CD5, and CD19.
- The LunG data  consists of d = 1,000 dimensions with n = 197 lung cancer patients from k = 4 different factors.
- The Prostate data  consists of d = 12,600 dimensions with n = 102 prostate cancer patients from k = 2 different factors. The original data set from contained 97 men who haved prostate cancer and recorded the information of the patients.
The date information in the algorithm test
Table 4 records the differences in objective functions of K-means, spherical K-means and K-medoids objective functions between the original space and the feature space. It can be seen that the clustering results of the original space and the feature space are consistent from for any initial value. Table 5 records the running time ratio of K-means, spherical K-means and K-medoids in the original space and feature space. It can be see that the acceleration effect of the algorithm for LunG data is not obvious, because the data dimension is not high (d = 1,000) and the number of samples is relatively large (n = 197). The algorithm has the best acceleration effect (more than 10 times) for Prostate data, mainly due to the high dimension and small number of data.
The comparison of the objective functions on the actual data
The comparison of the run times on the actual data
In this paper, we aim at the low efficiency of K-means algorithm caused by high-dimensional data. The clustering algorithm in the feature space is proposed to improve the efficiency of the algorithm while ensuring the accuracy. We also point out that as long as the clustering centere and distance function satisfy certain conditions, K-means type problems and corresponding algorithms can be accelerated with our ideas. In addition, we demonstrate in detail that the pre-processing steps (data dimensionality reduction) of our proposed method perfectly match the clustering steps (low-dimensional clustering) for the problem of high-dimensional Big Data problem.
The work described in this paper was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJ1709207).
Y. Y. Tang, Y. Tao and E. C. M. Lam, (2002), New method for feature extraction based on fractal behavior, Pattern Recognize, 35, 1071–1081, DOI: .
Y. Y. Tang, L. Yang and J. Liu, (2000), Characterization of dirac-structure edges with wavelet transform, IEEE Transactions on Cybernetics, 30, 93–109, DOI: .
T. Zhang, B. Fang, Y. Yuan, Y. Y. Yang, Z. Shang and B. Xu, (2010), Generalized discriminate analysis: A matrix exponential approach, IEEE Transactions on Cybernetics, 40, 186–197, DOI: .
Y. Y. Tang and X. You, (2003), Skeletonization of ribbon-like shapes based on a new wavelet function, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1118–1133, DOI: .
T. Xie, P. Ren, T. Zhang and Y. Y. Tang, (2018), Distribution preserving learning for unsupervised feature selection, Neurocomputing, 289, 231–240, DOI: .
T. Zhang, B. Fang, Y. Y. Tang, G. He and J. Wen, (2008), Topology preserving non-negative matrix factorization for face recognition, IEEE Transactions on Image Processing, 17, 574–584, DOI: .
T. Zhang, Y. Y. Tang, Z. Shang and X. Liu, (2009), Face recognition under varying illumination using gradientfaces, IEEE Transactions on Image Processing, 18, 2599–2606, DOI: .
J. Han, M. Kamber and J. Pei, (2011), Data mining: concepts and technique, Morgan Kaufmann Press.
G. Sudipto, R. Rajeev and S. Kyuseok, (2001), CURE: An efficient clustering algorithm for large databases, Information Systems, 26, 35–58, DOI: .
T. Xie and F. Chen, (2018), Non-convex clustering via proximal alternating linearized minimization method, International Journal of Wavelets, Multisolution and Information Processing, 16, 13–25, DOI: .
P. Hoyer, (2004), Nonnegative matrix factorization with sparseness constraints, Machine Learning Research, 9, 1457–1469.
D. D. Lee and H. S. Seung, (1999), Learning the parts of objects by nonnegative matrix factorization, Nature, 401, 788–791.
B. Ren, P. Laurent, G. B. Zhu and D. Gaspard, (2018), Nonnegative matrix factorization: robust extraction of extended structures, The Astrophysical Journal, 852, 104–121.
Y. X. Wang and Y. J. Zhang, (2013), Nonnegative matrix factorization: A comprehensive review, IEEE Transactions on Knowledge and Data Engineering, 25, 1336–1353, DOI: .
D. Comaniciu and P. Meer, (2002), Mean shift: A robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 603–619, DOI: .
Y. Du, B. Sun, R. Lu, C. Zhang and H. Wu, (2019), A method for detecting high-frequency oscillations using semi-supervised k-means and mean shift clustering, Neurocomputing, 350, 102–107, DOI: .
T. Duong, G. Beck, H. Azzag and M. Lebbah, (2016), Nearest neighbor estimators of density derivatives, with application to mean shift clustering, Pattern Recognition Letter, 80, 224–230, DOI: .
D. Cai and X. Chen, (2015), Large scale spectral clustering via landmark-based sparse representation, IEEE Transactions on Cybernetics, 45, 1669–1680, DOI: .
J. Shi and J. Malik, (2000), Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888–905, DOI: .
M. Brbis and I. Kopriva, (2018), Multi-view low-rank sparse subspace clustering, Pattern Recognition, 73, 247–258, DOI: .
E. Elhamifar and R. Vidal, (2013), Sparse subspace clustering: algorithm, theory and application, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2765–2781, DOI: .
Y. Ma, A. Y. Yang, H. Derksen and R. Fossum, (2008), Estimation of subspace arrangements with applications in modeling and segmenting mixed data, SIAM Review, 50, 413–458, DOI: .
H. Park and C. Jun, (2009), A simple and fast algorithm for K-medoids clustering, Expert Systems with Applications, 36, 3336–3341, DOI: .
S. Yu, S. Chu, C. Wang Y. Chan and T. Chang, (2018), Two improved K-means algorithms, Applied Soft Computing, 68, 747–755, DOI: .
D. Arthur and S. Vassilvitskii, (2007), K-means++++: the advantages of careful seeding, Society for Industrial and Applied Mathematics, 165, 1027–1035, DOI: .
O. Bachem, M. Lucic, S. H. Hassani and A. Krause, (2016), Fast and provably good seeding for k-means, IEEE The 30th Conference on Neural Information Processing Systems, 2016, 76–85.
M. E. Celebi, H. A. Kingravi and P. A. Vela, (2013), A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Systems with Applications, 40, 200–210, DOI: .
I. S. Dhilon, Y. Guan and B. Kulis, (2004), Kernel k-means, spectral clustering and normalized cuts, ACM International Conference on Knowledge Discovery and Data Mining, 2004, 551–556.
G. Ball and D. Hall, ISODATA, (1965), A novel method of data analysis and pattern classification, Stanford Research Institute Press.
S. A. E. Rahman, (2015), Hyperspectral imaging classification using ISODATA algorithm: big data challenge, IEEE the 5th International Conference on e-Learning, 2015, 271–280.
J. C. Dunn, (1973), A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, Journal of Cybernetics, 3, 32–57, DOI: .
I. S. Dhillon and D. S. Modha, (2001), Concept decompositions for large sparse text data using clustering, Machine Learning, 42, 143–175.
A. Banerjee, (2004), Clustering with Bregman Divergences, SIAM International Conference on Data Mining, 2004, 234–245.
Y. Linde, A. Buzo and R. Gray, (1980), An algorithm for vector quantizer design, IEEE Transaction Communication, 28, 84–94, DOI: .
J. Mao and A. K. Jain, (1996), A self-organizing network for hyper ellipsoidal clustering, IEEE Transactions on Neural Networks, 7, 16–29, DOI: .
Online, (2019), ORL, http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
Online, (2019), Yale database, http://cvc.yale edu/projects/yalefaces.html.
Online, (2019), COIL-20 database, ftp://zen.cs.columbia.edu/.
Online, (2019), CMD data, http://www.cottonssr.org/.
Online, (2019), DLBCL data, http://flowrepository.org/id/FR-FCM-ZZYY/.
Online, (2019), LunG data, http://biogps.org/dataset/tag/lung/.
Online, (2019), Prostate data, http://statweb.stanford.edu/~tibs/ElemStatLearn/prostate.data/.