Autonomous Sensor Data Cleaning in Stream Mining Setting

Abstract Background: Internet of Things (IoT), earth observation and big scientific experiments are sources of extensive amounts of sensor big data today. We are faced with large amounts of data with low measurement costs. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. In the past, very little attention was given to the most time-consuming part of the data mining process, i.e. data pre-processing. Objectives: In this paper we propose an algorithm for data cleaning, which can be applied to real-world streaming big data. Methods/Approach: We use the short-term prediction method based on the Kalman filter to detect admissible intervals for future measurements. The model can be adapted to the concept drift and is useful for detecting random additive outliers in a sensor data stream. Results: For datasets with low noise, our method has proven to perform better than the method currently commonly used in batch processing scenarios. Our results on higher noise datasets are comparable. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data


Introduction
Big Data is a term that is used for datasets that are too large in size and complexity to be handled with the current methodologies (Fan et al., 2013).The meaning of this definition changes constantly with the development of technology and advances in computer science.However, translating the data analysis into a streaming on-line process is always considered a good approach.Stream mining exposes another benefit of the methodology -real-time responsiveness of the system, which has been identified as desirable by many different authors regarding reporting (Belfo et al., 2015), intrusion detection (Al Quhtani, 2017) and others.
The field has received a lot of attention.Many stream modelling (regression, classification, clustering etc.) and evaluation methods have been developed.However, some data mining process phases as identified in the cross-industry standard process for data mining (CRISP-DM) methodology (Shearer, 2000), have been left aside (Kandel et al., 2011;Krempl et al., 2014).One of those phases, which data cleaning is a part of, is "Data preparation" and is crucial for real-world data mining applications (Zekić-Sušac et al., 2015).
Even in classical data mining task, where all the data is available beforehand, the practitioners claim that data preparation takes up to 80% of the time (Press, 2016).A lot of work is done manually.In stream mining scenario there is no possibility for a constant human intervention, all the data pre-processing needs to be completely autonomous.
Data cleaning represents the first step in data pre-processing.It represents a permanent challenge in data analytics.If not done or badly performed it can result in inaccurate predictions and later in unreliable business decisions.The issue has been tackled recently both by industry and academia, mostly to address the issues of scalability (Big Data), interfaces, new abstractions and statistical techniques (Chu et al., 2016).
The field of time-series analysis has been lively for a number of decades.Kalman published his work on linear filtering already in 1960 (Kalman, 1960).Kalman stands out of the crowd due to the successful application of the equations to trajectory estimation in the NASA Apollo space program.Different applications have been reported since then and the field of time-series analysis has been reinvented in correspondence with advances in computer science and technology.In the last years many applications were created for on-line streaming data analysis.
To the best of our knowledge the usage of Kalman filter for cleaning of streaming sensor data has firstly been proposed in our work (Kenda et al., 2013).The paper proposed an algorithm for additive outlier detection in a stream mining setting using short-term prediction based on Kalman filter.The very same idea has been proposed in (Xu, 2015), where it has been studied in depth and extended to a wider context.The authors coined the methodology as time series Kalman filter (TSKF).The method has been improved in (Kenda et al., 2017), where we proposed the usage of unsupervised machine learning approach for automatic parameter fine-tuning and tested the method on an artificial data set.In the current work we further extend the methodology by introducing the indirect modelling-based evaluation procedure and extensive testing on 5 real-world data sets.
Recently, literature is examining other potential Kalman filter extensions for data cleaning.For example, (Marczak et al., 2018) studies usability of augmented Kalman filters (AKF).
The paper is structured as follows."Methodology" section describes Kalman filter algorithm and how it was implemented in our methodology.In the "Results" section we provide evaluation of our methodology on artificial and real-world datasets.We also describe the indirect evaluation procedure.Next, we discuss the usability of our methodology in real-world scenarios and compare it to current state-of-the-art in batch setting.Finally, we conclude the paper.

Methodology
The notion of additive outlier Additive outlier is a point outlier, which occurs at a given timestamp   and affects a single observation.In sensor data such outliers can be a consequence of a sudden change in ambient conditions, communication glitch or some similar unexpected event.With sensor measurements we assume that they arrive much faster than the data changes.
We propose a method with short-term prediction, based on previous measurements.Short term prediction is compared to the new measurement and classified as an outlier if the difference exceeds a specified threshold.As proposed in (Kenda et al., 2013) we introduce a safe guard to overcome a potential instability of the algorithm and enlarge the threshold in case that the detected outlier is a false positive, which might be an indication of a sudden concept drift in the data.

Kalman Filter
Kalman filter is a very suitable algorithm to be applied to data cleaning in a streaming scenario.It is an on-line algorithm that can produce short term predictions and even calculate covariance error matrix (used to calculate a threshold for outlier classification).Algorithm assumes that our process can be described as a Gauss-Markov process.o Each internal state   can be inferred through its observation   , which is linked to the internal state via observation matrix   and is a subject of Gaussian noise.

Figure 2 Kalman filter application cycle
Source: (Kenda, Mladenić, 2017) In general, matrices   and   can change over time, but in our case they remain the same as we assume the underlying process does not change through time.Kalman filter equations are depicted in Figure 2.
Kalman filter application cycle starts with initialization of a priori estimates for internal state  1 − and covariance matrix   − .With each new observation   the state and covariance matrix get updated.The next phase is dedicated to short-term one step ahead prediction (projection).Finally, optimal new mixing matrix gets calculated (responsible for optimal updating of the projected state with an observation).  represents normal distribution variance noise matrix.Computational complexity of our implementation of Kalman filter is ( 3 ) where  is the dimension of internal state space.In the proposed 2 nd degree model the number of internal state components is  = 3.

Parameter Learning
Initialization of Kalman filtering algorithm can be very demanding and there can be many free parameters involved, depending on the observation and transition matrix dimensions.Usage of expectation maximization (EM) algorithm (Dempster et al., 1977;Xu, 2015) can yield estimates for the initial internal state of the system and corresponding covariance matrices.Clean initial dataset is needed to obtain these parameters.
In our experiments with time series data the results from EM algorithm have not provided good results (confidence into last state was exaggerated), therefore we propose an additional data-oriented approach.EM calculates estimates of the following parameters: a priori initial state  1 − , transition covariance , observation covariance   and initial state covariance   − .We propose multiplying EM estimates with an additional factor in order to minimize  1 score of outlier classification on a labelled dataset.Parameters can be obtained by a grid search over a predefined multiplier space.
Grid search is time consuming, but it can find configurations which result in much smoother model that better follows the underlying dynamic processes in the data.We have implemented exhaustive and randomized grid searches in our solution, reported results are based on the randomized version.

Streaming Sensor Data Platform with Data Cleaning
We propose the usage of the filter at the lowest possible level in the pre-processing platform.The data-cleaning component should be implemented at the entry point of a particular data source to the pre-processing platform (see Figure 3).Clean data is then inserted into stream pre-processing engine, which is in charge of data enrichment and heterogeneous data fusion and finally this data is pushed into the appropriate stream modelling method.Cleaning at this level uses only autoregressive features.On a higher level, however, data-cleaning, which takes advantage of data fusion, could be used.

Results
We tested our results on artificial and real-world data sets.Functionality of the algorithm is illustrated in Figure 4.It shows the impact of Kalman filter's short-term prediction and its variance on additive outlier detection.The measurement (depicted in dark blue) that falls outside the admissible interval around short term prediction (depicted in light blue) is considered an outlier.

Results on Annotated Artificial Data Set
We provide an artificial dataset, following the usual daily profile of a family of typical sensors.Each time-series in the dataset introduces a different level of Gaussian noise ( = 0; ).We have made the dataset publicly available at ResearchGate (Kenda, 2017).Data points are a subject of noise, 1% of data points have been considered as candidates for an additive outlier.Amplitude of additive outliers has been uniformly sampled on the interval from 0 to 0.714 ⋅ max(()), where max(()) is the maximum value of the underlying dynamics function.Amplitudes that were lower than 2 × σ have been dismissed.
Artificial set experimental results are depicted in Table 1.Different data sets (from 1 to 9) introduce different Gaussian noise, which makes it more and more difficult to correctly classify the outliers, which can be observed in decreasing values of precision, recall and  1 in Table 1.As expected, ARIMA (batch) method gives slightly better results than Kalman (streaming) method. 1 scores are similar, whereas ARIMA method is optimized towards better precision and Kalman towards better recall.
Figure 4 shows algorithm results with 2 different datasets: left -little noise (σ = 0.036), right -more noise (σ = 0.179).Kalman filters' short-term prediction is depicted in orange, measurements in dark blue.Any measurement outside of the admissible light-blue interval (defined by Kalman filter variance) is considered as an outlier.

Results on Real-world Data Sets
There are two major problems concerning real-world sensor data sets: (1) these data sets are not annotated, therefore it is impossible to calculate proper accuracy measures of a data cleaning algorithm, (2) without accuracy measures it is also impossible to apply machine-learning techniques for parameter learning.
To overcome these shortcomings, we need to take a look into characteristics of sensor data.We have observed in many sensor data sources that outliers are rare.Most of the data is clean.It is therefore easy to introduce artificial outliers into original data and use such augmented data set to solve the problem (2).With the algorithm we are able to learn adequate parameters for a successful application of the algorithm.Solving problem (1) is more difficult.We can apply human-based anomaly classification for the rare detected outliers, which enables us to calculate precision (is detected outlier really an outlier?).The second method is to compare modelling performance (i.e.regression) between the clean and the original datasets.We have analysed performance of our method on 340 time-series data sets of groundwater levels from Slovenia.Results are depicted in Figure 5. Y-axis depicts groundwater levels in meters above sea level, x-axis depicts unix timestamp.Figure 5(a) shows a smooth and clean time series, which is easy to model with Kalman filter.The algorithm successfully identifies even bigger shifts in the groundwater levels.Figures 5(b) and (c) show sensors with more noise.The timestamps where potential outliers were detected are marked with a vertical red dotted line.We can observe two true positives (first two outliers) and one probable false positives in Figure 5(b), which is a consequence of a fast change in the data and is difficult to model in an on-line setting.Similarly, we can notice one true and two false positives in Figure 5(c).Figure 5(d) depicts extreme errors in the data that get detected correctly, even in cases, where there is more than one consecutive noisy measurement present.

Indirect Evaluation of Data Cleaning with Modelling Results
Without a labelled dataset from real-world scenarios, we cannot directly estimate the effect of data cleaning.Thus we are estimating the benefits of data cleaning through observation of the improvements of machine learning models on the data.It has been previously shown that data cleaning can significantly improve the model accuracy (Krishnan et al., 2016).We have compared root mean squared error (RMSE) of ARIMA (1, 1, 0) models on raw and on cleaned datasets.Lower RMSE measure means better fit of the models to the dataset.
Furthermore, we have developed a meta-classification algorithm for time-series to detect suitable candidates, where RMSE can be improved.Based on the meta-data obtained from the time-series (such as variance, mean data frequency, Kalman filter parameters, confidence of the Kalman model, etc.) and from the data cleaning algorithm learning phase, such as (learning parameters, number of errors, length of data frame and cleaning model score), we were able to build a classifier, which can predict whether our cleaned time-series can be modelled worse, better or equally good on cleaned data.The classifier has been built using the random forests algorithm (Breiman, 2001).
Experiments have been conducted on 5 different datasets: (i) 340 time-series of groundwater levels in Ljubljana region, (ii) 67 time-series from Yahoo! A1 Server Load (Yahoo!Webscope, 2015), (iii) 400 time-series from smart-grid observations (active power) in SW Slovenia, (iv) and (v) 100 synthetic time-series from Yahoo! anomaly detection benchmark.Results are depicted in Table 2. Table presents KPIs related to the algorithm and the meta-classifier performance as follows.Improvement indicates fraction of time-series with better fit after cleaning (0.805 means that 80.5% of time-series benefited from the proposed data cleaning).RMSE ratio expresses ratio of improvements of RMSE against the losses (443.6 indicates that RMSE is improved much more than it deteriorates in cases, where data cleaning fails; this happens as groundwater data contains significant human-made errors).Precision, recall and F1 are standard classifier evaluation measures for our meta-classification algorithm.The most illustrative are results on the two synthetic datasets.On the first dataset (Yahoo!A2) our algorithm works perfectly, while on the second dataset (Yahoo!A3) it fails completely.The main difference between these two datasets is that the periodicity in the first dataset is much larger and noise is much lower.The same properties are illustrated on real-world datasets, where we see the best performance (80.5%) of the algorithm on a smart-grid dataset.Typical period in this dataset is one day and measurements are taken every 15 minutes.Groundwater (i) and server load (ii) datasets have a sampling interval much closer to the typical period (significant change in the data can happen within a single sampling interval, i.e. groundwater can rise significantly in a day with substantial amount of rainfall).Performance of our algorithm is 51.3% and 53.0%, respectively.
Usability of the cleaning algorithm was further improved with a meta-classifier.Based on time-series metadata the classifier is able to identify the data sources which are likely to improve with our algorithm with a precision, that is much higher than the improvement ratio (between 73.7% and 85.0%).

Discussion
As presented in the previous section our algorithm achieves the best performance with a typical stream of sensor data, as we can find in Internet of Things.In such scenarios sensor measurements are frequent and systematic changes in the data are low (sampling interval is much shorter than periodicity).In comparison with a commonly used ARIMA methodology in batch data pre-processing (Chen et al., 1993), our method works better with lower noise data.An obvious downside of the ARIMA methodology is that it requires fitting of ARIMA model to the whole dataset, which makes it unusable with data streams.
Our approach is applicable in any kind of streaming scenario.However, there are some additional restrictions that need to be considered.When testing on real-world dataset we have observed heterogeneous characteristics of sensor data with respect to noise, volatility and measurement intervals.When dealing with large and diverse amounts of sensors (nowadays it is not unusual to have more than 10.000 sensors in the system, i.e. in a regional smart-grid system) it is not feasible to do individual cleaning model learning, therefore some basic clustering of sensors into groups with similar properties is needed.Fine tuning of the parameters can be performed on a representative time-series only and then applied to the whole cluster.
Based on their characteristics efficiency of our methodology differs between the datasets.However, efficiency of the algorithm can be further improved with a classification algorithm on the top of time-series/learning-phase metadata, which is able to select a suitable time-series for the data-cleaning algorithm.In this way we were able to achieve precisions between 73-85%.

Conclusion
In this paper we have identified that efficient data pre-processing is very important in streaming data scenarios.We have focused on the first part of the data preprocessing pipeline: data cleaning.We conducted a short research on the state-ofthe-art in the field and proposed our own method based on Kalman filter.The method has been quantitatively tested on an artificial data set.We have compared our method to the ARIMA state-of-the-art method and have obtained better results on the datasets with lower noise ratio and comparable results on the datasets with higher noise ratio.The main advantage of our method is, that it can work with Big Data in a streaming scenario.
Additionally, we have applied our method to a heterogeneous set of real-world time-series.We have tested the efficiency of our cleaning method with an indirect approach, where we tried to fit an ARIMA model to raw data and to clean data to compare the respected error measures.The proposed data cleaning was shown to be beneficial on time-series that have properties like majority of sensor streams available in the IoT domain.We also developed a meta-classification method which can predict the success of the data cleaning with 75%-85% precision.By observing differences in Yahoo!A2 and Yahoo!A3 datasets we identified the major limitation of our algorithm.When changes in a time-series are rapid (i.e. if periodicity is short in comparison to measurement frequency) many valid measurements are classified as outliers and algorithm accuracy is low.Future work should therefore be directed into improving Kalman filter parameter fine-tuning procedure, which should capture such behaviour.Additionally, usability of the algorithm should be tested on different real-world datasets and in the production environment.

Figure 1
Figure 1 Diagram of Gauss-Markov process

Figure 3
Figure 3Position of data-cleaning system within the stream-mining analytical platform

Figure 4
Figure 4Illustration of the algorithm results with 2 different datasets: lower noise (left) and higher noise (right); measurements outside the admissible intervals are detected as outliers

Figure 5
Figure 5 Illustration of the algorithm results with underground water level dataset: (a) timeseries without outliers, (b) and (c) time-series with true and false positive outliers, (d) time-series with obvious outliers

Table 2
Algorithm performance on unlabelled data and prediction of the meta-classifier regarding the success of the algorithm