Spatial–Temporal Anomaly Detection Algorithm for Wireless Sensor Networks

Traditional anomaly detection algorithms cannot effectively identify spatial–temporal anomalies in wireless sensor networks (WSNs), so we take the CO 2 concentration obtained by WSNs as an example and propose a spatial–temporal anomaly detection algorithm for WSNs. First, we detected outliers through the adaptive threshold. Then, we extracted the eigenvalue (average) of the sliding window to be detected, constructed the spatial–temporal matrix for the relationship between neighboring nodes in the specified interval, used the fuzzy clustering method to analyze the eigenvalue of adjacent nodes in spatial–temporal correlation and classify them, and identified the abnormal leakage probability according to the results of the classification. Finally, we used real datasets to verify this algorithm and analyze the parameters selected. The results show that the algorithm has a high detection rate and a low false positive rate.


Introduction
In recent years, wireless sensor networks (WSNs) have been applied in many fields, such as environmental and habitat monitoring, object and inventory tracking, health and medical monitoring, battlefield observation, and industrial safety and control [1]. However, the data measured and collected by WSNs is sometimes unreliable because of the resource containments of the nodes or status changes in the monitoring object. Abnormal data in sensor networks can be divided into abnormal points called "outliers" and abnormal events called "events" [2]. Abnormal points are a result of resource limitations of WSNs and sensor nodes in poor environments, which often lead to node failure and therefore result in abnormal data [3]. Abnormal events [4] are often described as a series of abnormal values in a data stream.
At present, the methods widely used in sensor anomaly detection are mainly based on several categories, including statistical model technology [5], adjacent degree technology, wavelet analysis technology [6], and cluster technology [7]. The method based on the statistical model is unsuitable for abnormal distribution data. The method based on clustering depends on the number of clusters. The methods based on adjacent degree technology and wavelet analysis are complex. The traffic forecast model [8] uses the correlation coefficient of predicted traffic sequences and the actual flow sequence for anomaly detection. The spatial-temporal correlation characteristics of the sensor data were considered in [9], which used time and spatial correlations to generate outliers. The local outliers are converged to sink for the global outliers. This method is applied only to detect abnormal points, but sometimes, abnormal sequence detection helps to reveal abnormal events that occur. Thus, the time series of anomaly detection was more valuable in [10], which proposed to rapidly compare the similarities of two time series based on the Chebyshev coefficient and found an abnormal time sequence. The literature focused on outlier detection and a single time series of anomaly detection in the sensor data. The spatial-temporal characteristics of the sensor when the abnormal event occurred was overlooked [11].
The objective of this study is to identify the phenomenon of CO 2 leakage by analyzing the abnormal readings of sensors. Considering the analysis of CO 2 data streams, we analyze the spatial-temporal characteristics of each sensor when CO 2 leaks [12] and then identify the abnormal leakage effectively. In this paper, we propose the spatial-temporal anomaly detection (STAD) algorithm for WSNs. First, 3 rules for the anomaly detection of adaptive threshold value is used. Second, Euclidean distance is employed to determine the neighbor node and extract the mean of time sequence within the sliding window, which is the eigenvalue. Then, a fuzzy similar spatial-temporal matrix of the neighbor node is constructed. Afterward, the fuzzy clustering algorithm is used to identify the abnormal probability model. Finally, the algorithm is verified using a real dataset, and the detection rate (DR) and false positive rate (FPR) of the different parameter settings are analyzed. Several references for parameter selection in the future are provided.

Problem Description and Definition 2.1. Background
The greenhouse gas CO 2 , which is emitted by industrial production and human activities, is gradually causing global warming. The earth's environment on which people rely is increasingly deteriorating. Carbon capture and storage (CCS) is a technology that can reduce the greenhouse effect of CO 2 by storing the gas underground. The main risk of CCS is leakage. Thus, to monitor the safety of the CCS system, various monitoring technologies were used to establish a three-dimensional monitoring system. Surface CO 2 concentration monitoring is one of them. Identifying the leakage caused by monitoring data collected by sensors is the research target of the paper.

CO 2 Sensor
According to the gas diffusion in the guiding principles for environmental evaluation, while leakage occurs, the concentration of CO 2 is mainly affected by the wind speed, wind direction, and other weather conditions. Therefore, after a comprehensive consideration of system monitoring requirements, we selected sensors for temperature, humidity, wind speed, wind direction, and CO 2 concentration. The CO 2 concentration monitoring device diagram is shown in Figure 1.

Experimental Set-up
To identify the spatial-temporal characteristics of the CO 2 leakage, we designated eight sensors at equidistance of the leakage source, and each sensor is located at the same height as the leakage sources. The layout of every monitoring sensor is shown in Figure 2.   Figure 3 shows that, when leakage occurs, only part of the sensors' concentration levels changes significantly, and the concentration does not increase continuously but volatilely, while the difference of concentration detected by the remaining sensors is minimal.

Particularity of CO2 Anomaly Judgment
Many scholars provide different definitions of spatial-temporal anomaly. The CO 2 concentration data obtained by WSN monitoring are slightly different from those obtained by general STAD. First, the sensor data belong to the data stream. Second, CO 2 concentration is decided by diffusion, and the diffusion of CO 2 is influenced by wind speed, wind direction, and other factors. Moreover, the responses of different sensors vary. Therefore, the particularity of a CO 2 data stream is as follows: 1) Data stream The data stream has many features, such as large and continuous amounts of CO 2 , rapidity, unpredictability, infrequent scanning, and concept drift characteristics [13]. In general, the researchers proposed landmark window, sliding window, and attenuation models according to the scope of different time ranges to reduce storage and computational costs.
2) Abnormality feature The data stream cannot effectively identify abnormal leakage through a single sensor analysis of time series data at a certain moment or through an adjacent sensor data analysis because single sensor abnormality may be caused by equipment failure. In addition, the response of each sensor to concentration is different. In general, the CO 2 leakage caused by abnormality has certain global, durability, and fuzziness features.

Definition of CO 2 Leakage Anomaly
Definition 1. Sliding window: We chose a sliding window model to represent the data stream, and assuming that the window lengths is W, we used W as the time interval. The observation value in W time can be expressed as time series S W =<s 1 =(c 1 ,t 1 ), s 2 =(c 2 ,t 2 ),…, s w =(c w ,t w )>, where s i represents the value c i at the moment t i . The schematic for the sliding window is shown in Figure 4. Definition 3. CO 2 leak anomalies: We determined the eigenvalue of the sliding window and the neighbor node n of the sensor to be detected. We obtained the classification of each node as a result of the fuzzy clustering algorithm. The probability of the abnormal leakage can be represented as the ratio of the number of sensors with the same class as the node to be detected among all the nodes. The ratio of the threshold T is expressed as follows: Where Count (C) represents the number of sensor nodes in the same class as the node to be detected and T is the threshold. Sensor nodes are evenly distributed, so the concentration of about 50% of sensors downwind are affected; therefore, on the basis of prior experience, we set the value of T to 50%.

Algorithm Definition and Main Ideas
Considering the particularity of CO 2 leakage, we adopted fuzzy clustering algorithm [14] to analyze the spatial-temporal correlation measurements of each sensor to effectively identify anomalies. Taking into account the lightweight requirements of sensor anomaly detection, we divided the algorithm into two phases. The first stage uses the sliding window to identify the abnormal points. At the second stage, the neighbor node is determined by extracting the eigenvalue of the sliding window. The fuzzy characteristic matrix for the eigenvalue of the sliding window specifies its neighbor nodes. We used fuzzy clustering to identify abnormal leakage probability. The process of anomaly detection is shown in Figure 5.

Abnormal Point Determination of Time Sequence
The data stream of CO 2 exhibits a strong seasonal feature; to improve the accuracy of detection threshold, adaptive problems should be considered. By analyzing the time-varying characteristics and distribution feature of the CO 2 monitoring data, we concluded, on the basis of the Chebyshev theorem of large numbers and central limit theorem, that the CO 2 concentration of the data stream in the fixed sliding window conform to normal distribution (proof omitted). The newly derived observation value in the sliding window can determine its threshold value according to the following rules. Through  3 rules, we observed that the change in the threshold value, accompanied by the change in the mean and standard deviation, has strong adaptability.

Spatial-temporal Abnormal Judgment
As shown in the preceding analysis, the leakage determination is unusual. The eigenvalue of the multiple sensors is needed for the spatial and temporal correlation analysis. The discussion on selecting the eigenvalue, determining the neighbor nodes, and STAD based on fuzzy clustering is as follows.
1) Selecting the abnormal eigenvalue To determine the sequence similarity degree of the adjacent nodes for the anomaly detection sensor, we used the distance of the corresponding measurements between the nodes [2]. However, we observed that CO 2 leakage caused by the change in the sensor does not have a one-to-one relationship, as shown in Figure 6.  Figure 6 shows that directly using the observation value can cause a large error DR. Considering the characteristics of the observed value of CO 2 leakage, we chose the mean value of concentration to describe the change characteristics of the observation values within the sliding window, which can smoothen the influence of the instantaneous changes to a certain degree.
2) Determining the neighbor node The voting decision was used to identify the neighbor node [15]. A Voronoi diagram was used to determine the adjacent node [16]. To simplify the calculation, in this paper, we determined the neighbor node based on the fact that the Euclidean distance is less than a fixed value K. K is set according to the distance of a sensor. The sensor node to be detected is O, the coordinate position is (x, y), the neighbor node set to be determined is X = {X 1 , X 2 ,..., X n }, the coordinates of the X i is (x i , y i ), and the distance between the nodes X i and O is defined as follows: (2) If the distance of dist (X i , O) is less than K, then it becomes the Kth neighbor of X i when the value is O.
3) STAD based on fuzzy clustering The data collected by sensor nodes tend to have certain spatial correlations. Generally, the relevance of space refers to the data of the nodes related to the close physical location change approximation. However, CO 2 leakage caused by the spatial correlation has a certain particularity because the diffusion of CO 2 in the atmosphere is not evenly distributed. In principle, it is spread downwind, but because of the influence of atmospheric turbulence, the wind is not stable and the concentration change of the response of each sensor also differs. No strict mathematical formulas are used to determine the leakage. The sensors also cannot identify the status of leakage strictly according to the relationship between distance on the basis of the neighbor nodes, so we need to combine the fuzzy theory and spatial-temporal variation characteristics of CO 2 diffusion to judge the leakage probability. This study uses fuzzy clustering methods to conduct the STAD. The steps are described briefly as follows: a) The data are preprocessed.
The coefficient of similarity between samples or variables is calculated, and the fuzzy similar matrix is constructed. c) The fuzzy arithmetic is used to transform and synthesize the similar matrix. The fuzzy equivalence matrix is generated. d) The fuzzy clustering is conducted on the basis of the different levels of interception for the fuzzy equivalence matrix.

Steps of the Algorithm 4.1. Overview of the Algorithm
The  3 rules were used for each sensor data stream to detect the outliers. After the outliers were found, formula (2) was used to determine the neighbor nodes, and the correlation coefficients of the eigenvalue were calculated by formula (4). Then, one can judge whether any abnormal mode occurs according to definition 3.

Steps of the Algorithm 1) Outlier detection
Algorithm input: the point of C i to be detected Algorithm output: whether the point of C i is an outlier The algorithm steps are shown as follows: a) The information of each sliding window ) , , ( c) The observed value of C i to be detected is read.
, it is judged as an abnormal point. (2) Abnormal pattern recognition Algorithm input: the coordinate position (x, y) of the abnormal sensor; the sliding window number of m; and the threshold value Algorithm output: abnormal leakage a) Formula 2 is used to determine the neighbor node. b) The fuzzy characteristic matrix with n nodes and the eigenvalue of m sliding windows are established. c) A series of transformations is conducted on the fuzzy characteristics matrix, and this matrix is transformed into a fuzzy equivalence matrix. d) The fuzzy equivalence matrix is classified according to . e) The probability of abnormal leakage is determined according to formula 1.

Experimental Verification and Analysis
Considering that no standard database is currently available for the CO 2 leakage test, we analyzed the detection results of the algorithm with the real datasets of the CO 2 leakage to verify the efficiency of the STAD algorithm in this study.

Experimental Setup Description
The experimental dataset used the field to simulate the leakage data.

Data Processing
As an example based on sliding window length L of 10, the number of windows that need to be detected is 10 and the number n of nodes equals 8. The classification results of the fuzzy clusters obtained are shown as follows. The results in Table 1 show that we cannot achieve the classification effect when the value of is  too small. The greater the value of is  , the more accurate the classification is.
When L is 10 and M is 10, the test results are the same when L and M are respectively equal to 0.8 and 0.9. To compare the detection results, we evaluate the performance of the algorithm with the accuracy of the classification results. When the classification number is 3, the abnormal sensor nodes in this experiment are devices 02, 04, 05, and 08. We adopted DR and FPR, which are commonly used in anomaly detection as an index to measure the performance of the algorithm.

Result Evaluation
1) DR The complexity of the algorithm is determined by the length and the number of the sliding windows included in the calculation. Thus, we analyzed the length and number.
First, we analyzed the lengths of the sliding windows. Because this factor can increase the complexity of the calculation, our analysis shows that 100% of the anomalies can be identified when the sliding window length of L is 10 and the number of M windows is equal to 10. Therefore, to reduce the computational complexity and compare the DR of the algorithm, we appropriately decreased the sliding window lengths to 6 and 10. depicts that the DR was 100% when the lengths of the sliding windows were 10 and 8. The DR dropped to 75% when the lengths of the sliding windows were changed to 6. This finding shows that reducing the computational complexity is conducted at the expense of the DR.
Second, we analyzed the number of sliding windows. The following compares the DR when the sliding window number ranges from 8 to 12 and the lengths of windows are 6, 8, and 10.  Figure 9 shows that the inspection DR of the event anomaly detection based on the fuzzy clustering is higher as a whole. Figure 9(a) depicts that the DR decreased slightly when M takes 6 as its value. Figure 9(b) and 9(c) depict that the DR can reach 100% when M is greater  Figure 10 depicts that the FPR of the event anomaly detection based on fuzzy clustering is nearly 0 when the number of windows is larger than 10; when the number of windows is 8, the FPR is higher. The FPR increases significantly when the length of the sliding windows is 6.
In summary, the algorithm has higher DR and lower FPR as a whole for event anomaly detection. For the anomaly detection under these experimental conditions, considering the dual demand of the DR and the FPR, we suggest that the lengths of windows should be 8 or greater and the number of the detected windows should be 10 or greater.

Conclusion
The traditional detection method, which neglects the feature of observation values and usually adopts the static threshold method, may cause the FPR to be too high. Considering the temporal and spatial characteristics of CO 2 leakage, we proposed a STAD algorithm based on fuzzy clustering. The algorithm is divided into two stages. First, the abnormal points for every sensor are identified using  3 rules. Second, the eigenvalue of the sliding window are extracted to create a model based on the fuzzy equivalence model to obtain the classification results under different thresholds. This method allows the identification of the abnormal probability. This algorithm extends the application scope of the fuzzy clustering algorithm. The experimental results show that the algorithm has a high DR and a low FPR. As a result of the limited conditions, the number of the simulation nodes is less and the parameter selection is too simple, which only verified the performance of the method initially. These findings should be verified on platforms with a larger number of sensor nodes.