Implementation of Integration VaaMSN and SEMAR for Wide Coverage Air Quality Monitoring

The current air quality monitoring system cannot cover a large area, not real-time and has not implemented big data analysis technology with high accuracy. The purpose of an integration Mobile Sensor Network and Internet of Things system is to build air quality monitoring system that able to monitor in wide coverage. This system consists of Vehicle as a Mobile Sensors Network (VaaMSN) as edge computing and Smart Environment Monitoring and Analytic in Real-time (SEMAR) cloud computing. VaaMSN is a package of air quality sensor, GPS, 4G Wi-Fi modem and single board computing. SEMAR cloud computing has a time-series database for real-time visualization, Big Data environment and analytics use the Support Vector Machines (SVM) and Decision Tree (DT) algorithm. The output from the system are maps, table, and graph visualization. The evaluation obtained from the experimental results shows that the accuracy of both algorithms reaches more than 90%. However, Mean Square Error (MSE) value of SVM algorithm about 0.03076293, but DT algorithm has 10x smaller MSE value than SVM algorithm


Introduction
With the increased probability of environmental pollution accidents which are caused by the chemical industrial construction, development of economy and production activity are increasingly frequent leading to increased air pollution accident in an urban area [1]. Indonesia is the 6th largest greenhouse gas emitter in the world (IEA 2015) where 40% Percentage of Energy Emissions from transportation and 90% of transportation emissions comes from road transportation. In 2010, the population of Jakarta is 9.607.787 and 57.8% of the population suffers from various air pollution-related diseases. That problem makes total health cost reach up to 38.5 trillion IDR/USD54 billion [2]. The various air pollution-related diseases arise because of the absence of an air quality monitoring system that covers in a large area and detail to determine areas with poor air quality. The government uses air monitoring sensors that installed on a fixed position air monitoring station to measure air condition. The system covers only a small area and does not cover the whole city, maintenance and repair are difficult. The system still uses a conventional database not use Big Data environment, and also, they use an operator to maintain and protect the sensor devices. There are 5 air quality parameters in Indonesian air pollution rules, there is carbon monoxide (CO), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3) and particulate (PM10) [3]. These parameters have an impact on human health.
One of the ongoing research on real-time environmental monitoring and analysis with integrated by big data technology called Smart Environment Monitoring and Analytic in Real-time System (SEMAR) [4][5][6]. SEMAR can be integrated with water quality monitoring system on ROV (a small robot submarine) [4], this research was developed to solve the government problem on river conditions monitoring. Water quality sensor was mounted on ROV to take samples of water quality and send the result to the server. The server of SEMAR used the big data storage to collected and saved data from sensors in the Hadoop server by utilizing  [5]. The researchers also built a SEMAR's extension for water quality monitoring in real-time with integrating Internet of Things (IoT) and Big Data analytic [6]. The data result is classified using a machine learning algorithm. In addition to SEMAR research, there are also other studies on the application of the Internet of Things for environmental monitoring, namely research on location-based environmental monitoring using GPS sensors and Big Data Technology to detect coral reef damage [7] and research on wireless sensor network integration with cloud computing for monitoring. Water quality environment that collaborates with smart aeration systems [8]. However, the system is only implemented in water environmental monitoring and is not used for extensive area monitoring. Therefore, the SEMAR's extension can be improved for other environmental monitoring such as air quality condition monitoring. This system can combine with a mobile sensor [9] to improve the area of monitoring. This journal has five sections that organized as follows: Section 1 presents about Intro of the research, section 2 presents the related works and previous study of the research, section 3 presents the detail of our system design that we used in our research, section 4 presents the results and discussion of the experimental and the implementation of our research. And for section 5 we present conclusions including the recommendation of the future work that will be conducted for the extensions of our project research.

Related Works
The study and previous works about the air quality monitoring smart system that have been developed using various technology depend on the protocol and also using the real-time system. In 2016, researchers from India have conducted a study of implement Internet of Things on Smart Pollution Detection with AWS IOT Cloud, where the air quality sensor and GPS installed on the vehicle [10]. This system used for tracking and detecting air pollutants in urban, they don't use machine learning or any classification.
Mobile Enterprise Sensor Bus (M-ESB) is a research from China which used to urban environment sensing such as road condition and air pollution. M-ESB send the results of the data sensor that installed on a bus to the server and stored into a database, the output generated from this system is a display in the form of electronic maps website [11]. In 2017, Zhihan Lv and team [12] have conducted the study about Big Data Analytics challenges and the future topics of Big Data development. The result of this study shows that trends of IoT-Big Data platform are the retrieval data process that focused more and more on streaming and multiple sensors data. MapReduce and machine learning algorithm used in the analysis method of Big Data. In this research, we also conducted the research by utilized an IoT-Big Data platform. Figure 1 shows the IoT reference model [13] which consists of seven sections. This model used as a reference for the design of our system.   Figure 1 describes the layering systems of IoT Reference Model. The system grouped according to the IoT reference model consisting of physical devices and controllers (1), connectivity (2), edge computing (3), data accumulation (4), data abstraction (5), application (6), collaboration and process (7). The overall system design shows in Figure 2. Figure 2 shows the overall system divided into several layers according to the IoT reference model [13] in Figure 1.

Research Method
We will explain briefly about implementation according to layering system in IoT Reference Model in the next sub-section.

Physical Devices and Controller
Physical Devices and Controllers are part of architecture IoT that used to detect air quality condition. The device has been equipped with the air quality sensors such as Shinyei PPD42 Particulate Matter Detector for particulate, MQ7 for carbon monoxide, MQ135 for sulfur dioxide, MQ131 for ozone and MiCS 2714 for nitrogen dioxide. The device consists of an air quality sensor which is controlled by a microcontroller [14]. The data collection process started with air quality measurement using air quality sensors based on wireless sensor network [15]. Wireless sensor networks are used because they could be implemented in urban areas [16]. This controller collects the data and performs the conversion process for the air quality unit ug/m3. The converted data sent to another device for processing to the server via Wi-Fi network.In this research, physical devices and controllers are installed on top of the vehicle and retrieving data from the sensors every 5 seconds.

Connectivity
Through the Wi-Fi signal from a 4G modem, Physical devices and controllers send data to the edge computing layer for processing. The 4G modem is also used to connect between Single Board Computing (SBC) on layers the edge computing and cloud computing. The throughput of this 4G modem is about 20 Mbps.

Edge Computing
Smart car hub consists of SBC, Camera, GPS, display and Wi-Fi 4G modem an Edge Computing [17]. The data collected by the sensor is combined with GPS data in JSON format, then send the data through the MQTT protocol. Read sensor data 5: Read GPS Sensor 6: Add Sensor data and GPS Sensor to JSON Format 7: Send data to the server using MQTT (topic, data) 8: End SBC is the main controller to connect VaaMSN system with cloud computing system, where there are other devices connected to it such as Camera that serves to take picture condition around the car, as well as a display screen that serves as a user interface. in this research, Raspberry Pi 3 type B is used as SBC, Raspberry Pi sends a line of air quality data through the topic 'airsensor' to MQTT broker on cloud computing. The data lines transmitted has formats like 'air quality data, latitude, longitude and current time'.

Data Accumulation
Data accumulation is a process of storing data which sent by edge computing to the No SQL database as a Big data platform. Air quality data in JSON form, received by MQTT Subscriber with topic 'airsensor'. The data is used for the prediction process to determine the air quality index in accordance with the rules in Section 2.8. Air Quality Index results obtained from predicted results are stored in Cassandra that serves as No SQL database with schemas consist of a current timestamp, air quality data, latitude, longitude, label.
Cassandra was chosen because it has stable performance, strong security, operational simplicity for the lowest total cost of ownership and best scalability of NoSQL platforms [18]. The data with topic 'airsensor' stored on Cassandra through an Application Programming Interface (API) web service and the use of the Representative State Transfer (REST) communication architecture method that runs on the Node.js service in microservice for data storage with IP address 202.182.58.11.

Data Abstraction
This layer defines the management of the data flow for Big Data server and real-time visualization. We built two connector applications the first connector is based on Node.Js that  [19]. This connector function is to get data from MQTT Broker, distributed data to Cassandra database server by using RESTful API and send data to Influx DB visualization server as backend. The second connector is based on Python programming language. This connector function subscribes the data from MQTT Broker to Machine Learning server, and send the predicted data to MQTT Broker with a different topic. The IP addresses of microservices are 202.182.58.10 for InfluxDB as real-time visualization, 202.182.58.12 for MQTT Broker and connector, 202.182.58.14 for RESTful API and Cassandra database server.

Applications
The application layer consists of 1) Learning Process, 2) Real-time classification, 3) Real-time visualization in the form of a table, map, and graph.

Learning Process
The process of Machine Learning is used to build the classification model. This process is used to provide a classification model before performing the classification process in realtime. Thus, the level of confidence in classification results can affect the accuracy level of the generated model.
In the training process of the dataset, we used Scikit-learn [20] for conducting the training process of the dataset. Support Vector Machine [21][22][23] and Decision Tree [24][25][26] are used as classification algorithms for this research. The best method between the two would be selected by comparing the results of the algorithm. Support Vector Machine (SVM) used for regression and classification algorithm [20], this algorithm has been implemented for big data classification [23]. In SVM, this classification is performed by giving a training vectors ∈ , = 1, … , , with an indicator vector ∈ such that ∈ {1,-1} to solves the following primal optimization problem: after problem (2) is solved, using the primal-dual relationship, the optimal w satisfies: The decision function of the classification becomes: The original SVM can be classified into two classes. Proper multiclass methods are required when dealing with more than two-class classification problems. In this case, combine several binary classifiers with two methods. The first method is 'One on one' means applying a comparison of inter-class pairs. The second method is 'One against the other' means comparing one class with all the other classes.
The decision tree is a machine learning algorithm that uses tree decisions such as trees and the possible subsequent impact, which involves the results of events, resource costs, and utilities. Decision Tree is one of the best classifiers when considering classification accuracy, this algorithm studies the classification function which includes the dependent attribute (variable) given by the value of the independent attribute (input) (variable). Some of the most well-known decision tree algorithms are C4.5, CART and Naive Bayes Tree [24]. This research uses a CART that stands for Classification and Regression Trees. CART analysis is a form of binary recursive partitioning and can handle numerical and categorical variables [24][25][26]. The impurity level of accepted data can be measured by CART, also it can construct a binary tree where each internal node produces two classes for the accepted attribute. The way of how the tree constructed by selecting the attribute recursively use an attribute which has the lowest Gini Index. Attribute with the lowest Gini Index value is obtained by calculating the Gini Index value in each attribute. Gini Index is calculated based on the formula below, where the probability of the ℎ class for target classes of a given attribute is , meanwhile, is the probability of class [24].
The accuracy of the classifier algorithm was evaluated by divides the dataset into two subsets are about 70% for the training set and the remaining 30% for the test set. The training set is used to build the classification model. While the measurement of the built classification model performance used test set. The method used is called the hold-out method. Learning procedure is shown in Algorithm 2. This research uses Air Pollution Standard Index Range (ISPU), this rule is used by the current Republic of Indonesia government to determine the quality category [3] as shown in Table 1. Parameters of the Air Pollution Standards Index include carbon monoxide (CO), nitrogen dioxide (NO2), particulate (PM10), ozone (O3), and sulfur dioxide (SO2). The 'Range' column in Table 1 refers to the measured air quality index formulation values in determining the formulation be explained in Table 2.  Table 2 describes the groupings of values of each air quality parameter in ug/m3 units to be formulated to determine the air quality index. The 'Air Quality Index' column shows the maximum value of the air quality index with the air quality parameter conditions in accordance with the air quality parameter columns in the same row.

Algorithm 2. Single Board Computing
Determining the air quality index by the value grouping rules and value constraints can be used in (6) as set out in Table 2 In determining the ISPU, if there are several air quality parameters that are measured then, the data used is the air quality parameter with the highest ISPU value. For example, if the data obtained SO = 71, NO = 55, PM10 = 91 then the reported data is ISPU worth 91, Air Quality is "Moderate" and the dominant parameter is PM10.

Real-time Classification
Through the data model that has been generated by the learning process and analytics on large-scale data can be used to create a real-time classification system. Therefore, although data is used on a large scale with a large number of sensor nodes, the system able still perform the analysis process. The purpose of using this system is to bypass the data distribution delay from VaaMSN (edge computing) to data storage and visualization.
The process of real-time classification and learning process using a scikit-learn that runs in the python environment. Air quality data sent by VaaMSN on the topic of 'airsensor' through MQTT communication, then the data is converted into JSON format so that it can be used in classification to generate air quality index prediction from received data. The results of the process are numerical from 0 to 4 representing the categories in Table 1, sequentially starting from good, medium, bad, very bad and hazardous. The result is stored in a variable called 'label and put into JSON previously received data so that the data contains 'air quality data, latitude, longitude and the current time and labels'. The combined data be re-submitted with the topic "airsensoranalytic" for use in real-time visualization.

Real-Time Visualization
Visualization stage start from the connector sends the data into InfluxDB by using 'writepoint()' function on Node.Js. We use InfluxDB for time-series database [27], the data collected by InfluxDB as arranged by time-series, then send to Grafana [28]. The Grafana generated graphical interfaces such as a table, graph, and maps. Data schemes are "{current timestamp, sensor id, pm10, co, so2, 03, no, latitude, longitude, latitude, label}". We built three type of visualization, Figure 3 (a) is a table, show data of air monitoring sensors with latitude, longitude and index quality air monitoring, Figure 3 (b-f) is a Graph show the time series data with a line chart, and Figure 3

Collaboration and Processes
This layer provides feedback to users performed by the system. The system will send a notification if there is bad air quality in an Air Quality Parameter area. If there is a label value exceeding bad characteristics of the data sent on the 'airqualityanalytic' topic, the website visualization system built able to provide a notification to the user in the form of a warning that the current air condition is hazardous or not healthy. Not only using the website to push notification but also can be using Mesosfer platform. The Mesosfer is Mobile Backend Platform as a Service that provides several features to help and simplify the creation of the internet of things system and enabling users to speed up the development process. This platform used to send mobile notifications to alert the user about air quality condition. When air quality is in unhealthy, very unhealthy and hazardous condition system would send data to Mesosphere RestAPI, then the data becomes notification sent by Mesosfere to the user's cell. The data submitted is 'air quality data', 'air quality condition', 'current time' and 'data location'.

Results and Discussion
In this results sections, we have done some experiment and present the implementation both of software and hardware development. The experiment has been given the results that performed as well and using the analytical test showed how the system that we built works well. Several tests performed including real-time visualization testing and comparative performance evaluation of SVM algorithms [29] and linear Decision Tree using their default parameters where we use datasets from sensors according to the given rules. Table 3 shows the confusion matrix of the result of the training model that has been built from the Linear SVM algorithm, Table 4 shows the confusion matrix for the DT algorithm. The SVM experiment results that the number of data error are between 22 until 127 from around 21.804 data. It is mean that the error percentage is about 0.2% to 0.5%. The confusion matrix of SVM can be seen in Table 3.  The Decision Tree algorithm results in experimental results that the number of data error is between 3 until 12 from around 21.804 data. It is mean that the error percentage is almost 0%. Table 4 showed that the confusion matrix of the Decision Tree algorithm. The experiment showed that Decision tree is better than SVM with higher accuracy of the predicted label.

Classification Results
The second experiment we measure the acceleration of the classification result. To measure the acceleration we calculate the accuracy rate and MSE (Mean Squared Error). Table 5 shows that Decision tree algorithms offer a better accuracy rate by 0.99839479 when  Figure 4 shows the curve shape of the ROC (Receiver Operating Curve) which represents the Validation of the class model. ROC is a curve that compares graphs on the vertical axis of TPR (True Positive Rate) with FPR (False Positive Rate) existing on the horizontal axis of ROC. The area under the ROC curve called AUC. AUC is rated from 0 to 1 and gets better when it approaches 1. From the experiments performed, the decision tree algorithm has 100% accuracy in all classes, while the SVM algorithm has an accuracy of around 98%. These AUC results are better compared to the use of multilabel classifier [30] which produces AUC of around 0.71.

Purpose System Implementation
Experiments are used to test system integration from sensor readings to database storage and real-time visualization. Air quality data obtained by air quality sensor devices is that transmitted through communication to cloud computing. MQTT Broker distributes data to the prediction system to determine the air quality index of the data. The data that has been sent back after the data added with air quality index use another topic to MQTT Broker and accepted by MQTT Customer and forwarded to InfluxDB connector for real-time visualization. Data received by InfluxDB displayed in real-time using graphical interface graphics that can be viewed on the website at IP address 202.182.58.10.
The test is performed by installing an air quality sensor device on top of the vehicle and SBC in the vehicle. Vehicles are driven on the highway for collecting air quality data in the road. Figure 5 (a) shows the air quality sensor device and Figure 5 (b) shows the SBC as Smart Car Hub. Figure 6 shows real-time visualization on dashboard when the vehicle is in motion, a maker's visualization on the world map representing air quality data, the color of the marker shows the air quality data with the provision is green for good condition, blue for medium condition, yellow for unhealthy, red for very no healthy and black for hazardous.

Conclusion
In this research, the integration of VaaMSN and SEMAR for the air quality monitoring system has been implemented successfully. From the experiment show that the data from VaaMSN is sent to Big Data platform and visualized in real-time. The two algorithms that we used for analytical have been given the result that the estimation of accurate is more than 90% and achieves MSE is 0.00268096 for DT algorithm and SVM algorithm about 0.03076293. That means we achieve a good result in this experiment. In the future, the integration of VaaMSN and SEMAR is expected to be used in the road environmental monitoring which detects holes and road damage using cameras and other road condition sensors in real-time classification.