Cluster-based water level patterns detection

Indonesian Disaster Data and Information in 2016 showed that flood has reached a soaring 32.2% overall. In one of the common flood region (2016), Tangerang, the flood had impacted 30,949, and destroys more than 400 residentials. In spite of this dreadful fact, Tangerang has no systematically ways of detecting the flood patterns. Therefore, there is urgency for a system that is able to detect potential flood risks in Tangerang. This study explores a mean to systematically find flood patterns in Tangerang and attempt to visualize the risks based on 11 years of data on four major river stations within Tangerang vicinity. All the data obtained from Ciliwung Cisadane River Basin Center (BBWS) between 2009 until 2017 with total data of 368,184 rows. This study proposes an interactive dashboard based on the water level data covering rivers of Angke, Pesanggrahan, and Cisadane. Three clustering methods are implemented, the K-Medoids, DBScan, and x-means, to segregate the water level data, taken from four stations obtained from Ciliwung Cisadane River Basin Center (BBWS), into meaningfull periodic flood patterns. The output of this research is an interactive dashboard created based on the newly found patterns. The dashboard is designed to be simple and easy to use for non-technical persons. We believe that the output of this research could be implemented into the decision-making process taken by the Ciliwung Cisadane River Basin Center (BBWS) in order to improve countermeasure attempts on the potentially flooded areas.


Introduction
Indonesia is prone to flood disaster.According to the National Disaster Management Agency in 2016 [1], flood is one of the most often occurring disaster in Indonesia byas much as 32.2% with 713 incidents Figure 1.1377 facilities, and 624 Ha of land damaged.The impact of this flood disaster can be reduced if the communities are informed by some prediction of potential flood risks that occur ahead.Flood do not occur only in watersheds but also in urban areas or areas far from streams, for example in densely populated areas and roads that have no drainage or good uptake which makes it less obvious though still predictable.This flood behavior creates a need for an early warning system in such areas.One common approach to the flood early warning system is a visualization tool which has been used in many of similar systems.
According to the National Disaster Management Agency, the losses and damages caused by floods in Tangerang in 2016 alone reaches four mortality, 30,949 people suffering from floods, 5,313 displaced people, and immaterial losses of 403 residences, 11 education facilities, and 624 Ha of land damaged.The impact of this flood disaster can be reduced if the communities are informed by some prediction of potential flood risks that occur ahead.Flood do not occur only in watersheds but also in urban areas or areas far from streams, for example in densely populated areas and roads that have no drainage or good uptake which makes it less obvious though still predictable.This flood behavior creates a need for an early warning system in such areas.One common approach to the flood early warning system is a visualization tool which has been used in many of similar systems.
This study focus on exploring ways to visualize the flood periodic occurrence in Tangerang area due to its urgencies to reduces the impact of flood disaster.Currently Tangerang has no application that can detect and tell potential flood and geographical location of Tangerang (and its nearby areas i.e.Jakarta and Bogor) which makes it more vulnerable to upcoming flood event.This study proposes an interactive dashboard on the river stations within Tangerang area.Tangerang has three rivers which are Angke, Pesanggrahan and Cisadane (illustrated in Figure 2).The information in the dashboard are collected from clustering method as is common in solving high dimensionality logistic problem (c.f.[2][3][4][5]).

Problem Statement and Research Method
The goal of this study is to explore possible Tangerang's periodic flood patterns and visualize the patterns in form of dashboard visualization.The visualization could be used as a benchmark to prioritize flood prevention attempts such as preparing water pumps at high risk points and rivers bed maintenance schedules in areas with great potential for flooding.The visualization could also be used to help the Ciliwung Cisadane River Basin Center (BBWS) conducts maintenance services.Given this visualization data as a Knowledge Discovery in Databases (KDD), Ciliwung Cisadane River Basin Center (BBWS) can see the water level patterns that occurred during the period that has been predicted.
As research method, we follow the KDD process [6,7] of discovering useful information from a collection of data.In the [6] by Julian & Natalia, they conducted a research to build an application with a purpose to recommend to its users in assembling computer that suit their needs so they can get a better price from build a computer that they need by using web scrapping.Similarly, Monica et.al. in [7], this paper presents the finding from analysing the large amount of data that the Indonesian Government Tourism Office, specifically regarding tourism in Bali.They use K-Means and X-Means algorithms to cluster the various type of tourist attractions in Bali according to their popularity and Power BI to develop the interactive dashboard.The difference between the previous is in this paper we use KDD and uses three methods in the clustering to analyze possible water level rises patterns.Figure 3 ilustrate the four steps of the KDD based on [6,7].The details of these steps are explain in section 2.1 through 2.4 and the results are presented in section 3.

Selection
The steps undertaken in data preprocessing mainly fall into two categories, namely the removal of noise or outliers, and strategies for handling missing data fields.Specifically, in this study, we perform data preprocessing by following these steps using Power BI: a. Delete unused rows and columns.b.Equalizing the name of 7 columns; those are River, Station, Latitude, Longitude, Time, Date, and Water Level.

Transformation
Data that has been through the preprocessing phase will be transformed so that can be used for data mining process.In this case, we merge the data into one continuous vectors to be processed using clustering methods in R Version 3.4.4.

Data Mining
In the data mining process, clustering is done using K-Medoids, DBScan, and X-Means on two scenarios based on [8]: Scenario 1: Using the complete data between 2007 until 2017.Scenario 2: Based on the previous research, using the data from 2013 to 2017.
The clustering methods are used to segregate the data into meaningful groups.The patterns are strongly detected when there are some agreements between the results on both scenarios (i.e.since the clustering structures are still intact regardless of the number of data).

Analysis and Results
The analysis is conducted based on the results obtained from implementing three clustering methods: K-Medoids, DBScan, and K-Means.The results of all of the clustering methods are compared to make final conclusion on the pattern.

Data Clustering on K-Medoids
The implementation of K-Medoids is done by using R.The data will be clustered based on hourly water level.In this study we chose to use that index as a metric to evaluate the performance of each cluster with using assumption number of k=3 (meaning: high, medium, and low).The steps of K-Medoids process are [9-14]: a. Arbitrarily choose k=3 data items as the initial medoids b.Assign each remaining data item to a cluster with the nearest medoid c.Randomly select a non-medoid data item and compute the total cost of swapping old medoid data item with the currently selected non-medoid data item.d.If the total cost of swapping is less than zero, then perform the swap operation to generate the new set of K-Medoids.e. Repeat steps 2, 3, and 4 till the medoids stabilize their location until 50 iterations.The result for K-Medoids algorithm is displayed in a dashboard: Figure 4 shows the characteristics of each cluster's member.For scenario I and scenario II of all rivers, the averages of water level on each cluster are represented in Table 1:  The three highest average water levels is on 12:00 PM, 6:00 AM, and 6:00 PM.
The three highest average water levels is on 6:00 AM, 12:00 PM, and 6:00 PM

Data Clustering on DBScan
The implementation of DBScan is done by using R.The data will be clustered based on the water level.We implement different parameters from the previous study [8]: a. Eps: Previous is 1.0, currently is 0.01.b.MinPts: Previous is 5, currently is 1000 to 6000.
We use different values of the parameters because as the previous result has a very low clustering resolution (unbalance groups where one of the group has only one member).The number of cluster will be divided based on those two parameters (Eps and MinPts).These are the steps of DB Process [15][16][17][18][19][20]: a. Arbitrary selection of a point p. b.Retrieve all points density-reachable from p w.r.t Eps and MinPts (Eps: Previous is 1.0, currently is 0.01; MinPts: Previous is 5, currently is 5000).c.If p is a core point, then a cluster is formed.d.If p is a border point, no points are density-reachable from p and DBScan visits the next point of the database.e. Continue the process until all of the points have been processed.
The optimum number of clusters based on the parameters (Eps: 1.0 and MinPts: 5) is 2, that is high cluster and low cluster.The result for DBScan method is displayed in a dashboard: Based on the Figure 5 that shows the dashboard, it can be seen the characteristics of each cluster's member.For scenario I and scenario II of all rivers, the averages of water level on each cluster is represented in Table 2

Data Clustering on X-Means
The number of cluster will be divided based on Bayesian Information Criterion (BIC) of those two parameters (max_k and min_k) [21].The optimum number of clusters based on the parameters (min_k: 2 and max_k: 10) is 2, that is high cluster and low cluster.The result for X-Means algorithm is displayed in a dashboard in Figure 6.Based on the Figure 6 that shows the dashboard, it can be seen the characteristics of each cluster's member.For scenario I and scenario II on all rivers, the averages of water level on each cluster are represented in Table 3.The three highest average water levels is on 12:00 PM, 6:00 AM, and 6:00 PM.
The three highest average water levels is on 6:00 AM, 12:00 PM, and 6:00 PM Similarly, the implementation of X-Means [21][22][23][24][25][26] is done by using R.The data will be clustered based on the water level using parameters: a. min_k: 2 b. max_k: 10 The combination of these two properties signify a shift in water level distribution in the first seven years (2007-2013), yet still within persistent range.Furthermore, the third property, the distribution shift should happen in the month of November and January where the peak of water level in 2007-2013 is shifted from November to Januari in 2013-2017.As the comparison conclusion, we propose the use of DBScan and K-Means when the distribution's gap is negligible while the K-Medoids as a more general means to cluster data with unknown or equally sparse distribution.
Based on the clustering results on both scenarios and all methods, we find several interesting patterns.First, the highest average water level is always on February and diminished on March.This suggests an important period of diminishing water level pattern on the start of the year.Second, there are differences in the clustering results on the scenarios where the first scenario put November as the three largest clusters and the second scenario put January as the three largest clusters.These results suggest a shift in the water level pattern from starting to increase on November (and start to diminish on March) to January.This shift in pattern provides a clue that the important period of flood has shifted from November through March to January through March in the last five years.
Third, the average water level is constantly reduced in the past five years compares to the last 11 years as shown in Figure 7. Fourth, the function from the second scenario is better fit than the first (in spite of both have  2 = 1).This function signifies a continuous slope which might be usable for predicting the water level between January to March period.And finally, fifth, based on the time, the three highest average water levels are on 12:00 PM, 6:00 AM, and 6:00 PM respectively which suggest a high flood risks on these time frame.

ISSN: 1693-6930 
Cluster-based water level patterns detection (I Made Murwantara) 1383 Figure 7. Water level functions on the three largest clusters However, we found a warning that may invalidate these results that these patterns occur on limited clustering method parameters.Although we try to make as much generalization as possible by using three different clustering methods on several parameters, high value parameters (e.g.high value of  and low member value in DBScan method) might suggest different patterns.Therefore, further investigation should be done more thoroughly before implementing the findings into the real flood early warning system.

Conclusions
In this study, the data of water level in Tangerang for 2007 to 2017 is clustered by using K-Medoids, DBScan, and x-means clustering methods.The data are cleaned in the preprocessing stage.The data is then experimented on two scenarios based on hourly appearance.In the data mining process, clustering is done using three methods.The clustering result for K-Medoids is 3 clusters, DBScan is 2 clusters, and x-means is 3 clusters.Based on the results, the K-Medoids and x-means clustering appear better since the member of each cluster is more evenly distributed.The methods performed on scenario 1 appear better due to larger data availability; nevertheless the data from scenario 2 have a better predictive function.The cluster results show almost a complete convergence where they suggest a high potential flood risk on the first two months of the year.These results are visualized in form of interactive dashboard that is simple and easy to use for non-technical users.As final conclusion, we believe that the results show as an interesting potential for a flood early warning system in Tangerang.

Figure 1 .
Figure 1.Distribution of Indonesia's disaster in 2016

Figure 3 .
Figure 3.The four steps of the KDD implemented in this research

Figure 4 .
Figure 4. Dashboard example of K-Medoids as in scenario 1

Figure 5 .
Figure 5. Dashboard example of DBScan implementation of scenario 1

Figure 6 .
Figure 6.Dashboard example of the x-means method results on scenario 1

Table 2 .
: The Comparison of Scenario I and II using DBScan

Table 3 .
The Comparison of Scenario I and II using X-Means