Comparison of machine learning performance for earthquake prediction in Indonesia using 30 years historical data

Indonesia resides on most earthquake region with more than 100 active volcanoes, and high number of seismic activities per year. In order to reduce the casualty, some method to predict earthquake have been developed to estimate the seismic movement. However, most prediction use only short term of historical data to predict the incoming earthquake, which has limitation on model performance. This work uses medium to long term earthquake historical data that were collected from 2 local government bodies and 8 legitimate international sources. We make an estimation of a mediumto-long term prediction via machine learning algorithms, which are multinomial logistic regression, support vector machine and Naı̈ve Bayes, and compares their performance. This work shows that the support vector machine outperforms other method. We compare the root mean square error computation results that lead us into how concentrated data is around the line of best fit, where the multinomial logistic regression is 0.777, Naı̈ve Bayes is 0.922 and support vector machine is 0.751. In predicting future earthquake, support vector machine outperforms other two methods that produce significant distance and magnitude to current earthquake report.

INTRODUCTION An earthquake is a natural disaster that occurs as a result of rocks layer movement or displacement of the earth tectonic plate. This precipitous movement releases a huge amount of energy that creates a kind of seismic waves. The vibration results that passed through the earth surface caused damage for the population that lives on the earthquake impact areas. Indonesia with more than 300 million inhabitants is a country located in the most frequent earthquake region as it has about 127 active volcanoes [1], which usually called the Ring of Fire area that become the most active tectonic movement. Moreover, Indonesia also has the Great Sumatran Fault that span 1900 km length and the Banda Sea convergent flat margin that creates even more seismic activities [2,3]. Nowadays, the earthquake warning system already installed in many remote and volcanic areas that might increases the number survivor expectation. Moreover, many research outcomes also gain more information about earthquake characteristics and impacts to the surrounding area. machine learning has also been used to make advancement on the information and prediction results. However, some machine learning work result still has not provided accurate prediction, and sometimes rise up a false alarm because of lack of the volume of data or the prediction method [4]. In our knowledge, the application of the earthquake prediction still has a space for us to augment into a certain point that gives us more confidence and better results. Furthermore, a good and reasonable prediction will provide opportunities to manage the emergency route path for evacuation which may reduce the casualties.
In order to provide data for prediction, we utilize the data collection from several earthquake and seismological repositories. The list of data resources for our research as follows, the United States Geological Survey (USGS) [5], Incorporated Research Institution for Seismology(IRIS) [6], National Oceanic and Atmospheric Administration (NOAA) [7], European-Mediteranian Seismological Centre (EMSC) [8], International Seismological Centre (ISC) [9], Istituto Nazionale di Geofisica e Vulcanologia (INGV) [10], GeoForschungZentrum (GFZ) [11,12], Indonesia Tsunami Early Warning System (InaTEWS) [13], Global Historical Earthquake Archive(GHEA) [14,15], and Badan Meteorologi, Klimatology dan Geofisika (BMKG) Indonesia [16]. The volume of the data collection produces more than 1TB. After cleansing to have only data within Indonesia region, we have around 375 GB data which is used as training and testing data. Considering the volume of data, this work is a Big Data research.
In this work, we compare the performance of three machine learning approaches, which are multinomial logistic regression [17,18], Naïve Bayes [2,[19][20][21] and support vector machine (SVM) [4,[22][23][24][25] to the earthquake data. Where, Logistic Regression provides information of relationship between variant and to find out how close is one or more variable to another one. Naïve Bayes approach allows us to compute the probability that is taken from new information. SVM is used for classification and regression analysis of separation hyperplane. The contribution of this paper is twofold: (a) In predicting a disaster such as earthquake, a comparison between different machine learning algorithms may give light for a new approach. We propose a technique that is comparable to other approach for earthquake prediction in Indonesia region. Our method facilitates of prediction and visualization that range within 50 years of seismic historical data which is particularly helpful to classify of how different machine learning performance could put light on our method of prediction. To this, our approach can also adjust the size of data for better prediction. This is useful since the size of data, sometimes, influence the training and testing process for ultimate prediction. Other than that, we have flexibility on testing our results. (b) The data collection and cleansing includes massive volume of data which creates rich resources for prediction. We collect the data from legitimate organization all over the world that compares with the local monitoring by the government bodies in Indonesia. The data cleansing also takes most of our time which is not only retrieve raw data, it is also through web scrapping and data transformation. Some information need to be inspected carefully, as the monitoring data may be irrelevant for our work.
To this, we analyze the data based on whether the location of monitoring and its data relevant. For example, the earthquake data that released by a resource that taken from third party or not primarily generated by a specific seismic monitoring station.

RESEARCH METHOD 2.1. Relevant works
The improvement of earthquake prediction has been utilized via historical seismic data. The most promising technique is to use the Artificial Intelligence (AI) and machine learning (ML) has gained further knowledge [26]. In [27], Bertrand et al. identify the possibility of upcoming earthquake by forecasting the laboratory quake cycle, which reveals the timing of the event will probably occurs.In general, earthquake prediction is categorized into three different terms that is based on the length of the historical data source. Short term earthquake prediction needs a precursor to strengthen its accuracy [28], while intermediate and long term prediction makes estimation on statistical probability approach. Syifa et al. [29] uses SVM to analyze post earthquake situation to assess the distribution of seismic destruction, which can be useful for evacuation and mitigation plan. Another technique to address the prediction of earthquake uses the meteorological data [30] TELKOMNIKA Telecommun Comput El Control, Vol. 18, No. 3, June 2020 : 1331 -1342

TELKOMNIKA Telecommun Comput El Control
❒ 1333 based on the particle filter-based and support vector regression. This technique obtained natural information, such as air temperature, gas concentration and wind speed to estimate the precursor of earthquake.

Background
This section will discuss the background theory of the work that covers the earthquake theory and machine learning approaches. The earthquake background theory is categorized into earthquake types, seismic wave and earthquake phenomena in Indonesia. The machine learning covers the multinomial logistic regression, Naïve Bayes and support vector machine.

Earthquake
An earthquake is a natural disaster that creates tremor or vibration in the impacted area as a result of earth rocks layers movement or displacement because of the tectonic dislocation. This vibration will reach the earth surface that causes massive destruction. There are four types of earthquake, which are tectonic, volcanic, collapse and explosion. As shown in Figure 1, three types of of surface movement that caused an earthquake that appears not on every place in the earth. In general, the movement of earth surface as the cause of an earthquake when (a) two plates moves away to different direction, (b) two plates move in to the same point of line and (c) these plates move side-by-side on opposite direction. The layer of earth skin has high temperature that distributes its heath into surrounding area. In general, this volcanicactivityknown as the heath flow convection. This kind of activity pushes the magma into the surface which creates volcanoe. Indonesia is an archipelago that located in the Circum-Pacific and Mediteranian which has a lot of numbers of active volcanoes. To this, Indonesia becomes one of the high risk countries on earthquake disaster. In term of earthquake prediction, it is categorized based on how the earthquake occurs. There are three category of prediction. The first is long term prediction, where this prediction rarely implemented as it gets the range of more than 10 years of historical data and some additional information from sequential earthquake as a result of fault location. The second is the intermediate prediction that obtained information from the earthquake location, time and destruction power within several years. The last one is the short-term prediction that makes an earthquake estimation using several days of data set.

Machine learning
machine learning builds an insight from one or more dataset via some specific algorithms. In this work, we compare the performance of three machine learning algorithms, namely Naïve Bayes, support vector machine (SVM) and multinomial regression.
a. SVM In general, SVM is used to solve classification and regression problem. However, SVM has gained its popularity as it has good performance on empirical data. SVM conceptually simple, it has fast learning algorithm and very often produce accurate results. This is because SVM is a machine learning that is developed based risk minimization principle. In SVM, a training data set D is given as, , y i is -1 or 1 indicating the class input which is a threshold wavelet coefficients x i to describe low or high magnitude. For each x i is the p dimensional vector. A Hyperplane is used to separate between class input which is good when its position between classes. So that, if wx 1 + b = +1 is a supporting hyperplane of class +1, then wx 2 + b = −1 is the hyperplane to support class -1. In order to count the gap margin between two classes, we can find the distance between two supporting hyperplanes. This margin can be identified via ||w|| . For Linear classification, it will be min (w,b) 1 2 w 2 , and for non-linearâ = arg min a

b. Multinomial logistic regression
This method analyzes the relation between bounded and unbounded variable that have more than two variables which generalize logistic regression into multiclass regression. Multinomial logistic regression model with three categories will have formula as follow, c. Naïve bayes Naïve Bayes is a simple classification for counting the probability of combinations of a certain data set. This method assumes there is no dependency between classes to a value in class variable. Bayes theorem, as shown below, derives the posterior probability of two antecedents, which are prior probability and a likelihood function.
Where, X is the data with unknown class, H is the hypothesis data for class specification, aa is the probability of hypothesis H based on the posterior probability (X), P (H) is the prior probability, P (X | H) is the probability observing X given H, and P (X) is the marginal evidence of probability of X. d. Evaluation method In order to evaluate the machine learning performance, we make use of confusion matrix, mean absolute error (MAE), mean Absolute percentage error (MAPE), mean square error (MSE) and root mean square error (RMSE). Confusion matrix describes the performance of classification model from different classes. The classifier has done its work when it gained the information of true positive (TP) and true negative (TN). And, when it classifies the negative value it will produce the false positive (FP) and false negative (FN). In measuring machine learning performance, we evaluates for their accuracy (percent of correctness over all test instances) and precision.In this paper, we measure the performance using mean absoule error (MAE), mean absolute percentage error (MAPE), mean square error (MSE) and root mean square (RMSE), As shown in the evaluation formula above,ŷ i is the predicted earthquakes, y i is the data of earthquake from the resources and T is the number of examples used for testing. MAE measures whether our computation towards under and over estimations [28]. MSE is the most common way to evaluate the prediction results, where the error is the differences between the estimation result and its data. MAPE is the evaluation to indicate error when predicting between the original data and its result. MAPE useful when the size of variable is important to evaluate the prediction. Meanwhile, RMSE measurement emphasizes large errors more. RMSE

TELKOMNIKA Telecommun Comput El Control
❒ 1335 evaluates how close the observed data points are to the models' predicted values and MAE describes uniformly distributed errors. It is worth to note that the RMSE value is similar to the unit of the outcome. For example, when it measure the depth of an earthquake then the unit is km.

Data collection
This stage begins all of our work by collecting data from different location and various formats. The challenge in this activity is that some data can be retrieved directly from repository as ready to use data. In this work, the data collection activity is categorized into 3 methods, as follow: (a) Retrieve directly from the repository as it is provided in a ready to use format, such as comma separated value (CSV). (b) Retrieve a web site, manually, in a hypertext markup language (HTML) format. Then web-scraping to get the information we need from within the HTML text file. Several techniques applied to different data source. We retrieve the EMSC data by accessing or download of each web page within 14 years (2004 -2018). The webscraping technique is applied to resources from NOAA, EMSC, ISC, INGV, GFZ and BMKG. For InaTEWS, we downloaded manually. Other data set also downloaded directly, such as GHEA where the data format is not in CSV.
USGS data is in CSV format that we can downloaded almost all the data that range from 1st January 1900 until 31st August 2018. For IRIS data set we obtained data range 1968 to 2018. INGV data set ranges from 1985 to 2018, and for BMKG data set range 2008 to 2018.

Data pre-processing
This stage prepares the data before we make any prediction. Most of the work in this stage is filtering the information such as to identify whether the date, time, latitude, longitude, magnitude and depth exist within the data set. We also remove the data that has magnitude values 0 to avoid any misclassification during processing stage. Data merges also done in this stage. For example, we make classification of data within the same range of dates into 10 years and 30 years. In doing so, we obtained the intersection of data from different resources.

Prediction stage
This stage predicts the data set for specific group of 10 and 30 years. We split the work into two parts. In the first part, we train the data using set of group based on time, date, latitude, longitude, magnitude and depth to find the location and the possibility energy of earthquake. In the next part, we split the dataset into train and test that already categorized into 4 groups which are latitude, longitude, magnitude and depth, where the split ratio is 0.8 over 1.0. We make use R [31] as a tool to make prediction and its library implement some machine learning methods that we implement to. For Naïve Bayes we use the function Naive Bayes and SVM for support vector machine from library e0171 [32]. multinomial logistic regression uses multinom function from library NNET [33].
To predict the earthquake, the object is splitted to have specific result. For example, we predict the location of earthquake as the first step. Then, the magnitude and depth of earthquake is predicted based on the new location that already estimated in the previous step. The result of prediction is the combination of, both, the first step and the second step. In predicting the location of earthquake, we have implemented two techniques. First, we make use of Geohash library to merge the latitude and longitude. Second, we also predict the location of earthquake using only latitude and longitude. We split our prediction based on location as shown in Table 1. It is worth noting that the latitude and longitude is in degrees using decimal fraction. In predicting the magnitude values of an earthquake, we factorize the prediction into two factors. First, in order to get into magnitude prediction the latitude and longitude are used to get the power of earthquake. Second, we predict via the combination of location and depth, as depicted in Table 2. For the depth of ❒ ISSN: 1693-6930 earthquake, we factorized into the opposite of the magnitude prediction, as shown in Table 3. To visualize our results, we make use of R tool with Shiny [34] library that overlay on top of map that retrieved from google map using ggmap [35] library. The final application of this work is a web-based system.

RESULTS AND ANALYSIS 3.1. Analysis
In this work, we make prediction, solely, based on the earthquake data set. Data processes in two condition, first, we grouped into 10 Years and 30 Year, second, without grouping or individual data. Other than that, Naïve Bayes cannot create prediction for 10 and 30 Year individual data set because of imbalance data set. We split the training and testing data into 60% and 40%. We take into account the smaller error will guide us into more accurate prediction. To reduce the complexity of our work, we manage the prediction using a catalog that describe the method and data set, as shown in Table 4.
As shown in Table 5, the actual data that is grouped into 10 years using different evaluation techniques. SVM shows good result for Magnitude prediction and multinomial logistic regression has better results for data with Depth. Naïve Bayes is not included into 10 years analysis. On the other hand, SVM outperforms other method for 30 years dataset with grouping on Magnitude and Depth, as shown in table 5. It shows that the prediction accuracy as shown by MAE has 0.598473 which explicate that the prediction results of earthquake is quite precision than other method.
In making prediction using 10 years of data without grouping, SVM outperforms other algorithm which predict the earthquake location based on Magnitude and Depth. In this prediction, SVM solely predict the factor of latitude and longitude. The result, as depicted in table 6, shows that the prediction has achieved good result when the information of Magnitude and Depth estimates the coordinate location.
In predicting earthquake for 30 years dataset without grouping, multinomial logistic regression (MLR) exceeds other algorithm. It shows that using Magnitude and Depth data, as shown in Table 6, MLR has smaller error than SVM, where in this prediction Naïve Bayes is not included because of imbalance data.
In the next step, we would like to find out which method of machine learning suitable to predict earthquake. To this, we calculate the average of data set to give us an insight of which data set can provide small error rate. As shown in figure 7, the most applicable data set is for 30 year grouping data and 10 years not grouping data, as both shows low level of error rate. And we analyze that those data set has a chance to have good prediction. In more detail, both, the 30 years grouping and 10 years not grouping data set, SVM outperfoms other data with small error rate on using Magnitude information, which also shows smaller error compares to the Depth information. So that, we analyze that SVM will predict earthquake much better when using solely, on Magnitude information.
From the information in Table 7, we analyze that the earthquake prediction should be more accurate when we use Magnitude data as reference. In contrast, when the Depth data are used as reference, we might encounter the accuracy and, probably, has problem to predict the earthquake location prediction. These data give us vision that the depth data might have its use to predict the destruction that might appear to the location prediction. In measuring the performance of which machine learning method that suitable for earthquake prediction in Indonesia, we compare the average error rate for not grouping and grouping data set. Our result shows that the 30 Years grouping and 10 years not grouping data set give us a reasonable values. As shown in Table  8, SVM outperforms multinomial logistic regression and Naive Bayes. And also, 10 years not grouping data set, SVM shows better performance than Multinomial Logistic Regresion, as depicted in Table 9. Where in 10 Years not grouping data set, because of imbalance data, we cannot obtain result from Naïve Bayes method. Overall, our evaluation on machine learning performance shows that the grouping and not grouping data set which uses Magnitude as grouping reference performs better than using Depth values. Moreover, SVM method show better performance than other algorithm. Due to that we believe the prediction of earthquake that make use of SVM would provide better accuracy than multinomial logistic regression and Naive Bayes using similar data set.

Results
To show the implementation of our prediction into a more visualize information, a web service presentation is shown using R Shiny system. An original information of earthquake is retrieved from Indonesian Geological center. shown in Figure 2(a). We compare the earthquake report from the BMKG Indonesia, as shown in Figure 2(a), and compare it to the prediction results we made before the date of event that is depicted in Figure 2(b), 2(c) and 2(d). Our prediction is based on the number of day within a year. For example if we want to predict earthquake in March 11, 2019, then we count number of days from the beginning of the year up until the D day, where from the calculation we have 70 days. Then, we select the value of day, which is 70 days, into the web-system. In our map, the red colour shows the prediction result and the yellow colour shows the original data.
In comparing the earthquake report from BMKG Indonesia and our prediction result shows that prediction using Naïve Bayes, as shown in 2(b), based on the original learning data is not good enough. multinomial logistic regression performs better than Naïve Bayes, as shown in 2(c), the earthquake location slightly close to the report from BMKG. support vector machine (SVM) achieve better results for eastern Indonesia region, which is out performs other methods. It is worth to note that the training data influence the prediction results. Overall, the prediction results have updated our knowledge that different machine learning may perform differently, although similar data sets were used for training. In our analysis, SVM may have a chance for better earthquake prediction.  [16], (b) prediction using Naïve Bayes, (c) prediction using multinomial logistic regression, (d) prediction Using SVM.

CONCLUSION
We have compared machine learning method to predict earthquake location, depth and magnitude for Indonesia region. In order to visualize the prediction results, a web-based application has also been demonstrated. The conclusion we obtained from this work as follow, Naïve Bayes method is not good enough to predict for a grouping data set for only one year, and it is applicable for multi year grouping data. Considering the average error rate, SVM method outperforms other algorithm where using Magnitude data as reference provides better results than using the Depth data. This information leads us into an insight that the Depth can be used as the addition factor for better prediction. We deal with day, month and year as date property for prediction, and our observation shows that prediction based on day performs better. For overall data set, as we already expected, SVM outperforms other method that is followed by multinomial logistic regression in predicting. Naïve Bayes performed worst from all prediction results.