Analysis of DBSCAN and K-means algorithm for evaluating outlier on RFM model of customer behaviour

,


Introduction
Clustering is the process of dividing the objects into groups so that the objects within a group have similarities with each other and those objects have no resemblance to the objects in the other group.clustering is also referred to as segmentation data [1].Clustering has been widely used in various fields such as in the case of hotspot data clustering, customer segmentation, customer behaviour and more.The clustering method of subscriber grouping has been widely used as in the research [2][3][4][5][6].The algorithm used in the cluster are many kinds such as K-Means, Self Organizing Map, DBSCAN and others [6][7][8][9][10][11][12][13].However, the common and simple algorithm used is the K-Means algorithm [14].
Algorithm K-Means has been successfully applied to various fields [15].However, this algorithm is very sensitive to the choice of starting point [4] and also sensitive to the outliers because these objects distort the average value of clusters [1].This is because the determination of the number of clusters in the K-Means algorithm is determined by the user [1], [12], [16].However, many researchers have now discovered the method of cluster validity in determining the best number of clusters on a dataset.One of method validity is Dunn Index Method was developed by Dunn [17].
In addition to K-means, another clustering method is DBSCAN.This algorithm is different to the K-Means because it does not require the user to specify the number of clusters produced.DBSCAN has better performance compared with K-means.This has been demonstrated in the study [8] suggesting that DBSCAN has higher sensitivity and better segmentation.DBSCAN is designed to find the dataset portion containing cluster and noise changing [9] by using the Epsilon (eps) and Minimal Point (minpts) parameters that are useful in determining the distance and minimum number of neighbors and core point.In addition to handling the noise, DBSCAN can also find outliers in arbitrary clusters [18].Outliers are objects in datasets that are much different from the rest of the objects in the data set [15] which do not contain enough number of points (minpts) in forming the clusters [19].Outliers are often discarded because they are considered noise [1] but actually the detection of outlier data or 111 so-called anomaly data is necessary if there is a dataset that provides important information to the system [19].Important information derived from customer data collection can be obtained from the results of data analysis.Because each customer does not have the same behaviour [14], [19] then the data needs to be analyzed to find profitable customers.One of model that is able to measure customer behaviour is the RFM Models with three criterias namely Recency, Frequency and Monetary.
Customers who have different behaviours can cause outliers.Outliers found in customer's data can generate favorable customer behaviour or vice versa.If profitable customer is detected as an outlier and the outlier is discarded, this will harm the company because there is important information about profitable customers such as customer profiles, etc.
Through the DBSCAN and K-Means algorithms, outliers in customer data sets can be found to see different behaviours among customers.The outlier in K-Means is determined by determining the distance between the object and a group of objects [15].The K-Means algorithm classifies datasets into several clusters and checks whether objects in the cluster are detected by outliers [15].Objects detected as outliers are objects that are far from the core point of the cluster [15].Because these two algorithms are able to find outliers, this study will compare and test the outliers in each algorithm with the same customer data collection.This aims to determine the consistency of outliers in the customer data collection with RFM Models through the DBSCAN and K-Means algorithms.Furthermore, the data contained in the outliers in both algorithms will be analyzed to see the information whether the outliers contain profitable customers or vice versa so that decisions need to be made in providing services to customers.

Research Method 2.1. RFM Model
RFM model is the model developed by Hughes in 1994 in estimating subscriber life value [20] and customer loyalty behaviour [21] with 3 variables: recency (R), Frequency (F) and Monetary (M).Recency is the customer's time interval since the last purchase with certain period of time, Frequency is the number of purchases made by customers in a certain period, and Monetary is the amount of money that customers spend to the company in a certain period [21].Customer RFM values is needed to be known to assist companies in marketing because high customer RFM is more responsive to promotions, more likely to repeat order and most profitable purchases [22] when it is compared to customers with low RFM scores.
To knowing the RFM value of the customer [2] using the RFM value, the symbol '↑' is a value higher than the average value, the symbol '↓' is a value lower than the average value.This means that the higher the value it will be better for the company and the lower of the average it will get worse for the company.But for R, the symbol ↓ means the lower of the average then the better for the company and the symbol ↑ means higher than average then the value is not good for the company.The cluster belonging to the symbol R ↓ F ↑ M ↑ is named with Loyal Customer, the symbol R ↑ F ↓ M ↓ is called Lost Customer, the symbol R ↓ F ↓ M ↓ is called New Customer and the symbol R ↓ F ↑ M ↓ called Prospect Customer.The symbols are explained in Table 1 This customer group is customers who has recently made a purchase with a high number of transactions and the amount of money spent is high too.

R ↑ F ↓ M ↓ Lost Customer
This group of customers is customers that has long made no purchase with a low number of transactions and the money spent is low too This customer group is the customer has just made a purchase with a low number of transactions and the money is still low This customer group is a customer who has just made a purchase with a high number of transactions but the money is still low

DBSCAN Algorithm
DBSCAN is an algorithm developed by [9] with 2 input parameters, namely epsilon (eps) and minimum points (Minpts).Eps is the maximum point distance in forming a cluster and  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 1, February 2019: 110-117 112 minpts are the minimum number of points on a formed cluster [9].The number of clusters in this algorithm is determined by the value of eps and mints inputted by user.As its name, which is density-based spatial clustering of applications with noise (DBSCAN), this algorithm is able to detect noise when there are different data points with other data sets [9].In addition to noise, this algorithm is also able to define anomalous data or data outliers in the data series using the same 2 parameters where the point will be considered outlier if it does not contain sufficient number of dots in forming clusters or minpts determined previously [23].
The steps of the DBSCAN are as follows: 1) Select any point or object randomly from the data set as candidate corepoint 2) If the selected object qualifies as a core point having minpts and epsilon which has benn specified by user, then the object will form a new cluster with its neighbor object.Calculate the distance between a corepoint object and a neighbor object using euclidean distance formula [8]: where n is number of points in the sequence, yi is the mean of sequence and xi is sequence of data points within each window.3) Objects which are not included as corepoint or neighbor objects in step 2 will then be processed by making them as the next corepoint candidate.If it meets as corepoint then the object will form the next cluster with its neighbor object.And so on until all the objects in the data set are tested.If the object that has been tested does not qualify as a corepoint or a neighbor object, they can be categorized as an outlier/noise that is the object which corepoint distance is larger than epsilon and the number of reachable densities is less than the user-specified minpts.

K-Means Algorithm
K-Means is an algorithm that is categorized into partition clustering methods [5].This algorithm aims to collect all datasets into determined clusters [24].Steps in the K-means method are as follows: 1) Determine the number of clusters 2) Select the initial centroid randomly according to the number of clusters that have been determined 3) Calculate the distance of data to the centroid with the euclidean distance formula in (1).4) Renew the centroid by calculating the average value of each cluster 5) Return to step 3 if there is still data moving clusters or centroid value changes.
In steps 1 and 2 these steps, the number of clusters is determined by dunn index method to identify optimal number of cluster [6].This is because K-means is very sensitive to starting point selection to part items into specified clusters [25].

Dunn Index
Dunn Index (DI) is a cluster validity to identify optimal number of cluster with the highest value having the best clusters [26].Dunn index is calculated based on the following equation [26]: where (, ) is different function between cluster  and  defined as: and () is cluster diameter probably considered as cluster dispersion size.Cluster diameter of C can be defined as flows:

Methodology
The steps in this study consisted of 6 steps.The first step is to determine customer data with 1866 R, F and M attributes from the Herbal Penawar Alwahida Indonesia retail company.The second step is to normalize the data with the aim that each attribute R, F and M does not have a long range because the value of M is the value of money with different units of rupiah with the value of recency and frequency using the following formula [1].This method performs a linear transformation on the original data [6].where minA and maxA are the minimum and maximum values of an attribute, A. Then Min-max normalization maps a value, v, of A to v' in the range of [newminA, newmaxA] by: V'i= ( The third step is to determine the best cluster using the dunn index validation method for each DBSCAN and K-Means algorithm.The best clustering results are then used in each algorithm.The fourth step is the DBSCAN algorithm will input the epsilon value and the optimal minimum point is obtained from the results of the Dunn Index validation.The K-Means algorithm will cluster according to the optimal number of clusters and then search outliers for each cluster by using the outlier score formula and find out whether the outliers are global or collective outliers.The fifth step is to analyze whether the outliers have the same data or points in the two algorithms.The last step was to analyze the data outliers in the two algorithms to find out information found on customer data that detected outliers.

Results and Analysis
Data of RFM Models in Table 2 have to normalized by using Min-Max method using equation 5 with range 0-1 and the results shown in Table 3 .This study utilizes dunn index method in searching the optimal number of cluster both K-Means and DBSCAN using (2), ( 3) and ( 4).Table 4 is the Dunn Index value in K-Means.Based on Table 4, the number of optimal clusters is 2 with a dunn index value of 1.31.In this study, the experiments on the number of clusters in K-Means were carried out from clusters 2-9 because the dunn index value produced in this study was the more number of clusters, the smaller the dunn index value.Table 5 is the dunn index value in the dbscan algorithm.This study examines epsilon values from 1 to 6 and the minimum value points from 0.1 to 1.0.Based on the epsilon and minimum point value tested, the highest dunn index value is 1.02 with the optimal number of clusters is 2. So, Based on the Dunn index value, the number of optimal clusters generated in both algorithms is two clusters.
The next step after finding the optimal number of clusters is to determine the outliers in both algorithms.In the DBSCAN algorithm, the data detected as outliers amounted to 37 data outside of the data in cluster 1 and cluster 2. In DBSCAN algorithm, the amount of data in cluster 1 was 800 and cluster 2 was 1030.The data in cluster 1, cluster 2 and outliers in  6.The third column in Table 6 is the outliers produced by DBSCAN algorithm and the outliers.The cluster results in K-means algorithm are 2 with the number of data in cluster 1 is 1065 and cluster 2 is 801.Outliers in K-means are found objects in datasets that are much different from the rest of the objects in the data set [15].Sitanggang and Baehaki [15] identified outliers in K-means to be 2, namely global and collective outliers.Global outliers are outliers that occur when an object deviates from the data set.Collective outliers are outliers that occur when a cluster [1], [15].The outliers in this study are global data because there is no number of data found in the cluster below 1% of the total data.The data that detects global outliers are 25 from cluster 1.As for the collective outliers as outliers who have the amount of data in the cluster below 1%.The 1% value is based on Sitanggang and Baehaki research.The determination of outliers in K-Means with a global outlier is identified as an object which is far away from the centroid on all cluster.This study have identified each cluster to analyze global outlier using outlier score in (6) [1].The outlier score used in this K-means aims to looked the outlier score have generated by each data that detected by the outlier [15].It means is the higher the outlier score indicates that the data is far from the centroid of cluster.
Outlier Score = (,  )    (6) Where o is an object in the dataset, co is nearest centroid or center ti the object o, dist (o,c) is distance between the object o to its nearest centroid co and    is average distance from co to the object assigned to o.

ISSN: 1693-6930 
Outlier scores on the K-means algorithm are used from 2.0 to 9.4 based on the score outliers that have been generated as shown in Table 7.The number of outliers in K-means in cluster 1 is 26 and the number of outliers in cluster 2 is 37 datasets.In this study, the number of outliers in K-Means all of 63 data.The Outlier score in K-means algorithm shown in Table 7. Table 7 is the outlier score in cluster 1 and cluster 2. The second column in Table 7 are the outlier's data on cluster 1 which has same outliers with DBSCAN algorithm that marked bold.Based on the third column in Table 6 and the second column in Table 7, it can be seen that there are some of the same data marked in bold.The same data in each table is 26 and is shown in Table 8 in the fourth column.Twenty six data is customer data which has the same point in both algorithms.The last step of this study after finding outlier data in both algorithms is finding important information from customer data by looking at the Recency, monetary value and frequency of each customer data.There are 4 symbols found in this study, namely R ↓ F ↓ M ↓, R ↓ F ↑ M ↓, R ↓ F ↓ M ↑, and R ↓ F ↑, M ↑.Symbol ↓ is a low value compared to the average value, the symbol ↑ is a high value compared to the average value.In this study found the R value of the subset was obtained between 0.0-0.2243,F value between 0.106-1.0and M value between 0.004-1.This means that this customer data has a very high R value from the average of the other datasets which is below 0.3589, different F values and heights from the average average  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 1, February 2019: 110-117 116 data set that is below 0.0227 and a low M value and height is also different from the average of other datasets which are below and above 0.0068.
Based on Table 1 shown that symbol R ↓ F ↓ M ↓ is a New Customer who has buying behaviour with a low number of transactions and still low money.The R ↓ F ↑ M ↓ symbol is a Prospect Customer who has a high purchasing behaviour with a high number of transactions but the money is still low.The R ↓ F ↓ M ↑ symbol is a New Customer with a low number of transactions but a high amount of money is spent.The symbol R ↓ F ↑, M ↑ is Loyal Customer with buying behaviour with a high number of transactions and the amount of money spent is also high.

Conclusion and Further Research
This study found outliers in DBSCAN are 37 and K-Means are 63 (26 in cluster 1 and 37 in cluster 2).The found outlier has some data or object which is equal to 67 percent, which is the result of 37 outliers in DBSCAN and 26 at K-means.This study also found that the outliers produced in K-Means were global outliers.This is because the data found is part of the data set in the data that has the amount of data above 1% of the entire data.In addition, the outliers found at the same K-means as DBSCAN are outliers in cluster 1 of K-means.This requires further research to find each global outlier in K-means as an outlier in DBSCAN.
The Outliers that were found consisted of 3 charactristic of clusters, namely Prospect Customers, New Customers, and Loyal Customers.It was concluded that most of the data in this study included the characteristics of Lost Customers, namely customers who had not purchased for a long time and made a purchase with a low number of transactions and also low money.Therefore, this company needs to make a strategy so that customers can make payments and remain loyal to the company.

Table 2 .
Data of RFM Models