Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering

ABSTRACT

INTRODUCTION Current technological developments produce data that is not only large but also continuous. In fact, in recent times, humans have produced more data than all data that has been previously generated [1]. Data at this time is available massively, in large quantities, and in various types [2]. This kind of data is termed big data. This forces us to be able to extract important information from this abundant data.
One of the important pieces of information in the data is the data group. Data grouping is very useful for solving various problems in life. This is often applied as in Customer Segmentation, Recommendation, Image Processing, and others [3]. Often the data clusters have not been previously identified. So, supervised learning cannot be applied. One of the things that can be used as the basis for grouping data is similarity. This method of grouping is called clustering [4]- [6]. The next problem is often data grouped into a group arbitrarily, without considering the possibility to join in other groups. Maybe the computation process will run faster, but its accuracy is questionable. Fuzzy clustering has been proposed as a solution to solve this problem.
The degree of membership is the basis of the fuzzy clustering method [7]. Based on that, say each data point against each exclusion cluster. Fuzzy C-Means clustering is a popular method used in fuzzy clustering [8]. Fuzzy C-Means clustering is a distance-based clustering that applies the concept of fuzzy logic [9]. The clustering process goes hand in hand with the iteration process to minimize the objective function [3][7] [8]. The objective function is the sum of the multiplication of the distance between the data points to the nearest ISSN 2338-3070 Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI) 119 Vol. 7, No. 1, April 2021, pp. 118-130 Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering (Joko Eliyanto) cluster center with the degree of membership [10]. The more iterations, the decreasing the value of the function should be. The distance function used in this method has a key role [11].
Various studies on the effect of distance in the clustering method have been carried out. Some of the results of previous studies that no distance is more dominant and produce outputs that are not much different. The results of clustering are very dependent on the dataset used [3]. Euclidean and Manhattan / City Block, Chebyshev, and Minkowski distances have been identified for their effects in the K-Means Clustering algorithm [11] [12]. The results of both studies indicate that the Manhattan distance has a slower computation time than the other distances. In another study, the Euclidean, Manhattan / City Block, Canberra, and Chebyshev distances were applied and evaluated on the fuzzy clustering algorithm [13]- [15]. The results of this study concluded that the results of clustering were very dependent on the data used [16]. In our latest research, the combined Minkowski and Chebyshev distances can also be used to optimize Fuzzy C-Means clustering [17]. Another form of Euclidean distance, namely Average distance, can also be used in the clustering algorithm and produces better results than Euclidean distance [18].
Another way to optimize the clustering method is to apply the dimension reduction method [19]. The dimension reduction method can reduce data dimension but still maintain data characteristics [20]. One of the dimensional reduction methods is Core and Reduct. The Core and Reduct method from the Rough Set theory is proven to be able to improve the performance of Fuzzy C-Means Clustering at the Euclidean distance function [3] [21]. In this study, we are doing an expansion of the research on the last results. We want to know whether the consistent application of Core and Reduct can reduce the computational load on Fuzzy C-Means Clustering with various distance functions. The second objective is that we want to find the best distance for the new method. The data used are also limited to five UCI machine learning data, namely iris data, yeast data, seeds data, sonar data, and hill-valley data [22]. In this study, the method is only implemented on numerical data. The Core and Reduct dimension reduction method used was also developed limited to numerical data only.

RESEARCH METHOD 2.1. Fuzzy C-Means Core and Reduct Clustering
Fuzzy C-Means Clustering (FCM) is a clustering method that allows certain data can be induced in two or more clusters [7]. This method was invented by Dunn in 1973 and developed further by Bezdek in 1981. The usual application for this method is for pattern introduction. This method is based on minimalizing the objective function, 2 11 ,1 where m is a certain real number higher than 1, ij u is the membership degree of , i x in cluster j , i x represents the-i data, j c represents the cluster centroid j , and * is the norm which states the similarity between data and the cluster centroid.
Fuzzy partition is applied to trough the continuous optimization process of the objective function which defined before, with the update of the ij u membership degree matrix and j c cluster centroid by 2 1 The iteration will be terminated when In this method, the value of the objective function and membership degree is much related. During the initial iteration process, we assume that each of the data coordinates already have the value of membership degree for each of the existing cluster. Then, this value will be continuously updated through (2). When (2) is continuously updated, (1) will also keep updated towards its minimum value. One of the challenges of the clustering method is the large computation load. To overcome this, in this method, before entering the clustering process, a dimensional reduction process is carried out using Core and Reduct. The dataset for clustering problems can be viewed as an information table [3]. In the information table, the set of attributes RA  is called a reduct if R satisfies the following two conditions: Condition one states that for each object pair that cannot be distinguished by a subset R, it also cannot be distinguished by A and vice versa. The second condition states that there are object pairs that cannot be distinguished by R -{a} but can be distinguished by A. This means that R is the minimum set of attributes that can maintain the indiscernibility relationship IND (A). Usually, there is more than one reduction in an information table. The set of all reductions from the information table T is denoted as RED (T).
Then, the cores of the attribute set RA  are as Algorithm 1 is Fuzzy C-Means Clustering with Core and Reduct dimensional.

Algorithm 1. Fuzzy C-Means Core & Reduct Clustering
INPUT: Data input is in the form of variables and expresses objects and attributes. Data is a matrix, where n is a lot of data and m is the number of data attributes.
PROCESS BEGIN 1. If the dataset is not numeric data, then encoding data is done. If not, then proceed to the next process. 2. Apply the core and reduct method so that the number of variables will be a number of new variables , with pm  .
3. Applying the fuzzy c-means clustering method so that data clusters are obtained. 4. Cluster evaluation.

PROCESS END
OUTPUT: The value of the objective function, computational time, purity, Davies Bouldin Index, Silhouette Score and accuracy.

Distance Function a. Euclidean Distance
Euclidean distance is known as the most common and applied distance for the Fuzzy C-Means clustering process. For and coordinates, this distance is defined as where k x and k y are the value of x and y on the certain dimension of n . This distance becomes the standard distance for the fuzzy c-means clustering method [11] [18].

b. Manhattan Distance
Manhattan distance is defined as the addition of all of the attributes distance. Hence, for two coordinates data of and in dimension , the Manhattan distance for both of the coordinates is defined as [12]: where k x and k y are the value of x and y on the certain dimension of n .

c. Chebyshev Distance
This distance is also known as the maximum distance, which is defined as the maximum value of the existing attributes distance. The distance for two coordinates data of and in dimension is defined as [12]: where k x and k y are the value of x and y on the certain dimension of n .

d. Minkowski Distance
Minkowski distance is the formulation of the metric distance, which defined as [12]: This distance is very sensitive towards alteration when the value of both of the analyzed coordinates is close to 0. This distance is chosen because of the similarity of its character to Manhattan distance.

g. Average Distance
Average distance is the modification of the Euclidean distance. This modification is applied to improve the clustering result [18]. This distance is defined as:

Cluster Evaluation
Cluster evaluation is applied to determine the clustering algorithm level of accuracy and the availability of cluster labels. For this research, four tests are applied to evaluate the clustering process, which is purity test, accuracy test, Davies Bouldin Index (DBI), and Silhouette coefficient score.

Purity
Purity is used to calculate the purity of a cluster. Purity calculation for each cluster obtained is done by taking the most objects entered in the C-cluster where 1 i C  and C' are the original h-class with 1 h C'  As for the overall purity of the C cluster, it is done by adding up each purity in the C cluster and dividing it by the number of objects defined as follows: clustering has a purity value close to 0. This means that there are no cluster results that match the original class. While a good cluster has a value of purity 1. This means that the cluster results are in accordance with the original class.

Accuracy
Accuracy is calculated by adding up the number of objects included in the -cluster, where 1 i C  the exact class is then divided by the number of data objects. Accuracy is defined as where i a is the number of objects in the -cluster that correspond to the original class and number of objects. Good accuracy results if all clusters match the original class and then divided by the amount of data will produce a value of 1.

Davies Bouldin Index
Davies Bouldin Index (DBI) is one of the methods used to measure cluster validity in a clustering method. The purpose of measurement with DBI is to maximize the distance between clusters (intercluster) and to minimize the distance between data points (intra-cluster) in the same cluster [17]

Silhouette Coefficient Score
Silhouette Coefficient Score (SCS) is an internal metric that measures the cohesiveness and separation of clusters at the same time [24]. SCS calculates the average distance in a cluster and the minimum distance between an object to another cluster as follows: where one means the grouping solution is true and -1 means the grouping solution is wrong.

Dataset
The dataset used in this study is a dataset taken from the UCI Machine Learning website [22]. Table 1 presents a brief description of the dataset used in this study. All of the datasets in Table 1 are numeric types.

Research Method
This research is a numerical simulation using the Fuzzy C-Means Core and Reduct Clustering method. The program is structured using Python 3.0. This clustering method is carried out using seven different distance functions then the results are visualized and analyzed. The research steps are presented in Fig. 1.

RESULTS AND DISCUSSION
The initial step of this method is to reduce the dimensions of the dataset. The dataset in Table 1 is reduced using the Core and Reduct method. The result of this dimension reduction is a new dataset with fewer variables. The results of the reduction are presented in Table 2.
Based on the result in Table 2, Core and Reduct work better for the data with high dimensions. As for the low dimension dataset, it tends to be difficult to determine the core of the analyzed data. These results are consistent with research in the same field [21]. Data with low dimensions will make the computation load lower and the computation time can be increased significantly. The main objective of Fuzzy C-Means clustering is to acquire the objective function value as low as possible. The lower the objective function, the better the result of the fuzzy c-means clustering application. This means that the group of data is more clearly separated. Fig. 2 portrayed the comparison between the values of objective function acquired from Fuzzy-C Means and Fuzzy C-Means Core and Reduct. For five simulated datasets, Core and Reduct able to decrease the value of objective function close to 0% remains for the Euclidean distance, Manhattan distance, Canberra distance, and Minkowski distance. These results support previous research, which states that Euclidean is one of the best performing distances [12].

Fig. 2. Objective function average values
The process of computation works in a lower dimension that is resulting in the decrease of the computing weight. This phenomenon will affect the computing time and the number of iteration processes applied. Fig. 3 depicts the comparison of computing time in these two applied Fuzzy C-Means methods. At all distances, the computation time decreased significantly. The highest drop occurred at the Minkowski distance. However, the Euclidean distance still has the lowest computation time of all the distances used. These results indicate the consistency of the Core and Reduct method which is able to reduce the computational load on the Fuzzy C-Means Clustering method. These results complement previous studies which were limited to the Euclidean and Minkowski-Chebyshev distances [3][17].

Fig. 3. Average computing time
The lower computation time is usually also due to the iteration time it takes to converge. If the number of iterations is low, then fewer steps must be taken. Fig. 4 illustrates the comparison of the average number of iteration until convergent from both of the analyzed methods. The result shows that the Core and Reduct method is relatively good enough to decrease the computing time of Fuzzy C-Means clustering. This behavior is caused by the decrease of the number of iteration until convergent in the application of Fuzzy C-Means Core and Reduct. The acquired result is not quite significant because Core and Reduct only decrease the number of attributes, while the data record is stagnant.

Fig. 4. Number of Iteration Process Until Convergent
Another expected result from the application of dimension reduction is the good quality of clustering result, which has to be similar to the result of the common clustering process. The good cluster output should have accuracy and purity valued close to 1. Accuracy describes the accuracy of the working clustering model. Fig. 5 presents the accuracy value for each distance function. The average accuracy for all distances is 0.47. The distance with the highest accuracy value is Euclidean, with a value of 0.56. Meanwhile, the distance with the lowest accuracy value is the Minkowski distance with a value of 0.38. However, the impact of the application of the reduction of the Core and Reduct dimensions is to compare the accuracy results before and after the dimensional reduction. It can be seen that in Euclidean, Manhattan, Minkowski-Chebyshev, Canberra, the average accuracy can be maintained above 80%. At Minkowski's distance, the accuracy drops drastically until only around 60% remains. This result looks like it can be improved by combining it with Chebysev's distance. This supports the research that Minkowski-Chebysev's new distance has a great impact and can be used for the optimization of Machine Learning methods [17] [23]. reduced, the quality of the clusters can be maintained. The result of this paper expands the previous paper [3]. Fig. 6 presents these results.

Fig. 6. Average Purity
Another measure of the goodness of a cluster is the Silhouette Coefficient Score. The higher value of this metric, the more correct the clustering results can be. Interesting research results emerge in this section. On average, Fuzzy C-Means Clustering with Core and Reduct application is able to improve the value of silhouette score from 0.399 to 0.507. This means that the Core and Reduct method consistently across all distance functions can improve the quality of cluster results. Fig. 7 presents the Silhouette Coefficient Score value at each distance for the two methods. Nearly all distance functions have a significant increase for this measure. The Minkowski-Chebyshev distance yields the worst score in this case.  Fig. 8 shows the comparison of the average DBI results between the Fuzzy C-Means Clustering method and Fuzzy C-Means Core and Reduct Clustering. A cluster will be considered to have an optimal clustering scheme if it has a minimal Davies Bouldin Index (DB) (close to 0) [17]. This new method is able to decrease its remaining value to 53%. This means that the reduction of Core and Reduct dimensions increases the results of Fuzzy C-Means clustering. This applies to all distance functions.
Few of the aforementioned results are linear with the previous research [25] [26]. R. Zhao, L. Gu, dan X. Zhu also did research in the same field as this research. Their research resulted in the combination of the C-Means Clustering and Reduct functioned as Rough Set Feature Selection which able to improve the accuracy with a value averaged in 1% [21]. The addition of the Core process applied in this research guaranteed that the result of the dimension reduction was only acquired from the core of the dataset only.  The best result of this research is acquired on the certain value of U random to initiate the fuzzy c-means algorithm. In the future, an optimization process to determine the initial value for acquiring the best cluster centroid can be developed to further improve the Fuzzy C-Means clustering performance. Table 3 illustrate the valuation of clustering result from the application Fuzzy C-Means Clustering Core and Reduct with seven different distance function parameter. Based on that result, Euclidean distance is still considered the best distance function to be applied.

CONCLUSION
The Core and Reduct, dimension reduction method can reduce the computational burden of the Fuzzy C-Means Clustering method on all distance functions. The value of the objective function can be significantly reduced so that the number of iterations and computation time can also be significantly reduced. These results indicate that the reduction of the Core and Reduct dimensions works consistently on the Fuzzy C-Means clustering method with various distance functions. Even so, the quality of the cluster results from this method can still be maintained. These results are shown in the increase in the Silhouette Coefficient Score, the decrease in DBI, and the accuracy and purity values which are still above 80%. Euclidean distance is the best distance with the result of the number of iterations, computation time, the best Silhouette Coefficient Score. The Fuzzy C-Means Clustering method with the reduction of Core and Reduct dimensions is not recommended for the Minkowski-Chebyshev distance function. In the future, research on the development of the Fuzzy C-Means Clustering Core and Reduct method can be applied to image data, video, or other data types.