Combination of Cluster Method for Segmentation of Web

Clustering is one of the important part in web usage miningfor the purpose of segmenting visitors. This action is very important for web personalization orweb modification. In this paper, we perform clustering of the web visitors using a combination of methods of hierarchical and non-hierarchical clustering toward web log data. Hierarchical clustering method used to determine the number of clusters, and non-hierarchical clustering method is used in forming clusters. The stages of cluster analysis are preceded by pre-processing the data and factor analysis. With this approach, the owner of the web is more effective at finding access patterns of web visitors and can have new knowledge about visitors’ segmentation. From the test applied on ITS’s web log data, 6 clusters of web visitors are resulted. Among the 6 cluster, cluster 3 has the biggest number of members. This information can be useful for web management to pay attention on members’ behavioral patterns of the 3rd cluster’s either to make personalization or modification on the web. The test results show the feasibility and efficiency of application of this method.


Introduction
The Internet has become a huge information source [1] and an important media in the distribution of current information.This is an integral part of one internet service, namely the World Wide Web (WWW) that is capable of disseminating information in text, image, video, or voice and multimedia.The survey results conducted by Netcraft, in July 2012 states that there are 665,916,461 active sites, and according to internet world stats, in December 2011 there are 2.267.233.742internet users in the world.This means that the interaction between Internet users with web sites is very high and web servers record every activity of the visitor is in the form of files (web log).Until now, a web log has become the most important part in Web Usage Mining (WUM) to gather the web visitor data, especially in finding patterns of visitors' access, prediction of visitors' behavior [2], [3], to create a user profiles [4], [5].
WUM or web log mining [6] is one category in the field of web mining [7], which is the mining conducted on the web based on web log data.Specifically, by [8], states that WUM is the application of data mining techniques to discover the interaction between visitors of a website through web log data.The mining of web logs is useful for a variety of fields, including for web personalization [9] and web modification [10].
Techniques on WUM is including statistical analysis [11], association rules [12], [13], sequential patterns [14], [15], classification [16], [17] and clustering [18][19][20].Clustering is one of the important topics in WUM for visitor segmentation based on access patterns on the web or frequency of visits.by [21], use belief function method to perform the clustering on web log data.They divide web visitors into different groups and find a common access pattern for each group member.However, this approach still requires identify sessions that are less efficient on the pre-processing stage.By [22], conduct the clustering of web visitors with the K-Means method and they only prove that the method of K-Means clustering can be used to web log data without validation of its cluster result.
According to [23], clustering on web sessions includes three stages, namely preprocessing, measurement on the similarity and the application of cluster algorithms.In this research, we perform clustering based on the visiting frequency of visitor on the sites in the given period of time regardless of the web session so it is more efficient at the pre-processing stage and then we perform clustering using a combination of hierarchical and non-hierarchical cluster methods.
This paper is organized as follow: in chapter 1 that explains the background of the research and also the related research, chapter 2 discusses about stages of the research as well as the method used, chapter 3 is about the result and analysis, and chapter 4 is the conclusion of the research.

Research Method
Stages of this research in general are shown in Figure 1.

Dataset
The dataset used in this research are web log data from web of Tenth of November Institute of Technology Surabaya, with the web address is www.its.ac.id and the period of data collection is from 3 to 16 July 2012.Web log file format used in this research is the Common Log Format (CLF) [24], which is the standard format used by the web server when creating a log.Each line of CFLs consists of host/IP Address, identification, authuser, date and time, method, request, status, and bytes as shown in table 1.
From the first line of Table 1, we obtained information that the visitor with IP address 66.249.69.xxx have accessed a web page index.php on July 15, 2012 at 06:45:13 with a status code of 200 and 15319 file size and so on.This is the kind of information which is to be researched to get web visitor segmentation.The final result of pre-processing stage in the form of a matrix vector is as follow [22]: where m is the number of web visitors (data), n is the number of web pages (variable), and X is a vector of observations.Implementation of matrix vector in equation ( 1) about the web visitor behavior data based on the frequency of visits to the web page is shown in Table 2.

Table 2. Matrix vector
With p1, p2, p3, pn are the variable for a web page, for example, p1 is the web page with the name of index.php.u1, u2, u3, um are the variable for the visitors of the web, for example u1 is a web visitor's with IP address, 72.233.234.xxx.From Table 2, it can be concluded that the visitors with variables u1 have accessed the web page p1 6 times, web page p2 9 times and so on After the pre-processing of the dataset, 165 web visitor data were acquired with 57 variables (accessed web page).This data in the form of this matrix vector that was processed further.

Factor Analysis
The next stage is to conduct a factor analysis on the data resulted from the preprocessing stage.Factor analysis is a multivariate method that is used to describe the pattern of relationships between variables in order to find independent variables that affect the objects called by a factor.In this case, factor analysis aims to reduce the variables into several sets of indicators called factors, with no loss of meaningful information from the initial variable.
The first stage in factor analysis is the process of testing the adequacy of the data and the identification of correlations between variables with Measure of Sampling Adequacy (MSA) method in equation ( 2), Kaiser-Meyer-Olkin (KMO) in equation ( 3) and Bartlett's Test in equation ( 4) [27].
where: i= 1, 2, 3, ..., p dan j = 1, 2, 3, ..., p r ij = Coefficient of correlation between variables i and j a ij = Partial correlation coefficient between variables i and j Based on this method, a group of data is said to meet the sufficiency of the data and the correlation assumptions when the value of the MSA, KMO is greater than 0.5 and a significance value of Bartlett test <0.05.Therefore, variables with MSA<0.5 were excluded from the analysis.Output of the analysis in form of factor scores will be used in the cluster analysis.Table 3 shows the test results using KMO, Bartlett's and MSA methods.As shown in Table 3, the value of KMO and Bartlett's Test is 0.757 with significance value is 0.0.This means that the variable and the data can be received and analyzed further because the value of KMO and Bartlett's Test received is > 0.5 and significance value <0.05.Variables with MSA <0.5 were excluded in this research.Table 4 shows the variables with MSA <0.5.
After testing the adequacy of the data, then a factor analysis was performed with results as shown in figure 2.As shown in Figure 2 that there are 14 factors formed (eigenvalues ≥ 1) of 57 baseline variables.With the distribution of the variable and the percentage of variable ability explained by factor shown in table 5 and table 6.The last step in factor analysis is to make factors score, this is a score for factors that are formed to replace the value of the original variable by naming variable f1 to factor 1, f2 to factor 2, and so on.The results from the factor scores operation are used for cluster analysis.

Cluster Analysis
Cluster analysis is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar to each other than to those in other clusters.This is non-parametric techniques which is very much applicable in the real world.Cluster analysis in this study was carried out by combining the hierarchical clustering method and the non-hierarchical clustering method.Result of the factor analysis in the form of factor scores were used as input to the cluster analysis.

Hierarchical Cluster
The first phase of the hierarchical cluster is calculating the distance between objects with euclidean distance method and cluster formation using the single linkage method.Based on the results of the agglomeration schedule from this method, the number of clusters based on the rules of the elbow were determined, as shown in Table 7.  Table 7 shows a difference in co-efficient in where co-efficient in stage 159 is bigger than the other.Thus, based on elbow rule, with the amount of data as 165, 165 -159 = 6 (resulted 6 clusters).These result are used as input for the non-hierarchy cluster analysis.

Non-Hierarchy Cluster
Non-Hierarchical Cluster is used to determine web's visitor segmentation.In this case, K-Means method [22] was used with the following algorithm: (i) Determine the number of k as many as the number of cluster which is formed.This is also intended to represent the starting centroid.(ii) Data are allocated randomly into cluster based on the nearest centroid.(iii) Recalculate the centroid k position.(iv) Repeat step 2 and 3 until inter-cluster object moving no longer exist.

Results and Analysis
Based on the implementation of Non-Hierarchy Cluster method with 6 cluster of web visitor, membership of every cluster was gotten, as shown in Table 8.Valid data: 165 Table 8 informs the grouping of 165 web's visitor with cluster 1 consists of two members, cluster 2 with one, cluster 3 with one hundred forty three, cluster 4 with thirteen, and cluster 5 and 6 with three members each.The detail information can be seen in Table 9.
It can be concluded from Table 9 that web visitors (u1, u2, u3…u165) within the same cluster have the same access or visiting pattern toward ITS web page so that this information can be used as an input for the web personalization and modification, including cluster 3 which has the most member.

Conclusion
Based on the application of combined method of hierarchy and non-hierarchy cluster toward the web log data, it can be summed up that this method can give new information about a web visitors' pattern or behavior so that the information can be used for web personalization and web modification.From the test applied on ITS's web log data, 6 clusters of web visitors are resulted.Among the 6 cluster, cluster 3 has the biggest number of members (143 members).This information can be useful for web management to improve the service on the web page which is frequently visited or accessed by member of 3rd cluster, especially if the management wants to do the web personalization and web modification.

Figure 1 .
Figure 1.Stages of Research

ISSN: 1693- 6930 
Combination of Cluster Method for Segmentation of Web Visitor (Yuhevizar) 211

Table 3 .
Results of the testing with KMO, Bartlett and MSA methods

Table 6 .
Distribution and percentage of variable ability explained by resulted factor (continue)

Table 8 .
The number of clusters' members

Table 10 .
The final clusters centre