Association pattern of students thesis examination using fp- growth algorithms

Association rule mining is a technique in data mining instead of classification, clustering, and prediction. By using data mining, the data scientist can find new insight from the data. Data mining can implement in several areas, i.e., discovering client behavior patterns that can use to address the marketing strategies[1], discovering undergraduate student data[2], and extracting facts from medical data[3] by association rule mining. Data mining also uses to identify toddlers nutritional status[4], extraction risk factor from the highly ranked association rules for early childhood caries[5], analysis of the relationship among patent documents[6], analysis of ozone condition[7], [8], and determining the pharmacies drug to prevent a drug mistake[9].

Educational data mining is an interesting topic to discuss. The main goal of implementing data mining in the educational area is to use experience and new insight to improve the quality of education [10] and manage new courses properly [11]. Data mining also can use to prevent educational risk and educational opportunities i.e., student drop-out [12]- [16], duration of study [17], [18], learning behaviors [19]- [21], students outcome [22], [23] and student performance [10], [24]- [26].
In the previous research, the FOLD-growth method was used to analyze Alumni data's frequent pattern relation between time to get a job, study duration, age, English skill, field skill, grade point average, and the first salary [27]. This article will discuss the unique implementation of data mining for the educational field. By using association rules mining, we interest in analyzing student thesis examination to find meaningful association patterns. The thesis is a written scientific work prepared by students based on the results of the research problem. This research was carried out carefully with the guidance of a supervisor and research work from students. The results of this study are expected to contribute to knowledge and society. Every university and department has its procedure to provide student thesis examination. In general, every student thesis examination consists of four attributes, i.e., student, supervisor, examiner, and topic research. There is no problem in implementing student thesis examination, but how to evaluate the combination of the four attributes is interesting to explore. The university and department need to confirm that the application of student thesis examination is the right option.
A BS T RA C T The thesis examination is the final project for students to graduate from their majors. This thesis researches scientific work between a student and a supervisor in finding solutions to a problem. In the thesis examination, students must present their research results to be criticized by the examiner. This article aims to analyze the association pattern of student thesis examinations at a private university. Although the thesis's implementation has been carried out following procedures, to determine the composition of the board of examiners needs to be analyzed by examining the pattern of relationships between research topics, supervisors, and examiners. This study uses 448 data and uses FP-Growth Algorithms to find the rules. The research methodology starts from preparing the Dataset, cleansing data, selecting data, loading data into applications, transforming data, itemset frequencies, forming patterns, and analyzing rules. This study found 145 patterns of association rules with a minimum support value = 4 and a minimum trust value = 50%. The association rule pattern of 77.78% is under scientific group data. The benefits of the association pattern produced in this study can determine the composition of the examiners on the student thesis examination according to the research topic and scientific field of the examiners. In this article, we are interested in analyzing the student thesis examination based on private data university-the objectives of this research to analyze the data mechanism. Besides the analyzing mechanism, this research also develops a web application using python programming. It makes simple when the other data come from another university or department and need to analyze to get the association pattern.

II. Method
FP-Growth algorithm was introduced by Han [28]. This algorithm is the improvement of Apriori Algorithms [29] that generate frequent patterns without candidate generation. The association rules mining [30] analyzes frequent patterns that consist of items and finds the association between X and Y, where X and Y are itemsets. The methodology in this research follows Figure 1. Starting the development process, previously prepared an e-commerce platform consisting of 2 pages. The first page is the front of the e-commerce platform, and the second is the product detail page. The system scenario that is created assumes that the user explores the system by seeing the products on the front page, continuing to the product detail page. The product item data used is sample data taken from an e-commerce site in Indonesia. The data taken are product names, product prices, pictures, and product descriptions.
1. Data selection. Select the variable based on the independent variable of this research. The variables are research topics, supervisors, and examiners.
2. Data cleaning. Clean the data from noise data. The noise data are the data with null value or irrelevant value, or invalid value. The cleaning mechanism is by removing the noise data.
3. Data load. This research develops a web application using python programming. The Dataset has been prepared then uploaded into the web application to start analyzing.

A. Dataset
The earlier step of this research is to collect the Dataset from the academic unit. The Dataset contains 448-row data with several variables, i.e., student number, name, department, address, data registration, and so on. Table 1 shows the collection of the Dataset. This Dataset used a basis in this research. The next step of this research is preprocessing the Dataset to get a clean dataset processed to the algorithms. The preprocessing step will perform like data selection and data cleansing. This research not only analyses the data but also produces a web application that develops using python programming. The web application also does the preprocessing step but only in data transformation. It is because the data will transform into hash data. We choose the hash data because we need consistent data before the analysis process.

B. Data Selection
The data selection stage is a process for selecting or deciding what variables will be used at the data mining stage. Variables to be used include research topics, supervisors, first examiners, and second examiners. The selection data can be seen, as shown in Table 2. These are related to the objective from the analysis process that we want to understand these variables.

C. Data Cleaning
The cleaning process removes inconsistent data such as noise or invalid data on the supervisor's data, research topics, first examiners, and second examiners. We have done this process for 448 datasets, but unfortunately, there is no data needed to remove.

D. Load Data to Application
A web application has developed using python programming to analyze association rules mining. This python code also uses the pyfpgrowth library to implement the FP-Growth algorithms. After the data cleansing and data selection process, then the Dataset will upload to the web application. The web application will read all of the datasets and store the Dataset into the database.

E. Data Transformation
The data transformation uses function convertToHash in python, as shown in Table 1, to transform the data text into hash data. The hash data only change variable supervisor, first examiner, and second examiner. There are many similarity data in this variable and the fundamental character of association rules every duplicate value will remove. A name can be a supervisor, first examiner, and second examiner. It will duplicate data for association rules, so the solution is by adding a character that can separate the data depending on the variable name. In this case, the supervisor column adds with "pemb" word, the first examiner with "1uji" word, and the second examiner with "2uji" word. The result of the data transformation process shows in Figure 2. The column supervisor, first examiner, and second examiner have success transform into hash data. By this transformation, all three variables are not duplicate and ready to analyze.

F. Frequent Itemsets
The first step in the FP-Growth algorithm is to find frequent itemset. This frequent itemset value is needed to decide which item will be used in analyse association rules. The item infrequent will remove, as shown in Figure 3.

G. Result of FP-Growth Algorithms
The next step in the FP-Growth algorithm is to determine minimum support and minimum confidence value [29]. In this research, we determine minimal support (min_supp) equal to 4 and minimum confidence (min_conf) equivalent to 100%. The infrequent itemset will remove, and the frequent itemset will re-arrange to build an FP-Tree. Figure 4 shows the illustration of the sample of the FP-Tree in this research. After building FP-Tree, the FP-Growth algorithms need to generate a conditional pattern base, as shown in Table 2.  The conditional pattern base is decided based on the tree that has minimum support. The conditional pattern base contains a combination between the item and number of the count from the tree combination. Table 2 is a sample of conditional pattern base in this research. The next step is to generate a conditional FP-Tree. At this stage, to find a conditional FP-Tree, by adding up the existing support count, each item that has a larger number of support counts is equal to the minimum support count that will be generated with a conditional FP-Tree, as shown in Table 3. The conditional FP-Tree will generate Frequent Patterns. This stage is to find a single path and then combine it with items in the FP-Tree conditional. Table 4 shows a sample of Frequent Patterns. After frequent patterns are generated, the web application will re-transform the supervisor, first examiner, and second examiner into the real name. The result of the re-transform data shows in Figure  5. Figure 5 also calculates the value of confidence. The values of confidence calculate with Formula 1 [31].
The value of minimal confidence in this research is 100%. It is because only the maximal pattern that wants to analyze the result.

H. Analyze Association Rules
The complete result of Figure 5 shows in Table 5. Table 5 is used to analyse step. The focus of the investigation is to count the appropriate pattern if it combines with the background research and topic research of the student. The validation process has done with confirmation to the data background research and data research member. The result of validation found ten patterns (22.22%) is not appropriate. It means the pattern based on historical data is not by the data group research member. It can be due to several reasons FP-growth performance experience a drop when the FP-tree is very dense. The pattern result can be the basis to improve the combination of thesis exam activity in the future.

IV. Conclusion
Based on the result and discussion section, searching the Frequent Itemset of the decision tree uses the FP-Growth algorithm that works very well in doing the Frequent Itemset with the FP-Tree building process by generating rules from existing datasets. The results found 45 patterns of association rules with a minimum support value is four, and minimum confidence is 100%. This pattern has been validated with scientific group data, found 77.78% appropriate pattern, and 22.22% is not appropriate.