Behavioral Analysis for Detecting Code Clones

The activities of copy and paste fragments of code from a source code into the other source code is often done by software developers because it's easier than generate code manually. This behavior leads to the increase of effort to maintain the code. One of the detection methods of semantic cloning is based on the behavior of the code. The code behavior detected by observing at an input, output and the effects of the method. Methods with the same value of input, output, and effect will indicate that semantically the same. However, the detection method based on the input, output, and effect could not be used in a void method or method without parameters, another side comprehensively detection is required. The challenge is how to detect which variab le in a method that acts as input, output, and effect. Detection of the variab le input, output, and effects in a void method done using Program Dependence Graph. The use of clone detection methods semantically based on behavior can increase the agreement value.


Introduction
The process of copying and insertion of code fragments specific code section to another section of code referred to as cloning.The process can be accompanied or not by changes or the modification of code [1].Software developers often do so for various reasons, such as software limitations of working time.Some researchers claim that there is a 20-30% overall code are cloned in the source code [1].The existence of this cloning requires good management, given that there are some impacts that may be caused.
Code cloning is believed can cut the time of generating the source code in software development.With the cloning, a developer has a very simple way to generate the code.However, it will be able to cause problems when one of the copied code fragments containing the bug.This condition will increase the cost of maintaining the source code.Increasing maintenance costs are shown by an increase in efforts [2][3], the time and expense [4][5].There is also a correlation between the presence of clones in code with the onset of defects in software.Sjoberg and Yamashita [2], states that if the maintenance is not carried out consistently, it can lead to defects in the software.The same thing also delivered by Bettenburg et al., [6] which states that disability may occur if the developer did not perform well on cloning maintenance code.
Several methods have been used in the process of clone discovery.These methods are grouped into two groups, a method that is executed based on the similarity of syntax and meaning (semantics).In the case of code, the semantic similarity of code cannot be looked from the relatedness between the words [7] or the ontology [8].Elva and Leavens [9][10], mention in different terms, the representational and functional similarities.The similarity of the syntax is done by looking at how the source code it is presented or written.This approach considers only the similarity of syntax and has minimal semantic information.Meanwhile, the semantic similarity, the search process is done by looking at the meaning of the code.The meaning of a code fragment can be seen from the function of the code [5], [9], and the relationships of the elements of the code [10][11].Because of the presence of clone can lead to the increase of the maintenance cost, then the cloned code in a source code of the software must be found to help facilitate the management process of cloning.The importanc e of knowing the existence of clone is to simplify the process of refactoring the code.Refactoring code is done to maintain the 1265 consistency of the code and to reduce the possibility of the appearance of the bug.Clones code that is not consistent can increase maintenance costs and increase the likelihood of bugs.
In the semantic similarity detection process, the additional information is needed to enhance the meaning of the fragment of the source code.In the semantic similarity detection process, the additional information is needed to enhance the meaning of the fragment of the source code.The semantic similarity detection process has been studied by several researchers.Gabel et al. are using the Program Dependence Graphs (PDG) to gain the semantic information from code [12].PDG is used to represent control data and dependencies between statements in code.Gabel et al. are assuming that the isomorphism model that represented by PDG from a couple of code fragment indicate that they are the clone [12].
Keivanloo and Rilling propose an approach that the meaning of code can be gained from the relationship between token.They assume that if we consider the text, will be difficult to measure based on the meaning of the code (semantic).Keivanloo and Rilling did an enhancement of the method using semantic web technology.They connect the code based on inheritance relation between object.The relationship between tokens is known from the inheritance relation from the token [13].Jiang and Su using difference approach for semantic clone detection.In their research, the semantic similarity in code is a functional similarity.If two fragments of code produce the same output when given the same input value, then those fragments are semantically similar [5].There is an automatic mechanism for generating input value in this research.
Elva and Leavens have the same vision as Jiang and Su.They are focused their research on method or function at Java language.They also believe that the function of the method can be identified by the value of input and output.In this research, Elva and Leavens give an additional parameter to consider in the process of functional similarity detection.The parameter is the effect.Effect of the method in Java is assumed can give additional information for strengthening the meaning of method.Then, they named the detection process as IOE -Behavior (Input, Output, Effect-Behavior) [9], [14], [10].Inputs are all parameter on the method.The output is the result of the method.And, effects are all attributes of the classes that have possibility affected by the method at the runtime.
Input and output are needed in the testing phase.The method that does not have the input parameters and return value will be assigned as a cloned as long as they have the same definition of input, output, and effect [10].All of the methods that do not have the definition of input, output, and effect have possibility contain the clone.Therefore, the deeper study is indeed to solve that problem.
To resolve that problem, in this research will produce the framework for semantic clone detection.The framework has the function to detect clone code based on the behavior of the method.The challenges of this research are, how we can identify all variable that acts as input, output, and effect to detect semantic clone.All void and parameterized method are identified so that the analysis of method can be done to all method without exception.The preliminary research has been done and the result shown that this method is very promising [15].In the preliminary research, the PDG analysis is done to all of Java methods.In this research, the PDG analysis is applied more selective, not all methods are processed but only on void and without parameters methods.And, there is additional source code that is analyzed.

Research Method
In this part describe the research methodology that is used in this research.The phases consist of (1) Identification and collection Java methods, (2) Grouping the methods based on the input, output and effect definition, (3) Reconstruction method, (4) Testing/execution method, (5) Grouping methods based on the result of testing/execution, (6) Analysis of result.The contribution of this research is in the first phase.At the first process identify and collect all methods from the source code, then identify the definition of input, output, and effect by using PDG.All the phases are described in Figure 1.

Identification and collection java methods
The first technical phase of this research is collecting all methods in Java class.In the process of collection methods, there is a way to find all methods in object-oriented source code.This process is applied to all source code files by using a library that works based on byte-code analysis approach.The library is System Dependencies Graph API (SDGAPI).The result of method analysis is saved in one object.Finally, all object of methods are saved in list data type.The collection of methods is described in Figure 2.

Figure 2. The Process of Collection Methods
There is method extraction process in Figure 2.This process is how to gain all information on the methods.An extraction of void or non-parameterized method is done using Program Dependence Graph (PDG) from SDGAPI.PDG describe the dependencies between the variable in the method body.The analysis of input, output, and effect is done using the dependencies relation between variable.Input and output variables are all nodes that start and end the PDG.
The process of collection and identification of Java methods is done by an application that developed specifically for this research.The application is developed with Java language and using the SGDAPI.The application has two kinds of input type files.The files are .javafiles and .classfile.Because of SGDAPI is working with bytecode analysis approach, that why the

. Input identification
Identification of input from the Java methods is made using PDG.The information that will be gained from this process is the type of input variable.The input variables determine the type of random value that uses in testing or execution phase.Definition 1. Input is all of the types of input parameters of one method.In the case of a method with parameters, all type of parameters will be an input definition of the method.Figure 4 describes the example of a method with parameters.From that example code, by using definition 1, the input definition of method inc are int and int.The condition will be different if the method does not have parameters.By using PDG, input in a method that does not parameters have will be discovered.Figure 4 shows the example of the method without parameters.Input is all node that affects another node, not be affected by another node.In this case, all of the nodes in PDG that become the starter of the graph are input node.The PDG of code on Figure 5 is described in Figure 6.There are three nodes that become a starter in this graph, "int x", "int y" and "String result".But, not all of that starter node becomes an input.According to definition 2, all nodes that affect and not affected by another node become an input."String result" are excluded from input because it will be affected by another node.So, the input definition of that method is int and int.

Output identification
As well as input identification, output definitions are obtained using the same process.PDG is also considered in this process.But, output has a different definition of input.Definition 3. The output is the type of return value of the methods.All of the type of return value of the method, if any, will be the type definition of output for a method.For all methods that have the return value (not void method), it can directly be assigned to output definition.Definition 4. The output is the node that becomes an estuary for all node or the node that only affected by another node.All of the nodes that have a position at the end of the graph will be assigned to output.For example Figure 7, that is a void method.There is no return value of the method.The method only does some process and not return the result value of the process.To get which variables act as the output, the use of PDG is very useful.According to definition 3 and 4, and the information from PDG, then output variable can be found.The dependence graph of the example in Figure 7 are described in Figure 8.According to definition 3 and 4, if the method has the return value, then it can be directly decided that the output is the return type of the method.If the method is a void method, then definition four will be used to find the output.For the example of dependency graph in figure 8, there are three nodes that can be output according to definition 4. The nodes are "print x ", "print y" and "print result".From the nodes, there are variable x, y, and the result, and each of them has a type of int.From the example, the definition of output is int, int, and int.

Effect identification
The effect has a different definition of output.The effect is something that happens during the process of generating or resulting output.In the case of Java programming, it is identified with the entity or attribute that lies in the body of the method.Definition 5.The effect is all attributes from the class that has the possibility to be manipulated during the process resulting in the output.
All of the attributes of a class that has the possibility to be manipulated are listed and become the effect of the method.For example, there is class A and has an attribute A1.There is class B that has method MB1.In the method MB1, A1 of A is manipulated and has the possibility the value of A1 changed.From that description, method MB1 affect A1.A.

Grouping the methods based on input, output, and effect definition
Grouping the Methods is the first step of grouping methods.This process produces several groups of a method that has a possibility having a similar function.Grouping methods are to eliminate methods that have no possibility having a similar function.All methods that have no similar definition of input, output, and effect will be erased from a collection of candidate methods.Definition 6. (Similarity of Methods), A couple of methods are stated as similar whether it has a similar definition of input, output, and effect.Input, output, and effect defined as string data.The matching process is done using exact match comparison between string.The process is performed on every element data input, output, and effect.On the other hand, this process is done to keep and find the group of methods that contextually the same.All of the methods must be given the same value of the input to keep the contextual of the method in one group [9].

Method reconstruction
The method reconstruction process is for preparing the method that will be run at testing phase.Reconstruction is done by altering the parameters of the method and appending some script inner the body method manually.This process is done at all of the methods that do not have input parameter or return value as an output of the method.The result of input, output and effect identification is used as base information for doing reconstruction.
All methods that do not have input parameter and output or one of them, and identified as a clone candidate, will be reconstructed.The method that there is no input parameter in the code definition, and identified own an input; then the code is altered by adding a parameter as the result of PDG analysis.The method that does not have a return value, and identified own output by PDG analysis, the body of code is appended with the code that has a function to listen to the value of the variable that becomes output node.The output modification cannot do by adding "return" keyword in java code because the output of method resulted by PDG analysis is possibly more than one node.All methods are prepared in this phase to all of them can be run or simulated using random data.Several things that need to consider in the method reconstruction is defined as follow.
a.The name of parameters of the method is adapted from the name of the variable that becomes an input node of the method.b.The code that defines the variable that becomes an input is deleted because this variable will put to the method parameter.c.The code that gives an initial value to a variable that becomes an input is deleted because the initial value is got from the parameter and it can cause the death code in the body.d.Input node that calls a method from another method is replaced with a variable that has the same type with the return value of the method.

Testing or execution method
All methods that have similarity the definition of input, output, and effect are tested using the same data for one group.The data are generated using a random function that can produce the value that fits with the type of parameters.The random function produces a value of primitive type.The parameter that has a type as user define reference type are defined manually.
The initial code for test the method is generated automatically at the same time with grouping the methods.Then, in the final process, the SQL script is produced.The scripts contain many information, example, the name of the method, the name of the class where the method takes place, the name of the cluster, input value, output value, and effect.With SQL script, all data can be saved to database easily.The example of initial code for executing an example of the method is descript in Figure 9. From the code in figure 9, the tested method is getTextHeight, from class TextCalculator, cluster 10, and source code of jDraw.getTextHeight is the method that already reconstructed and ready to invoke.From the example, all the needs are the code that defines to generate random data to pass to the method.The final code that already appends the additional code is descript in Figure 10.The code that blocked with yellow is additional code.Random Data is the class that has a responsibility to generate random data.There are several methods to generate data that divided based on the primitive type of data.At the example, random for an integer type of data.Then, if all random data required in this testing method is ready, the data pass ed to a method to proceed.

Grouping methods based on the result of testing
Before grouping the methods, the SQL script that resulted from the previous phas e is executed.All log of the testing process is stored in the database.This phase is the determination phase.In this phase, we decide which method that has the same behavior.The grouping method is the way to know which method has the same behavior.The behavior similarity detection is based on the definition as follow [10]: All methods that have the same value of input and output, and the same definition of effect, then those methods are semantically or behaviorally equivalent.Not all of the methods that are grouped in previous grouping process are semantically similar.The only method that has the same input, output, and effect definition and also has the same value of input and output is decided as a semantic clone.

Analysis of result
Analysis of result is conducted with the aim to evaluate the result of the implementation of the purposed method.The evaluation is done by comparing the result of the system (purposed method) and the result of manual analysis by an expert.Expertise is a very familiar and experienced in designing a software source code.Then, from the result, the value of agreement between the system and an expert are counted using Cohen's Kappa coefficient [16].
where, , and

( ) ( )
Where is the proportion of the similarity of observation and is the proportion expected by chance [16].Then, the data obtained from observations of two observers described in tables relevance shown in Table 1.Kappa value can determine the level of agreement between the expert and system.The value can be interpreted using interpretation table as shown in

Dataset and Testing Scenario
This section descript dataset and testing scenario to perform the purposed method.The dataset that used in this research is the open-source software written in Java.

Dataset
The dataset consists of some source code written in the Java language.The source code is available on several sites code repository.From some source code, the code is grouped into code sizes small, medium and large based on the number of methods contained in each source code.The code with the small size is a code that has a method of fewer than 1000 methods.The code with a medium-size contains more methods from 1000 to 1500.And, with the large size of the code is code that has methods over 1500.Provisions grouping source code based on the number of such methods is still an open issue.There are no references say for certain groupings of code size based on the number of methods.The source code used in this research is the source code that meets size classifications small, medium and large.Applications used to test in this study are described in Table 3.

Testing scenario
Testing of methods offered in this study conducted by implementing a workflow that has been proposed and described in the methodology.The workflow is implemented in the case of data that is appointed for the trial.The data used for the test is divided into three categories: small, medium and large.Testing is done by comparing the results with the results of the identification system experts.Experts elected for identification is a person who is familiar with the source code and understands the types of clones exist in the source code.There are three experts, each expert identification at the source code an application system.
The identification process by the system is done in two ways, namely automatically and manually.The result of identification by the system is recorded in a database.Then, the log data will be analyzed by comparing the results of the identification by experts.In the final phase, Kappa coefficient was calculated to see the degree of match between the identification results with expert systems.

Discussion
The analysis results in the comparison of the number of methods that claimed as clone method.There are the differences between the number of previous and proposed method.The previous method looked more accurate than proposed method.From the distribution of type of clones in Figure 13, the previous method, there is a considerable value obtained for this type of Not Clone (not clones).

Figure 13. The Distribution of Types of Clones
The number of not clones of each source code (DNSJava, jDraw, jEdit) showed the largest number compare with another type that clones.The deeper analysis is needed to find out are all the methods that claimed as clones are really clones.This analysis is done by looking at every method that claimed as a clone or not clones.These results imply that , in the previous method (without PDG), there are many false positive values.Where false positive is the number of a method that identifies as clones by the system but in reality is not clones.
The number of the false positive method can cause a decline in the value of the accuracy or the degree of fit between the methods with experts.Comparison of the average value of Kappa coefficient between the previous method and the proposed method is 0.13043 and 0.2128.0.08237 there is a difference value between the two methods.It is mean that the proposed method is more conformable with the expert than the previous method with the value of Kappa coefficient 0.08237 greater than the previous method.The proposed method analyzes every method even there is no parameter or return value at the method which was not done in the previous method.

Conclusion
Type of input, output, and effect of a method can be specified on all methods of both the void method or non-void method.The specification of input, output and the effect of the method is done by analyzing the PDG formed based bytecode of a method.Generation PDG using a library that runs using the method bytecode analysis.
The results of the experiment by the system are compared to the test by experts.This approach aims to get a Kappa coefficient of the proposed method in this study.The results of calculation of Kappa coefficient is 0.2128 which can then be interpreted that the results of the degree of fit between the system and the experts in this study were a fair agreement.The result of the system is also compared to the previous method.Comparison of the average value of Kappa coefficient between the previous method and the proposed method are 0.13043 and The difference values between the two methods are 0.08237.The proposed method is more conformable with the expert.Some of the processes of clone detection on the source code are done manually.For the next study, we try to attempt all process can be done automatically.The recapitulation of the testing in this study shows there are still methods that cannot be identified.The assumption is this happens because of the failure of library SDGAPI in generating PDG.Therefore, some methods which do not have a definition cannot be analyzed.In upcoming research, this problem has to be optimized.Generating PDG will be attempted from source code.
Figure 1.Our Approach

Figure 4 .
Figure 4.The Example of Method with Parameters

Figure 5 .
Figure 5.The Example of Method Without Parameters

Figure 7 . 1269 Figure 8 .
Figure 7.The Example of Void Method

Figure 9 .
Figure 9. Example of Initial Code

1271 Figure 10 .
Figure 10.The Final Code in Testing Phase

Table 2 .
The Table2explains that there are some categories for the result of Kappa coefficient.The value of Kappa can be classified based on that categories.

Table 3 .
List Dataset