Algorithm for Predicting Compound Protein Interaction Using Tanimoto Similarity and Klekota-roth Fingerprint

This research aimed to develop a method for predicting interaction between chemical compounds contained in herbs and proteins related to particular disease. The algorithm of this method is based on binary local models algorithm, with protein similarity section is omitted. Klekota-Roth fingerprint is used for the compound's representation. In the development process of the method, three similarity functions are compared: Tanimoto, Cosine, and Dice. Youden’s index is used to evaluate optimum threshold value. The result showed that Tanimoto similarity function yielded higher similarity values and higher AUC value than those of the other two functions. Moreover, the optimum threshold value obtained is 0.65. Therefore, Tanimoto similarity function and threshold value 0.65 are selected to be used on the prediction method. The average evaluation accuracy of the developed algorithm is only about 50%. The low accuracy value is allegedly caused by the only use of compound similarity on the prediction method, without including the protein similarity.


Introduction
Network pharmacology is a new approach in pharmacology.This approach is a development from biological network and polypharmacology, which is used in drug discovery.Network pharmacology creates relational network between compunds contained in particular herb with proteins related with particular disease [1].This approach has been implemented to build network pharmacology for triphala formula of ayurveda [2], reveal the molecular mechanism of Qing-Luo-Yin formula of Traditional Chinese Medicine (TCM) [3], reveal candidate drug targets related to Complex Kidney Disease (CKD) from Bu-shen-Huo-xue (BSHX) formula of TCM [4], and build C2Maps platform [5].
Jamu is one of traditional herb medicine originated from Indonesia.Jamu formula consists of natural ingredients, i.e. root, bark, leaf, fruit, and flower [6].Just like ayurveda and TCM, the network pharmacology approach can be implemented on jamu formula.Afendi et al. [7] investigate the relationship between Indonesian herbs with jamu efficacy using biplot configuration, with data from Indonesia's National Agency of Drug and Food Control.Fitriawan et al. [8] and Ristyawan [9] developed classification system of jamu efficacy, each of which uses support vector machine and voting feature interval 5 method, based on binary data used by Afendi et al. [7].Qomariasih [10] determined active compounds from 4 Indonesian herbs, i.e.Zingiber officinale, Blumea Balsamifera, Tinospora crispa, and Momordica charantia L., using network pharmacology and simultaneous clustering analysis, under context for medication of disease diabetes mellitus type 2. Ochieng et al. [11] developed an approach to investigate the pharmacological mechanisms of the four Indonesian herbs mentioned before.Besides on its formula, research about jamu is also performed on the plant's physical appearances.Lantana et al. [12] proposed a new method for estimating spectral reflectance from jati belanda (Guazuma ulmifolia) leaf based on digital image color, which then can be used to estimate chemical compounds contained in the leaf.Meanwhile, Karlitasari et al. [13] developed a mobile application for visualizing Indonesia medicine in 3D.
Interaction between compounds and proteins that haven't been discovered yet can be predicted using several prediction methods.Liu et al. [14] filtered highly credible negative in silico samples from compound protein interaction, to be used as training data for predicting compound protein interaction using support vector machine as the predictor.Bleakley and Yamanishi [15] proposed new inference method for predicting compound protein interaction, namely bipartite local models.The method works by predicting target proteins from given compound, and then predicting compounds which the target is the given protein.The result of the prediction is the combination of the previous two predictions.
Compound and protein data used as the object of this research can be connected each other if and only if there's interaction data between two entities.If the interaction data is not available, the interaction of compound and protein can be predicted using similarity data between compounds and proteins, with the help of statistical methods and machine learning.In this research, we developed a method to predict the interaction between compounds from jamu herbs and proteins corresponding with a disease, using available compounds and proteins data.The result of these predictions is expected to be used as a reference for further testing of the interaction of the compound and the protein in the laboratory.

The Proposed Algorithm
In this research, we proposed a new method for predicting compound protein interaction.The developed prediction method receives compound and protein which are the interaction is to be predicted as input.The algorithm of this method is developed based on binary local models algorithm developed by Bleakley and Yamanishi [15], with protein similarity section is omitted.We used compound fingerprint data [16], i.e.Klekota-Roth fingerprint [17], as a representation of every compounds used.Figure 1 shows the pseudocode of developed prediction method in design phase, while Figure 3 shows the complete pseudocode as the result of the development.The first steps done by prediction method is determine the threshold value and set similarity value to zero.Next, the method performs a checking, whether compound protein interaction data is already available in the database or not (line 4).If the interaction data is function predictCPI() available, the prediction method returns true, and the prediction is not performed.Otherwise, the prediction needs to be performed.
On line 8, the prediction method retrieves detail data of the input compound from the compound table.On line 9-11, the prediction method retrieves list of compounds that target the input protein.On line 12-15, a looping is performed to determine similarity value between each compound from the list and the input compound, and compare obtained similarity value with current maximum similarity value.After the looping performed and the maximum similarity value has been obtained, this value is compared with the threshold.If similarity value is higher than threshold, the interaction between input compound and input protein is predicted to exist, and the prediction method returns true.Otherwise, the interaction between input compound and input protein is predicted to not exist, and the prediction method returns false.

Research Method 3.1. Datasets
Datasets used in this research were obtained from two sources.The first dataset is obtained from Qomariasih's research [10].he second dataset consists of data that have been collected before, and is a part of a research that is currently running.The two datasets consist of compound data, protein data, and compound-protein interaction data, which is collected from several webservers, like PubChem [18] dan Uniprot [19].Dataset 1 is used to evaluate optimum similarity function and optimum threshold, while dataset 2 is used to evaluate the prediction method.
In brief, the data preprocessing steps done are generating Klekota-Roth fingerprint data for every compounds in two datasets, choosing and checking data, and generating data for negative compound protein interaction on dataset 1.The part of compound data that is used to determine the degree of similarity between compounds is fingerprint.Fingerprint from a compound is a simplification or abstraction from compound's structure, denoted with a string of bits of certain length [16].In this research, Klekota-Roth fingerprint [17] is used.Klekota-Roth fingerprint's string of a compound consists of 4860 bits. he generation of Klekota-Roth fingerprint is done using Chemistry Development Kit (CDK) library [20], which available on web application ChemDes [21], with SMILES data as the input.SMILES (Simplified Molecular Input Line System) is a chemical notation language that denotes molecule structure as a graph that is essentially the two-dimensional valence-oriented picture chemists draw to describe a molecule.This language is designed specifically for computer usage by chemists [22].
Compound protein interaction data that are available on dataset 1 and 2 could be considered as interaction data with class label "positive", because the interaction data show that there are interactions for some compound and protein pairs.In order to make the dataset to be "balanced", in the meaning of the dataset has interaction data with label "positive" and "negative", it is required to generate interaction data with label "negative".In this context, interaction data with label "negative" can be interpreted as compound protein interaction data that is the validity is unknown.The generation of negative interaction data is only done for dataset 1.

Prediction Method
In this research, prediction method used is a simple prediction method, based on simple calculation.This method notices similarity between input compound and another compound from the dataset.It is based on central premise on medicinal chemistry that structurally similar molecules have similar biological activities [23].
The concept of the developed prediction method follow the concept of binary local models algorithm [15].The process of predicting compound protein interaction is done from two sides, i.e. based on input compound and input protein.From the compound side, the classification rule for grouping protein data, both proteins that is targeted by input compound or proteins that is not targeted by input compound, is searched, using protein's genomic sequence data.From the protein side, the classification rule for grouping compound data, both compounds that targets input protein or compounds that is not target input protein, is searched, using compound's chemical structure data.After the two classification rule is obtained, the prediction of compound protein interaction can be predicted from two sides, and the prediction result is aggregated.There are two important points on the development of the prediction method, i.e. similarity function and threshold value used.The similarity function is used to determine how close are the input compound and the compounds from the dataset that have interaction with the input protein.The threshold value is used as a boundary between positive and negative prediction result, for input compound and input protein.On the development stage, similarity function and threshold value have not been determined, because those values will be evaluated and determined through next steps.

Similarity Function
Similarity function is a function to measure the similarity, or the closeness, of two objects.The way to implement this function depends on the data to be used.In this research, the data that will be used with the similarity function is Klekota-Roth fingerprint, which is in the form of a string with length of 4860 bits.Therefore, the binary similarity function is selected.
The similarity of two binary strings can be described as follows.Suppose we have two binary strings x and y, each of which consists of p variables with value 0 or 1.The common association coefficients are calculated, and the result is shown on Table 1, where a, b, c, and d are the frequencies of the events (x=1 and y=1), (x=1 and y=0), (x=0 and y=1), and (x=0 and y=0), respectively, in the pair of binary vectors describing the two objects; p is the total number of variables, equal to a+b+c+d, which is the length of each binary vector [24].

Table 1. Frequency Table of Four Combinations for Two Possible Binary Variables [24]
y = 1 There are three similarity functions used, i.e.Tanimoto, Cosine, and Dice similarity function.This three function, according to Bajusz et al. [25], are the best metrics among eight metrics tested.On the evaluation process, the similarity between compounds will be calculated using three similarity functions.Afterwards, one of the three function that produces the highest similarity value for every appropriate compound pairs will be chosen, and will be implemented on the prediction method.The formula of similarity function Tanimoto, Cosine, and Dice is shown in Equation 1, 2, and 3, respectively.

Datasets
Dataset 1 consists of 55 compound data, 478 protein data, and 3059 compound protein interaction data.Meanwhile, dataset 2 consists of 12755 compound data, 2952 protein data, and 175071 compound protein interaction data.The first step of data preprocessing is generating fingerprint data for every compounds.After the generation, it is known that not all of TELKOMNIKA ISSN: 1693-6930  the compounds' fingerprint can be generated, due to the lack of SMILES data or fingerprint data generation failure using available SMILES data.The next step is choosing and checking data validity, mainly for the compound protein interaction data.This is done so that the amount of data to be processed is not too large and the data meet the required criteria.The criteria are as follows: the compound is registered on Pubchem [18] and has fingerprint; and the protein is registered on Uniprot [19] as one of human's Homo sapiens protein.
The next step is generating data with label "negative" to dataset 1. he procedure of the generation is as follows.First, enlist all compound protein pairs from the dataset that are not in the list of positive compound protein interaction.Then, choose N pairs of compound protein data from the list randomly, with N is the number of compound protein interaction data with label "positive".
After the data preprocessing, there are 55 compound data, 478 protein data, and 6066 compound protein interaction data for dataset 1.The compound protein interaction data in dataset 1 consists of 3033 compound protein interaction data with label positive and 3033 compound protein interaction data with label negative.For dataset 2, there are 6736 compound data, 1012 protein data, and 25123 compound protein interaction data.

Prediction Method
The proposed prediction method used in this research is already discussed on section 2. As mentioned before, the prediction method only uses compound similarity, and omits protein similarity.At the time of research, we haven't found appropriate metric to measure protein similarity, either structurally or functionally.On the pseudocode shown in Error!Reference source not found., the similarity function and optimum threshold value used has not specified yet, because the two items will be evaluated on the next step.The similarity function and optimum threshold will be used on the program code in line 13 and 2.

Similarity Function & Threshold Evaluation Result
On this stage, the prediction algorithm is implemented in the programming language Python 3. The three similarity functions to be evaluated are also included in the code.After the code implementation, the code is executed using compound protein interaction data from dataset 1, assumed that the compound protein interaction is unknown.
After the execution and similarity values are obtained, a number of 21 threshold candidate values are generated in the range [0, 1] with the difference between values is 0.05.Afterwards, for every threshold value, the similarity functions are filtered using criteria "the similarity value is less than or equal than threshold value", aggregated, and then calculated.Some values calculated here are true negatives, false negatives, false positives, true negatives, sensitivity, specificity, ouden's index, and accuracy.he evaluation of similarity function and optimal threshold is done at the same time.It is possible, because the two evaluations use the same aggregate data.Accuracy data and area under ROC curve (AUC) values is used for evaluating similarity function.Sensitivity, specificity, and ouden's index data is used for evaluating optimal threshold.The aggregated data is attached on Supplementary File 1.
From 6066 compound protein interaction data, it is known that the highest accuracy value is achieved by similarity function Tanimoto on threshold candidate value of 0.65, with value of 0.583.n the evaluation of ouden's index, the highest ouden's index value is also achieved by similarity function Tanimoto on threshold candidate value of 0.65, with value of 0.166.ROC curve for three similarity functions is then drawn and AUC value for each similarity functions is calculated, using the trapezoidal approximation.The ROC curve for three similarity functions is shown on Figure 2.After the AUC value obtained, it is known that Tanimoto similarity function has the highest AUC value among the three similarity functions, with value of 0.555.From this result, it can be inferred that the optimum similarity value and the optimum threshold value is Tanimoto similarity function and 0.65.After this step, the design of the prediction method is as shown on Figure 3.

Prediction Method Evaluation Result
In this stage, the amount of data used to evaluate the prediction method from every subset is 22611 compound protein interaction data.The value is a rounding from 90% of the amount of compound protein interaction data from dataset 2. The generation of the sub datasets is done 10 times, where in every repetition an amount of 22611 interaction data is chosen randomly, and stored in different tables.
After the sub datasets is generated, the evaluation of prediction method is done.On the evaluation result, it is known that the minimum accuracy obtained is 50.374% and the maximum accuracy obtained is 50.683%,whereas the average accuracy is 50.495%.The low accuracy value is allegedly caused by the prediction method that only use the compound similarity to predict the compound protein interaction, without including the protein similarity.The evaluation result of the prediction method is shown in Table 2.

Conclusion
In this research, we developed a method for predicting the interaction between compound and protein.The prediction method uses Tanimoto similarity function and threshold value of 0.65.The evaluation of the prediction method yield average prediction accuracy value of 50,5%.The result of this research is not reliable enough.Therefore, we suggest some points for the next research, as follows: a.To use machine learning approaches to build prediction algorithm, so that the prediction algorithm can capture the pattern of compound protein interaction data that is being predicted.b.To include the protein similarity measurement in the calculation to predict the compound protein interaction, so the prediction result is viewed not only from the compound's side.

Figure 1 .
Figure 1.The pseudocode of developed prediction method in design phase

Table 2 .
Result of Prediction Method Evaluation