Parallel random projection using R high performance computing for planted motif search

Motif discovery in DNA sequences is one of the most important issues in bioinformatics. Thus, algorithms for dealing with the problem accurately and quickly have always been the goal of research in bioinformatics. Therefore, this study is intended to modify the random projection algorithm to be implemented on R high performance computing (i.e., the R package pbdMPI). Some steps are needed to achieve this objective, ie preprocessing data, splitting data according to number of batches, modifying and implementing random projection in the pbdMPI package, and then aggregating the results. To validate the proposed approach, some experiments have been conducted. Several benchmarking data were used in this study by sensitivity analysis on number of cores and batches. Experimental results show that computational cost can be reduced, which is that the computation cost of 6 cores is faster around 34 times compared with the standalone mode. Thus, the proposed approach can be used for motif discovery effectively and efficiently.


Introduction
Motif discovery in DeoxyriboNucleic Acid (DNA) sequences is one of the most important issues in the field of bioinformatics since it may help biologists to obtain better understanding on the structure and function of the molecules in the sequence [1].A motif is a short pattern that repeats in the DNA sequence consisting of a combination of four basic nitrogen: Adenine (A); Guanine (G); Cytosine (C); and Thymine (T) [2].Issues in motif discovery can be categorized into 3 types, namely Simple Motif Search (SMS), Edit distance based (EMS), and Planted Motif Search (PMS) [3].The purpose of SMS is to find all the motifs from lengths 1 to the specified length in all sequences of [4] while the purpose of the EMS is to find all the motifs on the desired number of sequences [5].PMS aims to find the motive that appears in every sequence that exists [6].
In PMS, there are two important input parameters: the desired length of motif symbolized byland the number of mismatches denoted by d [7].For example, there are three DNA sequences, as follows: S1=ATTGCTGA, S2=GCATTGAA, and S3=CATGCTTG.Withl=4 and d=1, we obtain the following repetitive motifs: ATTG and TTGC.It can be seen that PMS is included in the NP-Hard problem, so that if this algorithm is run to look for all possible motives that appear in all sequences, then the time spent will be exponential [1].Random Projection (RP) [8] is one of the algorithms used for motive search problems in DNA sequences included in PMS.In this algorithm a piece of the input data in the form of sub-sequences (l-mers) will be projected according to the random position determined by k (k-mers) values [1,9].RP represents that mutations can occur anywhere so the projection is done randomly.Even though many algorithms have been introduced, since PMS is NP-hard problems, an implementation of the algorithms into parallel computing is necessary to be done.
Therefore, this research is aimed to design and implement RP for dealing with PMS in parallel computing in R programming language.The R programming language [10] is chosen since it has become the de-facto standard for statistics, data analysis, and visualization.1353 Nowadays, there are many algorithms, collected in software libraries/packages, that have been implemented and saved in the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/.In this repository, one of packages in R used for high performance computing and big data analysis is pbdMPI [11] that is used in this research.
In the literature, we found some relevant articles discussing implementations of motif discovery in parallel computing.For example, in Clemente & Adorna's study [12], random Projection algorithm was developed in the concept of GPU (Graphic Processing Units).Each processor will be directed into threads that work within the device or GPU.Meanwhile, the sequential process will be executed on the host or CPU.TEIRESIAS has been introduced to improve the speed on finding maximal pattern [13].An enhancement of the PMSPRUNE algorithm has been proposed with two additional features: neighbor generation on a demand basis and omitting the duplicate neighbor checking [14].Furthermore, there are some different approaches for dealing with patterns matching in various fields.For instance, multiple patterns matching methods was introduced for large multi-pattern matching [15].Improving the scanning mode of Square Non-symmetry and Antipacking Model (SNAM) for binary-image is obtained by proposing the new neighbor-finding algorithm [16].
The rest of the paper is organized as follows: first, the global procedure of this research is presented in section 2. In section 3, a main contribution, which is a modification and implementation of parallel random projection by using the pbdMPI package, is discussed.To validate and analyze the proposel computational model, we conduct some experiments in section 4 and some analysis in section 5. Finally, we conclude the research in section 6.

Research Method
Figure 1 shows the research design done in this study.It can be seen that first, we perform some preparation, such as identifying problems, research objectives, and literature study.These activities have been presented in the previous section.Then, we present a main contribution of this research, which is designing and implementing parallel random projection with R high performance computing (i.e., the pbdMPI package).This part will be explained in the next section.After that, we conduct some experiments and their analysis of the results.Drawing some conclusión is presented in the end.

Parallel Random Projection with the pbdMPI package
Basically, the computational model proposed in this research can be seen in Figure 2. First, after reading and converting the input data from the .falsafile, we perform a modification of random projection by utilizing R high perfomance computing (i.e., the pbdMPI package),  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 3, June 2019: 1352-1359 1354 called parallel random projection with pbdMPI.Detailed explanation regarding the proposed approach can be seen in Figure 3.The results of this model is all motifs, their starting indices, and computational costs.
Figure 2. The computational model of parallel random projection with pbdMPI According to Figure 3, it can be seen that besides supplying some parameters related to the RP algorithm, we need to input the number of cores and batches.Since the R programming language needs to load data into random access memory (RAM), we need to define the number of batches so that each batch just takes less than 20% of total memory capacity.Furthermore, actually Step 1 to 3 and Step 6 to 8 illustrated in Figure 3 are the same as the RP algorithm on the standalone mode.However, from Step 4 to 5 the tasks are conducted in parallel computing by using pbdMPI commands.An important part of these steps is a rule to divide the sequence into numbers of batches.Moreover, the rule should prevent all possible motif including the sequence even though it has been splitted into several batches.So, in this case we implement the (1) and (2): *,

. Data Gathering
The data used in this study obtained from research in [17].To download the data can be through the site of University of Washington Computer Science and Engineering on page http://bio.cs.washington.edu/research/download.In total, there are 52 data sets of DNA sequences derived from four species, 6 of which are derived from the Drosophila melanogaster sequence, 26 data derived from human sequences, 12 data derived from rat sequences and 8 other data derived from the Saccharomyces cerevisiae sequence.In each data file there are several sequences that number between 1 to 35 sequences.Then, every sequence that resides on the file has a variable length ranging from 500 to 3000 base pairs.
In this case, we only consider to use four datasets as follows: the dm01r.fastaand dm05r.fastafiles that are DNA sequences of Drosophila melanogaster, then hm01r.fastaderived from the human sequence, and muso4r.fastawhich is the rat DNA sequence as the input data.The dm01r.fasta file contains 4 DNA sequences with the total length of sequence is 6000, while the dm05r.fastafile consists of 3 DNA sequence with the length of 7500.The hm01r.fasta and mus04r.fastafiles have the DNA sequence length of 36000 and 7000, respectively.

Results and Analysis
Since the limited space, in this section we illustrate the results and their analysis for a particular dataset only.For example, on the standalone mode, a comparison of the number of motifs found according to m, θ, and (l, d) on the dm01r dataset is shown in Figure 4.It can be seen that the higher numbers of mismatch makes the higher of numbers of motifs.On the parallel computing mode, Figure 6 shows that the comparison between the computational costs and numbers of cores when we used the dm01r dataset on (l, d)=(6.2),θ=3, and b=10.It can be seen that the proposed model has been successful since in general speaking the computation cost can be reduced by adding the number of cores.Figure 6.The comparison between computational cost and number of core on the dm01r dataset To ensure the analysis, Figure 7 explains a comparison between computational cost and number of core on different (l, d) and m and the same θ (i.e., 3), and b (i.e., 10).It can be seen that the computational time with stand alone mode (i.e., c=1) at (l, d)=(8.3)with m=5 took 26.98 seconds while on the number of core of 2 the computation only took 6.3 seconds.It means that the computational time on stand alone needs four times longer than using 2 cores.Moreover, the standalone mode took more than ten times compared with parallel computing using 3 cores (i.e., 2.52 seconds).Using 6 cores, the computation can be faster around 34 times compared with the standalone mode.So, now it is obvious that the proposed model is much faster than the standalone mode.We also compared computational time gained from experimental results on the previous research [1] even though there are different data on the file dm01r and mus04r.The number of DNA sequences contained in the file dm01r is 4 with the length of 1500 for each sequence while in the research [1] the dataset contains 5 DNA sequences.In the file mus04r the number of  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 3, June 2019: 1352-1359 1358 DNA sequences used in this experiment is 7 sequences with the length of each sequence is 1000 while only 6 sequences were used by the previous research.The comparison can be seen in Table 1.It can be seen that all experiments conducted in this research are faster than the study in [1].It should be noted that the research conducted by [1] was performed in standalone mode.5.08 0.5 (8,3) 35.45 0.52

Conclusion
The main contributions of this research are as follows (i) to propose the computational model for modifying the random projection algorithm, called parallel random projection, for dealing with planted motif search by utilizing R high performance computing (i.e., the pbdMPI package) and (ii) to implement the proposed model and then validate it for finding motifs on DNA sequences.According to the experiments, we can state that the proposed model are able to reduce the computational cost significantly.Moreover, a comparison with the previous study has been done, and it shown that the proposal produced better results in the term of computational cost.
In the future, we have a plan to improve the model by using Big Data platform, such as by using the programming model of MapReduce on Apache Hadoop [18] and Resilient Distributed Datasets on Apache Spark [19].Moreover, the different tools for utilizing parallel computing, e.g., the foreach package [20], can be used as the study in [21].Different tasks in the related research to bioinformatics can be applied to test the proposed model as well, such as prediction on cáncer [22], kidney disease [23], and sleep disorder [24].Additionally, another method that can be implemented for dealing with this research is Knuth Morris Pratt [25,26].

Figure 1 .
Figure 1.Research design to conduct parallel random projection Parallel random projection using R high performance... (Lala Septem Riza) are starting and ending indices for cutting the batch of i .,, Lb and l are the length of sequence, number of batches, and length of pattern, respectively.It should be noted that the starting index starts from i =2.For example, it is give the sequence S=CAGTGACGTAATCA, and the length of pattern is 3. So, according to (1) and (2), we obtain the following batches: S1=CAGT; S2=GTGACG; and S3=CGTAATCA.By following how the algorithm random projection generates k-mers, we obtain the following k-mers on all batches that are the same as k-mers on the sequence (without splitting into batches): CAG; AGT;GTG; TGA; GAC; ACG; CGT; GTA; TAA; AAT; ATC; and TCA.It means that even though the sequence has been splited and processed by different cores, the results of RP and parallel random projection are the same.

Figure 3 .
Figure 3.The pseudo code of parallel random projection with pbdMPI

Figure 4 .
Figure 4.The comparison of the numbers of motifs found on the dm01r dataset

Figure 5 .
Figure 5.The comparison between the computational cost and datasets/length of datasets