The importance of data classification using machine learning methods in microarray data

ABSTRACT


INTRODUCTION
Bioinformatics involves the use of computers in managing biological information [1].In general, Bioinformatics involves microarrays classification, organization, and interpretation.This technology can be employed to solve various problems related to the biological field.For example, the prediction process can be used to control and prevent many diseases such as Cancer.The process can also be used to discover new disease markers.Detection of mutations in gene expression patterns would essentially lead to the development of an efficient therapy method [2].The gene controls cell development, and malfunction of genes leads to tumour formation or cancer.Deoxyribonucleic acid (DNA) microarray approach is a powerful approach that helps explore the genetic defects in the human body [3].For example, microarray technology has led to successful cancer diagnosis [4,5].Gene expression studies that involve gene selection, classification and clustering have been carried out [6].A suitable hybrid system in bioinformatics has been developed to detect cancer (gene mutation) and other diseases in a more accurate manner [7].

LITERATURE REVIEW
Cancer involves abnormal cell-growths, which is fatal if it is ignored as the developed tumour may spread to other body parts via bloodstream [8].There are several common cancers: lung, prostate, breast, and oral cancers and colons [9].An example of the cancer level shown in Figure 1, and for this, we need to know what the DNA microarray technology and what kind the machine is learning that applied to the microarray.

Microarray technology
In Biology, DNA is the blueprint of an organism and therefore it has all the information necessary for the biological process.When it comes to computational biology, this information is required for prediction, analysis and so much more.Hence, this information is required to be present in a form where computationally relevant for processing and analysis.Therefore, gene expression profiles in the form of numerical representation are used to perform feature selection, and classification.Microarray technology is the solution to such a requirement.DNA microarray (also commonly known as DNA chip or biochip) can be adapted to reveal the expression levels of many genes in a single reaction concurrently [10].Firstly, the structure of the protein is different from the structure of the gene and its analysis is still difficult.Therefore, the analysis of thousands of proteins will take a great deal of time.We also saw that one amino acid can be encoded by several codons, sequence of amino acids in the protein We will have several probabilities for the gene formula that produced this protein.The easiest way is to extract the mRNA in the cell and measure its percentage.Generally, the hybridization principle is used in Microarray technology to measure the gene expression levels in the human body [11].
Figure 2 shows the basic process involved in microarray technology.To conduct gene expression profiling, DNA sample and control sample from a patient is obtained.Then, DNA in the sample is denatured into single-stranded molecules.After that, the single-stranded molecules are cut into smaller fragments and then label it with a fluorescent dye.The green dye is for the control sample and red dye is for a normal sample.Both samples are inserted into the chip to hybridize or bind with the synthetic DNA on the chip.After the hybridization, the gene expression can be identified through the changes of colour on the chip.Therefore, this technology can be used in cancer diagnosis and drug response [12].Thus, via machine learning, significant information about genes representing a disease state and those highly associated genes that shared biological features can be extracted [13].
An accurate cancer diagnosis can be attained by executing the microarray data classification by simply building classifiers to compare the gene expression profiles of tissues of known and unknown cancer status [14].As a result, the classification process could be misleading due to the existence of noisy and irrelevant data.Therefore, a feature selection method should be devised to reduce the size of the feature set, or gene set [15].In general, a microarray diagnosis process involves feature selection and classification [16].To update, many machine learning algorithms have been developed for detecting mutations, e.g., ANN, SVM, clustering, and swarm intelligence algorithm.By using these methods, an optimal subset of genes can then be chosen to build a classification model.

Feature selection techniques
There are three feature selection techniques in classification, i.e., filter, wrapper, and embedded methods as shown in Figure 3. Filter based approaches are well known for data filtering or pre-processing to rank the genes and then the highly ranked genes will be used in further analysis.Then for the wrapper-based method, gene selection is done using the machine learning method and uses cross-validation to assess the feature subset score.Whereas embedded based.However, microarray data contain many non-significant features that would degrade the performance of most of the learning algorithms [17].


The importance of data classification using machine learning methods in microarray data (Aws Naser Jaber)

Different methods of feature selection
Normalization involves reducing unwanted variation within arrays.Typical assumptions made in some major normalization methods are: − Only a small number of genes are differentially expressed in terms of condition.− Annotation: This process involves gene characterization.− Summarization: Performing only a single measurement after performing a combination in a certain manner − Statistical Analysis: From a statistical point of view, the number of genes could be larger than the number of samples, thus leading to faulty classification.Feature selection should be performed by selecting the most informative gene to improve the accuracy and efficiency of the classification process and to address the problem of dimensionality.− Biological Interpretation: To interpret microarray data, one must have an adequate number of replicate measurements to determine results that have real predictive value.Dimensionality reduction is therefore essential.
In microarray classification, samples are classified into both abnormal (cancer) and normal datasets based on microarray measurements [18,19].It is challenging to train the classifiers on such datasets of high dimensionality [20].Preprocessing is an essential step to address this dimensionality problem, and then apply the classification algorithm for monitoring model complexity via regularization.Machine learning enables a system to automatically perform the learning process.It is not a real learning process; however, the system can recognize complex data patterns and make intelligent decisions based on computational methods.Classification is a procedure used to categorize sample data into a few classes.Some popular classification methods employed in data mining and other fields are artificial neural network (ANN), decision tree, support vector machine (SVM) and swarm intelligence.
Artificial neural network (ANN) or neural network (NN) is a method in artificial intelligence that mimics the complex processes as in the human brain.ANN requires a huge number of units' collection that is interconnected to permit communication between the units.The unit also denoted as nodes or neurons.They are simple processers function in parallel.Next is the decision tree method; this is a predictive modelling tool that falls under supervised learning.There are two main entities in decision tree called nodes.Besides, there are two types of decision trees such as classification and regression trees.SVM is another popular supervised classification method.The basic principle of SVM is, creates hyperplane that separates the dataset into classes.Furthermore, the swarm intelligence method is to use numerous simple agents with no rule to interact locally and globally.Popular swarm intelligence algorithms are ant colony optimization (ACO), artificial bee colony optimization (ABC), and particle swarm optimization (PSO).

RELATED WORK
Several microarray applications have been reported in related review [21].However, microarray can be hybridized with machine learning algorithms such as non-swarm intelligence and swarm intelligence algorithms.After detecting and filtering gene expression datasets, samples should be accurately classified into known groups by the features of gene expression.Hence, support vector machines (SVM), prediction analysis of microarrays (PAM), classification and regression trees (CART), K nearest-neighbor (K-NN) methods can be employed.Turgut, et al., applied a machine learning classifier for microarray breast cancer.First, they perform the right types of machine learning algorithms without applying any feature selection, and then they used two different feature selections.Examples of machine learning algorithms KNN, SVM, decision trees, MLP, random forest, logistic regression, adaboost and gradient boosting machine [22].They claimed that MLP did not improve accuracy.
Bharathi, A. M. Natarajan minimized the gene set for more accurate classification using ANOVA [23].The ranking of a gene was computed using ANOVA.SVM was used as a classifier.The technique was compared with the T-test classifier.Interestingly, the hybridization technique of ANOVA and SVM was accurate even using a minimum number of genes.While, another research proposed an artificial immune recognition system to classify microarray data (cancer, disease or normal tissues) [24].The result was then compared with those of other classifiers.In AIRS, a memory cell is used for training samples to build a classifier.The experiment was applied to colon cancer, brain tumour, and nine tumour datasets.AIRS performed better than other machine learning methods such as KNN, OneR, and Naïve Bayes.
Karayianni, et al., employed the fuzzy clustering method with viewpoints to identify unlabeled samples [24].The clusters are identified by calculating the expression mean of each feature with labelled samples.This technique was applied to breast cancer, brain cancer, AML, and MLL datasets.Sudip Mandal and Indrojit Banerjee applied ANN to diagnose and detect cancer [25].A special kind of ANN called multilayer feed forward neural network (MLFF) was used.The performance of ANN is dependent on parameters such as the number of hidden layers, number of nodes and weights.Different datasets consisting of breast and lung cancerous cells were employed.Two analyses were performed: cross-validation and new dataset testing.Datasets were divided into training (80%) and testing (20%) datasets.Due to the noise in the dataset, the accuracy was 96% after cross-validation and 94% for new dataset testing.ANN was designed with a single hidden layer, but the structure of ANN can be tuned for better accuracy.
In [26], they used the α depended on the degree-based feature selection method to solve the imbalance problem between the feature number and the instance number in microarray data-based gene expression.The classification accuracy of smaller gene size was better than that of larger gene size.Nine datasets have been used in this study such as colon tumour, central nervous system tumour, diffuse large B cell lymphoma, leukemia 1, AML, lung cancer, prostate cancer, breast cancer, and leukaemia.The results were compared with other techniques such as NB (Naïve Bayes), DT (decision tree), SVM (support vector machine) and K-NN (K-nearest neighbour).As reported, the k-NN classifier had better performance under seven α values.Li and colleagues assessed five feature selection methods such as KNN, C4.5, Naïve Bayes and SVM with leukemia and ovarian cancer datasets [27,28] has presented a comparative study on three feature selection methods with four data sets.They used prostate, colon tumour, and Leukemia and Hepato datasets.SVM performs better on all the datasets.
Chanho and Sung-Bae conducted an analysis of colon cancer and Lymphoma datasets by seven gene selection methods and six classifiers.Besides, Ji-Gang and Hong-Wen developed a gene selection method based on Bayes error filter (BBF) [29].BBF can select significant genes while removing non-significant genes.This evaluated using datasets include colon, prostate, lymphoma, leukemia, and DSLBCL.They had used SVM and KNN for measuring accuracies.They observed that SVM performed well on all the datasets used.Xing, Jordan, and Karp studied different classifiers such as the Gaussian classifier, regression classifier, and KNN.Feature reduction by these three methods shows better results.They proposed a hybrid approach of filter and wrapper for feature selection in high dimensional data.Mainly they have used Markov Blanket filtering and then classified with the use of three different classifiers.Thus, these classifiers able to perform better with the reduced significant feature space compared to full feature space [30].
On top of that, Onskog and colleagues had presented microarrays classification on seven cancer-related data.Double cross-validation methods are applied to obtain a strong error rate.The results show that SVM with a radial basis kernel and linear kernel performed steadily with these data sets.Moreover, based on the t-test there is a synergistic association between the methods and gene selection process [31].Besides this, [32] proposed a machine learning study on prostate cancer data set.In particular, the t-test and interquartile range are combined for feature selection.The results produced show that Bayes Network is outperformed, Naïve Bayes.In [33] different discrimination methods are used for classification on three cancer gene expression data sets.The methods are nearest-neighbour classifiers, linear discriminant analysis, and classification trees.From the output, the nearest neighbour classifies better compared to the decision tree classifier.
Furthermore, Sung Bae and colleagues used three microarray data sets namely, Leukemia, colon, and Lymphoma with feature selection and classifiers.The investigation results show that the ensemble classifiers produced the best classification rate compared to other methods [34].Abusamra has done an investigation on eight different feature selection methods and three classification methods.The feature selection methods include max minority, information gain, Gini index, t-statistics, the sum of variances and one-dimension support vector machine was compared.The classification methods are SVM, KNN, and random forest.Two types of glioma expression data sets are used in this experiment.The results show that the selection of significant genes had boosted classification accuracy.In both datasets, SVM performed better than other classification methods [35].
The maximum relevance minimum redundancy (mRMR) algorithm is a special group of filter-based approaches which able to select concurrently highly predictive but uncorrelated features.This algorithm mainly selects features subset having the maximum association with a class (relevance) and the minimum association between themselves (redundancy).The feature's ranking is given based on minimal-redundancymaximal-relevance measures.Hence F-statistics is used to calculate the relevance and Pearson correlation coefficient is used to calculate the redundancy [36].Besides this, [37] developed the Monte Carlo feature selection (MCFS) algorithm to identify informative features.The MCFS algorithm is integrating interdependencies among features.It has some similarity as in random forest methodology but differs in terms of feature ranking calculation [37].
Besides this, Alshamlan and colleagues proposed a new feature selection method called minimum redundancy maximum relevance (mRMR) hybrid with an artificial bee colony (ABC).This algorithm is specifically to select significant genes from the microarray.The experiment is conducted with six binary and multiclass data sets.The produced result shows that the proposed algorithm has achieved better classification accuracy compared to mRMR-GA and mRMR-PSO algorithms [38].Jayger, Sengupta, and Ruzzo [39] study various gene selection methods for microarray data classification.They used various statistics test with gene selection methods.The statistics tests include Fisher, Golub, Wilcoxon, TNoM, and t-test.
Huawen, Lei and Huijie compared various gene selection methods [40].They compared ensemble gene selection by grouping with the other three gene selection methods FCBF, mRMR, and ECRP.They used five datasets with these techniques.They used two classification methods Naïve Bayes and KNN.They compared and analyzed which classification method is effective.While, in [25], they employed the fuzzy clustering method with viewpoints to identify unlabeled samples.The viewpoints were constructed by computing the average expression for each feature (probe/gene) in the samples with a label.In their work, the previously available microarray expression data was introduced as viewpoints in the clustering process.The technique was applied to breast cancer, brain cancer, AML, and MLL datasets.The method was found to be better than other clustering algorithms such as K means, fuzzy c-means, affinity propagation, and the clustering method based on prior biological knowledge.However, Table 1 shows the most related works in microarray DNA.Furthermore, [41] hybrid cellular automata and ant colony optimization method to select the significant genes then used for classification.Thus, it has produced high accuracy compared to other selected methods as shown in the paper.Moreover in [42], an artificial neural network (ANN) is applied to ALL and AML datasets.This research had generated 98% accuracy, where there is no error in ALL datasets and one error in AML dataset.The cancer genome atlas (TCGA) is a pilot project launched by the National Institute of Health (NIH).This is basically to create a comprehensive atlas of cancer genomic profiles.Hence, most of the gene expression data are publicly available at TCGA that are used in prognosis and diagnosis [43].

CONCLUSION
This paper reviews the existing classification techniques applied in microarrays that contain high dimensional data.The high dimensional data problem can be solved using feature selection methods.Many gene selection methods have been used to classify cancerously or any other disease datasets with multi or binary classes.The underlying challenge is the efficient detection of different infected genes with different characteristics such as mutated genes caused by viruses, radiation, mutagenic chemicals.Machine learning techniques have been proposed to analyze microarray data.Hybridized methods can eliminate noise, reduce the number of features and ease classification.Swarm intelligence algorithms such as ant colony optimization (ACO), artificial bee colony optimization (ABC), particle swarm optimization (PSO) are powerful in feature selection.Hybridization between the classical machine learning techniques and the emerging machine learning techniques such as swarm intelligence algorithms can yield better results in diagnosis and classification.Currently, researchers have developed hybridized computational methods with swarm intelligence (SI) methods and proven that these hybridized systems are more accurate.Nevertheless, a model that solely relies on swarm intelligence algorithms should be built and analyzed.

Table 1 .
Most related work for the Microarray DNA