Classification of blast cell type on acute myeloid leukemia (AML) based on image morphology of white blood cells

,


Introduction
Leukemia is a disease of blood and bone marrow cancer.Bone marrow is a spongy tissue in the bone where blood cells are made.Cancerous blood cells will damage blood cells in the bone marrow [1].Leukemia has several types, namely chronic and acute leukemia.Types of acute leukemia include Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML) [2].AML type leukemia, referring to the French-American-British classification, AML is classified into 8 subtypes including M0, M1, and M2 [3].AML leukemia is caused by the differentiation of myeloid series cells stopping in blast cells which results in a buildup of the blast in the spinal cord.ALL or AML type leukemia diagnosis has been used to calculate the complete blood cell count.This approach requires relatively expensive energy, time and cost [4].An alternative that can be done to overcome these problems is using a blood cell image processing approach [2,5,6].The use of blood cell image makes identification process in order to diagnosis can be done by the computerization process.A number of studies on image processing for the diagnosis of leukemia have been carried out.First focused on detecting positive or negative leukemia ALL [7][8][9][10], second leukemia ALL or AML [11][12][13], third is AML subtype.
A number of studies that have used a blood cell image processing approach to the diagnosis of ALL type leukemia are carried out by Devi et al. [14] and Selvaraj et al. [15].The system model which proposed Devi et al. [14], is divided into several stages: pre-processing, segmentation using otsu thresholding, feature extraction with histogram of oriented gradient(HOG), and classification using adaptive fuzzy inference system.The diagnosis of leukemia based on a fuzzy inference system is also done by Khosrosereshki et al. [16] but uses the mamdani method.Selvaraj et al. [15], using a feature somewhat different from  ISSN: 1693-6930 TELKOMNIKA Vol.17, No.2, April 2019: 645~652 646 that of Devi et al. [14].The features is divided into two groups, namely shape features and densitometric.This features in both groups then used to make conclusions using naive bayesian classification algorithms.The next study was the diagnosis of leukemia subtype AML M2 and M3 by Suryani et al. [17].The study features extraction done preceded by the segmentation process.The segmentation process uses a watershed distance transform.The extraction feature results in white blood cell (WBC) area, WBC perimeter, WBC roundness, nucleus ratio, WBC mean, and WBC standard deviation.The final conclusion is to determine the AML subtype using the neural network.A similar diagnosis concept is also performed by Harjoko et al. [18], ie the classification of subtypes AML M1, M2 and M3, with feature extraction process preceded by active contour operation.
Most of the research that has been done, especially for the diagnosis of AML subtype leukemia uses stages that are not commonly used by clinicians.This makes it difficult for clinicians to understand each stage of the diagnostic process.Clinicians in diagnosing AML subtypes, predictably identifying the blast cell types present in the WBC.Referring to the type of blast cell contained in WBC that is as knowledge to be used to identify the subtype of AML.Research that has used this approach in diagnosing AML subtypes, is research conducted by Suryani et al. [19].The study identified the AML M0 and M1 subtypes, with the stages of determining the dominant blast cell type in the WBC.Other studies were also conducted by Suryani et al. [20], but for the diagnosis of AML M1 and M2.Unfortunately, the two studies have weaknesses, namely the low performance of the results of blast cell type classification.The low performance of blast cell type classification, also resulted in low performance in the diagnosis system of AML M0 leukemia, AML M1 and AML M2.The poor performance of blast cell type classification was caused by the lack of availability of white blood cell image data samples, which were identified with leukemia AML M0 subtype, AML M1 and AML M2.These conditions resulted in the distribution of data for each type of blast cell resulting from the feature extraction process in the diagnosis system of leukemia, becoming unbalanced.Imbalances of data can lead to a decline in the performance of classification algorithms [21][22][23][24][25].
Referring to a number of studies that have been done, and with a number of advantages and disadvantages, this study will propose a model of blast cell classification on WBC identified leukemia subtype AML M0 and AML M1.Identification is done by considering the condition of data imbalance.The condition of the data imbalance is overcome by using a combination of resampling, Synthetic Minority Over-sampling Technique (SMOTE), and randomize.System performance is measured using the parameters of sensitivity, specificity, accuracy, and area under the curve (AUC) parameters.

Research Method 2.1. Data
This study used data obtained from Dr. Moewardi Hospital, Surakarta Indonesia on Clinical Pathology.The data consisted of 50 white blood cell images that were identified by AML.The data is distributed into the subtype of AML M0 as much 20 images and 30 AML M1 images.Image data using JPG format with size 1600x1200 pixels.Characteristics of blast cell types contained in the WBC for the subtype AML as shown in Table 1, while the images for each blast cell are shown in Figure 1.

Proposed Method
The system model for identification of blast cell types in the process of diagnosis of leukemia disease can be shown in Figure 2. The system model is divided into 4 parts, namely image processing, oversampling, classification and evaluation of system performance.Image processing, including pre-processing, segmentation [19,27] and feature extraction.Feature extraction produces three features, namely WBC diameter, nucleus ratio, and nucleus roundness [19].These three variables are a feature of blast cells in the WBC.In this study, blast cells were observed in 3 types of blast cells with, as shown in Table 1.The oversampling section includes resample process, SMOTE, deletion of redundant data, and randomize.The resampling process is done to retrieve the re-samples from the existing data, for the next SMOTE process.The results of the SMOTE process then performed the same data deletion and finished by the randomizing process.The third part, the process of classification using Random Forest algorithm, in addition to Random Forest also tested k-NN algorithm.The fourth part is performance evaluation.Performance evaluation is done by using 3-dimensional confusion matrix, as shown in Table 2. Referring to Table 2, it can be derived into a 2-dimensional confusion matrix.The descending process for myeloblast cell types can be shown in (1-4), using the same concept is also used for cell types promyelocyte and myelocyte.Referring to (1)(2)(3)(4), it can be used to derive the equation of performance parameters.The performance parameters are the sensitivity, specificity, accuracy, and area under the curve (AUC).The system performance is validated by k-folds cross-validation method with value k=10.The method will divide the data into k-groups, with k-1 groups as training data, and 1 group for testing, and performed repeatedly so that all data groups have been used for testing.

Synthetic Minority Over-sampling Technique (SMOTE)
The unbalanced data can be solved by using some sampling techniques.One of the sampling techniques to overcome the imbalanced data is by using the method of Synthetic Minority Over-Sampling Technique (SMOTE) [28].The SMOTE method over-samples minority classes by creating synthetic samples that operate more on feature space than in data space so that the data distribution of each class becomes balanced.The SMOTE technique creates a synthetic sample by exploring samples of existing minority classes with random samples obtained from k-nearest neighbors.

Classification Algorithms
Classification algorithms can be grouped into two by looking at the approaches they used, ie black-box and non-black-box [29].In this research use both approaches, that is for black-box is using the k-NN algorithm, while for non-black-box use random forest (RF) algorithm.The random forest classification algorithm is an improvement of the CART classification algorithm.Improvements were made by applying the bootstrap aggregating (bagging) method and Random feature selections [30,31].In Random forest will use a number of decision trees, with each decision tree has been trained using a sample of data, and each attribute is broken into the selected tree between the subset attribute, which is random.The classification process is done by taking majority votes from the set of trees that are formed, for each tested data.
The k-Nearest Neighbor (k-NN) classification algorithm is a classification algorithm based on Euclidean distance [32].The precision of the k-NN algorithm is determined by the presence or absence of irrelevant features, or if the feature weight is not equivalent to its relevance to the classification.Another factor that affects the performance of k-NN is the value of k used.The k value is too high it will decrease the noise effect on the classification process, but will cause the boundary between each class to be blurred.A good k value can be done by determining the optimum parameters, for example by using the feature selection method.

Results and Analysis
The blast cell classification model on the WBC identified subtype AML M0 and subtype AML M1, can be shown the results for each stage.The first stage, namely image processing.At this stage the image segmentation process in which the result of WBC image segmentation as shown in Figure 3.The second stage, with reference to the process of image segmentation process, then performed feature extraction process.The feature extraction process produces 3 attributes.The WBC diameter which in units of μm, the nucleus ratio and the nuclear roundabout.The WBC diameter attribute has a value that is much different from the other attributes, so it needs normalization to have the same attribute value equal to the other attribute.The normalization method used is Min-Max [33].The WBC image used in this study amounts to 50, which is distributed to 20 AML M0 indefinable imagery and 30 AML M1 images.The 50 WBC image data from the feature extraction feature, with the WBC diameter feature, the nucleus ratios, and the nucleate roundabout obtained 165 blast cell data, as shown in Table 3, and the data was distributed into myeloblast blast cells 97, promyelocyte 31, and myelocyte 37.The distribution of the feature extraction data for each type of blast cell shows unbalanced data, for myeloblast cell types almost 3 times the number of promyelocyte and myelocyte cells.This indicates an imbalance in the distribution of data.The next step in the proposed system model is to perform resample, SMOTE and remove duplicate data.The results of these steps resulted in data of 244, distributed into 64 myeloblast cells, 84 promyelocyte cells, and 96 myelocyte cells.The results of the stages are then classified by the Random Forest (RF) classification algorithm, and k-NN [32] with the k-folds cross-validation validation method, where k=10.The test results are as shown in Table 4  The model of the blast cell classification system in WBC identified leukemia AML M0 and AML M1 proposed to provide better performance.The performance is as shown in Table 4 and Table 5.In Table 4 and Table 5 it can be seen, when the distribution of data for each blast cell is unbalanced, giving an average performance of AUC in poor category, whereas when done the process of SMOTE gives the performance in good category (in the range 80-90%) [34].In Table 4, particularly for the sensitivity performance parameters of the Random Forest classification algorithm without SMOTE, the classification of myeloblast blast cells is better than that of the other cells.This is caused by the amount of data myeloblast blast cells more than other cells (3xmore).The condition indicates an imbalance of data.
The proposed system model, as compared with previous research, as did Suryani et al. [19] suggests that the proposed system model is better.This is shown from the test of significance by using the t-test, with 95% confidence level.The results of the tests are shown in Table 6.In a study conducted by Suryani et al. [19], the classification algorithm used is k-NN.Differences in the use of classification algorithms without oversampling with SMOTE showed no significant differences, such as when k-NN compared with Random Forest, where the p-value>0.05.The performance of the proposed system model when replaced with the classification algorithm does not use Random Forest, also provides better performance, such as when using the k-NN algorithm, where the p-value is <0.05.It shows that the use of a combination of resampling, SMOTE and remove duplicate data, is able to provide better performance.
Further comparison with research conducted by Suryani et al. [20].The study diagnosed AML M1 and AML M2 leukemia by using blast cell types present in AML M1 and M2 leukemia as parameters in making decisions.Blast cells used in the study were myeloblast, promyelocyte, myelocyte and metamyelocyte.The difference is because it is used to detect leukemia AML M2.The problem that occured in this study is imbalancing data, so that the poor performance in detecting blast cell types.When compared with the proposed blast cell type detection model, the proposed model has a much better performance, namely by showing the pvalue <0.05.Complete comparison as shown in Table 6.The proposed blast cell type classification model, capable of delivering performance in a good category, when referring to the AUC value.The performance is obtained by using 3 attributes that are a feature of each type of blast cell.The three attributes can also be analyzed to find out how much the influence of performance the proposed system model.How much influence can be seen using some feature selection filter type algorithms, such as information

Table 3 .
The Data Sample of Result Feature Extraction

Table 4 .
and Table5.The Performance of the Proposed System Model (RF)

Table 5 .
The Performance of the Proposed System Model (k-NN)

Table 6 .
The Comparison of Proposed System Models with Previous Research