Towards transparent machine learning models using feature sensitivity algorithm

This study aimed to predict the likelihood of a diabetic patient to develop DKA and identify the key features, which lead to DKA complication. We applied feature selection technique to reduce the number of features from the original data. We introduced a method to measure and identify within attributes sensitivity. The dataset


I. Introduction
The medical community has a consensus that diabetic ketoacidosis (DKA) is highly widespread in diabetes patients. According to [1], the incidence of DKA in the diabetes population is about 15.6 percent. DKA is responsible for up to four percent mortality. Diabetic ketoacidosis could potentially complicate the diagnosis, management, and prognosis of diabetes [2]. DKA is a life-threatening disease [3] [4]. DKA has many psychosocial challenges for diabetes that put them at risk for repeated hospitalizations [5]. Diabetic ketoacidosis is a state of insulin deficiency that could lead to glucose disorder and lipid metabolism. DKA could, on the one hand, obscure and often delay the diagnosis of other diabetes complications [6]. On the other hand, it leads to more severe glucose hyperglycemia and intravascular volume depletion, both of which might increase morbidity and mortality [7].
Missing values happen due to periodic lack of sampling or input errors. These missing values lead to problems when working with machine learning [8] [9]. Many studies handled this issue by removing missing values, but this approach suffers from the potential loss of significant information to support decision making [10] [11]. The second method is replacement, which uses mean (for numeric attributes) or mode (for nominal attributes). To reduce the influence of unique values, the median can also be used [12]. Handling missing values is significant because machine learning models' performance is affected by missing data. Besides, most algorithms do not work on data with missing values [13]. Most of the introduced methods for handling missing values are on the training data [14].
This study aimed to predict the likelihood of a diabetic patient to develop DKA and identify the key features, which lead to DKA complication. We applied feature selection technique to reduce the number of features from the original data. We introduced a method to measure and identify within attributes sensitivity. The dataset was collected from Alsukari Hospital for building the machinelearning model. The data set contained 730 records and 29 attributes. This model is useful for predicting the likelihood that a patient has a DKA or not, and the critical points within feature domain values, which have a significant impact on the physicians' decision-making process. The rest A B S T R A C T Despite advances in health care, diabetic ketoacidosis (DKA) remains a potentially serious risk for diabetes. Directing diabetes patients to the appropriate unit of care is very critical for both lives and healthcare resources. Missing data occurs in almost all machine learning models, especially in production. Missing data can reduce the predictive power and produce biased estimates of models. Estimating a missing value around a 50 percent probability may lead to a completely different decision. The objective of this paper was to introduce a feature sensitivity score using the proposed feature sensitivity algorithm. The data were electronic health records contained 644 records and 28 attributes. We designed a model using a random forest classifier that predicts the likelihood of a developing patient DKA at the time of admission. The model achieved an accuracy of 80 percent using five attributes; this new model has fewer features than any model mentioned in the literature review. Also, Feature sensitivity score (FSS) was introduced, which identifies within feature sensitivity; the proposed algorithm enables physicians to make transparent, and accurate decisions at the time of admission. This method can be applied to different diseases and datasets. of this paper is organized into Sections two to seven as follows: related works, data collection and preparation, methodology, experiments, results and discussion, future work, and conclusion.

A. Method
We used the feature importance with the tree-based classifier as a method for reducing the number of features. Then we created categories for features. The feature sensitivity score is calculated for each category. Fig. 2 showed the general steps for the sensitivity algorithm.   Brakel and Schrauwen [15] proposed two models for imputing missing values to train the model. The main idea of the imputation method is that if the value is not available for a particular instance, it could be estimated [9]. Fortes et al. [16] introduced a method to handle missing values with an associated confidence and error parameter of the suggested value. Pelckmans et al. [17] proposed an approach for handling missing values taking into account the predicted outcome when missing values are taking place; this approach is based on the mean imputation. Rajawat et al. built a predictive analysis model using a hybrid machine learning technique. They filled attributes that had missing values with the median value of that attribute [18]. The previous studies depended on estimation when filling the missing values.
Efstathiou et al. [19] conducted a study to predict the mortality in diabetic ketoacidosis. The clinical and laboratory parameters values were accessed to predict the mortality in patients with diabetic ketoacidosis (DKA). They built a model using 20 attributes. The study suggested that simple clinical and laboratory parameters available in the first 24 hours from admission may contribute more objectively. However, they used many attributes, which may complicate the model and decrease performance. A. Deeb et al [20]. They proposed a model of care, to reduce the frequency of hospital admission of children and adults presented with DKA. The data was collected over four years of 158 admissions for DKA; they did not mention the attributes, which they used as predictors. S. Suwarto et al. [21]. Focused on building a model to predict DKA mortality, they used six attributes and 60 records. However, they used a limited number of observations. N. N. Siregar et al. [22] identified the predictors of 72-hour mortality in patients with DKA. They used 301 health records of adult patients; the results showed that 11 predictors are associated with DKA complications and mortality. The difference between the previous studies and this work that we used fewer attributes for predicting DKA. Furthermore, that reflected directly on the reduction of the training time.

C. Features Selection
The process of discovering and choosing the most significant features in a dataset is called feature selection. It is a vital step in the machine learning pipeline [23]. Irrelevant features that do not help a machine learning model: reduce training speed, reduce model interpretability, and, most importantly, reduce performance on the test set [24]. Though it looks simple, it is one of the most challenging processes in the work of designing a new machine learning model. Feature Selection and Data Cleaning should be the first step in creating the model.
The feature selection algorithms estimate feature importance based on the characteristics of the features, such as feature variance and relevance to the target variable. Selecting important features are part of a data pre-processing step and then train a model using the selected features [25]. Therefore, feature selection is uncorrelated to the training algorithm. It is a technique to present how the features in the model contribute to the model prediction. Several methods exist to get some insight into these black-box models. Feature importance gives a score for each feature of the data; the higher the score more important or relevant is the feature towards the output variable [26].

D. Data Collection and preparation
The dataset was collected from Alsukari Hospital. Ethical approval to use the data for research was obtained both from the Ministry of Health (MOH) and the hospital. The dataset contained 644 records and 28 attributes of diagnosed diabetes patients who were admitted to the hospital in the period from January 2018 to April 2019.  The figure above shows the features and their contribution to the model. we can observe that age is the most important feature with about 14 percent, and we can notice that the rest of the four features have importance around 10 percent for each. Feature importance is the contribution of a feature in the model Adding more features means more information to the model. But, we have to balance between the number of features and the model performance. The proposed model was able to run using fewer attributes, and maintained a level of accuracy, by only 18 percent of the original features.  Infection_years: The date since a patient was diagnosed with diabetes in (years).
 BMI: Body Mass Index (Table 2) is a measure of body fat based on height and weight that applies to adult men and women.  Sugar: Blood sugar, or glucose, is the main sugar found in blood. The level is measured by milligrams per deciliter (mg/dL) or using millimoles per liter (mmol/L).
 Symptoms_days: The date since the symptoms started appearing on the patient until arriving at the hospital.
 Class: The target variable.

A. Experiments
After Random Forest (RF) is a machine learning algorithm that is used widely for classification problems, RF is made of an ensemble of autonomous decision trees [21]. Each tree is learned with randomly selected samples and features [30]. We created a model to predict the test data. So, we use the training data to fit the model and testing data to test it. We used 70 percent of the data for training and 30 percent as a testing set. The oversampling method was implemented on the training set, for handling imbalance in the dataset. We also adjusted some hyperparameters of the algorithm, like n_estimators, max_depth, random_state, 40, 16, 18, respectively. The model achieved an accuracy of 80 percent on the testing data.
We used two features age and DKA (Fig. 4) for testing the efficiency of our proposed algorithm. The figures below visualize the results of the feature sensitivity algorithm. From Fig. 4, it is clear that patients with age categories between 0-10 are most likely to have DKA with the probability of 80 percent, while patients between 11 to 30 are in the critical area. The feature sensitivity score (FSS) is about 50 percent; thus, the absence of the Age attribute value is critical for the decision making process as we are working in the healthcare field. However, patients who are older than 50 years have a lower probability of having DKA. Moreover, therefore are not in the critical area of decision making. According to Table 4, we can notice that the 31-35 category is the only sensitive among body mass index(BMI) feature. 31-35 category has 50% sensitivity and considering the absence of it is value, then we have 9.5% (feature importance) as error rate in the model outcome. We may have a different decision.

B. Discussion
We built a machine learning model with an accuracy of 80 percent. The new model has fewer features compared to the previous models in the literature review. We introduced the feature sensitivity score (FSS) to identify within feature sensitivity using the proposed sensitivity algorithm. According to figure 3, the critical probability points within age feature are in 11-20, and 21-30 with a probability of 50 percent. Furthermore, from figure 4, Patients who are in the 31-35 body mass index (BMI) range are in the critical area because of the 50 percent probability.
The critical probability point means any under or overestimation of a missing value around 50 percent probability may lead to a completely different decision. The feature importance of the feature has to be taken into account. Using the proposed method would enable physicians to have transparency on model accuracy when making decisions.

IV. Conclusion
The findings of this study show that the proposed feature sensitivity algorithm approves the ability to determine within feature sensitivity. Feature sensitivity score will help doctors to make transparent and accurate decisions when missing any of the feature value. In the medical field, we need to make accurate decisions. Clinicians' decisions are one of the most important factors for guiding the cost and quality of medical care. Decisions help determine what prevention programs must be provided, what diagnoses to be made, what tests are requested, and what treatments are offered.