Linkage Detection of Features that Cause Stroke using Feyn Qlattice Machine Learning Model

ABSTRACT

Stroke is a disease caused by brain tissue damage because of blockage in the cerebrovascular system that disrupts body sensory and motoric systems Stroke disease is one of the highest death cause in the world. Data collection from Electronic Health Records (EHR) is increasing and has been included in the health service big data. It can be processed and analyzed using machine learning to determine the risk group of stroke disease. Machine learning can be used as a predictor of stroke causes, while the predictor clarifies the influence of each cause factor of the disease. Our contribution in this research is to evaluate Feyn Qlattice machine learning models to detect the influence of stroke disease's main cause features. We attempt to obtain a correlation between features of the stroke disease, especially on the gender as a feature, whether any other features can influence the gender feature. This research utilizes 4908 data of the disease predictor using the Feyn Qlattice model. The result implies that gender highly impacts age and hypertension on stroke disease causes. Autorun in Feyn Qlattice model was run with ten epochs, resulting in 17596 test models at 57s. Query string parameter that was focused on age and hypertension features resulted in 1245 models at 4s. An increase of accuracy was found in training metrics from 0.723 to 0.732 and in testing metrics from 0.695 to 0.708. Evaluation results showed that the model is reasonably good as a predictor of stroke disease, indicated with blue lines of AUC in training and testing metrics close to ROC's left side peak curve.

INTRODUCTION
Stroke is a disease caused by brain tissue damage because of blockage in the cerebrovascular system [1] that disrupts body sensory and motoric systems [2]. This condition causes all body functions controlled by brain tissue to be disrupted. Stroke is a very dangerous disease and must be treated immediately because brain cells can die in minutes. Proper treatment must be done to prevent complications. Stroke has become one of the highest death cause diseases in the world [3]. Many low-income countries are unable to cope with the burden posed by this disease. Moreover, Indonesia placed first in the highest death cases caused by stroke disease with 193,3/100.000 cases per year [4]. Some cause factors of stroke disease are hypertension, obesity, smoke, cholesterol increase, physical activity, low-density lipoprotein increase, excessive alcohol consumption, and diabetes [5].
The utilization of Electronic Health Records (EHR) by many countries worldwide is rapidly increasing [6]. Many medical data resulting from EHR has been collected and included in big data of health and medical service [7] [8]. The analysis of medical data is required to determine risk group factors of many diseases [9]. The collected data can be reprocessed using machine learning models to find various new patterns that can benefit as actionable knowledge and information [10]. One of the benefits of using machine learning is that it can be used to predict several factors that may cause stroke [11] [12]. This predictor clarifies the influence of each factor causing this disease. This predictor clarifies the influence of each factor causing this disease. For example, we can investigate whether there is an effect of age and hypertension on someone's susceptibility to stroke.
Nowadays, many classifiers from machine learning models have been used in some researches, especially on stroke disease. Research conducted by Liu [1] used a machine learning model called random forest in classifying cause factors of stroke disease, resulting in 85.03% accuracy. Another research was conducted by Zhu [13] identified stroke ischemic onset time based on DWI and FLAIR imaging with Convolutional Neural Network (CNN) model, yielding an accuracy of 80.50%. Meanwhile, Jamthikar [14] used machine learning, a random forest model, to prevent stroke by integrating carotid ultrasound image-based phenotypes and their harmonics with conventional risk factors, yielding an accuracy of 93.15%.
Many machine learning models are used to predict stroke diseases, such as SVM, XGBoost, Logistic Regression, KNN, Random Forest, Decision Tree, and others. However, currently, there are not many studies that apply machine models to investigate correlation or linkage between primary cause features of stroke diseases. Therefore, we propose an alternative machine learning model, called the Feyn Qlattice model, to assess the influence of each cause feature of stroke disease. This model was developed by a startup named Abzu, which was inspired by Richard Feynman's path integral formula [15]. Compared with neural networks and decision trees, Feyn Qlattice has some superiorities. Feyn Qlattice eliminates the black box concept that can be found in neural networks, though it provides explanations similar to the decision tree model. Feyn Qlattice works by searching thousands of potential models and seeking the best feature to become the ideal machine learning model to solve a computation problem [16].
Our contribution in this study is to analyze and evaluate the Feyn Qlattice machine learning model to detect the influence of the main causative features of stroke. The analysis was carried out to obtain the correlation between the features of stroke, especially on gender as a feature, whether there are other features that can affect gender features. By applying Feyn Qlattice, thousands of training models can be obtained so that the best machine learning model can be selected and used to predict the main causes of stroke. The results of the analysis can be in the form of data which is the result of the model evaluation of each predictor feature used.

METHOD
This section describes the proposed framework for using the Feyn Qlattice model to predict the association of features that influences the causation of stroke disease. Several important steps are described in each subset. Overall, the methodology used in the research can be seen in Fig. 1. Then, data transformation is carried out to balance data so that the machine learning model can work effectively. In the next step, the data is separated into training data and testing data. This technique is called splitting. Then, the Feyn Qlattice model was applied to produce thousands of best machine learning models that could predict the main features that cause a stroke. The Features that caused stroke were then selected to be tested with the Feyn Qlattice model as well. The model with the best performance results will be selected and evaluated. A more detailed explanation will be explained in the following subsections.

Dataset
Dataset used in this research was taken from public datasets made by Fedesoriano, which was uploaded in Kaggle [17]. This dataset was formatted in Coma Separated Value (CSV) with 5110 rows of data. It still has many noise or false-formatted data. For example, there was empty-valued or non-uniform data. The dataset has 12 main features that can be used to predict the cause of stroke disease. Available features of the dataset are id, gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, smoking_status, and stroke. The stroke feature becomes the classifier used from the dataset. Table 1 shows the dataset sample and its format used in this research.

Preprocessing Dataset
The dataset that has been collected cannot be immediately used because it is imbalanced, so preprocessing is needed. This step balances the dataset by adding the sample from a smaller dataset or subtracting samples from a bigger dataset [18]. Preprocessing is essential to improve data quality so that machine learning can function properly [19]. An unprocessed dataset is usually ambiguous and incomplete because some of its attributes are missing, either in its inputs or outputs, which may negatively affect the machine learning modeling [20]. Moreover, Qlattice models immediately detect data types; incorrect detection of data types leads to poor machine learning models. Qlattice supports many variants of data transformations, such as linear, multiply, sine, tan, and gaussian transformation [16].
Data features that are ambiguous, such as columns with similar features, will be collided or selected so that only one column will remain [21]. Empty-valued features will also be deleted in preprocessing. Data consistency is carefully maintained. This can be seen in bmi feature; N/A values were found decimals where decimals are a majority in this feature. Therefore, the feature will be uniformly adjusted.
Properties with important characteristics and categorical behavior will be changed to number categories; this technique is called categorical variable encoding [22]. Values from each categorical feature will be changed into a number. For example, gender feature has "male" dan "female" as its category. The value of "male" will be identified as "1," and "female" will be identified as "0". A detailed change of categorical encoding can be seen in Table 2.

Data Splitting and Data Balancing
Data that has been through the preprocessing step will have better quality and become ready to be used in machine learning. Data will be divided into 75% composition of training data and 25% of testing data. The splitting of the data must be done effectively to improve the model's accuracy [23] [24]. An illustration of the splitting can be seen in Fig. 2.

Feyn Qlattice Model
Feyn Qlattice model was used in this research. Data that has been split into 75% training data and 25% testing data will be processed with this model. Some steps used in the Feyn Qlattice model can be seen in Fig.  3.

Fig 3. Feyn Qlattice Model
Based on Fig. 3, a dataset that has been split will be reprocessed with a technique called sample weight computation to balance the data. Imbalanced data usually create problems in machine learning [25]. Only balanced data will be connected with the Feyn Qlattice. The Qlattice model uses training data that has been separated in data splitting to fill its train parameter. Another parameter of the model is the output name; since the purpose of the model is to predict stroke disease, the parameter is given "stroke" as its value. The kind parameter of the model is filled with "classification" since the dataset type is classification data. Meanwhile, the stypes parameter is filled with "gender" since the influence of the gender feature on other features that cause stroke disease will be assessed using this model.
Autorun process in Qlattice takes all parameters that have been set. This process will result in thousands of models that will be tested in 10 epochs in a certain time duration. Epoch is a hyperparameter that determines how many times the machine learning model will process the training data [26]. This research used a 10/10 obtain the best model and feature prediction. After processing stages were done to the dataset, machine learning will result in the best models and feature predictions. Methodological steps to get the best model and predictor feature can be seen in Fig. 4.

Evaluation Model
The best model obtained is evaluated by an evaluate machine model evaluation learning model called confusion matrix. This method can be used to measure the model's performance to various classification problems in machine learning [27]. The confusion matrix creates a representation of results such as true positive (TP), true negative (TN), false positive (FP) dan false negative (FN) [27]. TP means the positive results that are predicted by machine learning are correct. TN means the negative results predicted by machine learning are correct. Meanwhile, FP means the positive results predicted by machine learning are wrong, and FN means the negative results predicted by machine learning are wrong. Fig. 5 illustrates the confusion matrix table.

Fig 5. Confusion Matrix
Performance evaluation with confusion matrix results in accuracy, precision, and recall [28] [29]. Accuracy is the number of data points that machine learning predicted correctly among all data points. It can be calculated as follows: Precision is a percentage of relevant elements that can tell how many times the model can predict correctly. It can be calculated as Meanwhile, recall is a percentage of relevant elements correctly classified by the machine learning model over the whole relevant elements. The calculation of recall can be carried out using Along with the confusion matrix, we used Receiver Operating Characteristic (ROC), which is a visual technique to assess and choose a suitable classifier based on its performances [30]. ROC can also be considered as a performance measurement of a classification-type machine learning model [31]. It is common to compute Area Under the ROC Curve (AUC), a recognized metric to evaluate and compare classification models [30]. AUC can be equivalent to the probability that a randomly selected positive sample will have a higher value than a negative sample [30]. As the ROC curve gets closer to the top left corner of the graph, the model can classify better [32].

Data Transformation After Preprocessing
The data generated after the preprocessing stage will undergo data transformation, decreasing the number of data lines and each feature value with categorical data. After the data transformation, the previous sum of data, which is 5110, was reduced to 4908. The results of the data transformation can be seen in Table 3. As seen in Table 3, a change has been made to categorical features based on the categorical variable encoding. The variables previously had string as their type of data, while currently, their values are changed into encoding lists represented by integers such as 0, 1, or 2. However, special features such as avg glucose level and BMI still use decimals since their values vary or have uncategorical characteristics. The total number of transformed data is now 4908, composed of 75% of training data and 25% of testing data. Hence, the total number of training data is 3681, and the total number of testing data is 1227.

Model Feyn Qlattice
Based on stages in using the Feyn Qlattice model as in Fig. 4, as many as 17596 models were resulted and will be tested after the autorun mode was done in 10 epochs. This model with stypes input as 'gender' results in the best predictor features at 57s: hypertension and age. The autorun process of the Feyn Qlattice model can be seen in Fig. 6. According to Fig. 6, it is most likely that age and hypertension are predictors of stroke cause disease when viewed based on the age feature. The next test is to narrow the feature based on additional query string parameters, which contain age and hypertension as the parameter values. The plot graph of the model that resulted from the addition of query string parameters can be seen in Fig. 7.   Fig 7. Addition of Query Strings: Age and Hypertension After the query strings named age and hypertension were added, the autorun was re-run for ten epochs and resulted in 1245 machine learning models in 4s. The best model will then be visualized with a plot graph that results in training and testing metrics. Training metrics result in 0.731 of accuracy, 0.851 AUC, 0.117 precision, and 0.809 recall. Meanwhile, testing metrics result in 0.708 of accuracy, 0.818 AUC, 0.106 precision, and 0.788 of recall. An increase in accuracy can be seen based on the results. In training metrics, the accuracy was increased from 0.723 to 0.731. However, an increase of accuracy from 0.695 to 0.708 was found in testing metrics.

Evaluasi Model
The evaluation of the Feyn Qlattice model was done with a confusion matrix and ROC curve analysis. The evaluation was done to a model with the highest accuracy, which is 73.1%. The evaluation results of the training metrics can be seen in Fig. 8, while the results for testing metrics can be seen in Fig. 9. The AUC resulting in training metric evaluation was 0.85 and 0.82 for the testing metrics. According to Fig. 8 and Fig. 9, the quality of the training metrics was reasonably good since the blue line (AUC) value is close to the left corner of the graph, which is 0.85. The quality of the testing metrics was also reasonably good with a similar condition to training metrics; the AUC value of the testing metrics was 0.82. The results of this study were then compared with the results of previous researchers. A comparison of the results of using the model can be seen in Table 4.

CONCLUSION
The results of the test carried out in this study indicate that the Feyn Qlattice model can be a solution to obtain features that are used to predict stroke. The Feyn Qlattice autorun method can produce the main features of stroke trigger based on a person's gender, i.e., age and hypertension. This autorun method was run for 10 epochs and produced 17596 test models in 57s. The query string parameter in the Feyn Qlattice then focused on the features of age and hypertension. Once applied, there are 1245 models in 10 epochs with a time of 4s. The experimental results showed an increase in accuracy in training metrics from 0.723 to 0.731 and in testing metrics from 0.695 to 0.708. The results of the evaluation using the confusion matrix with the ROC curve show that this model has fairly good performance where the blue curve line (AUC) has approached the top-left corner of the graph.