Models for predicting the quality of life domains on the general population through the orange data mining approach

The incidence of type 2 diabetes mellitus (DM) has been predicted to increase until 2045 in the world. Furthermore, long-term treatment and lifestyle factors affect the quality of life. This study aims to determine the models that can be used to predict the quality-of-life domains in general population by using Artificial Intelligent (AI) devices. This is a cross-sectional design in which the inclusion criteria were individuals of age above 18 years and has never been diagnosed with diabetes mellitus (both type 1 DM and type 2 DM), fasted for at least 8 hours, and are willing to sign an informed consent after having received an explanation. Participants were asked to fill out two questionnaires, namely the Indonesian version of the Finnish Diabetes Risk Score (FINDRISC) and the EuroQoL-5 Dimensions-5 Level (EQ-5D-5L). The AI application uses Orange® machine learning with three models used in predictive analysis, such as Logistic Regression, Neural Network, and SVM. In addition, the model was evaluated using the sensitivity, precision, and accuracy of the AU-ROC parameters. The results showed that the neural network model based on the AUC value, precision, accuracy, and also the ROC analysis, was the best for predicting the utility index of domains in the EQ-5D-5L questionnaire, based on demographic data and the FINDRISC questionnaire.


INTRODUCTION
The trend in the incidence of type 2 diabetes mellitus (DM) continues to increase globally, with a prediction of 693 million patients in 2045 (Cho et al., 2018). Furthermore, the socioeconomic and health system burdens are global impacts experienced by all countries with high DM populations (Magliano et al., 2019). This is based on unhealthy lifestyle behaviors such as lack of exercise, a highcalorie diet, and stress (Kolb & Martin, 2017). However, most people do not realize that the development of type 2 diabetes begins with impaired glucose intolerance and prediabetes (Dall et al., 2014;Stull, 2016). Almost 50% of patients in the population were not previously diagnosed (Cho et al., 2018). According to this situation, many efforts are needed to decrease the morbidity and mortality of DM. The long-term treatment of DM may cause many burdens for the patients and the government.
Due to the long-term treatment for DM patients, quality of life is one of the treatment outcomes for chronic diseases. However, many factors may influence a patient's quality of life, such as demographic characteristics (Wang et al., 2011). Thus, predicting the quality of life using big data, is very important to reach the treatment effectivity. Artificial intelligence (AI), is developing rapidly, especially in the medical world (Abadir et al., 2020). Furthermore, it is a branch of computer science, which focuses on the complexity of analyzing medical data (Ramesh et al., 2004). In the context of health education, AI is very critical in determining the diagnosis and prognosis of the disease (Han et al., 2019). Predicting some models related to the diagnosis and the prognosis of the disease may use artificial neural networks (ANN) with a standard of logistic regression. ANN is one of the effective tools that can extract information and knowledge from a large data set .
Measuring patients' quality of life must be conducted to monitor the patients' condition after taking the medication for a long time. If patients do not adhere to the medication taken, the treatment target would not be achieved. Furthermore, patients may experience complications, and this will make the deterioration the quality of life. Thus, we need specific or generic instruments to define quality of life (Iqbal et al., 2017). The EQ-5D instrument, with five-dimensional components, namely self-care, daily activities, pain, and anxiety (Purba et al., 2017), is the simple and practical and straightforward instrument used to assess the quality of life in DM patients (Arifin et al., 2020). Changes in the dimensions of the quality of life through machine learning approaches are still under development to provide appropriate modeling of data algorithms. Several platforms such as Orange, KNIME, WEKA, and IBM SPSS Modeler have their respective advantages and disadvantages (Hosseini & Sardo, 2021). However, Orange is an open-source platform that can display more interactive data analysis and visualization (Demšar et al., 2013). This study aims to predict the dimensions of the quality of life of prediabetic individuals through modeling of the Orange application algorithm. Therefore, this health problem-solving method helps and guides doctors in making intelligent decisions on how to efficiently manage individuals hence not progressing to type 2 DM.
The previous studies about predicting quality of life, using some models, in two-years quality of life after surgery of breast cancer patients and quality of life after laparoscopic cholecystectomy. Both studies stated that the ANN model was more accurate in predicting the quality of life Tsai et al., 2012). This study is aimed to determine the models for predicting the quality-of-life domains of general population. To our knowledge, this study is the first study conducted in Indonesia, where it will be helpful to predict of quality of life based on demographic, medication, and clinical data already available in the patient's medical record.

MATERIALS AND METHOD Materials
The study instrument consisted of a socio-demographic sheet, as well as the Indonesian version of FINDRISC, and EQ-5D-5L. Furthermore, FINDRISC consists of 8 questions, which include the age, body mass index (BMI), waist circumference, daily physical activity, daily consumption of vegetables and fruits, history of taking antihypertensive drugs which helps in the lowering of blood pressure, history of checking blood sugar, and history of diabetes occurrence in the family (heredity). Each of the questions had different scores for each answer. The total score of all questions was 26. The higher score shows a high risk of DM (Lindström & Tuomilehto, 2003). The original version of FINDRISC has been translated, adapted/revised, and validated for the Indonesian population with a total score range between 0-26.
The EQ-5D-5L was used to assess the quality of life and it also consisted of 2 parts (EuroQol Research Foundation, 2019;Herdman et al., 2011) namely a descriptive system comprised of 5 dimensions which include mobility, self-care, daily activities, pain, and anxiety, and each of these dimensions consists of 5 levels such as lack of problem (1), mild (2), moderate ( 3), severe (4), and very extreme problem (5). The second part is the EQ Visual Analogue Scale (EQ-VAS), whereby one page contains a picture of a scale, such as a thermometer. In its filling, participants were asked to present their current state of health by giving a score from zero to one hundred (100 indicates the best health).

Methods
The method used was a cross-sectional design, and the inclusion criteria of participants were people of the age 18 years and above that has never been diagnosed with diabetes mellitus (both type 1 DM and type 2 DM), also fasted for at least 8 hours, as well as are willing to sign an informed consent after receiving an explanation. Meanwhile, the exclusion criteria were participants that take a drug that affects the blood glucose levels such as thiazides, beta-blockers, and steroids. This study has been approved by the Ethics Committee of the Faculty of Dentistry, Universitas Gadjah Mada with an ethical suitability letter number 0095/KKEP/FKG-UGM/ES/2019 on April 25, 2019.
Data were collected from a variety of public locations, such as offices and open fields where members of the community exercise in 3 (three) Indonesian areas, namely: (i) Banggai Laut Regency, of Central Sulawesi (ii) Malang City, of East Java, and Yogyakarta. The study location was selected based on the total description of patients with of type 2 DM patients in the province, as well as the consideration of analyst access to data. The sample size was considered based on the total number of adults in the three regions (Anonymous, 2021c(Anonymous, , 2021b(Anonymous, , 2021a. We calculated the sample size using OpenEpi in one group proportion, and the minimum sample size was 1084 with 99.9% confidence interval.
Participants that had never been diagnosed with type 2 DM consist of males and females over the age of 17 years. Furthermore, the participants were informed about the study objectives and benefits, as well as given the opportunity to ask questions. They, therefore, indicated their willingness to participate by signing the consent form provided.
In the case of prospective participants that came from a specific agency/organization, the approval process was conducted approximately one week before the data collection. They were contacted 1 (one) day before the collection of data. Furthermore, they were asked to fast for at least 8 hours, starting from 9 p.m or 10 p.m; until about 6 or 7 a.m; the next following day. During fasting, they are only allowed to drink water.
On the day of data collection, all participants filled out a form with their demographic information (according to their ID card) and proceeded to the measurement of height and weight, waist circumference, as well as blood sampling, which is being conducted by a trained doctor or nurse. Furthermore, the samples of their blood sugar were taken and measured using EasyTouch®GCU, Bioptik Technology, Inc, Miaoli County, Taiwan.

Data Analysis
The AI application utilizes machine learning from Orange® with three models used in predictive analysis, namely Logistic Regression, Neural Network, and Support Vector Machines (SVM). Furthermore, the model evaluation was conducted using the parameters, sensitivity, precision, and accuracy of the Area Under the Curve-Receiver Operating Characteristics (AUC-ROC). The reason for choosing this model is because it is most widely used in the health sector related to AI.
Additionally, the neural network model is used as one of the nonlinear models, whereby each evaluation uses the same data segmentation and repetition; hence the models are compared (Xiao et al., 2019) . Random Forest (RF) and SVM are used to evaluate a set of predictors to predict outcomes (Afzali et al., 2019).
The demographic characteristic, such as age, gender, BMI, abdominal circumference, fasting blood glucose, and FINDRISC score, were the independent variables, and the domains of EQ-5D-5L were the dependent variables. The AUC-ROC, accuracy, F1, precision, and recall, were determined using the formulas, found in the ORANGE data mining (Test and Score) (Anonymous, 2021d;Demšar et al., 2013). The AUC-ROC measures the ability of the classifier to distinguish between classes. The higher of AUC-ROC shows the better performance of the model, which distinguishes the positive and negative types. The ROC curve is used for evaluation metrics for the binary classification model. The accuracy shows the proportion of correctly classified examples. The F1 shows the weighted harmonic mean of precision, and recall. The precision is the proportion of true positive among classified as positive and recall is the proportion of true positives among all positives parts (Anonymous, 2021d).

RESULT
A total of 1428 respondents were willing to fill the questionnaire with the distribution as shown in Table 1. The description deals with demographic characteristics such as Body Mass Index (BMI), Fasting Blood Glucose, Abdominal Circumference, FINDRISC total score, age, and gender. Furthermore, the mean age of the response was 42.06 years, predominantly female (60.29). The mean BMI, fasting blood glucose, and parturition circumferences were 24.54 (overweight), 97.49 (normal), 87.40 cm for men, and 85.11 cm for women (above normal). Meanwhile, FINDRSIC's total score was 7.03 and a VAS score of 0.87.  Table 2 shows the proportion of levels in each domain, and most of the respondents stated that they had no obstacles in mobility, self-care, and daily activities. However, 27% and 17% of the respondents had mild discomfort and depression.  Figure 1 shows the prediction algorithm using ORANGE Data Mining. The data was arranged in the comma-separated values file. Then we edited the domain to organize the domain of the data, because we wanted to set some data into categorical data. The next step was imputing the data, to replace the unknown data values. For categorizing the data into features and target variables, we run the select columns. The following algorithm was run the models, prediction, test, and score, confusion matrix and ROC analysis. The confusion matrix would define the predicted and actual parts. The prediction showed the prediction of each individual data. SVM: Support Vector Machine; kNN: k Nearest Neighbor; ROC Receiver Operating Characteristics Figure 1

. Data processing flow using ORANGE Data Mining
The results of the prediction analysis can be seen in Table 3, in which the neural network model has the greatest precision and the highest AUC value. Therefore, this model is appropriate for predicting the quality of life based on five domains compared to others. The results of the ROC analysis on five domains under "mild" conditions are shown in Figure 2. Based on the two figures, it is seen that the neural network model has the best performance among others. Therefore, the results of the ROC analysis were similar to the conditions, which include no limitations, moderate and severe.
We did not do the data test for this finding; however, we tried to test the separated data randomly for comparing the mean and standard deviation of utility index of EQ-5D-5L in 500 data and 918 data. The mean of utility index for 500 and 918 data were 0.946 (SD: 0.08) and 0.938 (SD: 0.10), respectively. There was no significant difference of utility index between both data set (p> 0.05).

Mobility
Self-care

Usual activities Pain
Anxiety FP rate: False Positive rate; SVM: Support Vector Machine

DISCUSSION
Generally, this study explains that the domains utility index of the EQ-5D-5L questionnaire can be predicted using the neural network method based on age, gender, BMI, abdominal circumference, fasting blood glucose, and FINDRISC scores. A neural network is a model that is still widely used as an instrument in machine learning and AI, and this model is based on the mechanism of neural action which receives and transmits various signals through the axon as an axon potential. This conversion of a complex model into one simple decision ultimately forms a neural network model, which the conversion from a multiple to a single output takes place (Kriegeskorte & Golan, 2019).
Most of the subjects in our study had no problem in all domains of EQ-5D-5L. Only small proportion of the community stated the severe condition in some domains of EQ-5D-5L. This situation shows that communities in the three areas of Indonesia can live normally. However, around 17% and 27% of the participants experienced mild depression and pain, respectively. The government must be aware of this situation, because the mild condition can be changed to a severe level, if there is no preventive action. We suggest increasing the health promotion with the topics related to the mental health and pain management in particular diseases.
A previous study conducted in China stated that low levels of stress, social support, and physical activity affect the quality of life of active workers in the country (Xiao et al., 2019). Likewise, a study conducted on elderly subjects is strongly influenced by physical and psychological conditions, family support, and financial aid (Risal et al., 2020). Furthermore, longitudinal research conducted in Canada showed that eleven variables, including age, final education, and race, affect the quality of life. Based on these variables, social support and coping processes significantly affect the quality of life (Caron et al., 2019). Therefore, government efforts are needed to create a better environment and social habits on a more specific subject, namely patients with DM, also the previous study has shown that the quality of life of patients with DM is strongly influenced by knowledge and self-management of DM. Personal management is the independent monitoring and diet of blood sugar monitoring (Kueh et al., 2017). Therefore, government support is needed to establish intervention programs according to the needs of patients with DM to improve their quality of life.
Machine learning algorithms create complex models and make more accurate decisions based on relevant data. With a sufficient amount of data, it is expected that the performance of machine learning will be quite good. Furthermore, a fairly large amount of data was used, to ensure good performance. From the model used, logistic regression has the worst predictive ability, probably due to non-linear data. Our study finds that neural network model can be used to predict the EQ-5D-5L domains. By applying this model to the community data or characteristic data, the governments may define the level of the EQ-5D-5L domains and make some efforts to improve the quality of life of the community in their area.
The neural network is more superior to the linear regression for describing systems. This model can replace the predictive range of linear regression with identity function of nonlinear activation functions. In the neural network model, overfitting can occur when the model describes the random error, besides underlying association. Thus, to avoid the overfitting, the study must use the testing set to control the criteria for determining the end of the training . As a limitation, our study did not conduct the testing set, which should be separated from the training set. This was the first study conducted in Indonesia using ORANGE Data mining tools, therefore additional variables are needed to fully predict the quality of life of the general population. We conducted our study in the three areas of Indonesia, which consider the representativeness of the specific regions of Indonesia, for example, Banggai Laut as the island's regency, which needs particular transportation to reach the islands. Yogyakarta is an education city, in Indonesia, which has many schools and students from other provinces. Malang city, is the representative of the tourism city in Java Island, which also has many visitors. The location of the study needs more variation such as, cultural and social background, which may influence the quality of life. The questionnaire of FINDRISC and EQ-5D-5L implied the current condition, thus we hope there is no recall bias during the study. This research limitation is necessary to include more complete demographic data, therefore the predictions produced will be more optimal.

CONCLUSION
Based on the AUC value, precision, accuracy, and also the results of the ROC analysis, the neural network model is the best for predicting the utility index of the domains contained in the EQ-5D-5L questionnaire based on demographic data and the FINDRISC questionnaire. For further research, it is necessary to include other demographic data such as marital status, final education, and environmental conditions to predict the quality of life of the community in general.