A hybrid analysis model supported by machine learning algorithm and multiple linear regression to find reasons for unemployment of programmers in Iraq

The problem of unemployment is one of the most important problems faced by most countries of the world, and it is one of the intractable problems in developing countries, and in Iraq unemployment occupies great importance due to its high rates. This problem in itself is a serious condition, because it results from mismanagement and the structure of the economy, and despite its great importance, it has not been carefully monitored. There are studies and strategies that deal with the analysis and study of those causes that lead to this problem, such as traditional statistical methods, various mathematical and statistical methods, in this research proposed a method uses machine learning methods to find the factors that affect the causes of this problem, as well as the multiple linear regression method.


INTRODUCTION
Unemployment is currently one of the main problems facing most of the world, and unemployment is one of the intractable problems in developing countries, particularly the Arab countries, and the issue of unemployment in Iraq occupies a special importance [1]. given the high rates due to the restructuring of the economy and the many problems resulting from it, despite its importance However, it did not accurately monitor, and the evidence for this is the contradiction of official statistics among them, as well as its contradiction with what is published by Arab and international organizations. At a time when the World Bank statistics indicate that the unemployment rate in Iraq exceeds 50% [2]. The results of the survey conducted by the Ministry of Planning and Development Cooperation in cooperation with the Ministry of Labor and Social Affairs indicated that the unemployment rate in Iraq is 1.28%, and informal organizations have identified Unemployment rate in Iraq (40-60) % regardless of the conflict of numbers [3].
Economic theories indicate that 15% of the capable and searching workforce looks for a real crisis if the government, in cooperation with the private sector and international and civil organizations, does not take practical solutions to confront it [4]. Given the importance of the issue of unemployment and its accompanying repercussions, we chose a specific group of unemployed who are programmers, and we also adopted a mechanism for analyzing data to determine the most important causes and problems that lead to TELKOMNIKA Telecommun Comput El Control  A hybrid analysis model supported by machine learning algorithm and … (Mohamed A. Abdulhamed) 445 unemployment. As the trends of modern science are in the possibility of using them to solve societal problems, the method used was to analyze data by linking it with artificial intelligence techniques and statistical processes in analyzing this data [5]. Data mining has attracted a lot of attention in the research community over the past decade, in an attempt to develop scalable algorithms and adapt to an increasing amount of data in the search for meaningful knowledge patterns [6,7].
Packages of algorithms and software have grown significantly over the past decade. Data mining approaches can be divided into two basic types are [8,9]: First: Descriptive exploration: which relies on reorganizing data to extract models in it and includes (associative rules, sequencing discovery, aggregation). Second: Predictive exploration: which tries to find the best predictions based on data and includes (classification, chronology, prediction) [10,11]. This study included a description of demographic data for respondents represented by graduates of colleges of computer science in Basra Governorate, based on four Themes to study unemployment (the university education, the investment, the administrative factors and the graduate personality).
Then the stage of data analysis by relying on multiple linear regression analysis and using the spss program to determine the effect of demographic variables as independent factors on the dependent variable represented by the skills of programmers in the use of information technology and then rely on descriptive exploration to find the most important factors affecting the unemployment of programmers with the piriori algorithm which is one of the algorithms Distinguished in finding correlation relationships in data mining techniques to study the connections between study variables by relying on the weka program, then we discussed the most important conclusions that we obtained from this articale [12,13]. Previous studies Unemployment is very interested phenomena . there are many researches that related with. In [14] illustrated use panel data analysis methods that depend on cross-section dependency which give more reliable results. The results obtained as a conclusion that the impact of shocks on the unemployment rate are permanent. In [15] Apriori algorithm applied whith Visual Basic software as a tool for determin consumer purchase, as an output can be concluded, if the mini-support equal 15% while confidence equal 50% then 87 of rules will be generated as results. In [16] Weka software provided approximately all characteristics of data mining techniques. So, hat, the rule generated by Apriori provided market strategies for improve product quantities. In this article using Weka software to find list of the possible itemsets. In [17], machine learning algorithms are used to apply several scenarios to the problem of traffic congestion in Greece, in particular, a comparative test was conducted using four of the most common methods used in the field of machine learning (support vector regression (SVM), neural networks (NN), random forests (RF), and multiple linear regression (LR)), predict the traffic status. Where was obtained mean absolute error (MAE): 6.25, 6.57, 6.44 and 6.90 for SVM, NN, RF and LR methods respectively.
In [18] this study was conducted on individuals who previously had a job and then became unemployed, as a survey was conducted to obtain data in Turkey in order to predict the unemployed based on the use of machine learning algorithms and then compare with logistic regression analysis as econometric approach and shallow neural network, the results showed the superiority of machine learning algorithms over logistic regression and a shallow NN. So that an accuracy rate of 67% was obtained for the machine learning algorithm. In [19] phone records are relied upon to predict individual employment cases by using a survey on family records, as machine learning models were relied upon to predict and find those proportions for approximately 18 occupations in South Asia, where the result of the prediction accuracy was 70.4% Depending on the deep neural network models. In [20] proposed model to predict the value of the bid for bids and tenders for companies, where data of approximately 26 tenders has been entered to predict the price of the winning tender based on the linear regression model, where the proposed model showed distinct results to predict values with an error rate 3%, coefficient of determination R2=0.88167.
In [21], a model was made based on smart meters to collect data to monitor the state of electric power and the extent of its impact on some axes to design a smart model based on meter readings to predict the unemployed through the data collected by machine learning. The importance of research lies in: − Decrease the important reasons that causes unemployment, by using data mining techniques, which benefits to findind the correlation between the and extract the knowledge patterns use to processing this problem. − Decrease the governments efforts in restructure the ecomonic and supported the future plans building to counteract this phenomenon. − Increase the works opportunities provident for graduates, such as depending relation and indicators, for defination the reasons practically, accurate and try to overcoming.

REGRESSION LINEAR ANALYSIS
Regression analysis is a statistical method used to analyze the relationship between one or more independent variables and a dependent variable. Regression analysis is mostly used for three purposes [12,22]: − Description: the linear regression model is used to describe the shape of the relationship between independent and dependent variables. − Estimation and predication: the regression model is used linearity to predict the independent values of the dependent variable corresponding to the actual values of the independent variables. Estimation and forecasting are among the most important uses of regression analysis in the applied aspect. − Control: means the interpretation of the change in the values of the dependent variable in terms of the change in the values of the independent variable on the basis of taking the independent as a controllable variable. Linear regression analysis is divided into two parts: simple linear regression and multiple linear regression. In these studies, we will rely on multiple regression analysis [23,24]. Assuming that the variable y expresses the dependent variable and the variables (x1, x2, ..., xk) express k from the independent variables and that the number of observations is n, then the dependent view i = 1,2, .. ., n, yi can be expressed as a linear function in the Views group as follows: where ( 0 , 1 , 2 , …, ) expresses the regression coefficients, ei expresses the random error of viewing number i, i = 1,2,…, n, where n represents the number of observations, and equations can be formulated into matrices shown in (2).
Among the most important hypotheses of the regression model, there is independence between (x1, x2, ..., xk) and random error ei such that (3): where ∑ represents the variance matrix and is expressed by (4).
By applying a method least squares (LOS) To find an estimate of the parameters of the regression model (2) that contains (K + 1) of the parameters is the vector = ( 0, 1 , … , ) ′ where [24]:

ALGORITHM APRIORI
This algorithm is one of the distinctive algorithms in finding relationships for links in data mining operations, as this algorithm works to find interesting relationships between variables in large databases. Where the purpose is to find and define rules discovered in huge databases. This algorithm relies on two metrics to determine the associations strength they are confidence and support [16]. For mor explain See Algorithm 1 [25], that illustrate Apriori algorithm steps. At first Apriori algorithm pass simply counts item occurrences to determine the large 1-itemsets. Then a later pass, consists of two stages: Firstly, the large itemset Lk-1 git it in the (k-1) the pass is used to generate the candidate itemset Ck, using the Apriori candidate generation function as show in the algorithm 1. finally, the database is discovered to be found support of candidates in Ck is counted [7,11]. So that, the Apriori generation function takes LK-1 argument, the set of all large (k-1)-itemset extracted a superset [20,21].  In this algorithm, the LK represent Set of large k-itemset with minimum support, While CK represent Set of candidates k-itemset. Algorithm 2. The Apriori-gen function Insert into Ck Select p.item1, p.item2, … , p.item k-1 , q.item k-2 From Lk-1 p, Lk-1 q Where p.item1 = q.item1,…, q.itemk-2= q.itemk-2 , p.item k-1 > q.item k-1 ;

RESEARCH METHOD
In our proposed research, we used one of the machine learning algorithms to find the most influencing reasons for programmer unemployment in Iraq. We initially did a questionnaire for a group of the main axes that have a direct impact on unemployment in general. This survey included nearly 100 samples, where nine samples were excluded from the questionnaire in the primary treatment because the samples were excluded during the primary treatment process, the diagram in Figure 1 shown the steps were taken on the data to obtain the results.

. Description of the data
That the method of collecting data in our research was the work of a questionnaire through Google Form, where the questionnaire was published throughout all of Iraq, and during a period of 7 days, the data represented by ninety responses were collected and the responses were different and varied, that the questionnaire is generally divided into two parts, the first section includes demographic information and some the most frequently asked questions. As for the second section, it included four axes, and each axis includes four paragraphs. We have adopted the stability and reliability test for the validity of the questionnaire is the calculation of the Cronbach's alpha which was 0.775 and reflects a good value for the approval of the questionnaire and the results of the research. The research included several variables related to graduates of colleges of computer science.

Preprocessing stage
Like any data analysis process, we must first Carrying out some primary treatments to improve the row data collected. This stage involved the completion of some basic tasks, which are summarized as follows: − To facilitate the analysis process, the axes and their questions must be coded into short formulas and as in the Figure 2 that illustrates this process. − Figure 3 describes the coding for the features presented by the questionnaire to find reasons for the unemployment of Iraqi programmers, so that attributes were divided into four important axes which are (University Education UE, INvestment IN, ADministrative factors AD and The CHaracter of the graduate CH) depending on the reality of life in Iraq. − Then, we transformed the data format into a type called (ARFF), which represents the type of rules that can be dealt with it by using Weka Explorer. − The third step of pre-treatment We have used filters that convert data in different ways. Since there are two types of filters: supervised and unsupervised filters, we chose one of the types of unsupervised filters that are called (Numeric To Nominal) filter so that use to converting numeric values to nominal values because of association rules in Weka software can only support nominal values.The Figure 3, illustrated the output of applied numeric to nominal filter for all attributes.

First analysis based on multiple linear regression
Before using regression analysis, we will show demographic information for respondents: gender, age, graduation year, and type of work. The research includes studying the factors affecting unemployment of  Table 2 that the most responsive category was for ages between 23 and 27 years with a rate of 0.844 and that the lowest response is for the category between 38 and 42 and an average of 0.01.
In Iraq, there are three categories of youth classified as the category of workers in the government sector and the category of employees in the private sector and the last category is the unemployed. We will detail the categories as in Table 3 where the highest rate and by 57% is for the unemployed. The data was divided according to the graduation years into categories as shown in Table 4, the highest category by 54% is for graduates in the year 2018-2019, and the number of those who graduated before 2015-2016 increased by 22% of the total respondents.
Such that: Yi = The skills of programmers in preparing advanced software and it is the dependent variable. Xi1 = The gender of the programmer represents; it is an independent variable. Xi2 = The age of the programmer it is represented as an independent variable. Xi3 = The graduate year of the programmer represents an independent variable. Xi4 = The type of work that a programmer exercises after graduation and represents an independent variable. The (6) showen the multiple linear regression model for the study is: Using the statistical analysis program spss, where multivariate regression was analyzed, the results were as follows: The (7) for our multiple linear regression is: Formulating the hypotheses of the multiple regression model for the study are as follows:

Second analysis based on Apriori algorithm
In general, there are two steps for applied association rules. Firstly, finding all frequent item sets in a dataset based on minimum support, then find minimum confidence that use to construction of the best rules. The ARFF file that converted in preprocessing that include information regarding each programmer graduate's, where we enter the programmes.arff file to the Weka explorer interface for applied Apriori algorithm by using configuration in Table 5.
So that, we want use the Apriori Algorithm to find the best association rules that have minimum support = 10% and minimum confidence= 90%. The result that obtained can be illustrated clearly by Figure 4. the parameters mentioned in Figure 4 can be explained as follows, (-N) is required num of rules extracted, while the min confidence for the rule is (-C), the (-D) delta at which the mini support is decreased at each cycle, (-U) Indicate upper bound for min support. Finally, the lower bound for the min support Indicate by (-M). the Figure 5 shown some generated sets of large item sets for associator model based on Apriori algorithm for full training set.  For the analysis itemset that found, association rules Represented as X -> Y, so that the frequent itemset are generated based on Aproiri algorithm. The item sets (X) represent antecedent and Y are called consequent of the rule. Generally, Apriori algorithm controlled by the two metrics are support and confidence, to more clarify below some important criteria, so that P(X) and P(Y) represent the counting of total number of tuples at antecedent and consequent respectively. So, P(XY) = P(X Y) = P (X U Y) = represent the total number of tuples that include both X and Y, the best rules obtained from applied Apriori algorithm in our dataset illustrated in Figure 6. In this programmer dataset, we can calculate the interest rules based on the Weka results for each generation association rules. In our analysis, As only one association rule TELKOMNIKA Telecommun Comput El Control  A hybrid analysis model supported by machine learning algorithm and … (Mohamed A. Abdulhamed) 451 was obtained is (UE4=Agree AD3=Agree AD4=Agree 11) = P(X)= 11 itemset (CH4=Agree 10) = P(Y)= 10 itemset (UE4=Agree AD3=Agree AD4=Agree 11 ==> CH4=Agree 10) = P(XY)= P(X U Y)= 10. Figure 6. The best rule found in programmers' dataset

RESULTS AND DISCUSSION
This stage shows a summary of the results that were extracted depending on the application of the techniques used to analyze the data on a group of samples previously collected using a survey of a number of unemployment in Iraq after their graduation. The correlation rules were created using the Apriori algorithm and compared to Multiple linear regression. To clarify the results more, we will analyze the extracted results based on the methods previously suggested as follows:

Multiple linear regression
By using multiple linear regression analysis, the calculated value of F* is 0.721 at the confidence level 0.05 and the F-value at degrees of freedom = (k, n-k-1) is 2.49 indicating that the null hypothesis is accepted in the sense that independent variables such as gender, age, and year graduation in work in the field of computer science and psychology. And the year of graduation does not stand in the way of developing the skills of programmers.

Apriori Algorithm result
This part, showen the result from analysis of programmer dataset based on Apriori algorithm. The result that obtained can show in Table 6 that illustrate the final output for five important measures used by this tequniqe. To clarify more, the third column (equations) shows how to calculate the resulting value in the last column (output). For more clarification, the Figure 7 can be viewed plot matrix that calculated by Weka explorer software. That illustraet a 2-D plot of the current working relation.
From the above output, it can be concluded that the most influencing factors in the unemployment of programmers are represented in obtaining the best rule calculated by relying on the application of an Apriori algorithm. Where it appears that developing the capabilities of students during the university education stage and transferring experiences to them from outside Iraq has a clear impact on the axis of university education. On the other hand, the lack of a deliberate plan to accept students in the disciplines of computer and information technology, in addition to monopolizing positions for a specific category is the most influential aspect within the administrative axis. Finally, the graduate's personality axis also has an impact on unemployment, due to the computer's graduate's acceptance of any work granted to him depending on financial need and he may be away from his specialty.
In this artical, firstly, to use more properties, the Multiple Linear Regression was applied to the personal informatiom from our dataset. So that, Using the F test, the calculated value is 0.721 at the confidence level 0.05 and the F-value at degrees of freedom is 2.49 indicating that the null hypothesis is accepted. And the year of graduation does not stand in the way of developing the skills of programmers. Secondly, Association rules base mining is use to find hidden patterns, and the Apriori algorithm is used to find Associations rules between this attribute. By use 90 instances based on 16 attributes the minimum support and confidence calculated is 10% and 90% respectively.

CONCLUSION
A hybrid method of statistical analysis by relying on one of the most important methods used to analyze data that is called multiple linear regression on the one hand, in contrast, using one of the most important methods of data mining in finding the factors with the highest impact based on machine learning, which determines patterns of reasons for not employing graduates of colleges of computer and information technology in Iraq, where patterns are being analyzed to explain how to find the causes for this issue. The mining rule was applied to a questionnaire that was published on a group of graduates of those colleges to analyze the collected results. In future work, we seek to have the questionnaire include more graduates on the one hand. And adding more factors on the other hand, also can be used several other types of algorithms that are used in the areas of finding rules of association such as Eclat or FP-growth algorithms