The classification of the modern arabic poetry using machine learning

,


Introduction
Despite the number of approaches on the automatic classification of the English language and other languages, the Arabic language still needs a lot of research, especially related to Arabic poetry.This is due to the number of determinants in the language, including its difficulty and the need to master the rules of the language when studying poetry.There is also a need for a full understanding of the theory of "Al Arud", which specializes in the study of Arabic poetry [1] whether as a regular text or poem, focused on the topic or on the effects [2].Few studies have used sentiment analysis to classify Arabic texts [3].In this study, we used Naïve Bayes (NB), Support Vector Machines (SVM), and Linear Support Vector classification (SVC) for the classification task.
The next section of this paper covers a review of the related work, followed by the introduction of the four categories of modern Arabic poetry.After that, the dataset of the work is presented, followed by the data preprocessing step which has a direct effect on the accuracy of the classification process.The sixth and seventh sections focus on feature selection and the machine learning algorithms used.These sections are followed by those that discuss the methodology, results, and conclusions from the study.

State of the ART
Several methods have been used in the English language for the classification of emotions.Some of these studies depended on keywords spotting or unambiguous words like "happy" and "sad" [4].The lexical affinity from the effective research in this field depended on the emotion of the arbitrary term or words.In general, this method is better than the keyword ◼ ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 5, October 2019: 2667-2674 2668 spotting method as it cannot be used as an independent model [5].There are other methods which rely on a deep understanding of the language and semantics [5].Reliance on psychological theory in determining desires, goals, and needs was one of the models used in the classification [6].The machine learning techniques used in the classification of classical Arabic poetry depended on the emotion [7].This work classified the Arabic poetry into Fakhr, Retha, Ghazal, and Heija.The polynomial networks were used in the Arabic text classification [8].Several classification algorithms have been used in the classification of Arabic text, such as SVM [8,9], the NB [10], K-Nearest Neighbor (KNN) [11], Artificial Neural Network (ANN) [12], and the Rocchio feedback algorithm [13].

Categories of Modern Arabic Poetry
The modern Arabic poetry in general consists of the following types [14]: − Love poems: It is a poetic art used to express the feelings between lovers.The poet derives the meanings of his relationship with the subject, his outlook, the influence of the environment, and the reality of those feelings.− Islamic (religious) poems: The poets benefited from the stories contained in the Holy Quran; so, they took the precepts, rulings, and semantics and employed them in their poetry, treating community issues and problems that spread in their country at the time.− Social poems: Social poems aim to repair bad social conditions by diagnosing the problem, identifying its cause, and describing its resolution.The poets resort to the method of encouragement and motivation when they want their people to contribute to the promotion and progress and avoid the pests and conditions that undermine the foundations of its renaissance.− Political poems: This type of poetry expresses certain political orientations and the personal views of poets while preserving the way poetry is written, the values of literary and artistic poetry.

The Dataset
The Arabic language research using Natural Language Processing (NLP) is different from the English language in terms of the number and size of the datasets used.Due to the limited number of free available datasets in the Arabic language (which is an obstacle in the way of researchers), most researchers rely on a collection of datasets taken from magazines, news stations, and websites.Some researchers depended on Saudi newspapers [11].In the Arabic research, several schools of thought have classified the datasets into training and testing groups.In our work, the big problem is finding the datasets for tuning and testing because it is the first work on using machine learning for classifying the modern Arabic poetry.We depended on the website for datasets to train and test the categories of modern Arabic poetry.

Data Pre-Processing
The Arabic language is difficult both in speaking and writing.); the other is called constant letters.In this language, there are several kinds of diacritics used, such as "sukoon", "dammah", "Kasra", "Fatha", "tanween fatha", "tanween kasra", "tanween dammah", "shadde", and "mad".These short vowels give correct pronunciation and meaning.Table 1 illustrates the short vowels and pronunciations to the words that have the same letters but different pronunciation and meaning as shown in Table 2.
Arabic writings are different from those using the Latin alphabet, due to the direction of writing from right to left.Some letters in Arabic also take several forms depending on the location of the character on the word.These features must be considered in this work as shown in Table 3 The Arabic language has two types of genres, masculine and feminine.Each type in the Arabic language has different qualities and features in Arabic grammar.There are three classes in the Arabic language, the first is singular, the second is dual, and plural which also has two types (regular and broken).The Arabic language contains many ramifications in grammar.It is a very rich language, and this makes it difficult and a challenge to reach the required accuracy in the classification of modern Arabic poetry.Pre-processing of data is an important thing to do when building classification systems using machine language for the following reasons: − It removes noise from the text used in the classification.

−
It reduces the terms or characteristics on which we base our classification.

−
It helps reducing the amount of memory required for the classification.

−
It helps increasing the accuracy of the classification.We applied the following pre-processing on the data used in our work: − Tokenization: We divided the data into parts and based on characteristics and recognition of delimiters like the punctuation of special characters and white space.

−
We removed non-Arabic terms, words, numbers, punctuations, and any other singe.

−
The stop words like pronouns, prepositions, and conjunctions were also removed; we deepened the list adopted by Khoja and Garside [15,16].− Stemming: The major aim of stemming is to decrease an inflated dataset.In Arabic, many words can be composed from the same stem.Thus, we can reduce the number of terms used in the dataset and the complexity of text classification.This is also a storage requirement for classification systems [17,18].

Features Selection
In machine learning, constructing or representing vectors of features is a very important and critical point and has a significant impact on the results of the machine learning algorithm.Each object should be represented with its own features.

𝐷 = 𝑑1𝑑2 … 𝑑𝑛.
(1) where D is a document,  is a word, and  is the function representing the relation between the domain of documents and features. may be a linear or nonlinear equation.
which refers to the number of appearances of any characteristic or feature in any category deducted from the number of appearances of the same characteristic in all other categories.The feature vector was used for building document  once.When found any feature, the Boolean flag was used.The Boolean vector model used in this type of classification is better than the count model [19,20].

Machine Learning Algorithms
In our approach, three machine learning algorithms were selected for the classification of modern Arabic poetry.These algorithms have been proven successful in the classification of the English text.The first algorithm is Support Vector Machines, the second is Naïve Bayes, and the third is Linear Support Vector Classification.The datasets consist of four groups (folders): Islamic contains 23 files, Love contains 25 files, Politic contains 22 files, and Social contains 22 files, as illustrated in Table 4. Classifier performance is evaluated by computing its precision [21], recall [16], and f-measure [22].

Support Vector Machines
SVM is a computationally kernel-based algorithm for regression and binary data classification purposes [17,18].Based on the structural risk minimization theory, the SVM has been proven successful in solving both local minimum and high dimensionality problems.It has a better generalization performance compared to other ML methods such as ANNs [19,20].SVM has so far been excellent in solving several real-world data mining predictive problems like time series prediction, text categorization, image processing, and pattern recognition [21,22].Despite the remarkable achievements of the SVM, there are still certain drawbacks that need to be addressed, such as problems on the relationship of the statistical learning theory with other theoretical frameworks, big data processing, parameters selection, and the generalization ability of a given problem [23,24].With the rate of development of information systems, high-dimensional, dynamic and complex data are easily generated [25,26].

Naïve Bayes
The NB method is a classification scheme which relies on the Bayes' theorem.This technique assumes the independence of its predictors.Simply, the NB classifier assumes that there is no relationship between the existence of certain features in a class and that of any other feature [27][28][29][30].This theory was adopted in determining the class of the document on the following equation: The important hypothesis in this algorithm is that each property or feature in the document does not depend on the other's features, and assumptions produce the following equation:

Linear Support Vector Classification
Linear SVC is a type of machine learning algorithms similar to the SVM.Some features of this algorithm are the flexibility in selection and loss of functions.It is suitable for a huge number of samples.From the testing of this model on data, researchers have found it using one-against-rest approach compared to SVM which uses one-against-one approach.This model is used in several applications like the classification of text documents using sparse features [22][23][24].

Methodology
Figure 1 presents the outline of our work.In the beginning, we choose the dataset used in our work; after that, we segmented it into words and all the steps of data preprocessing were applied, including features extraction.We used three machine learning algorithms (SVM, LSVC, and NB) in training and testing.

Results
The work was done with the Python language using the machine configuration as follows: OS: Windows 7, CPU Speed: 3.20 GHz, Processor: Intel Core i7, RAM: 4GB.With the intention of scrutinizing the suggested work's performance, different parameters such as precision, recall, and f-measure were measured for all types of modern Arabic poem.

Figure 1 .
Figure 1.Block diagram of the proposed method

Table 1 .
. The Diacritics in Modern Arabic Poem ISSN: 1693-6930 ◼The classification of the modern arabic poetry using… (Munef Abdullah Ahmed) 2669

Table 2 .
Example for the Effect the Diacritics on the Arabic Word

Table 3 .
The Effect of a Positioning on the form of a Letter and the number of features is represented by .* is a feature vector length.We performed the mutually deducted occurrence as follows:   = (  ) represented the probability of occurrence of feature   in category or class c.Therefore, the mutually deducted count feature became as follows:  (  ) =   (  ) −   (  ), where  ≠ , The number of ◼ ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 5, October 2019: 2667-2674 2670 classes is represented by

Table 4 .
The Datasets for the Classification