Comparison Support Vector Machine and Naive Bayes Methods for Classifying Cyberbullying in Twitter

Twitter users in Indonesia in 2019 were recorded at 6.43 million. The high level of Twitter users makes it allows for free opinion to anyone. It can cause cyberbullying. Victims of cyberbullying experienced higher levels of depression than other verbal acts of violence. The forms of cyberbullying that occur on Twitter are flaming, Denigration, and Body Shaming. The research contribution is able to make social media developers and users more aware of the type of cyberbullying that social media users sometimes do without realizing it. Social media developers can prevent cyberbullying by using policies such as word detection and filtering features that indicate cyberbullying more accurately by classifying it by type and using the most accurate method. To classify cyberbullying forms in twitter, in this study, we use the Naïve Bayes method and Support Vector Machine (SVM) and compare them based on classification accuracy. This research will also identify words that are characteristic of each category of cyberbullying so that each category is easy to identify by social media users and makes it easier to avoid cyberbullying. The results of this study are the classification accuracy of Naïve Bayes of 97.99% and the classification accuracy of SVM of 99.60%. It means that SVM is better than Naïve Bayes for classifying the forms of cyberbullying in Twitter. Twitter with an average classification accuracy of 99.60%. It shows that the SVM method can classify cyberbullying forms better than the Naïve Bayes Classifier method.

factors). These factors in the case of cyberbullying can affect the bullies in using different words in their messages, thus forming cyberbullying categories that differ from each other [11] [12]. Forms of cyberbullying that occur on Twitter are flaming, Denigration, and Body-shaming. Flaming is fighting online using electronic messages with abusive and vulgar terms, such as swearing, gossiping, or mocking. Denigration is sending or posting gossip or rumors about someone to ruin their reputation (defamation) [13] [14]. Body Shaming is an act of criticizing or denouncing the shape, size, and physical appearance of others [15]. These actions can make the victim become insecure about their own body. The classification of cyberbullying categories on Twitter can use the text mining data processing method. Text mining is the process of extracting information in a large set of documents and can automatically identify patterns and special relationships of textual data [16].
Many methods have been developed to classify text, two of them which are Naïve Bayes Classifier (NBC) and Support Vector Machine (SVM) [17]. Naïve Bayes is a probability-based classification method. This method calculates a set of probabilities by sum the frequency and combination of values from a dataset. This method applies assumes that uses the Bayes Theorem and assumes all the attributes are independent or not interdependent determined by the value of the class variable [18]. SVM method is a non-probabilistic binary linear classification technique that represents each document in the form of a vector as a sample space. The SVM method is used to analyze vector data based on words with training models to determine hyperplane [19].
Research on cyberbullying about the classification of cyberbullying based on its form so far hasn't been done. Previous studies are only sentiment analyses of comments on social media to identify the document whether cyberbullying or not. These studies include sentiment analysis to detect cyberbullying on Facebook comments [20], but there is no classification based on the forms of cyberbullying. Other studies [21] identified cyberbullying tweets on Twitter social media. In that study, researchers identified comments that contain bullying and not bullying without any classification based on cyberbullying forms.
Based on the explanation above, this study will discuss the classifying of the tweets that indicated cyberbullying on Twitter using the Naïve Bayes Classifier method and Support Vector Machine. Indicated cyberbullying comments will be classified based on their forms, such as flaming, denigration, and bodyshaming. This study aims to identify words that are characteristic of a cyberbullying category and to compare the best method to classify cyberbullying data between Naïve Bayes and SVM.
The research contribution is able to make social media developers and users more aware of the type of cyberbullying that social media users sometimes do without realizing it. Social media developers can prevent cyberbullying by using policies such as word detection and filtering features that indicate cyberbullying more accurately by categorizing it by type and using the most accurate method.

RESEARCH METHOD
The data used in this study are text data in the form of comments that indicated cyberbullying in Indonesia. The comments were taken from tweets of social media Twitter users. Retrieval of tweets data using website crawling techniques with the Application Program Interface (API) that has been provided by Twitter. The amount of data used is 1000 tweets, with three categories based on the types of cyberbullying, there are Flaming, Denigration, and Body Shaming. The tweets period taken is between December 2019 and February 2020. To find out the characteristics of the word cyberbullying category and classify it using the SVM and Naïve Bayes methods, the steps of research are website crawling using the API on Twitter, text processing, k-fold cross-validation, classifying data with SVM and Naïve Bayes methods, calculating the APPER value.

Website Crawling Using the API on Twitter
Website crawling using the API on Twitter is used to get cyberbullying tweets. The keywords used in taking tweets are words that indicate bullying. The steps are carried out with OSS-R software using the TwitterR package library.

Text Processing
Text processing is a process to prepare data so it can be read properly by the system and get optimal classification results. Some of the steps are as follows: 1. It is selecting data manually to retrieve tweets that are indicated cyberbullying and discard tweets that are not cyberbullying. 2. Label the classification of cyberbullying categories in the data manually using Microsoft Excel software. 3. Cleans documents from unnecessary characters such as emoticons, hashtags (#), numbers, symbols, punctuation marks, slangwords and stopwords. 4. They are transforming cases and stemming from data. At this stage, all the letters in the document will be changed to lowercase and basic words. 5. Form a word frequency matrix by using the Term Document Matrix function on OSS-R

K-Fold Cross-Validation
K-fold cross-validation is a technique for breaking data into k sections of data set of the same size. Kfold cross-validation is used to eliminate bias in the data. Training and testing are carried out k times. In the first try, the Sk subset is treated as test data, and the other subset is treated as training data. In the second trial, the subsets S1, S2, …, Sk-2, Sk become training data, and Sk-1 become test data, and so on [22]. An example of applying the K-fold cross-validation method is illustrated in Fig. 1.

Classifying Data with SVM and Naïve Bayes Methods 2.4.1 Naïve Bayes Classifier
Naïve Bayes has two stages in the classification of texts. There are training and testing stages. At the training stage, it will do an analysis process of the document sample in the form of vocabulary selection, which is the word that appears in the collection of sample documents that can be a representation of the document. Then the prior probability for each category is determined based on a sample document and determines the category value of a document based on the terms that appear in the classified document [23].
Assuming the collection of documents as D= {d1, d2, ..., dn} and the category collection as V= {v1, v2, ..., vn}. The NBC classification begins by calculating the probability of ( | ), that is the probability of the category if document di is known. The document di is an n-tuple of words in the document, that is {a1, a2, ..., an} whose frequency of occurrence is assumed to be a random variable with a Bernoulli probability distribution [24] [25]. Furthermore, the classification of documents can be done by calculating the maximum posterior value based on the equation [26]: Based on (1), by applying the Bayes theorem. It can be written as follow Since the value of ( 1 , 2 , … , ) is constant, it can be ignored, so the equation (2) can be written as follow Next, in classification using the Naive Bayes method, it can be assumed that each word in { 1 , 2 , ..., } is independent. Because of ( 1 , 2 , … , | ) = ∏ ( | ), the equation can be written as follow The ( ) value is determined in the training process, so the value is approximated by Where is the number of documents with category j in the training data, and N is the number of documents used for training data.
The term does not always appear in one of the categories during classification, so the value of ( | ) is zero [26]. To overcome this problem, add-one smoothing or Laplace smoothing is used by adding 1 of term frequency so that the equation (5) can be written as follow

Support Vector Machine
The algorithm of SVM basically works by defining the limit between two classes with the maximum distance from the closest data. To get this limit, we need the best hyperplane in the input space obtained by measuring the hyperplane's margin and looking for the maximum point. Margin is the distance between the hyperplane and the closest pattern of each class. The closest pattern is called a support vector [27]. The SVM method is illustrated in Fig. 2.

Fig. 2. Illustration of the SVM method
Mathematically, the problem of separating vectors that have two different groups in a number of documents is formulated as follows.
Where = { 1, 2, ..., n} is a weighting vector, n is the number of attributes, is the input data used as an attribute, and b is a scalar as an additional weight. The first delimiter field restricts the first class, while the second delimiter restricts the second class. The two bounding fields are stated in the equation as follows: The margin value between the boundary planes based on the formula of the distance of the line to the center point is The margin value is maximized while still comply in (8). Maximizing 1/‖ ‖ is the same as minimizing ‖ ‖ 2 . So that the search for the best separator field with the largest margin value can be formulated into a constrained optimization problem, that is.
Where ( ) + − 1 ≥ 0 and the limitation functions can be written as follows.
These problems can be solved using the Lagrange multiplier method. The equation can be transformed into the Lagrange multiplier function as follows.
With i ≥ 0 is the Lagrange coefficient. Based on equation (8), the equation (13) can be written as follows.
To calculate the value of i (Lagrange coefficient), We must minimize L in (14) with respect to w and b. This value is obtained by calculating the partial derivative of L with respect to and b, so the equation can be written as follows.
The determination of the best separation of fields is formulated as follows.

Calculating the APPER Value
Apparent Error Rate (APPER) is a value that is used to see the probability of error in classifying objects. The calculation of classification accuracy is based on the results of the classification process that has been summarized into a classification Based on Table 1, errors in the classification of objects can be calculated using APPER [30], which is defined as.

Flowchart
Based on the research steps that have been described, to simplify the explanation of the process carried out, the process will be described in a flowchart. Fig. 3 is a flowchart of the entire analysis process carried out.  Table 2 gives the 10 words with the highest frequency of cyberbullying data in the body-shaming, denigration, and flaming categories. Based on Table 2, the words that characterize the body-shaming category are "cebol", "jelek", "botak", "gendut", "item", "cungkring", "dekil", "sipit", and "buncit" with "cebol" as a word that is widely used by bullier in bodyshaming. Words that characterize the denigration category are "isu", "alih", "uang", "becus", "plagiat", "haram", "kerja", "kasus", and "gelap" with the word "isu" as the word most often used by the bullier in denigration. Words that characterize the Flaming category are "anjing", "goblok", "bangsat", "bodoh", "tai". While, "anak", "tahu", "lihat", and "baru" are not considered as their characteristics because affixes word and small frequency, the word "anjing" is the most word widely used by bullier in flaming. The word "orang" is a word that is characteristic of cyberbullying in general because it is found in each category.

RESULTS AND DISCUSSION
The classification accuracy of cyberbullying with 10-iterations by using Naïve Bayes and SVM methods obtained the following results in Table 3 and Table 4.  Table 3, the Naïve Bayes Classifier method can classify three forms of cyberbullying on Twitter with an average classification accuracy of 97.99%. Based on Table 4, the Support Vector Machine method can classify three forms of cyberbullying on Twitter with an average classification accuracy of 99.60%. It shows that the SVM method can classify cyberbullying forms better than the Naïve Bayes Classifier method.