Mobile Forensics for Cyberbullying Detection using Term Frequency - Inverse Document Frequency (TF-IDF)

ABSTRACT


INTRODUCTION
Crimes that occur in the digital era develop along with technological developments [1]. Cyberbullying in Indonesia in 2015 ranked third in the world and 91% of them were children [2]. A report released by the RSA Anti-Fraud Command Center (AFCC) states that from 2013 to 2015, there was an increase in cybercrime activity reaching 173% worldwide with total losses reaching US$ 325 billion. The report also reported that in 2015 45% of transactions were carried out through mobile channels, while 61% of fraud occurred through mobile devices [3]. Social media such as Facebook, WhatsApp, Twitter, etc., is one of the ways for someone to do cybercrime. WhatsApp is an instant messaging (IM) application for smartphones that can run on various operating systems such as Apple iOS, BlackBerry, Android, Symbian Nokia Series 40 and Windows Phone. WhatsApp helps someone to be able to chat online, share files, exchange photos and other features that attract users [4]. According to statistical data, the Statista website shows the number of WhatsApp users in July 2019, 1.6 billion users access WhatsApp messenger every month [5]. The data opens a reference for investigators to anticipate better cybercrime actions that can occur in the WhatsApp application because more users are using the application.
The law on cybercrime is regulated in the law on ITE in Indonesia. ITE crimes can be criminalized by civil or civil law according to the level of crime committed, the process of cybercrime arrest by the authorities based on evidence of crime stored on a smartphone or on other hardware that can be used as evidence in a court of law such as user name, IP address and stamp time [6]. In the field of technology, forensic analysis of digital or electronic evidence is referred to as computer forensics or digital forensics [7]. Mobile forensics is needed to conduct forensic analysis relating to evidence in the form of cellular devices [8]. In conducting mobile forensics, it requires a reference on how to analyze someone who identifies cyberbullying on mobile, making it easier for investigators to find a cyberbullying action.
The problem that is often found in cyberbullying is that it is difficult to identify that the victim and the perpetrator committed cyberbullying because the checks were carried out by eye and there is no strong reference to prove the cyberbullying case has been carried out by the perpetrator against the victim. The research is expected to add a reference for investigators to obtain or identify cyberbullying actions that have been circulating. The method used to identify or detect cyberbullying actions using the TF-IDF method is one method to search for a word in a text by preparing it before searching for a word that is in the text. The use of the TF-IDF method will search for the same words in the keywords in the database so that the same words in the text or conversations in one person will be weighted and see the sentences that lead to cyberbullying actions.

RESEARCH METHOD
The method in this study began with digital evidence in the form of messages from the perpetrators and victims of cyberbullying research methods can be seen in Figure 1. The figure shows how to identify cyberbullying of evidence that will be raised by cloning the evidence before analyzing the evidence that is in the evidence. The results of cloning will get evidence in the form of data, and after getting an existing chat message, then do text mining to look for important words from the perpetrators and victims. The similarity is used when getting important words and similarities are searched by using the TF-IDF method between keywords and important words that are in the evidence for cyberbullying detection that has been done on the WhatsApp evidence group. Then cyberbullying identification can be known.

Preprocessing
Preprocessing is the initial stage of text mining. This stage includes all the routines and the process for preparing data that will be used in the knowledge discovery operation of a text mining system [9]. Preprocessing consists of several stages, namely case folding, tokenizing, stopword, and stemming. Case folding is a step that changes all the letters in a document into lowercase letters. Only letters "a" through "z" can be accepted. Characters other than letters are omitted and are considered delimiter [10]. Tokenizing is the process of decomposing the description that was originally in the form of sentences into words and eliminating delimiter-delimiter such as periods (.), Commas (,), quotation marks ("), parentheses (()), spaces and numeric characters that are in that word [11]. Stopword is a vocabulary that is not a feature (unique word) of a document. For example, "di", "oleh", "pada", "sebuah", "karena" and so forth. Stopwords are defined as irrelevant concerning the main subject of the database, although they may often be contained in documents. Stopwords include determinants, conjunctions, prepositions, and the like [12]. After going through the stopword removal process, the next action is the stemming process. Stemming is the process of mapping and decomposing various forms of a word into its basic word form (stem) [13].

Term Frequency
Term Frequency is a way to find the weight of a document. Where will be seeking the number of occurrences of the term in the document? The greater the appearance of a term, it will affect the amount of weight and the suitability value. Following is the equation of Term Frequency can be seen in (1).
Information: TF(d,t): the frequency of the term t in each document, which will then be used for the calculation of TF.IDF weighting.

Inverse Document Frequency
Inverse Document Frequency is a method for calculating the distribution of terms in documents [14]. The following is the equation of Inverse Document Frequency which can be seen in (2). = log 10 ( )+1 (2) Information: N : the total number of all documents in a conversation that occurred on the WhatsApp application.
idft : the number of documents containing the target word. The less the number of documents containing the target word, the greater the weight of the IDF.

TF-IDF Weighting
The TF-IDF formula is multiplying the weight of TF with the IDF of each word. The TF-IDF formula can be seen in (3). Information: : weight or the result of multiplication between term frequency and inverse document frequency. : term frequency is the number of terms of each conversation in the WhatsApp group : term-document frequency is a lot of documents that contain terms on Query

Normalization of Max-Min
The Min-max method is the simplest in the process of a linear transformation of the original data. After the Min-max normalization process, a balance comparison value can be obtained between the value before the normalization process and the value after the normalization process [15]. The max-min normalization equation can be seen in (4).
Information: ′ : the new value obtained for normalization results in the form of a percentage where the largest value is max (p).
: the value to be normalized is the value to be seen what percentage of the value is if the largest value is max (p) min ( ) : min is the smallest value that appears from this attribute, this value is the smallest __as the lower limit of normalization max ( ) : max is the largest value of these attributes, this value is the largest __as the upper limit of normalization

RESULTS AND DISCUSSION
Data were taken from simulation data in a group consisting of four people having a conversation in the group. The scheme can be seen in Figure 2 that shows the perpetrators of cyberbullying of victims through the WhatsApp group and produce a dialogue or conversation that can be seen in Table 1. Table 1 contains two columns, and the first two rows are users or people communicating a conversation, then the second column is the contents of the conversation expressed in a group. Evidence of data that has been obtained will be carried out the identification process at what level of cyberbullying is done, flowchart identification of cyberbullying can be seen in Figure 3. Figure 3 is the stage for cyberbullying identification to facilitate handling or analyzing bullying. The stages for identification are starting from case folding, tokenizing, stop word, stemming, TF- IDF, and normalization to get a cyberbullying weighting presentation. Figure 3 is a flowchart to detect bullying in a group conversation that starts from case folding, tokenizing, stopword, stemming, applying the TF-IDF method, and Normalization Min-Max.

Case Folding
The results of changes through the case folding stage can be seen in Table 2. Table 2 contains two columns and the first two rows are users or people communicating a conversation, then the second column is the contents of the conversation expressed in a group. The difference between Table 1 and Table 2 has changed from uppercase to lowercase letters, as in the word "TOLOL" to "tolol." Case folding results are used to convert the entire conversation to a standard form so that it makes it easier to prepare the text. i : "betul-betul-betul :d"

TF-IDF weighting
From the results of the preprocessing, conversation data will be weighted in a conversation using the TF-IDF method, which can be seen in Table 6 Frequency is a way to detect cyberbullying done in verbal conversations on WhatsApp, weighting or Term Frequency and the Inverse Document Frequency method can be seen in Table 6. The results of the calculation of Table 6 can be concluded that the perpetrator "C" has a level of use of negative words, which are words that are said to bully the victim. Queries do not count because it is only a database of negative words to look for negative words in a conversation. Numbers in fields Q, A, to I such as 1.69, are obtained from the calculation of TF-IDF using (3). From the results obtained 0.39794 settlement, then this calculation is done to get the TF-IDF other words in a conversation. Change to percent using normalization. The formula to calculate it becomes a percentage form to see the weight of cyberbullying that has been done using (4). The above calculation is the normalization of the TF-IDF calculation results following the predetermined queries as negative words, namely 4.56 for "a" actors, 3.22 for "b" actors, 4.74 for "c" actors, 0 for "d," 1.52 for "e" actors, 0 for "f" actors, 0 for "g" actors, 4.56 for "h" actors, and 0 for "i" actors' perpetrators. This number will be deducted by the minimum value of TF-IDF and then divided by the reduction between the maximum value of TF-IDF and the minimum value of TF-IDF. The calculation results are 48.144%, 33.952%, 50%, 0%, 16.048%, 0%, 0%, 48.144%, and 0% from these figures it can be concluded that the perpetrators who have the heaviest weight of bullying are perpetrators "C" because the highest value of the calculated results.

CONCLUSION
The conclusions that have been obtained from the simulation of conversations between four people in a WhatsApp group get the results of the cyberbullying rate that the user "C" has a cyberbullying rate of 50% from the following data can prove that the Term Frequency and Inverse Document Frequency methods can help investigators detect cyberbullying that occurs in WhatsApp group conversations and know the intensity level of negative words in bullying. Further improvement is needed to be able to detect cyberbullying more perfectly, such as the preprocessing process, it is necessary to normalize to minimize word detection errors so that the detection process is accurate even though there are abbreviated words or changing words like "sok2an" and so on. Based on the results of research that has been done can run well and smoothly and can achieve the expected targets or goals.