Stemming javanese affix words using Nazief and Adriani modifications

Stemming is the process of finding a basic word with several stages of affix removal. The main reason for stemming is to check spelling and machine translation and to support the effectiveness of the retrieval process. This study uses the Nazief and Adriani algorithm for stemming Javanese-influenced words. The first step taken is data collection and making a basic word dictionary. Then do the stemming process. Before stemming, modifications are made to the rules. The rules of the Nazief and Adriani algorithm, which are based on the morphology rules of the Indonesian language, are modified to suit the morphological rules of the Javanese language. Of the 366 words that were tested, it produced 351 correct basic words and 15 basic words that experienced errors. The results show that this algorithm can be used for stemming Javanese with an accuracy value of 95.9%


I. Introduction
The Javanese language has high morphological complexity because of the many variations of affixes in the word [1]. Complexity is also due to a prefix that changes depending on the first character of the root word. Similar to Indonesian, Javanese has a prefix consisting of a prefix, an infix, a suffix, and a combination of the three.
The use of Javanese has begun to decline. Indonesian teenagers no longer considered fashionable and relevant regional languages in the era of globalization [2]- [4]. It takes effort to reinvigorate the local language. There needs to be a serious effort to prevent fading level Java language capabilities, such as making an interpreter or learn the vocabulary of the dictionary [5] [6].
But there are things that must be understood that in the dictionary, there are no word affixes. The word affix can not be translated directly by the dictionary, so an alternative is needed by using the stemming algorithm [5].
Stemming algorithms in Information Retrieval system serves to reduce the number of indexes of a document. Besides that, it is used to classify words that have similar basic words and meanings but have different affixes [7]. In other words, stemming is the process of extracting the basic words from the original words and separating affixes [8]. This process has an important role because it is one of the first steps in the Information Retrieval process [9]. To produce the process of taking the prefix, infix, end or combination correctly, it is necessary to learn the morphology of a language correctly.
One of the stemming algorithms is the Nazief and Adriani method [10], [11]. This method has the advantage of a high degree of accuracy, and little overstemming and understemming [12]. Various studies on Javanese stemming have been done before [1], [5]. However, the use of Nazief and Adriani for Javanese has not proven successful. This article will discuss the application of Nazief and Adriani stemming methods to Javanese. Due to the difference between the affix between Javanese and Indonesian, this method must be modified so that the stemming process runs well. Therefore this article is written systematically to discuss research methods, the results of the discussion as well as conclusions and suggestions for future development.

B. Data Set
This study uses data in the form of various words with Javanese influences. The development of Nazief and Adriani algorithm is based on the morphology of the Indonesian language, so before the stemming process is carried out it is necessary to modify the rules in accordance with the morphology of the Javanese language. Modified rules follow the rules defined in the Javanese Paramasastra book [15] and Javanese Language Structure [13]. The rules for decapitation for Java are shown in Table 1. In Table 1, the KD symbol states the basic word. For example, the word "kumawani" will be decapitated into the prefix "kuma" and the basic word "wani". Some suffixes in Javanese and examples of their use are shown in Table 2.  From the rules of decapitation in Table 2 an example can be drawn on the word "silihana" to be decapitated into the basic word "silih" and the suffix "ana".  Table 3 shows that the symbol V represents the vowel, C for consonants and A for vowels / consonants. For example, the word "ngrakit" will be decapitated into the prefix "ng" and the basic word "rakit".
The related words used to test the stemming algorithm were taken from the Complete Javanese Dictionary, a Javanese-language newspaper called "PanjebarSemangat" and several Javaneselanguage sites.
The data collected from various sources is 449 words. The words used in the test consist of 235 words with prefix, 40 words with suffix, and 91 words with prefix and suffix, while 83 words are not used because there are no rules for decapitation the infix. The number of each word in each rule is shown in Figure 1. The test results will be manually checked to identify the correct base words or those experiencing errors. The results are evaluated using a Javanese dictionary. Then do the evaluation by calculating the value of accuracy. The calculation of the accuracy value is shown in Formula (1).

C. Nazief dan Adriani Stemming Algorithm
The development of Nazief and Adriani algorithms is based on the Indonesian morphology rules. This algorithm is grouped into prefixes, infixes, suffixes and combined suffix prefixes. In addition, it is also supported by the existence of a basic word dictionary and rearrangement of words that experience excessive stemming.
Indonesian morphology is grouped into several categories as follows [16]: Inflection suffixes are groups of suffixes that do not change the basic word form. This group can be divided into two: Particle (P), which includes "-lah", "-lah", "-tah", and "-pun".
Derivation Suffixes (DS) is a collection of original Indonesian suffixes which are directly added to the basic words namely suffixes "-i", "-kan", and "-an".
Derivation Prefixes (DP) are collections of prefixes that can be directly given to pure basic words, or to basic words that have been added up to 2 prefixes. These include: Prefixes that have morphology ("me", "be-", "pe-" and "te-") The prefix that has no morphology ("di-", "ke" and "se").
Based on the classification of the affixes above, the form of affixed words in Indonesian can be modeled as follows:

[ DP+ [ DP+ [ DP+] ] ] Root Word [ [+DS] [+PP] [+P] ]
So that the Nazief and Adriani algorithms can be used for the Javanese language, the beheading rules are modified according to the morphology of the Javanese language. In addition, adjustments to the rules of beheading for the complex initials of the Nazief and Adriani algorithm in Indonesian, amounting to 33 rules [11], became 16 rules in Javanese as shown in Table 3.
The similarity in word formation structure between Javanese and Indonesian [5] allows algorithm modification. Modifications to this algorithm are packaged in the form of flowcharts.
Pada permulaan pemrosesan, dan pada setiap langkah selanjutnya, periksa kata yang saat ini diinputkan terhadap kamus kata dasar. Jika kata ditemukan, kata tersebut dianggap sebagai kata dasar dan proses berhenti. At the beginning of processing, and at each subsequent step, check the word that is currently entered into the base word dictionary. If a word is found, the word is considered as the basic word and the process stops.
First delete the suffix that does not affect the spelling of the words "" me "," -your "," e "}. For example "klambimu" (your clothes) becomes "klambi" (clothes). If the word is found in the dictionary, the process is stopped.
Remove the suffixes {"-a", "-i", "-en"}. Then check the dictionary. If the word is found, the process stops. If the word is not found, then proceed to removing the next suffix.
Remove the suffixes {"-an", "-ak", "-n"}. For example in the word "njagongi", the word will be stemmed as "njagong". Because it is not a valid root word, it continues to delete the prefix.
In the previous step, partial removal of the word "njagongi" to "njagong" was done. This step will remove the "n-" prefix to get "jagong". This is a valid base word, so the process stops. If no prefix above is appropriate, the process stops and the algorithm identifies that the root word is not found.
If the word searched in the dictionary is not found, repeat step 4 (recursive process) again. If the word is found, the process stops.
If after removal the recursive prefix, the word has not been found. Then the recoding process is carried out with reference to Table 3. The columns in Table 3 show the prefix and character encoding variants to be used when the first syllable of the base word starts with a certain letter. Not all prefixes have a recoding character.
If all steps are still unsuccessful, the algorithm will return the original word before stemming.

III. Results and Discussion
From the test results using Javanese words, the algorithm is made capable of stemming words that have prefixes, suffixes and a combination of prefixes and suffixes. However, there are a number of influential words that cannot be stemming.
The first mistake is the words "kamigilan" and "kamitegen". The words "kamigilan" and "kamitegan" produce incorrect output. The decapitation that was supposed to be "kami-gila-n" is actually the letter "a" at the end of the word, which is carried out because it reads as a suffix that causes overstemming. The word that should be "gila" becomes "gil".
The second mistake is in the words "pamulang" and "pamriksa". The word "pamulang" produces a fixed word that should be "wulang". This is because if "pam" is removed, it will produce the word "ulang", so there is a need for rules that eliminate the prefix "pam" and the addition of the letter "w" at the beginning of the word. While the word "pamriksa" cannot be done stemming because there are no rules that eliminate the prefix "pam" and the addition of the letter "p" at the beginning of the word.
The third mistake is the word "koktekani". This error is caused by overstemming. The letter "ani" at the end of the word should be omitted only the letters "n" and "i", in fact the letter "a" participated removed. That is because the existence of "an" at the end of the word is considered as a suffix that must be removed.
The fourth mistake in the word that has an effect on "Pa" which is "pamomong" and "pamudha" cannot be done stemming. Each of these words should be "momong" and "mudha". The error lies in the prefix that should be removed only "pa" it actually eliminates the prefix "pam" contained in the rules as well.
The fifth mistake is the words "sadinane", "relapse", "gawana", "segane" and "tekane". The word has the same error, which is overstemming at the end. The letter "an" at the end of a word is considered as a suffix that must be removed.
The sixth mistake is the word "katawakake". The word was beheaded as "tawa". Because there is no suffix "kake" beheading is done in stages by eliminating "e" and "ak". The letter "a" in the word "tawa" is carried out with the omission due to the suffix "ak" so that overstemming occurs.
The seventh mistake in the word "pangenan". The fault lies in the prefix.. The existence of the prefix "Pang" and "Pan" causes overstemming. The "p" prefix that should have been removed, it's the "pang" that was removed.
The eighth mistake of the word "Nithik". The word should be distemming to "thithik". Because there are no rules that change the prefix "n" to "th" so the word fails to stemming.
Of the errors in the words above, overstemming occurs at most words with the character "an" at the end of the word. As for the results of the experiments conducted in this study are shown in Table  4. The accuracy value shows that the Nazief and Adriani algorithm can be used for Javanese and has good accuracy for Javanese. However, modifications made have not been able to stem words with infixes, the need for the establishment of insert rules in accordance with the morphology of the Javanese language.

IV. Conclusion
Based on the results of research that has been done that this algorithm can be used to perform Javanese stemming with an accuracy rate of 95.9%. Good accuracy is found in the prefix rule with minimal errors while in the suffix, there are still many errors. The system created is only capable of stemming words that contain prefixes and suffixes, so it is not possible for words that contain infixes.
In the future, it is expected that there will be developments that can correct errors, especially in the suffix rules, which are still largely wrong. In addition, it is also necessary to establish the appropriate infix rules of the morphology of the Javanese language so that words that have infixes can be done stemming. To produce an optimal root word also needs a complete root word.