Damerau levenshtein distance for indonesian spelling correction

Word correction used to find an incorrect word in writing. Levenshtein distance is one of the algorithms for correcting typing error. It is an algorithm that calculates a difference between two strings. The operations that used to the calculation are insert, delete, and substitution. However, this algorithm has a disadvantage that it cannot overcome two switched letters in the same word. The algorithm that can solve those issues is a Damerau Levenshtein. This research aims to analyze a Damerau Levenshtein algorithm used for correcting Indonesian spelling. The dataset in this research consists of two fairy tale stories with 1266 words and 100 typing errors. From these two algorithms, the accuracy is up to 73% on Levenshtein distance and 75% on Damerau Levenshtein.


I. Introduction
Levenshtein distance is a matrix measurement obtained from the calculation of two strings [1]. Two strings on Levenshtein distance represent the number of minimum changes required to substitute a string with another string [2]. The operations used in Levenshtein distance are insert, delete, and substitution [3]. However, this algorithm has a disadvantage that if changes in two letters should be changing these two letters. Therefore, it requires another algorithm that can solve the problem that is damerau Levenshtein distance.
Damerau Levenshtein distance is an improvement of the Levenshtein distance algorithm. In this algorithm, there is four required minimum operation to change a string into another string. These operations are insertion, deletion, substitution, and additional transposition operation [4].
This research aims to examine a comparison between Levenshtein distance and the damerau Levenshtein distance algorithm in a fairy tale story. It expected to provide an accuracy level of these two algorithms in correcting Indonesian spelling. A comparison result will provide information on a better algorithm for Indonesian spelling.

A. Reseach Dataset
T A dataset used are two fairy tale stories from ceritadongengrakyat.com. These stories consist of 1266 words. There are 100 typing errors in these two stories. The stories will adjust from several problem scenarios to test the program.

B. Data Processing
The preprocessing stage is performed before conducting the trial. This stage aims to remove a number and read signs in a story. This stage has done so while testing performed, there is no number and read the sign that can affect the result. Besides, the dictionary used is limited to save an initial word, so it s requires additional affix words. The examples of affix words are "membantu", "melakukan", "menolong", "membuat", and so on.

C. Spelling Error
Spelling errors occur if the author's words do not appear suitable to KBBI (Kamus Besar Bahasa Indonesia). Spelling errors are caused by several things, such as ignorance in writing, errors during AB S T R A C T Word correction used to find an incorrect word in writing. Levenshtein distance is one of the algorithms for correcting typing error. It is an algorithm that calculates a difference between two strings. The operations that used to the calculation are insert, delete, and substitution. However, this algorithm has a disadvantage that it cannot overcome two switched letters in the same word. The algorithm that can solve those issues is a Damerau Levenshtein. This research aims to analyze a Damerau Levenshtein algorithm used for correcting Indonesian spelling. The dataset in this research consists of two fairy tale stories with 1266 words and 100 typing errors. From these two algorithms, the accuracy is up to 73% on Levenshtein distance and 75% on Damerau Levenshtein.

D. Damerau Levenhstein Algorithm
The damerau Levenshtein is an improvement of the Levenshtein distance algorithm. In this algorithm, the minimum number of operations needed to convert one string into another string is calculated. The processes used on the Levenshtein distance algorithm are insertion, deletion, substitution. While, in the damerau Levenshtein distance algorithm, the operation used is almost the same as Levenshtein distance, but with the addition of the transposition operation between two characters [5]. Damerau Levenshtein does not distinguish between these four operations. The developed algorithmic process is compatible with the percentage of 80% of all errors in personal writing. Errors usually occur in the loss of letter characters, excess character letters, or error sequence letters of two different letter characters [6]. Examples application of the operations used in the Damerau Levenshtein Distance algorithm are as follows: 1. Insertion is an operation by inserting characters at a particular index to match the source string and the target string. 2. Deletion is an operation by deleting characters at a specific index to match the source string and the target string. 3. Substitution is an operation by replacing characters at a particular index to match the source string and the target string. 4. Transposition is an operation by swapping characters at a particular index to match the source string and the target string.

E. Calculation of Damerau Levenshtein Algorithm
Calculations on damerau Levenshtein are done by calculating the edit distance illustrated in Table  1 with the wrong word MALAGN and the target word MALANG. The estimate of the value of edit distance is obtained from the meeting of each row and column. The calculation starts at the index position of the first and last row and column. The results can be known after calculating until the end of the column, which will be the value of the edit distance. The following is the edit distance value, which is calculated at the end of the row and column.

F. Words Recommendation and Accuracy
The results of the program suggest that the detected word is incorrect. Suggestions appear incorrect words. The word suggestion given will refer to the dictionary used in the program database. Accuracy calculation is done to count the number of wrong words that can be corrected after that divided by the wrong words multiplied by 100%. The formula to calculate the accuracy (1). x is the number of incorrect words that can be corrected, and n is the number of incorrect words.

III. Result and Discussion
The test performed on two fairytale stories consisting of 1266 words and 100 spelling errors. The method used to test is Levenshtein distance and damerau Levenshtein. The results obtained from the two methods are differences in the suggested words, as in the example of Table 2.  Table 2 shows the test results using Levenshtein distance and damerau Levenshtein. The results show that damerau Levenshtein has better results in suggesting the wrong word. As in Table 2, some words cannot be suggested using the Levenshtein distance method. For example, words that cannot be suggested by Levenshtein distance such as "abtang" suggested "abang", and "bajnir" suggested "banir". While, in damerau Levenshtein, the word can be suggested. Some words do not have a suggestion on Levenshtein distance, which is on the word "makhulk", because the way Levenshtein distance works cannot move two interchangeable letters. Whereas in damerau Levenshtein, the problem can be overcome by transposing the two exchanged letters. Meanwhile, the testing program has a difference in calculation time between Levenshtein distance and damerau Levenshtein, as in Table 3.  Moreover, damerau Levenshtein has the disadvantage of cannot to correct two words that have no spaces. Thus, some mistakes cannot be corrected by damerau Levenshtein, as in Table 4. Furthermore, the testing result of damerau Levenshtein shown in Table 5.  Based on Table 5, the Levenshtein distance algorithm can correct the error results as many as 73 out of a total of 100 errors and 27 errors cannot be corrected. Damerau Levenshtein's algorithm can correct 75 out of 100 errors, and 25 errors cannot be corrected. The damerau Levenshtein algorithm has better results than the Levenshtein distance algorithm by increasing the accuracy results from 73% to 75%. Errors cannot be corrected because the two algorithm's methods cannot correct words that do not have space or two words attached.

IV. Conclusion
The implementation of the damerau Levenshtein algorithm to correct spelling in children's fairy tales story has better results than the Levenshtein distance algorithm. The accuracy of the test is the Levenshtein distance algorithm by 73% and the damerau Levenshtein algorithm by 75%. Besides, damerau Levenshtein has better word suggestions compared to Levenshtein distance. However, in this research, some weaknesses must be developed again by the next researcher. Some errors that cannot be resolved are correcting two words that are attached or do not have spaces and too many word suggestions. Moreover, the processing time required by the damerau Levenshtein algorithm tends to be longer than the Levenshtein distance.