Latent semantic analysis and cosine similarity for hadith search engine

ABSTRACT


INTRODUCTION
Search engine becames one of functions or the most important tool on information system specially on-line system [1].Search engine technology gives it easy for system user to get the information quickly [2].Google is one of capable search engines but it still has limitations in analyzing the content and meaning of search results [3].Along with advanced date regulation on the internet, search engines require speed and accuracy in releasing results in line with expectations today.The search function becomes important thing in getting information easily and quickly.However, not all search engines are devoted to find certain information precisely and accurately.In this study, a search engine that was built specifically to get information about the hadith in accordance with user needs.Where, the hadith is the second important source of law for Muslims after the Holy Qur'an [4,5].Of course, the generated hadith information must hand in hand with needed requirements.Therefore, search engines that are built need to consider the semantics wheather from the inputted keywords or the hadith data which is saved in the system.Hadith collection in the form of text requires certain processes so that the meaning of the text is maintained [6].Starting from preparing unstructured text data into structured data [7,8].Structured representation of text can be used in the next processes both in information retrieval (IR) and text mining [9].In the study of obtained information search engine, it uses the information retrieval (IR) technique by combining the latent semantic analysis algorithm and cosine similarity.In contrast to text mining where the results obtained from the system are not clear yet, IR produces information that has actually been known its form, because it is the same as the collection of data held [10][11][12].Information retrieval (IR) is used to connect relationships between large text data collections according to keywords.The parts of IR include: − Text operations (operations of text) which include the selection of words in keywords or documents (term selection) in the transformation of documents or keywords become term indexes (index of words).− Query formulation (formulation of keywords) that gives a standard to the word indexes of keyword.− Ranking (ranking), look for documents that are relevant to keywords and arrange the documents according to their compatibility with keywords.− Indexing (indexing), build a data base of indexes from document collections.Firstly, it is carried before searching documents.
IR system accepts keywords from users, then ranks documents on collections based on their compatibility with keywords.The result of rank which is given to users is documents based on the system are relevant to keywords.But the relevance of documents to a keyword is a subjective judgment and it is influenced by many factors such as topics, timing, sources of information and the objective of users.
Latent semantic analysis algorithm is widely used in processing text data by semantics approaches so the meaning of the text is maintained.Latent semantic analysis can be used not only for text summarization well [13][14][15], checking plagiarism [15], and automatically evaluating essays [16], of course it can also be used for searching.Latent semantic analysis compares the entered text with owned text data collection based on vector representations [17][18][19], with regard to semantics approaches to preserve the meaning of texts.In addition to latent semantic analysis, this hadith search engine research also uses cosine similarity to see the similarity of text data generated by search engines so that it can bring up text data sequences based on popularity as top order.Cosine similarity is one of the most popular similarity calculation methods to be applied to text documents [20].The main advantage of the cosine similarity method is that it can't be affect by the length and short of a document.Because the term value of each document is the important thing.Based on the explanation of the problem formulation above, how latent semantic analysis and cosine similarity can be implemented in finding the hadith text based on keywords entered correctly on the hadith search engine?Are latent semantic analysis and cosine similarity in the search engine can find hadith text data that are searched based on keywords that are entered correctly and relevant.

RESEARCH METHOD
Figure 1 describes activity flow of this research.Generally, this reseach used IR technique that implement latent semantic analysis and cosine similarity algorithm for producing information of hadiths based on input keywords.The activity begin from inputing the keywords (can be in the form of words, phrase, or sentence), the input keyword will be processed in text pre-processing phase to clean text data.Then, LSA agorithm will be conducted to create term document matrix and get the vector value of each document.Last, the similarity of input keywords and hadith data collection will be counted using cosine similarity.Latent semantic analysis is an algebraic method that extracts hidden semantic structures from words and sentences [21].Latent semantic analysis algorithm is one of the development algorithms in the field of information retrieval that is able to collect a large number of documents in a data base and connect relationships between documents by matching the given input.The main function of this latent semantic analysis is to calculate the similarity of a text data by comparing vector representations from other text data [15].The results of latent semantic analysis represent text data contextually and semantic that gives text meanings [21,22].The evaluation by using the latent semantic analysis method focuses on words in writing without considering to the order of words and grammar in writteng texts so that a sentence is assessed based on the key words include in the sentence [23].Basically, latent semantic analysis extracts information from patterns or collections of words that often appear simultaneously in different sentences.If the sentence contains a collection of words that often appear in large numbers, the sentence has semantic or safe meaning [21].Generally, the steps of latent semantic analysis that are used for text data, among others [24]: text pre-processing, creating term of document matrix, calculating singular value decomposition (SVD) and calculating vector value for each document

Text pre-processing
The text pre-processing stage is the stage to prepare text data which is unstructured data becomes a structured data representation [7,25,26].The process starts from tokenization, deletes regular expressions, deletes non letter characters, deletes stop words, and stemming.In fact, if needed, it is carried out a special process to handle natural languages contained in text data, such as; abbreviations, slang, regional languages, and other natural languages.The discussion regarding text pre-processing will be explained further in section 3.2.

Creating term of document matrix
After carried out the pre-processing stage in the text data, then the term of document matrix is constructed by placing the word result of the stemming (term) process into the row.This matrix is called the term of document matrix.Each row represents a unique word, while each column represents the obtained word source.The source of the word can be sentences, paragraphs, or all parts of the text.The examples of the term of document matrix can be seen in Table 1 (that presented with Indonesian language).On the Table 1, the first row represents the word has passed the pre process until the stemming process is called stemmed term (the word as term 1, term 2, etc.), and the column represents the context, namely the text.The value is located in each cell on the table shows how the number of times in a term appears in a document.For instance, the term 1 appears 1 time at the firts document, and appears 2 times at the second document, but the term 1 does not appear at third document, and so on.Table 1.Matrix example for term of document

Calculating singular value decompsition and vector value for each document
Singular value decomposition SVD is a linear algebra theorem which can split term of document matrix into three new matrices, those are: orthogonal matrix or left singular vector matrix (U), diagonal matrix or singular value matrix (S), and transpose of orthogonal matrix or right singular matrix (V) [27][28][29], formulated by (1) that illustrated in Figure 2. The formula (1) is obtained from the U matrix which is a matrix of m x k size and a matrix V of n x k size, as illustrated in Figure 1, U and V which have orthogonal columns so that it ca be valid: and S is a diagonal matrix of k x k size.The contents on the main diagonal of the S matrix are singular of the A matrix.The results of the SVD can be better understood if A matrix is written with a different interpretation.If  1, where the value of σ1 is for 1, for i = 1, 2, ..., k, on (3) it is sorted from the largest to the smallest.
If some big values  1 are taken and a small (near zero) σ_ (1) value is discarded, we get an approximation from good A value.So, by using SVD, a matrix can be written as a sum of the components ( 1    for i = 1, 2, …, k), and its weight is the singular value ( 1 , for i = 1, 2, … k, are taken from the formula of (4) [30].
SVD can identify and arange dimensions that indicate which data variations often appear.SVD takes the term of document matrix which consists of words and documents as in Table 1 which has been broken down into linear independent components.The result of the SVD process is a vector that will be used to be calculated its similarity by an approach.

Calculating cosine similarity
Cosine similarity is used to calculate the cosine value between documents vector in a collection and the needed input vector [31,32].The smaller the produced, the higher the level of similarity of the essay occure.The formula of cosine similarity is as shown in (5): with the statement, it showed that A is a document vector, B is an input vector, A. B is the dot product of vector A with vector B, |A| is the length of vector A, |B| is the length of vector B, |A|.|B| is a cross product between |A| and |B| and α is the angel which is formed between vector A and vector B.

RESULTS AND ANALYSIS
In this section, it is explained the results of research and at the same time is given the comprehensive discussion about how LSA and CS are implemented in searching information of hadiths and present the evaluation result of experiment that conducted.

Pre-processing for text data
Text data is unstructured data that needs special treatment before caried out mining process or searching for information contained in the text [30].The pre processing stage for text is the stage of preparing text data into a structured data representation.Generally, two types of structured data representations for text TELKOMNIKA Telecommun Comput El Control  Latent semantic analysis and cosine similarity for hadith search engine (Wahyudin Darmalaksana) 221 are bag of words and multiple of words [33,34].Latent semantic analysis is one algorithm that produces structured text representations in the form of multiple of words.Where, the text is not only represented by 1 word but also can be more than 1 word or also known as n-gram.Even the latent semantic analysis word collections considers to the semantics between one word and another.
Pre-processing of text data starts from uniformity of the size of letters to lowercase, deleting characters other than letters and regular expressions, if it is necessary to change abbreviations to be their original form, delete unimportant words or stop word removal, then it is the process to change the initial words into words essentially or stemming.In this study, the stemming process uses the Nazief & Adriani algorithm because the hadith text documents are arranged in Indonesian.The Nazief & Adriani algorithm is the most commonly used stemming algorithm for Indonesian because it is in accordance with the syntax of Indonesian [35][36][37][38][39].The results of the stemming used as data are entered for the latent semantic analysis and formed the term of document matrix from the text data.

Implementation of latent semantic analyais and cosine similarity on the hadith search engines
Latent semantic analyais is applied after the pre processof text is complete.Then the pre process results will be formed to be term of document matrix.The term of document matrix will be computed by SVD to produce a matrix of U, S, and V.The final stage is the application of cosine similarity to see the similarity of the information generated as well as arange it based on the level of similarity.The flow of the latent semantic analysis and cosine similarity that impemented in this study can be seen at the Figure 1.For instance, there are 3 pieces of the following hadith documents (present in Indonesian language): Document 1: Janganlah kalian berdusta atas namaku, karena siapa yang berdusta atas namaku niscaya dia masuk neraka.
(Do not lie on behalf of my name, because if anyone who lies on behalf of my name, he/she will go to the hell surely.)Document 2: Janganlah kalian berdusta terhadapku (atas namaku), karena barangsiapa berdusta terhadapku dia akan masuk neraka.(Do not lie to me (on my behalf), because whoever lies on me he will go to the hell.)Document 3: Barangsiapa yang sengaja melakukan kedustaan atas namaku, maka hendaklah dia menempati tempat duduknya dari neraka.(Whoever deliberately lies on behalf of my name, he should occupy his seat from the hell.)

Input Keywords in Hadith Search Engine:
Jangan Dusta Masuk Neraka (Do not lie to go to the hell) Text data from these three documents and go to the search engine.It will be caried out pre-proccess to produce text data as follows: Document 1: jangan kalian dusta atas nama dusta Document 2: jangan kalian dusta atas nama dusta masuk neraka Document 3: sengaja dusta atas nama hendak tempat duduk neraka Input keywords in hadith search engine: jangan dusta masuk neraka Then, the already three prepared text data is processed to form matrixes of the term of document likes on Table 1 and it is gained A matrixes as follows: The main step that needs to be completed is to decompose A matrix to be 3 other matrices using SVD, starting from finding the ATA value to calculate with cosine similarity.The process of applying Laten Semantics Analysis and Cosine Similarity for the term of document matrix is in the following Table 1.Search the value of ATA: search U matrix value with the formula of U= AVS-1: After being obtained the value of the USVT matrix, the next step is to reduce the rank of the matrix.This was done in order to reduce computing time.It is an example of a rank reduction of k = 2 from the USVT matrix as follows: From the results of the above calculation, it can be concluded that the arangement of documents that have the closest similarity with the input documents is document 1, document 3, and document 2.

Experiment and result evaluation
Testing is caried out by trying all the hadith queries on the system.Recall and precision values are searched by using formulas ( 6) and (7) [38,39].

𝑅 =
(6)  =           (7) where, R is Recall, so the R value is obtained by comparing the Number of relevant items retrieved with the total number of relevant items in the collection.Recall is a document that is called from the system based on the user requests that follow the pattern of the system.The greater Recall value cannot be said as a good system or not.And, P is precision.So, the P value is obtained by comparing the number of relevant items retrieved with the Total number of items retrieved.Precision is the number of documents that are called from the relevant database after being assessed by the user with needed information.The greater the value of a system precision, the system can be said well.
The purpose of the recall and precision test is to obtain information on the search results obtained by the system.Search results can be judged by its recall and precision level.Precision can be considered a measure of accuracy while recall is perfection.The value of precision is the level of accuracy between the information requested by the user and the answers given by the system.While the Recall value is the success level of the system in rediscovering information.As for the results of the recall and precision tests and the time which is spent on searching the tested hadith, it can be seen in Table 2, Figures 3 and 4.

CONCLUSION
Based on 50 times testing of the recall and precision values that have been carried out (contained in Table 2), it showed that the search engine hadith performance can apply the latent semantics analysis algorithm and cosine similarity quite well.Hadith information which is obtained based on keywords, phrases, or sentences entered successfully found well, it was indicated by a recall value of 87.83%.Although the overall information which is generated only has a value of accuracy or compliance with user input only 36.25% which is indicated by the value of the produced precision.Generally, the latent semantics analysis algorithm and cosine similarity that are used are able to produce the hadith information well.There were several factors that influenced the search results other than the possibility of an error in using the algorithm, including incomplete data and too much noise.Therefore, the pre processing stage is very important to be able to produce more accurate information.Because the pre processing stage produces text data that gives an input into the latent semantics analysis algorithm which will certainly affect the search results.For further research, the collection of saved Hadith data needs to be completed so that search engines can learn and get more precised information.In addition, the information obtained can be developed not only sorted by similarity but also can be grouped according to their meanings.

Figure 3 .
Figure 3. Result of relevant information Figure 4. Result of precision and recall value

Table 2 .
Tested result of latent semantics analysis and cosine similarity