Content-based recommender system architecture for similar e-commerce products

Recommendation systems are quite famous and are increasingly being used on e-commerce platforms for a variety of purposes. The recommendation system technique used also varies greatly depending on the scope and Item of recommendation. Content-based filtering, for example, is used to recommend related product items based on user preferences. However, how the recommendation system architecture should be built starts by creating a data model for bringing up related product items. This paper offers a system architecture by considering the initial problem usually faced by recommendation systems, namely the cold start problem. The problem of lack of user preferences data is trying to be overcome by utilizing product item documents. Product item documents are processed using the TF-IDF algorithm and Vector Space Model to generate a data model. Then a query can be applied to find similarities to items that the user has seen. In the end, the recommendation system architecture that was built produced excellent Precision using Recall and Precision testing. Tests are carried out for data using the weighting of product names and product labels. The result obtained 0.84 for the average value of Recall and 0.78 for the average value of Precision.


I. Introduction
E-commerce is not only used by large traders but is also used by traders who are pioneering efforts.The types of products also vary, ranging from building materials to food materials such as agricultural products [1].Growing with more and more platforms [2].E-commerce is overgrowing, and a study predicts that there will be an increase in the e-commerce market amounted to 47.5% in 2022 compared to 2019 [3].Many products sold through e-commerce are a challenge for sellers to offer their products.Products with the same name and type may number in the dozens in a single e-commerce platform.E-commerce platforms should provide convenience to the user's relevant product search [4] [5].Moreover, with the increasing number of products sold, buyers will take a long time to evaluate products purchased by the user [6].
One of the conveniences provided is by providing other product recommendations based on products that have been searched by users [7].The recommendation system has been widely applied in various companies, such as Amazon, Netflix, and Tokopedia, to support business [8].Netflix uses the recommendation system for various purposes such as top movies, trending, movie similarity, and search [9] [10].In the tourism sector, the recommendation system is also used to make it easier for prospective tourists to determine their tour trips [11].The utilization of a recommendation system in e-commerce produces several benefits: increased sales, increased user loyalty, and others [12].The recommendation system can make it easier for users to find, get, and determine products online [13].
The recommendation system has several methods that can be used in its design, namely, Content-Based Filtering, Collaborative Filtering, and Hybrid [14].Content-Based Filtering (CBF) is a recommendation that uses the content attribute of an item, which is then used to determine the similarity between items in determining recommendations [15][16] [17].Collaborative filtering is a recommendation system whose attributes are not from the content of an item or product but the similarity or relationship of user data; there are two categories of Collaborative Filtering: User-Based and Item Based [18].In User-Based, the approach is to see the similarities between one user and another.In Item Based, the approach is seen from the similarity between items that interact or are A BS T RA C T Recommendation systems are quite famous and are increasingly being used on ecommerce platforms for a variety of purposes.The recommendation system technique used also varies greatly depending on the scope and Item of recommendation.Content-based filtering, for example, is used to recommend related product items based on user preferences.However, how the recommendation system architecture should be built starts by creating a data model for bringing up related product items.This paper offers a system architecture by considering the initial problem usually faced by recommendation systems, namely the cold start problem.The problem of lack of user preferences data is trying to be overcome by utilizing product item documents.Product item documents are processed using the TF-IDF algorithm and Vector Space Model to generate a data model.Then a query can be applied to find similarities to items that the user has seen.In the end, the recommendation system architecture that was built produced excellent Precision using Recall and Precision testing.Tests are carried out for data using the weighting of product names and product labels.The result obtained 0.84 for the average value of Recall and 0.78 for the average value of Precision.assessed by users.The third approach is Hybrid which combines the recommendation system's filtering method [19].This paper discusses the recommender system architecture in e-commerce to determine which products are related and similar to produce items that have been searched for and seen by users.The aim is to determine the appropriate system architecture and mechanism for implementing a recommendation system on an e-commerce platform.This paper uses an example of an e-commerce platform that is made simple using a data set taken from one of the platforms.Data were taken only general data of a document item product named the product name and product description [20].
The recommendation system architecture that is made will only use content-based filtering techniques.It is based on the possible problems that the recommendation system will inevitably face when only the product documents are available [21].This problem is commonly called the cold start problem, making content-based filtering unable to run correctly because user preferences have not been obtained [22].
The cold start problem can be solved with a simple mechanism: to use log data of user activities when searching and finally see the Item [23].This data can be used as a simple preference so that its similarity can be found with the product item model data.The recommendation system architecture built includes how the model data is stored and how it can be used as a source to find similarities between items with an adequate Precision level.

II. Method
This paper proposes a content-based recommendation system architecture to produce similar product items and according to e-commerce platform users' preferences.The text mining methodology [24] is shown in Fig. 1, namely data collection, data preprocessing, weighting using TF-IDF, forming data using a vector space model, and checking similarities with cosine similarity [25].The resulting data model is stored in JSON form, and item similarity checks are made in Restful API.Starting the development process, previously prepared an e-commerce platform consisting of 2 pages.The first page is the front of the e-commerce platform, and the second is the product detail page.The system scenario that is created assumes that the user explores the system by seeing the products on the front page, continuing to the product detail page.The product item data used is sample data taken from an e-commerce site in Indonesia.The data taken are product names, product prices, pictures, and product descriptions.Fig. 2 shows an outline of the workflow system recommendation on the product detail page starting from the system's user browse products.The system will send a product request that the user sees.The product id will then be sent to the Recommender Engine, after which the product id will be stored in the database.The Recommender Engine will calculate the similarity between the product data and other products from the database.Recommender Engine contains two parts of Hook and Core.Hook section in charge of receiving data from the database and receive payload Daris server.Data were then submitted to the cores to do the stages of the process of determining a recommendation.Core performs the Text preprocessing process followed by weighting TF-IDF.The data is converted into a vector space model to calculate the value of its similarity to the Cosine Similarity, having obtained the weighting value.Responses are in the form of products that users see and other product recommendations.When the user returns to the home page, the system will display the recommendations obtained from previous activities on the product detail page.
When users search for a product, the server will accept the request, and the server will search for the product the user is looking for from the database.After searching for data from the database, the server receives the data and displays it to users.The server will send the user's search data to the Recommender Engine, which Hook will capture data.The data in the form of user id and product id being searched will be stored in the database, and the Recommender Engine will retrieve data from the database to process the similarity of the data sought with the data in the database with the TF-IDF algorithm and the Vector Space Model.Then the results of these recommendations will be sent to users.So every user sees a product, a list of other products similar to the product being viewed will be displayed.Moreover, when users see the feed menu, recommendations will appear based on products that have been seen before.
System testing conducted in this study aims to determine how the system's suitability of products is recommended.The system testing is done by calculating the recall and Precision [26] values of several randomly selected product samples then calculating the average value obtained.The test system is based on three schemes testing systems.In the first test, the document attributes used for the recommendation are the product name and description.In the second test, the document attribute used for recommendation is the name of the product.In the third test, the document attribute used for recommendation is the product name with the product label.

A. Datasets Collection Results
The data taken is data on the type of foodstuffs, agricultural products, or livestock products, such as vegetables, fruits, spices, and the like.It has been obtained approximately 868 data as data sets.Furthermore, each data is labeled according to its category.Table 1 shows the 20 labels of datasets obtained.The label distinguishes specific foods such as vegetables distinguished from the shape, parts, and plant species.Information: -Total: is the total document in the database record -Label: is the document label

B. Text Preprocessing
The preprocessing stage aims to clean the product name data and product descriptions from unnecessary words [27].In-text preprocessing four stages will be followed by the system.The first stage is case folding which aims to convert all Text into a standard format, where the text format is used in lowercase.So that all Text that has a capital format will be changed first to lowercase.Changes in original data and case folding results are shown in Fig. 3.At this stage, the toLowerCase() function is used, which is a JavaScript string manipulation.The second stage is tokenization, which functions to convert documents into tokens or words.At this stage, use the WordTokenize() function in the Natural.jslibrary.The change in form from the tokenization process can be seen in Fig. 4, which shows that the description sentence has changed to 1 separate word from each other.The third stage is stopword removal which aims to remove meaningless words.At this stage, use the stopword list datasets obtained from Kaggle.Then the dataset will be matched with words in the document.If the word is on the stopword list, it will be deleted.The last step is stemming, which aims to remove the affixes to words to produce only the root words.At this stage, the literary algorithm [28] is used for Indonesian Text and Porter's algorithm [29] for English Text.The final results of this preprocessing stage are shown in the last part of Fig. 5.

C. TF-IDF process
The TF-IDF process was carried out to count and determine how important these words were in the data set [30].TF-IDF starts from calculating each Term in the document (TF), wherein this process bag-of-words will be generated.Then the process continues by counting the number of documents that have a specific term (DF).After that, calculating the Inverse Document Frequency (IDF) and finally, the TF value is multiplied by the IDF.Document weighting is done using the TfIdf () function in the Natural.jslibrary.
Step 1 is to calculate the frequency of the Term's appearance on each existing document (TF).
Then the results will be like Table 2.

Token
Result infromation: -bayam (D1) in document 1 is 2 words -segar (D1) in document 1 is 2 words -note (D1) in document 1 is 1 word -sedia (D1) in document 1 is 1 word -bumbu (D1) in document 1 is 1 word Step 2, namely calculating df, df is the number of documents that contain specific words.The df calculation is based on Table 2.The results of the df calculation are shown in Table 3.

Result information:
-Df "bayam" has a value of 1 because the number of words "bayam" in the four documents is only 1, namely in document 1. -Df "segar" has a value of 3 because the number of "segar" words in the four documents is 3, namely in documents 1,2, and 4. -Df "note" has a value of 1 because the number of words "note" in the four documents is only 1, namely document 1. -Df "sedia" has a value of 1 because the number of words "sedia" in the four documents is only 1, namely document 1.
Step 3 is to calculate the IDF.The equation used is as (1) Ln: Natural logarithm.Where this logarithm is based on e, which is Euler's constant.The quantity of Euler's constant is e = 2.718281828459 N: Number of documents Df: Document Frequency, or the number of occurrences of Term in the document The results of the IDF calculation are shown in Table 4. Step 4, namely calculating the TF-IDF.TF-IDF calculations use (2).
, =  ,      Here is an example of calculating tf-idf (Table 5).This weighting process produces a value or weight given to each token.The high-weight tokens are bayam, segar, daun, bawang, segarikat.

D. Cosine Similarity
Every word in the query and document is converted into a Vector Model to calculate Cosine Similarity.The calculation result from Cosine Similarity will produce a similarity value between the query and the available documents.Each document is compared with other documents to get a Similarity between one document and another.Table 6 is an example of calculating the cosine similarity.The next step is to calculate the total overall value of the query and the document that is:

Token
After getting the query's total value and document weights, the next step is to calculate the square root of the total value.After the data has been successfully calculated for Cosine Similarity, the rank score results are taken and sorted to get the recommendation.The rank score on the Cosine Similarity results is between 0 and 1.The higher the value, it proves the document similarity between the query document and the recommended document.

E. Testing
Testing was done by making a recommendation model from the data set of 868 data, which then calculated the similarity to Cosine Similarity.The test that will be done is to calculate the recall and precision values.The greater the recall and precision values, the better the recommendation system with the Content-Based Filtering method on product recommendations can provide appropriate recommendation results.There are 3 test schemes.The first recommendation system will recommend a product based on the description and name of the document.The second is the recommendation system will recommend products based on the name of the document.The third is the recommendation system will recommend products based on the document's name and product label.The first schematic test can be seen in Table 7.In the second scheme test, where the data used to calculate the weighting is the product's name only.The test results can be seen in Table 8.In the third schema testing, the data used for weighting are product names and product labels, which these labels will later be used as product categories.The test results can be seen in Table 9.The Recall and Precision values have been produced from the above test results, which differ significantly between the first test scheme and the second and third test schemes.Where the first test scheme produces an average value of Recall, and Precision is below 0.60.Moreover, the second and third test schemes produce identical mean values.It happens because the product description in the online store is not relevant to the name of the product.For example, usually, a merchant writes a product description with a shop description or conditions in the store, such as delivery time or the method of delivery that can be done in the store.It will make the results of the recommendations unsuitable.So it can be concluded that it produces a large text value, so the product data must have an exact name and description, and there is no need to add other sentences that are useless.Testing this recommendation system produces different Recall and Precision values between the first, second, and third test schemes.The test value of the first scheme, as shown in Fig. 6, produces an average Recall value of 0.30 and a Precision of 0.59, and in the second and third tests, it produces an average Recall value of 0.84 and Precision of 0.78.

Table 7 .
Testing the first schema

Table 8 .
Testing second schema

Table 9 .
Testing second schema