Research on Identification Method of Anonymous Fake Reviews in E-commerce

In this paper, a new method has been proposed for identifying anonymous fake reviews generated by click farmers in E-commerce and improves the identification rates. Anonymous fake reviews are different from the gunuine reviews. They could be distinguished based on the credibility of users, the average daily number of evaluations, the content similarity, and the degree of word overlapping. The proposed method takes into account these 5 features to calculate the fake reviews content by constructing multivariate linear regression model, Experiments show that this prelimilnary work performed well in identifying fake reviews in Chinese E-commerce website. The extracted features are also useful to identifying the fake reviews when the reviewer’s identification is not accessable


Introduction
It is ovserved that the E-commerce and opinion-sharing website are gaining their popularity following the emergence of Web 2.0 [1].Online commenting and reviewing are widely adopted in many E-commerce platform such as Amazon, Taobao, and Tmall, which allow users to exchange their purchasing or consuming experiences.Positive reviews can increase shop's reputations and attract more customers, while negative ones may bring out potential sales loss [2].Consumers use these reviews not only to receiveword-of-mouth (WOM) information on products, such as quality, suitability and utility, but also to input their own reviews to advice other consumers [3].As the influence of online review becoming a critical factor of the E-commerce market, unfortunately, it is noticed hundreds of click faring groups have sprung up in recent years which delivering bundles of fake positive reviews to merchants requiring a quick and dirty way to boost their popularity.The consumers who make purchase decision rely on those fake reviews posted by 'click framer' could be disappointed as the product will not meet their expectation [4].Moreover, many E-commerce platforms protect the pricacy of the users through providing anonymous evaluating services, potentially facilitates the generation of fake reviews.Now, the data security is the biggest issue for the consumers in E-commerce [5].So anonymous fake reviews analysis has drown many researches' attention.The reason of investigating the fake reviews in E-commerce is to identify the fake reviews and help the consumers to get genuine information of the product or service they want consume.Therefore, how to identify the anonymous fake reviews in E-commerce is a necessary work.But, very few researchers explore anonymous fake review identification topic till now.Y. Fu and B. H. Dong [6] provide a model to extract the fake reviews from Taobao and Tmall website, but the method they employed is rely on some internal business data, thus may not applicable to other Ecommerce platforms.This research tries to set up a model to identify on anonymous fake review purely rely on public reviews.Instead of using special customer database, our model achieves the goal by extracting review feature and model training upon public reviews.
For the problems mentioned above, the paper firstly proposes five assumptions of the feature extraction based on analysis of evaluation processing.Simultaneously, a concept named VoFR is defined to compute fake reviews.Then a supervised model for finding click farming product and detecting fake reviews is developed with VoFR.Finally, the experiment shows the supervised VoFR model accuracy is 92.5%, precision is 94.7%, the recall is 90%.In addition, the unique contribution of the paper fills the gaps in identification of anonymous fake reviews.
The rest of the paper is organized as follows.Related work is reviewed in Section 2. Based on analysis of evaluation processing, we propose five assumptions in Section 3, followed by extraction of the feature functions and supervised VoFR modle in Section 4. Section 5 presents our detailed experimental results.We summarize this work in Section 6.

Related Work
Generally, the research related to analyzing the fake reviews can be roughly fallen into two categories: identifying the text of fake reviews and identifying the click farmers.
For fake text detection, Jindal, et al., [7][8][9] proposed the concept of fake reviews in 2007 at the first time.He collected the reviews from Amazon and manually labelled the fake reviews.Logistic regression is employed in his research to identify the fake reviews.Lai, et al., [10] proposed a recognition method named unigram model.The grammars and formats of the text are used as features for classification.Ott, et al., [11] formulated the problem of identifying fake reviews as a binary classification problem.Besides that, the fake reviews are solely identified through the texts in [12][13][14].Even through huge progress has been made in text based fake review detection in recent years, the methods require high level understandings of the words in the reviews, which is hard to archive since the fake reviews are intended to mislead the consumers.
For detecting the click farmers, researchers assume the click farmers could generate more fake reviews than the genuine reviews.References [4,15] track the behavior of consumers to detect the click farmers.Liu, et al., [16] exploited an 'undesired rule' to identify the fake reviewers.However, the aforementioned algorithms could only identify a specific category of click farmers.Mukherjee, et al., [17,18] considered the generating of fake reviews as a group behavior, he proposed a triple-cross model to identify the click farmers through integrating the group features and individual features.References [19,20] considered the relations between the reviewers, reviews and shops to identify the click farmers.
In this paper, both the userID and historic activity of the consumers are not available under anonymous condition, so that we propose a new way to solve the problem.The fake reviews are identified through feature extraction and model training.The details of the methods are as following.

Click Farming Evaluation Processing
To better understand the fake reviews and the feature of click farmers, we investigated the pipeline of generating fake reviews in Taobao & Tmall, two largest E-commerce platform in China.A detailed model of click farming is studied comprehensively.As shown in Figure 1, the survey found that fake reviews are generated exactly the same as the genuine reviews.As the the review is made anonymously, there is no way to access the real identification of the consumers, which brings difficulties to our investigation.Therefore, we proposed five assumptions to describe the anonymous reviews: the users' credibility, the average daily number of evaluations, the similarity of reviews and products description, the overlap between reviews and ratio between sales volume and shop running time.Features are extracted based on these assumptions and then used to train a model to identify the fake reviews.

Users Credibility
Once consumers finished their shopping and evaluating, merchants in Taobao&Tmall will give them a positive or negative rating.The 'positive' rating will add one point, 'negative' rating reduce one point, thus caused consumers to produce the different user credibility.So, the more consumers use their ID for shopping the more credibility they have.
Users credibility reflect both the number of goods that the consumer has purchased and the credit of the consumer has.The credibility cannot be hidden even if the consumer choose anonymity.According to our investigation, the click farmer's salary is partially depending on the credibility.Higher credibility is corresponding to higher salary.However, what we found is Ecommerce platform always restrict the maximum number of reviews that can be posted by one account in a certain period.So the click farmers, in usual, has multiple userID whose identity has not been verified by the E-commerce platform.These auxiliary accounts always have lower buying records and lower user credibility, which at the end, declines the average credibility of the product whose reviews are generated by the click farmers.Also, since the cost of hiring a click farmer with higher credibility is higher than hiring a click farmer with lower credibility, the merchant always choose to hiring more click farmers to generate more reviews instead of hiring someone has more experience.Hence, we assume that negative correlation exists between the users credibility and number of fake reviews that a product received: Assumption I: The average user's credibility of a product with more fake reviews is lower than the average user's credibly of a product with more genuine reviews.

Average Daily Number of Evaluations
The evaluation time can be obtain from the website, almost all the merchants choose click farming is for improving sales and rating.The high sales mean better reputation, better products.Similarly, these higher reputation and sales shops will attract more consumers to buy, and then the daily number of reviews will increase.These merchants have a common characteristic that the average daily number of evaluations is lower.
Assumption II: The scalping products average daily number of evaluations is lower than normal products.

Similarity of Reviews and Products Description
Since fake reviews are not generated through any customer experience, the contents of the fake reviews are always monotonous and tiresome.The only source for click farmers to obtain the information about the product is through reading its online description.Since the fake reviews are usually generated and organized based on the description, result in a high similarity between the fake reviews and the description of the product.
Assumption III: The similarity between the fake reviews and the description of the product is higher than the similarity between the true reviews and the description of the product.

Overlap between Reviews
According to expectancy theory in reference [21], the motivation force experienced by an individual to select one behavior from a larger set is some function of the perceived likelihood that that behavior will result in the attainment of various outcomes weighted by the desirability of these outcomes to the person.Since itis not reward able to publish a carefully written review, which is generally between 0.3 to 1.3 dollars, the click farmers, who are lack of real customer experience, always choose to edit and reorganize previous reviews.Therefore, the fake reviews always plagiarize each other.Reference [22] referred to a method of detecting text plagiarism, COPS which detected document overlap by relying on string matching and sentences.Learning the method , the overlap between reviews is calculated.
Assumption IV: The overlap between fake reviews is higher than the similarity between the genuine reviews.

Ratio between Sales Volume and Shop Running Time
Based on statistics, we found that there is a relative constant ratio between the selling volume and the time of the online shop established.A click farming shop usually has higher selling volume but short running time.
Assumption V: The click farming product has a lower ratio between selling volume and selling age than a normal product.

Supervised VoFR Model
We propose a new method to identify click farming product by computing the volume of face reviews (VoFR).The VoFR is computed based on the features proposed in Section 3, a multi-linear regression model is adopted, and the result of the multi-linear regression is the VoFR model.Here, the VoFR of clickfarming product is 1, the VoFR of a normal product is 0. The model is defined as following: ( ) respectively represent the feature function of users credibility, the average daily number of evaluations, similarity of reviews and product description, the overlapbetween reviews, and the ratio between sales volume and registration time.0 5   are the weights, they are learned through the training dataset.

Credibility Feature Function
Althoughthe userID is anonymous, neither could track any historic information of any click farmers, user's credibility can be obtained.The weight is assigned based on Table1.The footnote represents the weight of products 's review.In Taobao, 0 Golden Crown represents user with the lowest credibility.User with the greatest credibility is assigned the lowest weight to ensure the credibility of product fall into the interval of [0, 1].

Time Feature Function
Time feature function is usedto name the average number of evaluations during one day, is represented by ( ) ' ' ' ' ( ) min( ( )) ( ) 1 max( ( )) min( ( )) Wherein, i R represents all the reviews of i p 's, T represents the total days of collection to obtain i R .

Similarity Feature Function
Since the click farmers do not have customer experience, their reviews are always based on the description of products, so that the similarity between the reviews and the product can be used to character the genuineness of the review.

Figure 2. Calculation the similarity
The similarity between each review with the product description is calculated to measure the objectivity of reviews.The higher the degree of similarity between products with its description, the less emotional words it used, the more objective of the review.Based on the AssumptionIII, the more objective of the review, the higher possibility of the review belongs to a fake review.

Overlap Feature Function
A word set for each review of product i p is built, excluding any duplicated words.The average overlap value between two word sets is computed as following:

Ratio Feature Function
Analysis the selling volume find that the selling volume of click farming shops are less than the normal shops, however, their registration time only have slight difference.Hence we add this feature into our model.
' ' ' ' ( ) min( ( )) ( ) 1 max( ( )) min( ( )) i G represents the total selling volume of product i p , i L represents the age of shop i p 's ,we employ the same data normalization method to normalize our data.

Experiment Analysis
The overview of the proposed method is shown in Figure 3.

Training Dataset Collection
To investigate the features of fake reviews, understand the pipeline of click farming, 116363 reviews from Taobao and Tmall platform are collected.The genuine reviews are selected and downloaded from the shops where the authors had purchased something from.The shops have truly good consumer experience and existed for a long period of time.To collect the fake reviews, the author pretended to be a click farmer.We found that fake reviews are also overwhelmed in Tmall platform.After data preprocessing, features are extracted on the basis of assumptions mentioned in section 3.

Fake Reviews Dataset
Through investigation, we found that click farming products are usually come with kickstarted online shops whose reviews are few and selling volumes are low.The quality of the products that they sell is not necessarily worse than their competitors.The fake reviews are mainly generated for improving their selling volume, revenue, and rankings.Hence, the following rules are employed in preprocessing.
Rule1: exclude any reviews that contain images or additional comment.It is not usual to insert images or additional comments in the fake reviews.Rule2: exclude any reviews that human cannot identify.A label cannot be assigned if human could not distinguish the review.Rule3: exclude any reviews that contain advertisements.In practice, these reviews can be excluded through simple heuristic rules.Rule4: exclude any reviews that contain real shopping experience.Rule5: exclude any negative reviews.
Based on five rules, reviews of neck and shoulder massage have been preprocessed, as shown in Table 2.

Rule1
Product very product, praise.
Rule 2 Professional tattoo, please add QQ ********* Rule 3 Product packaging, housing also has a corresponding manufacturers.But the product gap is large, does not meet the proper quality brand.The use of fever phenomenon, the general attitude of customer service Mike.

Rule 4
It's completely fake, I'm not black you, what is to spread the products, on the value of a 20 yuan a massage head is crooked, the intensity is very small, yet I bought 40 yuan a product, who do not believe, who regret, asked people to click farming all the praise!Rule 5 After pretreatment the fake reviews dataset has a total of 80 products, 51 merchants, 10708 reviews, related to the eight categories of clothing, footwear, electronic appliances, and furniture etc.

Genuine Reviews Dataset
Contrast to fake reviews, genuine reviews are also from Ali's Taobao&Tmall website.Facing the same product and price, the higher the ranking the more likely to be selected.So click farmers prevalent in newly estblished shop or a new product.For merchant who sales rankings already high, there is no need for click farming, because these shops in pre-sales have accumulated a great deal of popularity.So we choose higher sales merchant, and choose the recent reviews to set up genuine reviews dataset.High sales of products 'reviews generally have tens of thousands, according to the time sequence, we choose the reviews nearly 2 months, and then pretreatment, artificial remove advertising reviews, get genuine reviews dataset, used for VoFR model training.The genuine reviews dataset has a total of 80 products, 50 merchants, and 105655reviews, related to eight categories.

Supervised VoFR Model Analysis
Feature function's calculations show us, there is different between the fake reviews and genuine reviews.Expressed as . . . .Test datasets is composed by 20 click farming products (32252, 13 shops) and 20 normal products (2453, 13 shops).The critical value is 0.5.If VoFR's output is higher than 0.5, the product is identified click farming, otherwise is normal product.The classification results are shown in Figure 10:

Evaluation Indicators
The evaluation indicators: accuracy, precision, and recall.Here defined as follows: |TP| is the number of click farming products correctly to be predicted click farming.|FP|is the number of the normal products wrong to be predicted click farming.
|FN| is the number of click farming products wrong to be predicted the normal.
|TN|is the number of normal products correctly to be predicted the normal.Formula as follows: Now, almost all research on fake reviews are relying on Amazon.We compared the different from the reviews data on Amazon, Taobao and Tmall.As shown in Table 4. Different data makes it difficult on the study.
Table 4 tells that anonymous users cannot be track the purchase history and the reviews cannot be evaluated by rating, so we propose VoFR modle to classify clickfarming products and the fake reviews.
Reference [6], the dataset is also from Taobao and Tmall.The click farmers' userID is obtained, tracking purchasing informationfor analysis and getting 14 features.SVM algorithm and KNN algorithmare respectively applied for identification click farmers in Taobao.Its evaluation indicators and ours are shown in Table 5.First, Reference [6] was analyzed for click farmers'userID and purchasing information, which is difficult to obtain.In the era of data security, such an approach also could infringe on user privacy, it is unsafe.Our dataset is from the public information in Taobao and Tmall.Second, since the data is too specific in Reference [6], thereby causing poor can migrate.Third, comparing Reference [6]'s 14 features, our research has 5 features.But the result of evaluation index is not much difference.Precision is better than SVM and KNN.So that, some features in Reference [6] are useless.Forth,Reference [6] conducted research for the consumers, while this paperresearch for products which can be applied for anycategory, more comprehensive data is available.

Conclusions and Future Work
Click farming in E-commerce is an unignorably and challenging issues, it misleading the consumer's purchase decisions.This paper compares the click farm pattern and normal shopping pattern, then proposes the application of the feature function and multi-linear regression method of constructing VoFR model.By calculating the volume of fake reviews to identify click farming products, real conseumer's decision is provided to users.Experimental datasets obtain from China's largest E-commerce platform's (Taobao and Tmall) reviews, and manual tagging the anonymous user reviews of standard data sets to ensure the accuracy of the data.Experiments show that by calculation and analysis, the five features functions as input data for VoFR model are effective identification method.
The future work will be further study for VoFR model, extracting feature functions and features of importance rating further improve the accuracy of VoFR model.And the method is extended to other fake reviews to identify missing information, and to expand the applicability of

) ij C
represents the average degree of credibility of the product i p 's review j r , of the number of products i p 's reviews j r with the corresponding weights ij  .The average credibility is computed which contains all the fake and genuine reviews of one product.The final value is normalized through max-min normalization. ISSN: 1693-6930 TELKOMNIKA Vol.14, No. 4, December 2016 : 1510 -1520 1514

ijr
represents the vector of review j r for product i p , i d represents the vector of the description of product i p ,

) i R
represents word collection contains all the word sets of i p , overlap value between any pair of word sets,

8 .
We choose 40 pairs of the data as sample.

Figure 10 .
Figure 10.VoFR results of the test dataset Accuracy represents an accuracy of click farming products and normal products can be correctly classified.Precision representsthe click farming products can be successfully detected the accuracy rate.It reflects the accuracy of the classification results of VoFR model.Recall that issaid the probability of a correct classification of the click farming, said the proportion of click farming in total products.


ISSN: 1693-6930 TELKOMNIKA Vol.14, No. 4, December 2016 : 1510 -1520 1520 the research results.In addition, for the text of reviews still can make a further research on emotional information.Emotional lexicon will be built to improve the efficiency of VoFR model.

Table 1 .
The Weight ij

Table 3 .
Model parameter