Automatic Summarization in Chinese Product Reviews

With the increasing number of online comments, it was hard for buyers to find useful information in a short time so it made sense to do research on automatic summarization which fundamental work was focused on product reviews mining. Previous studies mainly focused on explicit features extraction whereas often ignored implicit features which hadn't been stated clearly but containing necessary information for analyzing comments. So how to quickly and accurately mine features from web reviews had important significance for summarization technology. In this paper, explicit features and “feature-opinion” pairs in the explicit sentences were extracted by Conditional Random Field and implicit product features were recognized by a bipartite graph model based on random walk algorithm. Then incorporating features and corresponding opinions into a structured text and the abstract were generated based on the extraction results. The experiment results demonstrated the proposed methods out preferred baselines.


Introduction
Nowadays, the degree of activities in Chinese online market is still high and it's timeconsuming for customers to read a flood of comments. Currently, a few typical Chinese ecommerce websites have done several inductive statistics, for example Tmall.com gives phrases and its quantity to other users for giving reference, Amazon.cn gives star ratings to goods based on user reviews, but all of these are coarse-grained extraction, resulting in interpreting out of context which are limited to objectively understand reviews for users, for example some extracted labels can only represent the experience of a certain people, and some phases express incompletely [1]. When the number of users is large, the problem will be more prominent. Therefore, generating summaries accurately and concisely has great significance to analyze and conduct product reviews, and it will improve the efficiency of online shopping and help others to obtain important information quickly.
The technique of automatic summarization has developed rapidly in recent years. The summary generation can be divided into extraction summarization and generation summarization. By selecting the sentences in the original text to form summary, the extraction summarization usually estimates the sentences in the document according to pre-defined feature sets or machine learning algorithms, then the sentences with high scores are output as summary [2]. The generation summarization includes words and phrases that are not occur in the original text, and typically based on entity information and compression techniques and so on. Due to the generation summarization is still in its infancy and has huge challenge to the natural language processing technology; there has a considerable distance for generating a practical summary [3]. Hence this paper focuses on the former method.
For the extraction summarization, the effect of the comments opinion mining will directly affect the quality of the generated summary. Hu and Liu [8] present two kinds of features in product features mining, namely explicit and implicit features. Many people are aware of the existence of implicit features in [4,5], whereas the existing methods for mining implicit features are not very mature. Su and Xiang [6] mainly use Point-wise Mutual Information (PMI) to associate semantic analysis with product features and opinion words which match probability in training set. In [7], a co-occurrence association rule mining (CoAR) algorithm is proposed to select implicit product features. But above all sorts of implicit product features extraction methods can be evaluated only for special words, it is not ideal for general words. Therefore, our study focuses on implicit opinion mining and in order to get high-quality summaries.
Many scholars have put forward a variety of summary methods since Luhn [9] defined automatic summarization in 1958. Reference [10] used hierarchical clustering for documents, and then calculated the relevance of text units by using the LexRank to extract important sentences from each category. The Interdependent Latent Dirichlet Allocation (ILDA) model was used in [11] which took the shallow semantics of the documents into account but ignored the text structure information. Sequence annotation model was used to solve this problem in [12], in which Hidden Markov Model(HMM) with less independence assumptions was used while HMM had limited ability to describe features of the relationships between sentences. Reference [13] combined Hidden Topic Markov Model with LDA topic model, breaking the theme independent hypothesis, but ignoring the semantic synonymy and relevance. Multi-document summary was built based on sentence distribution in [14] which calculated the frequency of occurrence of words forming the sentences. Clustering approach was used in [15] to extract information but ignored the readability of the summary. In our paper, we train models automatically by using the machine learning algorithms and the given feature sets. Conditional Random Field (CRF) and semi-supervised learning method are used to extract features opinions and "feature-opinion" pairs in comments. The methods of this paper are suitable for regular sentences and short comment texts, needing to label parts of the training corpus manually. Based on the existing results of word segmentation, the semi-supervised learning method is used to extract features and opinions in this paper. Besides, the paper combines the features of the merchandising function, capability and components which are gained from comments to construct a bipartite graph, then the highest probability implicit features would be computed by random walk algorithm. Thus, the summary will be generated based on the lowest cost value calculated by the probability distributions of pairs.
In general, our contributions are the following: (1) A novel model is used to solve the problem of implicit features extraction, and verify the feasibility of this model under some tests.
(2) We experimentally evaluate our methods against with some existed methods on feature extraction for both precision and recall, and current techniques on automatic summarization for ROUGE.
(3) We focus on product reviews and get summary sentences according to the probability distribution of product "feature-opinion" pairs.
The remaining parts of this paper are organized as follows: Section 2 proposes related knowledge and our approach; the experimental results are presented, evaluated and discussed in Section 3; Section 4 presents our conclusions and future work.

Model Design
A number of studies shown that it's essential to use a special text processing technology for web produce comments with brief text, diverse language, sparse data and high in noise, which is different from traditional documents [16][17][18]. Therefore, we propose the approach in this chapter mostly considering opinion mining. The main content of this chapter are "feature-opinion" pairs identification and collocation, implicit features extraction and automatic summarization. System flowchart is shown in Figure 1.
Product reviews are climbed from e-commerce sites and all reviews can be seen as a document in which each sentence is a comment. In order to obtain the high-quality and reliable experimental data, we firstly proceed review datasets preprocessing, including segmentation, denoising which covers comments emotions, special characters, or off-topic sentences (for example "I am very happy to receive the goods", "This style is what I want") and so on.Because of the particularity of Chinese grammar, we need to do word segmentation using ICTCLA segmentation system. Training data is labeled by HowNet [19] and trained by models in order to extract product features and opinions, and comment sentences can be divided into explicit sentences and implicit sentences according to the extraction results. Then we cluster "featureopinion" pairs in explicit sentences to construct a bipartite graph, using random walk algorithm to calculate the probability of implicit features and achieving the extraction of "feature-opinion" pairs. Finally, we provide the summary for users.

Collocation Extraction Based on CRF
The main content of this section is to extract "feature-opinion" pairs based on the Conditional Random Field. As CRF applied in Chinese word segmentation, sentiment analysis and part-of-speech tagging, we transform the problem of collocation extraction into the sequence annotation task.
Collocation extraction is defined as extracting commodity features and opinions which are expressed as <product features, opinion words> in the comment text, like the <pixel, high> in the comment "苹果手机的像素很高" ("The pixel of iPhone is high."). The process of identifying features and opinions can be seen as under the condition that input a string of words , the maximum probability labeled sequence is outputted. Here we introduce seven mark symbols , in which " " represents the initial word describing property features, " " represents the intermediate term describing property features, " " represents the end term describing property features, " " represents the opinion word which adjacent to the feature, " " represents the intermediate term of opinion word, " " represents the end term of opinion word, " " represents the unrelated word.
Choosing a good feature can greatly improve extraction performance, thus it's important to construct the feature template for labeling sequence based on CRF. Features used in our model including word feature, part of speech feature, position feature, interdependent syntactic relation feature, whether is an explicit comment sentence. After building feature templates and training model using training corpus, we get collocation extraction model and mine the collocation pairs after entering new corpus. For example, the results of labeling and training the sentence "手机很精致，屏幕显示很细腻，音量有点低，WiFi 信号接受能力很差。" ("This phone is very delicate and the screen display is fine, but with a little low volume and poor WiFi signal reception.") are shown in Table 1. The example has five columns, which represent "word feature", "part of speech feature", "position feature", "interdependent syntactic relation feature", "whether is an explicit comment sentence". All elements of the model in which we can obtain four groups of collocation as <手机，精致>、<屏幕显示，细腻>、<音量，低>，< WiFi 信号接受 Different words or phrases are often used to describe the same feature by customers, such as words "facade", "external", "aspect" are used to describe the appearance of the mobile phone. In order to make similar features have the same description, we cluster n features of "feature-opinion" pairs for matching each opinion word . Our method of clustering is based on the paper [20] which automatically identify some labeled examples by semi-supervised method, then unlabeled features are assigned to a cluster using naive Bayesian based on EM formulation [21]. When EM converges, the classification labels of all the attributive words give us the final grouping. Thus the implicit feature extraction problem is turned to a classification problem.

Implicit Features Extraction
It's not hard to find explicit features in the reviews, but the number of them is limited. Based on CRF model we can extract explicit features accurately whereas extracting implicit features using rule-based methods with full coverage is difficult. For the implicit features extraction problems, we mine implicit features via calculating the results of random walk algorithm and the probability of candidate features.
In this section, our main task is to extract implicit features. We utilize features and opinions which are collected previously to build a graph. The bipartite graph composes of candidate features and opinions, here represents candidate features and seed features, { } and represents opinions. The edge of connects the vertex and , is the edge weight of connecting the vertex and in the weight matrix , implicit features are represented by and the seed set of is denoted by s where the feature belonging to the extraction feature is signed as a positive example and the others are signed as negative examples. According to graph and seed feature set s , our algorithm calculates the probability of implicit feature assigned to the candidate feature set . Taking some cellphone reviews for example, the more co-occurrence of product features and opinion words, the greater relevance between them. As shown in Figure 2, the opinion word "very big" is associated with the features "screen" and "memory", whereas the connection with "memory" is closer than "screen" and the edge weight will be higher, the feature described by "very" is more likely to "memory". A small amount of artificial features can be seen as the seed set based on our graph model, we can obtain implicit features from corresponding candidate features using random walk model with opinions in the implicit sentences.

Figure 2. The Bipartite Graph of Some Cellphone Reviews
The size of state matrix is and the number of matrix iterative is defined by . When , the initial state of the candidate feature set is denoted by , and when iteration will stop, represents the final state of all candidate features via random walk algorithm. The probability of feature belongs to cluster category is expressed by each entry in the matrix which is non-negative and can be calculated as shown in (1).
Here, is a normal matrix in which diagonal matrix is to normalize relational matrix and each diagonal entry in matrix is the sum of each element in matrix , whereas others in matrix are zero. The function of  is to adjust the degree of depending on the initial state or bipartite graph when distribute candidate features. We define when is the first column of corresponding the category (positive example) and is the second column of corresponding the non-implicit features (negative examples). When reaching the final state, the probability of each feature belongs to the category is calculated by | as shown in formula (2): For the above definition, this paper uses random walk algorithm to extract implicit features is shown in Table 2. , we believe that the word with the highest probability is the implicit feature related to the opinion word according to the probability is arranged from high to low.

Automatic Summarization
After opinion mining, we need to extract some candidate sentences which are related to products with most keywords, and then calculate the importance of each sentence. The process of generating summaries can be divided into three steps: (1) Training the CRF model, extracting keywords and the collocation of them; (2) Calculating the probability distributions of "feature-opinion" pairs; (3) Comparing the probability distributions of the comment sentences with the pairs, and extracting the candidate sentences.
In this paper, we calculate the probability distribution of "feature-opinion" pairs based on CRF model and bipartite graph, supposing that summaries have the similar probability distribution with high frequency pairs. Calculating the probability distribution of the comment sentence is based on the collocation of product features and opinions. The probability distribution of the comment sentence which has "feature-opinion" pair is calculated as: ( Where is the number of sentences in pair and ∑ is the similarity sum of the comment sentence and other sentences in pair , which reflects the representative between the comment sentence with pair . The higher value of it means the more information and more representative the comment sentence has. The value of ℎ is determined by some prompt words in the sentences like "我认为"("I think"), "虽然……但是……" ("not only……but also……") and so on. The more words like these the comment sentences has, the higher score it will has. Since long sentences can be easily recognised, we use the which is the sentence length using the word as the unit to eliminate the perference of long sentences.
The similarity of the comment sentence and the corresponding pair is showed by divergence which is calculated as: Here, and are probability distributions. When the divergence is lower, the difference between the comment sentence and the corresponding pair will lower, and the degree of similarity will higher, that is the lowest | ‖ | . Since we select the sentences as summary sentences with the minimum divergence value and the maximum | value, the cost of generating summary sentences is calculated as shown in (5) in order not to be bound by the divergence value: Where _ + _ is the sigmoid function of _ . Finally, the text summary is generated by extracting the sentences with the lowest cost value.

Results and Analysis
We conduct the experiments based on the approach we proposed. The experiment results and analyses are as follows.

Experimental Data
In this paper, the 121790 pieces of comments which are crawled from three Chinese ecommerce sites are adopted as experimental data in two areas including 79855 pieces of comments from mobile phones and 41935 pieces of comments from computers. Via observing the corpus of information, it can be concluded that most of the syntactic structure in the experimental data are short texts, then the comments are segmented by the Chinese punctuation, which leads to 368963 pieces of comment clauses. And after eliminating some irrelevant comment sentences, we deal with the remaining 311870 clauses. In this paper, 200 pieces of comments from mobile phones and computers respectively are manually selected as the set, which contains 100 pieces of explicit comments and 100 pieces of implicit comments.

Experimental Results and Analysis
This paper uses the accuracy and recall as the evaluation criteria, we extract explicit features and opinions as well as their collocation, comparing the results with Hu and Liu's research in [8] it's shown in the Table 3. Hu's approach is defined as Method One. Hu and Liu represent that the more important commodity features are, the higher frequency they have. Thus, the association rules are used to extract the high-frequency terms and noun phrases to mine commercial features according to setting the text window and extracting non-frequent features depending on the adjective collocations around the frequent features. This method is easy and efficient, but the effect partly lies on the selective correlation of frequent item sets. Both of the extraction results of F-value from the two areas mobile phones and computers are lower than ours. Because features and opinions are associated in comments where the speech tags are completed based on CRF model. The higher F-value can be gotten when we deal with the sentences which are short sentences and strong regularity comment corpus.
Convergence probability has been calculated after several iterations based on random walk algorithm, in our experiment the iterative time . This experiment investigates the accuracy of "mobile phone" and "computer" these two kinds of goods when within the scope of different values from 0.1 to 1.0. As shown in Figure 3, it describes the accuracy of the first 100 results from two types of commodity evaluation sets. With the in cease of , the accuracy of two types of comments changes gradually from high to low, reaching a peak at a certain point. From the Figure 3 we can see that when . , relatively high accuracy of extracting implicit features on both two types of comments has obtained.
The mean absolute error (MAE) is used to measure the accuracy of implicit feature extraction in our experiment, equals to the difference of implicit features extracted by machine identification and human annotation, which is calculated as: (6) Where and respectively represent the implicit features extracted by machine recognition and by human annotation, the number of implicit features is denoted by . The higher value of MAE represents the lower extraction quality or vice versa. Comparing MAE results with PMI [6] algorithm and CoAR [7] algorithm are shown in Figure 4 and the values of Precision, Recall, F-measure in three methods are shown in Table 4.   In the experiment, we find that product features modified by PMI algorithm and CoAR algorithm with fixed category, like through "便宜"(cheap), "实惠"(benefit) can get the appropriate product feature "price", but for some strong generality words, such as "不错"(nice), "一般"(just so so) etc. are treated unsatisfactory in effect, because these general opinions can be used to modify almost all features. The proposed method in dealing with these opinion words has achieved good results, the MAE values of our method are lower than other two methods. Moreover, we also find that the precision and recall blended in implicit features are higher than extraction results that only considering explicit features.
The ROUGE [22] automatic evaluation tools are used to analyze and evaluate the experiment results of automatic summarization. In this paper, the methods in [10][11][12][13] are used as the baselines, and the experiment results show that the generated summary based on a bipartite graph and the CRF model are better than baselines not only in the key information coverage index (ROUGH-1) but also in the summary readability evaluation index (ROUGH-2, ROUGH-SU) as shown in Table 5.
The quality of summarization depends on the extraction performance. Therefore, the quality of summarization based on the extraction with higher precision in our study outperforms existing methods. The hidden semantic information in the comments is obtained and the lack of shallow semantic analysis is filled，our summaries can express the feelings of users adequately and present closer to the expert summaries.

Conclusion
In this work, we present extraction models respectively for explicit and implicit features according to their characteristics. Using CRF model to mine explicit features and "featureopinion" pairs in the explicit sentences, then we propose a bipartite graph based on random walk algorithm to extract implicit features, combining features and corresponding opinions into binary collocation that is turning the unstructured or semi-structured text into structured text. At last, we select comment sentences as summary by calculating the cost value. Experimental results show that our method is reasonable and effective, the two models and automatic summarization proposed achieve good results. Opinion mining based on Chinese product reviews is a difficult subject which reflects the flexibility and uncertainty of natural language processing. It can also provide useful information for sentiment analysis with great research value.