Feature Selection Method Based on Improved Document Frequency

,


Introduction
The classification technology is to assign automatically a new document into one or more predefined classes based on its contents. With the development of WWW, in recent years, text categorization (TC) has become one of the key techniques for handing and organizing text data, and the technology has got extensive use in rubbish mail filtering, classification for Web page and document. Therefore, it is very necessary and meaningful to study the key technology of text categorization for improving the speed and accuracy of categorization.Some classication algorithms used in TC are:Support Vector Machines,k-NearestNeighbor (kNN) and naive Bayes [1], [2]. A major difficulty of text categorization is the high dimensionality of the original feature space.Feature selection is an import method to reduce the amount of feature in text categorization, and its goals is improving classification effectiveness and computational efficiency. Currently, the feature selection method's principle of operation is that it will compute and score for each feature word using statistical knowledge, according to sort the feature words, then it select some feature whose score is higher to act final document feature. Some well-know methods are document frequency(DF), information gain (IG),expected cross entropy(ECE),the weight of evidence of text (WET), x 2 statistic (CHI) and so on [3], [4], [5], and it is highly desirable to reduce the feature space without the loss of classification accuracy.
The document frequency (DF) thresholding, the simplestmethod with the lowest cost in compution, has shown to behave well when compared to more methods, it can be reliably used instead of IG or CHI when the computation of these measures are too expensive. Aexperimentin literature [6] is carried out on the performance of feature extraction among DF,IG,WET and CHI, and the experiments result show that the DF method has its own advantages such as easy to realize, widely to use in Chinese text categorization and English text categorization, and the feature select performance is good while it compared with others feature selection methods.However, this method overlook the usefullow-frequency feature words in the category and no considering the contributions to each category ,so it is usually considered an empirical method to improve efficiency, not a principled criterion for selecting predictive features.
In this paper we propose a feature selection method based on improved document frequency (DF),named DFM, derived from the DF original definition. The DMF overcome the shortcomings of DF such as overlook the usefullow-frequency feature words in the category and The rest of this paper is organized as follows. Section 2 describes the term selection methods commonly used, and gives a improved document frequency method MDF, Section 3 discusses the classifier using in experiment to compare MDF with other text feature selection methods, and presents the experiment's results and analysis. In the last section, we give the conclusion and future work.

Research Method
In this section we summarize and reexamine the feature selection methods DF,IG,ECE and CHI which are commonly used in feature selection for text categorization, and we implement a new feature selection method of DFM (Document Frequency Modified), which it's evaluation function based on Document Frequency (DF) method.

Feature Selection Methods
The following definitions of DF, IG, ECE and CHI are taken from [7], [8], and They will be simply introduced.

Document Frequency (DF)
Document frequency is the number of documents in which a term occurs. In text categorization, according to the setting threshold, the term is retained or removed. DF is the simplest technique for feature reduction. It scales easily to very large corpora with an approximately linear computational complexity in the number of training documents [8].

Information gain (IG)
Information gain is used to measures the amount of information obtained for category by knowing the presence or absence of a term in category documents. Let be the train set of categories. The information gain of term t is defined as following: All the feature terms are computed according to formula IG, whose information gain is less than some predetermined threshold are removed from the feature space.

Expected cross entropy (ECE)
Cross entropy, also known as the KL distance. It reflects the probability distribution of text topic class and in can computer the distance between specific term with text topic class under the condition of probability distribution .If the cross entropy of termis bigger, the effect on distribution of text topic class is bigger. The difference to information gain is consider the relation of word occurrence and categories, only calculating theterm appear in the text.
The chi statistic method measures the lack of independence between the term and the category. If term t and category i c are independent, then CHI is 0. If there are n classes ,then each term value will have n correlation value, the average value calculation for a category as follows: The above 4 method is the most common methods in the experiment andthe different points of ECE and IG is that ECE only considers the effects to category while that words appear in the documents. the DF method is simple, and complexity is lower. CHI method shows that the CHI statistic value is greater, the correlation between features and categories is more strong. The literature [3] experiments show that IG and CHI is most effectivein the English text classification DF followed .Experiments prove that the DF can apply to large scale calculation, instead of CHI whose complex is larger. Literature [7], [9] points out that IG, ECE and CHI methods have same effect on feature extraction in Chinese text classification, followed by DF.
DF is one of the most simple feature term extraction methods. Because of the extraction performance and corpus into a linear relationship, we can see that, when a term belong to more than one class ,the evaluate function will make high score to it; however, if the term belong to a single category, lower frequency of occurrence lead to a lower score. DF evaluation function theory based on a hypothesis that rare term does not contain useful information, or it's information is litter as so to exert useful influence on classification. However, there are few conflicts between this assumption and general information view,In information theory, there are point of view that some rare term with a greater amount of information can reflect the category information than those of high frequency words, and therefore those termsshould not be arbitrarily removed, so the choice only using DF methodwill lose some valuable features. Document frequency method is easy to implement and simple, and it' effect is similar to other methods in the Chinese and English text classification. Aiming at the shortcoming of DFmethod, we present an improved feature selection method based on the DF.

A Feature Selection Method Based on Document Frequency improved
From the research of literature [10], [11], this paper summed up, to meet the following 3 points of entry is helpful for classification, these requirements are: 1. Concentration degree: in a corpus of many categories, if a feature term appear in one or a few categories, but not in other category text, the term 'srepresentation ability is strong and it is helpful for text classification. 2. Disperse degree: if a term appear in a category,it has strong correlation with the category.
That is, a feature termis more helpful to classification while it is dispersedin a large of text of a category. 3. Contributiondegree: if a feature term's correlation with a certain category is more strong , the amount of information is greater andit is value of classification. This article uses document frequency DF and adopts the following method to quantitatively describe the above three principles: (1) Concentration degree：Using following formula to expression, the ratio of formula is biggerthat the term is the more concentrated in the class.  (6) In this paper, we implement a new feature selection method of DFM (Document Frequency improved), which its evaluation function based on Document Frequency (DF) method, and the Concentration degree, Disperse degree, Contribution degree are introduced in DFM .The DFM evaluation function is as follows:

Data Collections
The experimental data used in this paper is from Chinese natural language processing group in Department of Computing and Information technology in Fudan university.The training corpus is " train.rar " which has 20 categories includes about 9804 documents and "test.rar" includes about 9833 documents is used for test. We just choose some of the documents for our experiments because of considering the efficiency of the algorithm. Table 1 shows the specific quantity of samples in each category we chose

performance measure
To evaluate the performance of a text classifier, we use F1 measure put forward byrijsbergen (1979) [12]. This measure combines recall and precision as follows:  Figure 1 show the selecting performance used SVM on Fudan corpus after feature selection using DF, IG, ECE, Chi, and DFM. It can be seen in Figure 1 that the DFM method outperforms the DF method.  Figure 1, we find that DF F1'rising along with the increase of the characteristic dimension, IG and ECE produce similar performance of classification, because the ECE is a simplified version of IG ,and it only takes into account the condition of feature terms appeared in corresponding category. CHI and DFM are the most effective in our experiment, and CHI classification F1 value has been very stable in the process of the classification. The feature selection method DFM curve are significantly higher than the others methods while the characteristic dimension between 5000 and 8000 ,especially when characteristic dimension is 8000 , Figure 1 appear a maximum points and F1 valueis 99.133%. The classification effect of DFM is better than other four feature selection methods. The extreme value of five kinds of feature selection methods are show in Table 2 with Precision and Recall and F. From Table 2, we can notice that five feature selection methods show better performance all , and DFM gets the best categorization performance that the F1 value is 99.133%. Table 2. The bestperformance of five feature selection methods From Figure 1 and Table 2, we can see that DFM can extract category characteristics from Chinese text classification and improve the classification accuracy, and it has the stability in feature extraction.

Conclusion
This paper has proposed an improved feature selection method based on DF, named DFM. DFM implemented three principles which are Concentration degree, Disperse degree and Contributiondegree. The experiment has shows that DFM is an effective method to extract category characteristics for feature selection, and it can effectively improve the performance of text categorization. In the future, we will continue to work on the study of contribution of categories characterization .