An adaptive clustering and classification algorithm for Twitter data streaming in Apache Spark

,

request apparatuses that are good with hadoop [12]. As of late, analysing enormous unstructured information is a business need. Cluster analysis is one of the mining issues utilized for investigation like assessment mining, sentimental investigation and popularity examination [13]. Current systems use devices and advances to process Twitter information which are utilizing event processing and one message at time investigation [14].
A standout amongst the latest studies utilized several learning frameworks [15] such as K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Random Forest (RF), and Naïve Bayes (NB) [16,17]. The RFA generates better recall, precision, and F-measure values. SVM all performed similarly by achieving about 93% accuracy in every group. In every one of these prior studies investigations, classification was used for spam discovery on Twitter. The anomaly identification framework improvement is for distinguishing spammers on Twitter utilizing account data and streaming tweets [18,19].
The main contributions can be stated as: 1) pre-processed utilizing an Improved Fuzzy C-means clustering to viably cluster the atwitter information then the clustering is additionally improved by utilizing an Adaptive Particle swarm optimization (PSO) algorithm, 2) pre-processed information is classified utilizing the modified support vector machine (MSVM) classifier with grid search optimization.This article is presented in different sections as follows: the related previous studies to the proposed system were reviewed in section 2, while section 3 briefly discussed the suggested approach. In section 4, the experimental results were discussed while section 5 presented the conclusion.

Research Method
A Hypertext-Induced Topic Search (HITS) was suggested by Leilei et al. [20] based on the Topic-Decision strategy (TD-HITS) and a Latent Dirichlet Allocation (LDA)-based Three-Step display (TS-LDA). The framework was suggested for influential spreaders detection and identification in social media data streams. The proposed TDHITS can easily identify the number of themes as different related posts in a huge number of posts. TS-LDA can identify powerful propagators of trending event based on the client data and the post. On a Twitter dataset, the results showed the efficiency of the suggested methods in events recognition and in distinguishing powerful event propagators.
Shangsong Liang et al. [21] proposed a work for handling the issue of client clustering with regards to their distributed short text streams. To acquire better client clustering performance, they proposed a two-user cooperative interest following models that go for following changes of every client's dynamic point dissemination as a team with their followers' dynamic subject dispersions, based both with respect to the content of current short messages and the recently evaluated conveyances. They also suggested 2 collapsed Gibbs sampling frameworks for the cooperate inducement of the dynamic advantages of the clients for both short-and long-term clustering reliance point models.
Streaming data is one of the considerations accepting hotspots for concept-evolution studies. At the point when another class happens in the information stream it very well may be considered as another idea thus the concept-evolution. Tahseen et al. [22] highlighted the problem by characterizing a new collaborative strategy called "class-based' group which swaps the conventional "chunk-based" method for repetitive class identification. The study discussed the attribute of the 2 different techniques in class-based group in order to provide their detailed analysis and clarification. They also proved the superiority of the "class-based" groups over procedures by means of observational methodology on various benchmark databases comprising web remarks as text mining challenge.
Lekha et al. [23] developed a framework for open-source big data called Apache Spark which is a cloud-based framework that focus on the development of machine learning framework with respect to big data streaming. In this framework, the user tweets his/her health traits and the application get the equivalent progressively, extricates the traits and develop machine learning framework to anticipate client's health status which was then legitimately informed to him/her immediately to make suitable action.
Senthil and Usha [24] worked on categorizing streams of Twitter data based on sentiment analysis using hybridization. The study used a URL-based security device to collect 600 million open tweets while feature selection was applied for sentiment investigation. The ternary classification was performed based on a pre-processing strategy while the results of  the tweets sent by the users are collected. Then, a hybridization approach based on 3 optimization methods (PSO, GA and DT) was applied for classification accuracy using sentiment analysis. The results were compared with previous works, and their developed strategy demonstrates a greater than different classifiers analysis.

Proposed Methodology 3.1. Phase 1: Adaptive Clustering for Twitter Data Streams in Apache Spark
The presented technique consists of the subsequent steps: initially, input twitter data is pre-processed using tokenization and stop word removal processes. Then the pre-processed data is effectively clustered utilizing an Improved Fuzzy C-means clustering with Adaptive Particle swarm optimization (PSO) algorithm. Finally twitter data streaming using our proposed method is examined in apache spark engine. The flow diagram of this proposed twitter data streaming utilizing phase 1 methodology is given in Figure 1.

Preprocessing
In the proposed twitter data streaming, at first the input data used for the proposed proficient information streaming is taken from the dataset [25][26][27][28][29][30]. Here, is the twitter input dataset. At that point the input twitter data is preprocessed utilizing tokenization and stop word removal processes which are utilized to expel the conflicting information or noisy information from dataset. Input data preprocessing incorporates the accompanying processes [31]. a. Symbolization Symbolization is the task of splitting the input information up into pieces, called tokens, possibly in the meantime discarding certain characters, like punctuation. Basically, tokenization is the way toward separating the given text into units called tokens and it is utilized for further handling. The tokens might be words, number and punctuation sample [32][33][34][35]. The reason for symbolization is to expel all the punctuation marks like commas, full stop, hyphen and brackets. The input data after applying the tokenization is given in (1): where, ̅ is the tokenized data and = 1,2,3, … , . b. Stop word removal After tokenizing, the tokenized information ( ̅ ) is given as the contribution for stop word removing and here some undesired words are rejected by utilizing stop word elimination. Stop words will be words that are by and large thought to be futile. The purpose for this procedure is utilized to avoid conjunction, relational words, articles and other continuous words, like adverbs, action words and adjectives from textual information [36]. Some of the as often as possible utilized stop words are "a", "me", "of", 'the', 'he', 'she', 'you'. The tokenized information subsequent to applying the stop word elimination is given in (2): here, is the preprocessed set of data after eliminating stop words and = 1,2,3, … , .

Data Aggregation
Aggregation is the process of splitting a set of objects in the dataset into subsets or cluster. Each subset is a cluster, and attributes in a cluster are similar to each another. The proposed modified fuzzy clustering algorithm (MFCM) is used for effective clustering where the performance of the MFCM depends upon the updating the memberships function using sigmoid function. Additionally MFCM performance is improved by using support value based adaptive PSO algorithm. The preprocessed data is optimized using support value based adaptive PSO algorithm before modified fuzzy c-means clustering [37].
Clustering is the process of separating a set of items in the dataset into subsets or cluster. Every subset is a cluster, and traits in a group are like each another. The proposed modified fuzzy c-means clustering algorithm (MFCM) is utilized for viable clustering where the execution of the MFCM relies on the updating the membership functions utilizing sigmoid function. Also MFCM execution is improved by utilizing support value based adaptive PSO [38]. a. Support value based adaptive PSO The PSO was developed as a heuristic population-based optimization method which was inspired by the flocking behaviour of birds. The PSO is presented as a collection of particles which individually represents a potential solution [39]. The particles pursue a basic behavior: copy the accomplishment of neighbouring particles and its own accomplished triumphs. The location of a particle is thusly affected by the best particle in a neighbourhood, ′ just as the arrangement found ′ . Particle position is balanced utilizing the accompanying condition: where, the velocity component signifies the step size. The velocity is updated via (4): where, ′ is the inertia weight, 1 and 2 are the acceleration coefficients 1 , 2 ∈ [0,1], ′ is the individual best position of particle , and ′ is the best position of the particles. At that point, Map the location of each particle into solution space and evaluate its fitness esteem as indicated by the support value based fitness function. In the meantime, update ′ and ′ position if required. The support value is estimated by utilizing (5): Here, ̅ denotes the support value, 1 , 2 , … , signifies the input population. This updating process is proceeds until a criterion is met, usually it used for finding optimum solution through number of iterations. The pseudo code of support value based adaptive PSO algorithm is given in Algorithm 1.

Algorithm 1: Support value base adaptive PSO algorithm
Step 1: Initialization Set the initial size k' = 0 Set a population size of NP Set velocities size vj of the insect  (5) Step 4: Update position and velocity Calculate the positions and velocities of insect utilizing (3) and (4) End For Step 5: Increase the generation count Fuzzy c-means is a clustering method which permits the situation of one dataset belonging to more than one cluster at a time. The suggested MFCM clustering provides better clustering performance compared to the conventional FCM clustering methods. In modified fuzzy c means clustering, Let = { 1 , 2 , 3 , … , } be the set of data points after adaptive particle swarm optimization and = { 1 , 2 , 3 , … , } be the set of centers. The pseudo code of modified fuzzy c-means clustering algorithm is given in algorithm 2, Algorithm 2: pseudo code of modified fuzzy c-means clustering The MFCM algorithm allots data to every class by utilizing fuzzy memberships. The modified objective function for partitioning the input dataset into clusters is defined as, in (6), represents the data, is the ℎ cluster center and is the constant esteem. Where, sigmoid function denotes the weighted mean distance in cluster , and it is adapted for the effective clustering in (6) given by: The function of being member signifies the likelihood of data flew which come from same cluster. The probability of data in FCM algorithm is based on the distance of individual insect with other team in same cluster. The functions of membership and cluster center vectors are updated by the velocity and particle positions by (8) and (9).
the clusters centroid values are computed by utilizing (9) algorithm will continue running till the change between two iterations reach the , for the given sensitivity threshold.
where, = a termination condition lying in the range of 0 and 1, while = the iteration steps.
Repeat the steps until efficient clustering reached.

Phase 2: Effective Classification for Higgs Data Streams in Apache Spark
In the second stage, the Higgs data streaming is viably performed by pre-processing the input information. Then the pre-processed information is classified utilizing the modified support vector machine (MSVM) classifier with grid search optimization. At long last the optimized information is assessed in spark engine then the assessed esteem is utilized to discover the confusion matrix is accomplished. The proposed stage 2 work utilizing Higgs datasets for the data streaming in Apache Spark. The flow diagram of phase 2 methodology for the effective classification of higgs data streams is given in Figure 2.

Preprocessing
In the proposed Higgs data streaming, first the input information utilized for the proposed effective information streaming is taken from the dataset ′ = { 1 ′ , 2 ′ , 3 ′ , … , ′ }. Here, ′ is the Higgs input dataset. Then the input Higgs information is preprocessed utilizing tokenization and stop word removal processes which are utilized to expel conflicting information or noisy information from dataset. Here, the input data is first preprocessed by utilizing tokenization process given in (1) and subsequently tokenized data is processed by utilizing stop word removal process given in (2).

Data Streaming Classification Grid Search Based Modified Svm
The SVM as a binary classification method is reliant on the structural risk minimization approach. The SVM initiates by mapping the training data into a hyperplane which divides 2 classes of information in the feature space and maximize the edge of division among itself and those focuses lying closest to it. This decision surface would then be able to be utilized as a reason for categorizing unknown information [39]. SVM classification is improved by using network grid search optimization. The grid search improvement adequately tunes the SVM parameters for the better assortment.
In (11)  where (14) is subject to: The reason for employing the Gaussian SVM which employs parameters ′ and gamma ( ′) is to transform the component vector space into the incensement of remoteness such that partition can be performed with higher accuracy. The diversion is accomplished using the kernel function ( , ) = ( )̃( ), characterized for the Gaussian SVM is The choice of proper learning parameters is a significant step in acquiring very much tuned support vector machines. For the most part, the settings of these parameters depend on a grid search. The pseudo code for the optimization of SVM parameter utilizing Grid search for better classification is given in algorithm 3. The SVM initialize to main parameters ′ and gamma ( ′) and the procedure of optimization by isolating the hyper-plane to get identical way of work out the information and these are the parameter of SVM classifier for the regularization The parameter ′ characterizes the mistake of data flew. When the value of ′ increases the mistake rate also increases and brings down the number of permitted points in the error range. A smaller value of ′encourages a bigger error gap upon the isolation of the hyper-plane. For Gaussian SVM, the ′ parameter is determined as it affects its hyper-line adaptability. To reduce the values of ′, the hyper-plane line is almost linear, and for increasing the numbers, it works out to be progressively curved. Expanding the value of ′ to over-fitting on work out data. This grid search based modified SVM classification provides the effective process of data streaming.

Results and Discussion
The implementation of our proposed data streaming using adaptive clustering and classification is performed in the working stage of Java apache spark. The Twitter dataset and Higgs dataset is utilized to assess the proposed twitter data streaming. In order to investigate the performance of the proposed data streaming is distinguished with the existing artificial bee colony (ABC) optimization and Genetic algorithm (GA) techniques in regards of Recall, Precision, F-measure and Convergence.

Performance Analysis of Proposed Clustering
The statistical metrics of F-score, precision, and recall can be expressed in the terms of TP, FP, FN, and TN Where, TP (true positive), FP (false positive), FN (false negative) and TN (true negative) esteems. The performance of our proposed work is analysed by utilizing the statistical measures mentioned in this section.

Precision
The fraction of data recognized which are appropriate to the original data is termed as precision: the comparison graph of proposed data streaming using improved fuzzy c-means clustering with existing Fuzzy C-means clustering (FCM) and K-means clustering in terms of precision is appeared in Figure 3. It depicts the proposed data streaming using improved fuzzy c-means clustering resulting well in terms of precision than the existing Fuzzy C-means clustering (FCM) and K-means clustering.

Recall
Recall ascertains the fraction of data which are appropriate to the query data that are effectively recognized. The comparison graph of proposed data streaming using improved fuzzy c-means clustering with existing Fuzzy C-means clustering (FCM) and K-means clustering in terms of recall is appeared in Figure 4. It depicts the proposed data streaming using improved fuzzy c-means clustering (IFCM) resulting well in terms of recall than the existing Fuzzy C-means clustering (FCM) and K-means clustering.

F-Score
This value determines the accuracy of a test. The best F-measure value is 1 while the worst is 0. F-measure is computed using (16).
The comparison graph of proposed data streaming using improved fuzzy c-means clustering with existing Fuzzy C-means clustering (FCM) and K-means clustering in terms of F-score is appeared in Figure 5. It depicts the proposed data streaming using improved fuzzy c-means clustering resulting well in terms of F-score than the existing Fuzzy C-means clustering (FCM) and K-means clustering.

Convergence Graph
The convergence graph of the suggested PSO using data streaming with ABC optimization and GA techniques is given in Figure 6. In the proposed PSO system, the convergence occurs between fitness and number of iterations is better than the existing ABC and GA convergence.

Computational Time
It is the quantity of time taken for the completion of proposed twitter data streaming. The computational time of data streaming in seconds can be obtained from the data stream size in bit and the bit rate in bit/sec as: where, ̃ be the computational time of classification, be the size of the data stream, be the Bit rate. The performance result of our proposed IFCM with existing FCM and K-means clustering in terms of computational time is given in Figure 7. It depicts the proposed data streaming using improved fuzzy c-means clustering (IFCM) achieved better computational time compared to FCM and K-means clustering. The comparison results regarding of various performance measures utilizing adaptive clustering is depicted in Table 1.

Average Classification Error Percentage
The comparison assessment of the classification error percentage is given in Table 2. The proposed modified support vector machine (MSVM) classification error percentage is significantly lesser than the existing SVM and Anti-Bayes Multi classification.

Receiver Operating Characteristic (Roc) Curve
The ROC curve is a probability plot which expresses the fitness of a model in class recognition. The ROC curve is generated by plotting the TP values against the FP values. The assessment graph of proposed MSVM with existing Anti-Bayes Multi Classification and SVM in terms of ROC is displayed in Figure 8. The convergence graph of the proposed grid search optimized search optimization utilized classification with existing BAT and Cuckoo search optimization techniques is given in Figure 9. In the proposed system, the convergence occurs between fitness and number of iterations is better than the existing BAT and Cuckoo search optimization convergences.

Simulation Time
It is the quantity of time taken for the completion of proposed twitter data streaming. The computational time of data streaming in seconds can be obtained from the data stream size in bit and the bit rate in bit/sec as: where, is the simulation time of classification, is the size of the data stream, is the Bit rate. The performance result of our proposed MSVM with existing SVM and Anti-Bayes Multi Classification in terms of computational time is given in Figure 10. Figure 10 depicts the proposed data streaming using MSVM provides better results in terms of computational time than the existing SVM and Anti-Bayes Multi Classification.

Accuracy
Accuracy is percentage of real outcome irrespective of the existence of TP or TN in a given population. It estimates the level of accuracy of a data classification process. Accuracy is computed using (19).

Accuracy=
(TN+TP) (TN+TP+FN+FP) The comparison results regarding of accuracy using proposed MSVM classification with existing SVM and Anti-Bayes Multi Classification is depicted in Table 3. The comparison graph of proposed MSVM classification with existing SVM and Anti-Bayes Multi Classifications in terms of accuracy is shown in Figure 11. It illustrates the proposed MSVM classification provides better classification results than the existing SVM and Anti-Bayes Multi Classifications.

Conclusion
In this paper we have presented an effective twitter data streaming using adaptive clustering and classification algorithm. Here, the pre-processed data utilizing Improved Fuzzy C-means clustering effectively clusters the twitter information with improved by utilizing an Adaptive Particle swarm optimization (PSO) algorithm. Furthermore, the modified support vector machine (MSVM) classifier with grid search optimization effectively performs the twitter data streaming. The experimental outcomes exhibits that our proposed data streaming outperforms the existing ABC, GA optimized clustering and also existing SVM and Anti-Bayes Multi classifications regarding performance measures such as, accuracy, precision, recall, convergence, ROC curve and F-score. This results proves that the proposed clustering technique effectively process the twitter data streaming than the existing techniques and also the proposed classification technique effectively process the Higgs data streaming than the existing techniques.