Sentimental classification analysis of polarity multi-view textual data using data mining techniques

ABSTRACT


INTRODUCTION
In the digital age of the rapid shift to information era, enormous amounts of textual data have achieved by the participating in virtual communities on behalf of sharing data and information, social communications and daily transactions [1]. Manually processing of this huge amount of unstructured textual data and information is very costly in term of time-consuming and expensive process. Accordingly, taking advantages of this type of data and information is very hard for the reason of its availability in various forms such as unstructured or semi-structured text data forms [2,3]. The majority of indexed resources of this textual data are mainly extracted from online reports and documents, e-mails, online textual blogs, memos, digital letters, etc [3]. Textual data resources may possibly consist of data collected from multiple information contributors with different representations and usually handled by heterogeneous analytical models. This diversity in the textual data sources and procedures forms the sentiment multi-view textual data. This type of data comprise unique information features that involve information have the same instance but with various representations or have different instances with same representation [4]. Thus, the efficient understanding of sentiment multi-view textual data is becoming an extremely critical movement by the way of using data mining methods. The main promise of data mining tools and techniques to take advantage of multi-view textual data is the potential speed up of big data era [5]. Various methods and strategies of data analysis are proposed for delivering online and/or integrated solutions to solve complexities associated with the management of various data forms in sentiment multi-view textual data. The mixed methods and likewise the hybrid methods of text data mining technologies may be classified as the modern practice for analyzing sentiment multi-view textual data [6]. Textual Data Mining is "roughly equivalent to text analytics, refers to the process of deriving high-quality information from text" [7]. Moreover, it is as well considered as a path of data mining field when data mining algorithms and methods are applied for the discovery of knowledge in the textual data forms. Many of text data mining techniques can be individually applied or incorporation with other techniques for discovering sensible and valuable knowledge. Although, text data mining has achieved various successful applications in many areas, some deep sentiment text analysis tasks still remain as a challenge and not handled yet. The real distinction of technological efforts with text data mining from the general concept of data mining is the possibilities to deal with the special characteristics of implicit information in textual data that seems to be primarily unstructured. However, the typical practice of such technological possibilities is yet in the early stages to reveal the necessary knowledge that extracted from this unstructured textual data [8]. Therefore, the key aspect motivated this paper is that multi-view textual data can be characterized as a very complex implicit data structure. In this context, the main problem to deal with this type of data is to design hybrid techniques of data mining algorithms that able to handle implicit data and information. Additional combined efforts of knowledge discovery applications have the possibilities to reveal associating common information patterns in the multiple views or modalities of textual data formats. At Present, there is fewer addressing methods of data mining and machine learning for learning from these semi-structured or unstructured data types in the textual data formats [9]. M. Jakubik [10] has summarized knowledge creation management with incorporation of gathering and sharing information, and effectively developing this information and then shifting the knowledge to demanders. This research is going to investigate the track outlined by M. Jakubik, [10] for managing the creation of knowledge but then will construct this research by operating the tools and techniques of data mining in multi-view textual data. Consequently, the primary purpose of conducting this research is in the direction of find out in which way can automatically classify knowledge and more accurately identified classes from unstructured textual data documents. Consequently, it will offer further knowledge that can be realized, improved, and delivered to the demanders.

RELATED WORK
In this section, a brief description is provided to assess the existing approaches of textual sentiment analysis and the distinction between the proposed framework and related previous works. The textual sentiment analysis is very well-known as part of data mining approaches, which can essentially classify the sentimental multi-view textual data into two categories as positive or negative [11]. In this research, the proposed framework is applied on online polarity user-reviews to be classified as 'good' or 'bad' sentiments. In actual fact, extracting knowledge from information available in textual data is increasingly gaining more consideration. This consideration by related previous works have developed a number of sentiment analysis methods to measure the opinions of users about products or services. As example of early developed methods, Reddick et al. [12] investigated features of text in which social media by some public services delivery. Likewise, Müller et al. [13] had applied a variety of text analytic approaches to understand obstacles related to customer service. Some data mining approaches have involved in the knowledge discovery from sentimental multi-view textual data, such as features extraction of online user generated contents [14], to prediction and classification [15]. In the related research of user's reviews, a sentiment classification of movie reviews has been handled by Ahuja et al. [16] using dual training and dual prediction while addressing polarity shift. Consequently, Zola et al. [17] have propose a novel sentiment classification, in which a three-step methodology is explored based on balanced training, text preprocessing and machine learning using two languages: English and Italian.
In this context, the data mining algorithms are engaged strongly for knowledge creation from sentimental textual data. As examples of using data mining algorithms, Wang et al. [18] introduced a convolutional recurrent neural network for classifying text documents, using support vector machines  [19,20], using naive bayes for optimal feature selection in text categorization by Tang et al. [21], and using naive bayesian algorithm by Tang et al. [22] for text unstructured classification in the Spark computing environment. Even though some of those solutions happen to be highly effective, the challenges continue to be relevant to the variation of multi-view polarity textual data that required preprocessing, anddeciding on the proper classification procedure [23]. Therefore, this research is focusing on the difficulty of providing improved accuracies of classifying the sentimental multi-view textual data into two categories through the application of the proposed framework on online polarity user-reviews dataset upon a given topics.

RESEARCH METHOD
The method strategy that has been designed for this paper is divided into conceptual development stage of the proposed framework which relies on the combined tools and techniques of data mining. This combination of data mining algorithms can serve the purpose of handling unstructured textual datasets. The complete procedure involved with identifying comprehending knowledge from unstructured textual data sources is simply implemented using the direction outlined in a matter of this section. The following actions had been experienced to improve and refine the development of the framework design: -Identification of various source information regarding textual analysis data mining.
-Evaluation and analyzing available techniques and approaches of data mining for analysing textual databases. -Determining beneficial data mining algorithms to design hybridized framework for analysing textual databases. The next stage is the design and implementation of the framework to obtain the results and verify the objectives of this research.

Conceptual development of algorithms 3.1.1. X-means Clustering Algorithm
The X-Means clustering algorithm proposed by Pelleg and Moore [24] is a K-Means extended by an Structure Improved part. The centroid in this part of the algorithm tries to be divided automatically in its area. The join decision among the children of each centroid with the centroid itself is finalized comparing with the Bayesian Information Criterion (BIC) values. Nevertheless, the aim of X-means algorithm is to deliver a fast and effective way to cluster unstructured data.
The X-Means clustering algorithm allows to specify the range number of clusters. It starts to run with the minimum number of clusters (K) of the given range and assigns a BIC score and it continues to add centroids where they are needed until the maximum number of clusters (Kmax) range is reached. Throughout the mentioned process, the centroid set will be recorded when it is achieving the best score, and this becomes the final centroid set output. Each model has (K) to (Kmax)+1 numbers of centroids and a BIC score which will decides the fit number of clusters. Thus, it is automatically selecting the optimal resultant of clusters number. The steps flow of the X-Means clustering algorithm is given in the Figure 1, as defined in [24].
The X-means Clustering functions is associated with strengths for further improve the knowledge discovery within multi-view textual dataset in terms of the proposed framework implementation. The clustering task process is repeated with updating the best BIC score till the maximum number of clusters (Kmax) range is reached which is provides a typical number of clusters. Also, this will affects positively the required processing time. Furthermore, the use of concurrency speeds up the clustering process and the use of the BIC gives a mathematically complete measure of quality.

HotSpot algorithm of association rules
HotSpot algorithm of association rules finds out a set of rules presented in structure similar to a tree, which is maximize or minimize a target value of interest [25,26]. The HotSpot algorithm with a nominal target looks for segments of the information where there is a high chance of a minority value occurring by given the constraint of a minimum support, and with a numeric target could be attracted in discovery of segments where they are higher than average in the whole dataset. For example, in the sentiment multi-view textual data scenario, it attempts to find in which positive or negative textual review groups are at the highest emotional claim ratio, or in which positive or negative textual review groups have the highest average sequence of sensitive issues in sentimental reviews dataset. This algorithm is close to the PRIM bump hunting algorithm described by Friedman and Fisher, [27].
The Single-key Information sequences were recently linked with identified clusters using X-means clustering algorithm, which is used in this research for separation to a sentimental review categories. The goal here is to identify characteristics of each textual document as positive or negative user-review categories, where performing HotSpot association rules later for multi-view textual data is significantly helpful in terms of maximizing the target value of interest in each cluster to obtain the maximum concentration improvement by recognizing the frequent individual terms in each cluster and combines them to a larger term sets as long as those term sets appear appropriately frequented in the database. The whole algorithm can be divided into four steps as the Figure 2.
The automatic discovery of frequent term sets or multi-key frequent information rules (MKFIR) is a very promising area of data mining applications in the field of multi-view textual data. A frequent term set is described as being a collection of key phrases which usually takes place inside the text document set ın excess of the minimum support times.

Naive bayes and decision trees (J48) algorithms for classification
In this stage, two different classifiers, in particular decision trees (J48) and naive bayes classification algorithms are compared and applied in this research. The purpose of analyzing each of these distinct classification algorithms lies in the fact that each uses different selection criterias to the critical information variables. The J48 classification algorithm is based on decision tree. Naïve bayes classification algorithm is based on probability [28]. Successively, they extend right from the simple distance measure to other path of probabilistic distance measures to search the similarity criteria among a set of documents and then classify them directly based on their particular category. In this manner, the main objective of such experiments is to verify the proposed framework on such basis as maximizes the target value of interest by discovering multi-key frequent information rules, which delivers considerably better accuracy when compared to the simple termbased classification techniques. The Decision Trees and Naive Bayes classification algorithms has been examinedwith regards to classifying textual data in accordance with two different classes.

THE PROPOSED FRAMEWORK
In this section the major functions of knowledge discovery defined above are integrated to develop the proposed framework for classifying the sentimental multi-view textual data. This can be done through maximizing the target value of interest by discovering multi-key frequent information rules. The initial stages of knowledge discovery are realized by applying the X-means clustering algorithm to identify the Single-key Information sequences. Afterward, the identified Single-key Information sequences utilized by the HotSpot algorithm of association rules to generate multi-key frequent information rules. The integrated application of these data mining algorithms which generate the multi-key frequent information rules can be utilized to shape a knowledge-based relational database for improved analyzing and handling of the sentiment multi-view textual data. The benefit with discovering the key information in terms of multi-key frequent information rules serves to automate the process of categorizing the sentimental multi-view textual documents within two predefined categories (i.e. positive or negative information classes).
Furthermore, the proposed framework is outlined to analyze sentiment multi-view textual dataset, which lies to four main data processing stages, as shown in Figure 3. The knowledge discovery and text classification from these four main parts is influenced by the discovery of multi-key frequent information rules. When the information is already has been structured by the data preparation and preprocessing stage, it is then passed to the data clustering stage for arranging textual data into a number of clusters and identify the Single-key Information sequences. Identifying the sequence of key terms helps ensuring some key information rules in the multi-view textual documents.
In the context of current research, each document available in the sentimental multi-view textual dataset comprises a combination of terms to explicate impression. Maximizing target value of interest by utilizing the multi-key frequent information rules enables summarizing the key information within each particular document for further processing or classification. The illustrated framework above with integration of different efforts of data mining algorithms is proposed to analyze free formatted form of multi-view textual data. The proposed framework is achieved through the application of X-means algorithm for clustering and HotSpot algorithm of association rules to discover useful knowledge in terms of multi-key frequent information rules.

DATA ANALYSIS
The analysis of classifying the sentimental multi-view textual data into two categories through the application of the proposed framework is discussed in the scenario of online polarity user-reviews dataset.

Data collection
In order to perform the experimentation of this research study, the performances of particular test and analysis with the proposed framework is conducted using sentiment polarity reviews dataset. The sentiment polarity reviews data provided by [29] is one of the commonly used sentiment analysis datasets. The sentiment polarity reviews dataset involves 2000 online user-created reviews on a given topic archived from the cs.cornell web portal and it is available in (http://www.cs.cornell.edu/people/pabo/ movie-review-data).

Functional requirements
Weka 3.7.12 software is used to support the activities defined during the experimentation of the proposed framework, starting with the pre-processing stage toward the application of knowledge discovery level, which is used to classify the sentiment polarity reviews dataset.

Implementation of framework functionalities
The validation of the proposed framework is performed in this section in order to ensure better performance than the simple term-based classification models, through identifying key knowledge areas within sentiment polarity reviews dataset. a. Preparation and Pre-processing of Text Documents The data preparation and pre-processing tasks are being used to remove un-necessary terms from unstructured text data formats and prepare the textual data for further analysis of various data mining algorithms for subsequent analysis of the text. The data preparation and pre-processing tasks are including the following steps: -Remove the stop words or terms that can be less effective in knowledge discovery from sentimental multi-view textual data. -Perform word stemming process by way of reducing derived words or terms to its actual stem or the root form. -Structure the information into vector form in which every term is considered a word vector.
-Represent text in a form of term frequency (TF) matrix by counting the words and their corresponding frequencies, as shown in Figure 4(a). -Determine the similarities by calculating the Euclidean Distance method. -Weight terms in the set of documents through Inverse document frequency (IDF) and term frequency (TF) by using its matrices (TF*IDF). -The output is then saved in attribute relation file format (.arff) file that will be used later for further data mining analytics. b. Application of X-Means clustering algorithm The X-means algorithm is applied on the (.arff) file format which was obtained as a result from the preparation and pre-processing stage. This algorithm is very supportive to calculate automatically the appropriate number of generated clusters that capturing related information by using the Single-key Information sequences. The implementation experiment of this research has been shown that X-means algorithm adopts the total clusters for the used sentiment multi-view textual dataset to be two clusters as shown in the Figure 4(b). It gave substance to be very advantageous clustering algorithm in preserving valuable information structures that identified in the sentiment multi-view textual dataset. Thus, the X-means algorithm has been utilized as an efficient tool to capture valuable key Information within every single cluster, as shown in the Figure 4(c). c. Application of HotSpot association rules algorithm The discovery of maximum target value of interest by determining multi-key frequent information rules in multi-view textual datasets helps with finding terms that appear most frequently with each other. Knowledge discovered with the help of rules for maximum target value of interest is employed for mapping the positive or negative information document categories, as shown in the Figure 4(d).

d. Application of classification algorithms
The output of (.arff) file based on Single-key Information sequences obtained by X-means clustering and maximum target value of interest achieved by HotSpot association rules, which is then loaded into Weka 3.7.12. Two different classifier algorithms were tested to classify the sentiment multi-view textual data into positive or negative information document categories and the result accuracies that achieved by both classification algorithms are described and discussed in next (Section 5).

RESULTS AND DISCUSSION OF CLASSIFICATION ACCURACIES
The classification result accuracies are measured by way of classifying sentimental multi-view textual data as positive or negative information document categories. The classification results show the rate of correctly classifying the polarity data when both of simple-term based classification method and the proposed method are separately used for classifying sentiment multi-view textual data of polarity reviews dataset into their respective categories. The calculated correctly classifying rates are provided in Table 1.
The correctly classification rates have been measured alongside with every single classifier for predicting the correct category of each polarity information document. Consequently, the highest rate of correctly classification has the capacity with more accuracy assured in the process of decision making. Different settings of parameters have been tested by using Weka 3.7.12 software, in which different test modes like cross-validation with 10 folds and using training dataset are tested. The classifiers in the proposed method using rules for maximum target value of interest are all result improved accuracy than the simple term-based classification method, which is using (term frequency matrix) only.

CONCLUSION
This paper has been discussed the integration of different data mining algorithms for sentiment classification of multi-view polarity textual data. A data mining framework is proposed for generating Single-key Information sequences of knowledge by X-means clustering and maximum target value of interest by discovering Multi-Key Frequent Information Rules in each cluster by using HotSpot association rules will perform classification with improved accuracies of the classifiers. The analysis results have shown improved accuracies of classifying the sentimental multi-view textual data into positive or negative information document categories.