Opinion mining using combinational approach for different domains

ABSTRACT


INTRODUCTION
Opinion mining is emerging area of research as popularity & availability of reviews increasing. Opinion mining is used to determine the polarity of a text such as positive, negative or neutral. Opinion represents the individual's ideas, judgments, assessments, beliefs about specific topic. Extraction of opinion or sentiment is very important task in business and academic world. Every manufacturer wanted to know the review about their products [1]. Decision making affects because individuals are rely on what others are thinking. Therefore, sentiment analysis is popular stream, which extracts sentiments and analyze it [2][3].
Classification of documents depending on polarity is a key activity in opinion mining. Documents are written in positive, negative or in neutral orientation. Words polarity is impotant feature in opinion mining. As words are domain dependent, the knowledge of domain is required to find out whether word is positive or negative. Reusability of knowledge of one domain in another domain is key issue in opinion mining.
Doamin adaptation can be the solution for this problem in which knowledge can be transferred from one domain to another reducing cost in terms of time and money. Consider the problem, where task is to automatically classify the reviews on Electronics domain into positive and negative orientation. For this task, first we have to collect many reviews of the domain. Then train a classifier on the reviews with their subsequent labels. Large amount of reviews are needed to maintain good classification performance. Labeling reviews for each domain is time consuming as well as expensive process. Hence, domain adaptation need arises which can uses knowledge of one domain to another one [4]. Structural correspondence learning (SCL) proposed to extend structural learning. SCL defines pivot features, which are common to both source and target domain. This method tries to find the correlation between pivot features, and non-pivot features. Whitehead et al., [5] proposed a method for building ensemble models, using lexicon similarity, that yield a high classification accuracy for domains in which no training was performed. It is reported that an adjusted form of cosine similarity between domain lexicons can be used to predict which models will be effective in a new target domain. Jialin et al., [6] proposed a general framework for cross-domain sentiment classification. Spectral feature alignment (SFA) creates meaningful clusters with the help of common words. Bipartite graph is constructed between common or domain independent and uncommon words of both source and target domains. Mutual information used to select common words and binary classifier is trained for classification. Experimental results shows effective performance of approach on both document and sentence level classification. Liu et al., [7] proposes a method for co-extracting opinion targets and opinion words by using a word alignment model. Main focus is on detecting opinion relations between opinion targets and opinion words. As compared to previous methods based on nearest neighbor rules and syntactic patterns, proposed method captures opinion relations more precisely. An Opinion Relation Graph is constructed to model all candidates along with a graph co-ranking algorithm to estimate the confidence of each candidate. The items with higher ranks are extracted out. The experimental results for three datasets with different languages and different sizes prove the effectiveness of the proposed method. In future work, authors planned to consider additional types of relations between words, such as topical relations, in Opinion Relation Graph. Balamurali A. R. et al., [8] proposes approach for cross domain sentiment tagging. A method for creating high in-domain classifier using simple low level features is introduced.
A generic classifier based on meta-classification approach coupled with this high in-domain classifier is used to create labeled data for a new domain from domains having labeled data. Results showed considerable improvement in cross domain sentiment tagging accuracy if domains are similar. In case of dissimilar domains system exceeds the baseline accuracies by substantial margins. Bollegala D. et al., [9] proposed a method creating thesaurus which is receptive to sentiment words from different domains. Author used both labeled and unlabeled data. Created lexicon vocabulary was expanded at train and test times in a classifier. Proposed method compared with many baseline methods which reveal a good performance. Shoushan et al., [10] proposed active learning in which source and target classifiers are trained separately. Using Query by Committee (QBC) selection strategy informative samples are selected and classification decision made by combining classifiers. Label propagation is used to train both classifiers. Result demonstrates significantly outperforms than the baseline methods. Like ensemble classifiers graph based methodology also used for domain adaptation. Inderjit S. et al., [11] proposed the graph based domain adaptation method. Similarity graph constructed between features from all domains, if these features are similar then edge exist between them. All labeled features used in metric-learning algorithms. Graph is constructed using data-dependent metric and weight is calculated for each edge. An experimental result demonstrates reduction in classification error. S. Bhatt et al., [12] proposes an algorithm to adapt classification model by iteratively learning domain specific features from the unlabeled test data. Moreover, this adaptation transpires in a similarity aware manner by integrating similarity between domains in the adaptation setting. Cross-domain classification experiments on different datasets, including a real world dataset, demonstrate efficacy of the proposed algorithm over state-of-the art.
Many open source lexicons are available which serve as a database for extracting the polarity values of opinion words. However, these generic polarity lexicons reveal the general sentiment of opinion words. An opinion word could be context dependent or domain specific. The word like "small" may represent a negative orientation in a hotel domain but if used in mobile applications it is a positive. Same way "freezing" is good for a refrigerator but negative for software applications. The variation of opinion possess by a same word in different domains restricts the usage of generic lexicons as it contains generalize polarity of a word. So need of domain adaptable lexicons emerges. Domain adaptability is major issue in sentiment analysis which has been addressed in proposed framework. A proposed approach attempts in building a classifier which uses maximum entropy classifier with clustering based on point wise mutual information between words.

PROPOSED METHOD
Opinion lexicon is a word or group of words in review. Identification of opinianated words or lexicons is an important task. In proposed method different tasks are discussed. Data collection is crucial task as large data is available online. For proposed approach Amazons multi product review dataset is used. After data collection cleaning of data is necessary. Stop word removal method is applied on data collected. Part of speech tagging is used to tag words as adjective, adverb. After this preprocessing step classifier is applied on cleaned data. Maximum entropy classifier is used alongwith clustering as shown in Figure 1. where,

RESULTS AND ANALYSIS
Blitzer et al., [14] Multi Domain dataset is used for evaluation of proposed method. Results of each step recorded in following section. Preprocessed data is used for classification and clusteing process. In first experiment source domain is divided into 5 parts and target domain taken as it is. Second experiment done on combination of source domains.
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product domains [14]. This dataset contains three types of files positive, negative and unlabeled in XML format. Each line in form of: feature:<count> .... feature:<count> #label#:<label>. These files are extracted using XML file splitter and reviews are written into text file as shown in Figure 2. The dataset contains 1000 positive files and 1000 negative files for each domain. On this dataset preprocessing step is applied to remove noisy data. In this phase, pre-processing is done to eliminate unnecessary words called as stop words. This is important because the irrelevant data from the reviews could be eliminated. This eliminates the processing overheads of a large amount of textual data. Most of the English sentences include words like "a, an, of, the, I, it, you, and, etc". Such words do not carry particular meaning. Information extraction from natural language can be done effectively and clearly by avoiding those words which occurs very often. To remove stop words from sentences text file is used which consists of list of English stop words as shown in Figure 3. After stop word removal, using parser part of speech of sentence like noun, adjective, adverb, verb, etc. are extracted. Parsing is vital step as it gives opinion words as an output. Sentence Parsing involves assigning different parts of speech tags such as noun, preposition, verb, adjective and adverbs to a given text are known as Part-of-Speech (POS) tagging. The part-of-speech is a category used in linguistics that is defined by a syntactic or morphological behavior of a word. The traditional English language grammar classifies POS in the following categories: verb, noun, adjective, adverb, pronoun, preposition, conjunction and interjection. The reason why POS tagging is so imperative to information extraction is the fact that, each category plays a specific role within a sentence. Nouns give names to objects, or entities from reviews. An adjective describes opinion. Also, some adverbs can play key role as an adjective. Firstly, text review is divided into sentences. Stanford parser is used to generate the POS tagging of each word present in the sentence. It is very essential as it helps in finding general language patterns as shown in Figure 4.  Table 1 shows accuracy from same technique for 12 classification tasks.  Figure 5. Figure 6 shows the effect of combining multiple source domains. We see that the combination of DVD and Electronics as well as Book and DVD as source domain gives highest accuracy. Other observation shows that when we use two source domains is always greater than the accuracy if we use single domain as a source. Experiment was done for 400 files. Combination of Book and Kitchen as source applied on DVD domain as target showing less accuracy that is 75.25% compared to others. Also Electronics and Kitchen, Book and Electronics as a source and DVD as a target producing 71.5% and 79.75% accuracy respectively. From observations accuracy for these domains is less compared to other combinations and DVD as target. It states that words from source domains are not matching with target domain as shown in Figure 6. Efficiency of model is dependent on domains similarity.

CONCLUSION
The commitment of this paper is to apply the semi supervised method to build domain adaptation model to extract features. Proposed approach utilizes maximum entropy classifier to classify reviews into two classes positive and negative. Clustering is appled on classified words using pointwise mutual information. The experiemnatl results deliver good accuracy value for 5 fold cross validation and combination of source domains. Blitzer et al. [14] multiproduct dataset is used for experiments. In the proposed framework, clustering for only unigrams is used. Bigrams can be used in future. Non word features can be included in approach.