Business intelligence analytics using sentiment analysis-a survey

ABSTRACT


INTRODUCTION
Natural language is playing key role for direct or indirect communication. Any information can be partitioned in to facts and opinions. Sentiments are the subjective statements that reflect the sentiments of an individual about an event or an object. The technique to extract subjective information from text and determining the overall contextual polarity of opinion is called as sentiment analysis. As stated by Liu [1,2], it is always useful for businesses to know impressions by their customers about their product or services; so as to know the reasons of their profits or losses. Also, the customer who wants to buy a product would like to know the impressions about that product by existing users [3]. Users may comment on overall product or feature of the product. To find the sentiment of an object, first the sentiment words for features of given object are identified. These features assign weights either on the basis of their polarity or on the basis of polarity and importance. The polarity weights of all the features of the object are aggregated in order to get the aggregate sentiment about object [4]. The techniques for measuring sentiments are broadly classified into machine learning methods and lexicon-based approaches.
Huge number of sentiments, their analysis and impact of parts of speech on sentiment polarity raised great challenges to the researches. A comparative study of different machine learning algorithms, linguistic features, natural language processing will definitely provide a proper direction to all researchers for better business intelligence decisions [5].

Review sites
Reviews about products or services are available on review site. Review sites are helpful to both manufacturers and customers of the product. The manufacturers will get the sentiments through these sites [6]. Reviews on website can not be restricted. Positive reviews may be written by businesses or individuals being reviewed, while negative reviews may be fake.

Blogs
It is a periodically updated web page, run by a person or group. Blogs are written in an informal language. The bloggers can post blogs hourly, daily or weekly and make conversation fast and real time.

Micro blogging
It is a website where a user makes short, frequent posts. Micro blogging is a broadcast that is available in the form of blogs. A micro blog is smaller in its contents than traditional blog. Using micro blogs users exchange audio, video links, images, short texts which are used to comment their reputation. These micro blogs are also called as micro posts.

News articles
Most of the websites of daily news papers allow users to comment on ongoing event or issue. Rich site summary is also helpful to get the sentiments posted by readers.

Social aetworks
Social networks like Facebook, Twitter are lifeline of conversation for today's world. People are sharing their opinion through these websites. a. Twitter It is online micro blogging service helpful for reading or sending textual posts. The post is called as 'Tweet'. b. Facebook One can create his/het profile on Facebook. Images, text message, video can be uploaded on Facebook. Based upon privilege provided to a person he/she can see the profiles of their friends if added, to exchange text messages.

SENTIMENT CLASSIFICATION OUTLINE
SA architecture shows that preprocessing, feature extraction, sentiment classifications are the steps involved in it.
3.1. Basic terminologies a. Opinion: Based on knowledge or experience about any object or event, one can express opinion about it. Mathematical expression for opinion is (o, po, f, ho, t), where o is object, po is the polarity of the opinion for particular feature of object or event, f is feature of object or event, ho is the person or group posted opinion t is the time at which opinion is impressed. b. Opinion Holder: A person who expresses their views about any object or event is called as opinion holder. c. Object: Object is any entity or event say A. It is a pair, A: (C, P), where C is the part of A, and P is the feature or attribute of A. An opinion is expressed as "I like Red Mi mobile phone", or based on attribute of Red Mi mobile phone, opinion can be given as "The display quality of Red Mi mobile phone is nice" [7]. Also, the component of an object or feature of an object can be used to express the opinion. d. Object to Feature Form Simplification: The general opinion about an object can be expressed as "I like Red Mi Phone". Whereas the opinion is expressed on feature of the object as "Red Mi phone has a very good picture quality", here Red Mi phone has a feature 'picture quality'. Let, an object O has feature f in sentiment text [8].

Preprocessing
We are interested in features of an object. For this preprocessing of input data are preprocessed using following steps. a. Tokenization: White spaces, special characters, symbols are removed; remaining words are called as tokens. b. Removal of Stop Words: The articles and common words like "a, an, the, this, that am, is", etc. c. Stemming: Reduces the tokens or words to its root form. d. Case Normalization: It changes the whole document either in lower case letters or upper case letters [9].

Feature extraction
This step deals with the following scenario. a. Feature Types: It deals with finding of types of features used for sentiments viz. term frequency, term co-occurrence, sentiment word, negation, syntactic dependency. b. Feature Selection: It deals with finding good features for sentiment classification viz. information gain, odd ratio, document frequency, mutual information. c. Feature Weighting Mechanism: It calculates weight for ranking the features using term frequency and inverse document frequency. d. Feature Reduction: The dimensionality of features is reduced for better performance.Opinion posted is classified as positive opinion, negative opinion and neutral opinion. The 3 levels of sentiments analysis are as follows.

Levels in sentiment analysis
Opinion posted is classified as positive opinion, negative opinion and neutral opinion. The 3 levels of sentiments analysis are as follows.

Document level
The whole document is considered for impressing the opinion as positive, negative or neutral. The opinion about an object may be expressed without using any opinion word. In this case natural language processing plays a vital role to mine the correct sentiments. The main challenge is to extract subjective text for inferring the overall sentiment of the whole document.

Sentence level
The documents in collection are divided into sentences and then the sentences are classified as per positive, negative or neutral polarity. A document is a combination of subjective and objective sentences. First the subjective sentences are determined and then the opinion in those subjective sentences will be calculated.
The sentence level polarity identification can be done in either of the two ways: a grammatical syntactic approach or a semantic approach. The grammatical syntactic approach takes grammatical structure of the sentence into account by considering parts of speech tags. [10].

3.4.3.Word or phrase level
When product feature is considered for sentiment analysis, it is word or phrase level sentiment analysis. It uses adjective, adverb as features .Word level sentiment can be attained by 'Dictionary Based Approach' or 'Corpus Based Approach'. a. Dictionary based approach Sometimes the opinion is not expressed by a popular keyword. Some jargons may be used to express the sentiments. Here WorldNet containing the synonyms and antonyms is considered for finding out the polarity of a word. b. Corpus based approach In this method, occurrence of any word with other word whose polarity is known is taken in to account. Adjectives joined by 'and' show the same impression and if joined by 'but' show opposite impression.

ROUTES TO ACHIEVE SENTIMENT ANALYSIS
Several algorithms, techniques are used to achieve sentiment analysis. Still many researchers are striving for improvement of existing methods and developing new effective methods. Types of sentiment analysis approaches is shown in Figure 1.

Machine learning strategy
Using Machine learning the performance P of current task T using past experience E is improved. Machine learning techniques first train the algorithm with some particular inputs to formulate a model. Lateron this model is used to test the new data for categorization. Some of the techniques are as follows:

Support vector machines (SVM)
SVM are supervised learning methods used for classification. SVM uses hyper plane to separate multidimensional data. All the data points on these planes are called as supports. The supports which are closer to hyper plane boundaries are called as support vectors, Figure 2. Number of hyper planes is possible for given data. The hyper plane with highest margin is selected for classification. With greater margin, less classification errors can be attained. SVM is performing well in text classification and in a variety of sequence processing applications [11].

Naive bayes
It is simple in implementation with high accuracy. The algorithm will take every word in the training set and calculate the probability of it being in each class (positive or negative). It assumes that a feature value does not depend on other feature value. A naive Bayes classifier describes that the correlation between features is not considered to exist any object. For example Red color, round shape, 5 to 8 cm in diameter object is tomato. If a single feature is present, it will be considered as tomato.
In machine learning we are often interested in selecting the best hypothesis (h) given data (m).In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (m). One of the easiest ways of selecting the most probable hypothesis given the data that we have that we can use as our prior knowledge about the problem Bayes [11].
Bayes' Theorem is given as: P(h|m) = (P(m|h) * P(h)) / P(m) Where P(h|m)-It is posterior probability of hypothesis h when data m is given. P(m|h) -It finds probability of data m considering that hypothesis h is true. P(h) It is the prior probability of hypothesis h being true. P(m) -The data probability.

Decision trees (DT)
DT divides input training data into subset and pattern is found from the subset. These patterns are used further for classification. The knowledge gained from patterns is used to construct a tree. A decision tree consists of a root node, interior nodes and leaf nodes. Root node consists of whole dataset, interior nodes split the incoming data set into two or more parts based on dissimilarity and leaf nodes represents classification [12].
The in record form data is as follows. (x, Y)={x1 ,x2, x3........, xk, Y) Y is dependent variable. The vector x is combination of the features, x1, x2, x3.. xk. At each node best attributes to partition data into individual classes are chosen. The best partitioning attribute is selected using information gain measured are used in speech and language processing.

Rule based approach (RBA)
Relational rules represents the knowledge is the backbone of RBA. This classification technique is based on IF-THEN rules. It has very poor generalization, but provides better performance within a narrow domain. This approach includes learning classifier systems, association rule learning, and artificial immune systems. [12]. Rules can be generated by different methods. Support and confidence are the two common methods for generating rules.
The number of instances present in the training data set and which are relevant to the rules are called as support. The Confidence represents the conditional probability that the right hand side of the rule is proved if the left hand side is proved. A word with positive sentiment is assigned a score +1 and if the word is giving negative sentiment -1 score is assigned to it. A rule is analyzed by a target word. If two mobiles are compared based on their prices, the word 'price' is the target word to analyze the sentiment by RBA.

Lexicon based approach
It is unsupervised technique for SA. This approach can work with small amount of training data. Positive opinion deals with desired states and negative opinion deals with undesired states. Rich, good, beautiful are the words expressing positive sentiment. Poor; bad, dirty are the words expressing negative opinion. The opinion phrases and idioms together are called as opinion lexicon. There are 3 approaches for collecting the opinion word list: manual approach, dictionary-based approach, and corpus-based approach [8].

Dictionary based approach
Opinion words and their synonyms are found and saved as the part of dictionary. Then the errors from such dictionary are removed manually. This approach is has limited applicability. Using a small set of testing opinion words and an online dictionary this approach can be used e.g., Word Net or thesaurus [12].

Corpus based approach
A word may express different meaning in the same domain. [13,14]. The word "small" gives opposite opinions in two sentences: "The mobile phone is small" (positive) and "The storage capacity of mobile phone is small" (negative). If two sentences or two words are joined by 'AND', both of them are impressing the common sentiment. Sometimes 'AND 'may gives us opposite sentiments like "The Red Mi phone battery is excellent and takes long time to charge". It is expected that on both the sides of 'AND' same opinion is expressed. For the connectors 'OR', 'BUT', etc rules are designed. This idea is called sentiment consistency.
A large corpus is studied to find out the similar or different impression of words conjoined. Same and different opinion flow between adjectives form a graph. Any clustering algorithm is applied on this graph to form two clusters, say positive and negative.

Comparison of three major techniques 0f sentimental analysis
The three major techniques used in sentimental analysis are analyzed based on their performance and accuracy. The major advantages and disadvantages of using any approach are also discussed. The comparison of all these techniques is shown in Table 1. Excessively rely on emotional dictionary.

APPLICATIONS
As large opinionated data sources are available, it is being used in many applications. Some applications of SA are enlisted as follows [15,16]. a. Business Demand and supply decisions can be made based on the impressions by user of that product. Through these impressions, organization can check the quality of the service they are providing. To grow in the market decisions can be made based on sentiments available timely. b. Politics Political issues discussed on political blogs. The positive and negative popularity of public figure can be known by analyzing the sentiments of people available on social websites.Business c. Recommender system The product or service is will have positive sentiments if people are using it happily. These products can be recommended to a new user strongly, if user is viewing the ratings or sentiments. So sentiment analysis is having a vital role in recommender system. d. Summarization It is time consuming for a reader to read all opinions for a given entity and then judge.SA will give us the summarized opinion of any entity for a given tenure. 6. CHALLENGES Sentiment analysis is facing following challenges [17]. a. Negation handling Negation alters the polarity of the associated adjective and hence the text. Negation words (like not, neither, nor, it is not the cases) are playing a vital role for polarity change. One possible solution to handle negations is to reverse the polarity of the adjective occurring after negating word. However, this solution fails to deal with the cases like "No wonder the mobile is good" and "Not only the menu was delicious, the culture and service was also remarkable". The use of pure language processing techniques, mathematical models fails to deal with negations completely. b. Domain generalization Polarity of adjective changes when is used in different domains. For example, "The play was inspired from Shakespeare's play" (negative orientation), "I got inspired from the autobiography" (positive orientation). Here, the word 'inspired' exhibits two different polarities for two different contexts. A generalized sentiment analyzer still remains a challenge because of difference in the meaning of a word/sentence in different domains. c. Pronoun resolution Generally the opinion for an object is expressed in the first sentence of the complete text. Remaining portions may contain opinionated text expressed using pronoun. To resolve the pronoun i.e. for what it has been used for is a challenging job. d. Language generalization A separate dictionary is required for separate domain and language. Most of the sentiment analyzers are implemented in English language. A language general sentiment analyzer is fruitful as it gives a broad view of sentiments about a product. e. World knowledge One entity can be referred using other entity. For example, "He is as intelligent as Einstein". Here, to identify the sentiment orientation of the text, one has to know about 'Einstein''. f.
Detection of spam and fake reviews: The spam and fake review can be eliminated before preprocessing by using outlier analysis and by considering reputation of reviewer. g. Single sentence possessing multiple opinions A sentence may contain multiple opinions for different features. In this case the strength of opinion for a feature should be computed properly. h. Comparative sentences The order of words in a sentence expresses different meaning. The sentence, "Red Mi mobile is better than Nokia mobile" gives opposite opinion from "Nokia mobile is better than Red Mi mobile". i.
Synonym gathering Two words may express same feature of an object. These synonym words should be grouped together. In this example, "cute" and "excellent" both refer to the same feature.

CONCLUSION
In this paper, basic concepts related to sentiment analysis are discussed thoroughly. The techniques for sentiment analysis are discussed and compared on the basis of their advantages and limitations. The contribution of current research study is to distinguish positive, negative or neutral nature of sentiments based using explicit opinions. It is needed to deal with implicit opinion with its feature. In future implicit and explicit opinions and their features may be considered to find better sentiment classification.