Business recommendation based on collaborative filtering and feature engineering – aproposed approach

ABSTRACT


INTRODUCTION
Sentiment analysis (SA) of blogs is playing a vital role for business decisions to plan a good business strategy. SA is an artificial intelligence strategy that quantifies the sentiment as positive, negative or neutral [1]. Sentiments are expressed at Document-level, Sentence-level, and Aspect-level [2]. SA has many applications in various fields like ranking products, services and merchants, predicting share price, predicting movie popularity, recommendation using business intelligence. SA aims to provide the right knowledge to the right person at the right time [3].
Current algorithms are being used for a single group of users or products, which ignore the impact for the other groups [4].There, may be few fake posts which are posted by fake users, competitors. So it is a challenge to filter the posts which are not specific to the feature of a product or service. Traditional SA algorithms do not consider the fact that as time passes, the value of data decreases for making decisions. The data considered for short tenure will decrease the quality of recommendation or decision. Bugs in bugs out problem still remain there [5].
Clustering followed by collaborative filtering has proposed a remarkable solution to resolve these issues [5].In the first step, we preprocess the input sentiments and identify the features of the product or service described in sentiments. Using clustering, likely blogs are selected and then feed to collaborative filtering algorithm to fill missing gaps of rating for some features [6].One objective of the proposed recommendation system is to enhance traditional content-based filtering by building user profile based on the static information that represent the likeliness of users to the features of the items or service [7].

LITERATURE SURVEY
A remarkable work is carried out in the research area of sentiment classification. The main focus of this work is on classifying larger pieces of text, like reviews of product or event [8]. Tweets are different from reviews as they have different purpose. Reviews are summary of author's thoughts. Tweets are limited to 140 characters of text.Tweets represent general mood of people through various reactions based on experience or as an impression for news articles [9]. Hu and Liu have given a technique for Feature Based Summarization system (FBS) of customer reviews of products. It also generates sentiment based summary as either positive or negative opinion using adjective words in reviews [10]. Chaovalit and Zhou compared supervised and unsupervised algorithm for classification and got 83.54% of accuracy for supervised method and 77% of accuracy for unsupervised method [11]. Pang O Keefe and Koprinska have given technique to select features using attribute weights and applied Navie Bayes and SVM classifiers for classification of moods [12,13]. Linguistic features are used to detect the twitter sentiment using hash tagged data set (HASH) and emoticon data set. Results are evaluated by using unigrams and bigrams [14,15].
The study by Hassan shows that parts of speech features are not playing good role in sentiment analysis for micro-blogging domain. Author introduces classification method for query term sentiment analysis. Here classifier and feature extractor are considered as two different components [16]. Each token is assigned a sentiment score called total sentiment index. Using classification algorithm the sentiments are classified as positive or negative polarity sentiments [17]. Political future can be analyzed real time monitoring and analyzing public conversation on social sites [18]. Feature vectors and tagged content of corpus can be used to make model by using machine learning approach. This model is used to classify or categories untagged corpus of text document [19]. For language consistency twitter is more informal. Emoticons are used express the opinion. Many tweets are ambiguous and these are maximizing the opinion for readers; but deflect the opinion to a machine learning algorithm [20]. Sentiment classification algorithm (SCA) and SVM are used to evaluate the performance of the approach used accuracy, recall, precision are some parameters on which sentiment analysis performance is evaluated [21].

Research design
A proposed research design for sentiment analysis using collaborative filtering and feature engineering is given in Figure 1.

Data collection
A correct input may leads us to get a correct output. Sentiment data is available on twitter website or from kaggle dataset.

Data preprocessing a. Case normalization
The tweets are available in combined case that is it may contain upper and lower case characters. In case normalization the entire document or sentence is converted in to lower case pattern generally. b. Tokenization A document is split in to sentences. Sentences may be divided in to words. By removing certain characters like punctuation marks, remaining words are now tokens. Users include Twitter usernames in their tweets in order to direct their messages. A de facto standard is to include the@ symbol before the username (e.g.@alecmgo). An equivalence class token (USERNAME) replaces all words that start with @ symbol.

Term frequency count and feature extraction
After doing preprocessing a list of adjectives in the dictionary is matched with every reaming word in the data set to find out adjectives and thus the features, along with these adjectives.

Feature rating
We will provide a list of adjectives along with a crisp value say 0 to 5 saying that 0 stand for the worst, 5 stands for the best and so on.Thus we can provide the rating for the features if the user has commented on.The uncommented feature will not have any rating, rather it will be empty rating as shown in Table 2.

Clutering the top k users
We need to find similar users based on their interest for the features of product or service. Here we are interested to get top k users having the similar taste for their impressions.We can provide threshold value to optimize the result. While clustering using an appropriate clustering algorithm, say k nearest neighbour.
In the Table 3 shown user 1, 3, 4 are having similar taste of interarest for features. Likewise out of P users top k users we are finding. These top k users are now the representatives of the original data set we have considered as an input. The top k users have not rated for all features. But these top k users have commented on similar features very closely. The missing gaps of rating for some features by these k users will be overcome in collaborative filtering.

Collaborative filtering for recommendation
Collaboration means recommendation of item or service based on feature rated in user's choice. Filtering is separation of similar entities based on user's likes or dislikes. The motivation for collaborative filtering comes from the idea that one person can get best recommendation for any business say B, from another person who has the same interest in B already. Collaborative filtering methods are used for monitoring data such as financial data, sentiment blogs for product or services, an electronic commerce and web applications. Table 4 shown explains working of collaborative filtering. Consider movie rating is given for 5 features f1 to f5. Rating for features are in the form of 1 to5.1 stands for dislike and 5 stands for most like.  Feature  F1  F2  F3  F4  F5  1  5  3  4  4  ?  2  3  1  2  3  3  3  4  3  4  3  5  4  3  3  1  5  4  5  1  5  5  2  1 Step 1: Ignore the missing reading column and calculate the average of remaining rows. Step 2: Choose 2 rows whose similarity is to be calculated using given formula.
where, Sim (Ci, Cj) =Similarity between customer i and j. rip=Particular rating of customer i. rjp= Particular rating of customer j riavg=Average rating of customer i rjavg=Average rating of customer j. Above results clearly state that customer 1 and customer 2 has highest similarity in their ratings. We may conclude that, rating for feature 5 for customer 1 will be same as given by customer 2. So, it will be 3 for customer 1.
Step3: In this step we can find out column average for all customers for all features. The Table 5 exaplains the column average for different features.As the colun average is between 1 to 5, we can set threshold as per our demand to comment on the quality of a feature for any product or service. Now one can use above statistics with some threshold for every feature for feature based recommendation of the movie.

CONCLUSION
We have thoroughly studied the proposed approach using collaborative filtering and feature engineering for business recommendation. The preprocessing on input data set will definately improves the quality of the corpus. We will get a proper set of features using frequently occurred adjectives. Clustering algorithm like k nearest neighbour will provide us top k similar users which can give the recommendation for any product of service using collaborative filtering.We can provide threshold value for individual feature so that product or service can be recommended based on that specific feature only. For the proposed approach in this paper, we will provide threshold value to all features considering as a system, which will give us the recommendation for any product or service.
In the future, one can directly consider the Kaggle data set, which provides the rating of any product or service by m number of users for f number of features. It will reduce the role of preprocessing and we can compare the machine learning techniques for better outcomes.