Forecasting movie rating using k-nearest neighbor based collaborative filtering

ABSTRACT


INTRODUCTION
Now a days internet has become an integral part of life.People are strongly connected to social media for sharing their emotions and reviews on different websites.These emotions are in the form of sentiments or ratings for a product or service.Business intelligence for provider or demander is guided by recommendation system.Recommendation system suggests items or services to the user of his interest like books, movies, shares.A large dataset with ratings by different users is available online.The movie rating database with users and different parameters for movies is available on several popular websites like Kaggle.Decisions made by support of multiple stronger historical impressions to resolve an issue are always superior to the decisions made with single impression by any user.Rather than collecting all of the reviews or ratings, only the users having stronger relevance of ratings between them are collected.By using k-nearest neighbor (KNN) recommendation system finds a small set of data which is the most similar.
Collaborative filtering (CF) takes input from KNN clustering algorithm.KNN gives top K users with ratings among available huge dataset.Movies genres may be of entertainment, educational, horror, and comedy.Movie recommendation system will help user to rate unrated or new movie for which old users have given ratings already.The increasing amount of data or internet raised challenges for the users to manage the information for getting suggestion to them to predict the rating for their interest.

LITERATURE REVIEW
Godbole et al. [1] proposed a mechanism for assigning positive or negative scores to each entity in a text collection.Annett and Kondrak [2] discuss an innovative application of support vector machines (SVM).They used movie data to compare their method to lexical-based [2].The results of different classifiers suggest that using numerous classifiers in a hybrid approach increases sentiment classification quality [3].For sentiment analysis (SA) of reviews for travel sentiments, three alternative machine learning methods are compared [4].SentiWordNet is proposed as a source for building data set, and advancement is proposed by calculating positive and negative scores of phrases to identify sentiment orientation [5].
Shambour and Lu [6] addressed how e-governments assist in the search for a reliable corporate partner to conduct business.Moshfeghi et al. [7] discuss a novel strategy for CF that uses a combined approach of semantic, emotional, and rating information.The product features for which an opinion is offered are mined in an opinion summary, which varies from text summarization [8].By clients Erik Cambria outlined how SA may be used in recommendation systems in great detail.The comment is replaced by a rating, followed by a discussion of rating approach [9].The influence of domain information is considered when choosing feature vectors.Different classifiers are used to determine their impact on a specific domain and feature vector [10].Pappas and Popescu-Belis [11] proposed a sentiment-aware closest neighbor algorithm for multimedia recommendations based on user comments.A high-prediction-accuracy explicit factor model for explainable suggestions is proposed [12].Tang et al. [13] discusses how feature-based twitter sentiment is superior than standard neural networks.A novel memory-based CF technique is suggested, which models a CF-based recommender system using user reviews and numeric ratings [14].
Yang et al. [15] proposed a clique-based data smoothing approach to solve the data sparsity problem in CF using traditional user based nearest neighbor algorithm.Trends and patterns in client priorities have been studied by Subramaniyaswamy et al. [16] can be used with filtering and clustering techniques to identify items of interest.Nilashi et al. [17] presents a dimensionality reduction strategy to overcome the sparsity and scalability problems in the recommended system.The merits and limitations of basic sentiment analysis concepts are thoroughly explained and compared [18].On several datasets, Özdemir et al. [19] demonstrated how to evaluate the efficacy of various data classifier algorithms.Mohemad et al. [20] proposed the construction of a new ontology model in the education sector in order to give early detection of children with learning problems.
Babeetha et al. [21] proposed a prediction strategy for smoothing sparse original rating matrices and clustering, as well as a discussion of the proposed methods' accuracy and processing time.Rahim et al. [22] looked into the importance of a number of important variables for innovative digital marketing.Heart disease and breast cancer can be predicted.Saranya and Pravin [23] worked on developing successful prediction models and methodologies.Behera et al. [24] described how collaborative filtering and content-based recommendation are well-known strategies for selecting movies from a huge collection and determining their attribute-based similarity.According to Ez-Zahout et al. [25] matrix factorization is the superior technique, and it delivers a satisfactory result with a high precision score for the MovieLens dataset.Mawane et al. [26] suggested a platform whose purpose is to try to discover the suitable parameters of the Kohonen maps that can help satisfy the relevance of recommended items by dividing learners into homogeneous groups before generating recommendations.El Fazziki et al. [27] examined user-based CF on two datasets, film trust and MovieLens, and found that it works well and enhances prediction accuracy.Rawat et al. [28] proposed a method for predicting virtual machine failure based on a time series stochastic model that accurately predicts failure.Various evaluation techniques like precision, recall, F-measure, accuracy for machine learning based algorithm are discussed by Yadla and Rao [29].

RESEARCH CHALLENGES AND OBJECTIVES
Today, several recommender systems have been evolved for specific domains but, those are not particular enough to fulfill the information desires of users.So, it is important to build better recommender system.In designing such system, we face numerous issues.

Sparsity problems
A recommendation quality depends on data sparsity.Data sparsity occurs when dimensionality of the data set increases and some users do not rate few items.Good research is carried out to minimize this problem.Dimensionality reduction can solve this issue up to a certain extent.

Cold start problem
The cold start problem occurs as new users or items enter the system.This issue is classified as a new user issue, a new item issue, or a new system issue.This issue degrades the performance of collaborative  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 12, No. 6, December 2022: 6506-6512 6508 filtering for recommendation.We may look into appropriate user group rather than entire data set-to resolve this issue.

Scalability
Small dataset can be evaluated to recommend users for their choices.As dataset increases, accuracy in result reduces.We can use advanced large scale assessment methods to improve scalability.

Grey sheep problem
Sometimes user's behavior is suspicious as the user may rate related items inconsistently.This will degrade the performance of recommendation system.One way omits these users from training dataset and choose only top K users we have rated all items with consistency to related items.Objectives of work: i) to analyze the given ratings by different users for movies; ii) to minimize sparsity, cold start issue; and iii) to compare proposed approach with existing approaches.

PROPOSED METHOD ARCHITECTURE
As shown in Figure 1, the input dataset is of Movielens from Kaggle database.KNN algorithm is applied on this dataset to find Top K similar users who have rated all movies.Collaborative filtering algorithm is applied on this Top K dataset for recommending rating for unrated movie.

Input dataset
The input dataset contains following features: i) user Id; ii) movie Id; and iii) rating.In the Table 1, the record set contains ratings given by various users to various movies.Six users have rated 4 movies.The ratings given by users vary from 1 to 5. The record set additionally contains few rows where rating data is missing, which will be computed using CF.To find missing rating, we first find top K users having higher similarity between them.

Finding top K users
Here our aim is to find top K similar users whose ratings for all movies are closer than remaining users.To attain this, we are going to use KNN clustering algorithm.K in KNN stands for the number of neighbors.In training period KNN does not learn anything.It saves the training data set and uses it to make real-time predictions.Due to this KNN algorithm is much faster than support vector machine, regression which require training before prediction.To implement KNN we require value of K and distance function.We are going to use Euclidean distance function to find top K users amongst all used set.KNN algorithm: i) select the K users who have given ratings to all movies, ii) find the Euclidean distance of each user with remaining all users, iii) select top K users with maximum weight to formulate a set of top K users, and iv) our model is ready.

Formula for Euclidean distance
To calculate the distance between user A and B, we will put values of their ratings in (1).Ratings of user A will occupy positions of variable x whereas ratings of user B will occupy positions of variable y.For calculating the Euclidian distances, we will consider all the users who have rated all movies.As shown in Table 2, first we will find the average rating by every user.
By putting the values from Table 2 in the formula we will get the Euclidean distance between A and B.
(, ) = 3.9 Similarly, by calculating Euclidean distances for user pairs (A,C), (A,D), (A,E), (B,C), (B,D), (B,E), (C,D), (C,E), (D,E) we will get following results shown in Table 3.As shown in Table 4, if we consider K=1 we will get smallest 1 distance that is 0 for (A,C), for K=2 we will get smallest 2 distances 0 and 3.16 for (A,C) and (B,E) respectively.Similarly for K=3 we will get smallest 3 distances 0 for (A,C), 3.16 for (B,E), 3.74 for (A,D) and (C,D).If we consider value of K very small, we will face the problem of under fitting in which noisy data or outlier has huge impact on our classifier.In opposite case, if we consider value of K very large, we will face the problem of over fitting in which classifier has tendency to predict majority classes regardless of which neighbors are nearest.To avoid under fitting and over fitting we will consider value of K=2.Then we will get following in Table 5.In Table 5 only the users by considering value of K=2 is shown with their ratings.

Collaborative filtering
The aim of CF is to predict ratings for unrated items by considering the top K user set with their ratings.We will not consider the column with missing ratings.As shown in Table 5, we will ignore the column for movie 4 as user X has not rated it.Now we will find average of all ratings provided by all users for movie 1, 2, 3.The average ratings are shown in Table 6.Now we will find the similarity between user X with all remaining users.We will attain this by cosine similarity formula.

Comparing the similarity between users
Now, we will compare ratings of user A, B, C, E with the ratings of user X to get how close these users are with user X.To achieve this, we preferred to use cosine similarity formula.Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size.Cosine similarity formula: where, rip is current/particular rating of customer i, rjp is current/particular rating of customer j, riavg is average rating of customer I and rjavg is average rating of customer j.
The similarity varies in between -1 to 1 only.To calculate similarity between user A and X we will put values in (3).Similarly, by calculating cosine similarity for (B,X), (C,X) and (E,X) we get results as shown in Table 6.
From Table 7, we will get lowest similarity weight -0.89 for user pair (A,X) whereas highest similarity weight 0.31 for user pair (E.X).Hence as similarity between ratings of pair (E,X) is highest; we can assign user E's rating for movie 4 to missing rating for movie 4 by user X.So, the missing rating for user X for movie 4 will be 5.

CONCLUSION
In this paper, we proposed a recommended system based on collaborative filtering with k-nearest neighbors (KNN).We got unknown rating by comparing the ratings for top k-movies.Selection of proper value of K improved our result of prediction.We have resolved both the issues of underfitting and overfitting In future, we should use deep learning techniques for movie rating prediction or movie recommendations.Also, we may prefer some standard database directly from the organizations working in the same domain to get more accurate results.

Table 1 .
Ratings for movies by users

Table 2 .
Average ratings by users

Table 3 .
Euclidean distance between users

Table 4 .
User pair set for different K values

Table 5 .
User with ratings for K=2

Table 6 .
Average ratings by users without considering missing value column

Table 7 .
Cosine similarity between user pairs