Analysis of Mobile Service Providers Performance Using Naive Bayes Data Mining Technique

ABSTRACT


INTRODUCTION
The main regulator and governor of telecommunications and its rules in Malaysia is the Malaysian Communications and Multimedia Commission [1], [2]. Regulatory reforms and rehabilitation are very important aspectsin creating competition effectiveness among the industry of telecommunications. Correspondingly, the Malaysian telecommunications industry has been exceptional growth in recent years [3]. Therefore, this leads to produce a huge and diverse data sets i.e., big data, which is need analytics and investigation to discover hidden correlations, customer preferences, market trends, and further valuable information that may help organizations make better business decisions. Problem arises, with the growing field of big data, utilization of structured and unstructured data leads to worthy information for telecommunications industry in Malaysia to grow exponentially [4]. Consequently, issues on utilization of structured and unstructured data requires critical and analytical methods to overcome the needs of industry  [5], [6]. There are many challenges to be faced for finding out the best telecommunication service provider since nowadays there are too many choices of mobile communication services with a different service rates and speeds [7].
The contribution of this study is to give a solution for evaluating the performance of telecommunication service providers inthe Malaysian telecommunications industry, this is by:  Analyzing huge and diverse data giving by the telecommunication service users using their twitter accounts daily.  Ranking the performance of the telecommunication service providers in Malaysia based on the tweets data of their users.

RESEARCH METHOD
From prepping data for pre-processing until conducting analysis, the scope of this project is focusing on the process of data science itself. The method used in this study, is based on Cross Industry Standard Process for Data Mining (CRISP-DM) [8], as this model is well-known in the data mining process [9]- [11]. The complete process diagram of CRISP-DM is given in the Figure 1 and followed by the description for each process included in the model. From Figure 1, the business understanding process focuses on the purposes and requirements of the project, which comprises understanding the business objectives, success criteria, project plan, and deliveries [12]- [14]. The data understanding process starts with an initial-data-collection and manage to proceed with the data description and data exploration. The data preparation process includes data cleaning, sampling, normalization, and feature selection. The modeling process includes select modeling techniques, building, and training the model, in addition to make prediction. The evaluation process includes the model validation, review the results, and success criteria evaluation. Finally, the deployment process includes result visualization, and the report creation. Therefore, the method that suits our sentiment analysis for telecommunication business operation is defined in the workflow that given in Figure  The computer program in this project is written using R Studio and R language which is a programming language for statistical computing and graphics. While the data that will be used during the test, gathered from the Twitter Application Platform Interphase (API). For the user that want to access the data from Twitter API need to have the Twitter account. However, the first step before beginning the code, R studio needs an API key to synchronize it with the Twitter API. After the synchronize success, the data can be gathered freely from the Twitter API, but the R studio can access only the data within seven days before the request date.
For the big data analysis, Naïve Bayes technique is deployed in this project to obtain the result from big datato produce the most accurate result. The Naïve Bayes classifier is a supervised learning and one of the simple probabilistic classifier techniques in the Machine Learning course with strong (naive) independence assumptions between the features [15]- [17]. The Figure 3 is showing the processes flowchart of Naïve Bayes Technique. The train classifier can be used for training the data to calculate Bayes-optimal estimates and make predictions of the model parameters [18]- [20]. The process flowchart of the train classifier that applied in this project is given in Figure 4. The Figure 5 shows how Naïve Bayes works in the test set classifier for sentiment data. This is appropriately representative intended for the underlying recognition problem, that leads to worthy information for telecommunications industry in Malaysia to grow exponentially. Add priority probability to these probabilities Calculate probabilities using Naïve Bayes Algorithm Add the probability for all the words in the data A great service, good service + 4 Poor service, Poor connection -5 A good service, great connection + So, a total of 10 unique words eg. I, loved, the, service, a, great, hated, good, connection, poor.  Step 2: converting the data into a frequency table, which is given in Table 2 as follows: Next, look at the probabilities per outcome (+ or -)  Step 3: Compute the priority P (+) = total of + class P (-) = total of -class  Step 4: Compute the conditional probability / possibility of each attribute P(I|+); p(loved|+); P(the|+); P(service|+); P(a|+); P(great|+); P(good|+); P(connection |+); P(wk.|+) = nk: number of times word k occurs in these cases (+) n: number of words in (+) case -> 14 vocabulary: total unique words while testing, for unknown words we use nk = 0 and find its probability being both positive and negative.

DATA ANALYSIS
In this study, we are using a real data extracted from Twitter API, a website uses to access core Twitter data. Consequently, we save the data into .csv file format as given in Figure 6. Next, dataset is loaded in R studio for further analyses.  Based on the obtained dataset and data attributes, not all the data have been applied in the analysis, only text attribute will be selected and will be used for modelling purposes. The purpose of the selected attributes is to see the weightage of the positive, negative and neutral word.
For the result of sentiment analysis, all the tweet texts have been scanned, and the score has been given. The score is based on their positivity and negativity words, which are based on the positive file and negative file. The Figure 8 is showing the tweets and its given scores. These scores and results can be used to improve the customer experience and business growth by discovering unknown correlations, hidden patterns, customer preferences, market trends, and further valuable information that may help organizations make better business decisions. The technique that deployed in this project is the Naïve Bayes, which able to provide strong independence assumptions between the features related to the sentiment analysis. Furthermore, it gives the robust solution among telecommunication service providers [10].

FINDINGS AND RESULTS
After the score had given, the results graph is plot based on their negativity and positivity polarity as shown in Figure 9 below. After the graph done plotting, all these results are transferred to R Shiny which is used to visualize the result in more proper and creative way. R Shiny had been chosen as its easy interphase to understand and use even for the very first-time user. Based on the Figure 10 below, we can see that there are different 5 boxes with different color and value. The value stated in the box is the amount of raw data gathered from the Twitter API that we are dealing with for this project. Based on polarity scores, telecommunication service providers ranked as Telco 1, Telco 2, Telco 3, Telco 4 and Telco 5.  Based on the result showed in Figure 11, the telecommunication company, Telco 4 is the best, which getting 92% positive twitter comments from their customers. Lowest score is Telco 3, which is only 62% score on positive comments. By looking at this graph, telecom service providers can evaluate their performance easily from their customers' tweet data.