A large-scale sentiment analysis using political tweets

ABSTRACT


INTRODUCTION
Social media have become more essential and Twitter plays a vital role in campaigning during election time.Twitter is one of the most common and popular social media that give the freedom for people to share their opinions, thoughts, and beliefs in the world.Twitter is increasingly used by politicians, journalists, political strategists, and citizens as a large part of the network for the discussion of public issues.Governments and politicians always detect the social media network and amendments, and how people are responding to different policies, and acts.Some political scientists working with Google, Facebook, or precise large datasets may have to know about big data architecture and new distributed methods with the huge data sets.Political scientists can focus more on new software for data cleaning, data management, reproducible science, data lifecycle management, and data visualization.In the era of big data, data is collected from various sources, such as mobile devices and web browsers, and stored in various data formats.It cannot handle the traditional storage and analytics platform from various structured and unstructured data.Hadoop, a good platform for big data analytics, offers scalability, cost-efficiency, parallel processing, availability, flexibility, and fast and secure authentication.An open-source framework Hadoop comprises a storage part called Hadoop distributed file system (HDFS) and a processing part called MapReduce.Sentiment analysis (SA), one of the big data applications focuses on analyzing big data in various ways, and includes identifying patterns and relationships, making informed predictions, providing actionable insights, and deriving insights.SA uses text analytics in analyzing steps with high velocity and a large amount of tweet data.Sentiment classification techniques can be roughly divided into machine learning approach, lexicon-based approach, and hybrid approach.The lexicon-based approach is built in a sentiment lexicon that has known the collection of data and precompiled sentiment terms.In this approach, the emotional dictionaries can classify the analysis of specific words and sentences from the tweet.The emotional vocabulary element of the dictionary is searched in the text, the emotional weight is calculated, and the aggregate weight function is applied.This technique is governed by the use of a dictionary consisting of previously labeled vocabulary.The text classification depends on the total score obtained from the emotional elements.The machine learning approach is used the sentiment analysis tasks in this system which collect data from the tweet to label text classification.Machine learning methods can be mainly applied to the emotional analysis by using supervised classification.It applies the famous machine learning (ML) algorithms to analysis the linguistic features.Sentiment lexicon is mainly organized by using both approaches which are the majority of methods called hybrid approach.
Political sentiment lexicon, a resource of the lexical intended for sentiment analysis, is a lexical element extreme word database at the presidential election time with their sentiment polarity for a political domain.Political lexicon has been widely used for analyzing the extreme opinion word in a web text.SentiStrength has deployed references for the extreme opinion word weight resource to categorize the orientation of the political domain lexicon.The building of the political sentiment lexicon is one of the challenging components in the approach of lexicon-based sentiment classification.There is no result to point out the extreme political words from the sentiment lexicon.So, the approach of lexicon -based political sentiment classification applies the political domain lexicon to classify the sentiment orientation of the sentence.The construction political sentiment lexicon can have an effect on the performance of the extreme political opinion words analysis and sentiment classifier.The performance of the political domain-specific lexicon is the best enhancement to classify the sentiment lexicon words from the political tweet content.
In this paper, political multi-class sentiment analysis (PMSA) is implemented on a big data analytics platform.In PMSA, political sentiment lexicon is constructed by extracting valuable information from a large source of data.Multi-class classification is performed with three different machine learning techniques.The proposed PMSA developed with four modules: data collection, data preprocessing, lexicon generation, and data classification.

RELATED WORKS
This section described some papers that are related to the proposed PMSA.Three parts of this section are political lexicon construction, political sentiment analysis and the works of the proposed system.The first part describes the related paper on political sentiment lexicon-generating approaches.In second, political sentiment analysis papers are described and the last parts summarized the proposed work.

Political lexicon construction
Vu and Thanh [1] proposed a lexicon-based SA system on social media networks.The authors implemented the lexicon-based approach which certain the sentiment dictionaries by utilizing the heuristic method and the data preprocessing.Their proposed method was effective because this method is a combination of popular lexicon-based sentiment analysis methods, the Liu method, and SentiWordNet.Moreover, the data preprocessing steps of this system filter the opinion-oriented word from the text data before the sentiment analysis conducts.The performance of this proved method was achieved better than any previous method with the lexicon-base method.
Almatarneh and Gamallo [2] presented a lexicon-based approach for searching extreme opinions.They used the unsupervised approach for searching extreme opinions which were based on the automatic new lexicon construction with the most positive and most negative words.The main purpose of this system is to assign a value to extreme opinions.They described automatically a method to create the lexicon of the extremely positive and negative words from labeled corpora.Their automatically created lexicons had been compared with the other previous lexicons by performing the account of some partitions.
Feng et al. [3] implemented an automatic sentiment lexicon generation approach for the reviews of mobile shopping.In this system, the authors proposed the automatic constructing approach for the sentiment lexicon of a specified domain by considering the relationship between product features and sentiment words in the reviews of mobile shopping.There are two main parts; the first part is selecting product features and sentiment words from the original reviews and then using categories to perform the dimension of sentiment.Second, sentiment words that are related to mobile shopping are classified or clustered into specified Int J Elec & Comp Eng ISSN: 2088-8708  A large-scale sentiment analysis using political tweets (Yin Min Tun)

6915
categories to form a dimension of sentiment.Their generated lexicon is created by constructing the classification task of sentiment with the various products that are written reviews in both English and Chinese.

Political sentiment analysis
Political Sentiment analysis is one of the interesting SA systems through online political tweets for the prediction results of the election.Rohrschneider et al. [4] presented the political SA system during the election of the German Federal in 2009.The purpose is to specify a platform for sentiment analysis on Twitter data and also predicted the outcome results of the election.The authors determined by the classification of which a political party or politician is identified.They applied the LIWC2007 tool for sentiment extraction from related political tweets.LIWC was an accurate software of text analysis that was developed to reveal thoughts of people, cognition, personality, and emotions by representing text samples.Fujiwara et al. [5] concluded their system that the number of tweets was directly proportional to the winning chances of election.
Ringsquandl and Petkovic [6] developed SA on the campaign topics of presidential candidates in the Republican Party, USA.They presented the frequencies amalgamation of noun phrases and the pointwise mutual information (PMI) measure of their system with a limitation on aspect extraction.The authors described the semantic relationship between their topic holds and politicians.According to their experimental result, the accuracy of the aspect extraction in their system is improved.Elghazaly et al. [7] implemented the SA in the Egypt presidential election based on the classification of Arabic text using the WEKA application.They expressed their results that the highest accuracy result was achieved by using the naïve Bayes (NB) classifier with the lowest rate of error.
Caetano et al. [8] implemented the political sentiment analysis to identify the political user's classes and user's homophily during the American presidential election, in 2016.They collected the tweets data of 4.9 million from the 18,450 users and their network from August to November 2016.The author specified six types of user classes which are representing their sentiment words for Hillary Clinton and Donald Trump: whatever, Hillary supporters, Trump supporters, neutral, negative, and positive.Their experimental results it is a better homophily levels that supports the multiple connections, the similar speeches, or the reciprocal connections.
Ullab et al. [9] proposed the political sentiment analysis system for optimal searching from the presidential elections in the USA 2016 to prove that their features were more suitable in the election results prediction.Their applied features such as uni-gram, bi-gram, tri-gram and opinion words features were analyzed and compared by using the popular data mining approaches such as random forest (RF), artificial neural networks (ANN), and naïve Bayes (NB) classification methods.They implemented many preprocessing methods on the dataset to expose the well_shaped dataset.Finally, they proved that their system found the unigram showing with a higher accuracy of 81%.
After studying the related research paper, the proposed system is considered a political multi-class sentiment analysis system for Twitter data.Political multi-class sentiment analysis constructs the political lexicon.The accuracy of the lexicon is evaluated by different political data.The implementation of PMSA is performed on the big data platform.

Proposed work
Enhancing the performance of sentiment analysis for the political domain is the major objective of this work.Twitter data analysis is related to the mining of text because most of Twitter posts are text messages.It encompasses the methodologies such as machine learning, natural language processing, and data mining to appropriately characterize measure, model, and mine meaningful patterns from political tweet largescale data.This system's studies review the various methods for generating political resources during the presidential election season.To extract opinion words from political tweet content for sentiment analysis, the proposed system investigated the extreme opinion word of lexicon generation for political domain.This system considers opinion data in sentiment analysis.Extreme opinion word needs linguistic resources such as lexicons, corpora, and dictionaries to implement sentiment analysis in political tweets.Political tweet resources help to evaluate extreme opinion word sentiment analysis for election time.The proposed system is based on the creation of a political lexicon, classification of political opinion words, and political tweet content mining process.The system expresses the background theories on a big data analytic platform for developing political tweet sentiment analysis that is applied the initial experiments of this research and the proposed combined political lexicon generation and machine learning based for political tweet informal short text.

PROPOSED METHOD FOR CONSTRUCTING SENTIMENT LEXICON
Sentiment analysis is still having many challenges and researchers are trying to solve the problems from the various disciplines.The applications of sentiment analysis are being promised in various political and business industries.Therefore, researchers became to introduce a lexicon-based sentiment analysis approach for the election twitter data.

How to build political lexicon for political domain
Sentiment classification aims to automatically classify political retweet text into positive or negative sentiment.Machine learning-based, lexicon-based, and hybrid approaches can be used to classify sentiment.An important tool for determining the sentiment polarity of tweet user opinion is a sentiment lexicon.Methodologies, knowledge-based and corpus-based are commonly used to create sentiment lexicons.Due to the political system, there are domestic politicians in a large number of supporters and voters in their campaign places.They get political information and opinion that they want to elect and they inform their message on the internet.Sentiment analysis and opinion mining have been focused on many aspects related to opinion, namely polarity classification by making use of positive, negative, or neutral values.However, most studies have overlooked the identification of extreme opinions in spite of their vast significance in many applications.The political lexicon generation uses a supervised machine learning approach to search for extreme opinions, which is based on the automatic construction of a political lexicon containing positive, strong-positive, negative, strong-negative, and neutral words.
Political sentiment lexicon is a resource for sentiment analysis which is a lexical element extreme word database at the presidential election time off with their sentiment polarity for a political domain.Political lexicon can be widely used for analyzing the extreme opinion words in a web text.The political lexicon recognizes political domain-specific sentiment based on the opinion of the Twitter user.Sentiment scores (positive, strong-positive, negative, strong-negative, and neutral) are generated for the opinion words which are based on the political tweet terms.The total word occurrence is calculated by the term frequencyinverse document frequency (TF-IDF) method [10].As the first step of this process, "parts of speech" for each word are extracted, and then ranks for each word are defined as part-of-speech (PoS) tags words.In the next step, these tags are normalized.In the final step, the total score for each multiplication vector word is computed with the initial score.Sentiment score calculation is used to calculate the score of sentiment on the preprocessed data to define the class label on data.In this approach, feelings, weight calculations, and functions of total weight are applied for emotional vocabulary dictionaries list searching in the text.This proposed technique is improved with the help of a pre-tagged lexicons dictionary and the TF-IDF method for the creation of an input tweet vector.In this system, the political Twitter stream data contains opinion words that help to determine sentiment about the political parties.Political lexicon is used by political parties to extract with the opinion words in the president's election.
The political lexicon is the mainly effective way to express their feelings and opinions when they vote.The generated political lexicon applies on the political domain to classify the sentiment orientation of the sentence.In other terms, the construction political sentiment lexicon can have an effect on the performance of the extreme political opinion words analysis and sentiment classifier.The performance of the political domain-specific lexicon classifier is enhanced as the sentiment lexicon which includes the political words containing well-built sentiment to classify the opinion words the political tweet content.Extraction the extreme opinion words for political tweet content by using a dictionary-based approach intends to sentiment classification.The opinion seed word is a word utilized for collecting antonyms and synonyms as of the dictionary and applied for sentiment analysis.Furthermore, if the opinion seed words contain incorrect sentiment polarity, the sentiment classifier incorrectly classify the sentiments of political words.Automatic lexicon is generated by using the approach of constructing a lexicon from the trained data.The lexicon-based political sentiment lexicon generation is proposed for the political domain.

Knowledge-based
Knowledge-based is a graphic resource such as WordNet.The political lexicon is developed as a knowledge-based dictionary to find synonyms and antonyms of words.It finds closer not only the synonym words but also the antonym words on tweets, the less iteration is required to define a synonym between the words.Both studies use the relationship between words in a knowledge-based.The main idea of these methods is to manually collect the initial set of sentiment words and their orientations, and then it uses the knowledge-based to expand this by searching their synonyms and antonyms.

Corpus-based
These methods are based on syntactic reasons or patterns that occur in the first list of word senses to find other meanings in the larger corpus.Based on the corpus, this was found in the change of emotional polarity in the text.It gears the emotional polarity of a word in the corpus-based system to towards for the emotional polarity of its neighboring words.Both works are based on a corpus rather than a knowledgebased.The great advantage of corpus-based methods is the domain-specific words and their orientations found in the finding process.A large-scale sentiment analysis using political tweets (Yin Min Tun) 6917

Machine learning
Machine learning methods are based on annotated data classification into intended categories.These can be grouped into three broad categories: supervised, semi-supervised and unsupervised.When people do not annotate their data, individuals choose a vocabulary-based approach.Machine learning approaches are fully automated, convenient, and capable of handling large data collections.These methods require a training dataset to support classifier automation.It is used to develop a classification model for classifying feature vectors and a test data set for predicting class labels for unseen feature vectors.Most operations of the sentiment analysis use machine learning.In this system, three machine-learning methods were applied (multinomial naïve Bayes (MNB), decision tree (C4.5), and linear SVC), to classify the sentiments expressed in politicians' retweet sentences written in political tweet content.
The combined lexicon-based and machine-learning methods consider extracting extreme opinion word and constructing a classification model to evaluate the lexicon performance.Proposed system is available a unified framework in which lexical background information, unlabeled data, and labeled training sentences can be effectively combined.This system analyzes the political retweet content and BBC News political discourse sentiment sentences.

PROPOSED METHOD FOR POLITICAL SENTIMENT ANALYSIS
Alaoui et al. [11] presented a three-stage-based approach, which is the dynamic dictionary construction with the words' polarity on the given topic based on the selected hashtags.In their proposed research, related tweets with the 2016 US election are specified to the negative and positive classes.According to their evaluation results on traditional data analytics, high accuracy of prediction performance was achieved but they cannot provide improved accuracy on the big data analytics platform.In 2019, a sentiment analysis system based on a supervised machine learning approach was created for the general election of India with the political sentiment data on Twitter [12].
They used long short-term memory (LSTM) for the classification model and analyzed the experimental results with the other machine learning approach models.They do not consider big data analysis and do not perform multi-class classification.Hasan et al. [13] implemented the hybrid technique on the political analysis with two machine learning techniques (SVM and naïve Bayes).
In this research, they evaluated performance analysis on the two machine learning approaches, but they did not consider multi-class classification.Baltas et al. [14] presented a sentiment analysis tool for the microblogging messages analysis with their specified sentiment.In their system, naïve Bayes, decision trees, and logistic classification methods were used by MLlib for the classification outputs.According to their experimental results, the accuracy rate of the Naïve Bayes classification method is more achieved superior standard results than the other two methods but they also did not perform on multi-class sentiment analysis [15].
The author implemented sentiment analysis for personalized tweets recommendation and tweets classification using a specific domain seed list to classify the tweets.They got 96% of accuracy on this dataset when they implemented one million datasets on tweet [16].Thelwall et al. [17] detected sentiment strength in the short informal text.In this system, the author optimized the sentiment strengths using the machine learning approach, and then they approved that their proposed system is better than the other baseline approaches.
Bouazizi and Ohtsuki [18] proposed multi-class sentiment analysis on the Twitter data for the challenges and performance of classification.They proposed a new model to present the sentiments and applied this model to describe the relationships of different sentiments and to discuss about the difficult task on the multi-class sentiment analysis.The available raw data are collected from various social media sources and the data types are unstructured, semi-structured, and structured.Among the various popular social media networks, the feature of Twitter is a combination of social networks and web blogs [19].The rapid improvement of Twitter helps journalists, citizens, politicians, and political strategists.This fact supports for an important part of the media network with publicly mediated and political affairs [20].

The framework of proposed political sentiment analysis system
This framework has been applied for the implementation of the parallel and distributed execution system with the processing of batch data on the different data nodes.For this approach, Apache Hadoop has been utilized as the best framework for implementation [21].On the other site, Apache Spark provides the achievement with more attractiveness and high performance by expanding Hadoop's abilities and permitting the processing of real-time stream data [22].In this system, extracting the required information from political big data on social media is developed.This system is implemented the platform of big data analytics (Apache Spark) with a highest amount of velocity and a large amount of tweet data.To get high accuracy, this system There are four main modules in the proposed system: a collection of data, preprocessing of data, generation of political lexicon, and data classification module.The development of these modules is implemented on four layers: a data ingesting layer, a layer for storage, a layer for processing, and a layer for data analytics.Real-time stream data from Twitter is collected by using Apache flume and then this collection of ingested data is developed at Hadoop distributed file system (HDFS) using a memory channel.HDFS is located at the storage layer of this system and Spark is used at the processing layer for batch processing.The implementation of other modules is performed at the analytics layer.The hybrid approach (a combination of three machine learning approaches and lexical-based classification) is a very useful approach for sentiment analysis with optimal performance.

Implementation of big data platform
Political sentiment analysis of multi-class systems is developed using the platform of big data analytics with Spark streaming, HDFS, Spark MLlib, and Apache Flume [23]- [25].The functions of the four layers are discussed in this section.The layer of data ingestion-At this layer, the collected stream data from Twitter is performed by Apache Flume.These are pushed to the HDFS-sink for offline processes.The receiver is set up by spark streaming as Avro-agent and the collected data is pushed into Avro-sink for the online processes.
The layer of data storage: To store reliable and usable collected data, HDFS is used in this storage layer.HDFS helps the architecture of master/and single name of node implements as the primary server.The operations of file namespace are performed by single name node as opening, renaming, and closing.The directories and files that define the block map to data nodes.In HDFS, stored data is implemented by data nodes.
The layer of data processing: In this layer, Spark and YARN Cluster Manager are used for large amounts of data parallel processing based on hardware clusters for reliable nature and fault-tolerant.In YARN, the YARN Resource Manager tracks the resources for node Manager.This node manager controls the resources of the slave node.At the same time, each slave node has one or more resources and each executor has a process of work.According to the schedule, a separate Java virtual machine (JVM) performs the task under the operator.The loaded data is separated as multiple handlers in resilient distributed dataset (RDD) and these partitions use the conversion.This Manager have to run simultaneously the multi-threaded and multiple tasks.
The layer of data analytics: in this layer, there are two main components: classification for online and training for offline.The processing of data, labeling for class, and model generation are implemented for offline training to create the model of classification.The upcoming real-time of tweets preprocesses are classified with the online classification.The analytic layer execution process is implemented with a distributed Spark engine.The sentiment classification of multi-class based on the domain-specific sentiment lexicon of the political lexicon was developed by the help of Spark.
As shown in Figure 1, political data is the associated activities to make the decisions for the groups as the distributed resources.To develop the useful extraction of efficient political information, this proposed system implementation has four steps: a collection of data, preprocessing for data, class labeling, and data classification for the sentiment.In the data ingesting layer, real-time stream data from Twitter is collected using Apache flume and then this collection of ingested data is developed at Hadoop distributed file system (HDFS) using a memory channel.HDFS is located at the storage layer of this system and Spark is used for the processing layer in the batch processing.The implementation of other modules is performed at the analytics layer.The hybrid approach (a combination of machine learning approaches and lexical-based classification) is very useful for sentiment analysis with optimal performance.

Data collection
In this presented system, stream data through Twitter are collected by Apache Flume with the configuration of batch streaming, and then these collected data are assigned to HDFS-sinks [21].English tweets data are collected and filtered by using keywords.Twitter data and Apache Flume play the main role in data collection and briefing of Apache Flume descriptions.

Data pre-processing
As the ingestion of data includes the useless data and data duplication, these data need to preprocess and clean for an effective analysis system.The purpose of data preprocessing is the removing of tweet text feature selection, duplicated and noise data, stop words, and repeated characters.In this step, the classification job is simplified and the cost of processing is decreased for the training stage.

Tweet text feature selection
The selected tweet features are used for sentiment analysis which describes the feeling and opinions of the Twitter user.One tweet stream record has various tweet attributes.Among them, the "text" attribute describes the opinion and feeling of the Twitter user.For the proposed system development, the values of the "text" attribute are selected because this system is the analysis for mood tracking of the people.English text attributes are used in this system.In the selection of values "text" attribute, the values of the "lang" attributes are checked and extracted whether the text language is English or not.If the "lang" attribute is Language in English, these are also selected for this system's analysis.

Removal of noisy data
In tweet data, the useless information is noisy data for the tweet data classification.These data include hashtags and Website URL links where hashtags are replaced with the same word without hashtags by using adding character repetitions, white space, and @username.For example, #fun is replaced with fun.In this system, a non-alphabet replacement using space is developed.

Removing of character repetition
To remove duplicate tweets, the extracted tweet checking is used to determine which already exists and which does not.If there is data duplication in tweets, there is no use extra words, and these tweets are replaced with the duplicated tweets deletion by using the substring function.It is combined with an existing list of tweet data and analyzed when the feature extractions from tweets are already completed.

Removing of stopwords
Removing stopwords is an important technique for noise elimination in the classification of tweet text.Tweet data need to remove stopwords due to the includence of no meaningful information.The stopwords of this system not only depended on domain classification but also took manual examination of data into consideration.

. Negation handling
Negation handling is to identify the scope of negative and upturn the polarity of opinionated words that are effective by a negative.It supports significantly the accuracy of the classification.For example, the word "not good" in a phrase may represent negative sentiment that is follow by not into "not_" + word.

Political lexicon generation
In lexicon generation, the cleaning data is allotting into chunks with unigram, bigram, trigram, and n-gram words.And then it uses a JSON parser to split the parts of the POS tag.Extract the political opinion words (Adj, Adv, NN, and Verb) that are supported and related to the candidate.The TF-IDF method is used for counting.This method includes the weight to terms that recur frequently in a document as important words.Using TF-IDF, relevant words or keywords in a news text are found and then it gives the weight of word based on how many the recurrent features are in the political news.

Sentiment score calculation
Each sentence is assigned by its appropriate polarity in accordance with the sentiment analysis criteria at the sentence level.This is typically accomplished by determining the polarity of individual words, and phrases, then combines them to determine the overall polarity of the political tweet sentence.The mood expressed in each political word was estimated by adding the polarities of the words and phrases in the political tweet content and considering each new piece as a sentence.
Political word scores on the content of political news tweets were calculated by using the sentiment extraction operator.This operator gives the final result in terms of sentiment.Texts with sentiment scores between -1 and -5 are considered negative and strong negative, while texts with sentiment scores between +1 and +5 are considered positive and strong positive.This operator provided the accurate results by using the political lexicon.We also use the sentiment score function which is based on (1)-( 5) to calculate the total sentiment score for political news content on tweets.

Class labeling
Lexicon-based political generation is implemented in the training data classification to define the label.The procedure of class labeling process is described in the following.The class label of a sentence depends on the total polarity of a sentence.

Classification model
In the final step, the development of classification models is used for class labels and feature vectors.The optimal model with the best accuracy was selected from the other developed classification A large-scale sentiment analysis using political tweets (Yin Min Tun) 6921 models of this system [26].In the training data and testing data of this proposed system, the three different classification methods (MNB, decision tree, and linear SVC) are implemented [27].

Multinomial naïve Bayes (MNB)
Bayes theorem-based multinomial naïve Bayes algorithm is a probabilistic learning method, which is mostly applied in the research area of NLP.MNB evaluates the probability of each tag for a given sample and then returns the tag with the highest probability as the output.Naïve Bayes classifier is a collection of many algorithms in which all the algorithms share one common principle, and that is, each classified feature is not related to any other feature.− Maximum likelihood where   is the probability (  |) of feature i appearing in a sample belonging to class .

Decision tree (DT)
Decision tree-based classification model, also known as a statistical classifier, is an approach for the classification of data.C4.5 is the optimal technique for decision tree generation.
where,  is the class proportion for output,  is a set of case,  is a case attribute, |  | is the count of cases to I, and || is the count of cases in the set.

Linear SVC
Linear SVC creates the binary classification model using the linear SVM classification model.The hyperplane of SVC generates the separation that achieves the highest distance for the nearest points of training data on any class.Hinge loss is optimized with the help of OWL query language (OWLQL) optimizer.(; , ) = max (0,1 −   ) () = () + where, L(W; x, y) is the loss function for linear SVC, () is the raw tweets,   is the word vector,  is the sentence label and  is the emotional words for each sentence.

EXPERIMENTAL RESULTS DISCUSSION
To analyze the experimental results, the evaluation performance of the three applied classification techniques is compared to get the optimal classifier for the political sentiment analysis system on multi-class.In this system, the unstructured political dataset that contains tweets of Trump, Clinton, Obama, Joe Biden for the American presidential election and discourses in BBC News politics are used as shown in Table 1.The 80% of the total dataset is used for the training dataset and 20% of the total dataset is used for the testing dataset from each dataset.The popular performance measures (accuracy, recall, F-measure, and precision) are evaluated for this proposed analysis system.
Twitter is a social media platform that enables the posting of tweets, or brief communications, in 2016.It was used nearly 328 million active users from 2006 to 2017.During the 2016 American Presidential Election, it was one of the most widely used online social networks.A name, a profile description, a photo, and a location are the components of a Twitter user's profile.The information was gathered from August 1, 2016, to November 30, 2016, utilizing the Twitter official API to obtain user profiles and their contact networks for tweets.For the evaluation performance of this system, the proposed lexicon-based model classifier is evaluated from the beginning of the analysis.To establish ground truth, the classified results are compared with the same data which are three machine learning classified.For improving classifier performance, data preprocessing is developed.
To analyze the performance evaluation results of this system, the political lexicon for the political domain classifier is compared with the political datasets.The experimental results of the system performance are described in Figure 2. Table 2 is shown the comparison of accuracy between each method with our proposed lexicon-based classification on a political dataset.According to the experimental results, strong positive, positive, neutral, negative, and strong negative have been calculated the percentage of lexicon-based classification on tweet and machine learning classification in the political dataset.The evaluation accuracy result of naïve Bayes theorem with five different political dataset is shown in Table 3.The results show that Biden dataset have the better accuracy with 71.00%.
The evaluation accuracy result of decision tree (C4.5) with five different political dataset is shown in Table 4. Biden dataset has the better performance with 99.30% accuracy in decision tree.The evaluation accuracy result of linear SVC with five political dataset is shown in Table 5. Biden dataset and BBC News politics dataset have the same accuracy of 100%.
The performance comparisons of the three classifiers on the political dataset are described in Table 6.On average, the evaluation results of precision, recall, F-measure, and accuracy of linear SVC achieve more accuracy than naïve Bayes and decision tree for political datasets.The selected model (linear SVC) is applied to classify the new collected tweets.The comparison of the machine learning classification model on the political dataset is illustrated in Figure 3.A large-scale sentiment analysis using political tweets (Yin Min Tun)

6923
In this section, the performance of three different models is compared.The results showed that linear SVC has the better result than the other techniques.According to the results, linear SVC which is used in one vs Rest approach classifier with proposed political lexicon-based classification model achieves the best optimal accuracy.

CONCLUSION
In PMSA, system implementation is developed on the big data analytic platform (Apache Spark) for high-velocity analysis and large amounts of tweets in an effective manner.The hybrid approach is applied to get the great performance of the multi-class classification system on a vast volume of social-political data.In the proposed system, the political lexicon is created by collecting extreme opinions with their polarity, and it constructs classification model with three machine learning techniques (MNB, DT, and linear SVC).The generated lexicon is applied on different political datasets to show the performance of the lexicon.In the experimental result, the performance of political lexicon is evaluated by three difference machine learning techniques with different political datasets.According to the result, the accuracy of the linear SVC can be provided 98% that has the better performance than the other two methods.PMSA can also classify the discourse news on the political domain.In the future work, the political multi-class sentiment analysis system will also be used as a deep learning in a big data environment.

Figure 2 .
Figure 2. Percentage of tweets for political dataset on classification by political lexicon

Figure 3 .
Figure 3. Performance of tweets for political dataset on classification by different model

Table 2 .
Political dataset accuracy for political lexicon

Table 3 .
Performance evaluation of political dataset with naïve Bayes

Table 4 .
Performance evaluation of political dataset with decision tree (C4.5)

Table 5 .
Performance evaluation of political dataset with linear SVC

Table 6 .
Overall performance evaluation of three machine learning models