Measuring information credibility in social media using combination of user profile and message content dimensions

Received Mar 19, 2019 Revised Feb 3, 2020 Accepted Feb 12, 2020 Information credibility in social media is becoming the most important part of information sharing in the society. The literatures have shown that there is no labeling information credibility based on user competencies and their posted topics. This paper increases the information credibility by adding new 17 features for Twitter and 49 features for Facebook. In the first step, we perform a labeling process based on user competencies and their posted topic to classify the users into two groups, credible and not credible users, regarding their posted topics. These approaches are evaluated over ten thousand samples of real-field data obtained from Twitter and Facebook networks using classification of Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (Logit) and J48 Algorithm (J48). With the proposed new features, the credibility of information provided in social media is increasing significantly indicated by better accuracy compared to the existing technique for all classifiers.


INTRODUCTION
It cannot be denied that the popularity of social media has increased rapidly in recent years. Currently, about 320 million users monthly are active on the micro-blogging site, Twitter. Twitter is a global phenomenon, where 77% of Twitter accounts are outside of the United States and Twitter supports 33 languages. Because of the efficiency, volume, and timeliness of information, Online Social Networking (OSN), for example, twitter.com, has become an important source of information [1]. According to the Twitter blog, about an average of 340 million tweets are generated per day as of March 2012. In addition to receiving information from the people they "follow", people are increasingly looking for relevant topical tweets, which is more than 1.6 billion requests for Twitter search portals per day.
In particular, learning about news is often an important motivation for people to read tweets [2], for example, in order to continuously update information about local emergencies [3]. One of the OSN functions is to become a medium of sharing and searching for information [4,5]. Each user can act as a source and spreader to the information, either forwarded in full or with modifications and additions. The role of OSN as a source of information is even more prominent in emergencies such as in particular accidents, natural disasters and incidents of terrorism because it provides a faster report than conventional media [6][7][8][9][10][11][12][13][14].
However, false information that spreads on social media has serious consequences. Thus, a mechanism to automatically determine the credibility of the tweet is required. Morris et al conducted a survey to understand the perceptions of user credibility on Twitter [3]. Morris et al also conducted an experiment with the purpose of uncovering user-based or content-based features used to assess the credibility. Consequently, user-based features can be grouped into three categories: influence, topical ISSN: 2088-8708  Measuring information credibility in social media using combination of user profile … (Erwin B. Setiawan) 3539

RESEARCH METHOD
The Proposed Information Credibility Model is shown in Figure 1. Dataset are divided into two, i.e., training data and testing data, where training data are labeled manually, and while testing data are pre-processed, including their feature extraction. The result of the feature extraction for training data come into the feature selection process and then move to the credibility classification modeling process and then the modeling result is used to predict the testing data. Finally, the Twitter credibility class with good accuracy is expected to be gained.

Labeling
Labeling is applied based on the compatibility of user competencies and tweet or message. In this paper, we consider of concept stating that posted tweets with a tweet topic correlated to competence of the posting account is a measure to be credible rather than posted tweets with a tweet topic uncorrelated to competence of the posting account. This concept builds a higher probability of posted tweet is credible or not. We also define tweet is message posted in twitter and message is message posted in Facebook. We perform labeling manually for tweet and message categories, while for user competencies, we perform a real survey. The objective of survey is to collect information of user competencies. We made an online survey through the website www.surveymonkey.com in January -March 2017. Respondents were asked questions about their opinion of 256 famous people with each corresponding competence. Information displayed in the survey includes photos, bio profiles, five tweets and five messages having the highest engagement, number of followers, number of tweets, and number of following. The survey has been conducted on 188 respondents, 137 men and 51 women. Where the job distribution is shown in Figure 2. The percentage of four large respondents are 28.19% from private employees, 27.13% for lecturers, 19.15% for students, and 15.43% for self-employed.
Respondent distribution based on education is shown in Figure 3. The largest component of the respondents is 98 respondents (52%) from Bachelor degree, 62 respondents (33%) from Master degree, 13 respondents (7%) from Senior High School level, 4 respondents (2%) from 3-year Diploma, and 1 respondent from pharmacist education. The way to determine whether the user is competent or not is by calculating the highest number of opinion given by the respondent to the provided 256 famous people. The survey is conducted to obtain competencies from 256 famous people, including 115 famous people which the data are taken from Twitter. Competence sample data of 10 people is shown in Table 1.  Two credibility labels are used in this study, i.e., "credible" and "not credible". We define that information is considered as credible when the famous people posts tweet or message appropriate to their competencies. On the other hand, when the tweet or message are posted out of the famous people competencies, the information is considered as not credible. The process is shown in Figure 4  The distribution of information credibility labeling for Twitter social media is shown in Table 2. b. Facebook social media The distribution of information credibility labeling for Facebook social media is shown in Table 3.

Pre-processing
By assuming text input from the original tweet (Twitter) or post message (Facebook) content, preprocessing consists of case folding, tokenization, stop-word removal, and stemming. Case Folding is the process by which words or phrases in a text tweet or post message will be converted into lowercase letters (a to z). This is expected to solve problems when words are written in different letters.

3541
Tokenization is applied to cut the input of a tweet or post message from its composing words. In principle, separate each word in the text tweet or post message. This process includes deleting numbers, punctuation, and characters other than alphabetical letters. These characters are considered as word separators so they will be removed to prevent "noise" in further processes. Meanwhile, stop-word removal removes non-topical words that are not considered important such as: "and", "this', "that", "is", "or", "which", "through", and so on. This pre-processing helps reduce irrelevant features in the data. Finally, stemming is the process of finding root words by removing prefixes, infixes, suffixes, and confixes (combination of prefixes and suffixes) in derivative words. By originating, variations in words that have the same root will be considered the same way (feature). It helps improve retrieval performance on Information Retrieval.

Feature extraction
This section elaborates the feature extraction on Twitter and Facebook. The feature distribution, in both Twitter and Facebook, is attached, while the user profile dimension feature and message content dimension feature are also presented.

Features used on twitter
This paper uses two dimensions of features, namely the user profile dimension and message content dimensions. The most popular old features used by previous works have also been summarized in this study. In total, 33 features obtained from 5 different papers are discussed in this paper. The collection of features from works using classifiers is performed to predict credibility [3,15,16,22,18]. Furthermore, 17 new features are proposed in Table 4 indicated by underlined bold features.   [22]. The description of each of the 45 features is shown in Tables 5 and 6. numPosWordDesc The ratio of the sentiments number is positive towards the number of words in an account's bio profile.
The value of the ratio is bigger equal to the value of the account's credibility.

numNegWordDesc
The ratio of the sentiments number is negative towards the number of words in an account's bio profile.
The value of the ratio is getting smaller compared to the value of the account's credibility.

check_web_personal
Having a URL that connects to the user's original website and it can be used to see the credibility. Yes 12 check_location Having a location in the description can guarantee the authenticity of the user's original area. Yes 13 is_verified A verified account is an official account that has been authenticated by Twitter. No  14 number_follower The number of followers can help to find out how many other users want to see/follow the trail of information from the user. The number of followers can become an indication of the user's information credibility level, the more followers the higher the level of trust.

number_statuses
The number of statuses can inform the level of user's activity in using Twitter. Users who do more activities will have more credibility.
No 16 number_following From the number of Following, it can be seen that the user has many friends who might be giving more sources of information. The number of Following shows many sources of information. The number of hashtag from a tweet. By clicking the #Hashtag in Twitter, the same information with the same hashtag will appear so that people will be assisted to find the information uniformity to digest the truth of the information with detail and clear history. The means used to share a tweet can be divided into two, via a smartphone or PC Client. Yes 15 is_url A tweet with URL helps deliver more information so it can provide trust by giving the tweet source. The more in number of URLs given in a tweet the more credible the information is.

is_mention
Tweet contains Mention it means where its source was taken from someone else to provide better source certainty. The mention can indicate whether the mentioned user mentioned provides evidence of the news authenticity, for example, the user included photos of the evidence.

is_hashtag
The existence of #hashtag helps to ensure and view the news history in order to be able to seek information credibility. By clicking the #Hashtag in Twitter, the same information with the same hashtag will appear so that people will be assisted to find the information uniformity to digest the truth of the information with detail and clear history. The existence of positive, neutral, and negative sentiments to select the information that its credibility level is going to be seen. The positive sentiments are usually describing more credible information.  [21]. Table 7 shows the user's dimension features in Facebook, while Table 8 shows the message content dimension features in Facebook. The number of words in describing the user's profile (bio profile). A detailed description can make it easier to know someone's credibility.

length_bio
The length of character and words that explain whether the user gives a short or long message that could influence someone's perception. The number of the hashtag in a post message. By clicking the #hashtag in Twitter, the same information with the same hashtag will appear so that people will be assisted to find the information uniformity to digest the truth of the information with detail and clear history. In addition, this paper also applies a new approach related to the spam prediction and sentiment prediction described as follows: a. Spam prediction (check_spam) We use two corpuses related to the spam words or phrases that are 200 English spam words or phrases and 100 Bahasa Indonesia spam words or phrases as used in our previous study [23]. The two corpuses are developed based on Indonesia spam-words. Table 9 describes 12 examples of Bahasa Indonesia spam words or phrases [23]. b. Sentiment prediction This paper uses a corpus which contains the list of sentiments words consists of 354 words [24]. This sentiment is obtained by searching for words that are categorized as negative, positive and neutral. Some sample data are shown in Table 10 [24].

Classification algorithm
The four learning algorithms that will be explored are Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (Logit) and J48. As illustrated in Fig. 1, the four algorithms are used to model the topic classification of tweets during the training phase. The topic model of tweets is then used to classify the credibility of new information, using the same algorithm as that used to model the classification. The following is a description of each algorithm. a. Naive bayes (NB) Naive Bayes (NB) is a classification model in the form of probability values for each attribute to the class, and the classification of new data is done by looking at classes that have the maximum probability based on attribute data [25]. Naive Bayes has the advantage of construction easiness which does not require several complex parameters, and it is scalable. In addition, this method is expressed as an algorithm that has the properties of simplicity, elegance, robustness, and high accuracy [26]. b. Support vector machine (SVM) The idea of Support Vector Machine (SVM) for classification is to find the optimal hyperplane (line/boundary field) that separates data into two classes in the data n-dimensional feature space. With this concept, the optimal hyperplane solution in SVM does not have a local optimum, and as a result, the solution will be unique [25]. SVM can be implemented easily and is one of the right methods used to solve high-dimensional problems within the limitations of existing data samples. c. Logistic regression (Logit) Logistic Regression (Logit) is a probability classification model with a real value input vector. The input vector dimensions are called features. There are no restrictions imposed for correlated features. Logistic Regression is used every time we need to set input to one of several classes. The logistics function is a linear combination of features. The output is usually binary, but Logistic Regression can also be applied to multiclass classification problems [25]. J48 is a development of the ID3 algorithm. J48 is an implementation of the C4.5 algorithm that produces a decision tree. This algorithm can classify data with decision tree methods that have the advantage of being able to process numerical (continuous) and discrete data, can handle missing attribute values, and produce rules that are easily interpreted. Each data from an item is based on the value of each attribute. Classification can be seen as a mapping of a group of sets of attributes from a particular class. Decision tree classifies the data given using the value of the attribute [27].

RESULTS AND ANALYSIS
This section provides the results and analysis of the data set and labeling scheme for Twitter and Facebook.

Data set for experiment
The use of Twitter data containing Indonesian language is the same as in [28], involving 115 accounts with 19401 tweets. Table 11 provides a sample labeling of tweet topics from Law, Politics, and Entertainment [28]. Table 12 shows the distribution of Twitter data. It consists of 19 topics where the distribution is not balanced ranging from 0.2% to 15.3% [28].Facebook data used in this study consists of 56 accounts with 23489 messages. Due to the absence of a user account, not all accounts on Twitter (115 accounts) can be retrieved. Table 13 describes the distribution of Facebook data used, consists of 19 topics, which shows that the distribution is also unbalanced, ranging from 0.17% to 18.38%.

Experiment
We consider three objectives of performing experiment, i.e., (i) to compare the proposed technique with previous research on Twitter and Facebook about information credibility, (ii) evaluate the effect of adding new features in Twitter and Facebook, and (iii) evaluate the effects of feature dimensions used both on Twitter and Facebook. Our experiments used a comparison of training data versus data testing, with a composition of 80:20.

Twitter social media
In this study, each cell describes an average of 5 times of the accuracy taking for each testing vs twitter composition taken randomly. The results of the proposed method and the previous research are shown in Table 14. Table 14 shows that this paper succeeded in increasing the accuracy of previous studies in almost all classifiers. When compared to previous studies, it can be seen that the highest accuracy is 88.42% achieved by using J48 classifier with the lowest increase of 5.93% and the highest of 27.17%.  Table 14 also shows a comparison of the accuracy value between the user profile dimension and message content dimension in 4 different classifiers. The user profile dimension accuracy is higher than the message content dimension accuracy for all classifiers. The highest accuracy value on the user profile dimension using the J48 classifier is 82.77%. All the merging the features of the both dimensions used in this study increase accuracy for SVM and J48 classifiers, while the two other classifiers, i.e., NB and Logit classifiers, provide a decrease on the accuracy.
The new features are classified according to the influence of them on the accuracy. The features that increase the accuracy after it added to the baseline features are classified as increased group, the features that decrease the reverse are classified as decreased group, while the features that not effect are classified as mixed group, in this paper. Here, the baseline features represent the set of feature used in Ross and Thirunarayan [22]. Table 15 shows the effect of the 17 new features proposed, consist of 12 features based on user profile and 5 features based on message content dimension, in each classifier on Twitter. All features proposed on Twitter in both feature dimensions provide an increase on the accuracy of each classifier. For influence on all classifiers, all new feature increase on the accuracy of 6.60%, with 6.67 % for features based on user profile, and 6.45% for features based on message content dimension. The biggest average for feature is 8.55%, achieved by the NumFollowingNumFollower feature. In the terms of the effect in each classifier, #sentiment_desc feature provides the highest improvement of accuracy of +13,41 % was achieved on SVM classifier.

Facebook social media
This paper has carried out two developments. First, developing Facebook API that can retrieve datasets online. Second, adding more features to 49 new features based on users and content. Table 16 shows the highest accuracy increase compared to Saikaew's study. This paper succeeded in increasing the accuracy of previous studies in almost all classifiers. The increase is 9.91% with an accuracy value of 78.61% by using J48 Classifier. Table 16 also shows a comparison of the accuracy value between the user profile dimension and message content dimension in 4 different classifiers. The user profile dimension accuracy is higher than the message content dimension accuracy for all classifiers. The highest accuracy value on the user profile dimension using the SVM classifier is 76.50%. All the merging the features of the both dimensions used in this study increase accuracy for only J48 classifiers, while the three other classifiers provide a decrease on the accuracy.  Table 17 shows the effect of the 49 new features proposed, consist of 8 features based on user  profile and 41 features based on message content dimension, in each classifier on Facebook. Here, the baseline features used as the comparison is representing the set of feature used in Saikaew [21]. All proposed features based on user profile dimension provide an increase on the accuracy of each classifier, whereas for message content based only 27 features or equal to 65.85% which give an increase in accuracy, remaining is 14 features or 34.15% provide mixed results. For influence on all classifiers, all new feature increase on the accuracy of 0.57%, with 2.64 % for features based on user profile, and 0.17% for features based on message content dimension. The biggest average for feature is 7.26%, achieved by the engagement_count feature. In the terms of the effect in each classifier, engagement_count feature also provides the highest improvement of accuracy of +11,98% was achieved on J48 classifier.
The additional new feature on Twitter and Facebook are found to provide the best accuracy value and are influencing the credibility of the information, where the results are shown in Tables 14 and 16. It is clearly shown that user profile dimension is having a higher accuracy compared to message content dimension for all classifiers. Based on these results, it can be concluded that the credibility of information can be seen from the Twitter users. In searching for the information from Twitter, making users who provide content or tweets as the source of information can add the credibility and trust.This result confirm that purpose concept is practical and reliable. Finally, the effect of two feature dimensions, user profile dimension and message content dimension, on Twitter and Facebook are also found to provide the best accuracy value and are influencing the credibility of the information, where the results are shown in Tables 15 and 17. It is clearly shown that user profile dimension is more consistent increasing accuracy than message content dimension for all classifiers.

CONCLUSION
In this study, a method to measure the credibility of information on social media, i.e., Twitter and Facebook, has been proposed using labeling process and additional new features. We introduced 17 new features for Twitter and 49 new features for Facebook. We also used 4 classification methods, i.e., NB, SVM, Logit and J48 Algorithms. By adding new features, we obtained an accuracy of measurement about 88.42% for Twitter and 78.61% for Facebook, which is better than the previous results for all classifiers. In terms of the two feature dimensions, the user profile dimension accuracy is found to be better than the message content dimension for all classification conditions. Finally, the effect of new features to accuracy, all features proposed on Twitter in two feature dimensions provide an increase of accuracy for all classifiers. Furthermore, in Facebook, all the proposed features based on user profile dimension provide an increase of accuracy for all classifiers. However, in Facebook, from the view point of message content dimension, only 27 features (65.85%) provided an increase in accuracy. On the other hand, the remaining 14 features (34.15%) provided mixed results. For all conditions, we found that the user profile dimension is more consistent to increase the accurate measurement rather than the message content dimension for all classifiers. We are expecting that these results can provide contributions to the future development of information credibility on social media.