Predicting depression using deep learning and ensemble algorithms on raw twitter data

ABSTRACT


INTRODUCTION
Social networking sites have become a habitual component, with sites like Twitter and Facebook being the 7 th and 2 nd favorite sites having millions of subscribers [1]. Such sites have become an open dais for people to reach out and express their feelings, likes, routines etc. Unlike mood fluctuations, depression is a common mental disorder which brutally affects a person's daily routine life. Any person who has undergone some adverse experiences like sudden death, unemployment etc. is liable to it. According to WHO, more than 300 million people of all age groups suffer from it, with only fewer than 10% receiving suitable treatment, owing to reasons such as lack of suitable health care, social humiliation and timely and right diagnosis [2].
Frances A et al. [3], states that the depression displays the following symptoms in the given order; sad mood, eluding all activities, weight and sleep fluctuations, body agitation, energy loss and tiredness, feeling of triviality, loss of decision-making capacity and finally suicidal tendencies. Emotion and sentiment analysis exercises machine learning algorithms to examine a text with respect to the emotion conveyed. Sentence level analysis is applied in this study to inspect if a tweet is emotionally vulnerable or not. Recently, the death of a 16-year-old Malaysian teen after calling for a poll in her Instagram [4] (where many people voted for "Death") has gathered huge media. Such incidents support the fact that if with proper monitoring of such sites [5] is done timely help and can be provided, hence avoiding such catastrophes. Figure 1 highlights the methodology followed in the paper. The aim of the proposed work is to predict depression in individuals using their behavior online (on twitter specifically) [17][18][19]. This is done in two main stages. First being the stage where sentiment analysis [20] is applied on a particular individual's twitter posts to predict binary classes (i.e. depressed/not depressed). The twitter posts were obtained using the twitter API from a developer twitter account. A deep learning module known as long short-term memory (LSTM) [21,22] is employed. The proposed LSTM model used a Kaggle dataset on twitter tweets related to depression to learn and validate. network architecture encompasses an embedding layer, which has input dimension set to total number of tweets, output dimension set to 200 and input length set to sequence length as stated earlier. The next part includes an LSTM layer with 500 units, dropout 0.2 and recurrent dropout also 0.2. Last part of the model network includes a dense layer with one unit and a sigmoid activation to concise the generated output between zero and one. This completed the proposed model's architecture and the model is compiled using Adam optimizer and loss is dealt by using binary crossentropy with accuracy metric selection to observe and evaluate.

METHODOLOGY
Post compilation, the model is fitted on the earlier generated features and validated on labels specified in the dataset. A validation split of 0.3 (70% training data, 30% test data) is introduced and this model is trained for 5 epochs. Finally this model is stored in a JSON file for future use. Every new tweet obtained from the twitter API goes through the same preprocessing procedure mentioned above before being forwarded to the LSTM model. The result of the model is obtained from the model. predict () function and is rounded to an integer value (0 or 1 in this case).
The obtained accuracy is compared to sequential Convolutional Neural Network (CNN) [24,25]. The preprocessing steps are exactly the same. The network architecture includes an embedding layer similar to the LSTM embedding. The only difference is a weights argument is given an embedding matrix, with random values in the range of 200 to total number of tweets multiplied by 0.01. This is followed by a dropout of 0.4 which feeds the data to a total of four 1D convolution layers. Each of the convolution layers has kernel size set to 3, padding set to valid, activation is relu and strides is set to 1. Only thing that differed is the filter (dimensionality of the output space). It is decreased by 50% at each layer. First layer had filter set to 600, second had 300, third 150 and fourth had 75.
After this, flatten is included in the model architecture to flatten the input. This is followed by adding a dense layer with 600 units, dropout of 0.5, activation set to relu, a dense layer with one unit, and finally, an activation set to sigmoid. CNN model is compiled using exactly same arguments for loss, metric and optimizer. The second stage includes trying to improve the outcome of the proposed work using basic machine learning classifiers [26, 27] and a few optimized ensembles. Classifiers used are logistic regression, linear support vector classifier (SVC), multinomial naive bayes, bernoulli naive bayes along with ensembles like random forest classifier and gradient boosting classifier.
Preprocessing as depicted in Figure 3 includes splitting of data into train and test set. This is followed by vectorizing the tweets. The mentioned stage provides the classifiers with three different kind of vectorizers namely count vectorizer, TF-IDF and n-grams [28]. For Count Vectorizer, the preprocessing includes fitting the count vectorizer on the training and data then transforming the documents in the training data to document-term matrix. The obtained training data (features) is fed to the current model selected from the list. A feature names list is initialized and populated using Count Vectorizer's get_feature_names () function. Results are predicted using predict function. TF-IDF Vectorizer is fit on the training data using min_df=5, similar to Count Vectorizer, documents are transformed and the training data is fed to the current model selected from the list. Along with this, feature names and sorted TF-IDF index is also calculated to find the smallest and highest TF-IDF, (least and most important coefficients). For n-grams, similar procedure was followed and is fit on the training data using min_df=5 and n-gram_range (1,2). Similar to other vectorizers, smallest and highest coefficients are noted and result sentiment of the tweet is predicted.
Each and every classifier and ensemble mentioned above is fit on each of the three vectorizers and results are noted. This stage is largely focused on cross-checking predictions made by the first stage. Every new tweet obtained through the twitter API is sent through first and second stage. In the second stage every possibility of model and vectorizer is executed and results are noted. The final result included the weighted mean of all second stage possibilities. This value is cross-checked with the first stage predictions. The weights are assigned according to accuracy of the model on the data.

RESULT
The following tables Tables 1-7 show the results obtained in this research. The mentioned classifiers's results are compared here.

CONCLUSION AND FUTURE WORK
Often traditional survey-based questions fail to uncover the extent of users 'mental health depreciation. Nowadays, social media is a common platform, which people use to reach out. Therefore, the above proposed methodology cashes in this popularity of SNS and evaluates the depression levels of the users by employing natural language and machine learning techniques. In future, by collecting more data, more frequently with the aim of improving the accuracy of the work to give a better diagnosis. Such tools which can predict variations in person's mood can be an important method for both clinical observations and self-diagnosing. The method can be time consuming and hence steps must be taken in this regard to improve upon.