Reddit social media text analysis for depression prediction: using logistic regression with enhanced term frequency-inverse document frequency features

Madan Mohan Tito Ayyalasomayajula, Akshay Agarwal, Shahnawaz Khan

Abstract


Language provides significant insights into an individual’s emotional state, social status, and personality traits. This research aims to enhance depression detection through the analysis of linguistic features and various dataset attributes. The dataset, sourced from the social networking platform Reddit, comprises posts and comments from individuals diagnosed with depression. Logistic regression with term frequency-inverse document frequency (TF-IDF) is employed as the primary model for text classification. To improve model performance, a novel feature—the average time interval between consecutive posts or comments—is introduced, contributing to a marginal but noteworthy improvement in accuracy. The proposed model demonstrates superior F1 scores compared to other models applied to the same dataset. Given the increasing recognition of mental health’s significance, accurately diagnosing mental disorders is of paramount importance. This study underscores the potential of leveraging linguistic analysis and advanced machine learning techniques to identify depressive symptoms, thereby contributing to more effective mental health diagnostics and interventions.

Keywords


Language; Language analysis; Machine learning; Mental health; Reddit; Term frequency-depression; Term frequency-inverse document frequency

Full Text:

PDF


DOI: http://doi.org/10.11591/ijece.v14i5.pp5998-6005

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578

This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).