Exploring topic modelling: a comparative analysis of traditional and transformer-based approaches with emphasis on coherence and diversity

Ayesha Riaz, Omar Abdulkader, Muhammad Jawad Ikram, Sadaqat Jan

Abstract


Topic modeling (TM) is an unsupervised technique used to recognize hidden or abstract topics in large corpora, extracting meaningful patterns of words (semantics). This paper explores TM within data mining (DM), focusing on challenges and advancements in extracting insights from datasets, especially from social media platforms (SMPs). Traditional techniques like latent Dirichlet allocation (LDA), alongside newer methodologies such as bidirectional encoder representations from transformers (BERT), generative pre-trained transformers (GPT), and extra long-term memory networks (XLNet) are examined. This paper highlights the limitations of LDA, prompting the adoption of embedding-based models like BERT and GPT, rooted in transformer architecture, offering enhanced context-awareness and semantic understanding. The paper emphasizes leveraging pre-trained transformer-based language models to generate document embedding, refining TM and improving accuracy. Notably, integrating BERT with XLNet summaries emerges as a promising approach. By synthesizing insights, the paper aims to inform researchers on optimizing TM techniques, potentially shifting how insights are extracted from textual data.

Keywords


Bidirectional encoder representations from transformers; Extra long-term memory networks; Generative pre-trained transformers; Latent Dirichlet allocation; Social media platforms

Full Text:

PDF


DOI: http://doi.org/10.11591/ijece.v15i2.pp1933-1948

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578

This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).