K-means variations analysis for translation of English Tafseer Al-Quran text
Abstract
Text mining is a powerful modern technique used to obtain interesting information from huge datasets. Text clustering is used to distinguish between documents that have the same themes or topics. The absence of the datasets ground truth enforces the use of clustering (unsupervised learning) rather than others, such as classification (supervised learning). The “no free lunch” (NFL) theorem supposed that no algorithm outperformed the other in a variety of conditions (several datasets). This study aims to analyze the k-means cluster algorithm variations (three algorithms (k-means, mini-batch k-means, and k-medoids) at the clustering process stage. Six datasets were used/analyzed in chapter Al-Baqarah English translation (text) of 286 verses at the preprocessing stage. Moreover, feature selection used the term frequency–inverse document frequency (TF-IDF) to get the weighting term. At the final stage, five internal cluster validations metrics were implemented silhouette coefficient (SC), Calinski-Harabasz index (CHI), C-index (CI), Dunn’s indices (DI) and Davies Bouldin index (DBI) and regarding execution time (ET). The experiments proved that k-medoids outperformed the other two algorithms in terms of ET only. In contrast, no algorithm is superior to the other in terms of the clustering process for the six datasets, which confirms the NFL theorem assumption.
Keywords
chapter Al-Baqarah; internal cluster validation metrics; k-means variations; no free lunch theorem; Tafseer datasets; text mining;
Full Text:
PDFDOI: http://doi.org/10.11591/ijece.v13i3.pp3255-3265
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).