K-means variations analysis for translation of English Tafseer Al-Quran text

Mohammed A. Ahmed, Hanif Baharin, Puteri Nor Ellyza Nohuddin


Text mining is a powerful modern technique used to obtain interesting information from huge datasets. Text clustering is used to distinguish between documents that have the same themes or topics. The absence of the datasets ground truth enforces the use of clustering (unsupervised learning) rather than others, such as classification (supervised learning). The “no free lunch” (NFL) theorem supposed that no algorithm outperformed the other in a variety of conditions (several datasets). This study aims to analyze the k-means cluster algorithm variations (three algorithms (k-means, mini-batch k-means, and k-medoids) at the clustering process stage. Six datasets were used/analyzed in chapter Al-Baqarah English translation (text) of 286 verses at the preprocessing stage. Moreover, feature selection used the term frequency–inverse document frequency (TF-IDF) to get the weighting term. At the final stage, five internal cluster validations metrics were implemented silhouette coefficient (SC), Calinski-Harabasz index (CHI), C-index (CI), Dunn’s indices (DI) and Davies Bouldin index (DBI) and regarding execution time (ET). The experiments proved that k-medoids outperformed the other two algorithms in terms of ET only. In contrast, no algorithm is superior to the other in terms of the clustering process for the six datasets, which confirms the NFL theorem assumption.


chapter Al-Baqarah; internal cluster validation metrics; k-means variations; no free lunch theorem; Tafseer datasets; text mining;

Full Text:


DOI: http://doi.org/10.11591/ijece.v13i3.pp3255-3265

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578