Scientific Documents clustering based on Text Summarization

Pedram Vahdani Amoli, Omid Sojoodi Sh.

Abstract


In this paper a novel method is proposed for scientific document clustering. The proposed method is a summarization-based hybrid algorithm which comprises a preprocessing phase. In the preprocessing phase unimportant words which are frequently used in the text are removed. This process reduces the amount of data for the clustering purpose. Furthermore frequent items cause overlapping between the clusters which leads to inefficiency of the cluster separation. After the preprocessing phase, Term Frequency/Inverse Document Frequency (TFIDF) is calculated for all words and stems over the document to score them in the document. Text summarization is performed then in the sentence level. Document clustering is finally done according to the scores of calculated TFIDF. The hybrid progress of the proposed scheme, from preprocessing phase to document clustering, gains a rapid and efficient clustering method which is evaluated by 400 English texts extracted from scientific databases of 11 different topics. The proposed method is compared with CSSA, SMTC and Max-Capture methods. The results demonstrate the proficiency of the proposed scheme in terms of computation time and efficiency using F-measure criterion.


Keywords


Clustering; Summarization; Data Mining; Scoring;

Full Text:

PDF


DOI: http://doi.org/10.11591/ijece.v5i4.pp782-787

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578

This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).