Text documents clustering using data mining techniques

ABSTRACT


INTRODUCTION
Web document clustering is a suitable technique for collecting documents with similar content from a set of documents that spread on the web pages [1][2][3]. Document clustering provides one of useful and efficient techniques to find and understand the documents [4], where clustering can group the similar documents in one place. Accordingly, web documents can be classified according to a collection of topics for each category. These topics focus on word tokens that may appear during document analysis. The word tokens also refers to the repetition of terms in documents, where extracting terms from textual data helps in the classification of documents [5]. Consequently, the documents are classified by a cluster of terms into a set of categories, based on the number of occurrences with each word tokens for a specific topic in those documents [6].
The classification of documents is expedient for researchers who perform interdisciplinary research on various topics [7]. Ordinarily, document clustering is an important pillar in achieving this objective [8,9]. Clustering will help the user to get all relevant documents in one category and the search can be limited to some important documents of his choice. Conversely, finding meaningful documents for researchers by normal search process, is a challenging and time-consuming problem especially in view of the steady increase in the number of documents. Moreover, diversity of the major sources of documents such as ISSN: 2088-8708  Text documents clustering using data mining techniques (Ahmed Adeeb Jalal) 665 research papers, web pages, archives, technical reports, and digital repositories that available to the user over the internet. Nowadays, a large number of people use the internet as their main source of information. Consequently, the users need to find their interesting requests easily and conveniently which represents the most relevant information that was queried [10][11][12]. However, the search engine retrieves more irrelevant pages based on a few keywords for a user's query, resulting in long lists of URLs. Searching on the web pages to discover knowledge according to user query, is not an easy task to perform. Considering, the problem of information overload that facing internet data warehouses. Therefore, web data mining can be an easy and important technology for discovering and retrieving useful information and knowledge [13,14]. Web data mining is a sub discipline of data mining applications to discover patterns that mainly deal with the internet. Web data mining can be categorized into three types: web structure mining, web content mining, and web usage mining [15]. All these types use a diversity of approaches, techniques, tools, and algorithms to discover the patterns of information [14]. Accordingly, improving search engine using data mining techniques aims to discover useful information from the large amount of data [16,17].
Over the past decades, institutions, universities, and journals have published numerous research papers in various scientific fields. Ordinarily, research papers are not classified and clustering into categories. Consequently, there are many documents clustering approaches [8] and recommender systems [18] that proposed for classifying research papers based on the documents content characteristics or attributes. Each of these techniques differs in many parts, such as the types of attributes they used to characterize the documents, the similarity measure used, the representation of the clusters etc. The literature reviews of related works on research paper classification and its applications are as follows.
Thushara et al., [19] proposed a document-centered system for classifying research articles that published in the domains of computer science. It is based on automatic keywords extraction from research articles using rapid automatic keyword extraction (RAKE) algorithm to get best score-matrix of keywords. Moreover, the proposed system adopts a hybrid approach to the classification process by applying different methods at various phases of the system. This classification process relates to the semantic analysis by using score-matrix of keywords and cosine similarity for articles classification into relevant domain. Consequently, domain classification facilitates the identification and retrieval of important articles for researchers that are in line with their actual fields of interest.
Kim and Gil [20] proposed the paper classification system consists of four major processes: crawling, TF-IDF, topic modeling and data management, and classification. This proposed system aims to cluster the research papers into the meaningful categories in which contain similar topics. Accordingly, the proposed system creates a dictionary of keywords from the abstract and keywords data that crawled. These keywords consist of top-N of high frequency keywords among the entire keywords. Also, it extracts topics from the abstract data of each paper by latent dirichlet allocation (LDA) scheme. Finally, research papers are classified into similar subjects by using K-means clustering algorithm. The K-means clustering algorithm is based on the term frequency-inverse document frequency (TF-IDF) values of each paper.
Nahar et al., [21] presented an approach for classifying and clustering the research's papers into clusters based on concepts and contents. This clustering process uses title, keywords, and abstract of the paper for performing the classification process. The proposed approach is mainly depends on information retrieval (IR) as core process along with some natural language processing (NLP) techniques, latent dirichlet allocation (LDA), and latent semantic indexing (LSI). Moreover, it aims to improve the LDA model that is used for classification using the concept of topic modeling and the LSI model used for performing querying. Consequently, the presented approach provides an automatic, short time, and accurate solution for classifying research papers that published in the field of information technology.
Saad et al., [22] presented emotions classification for Malay folklore from children short stories using four types of common emotions: happy, angry, fearful, and sad. This work based on term frequencyinverse document frequency (TF-IDF) that extracted from the text stories. Then, text stories will be classified by support vector machine (SVM) and decision tree (DT). This work aims to add emotions for a more natural storytelling.
In addition, there are also various other approaches for classifying the documents by applying different techniques such as using text mining based on the technology of natural language processing [23,24], building a semantic representation of articles from their associated entities [25,26], and using N-grams and efficient similarity measure that known as improved sqrt-cosine similarity measure [27]. As mentioned in the examples above, the importance of documents clustering and classifying is highlighted to satisfy users and facilitate the retrieval process of relevant documents.
This paper aims to classify and cluster the research papers into categories to overcome the respective difficulties for the search users. Moreover, clustering provides a better coverage while avoiding complexity, not only with research papers but with various domains as well [28][29][30]. Thus, this proposed approach of text documents clustering has a significant impact to find useful information, address the lack of understand-ability, and improve search-ability for users. Consequently, we proposed research papers classification system based on term frequency (TF), Term frequency-inverse document frequency (TF-IDF), and cosine similarity, to guide the users by their needs in the domain of research papers.
The second section explains the methodology and describes proposed methods for text documents clustering such as web mining, data extraction, TF-IDF, and cosine similarity. These techniques contribute to the analysis of scientific papers by extracting data from it, in order to classify the papers into groups organized according to similarity. The third section highlights on the results of the proposed classification approach and the algorithms that used to implement it. Finally, this research outlines the challenges of research papers classification and aims to provide a better clustering for the research papers.

RESEARCH METHOD
In this paper, a classification approach for clustering the research papers is presented, as researchers spend a lot of time to identifying the relevant cluster of the undertaken papers. Ordinarily, the papers are classified into clusters based on the concepts and the contents. Accordingly, our approach provides a clustering process depends on three major parts of the research papers: title, abstract, and keywords. The abstract was chosen as one of the important parts of the paper that describes its essence after the title [31,32], and it is often the next part that users tend to read. Moreover, the abstract is enriched with interesting and fundamental words/terms that express the direction of the paper and a summary of all other contents of the paper.
The data set contains 518 papers that published in Bulletin of Electrical Engineering and Informatics (BEEI) journal, since 2012 to 2019. These scientific papers include different topic scopes which are written in English. The BEEI journal is issued by the Institute of Advanced Engineering and Science (IAES) of Ahmad Dahlan University. Our goal is to classify these papers into five clusters according to the following scopes of the journal: -Cluster 1: Computer Science, Computer Engineering, and Informatics. Ordinarily, the research papers are often classified and retrieved according to the user's query or by semantic representation and many other methods, as we mentioned in the literature reviews of related works in the first section. In our approach, we apply basic crawler algorithm [15] to extract the contents of the topics for each cluster separately, as well as the title, abstract, and keywords of all papers. Subsequently, we suggest classifying papers based on word tokens which extracted from the topics of the above five clusters that covered by the BEEI journal. Moreover, classification approach techniques include TF-IDF and cosine similarity. Figure 1 shows general steps of the flow diagram for techniques that used in the proposed classification approach.

Text preprocessing
Text preprocessing is a one of major component in many algorithms of text mining. It usually consists of the tasks such as tokenization, filtering, lemmatization, and stemming [33]. Ordinarily, clustering algorithms require to specifying the type of attributes (e.g. words, terms, or phrases) to extract from the documents that underpin the clustering algorithm performance.
As shown in the text preprocessing step of Figure 1, it automatically extracts word tokens lists using text preprocessing tasks. Tokenization is the task of breaking the character sequence in topics that are crawled into pieces (words/terms) called tokens. Filtering is a task intended to perform further processing on word tokens lists to remove stop and similar words to reduce the indexing size and increase the accuracy of results. Moreover, it necessary be taken into consideration the morphological analysis of words to group the various related words together to be analyzed as one item, lemmatization task is preferred in practice. Stemming task aims to get a stem (root) of derivative words that are actually language dependent. Consequently, we get five lists of word tokens from clusters topics, one for each cluster.

Term frequency-inverse document frequency (TF-IDF)
TF-IDF is a numerical and descriptive statistical mechanism that used as a weighting factor in the fields of information retrieval. The TF-IDF weighting provides a good insight of how important words are by the appearance of specific words in documents content. Consequently, the TF-IDF is used to extract word tokens from documents, calculate degrees of similarity among documents, determine important ranking, etc. In our approach, we calculate TF, IDF and TF-IDF for each word token in the lists on both clusters and documents.
The term frequency (TF) counts how often the specific words appear in document content, which can be calculated as in (1). The words with a high TF value are more importance in documents.
where, , denotes to the frequency of word/term that occurs in document .
On the other hand, the inverse document frequency (IDF) measures the rarity and importance of a word/term across all documents, which can be calculated as in (2). The words with a high IDF value are considered rare in all documents.
where, , is a logarithmic scale for dividing the total number of documents by the number of documents in which the word/term appears.
Consequently, the Term Frequency-Inverse Document Frequency (TF-IDF) weighting is calculated as in (3). The TF-IDF weighting value increases when a specific word/term has high frequency in a document and the number of documents in which the word/term appears is low.

Cosine similarity
Cosine similarity is a one of the powerful similarity measures compared to all other techniques, that used to measure similarities between two vectors based on the cosine of the angle as in (4). Moreover, the cosine similarity is widely used in document clustering in the field of data mining. Ordinarily, the cosine similarity method measures the similarity between a user query and retrieved documents based on the terms that extracted from the user query. Nevertheless, in our approach we suggest measuring the similarity between the content of clusters and documents based on the word tokens lists, as shown in the classification step of Figure 1.
where, and are denote to the cluster and document vectors, respectively. The higher-ranking documents are more relevant to the cluster.

RESULTS AND DISCUSSIONS
The proposed research papers classification system is based on web data mining techniques to manage and process research papers data. In this section, we will describe the data sets collected and the steps taken while running the experiments along with discussing the results down to the evaluation. We collected 518 research papers for use in experiments, that are published in BEEI journal in various subject and scopes. The papers are related to the field of computer science, computer engineering, informatics, electronics, electrical, power engineering, telecommunication, information technology, instrumentation, control engineering. Each of these scope contains several topics such as computer architecture, programming, computer security, electronic materials, microelectronic system, electrical engineering materials, antenna and wave propagation, distributed platform, and robotics. Our goal is to classify these papers into five clusters according to those scopes. Consequently, as we explained early in research method section, we crawled the title, keywords and abstract for each paper to use as core data for classification. Meanwhile, we extract five lists of word tokens from the topics of scopes. Once these steps are completed, the corpus became ready to be used as input for TF-IDF calculation module to calculate the weight for each word token for both clusters and papers, as shown in Figure 2. Subsequently, the cosine similarity algorithm is implemented based on TF-IDF weights, as shown in Figure 3. Typically, the cosine similarity value ranges from 0 to 1, where a high value indicates that data are well-matched to their own cluster and poorly matched to neighboring clusters.  As we see in Figure 2, there are five different clusters. For instance, the first cluster revolves about computer science, computer engineering, and informatics. The first cluster consists of many word tokens such as computer, programming, computing, and security, to mention a few. Similarly, we can examine the rest of clusters by analyzing the set of extracted topics. The results showed that most of the papers have been linked to the right cluster, depending on the results of cosine similarity algorithm. Figure 4 shows the classification, number, and distribution of over 96% of papers, since 2012 to 2019. These results constitute the efficiency of the proposed approach.
The validation factor allows evaluating the classification of papers according to the selected algorithms. We evaluate the proposed system using precision and recall metrics which are one of the most common validation metrics that based on separation between relevant and irrelevant items. As shown in Figure 5, the validation gives more accurate labeling for the papers classification. We found that some papers contain mixed subjects, which means that many different module, contribution, and tools have been employed in the paper.

CONCLUSION
In this paper, we proposed a classification approach for clustering the research papers to improve and automate the process of organizing and classifying scientific papers. The classification approach that introduced in this paper uses web data mining techniques to classify research papers depending on the focus and scope topics. The selected algorithms have shown accurate and reliable results in the classification according to predefined clusters. Ordinarily, classification of papers is essential to facilitate the finding of scientific research and increase the effectiveness of identifying the needs of researchers. The experimental results showed that it is possible to classify more than 96% of the papers in similar scopes using the cosine similarity algorithm, as these results were verified by precision and recall metrics. This paper mainly focuses on developing and analyzing the classification of research papers based on clusters topics. Future work should be extended to include various topics extracted from the papers to classify the whole papers accurately and efficiently.