Text documents clustering using modiﬁed multi-verse optimizer

In this study, a multi-verse optimizer (MVO) is utilised for the text document clustering (TDC) problem. TDC is treated as a discrete optimization problem


INTRODUCTION
In the current digital era, massive amounts of online text documents inundate the web every day. Manipulating these text documents is important for improving the query results returned by search engines, unsupervised text organisation systems, text classification, text summarization, knowledge extraction processes, information retrieval services and text mining processes, scientific document clustering [1]. Many approaches have been proposed for unsupervised organising text documents.
Text document clustering (TDC) is an effective and efficient technique used by researchers in this domain [2], this field of text mining enables the organisation of large amounts of textual data. It can be defined as an unsupervised automatic document clustering technique that utilises the document similarities rule to divide documents into homogeneous clusters. In other words, text documents in the same cluster are similar, Ì ISSN: 2088-8708 whereas those in different clusters are dissimilar [3]. Conventionally, clustering methods can be classified into two main groups: (i) partitional clustering and (ii) hierarchical clustering. K-means and K-medoids are simple and easy-to-use methods that can be tailored to suit large-scale text document datasets. They are iterative clustering-based techniques initiated with predefined numbers of cluster centroids. At each iteration, documents are distributed into clusters according to similarity functions, depending on the distance between each centroid and its closest document. Then, the cluster centroid is iteratively updated according to the documents belonging to the same cluster. This operation is stopped when all the documents are moved into the right cluster by means of stagnated cluster centroids. The main shortcoming of these methods is their convergence behaviour; they move in one direction in a single search space region and do not perform a wider scan in the whole search space region. Therefore, they can easily become trapped in local optima due to the unknown shapes of search spaces. Given that K-means is a local search area method and TDC is formulated as an optimization problem [4], optimization methods that can escape the local optima can be utilized for TDC [5].
The most successful algorithms recently utilized for TDC are metaheuristic-based algorithms [6]. The first type of metaheuristic algorithms is evolutionary-based algorithms (EA), which is initiated with a group of provisional individuals called population. Generation after generation, the population is evolved on the basis of three main operators: recombination for mixing the individual features, mutation for diversifying the search and selection for utilising the survival-of-the-fittest principle. The EA is stopped when no further evolution can be achieved. The main shortcoming of EAs is that although they can simultaneously navigate several areas in the search space, they cannot perform deep searching in each area to which they navigate. Consequently, EAs mostly suffer from premature convergence. EAs that have been successfully utilized for TDC include genetic algorithm (GA) [7] and harmony search [8].
The second type of metaheuristic algorithms is trajectory-based algorithms (TA); a single solution is usedto launch such an algorithm. This solution is improved by repetition using neighbouring-moves operators until a local optimal solution that is in the same search space region, is reached [9]. While TAs can extensively search the initial solution search region and achieve local optima, they cannot navigate simultaneously numerous search space regions. The main TAs utilized for TDC are K-means and K-medoids. Other TAs used for TDC are self-organizing maps (som) [10], and β-hill climbing.
The last type of metaheuristic algorithms is swarm intelligence (SI); an SI algorithm is also initiated with a set of random solutions called a swarm. Iteration after iteration, the solutions in the swarm are reconstructed by means of attracting them by the best solutions that are so far found [11]. SI-based algorithms can easily converge prematurely. Several SI-based TDC are utilized, such as particle swarm optimization (PSO) [12] and artificial bee colony [13].
The multi-verse optimizer (MVO) algorithm was recently proposed as a stochastic population-based algorithm [14] inspired by multi-verse theory. The big bang theory explains the origin of the universe to have been a massive explosion. According to this theory, the origin of everything in our universe requires one big bang. Multi-verse theory believes that more than one explosion (big bang) occurred, with each big bang creating a new and independent universe. This theory is modelled as an optimization algorithm with three concepts: white hole, black hole and wormhole, for performing exploration, exploitation and local search, respectively. the MVO has been utilized for a wide range of optimization problems, such as identifying the optimal parameters of PEMFC stacks [15], unmanned aerial vehicle path planning [16], clustering problems [17], feature selection [18], neural networks [19], and optimising SVM parameters [18]. This paper adapts the MVO algorithm for the TDC problem using Euclidean distance as similarity measure. The adaptation includes modifying the convergence behaviour of MVO operators to deal with the discrete, rather than continuous, optimization problem. The main advantage of the proposed method is that it improves the quality of final outcomes for TDC problems. A comprehensive comparative study is conducted on six text document benchmark datasets that have different numbers of clusters and documents. The quality of the final results is analysed with a discussion using accuracy, precision, recall, F-measure, purity and entropy criteria. The findings of the experimental analyses reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms.
The rest of this paper is organised as follows. Section 2. presents the structure of MVO for TDC. Section 3. discusses the experiments results of MVO. Section 4. gives the conclusion of this study and future work for the authors.

MULTI-VERSE OPTIMIZER FOR TEXT DOCUMENT CLUSTERING
This section describes the main components utilized for tackling TDC. The term 'components' is used to denote the adaptation elements that are conducted for solving the TDC problem using the MVO algorithm and their sequence, including (i) TDC pre-processing, (ii) solution representation and (iii) calculation of the objective function (evaluation of the solutions). Finally, the question of how MVO is adapted to TDC is addressed.

Text document clustering (TDC)
TDC aims to divide documents into clusters; each cluster has similar documents, whereas documents in different clusters are dissimilar [20,21].
In the use of any clustering algorithm for TDC, the text needs some necessary and preliminary steps (pre-processing); this step filters unnecessary data, such as special formatting, special characters and numbers, out of the text. Thereafter, the pre-processed text document terms are converted into numerical form for further processing. The main goal of this step is to improve the quality of features and reduce the implementation complexity of the TDC algorithm [22]. The text pre-processing includes tokenization, stop word removal and stemming, which are discussed in detail in the succeeding subsections.

Tokenization Phase
In the tokenization phase, each document is broken down into a set of tokens (words), where the token is any sequence of characters separated by spaces. Each document is then formulated as a word instance count (pieces) as a bag-of-words model [23]. Note that the word instance count is filtered through removal of empty sequences, number formatting and collapsing, among other tasks [24].

Stemming Phase
Stemming is the process of decomposing terms to their roots by removal of affixes (prefixes and suffixes) [25]. For example, the root of the word 'stemming' is 'stem'. In the English language, many terms may share the same root; for example, the words 'connects', 'connected' and 'connecting' all stem from the same root, which is 'connect' in www.text-processing.com/demo/stem/. The stemming process attempts to improve the clustering by reducing the number of different terms that have similar grammatical properties and stem from a single term [26].

Solution representation
Each solution is represented as a vector where d is the number of documents. Figure 1 shows an example of a solution representation. In this example, five clusters contain twenty document; for example, cluster two has three documents {4, 8, 9}, and document 10 (i.e., x 10 ) is in cluster three.

. Objective function
In this study, for each solution x the objective function is calculated using the average distance of documents to the cluster centroid (ADDC) as shown in (1).
where D(C ki , d j ) is the distance between the cluster centroid j and document i, n i is the number of documents in cluster i, k is the number of clusters, and f (x) is the objective function (i.e. minimize the distance).

Multi-verse in optimization context
MVO [14] inspired by multi-verse theory. According to multi-verse theory, universes connect and might even collide with each other. MVO engages and reformulates using three main concepts: wormholes, black holes, and white holes. The probability is used for determining the inflation rate (corresponds to the objective function in the optimization context) for the universe, thereby allowing the universe to assign one of these holes. Given that the universe has a high inflation rate, the probability of a white hole existing increases. Meanwhile, a low inflation rate leads to increased probability of a black hole existing [14]. Regardless of the universe's inflation rate, wormholes move objects towards the best universe randomly [15,27]. The black and white hole concepts in MVO are formulated for exploring search spaces, and the wormhole concept is formulated for exploiting search spaces. In other EAs, MVO is initiated by a population of individuals (universes). Thereafter, MVO improves these solutions until a stopping criterion. The conceptual model of the MVO in [14] shows the movements of the objects between the universes via white/black hole tunnels. These hole tunnels are created between two universes on the basis of the inflation rate of each universe (i.e. one universe has a higher inflation rate than the other universes.). Objects move from universes with high inflation rates using white holes. These objects are received by universes with low inflation rates using black holes.
After a population of solutions is initiated, all solutions in MVO are sorted from high inflation rates to low ones. Thereafter, it visits the solutions one by one to attract these solution to the best one. This is done under the assumption that the solution that has been visited has the black hole. As for the white holes, the roulette wheel mechanism is used for selecting one solution.

Adapting MVO for TDC
After the pre-processing step, MVO is used to split documents into their parent clusters. In this study, the solution representation and the objective function formulated above are used. The steps of classical MVO in [14] are adopted for TDC with certain modifications. These modifications are related to the nature of the problem variables. Given that the clustering problem is discrete in nature [28] and MVO was originally proposed for continuous optimization problems, MVO should deal with discrete values of the decision variables of each TDC solution. During the MVO execution, the generation function and the wormhole equations (2) is adjusted for deciding the feasible solution as follows: A general overview of MVO for TDC is provided via Figure 2, which visualises the procedural steps.

EXPERIMENTAL RESULTS
For evaluating the performance of the proposed method, a set of designed experiments is conducted using six instances of standard datasets formulated for measuring the performance of text clustering techniques. Six evaluation measures are used, as conventionally done: precision, recall, F-measure, entropy accuracy, and purity criteria. For comparative evaluation, results obtained in terms of evaluation measures are compared with those obtained by three state-of-the-art algorithms (K-means clustering, GA and PSO) using the same objective function. The experiments are conducted using the programming language MATLAB. Thorough descriptions about the experimental results are given in the following subsections. Table 1 provides the characteristics of the six text document datasets used in this study: (CSTR2, 20Newsgroups, Classic4) in sites.labic.icmc.usp.br/text-collections, (tr12, tr41 and Wap) in glaros.dtc.umn.edu/ gkhome/fetch/sw/cluto/datasets.tar.gz.

Results and discussion
The results obtained by MVOTDC are summarised in Table 2, and the parameter settings used in the experiments are given in Table 3. The results are summarised in terms of precision, recall, F-measure, entropy accuracy, and purity for the six datasets. The findings prove the validity and effectiveness of the proposed MVOTDC in the distribution of text documents to the right clusters.
The results are also conducted to show the validity of the proposed method in comparison with three well-known methods: GA, K-means and PSO. Table 3 shows the parameter setting values for each compared algorithm. These parameter settings are used as suggested in [17].
A comparative analysis of K-means, GA, PSO and MVOTDC is provided in Table 2 in terms of precision, recall, F-measure, entropy accuracy, and purity; the average values for each measure are recorded. The results obtained by the K-means clustering algorithm are worse than those obtained by the other algorithms for nearly all datasets. The possible justification is that K-means is a local search algorithm; therefore, it is highly likely to fall in local optima due to its inability to explore the problem search space effectively. Meanwhile, population-based metaheuristic algorithms, such as GA, PSO and MVOTDC, can explore different areas in the search space simultaneously and can consequently achieve better exploration properties. Table 2 also show that MVOTDC attains minimum entropy and maximum purity, precision, recall, F-measure, accuracy for five datasets (i.e. DS1, DS2, DS3, DS4, DS6). This ability of the proposed MVOTDC algorithm during the search in reaching the right balance between exploitation and exploration with a powerful learning mechanism strengthens its performance in achieving impressive outcomes in comparison with the other methods. Table 2 provides the results of the F-measure for all compared methods, including MVOTDC. Notably, MVOTDC produces the best F-measure values for five datasets. Furthermore, GA, PSO and MVOTDC outperform K-means in all the datasets.
From a different perspective, Table 2 also shows the accuracy of all compared algorithms. In general, the results obtained by MVOTDC are better than those of the other methods. In fact, the results could be slightly changed from one dataset to another due to the fact that clustering algorithms are normally highly sensitive to the dataset search space. This is can be validated by the finding that MVOTDC obtains the best accuracy in five datasets and the second-best for DS5.
The purity measure of clusters is another external evaluation. It measures the maximum class for each cluster. In general, the closer the purity value to 1, the better the clustering solution. Table 2 shows the results of the purity measure for all compared methods on all datasets. MVOTDC outperforms K-means, GA and PSO in five datasets (i.e. DS1, DS2, DS3, DS4, DS6). The proposed algorithm obtains a 21.5% improvement percentage for DS1 in accordance with K-means. For DS2, MVOTDC's purity values show improvements of Entropy is another external measure used in evaluating and comparing the quality of clustering algorithms. The entropy value is zero only when all documents in a single class are placed in a single cluster. In this case, the one cluster solution is considered the best. Table 2 shows the entropy measure values obtained by all the compared algorithms on the different datasets. The bigger the entropy value, the worse the clustering solution. According to the results, MVOTDC provides low entropy values for most of the datasets, which means that it performs better than the other algorithms and offers the best clustering solution. Notably, K-means produces the worst entropy measure for all datasets, whereas GA and PSO are again ranked in between MVOTDC and K-means. The superior performance of MVOTDC is due to its explorative capability in the search space.
The objective function is determined by ADDC for all clustering algorithms so that the distance between the documents in each cluster is minimized. Figure 3 depicts the convergence trends of GA, PSO and MVOTDC using ADDC values. The x-axis is the stream of iteration numbers, whereas the y-axis is the stream of ADDC values. Notably, the convergence rate of MVOTDC is fairly fast for all datasets except DS5.  It is worth emphasizing the MVOTDC can be used to address specific optimization problems such as EEG signals denoising [30], gene selection problem [31], and power scheduling problems [32]. Despite the MVOTDC's superiority among the competitive algorithms, MVOTDC remains sensitive to the characteristics of the datasets, making it difficult to predict its behavior on new datasets while implemented.

CONCLUSION AND FUTURE WORK
This paper proposes a metaheuristic optimization algorithm called multi-verse optimizer (MVO) for solving the text document clustering (TDC) problem, i.e. MVOTDC. This method introduces a new strategy of sharing information between solutions on the basis of an objective function and learns from the best solution instead of the global best (i.e. all solutions). The convergence of the results of MVOTDC is impressive due to the method's achievement of the appropriate balance between exploitation and exploration search during each run.
The proposed MVOTDC is evaluated using six text document datasets with various sizes and complexities. The numbers of documents and clusters in each dataset are given. The quality of the obtained results is assessed using six measures: precision, recall, F-measure, entropy accuracy, and purity.
These measures are also used for a comparative evaluation in which three well-known clustering algorithms are used: K-means, genetic algorithm (GA) and particle swarm optimisation (PSO). For all measures, the results obtained by MVOTDC are significantly better than those produced by the three compared methods. In terms of computational time, MVOTDC is slower than K-means and requires nearly the same computational time as GA and PSO. Therefore, MVOTDC can be considered an efficient clustering method for the text clustering domain.
Given the successful outcomes of MVO for the TDC problem, MVOTDC can be implemented for different types of clustering problems. MVO can also be further improved by the addition or modification of its operators so that it can address other discrete optimisation problems, such as scheduling. In addition, datasets other than those used in this work can be used in future studies. In addition, hybridized the MVO with local search strategies in order to improve initial solutions and the exploitation capability during optimization process.