A hybrid swarm intelligence feature selection approach based on time-varying transition parameter

Feature selection aims to reduce the dimensionality of a dataset by removing superfluous attributes. This paper proposes a hybrid approach for feature selection problem by combining particle swarm optimization (PSO), grey wolf optimization (GWO), and tournament selection (TS) mechanism. Particle swarm enhances the diversification at the beginning of the search mechanism, grey wolf enhances the intensification at the end of the search mechanism, while tournament selection maintains diversification not only at the beginning but also at the end of the search process to achieve local optima avoidance. A time-varying transition parameter and a random variable are used to select either particle swarm, grey wolf, or tournament selection techniques during search process. This paper proposes different variants of this approach based on S-shaped and V-shaped transfer functions (TFs) to convert continuous solutions to binaries. These variants are named hybrid tournament grey wolf particle swarm (HTGWPS), followed by S or V letter to indicate the TF type, and followed by the TF’s number. These variants were evaluated using nine high-dimensional datasets. The results revealed that HTGWPS-V1 outperformed other V’s variants, PSO, and GWO on 78% of the datasets based on maximum classification accuracy obtained by a minimal feature subset. Also, HTGWPS-V1 outperformed six well-known-metaheuristics on 67% of the datasets.


INTRODUCTION
Feature selection (FS) is an essential technique that has been widely utilized to improve the performance of machine learning algorithms in a variety of domains [1]. With the emergence of highdimensional datasets, FS becomes a difficult challenge because it seeks to find the optimum combination of features that represents the whole set without information loss. According to Liu and Motoda [2] the FS issue utilizes comprehensive search, random search, or heuristic search as search techniques. Comprehensive search produces and examines all probable feature subsets to pick the ideal one, while random search creates and examines stochastic subsets of features. Heuristic search generates a random subset of features and then uses the best solution to guide the search until finding the best feature set. A filter or wrapper approaches are often used to examine the generated subset of features [3]- [5]. The filter method depends on the associations between the features themselves, rather than using any feedback from the learning algorithm [4]. The wrapper model, on the other hand, integrates a learning algorithm as the assessment criteria [6]- [8]. The hybrid method combines both filter and wrapper to take the advantages of both techniques [9], [10].

BACKGROUND 2.1. Particle swarm optimization
PSO was initiated by Kennedy and Eberhart in [39]. It simulates the particles' behavior within the swarm when they are searching for food. Each particle has its own position that refers to a feasible solution in the search space, and its own velocity that manages the particle's movement step and direction during the search process. Each particle is oriented through its movements by the best particle within the swarm and its best location during the search process. The new velocity and the new position of each particle are calculated using (1) and (2) as: where veln and velo represent a particle's new and old velocity, respectively. The perb and glblb are the personal and the global best locations in the search space. The values d1 and d2 are two random numbers in the range [0, 1], af1 and af2 are acceleration factors that equal to 2 in many studies in the literature. The iw is an inertia weight which reduces linearly or non-linearly over the search strategy to optimize exploration and exploitation phases. A new position, xn, is calculated based on the particle's new velocity, veln, and the particle's old position, xo.

Grey wolf optimization
The grey wolf optimization (GWO) algorithm was proposed by Mirjalili et al. [40]. It mimics the hunting behavior of grey wolves. The GWO initializes the algorithm by generating random positions for each wolf in the population. Each wolf represents a solution in the search area. The evaluation of these solutions is done, and the three best solutions are picked. The best solutions, which are alpha , beta , and delta , have the better knowledge about the location of the prey. The rest of the pack is called omega. At each iteration, omega wolves follow the three best solutions and changing their locations within the search space to stay close and encircle the best solutions. The mathematical formulas that are used to model these behaviors are as (3)-(5): where is wolf's location in the search space. , , and are the positions of the three best solutions. , , and are the distances between each wolf and the best three solutions. +1 is the new location of each wolf which will be in any random position around the prey. The two coefficients A and C are calculated by using (6), (7).
where d1 and d2 are random numbers in the range [0,1], variable a decreases linearly from 2 to 0 during the iterations and can be calculated using the: where a is a variable that tunes the local and global search of GWO algorithm, t is the current iteration, and T is the overall number of iterations.

Tournament selection strategy
The TS technique is employed to choose a guiding solution rather than picking the optimal solution. In this mechanism, a collection of solutions, known as tournament, is chosen randomly then the best solution in the tournament is chosen as a guiding solution [7], [41]. The selected search agent using tournament method is used to guide other agents. This strategy is used to emphasize searching more areas in the large search space. As shown in (9) and (10) show updating the position mechanism: where Xselected is the search agent selected by the tournament method, D is the distance between each search agent and the selected tournament agent. The two coefficients A and C are calculated by using (6) and (7).

The binary optimization problem
FS is a binary issue since all attributes in the dataset may be characterized by a string of zeros and ones. Where 0 indicates the unselected features and 1 reflects those that have been picked. According to Mirjalili and Lewis [32], the switch from continuous solutions to binaries is based on two types of TFs, namely S-shaped and V-shaped. The mathematical formulas of these TFs are shown in Table 1. Table 1. S-shaped and V-shaped TFs [32], [33], [42] S-shaped TFs S1 (

METHOD
The FS issue is a type of optimization problem that looks for the most relevant features in a dataset. In this paper, a hybrid wrapper-based meta-heuristic approach is introduced for the FS task. Search techniques must tune the explorative and exploitative phases to determine the ideal solution. Throughout this paper, PSO with strong exploration potential and GWO with high exploitation capabilities are used, while a TS operator is employed to boost the diversity and the search efficiency to avoid getting the local optima. This approach is named the hybrid tournament grey wolf particle swarm algorithm (HTGWPS). In this paper, the FS issue is treated as a binary problem, which means converting the features into a string of ones and zeros with the same number of attributes as there are in the dataset. A one represents a selected feature, while a zero represents unselected one. In this work, four S-shaped and four V-shaped TFs were used to derive eight variants of the proposed approach. Each variant is named HTGWPS, followed by the type and the number of the TF that was used.
According to this hybrid approach, the initial population is generated randomly. Then, the solutions will be optimized by using the PSO, GWO, or TS techniques. This is examined by using a time-varying transition parameter, namely Z, that is calculated using (11).
where t is the current iteration, and T is the maximum number of iterations. Figure 1 shows a flowchart of the proposed HTGWPS approach. As shown in the flowchart, the solutions are initialized randomly, then the evaluation of these solutions is performed according to the tackled problem. The parameter Z is calculated. If Z is greater than or equal to 0.5, then the solutions will be updated using PSO mechanism. Otherwise, the solutions will be updated using either GWO or TS depending 785 on a random value, rand, between 0 and 1. These steps are repeated until the maximum iteration is reached and the optimal solution is retrieved. The PSO in the first half of the iterations assists the algorithm in doing a comprehensive scan of the search area by using the PSO's diversification ability to choose the promising area that may hold the global best solution. The GWO technique has intensification properties since it follows the three best solutions throughout the search phase. As a result, the algorithm may get stuck in a local optimum, especially if the algorithm searches a large search space. To address this issue, this paper employs the TS technique with its stochastic behavior in the second half of iterations in competition with the GWO to attain the exploration phase, which prevents this approach from being stuck in a local optimum. Referring to the achieved results, this tactic efficiently tunes between exploration and exploitation and prevents getting trapped in local optima.
Algorithm 1 displays the pseudo-code of the HTGWPS. The lines between 2 and 5 initialize the number of iterations, the population size, and tune the algorithm's parameters. In line 8, Z has been calculated. The lines between 9 and 20, if the value of Z is greater than or equal to 0.5, then the solutions will be improved using the PSO technique based on (1) and (2). Otherwise, a random value, rand, has been generated to guide the solutions to update their positions based on the GWO technique using (3) to (8), or based on TS tactic using (9) and (10). These steps are repeated until the maximum number of iterations is reached and the optimal solution is retrieved. Algorithm 1. The pseudocode of HTGWPS approach 1: Start 2: Set the maximum iteration T 3: Set the size of population P 4: Initialize random position and velocity for each individual 5: Initialize the parameters a, A, C, c1, c2 6: t ← 1 7: while t < T do 8: Calculate Z based on Eq. (11) 9: p ← 1 10: for p < P do 11: if Z ≥ 0.5 then 12: Update the positions according to PSO 13: else 14: if rand ≤ 0.

DATASETS AND EXPERIMENTAL SETUP
The approach that is proposed in this paper was evaluated by using nine high-dimensional small-sample medical datasets that are illustrated in Table 2 and mentioned in [6], [36], [37]. Interacting with this form of the dataset is difficult due to the limited number of samples (observations), which makes the training of the learning model inadequate, as well as the huge number of attributes that increase the complexity of the search process in the search scope. The experimental results compared the proposed approach with other investigated approaches in terms of a set of evaluation metrics such as the average number of selected features, the average classification accuracy, the average fitness value, and the average computational time. The average classification accuracy is the metric that evaluates the classifier's predictive accuracy using the subset of selected features. It is calculated using (12).
where M is the number of runs, T is the number of instances, A and P are the actual and predictive classes, respectively. The average selected features are a metric that shows the algorithm's performance in terms of features number while solving the FS problem. This metric is calculated using (13).
Where: M : the number of runs, c : the number of selected features, and C : the overall features. The average fitness value is the metric that is a combination of the classification error rate and the rate of feature reduction. It is calculated using (14).
where M is the number of runs, Fit is the fitness value of the best solution in each run, which is calculating using (16). The average computational time is the amount of time needed to accomplish a computational process. It is calculated using (15).
where M is the number of runs, ComTime is the computational time.
The quality of a solution is defined by two factors: the number of selected features and the classification error that is recorded by using these features. As shown in (16) shows the objective function that combines these two factors. where cErr is the classification error. |f| is the number of the picked attributes, and |F| is the overall attributes in the dataset. The parameter m ∈ [0, 1] reflects the importance of the classification accuracy, while = 1-, which reflects the importance of the feature subset length. The instances were split randomly into 80% training and 20% testing subsets. The implementation of the proposed approach was done by using MATLAB. The testing was done on a machine with 2.2 GHz Intel Core i7 and 8 GB RAM. The findings were gathered after 30 runs with 100 iterations and 10 search agents. The parameters m and n in the fitness equation were equal to 0.99 and 0.01, respectively.

RESULTS AND DISCUSSION
A set of comparisons were conducted to evaluate the proposed hybrid HTGWPS approach that is proposed in this paper. In section 5.1, the HTGWPS was initially evaluated by comparing the four S-shaped TFs to the original binary versions of PSO and GWO. In section 5.2, the comparisons were conducted to assess the HTGWPS using the four V-shaped TFs with the PSO and GWO parent algorithms. Finally, in section 5.3, the top HTGWPS variants derived from S or V-shaped TFs were compared to well-known metaheuristics from the literature.

Evaluating HTGWPS variants based on S-shaped TFs
The results of the experimental evaluations of the HTGWPS variants based on S-shaped TFs are described and analyzed in this section. Table 3 displays the classification accuracy of the HTGWPS-S variants. According to the results, HTGWPS-S1 was ranked the best in most datasets. It can be seen that HTGWPS-S1 attained the best classification accuracy in five datasets, and it reached 100% accuracy for Leukemia1 and Leukemia2.  Table 4 shows the average number of features chosen by each variant for each dataset. The table shows that HTGWPS-S2 outperformed the other competitors in five datasets and recorded the best mean rank, followed by HTGWPS-S4. According to Table 5, the average fitness results also indicate that HTGWPS-S1 achieved the best approach based on S-shaped TF by choosing the minimum relevant attributes with the lowest classification error. The outcomes revealed that the HTGWPS-S1 is the fittest technique based on the S-shaped TFs, and this is evidenced by the value of the mean rank. Table 6 shows that PSO is the best approach in terms of the average computational time, followed by GWO, and this superiority is due to the hybridization of the proposed approach, which needs more computational time.  In this paper, the accuracy rate has a higher priority than the number of selected attributes. So, according to the outcomes, the HTGWPS based on S1-TF yielded better performance. The findings revealed that HTGWPS-S1 offers a stable trade-off between exploration and exploitation to escape local optima and attain higher fitness results. The power of the proposed approach is the high exploration ability at the beginning of the search process by using PSO optimizers, while utilizing GWO gives the approach the exploitation ability after passing half the number of iterations. The TS tactic is utilized to give the exploration a chance at the end of the search process to avoid being stuck in the local optima.

Evaluating HTGWPS variants based on V-shaped TFs
This section presents the results of the HTGWPS variants based on V-shaped TFs in terms of the four-evaluation metrics. According to Table 7, it can be seen that HTGWPS-V1 recorded the best average classification accuracy on 89% of the dataset, while HTGWPS-V4 was rated second by the mean rank value. Table 8 displays the average number of features picked by each V variant. According to the results, GWO chose the fewest features in five datasets, followed by HTGWPS-V1 in four datasets. However, when the mean rank was determined, both techniques had the same overall rank across the nine datasets included in this paper. Table 9 summarizes the average fitness values of all V-shaped HTGWPS techniques. As mentioned previously, the fitness values represent the lowest classification error that is obtained from the least selected features. Whereas the strategy with the smallest feature subset size and the lowest classification error is the most effective. Here, we can see that HTGWPS-V1 was rated first in 78% of the datasets. This indicates that the HTGWPS-V1 optimizer can perform a consistent trade-off between exploration and exploitation in order to reach the global optima.    Figure 2 shows the convergence curves of the fitness values during 100 iterations for HTGWPS-V1, GWO, and PSO in dealing with different datasets that are used in this paper. According to these curves, the HTGWPS-V1 obtained the best results on 7 out of 9 datasets. This is evident in Figures 2(a) to (i), since HTGWPS-V1 achieved the lowest fitness values in most curves. Figure 2(a) describes the behavior of the three algorithms on the 11-Tumors dataset. It shows that HTGWPS-V1 curves down according to the iterations until it reaches the lowest fitness value, followed by GWO, while PSO stuck in a local optimum. Figure 2(b) shows the fitness value curves of three approaches on the 14-Tumors dataset. It also shows that HTGWPS-V1 optimizes the results by minimizing the number of features with a minimum classification error during the iterations, followed by PSO, which drops at a local minimum value. Figure 2(c) shows the convergence curves of the fitness values of three optimizer approaches on the Brain-Tumor1 dataset. The curves show that HTGWPS-V1 obtained the minimum fitness value after scanning the search space, and this is obvious by the behavior of the curve, which slopes down during the iterations while PSO is stuck in the local optima in early iterations, and GWO had the highest fitness values. Figure 2(d) shows the fitness value curves of three approaches on the Brain-Tumor2 dataset. It shows that HTGWPS-V1 achieved the best fitness value after 40 iterations. As seen in the HTGWPS-V1 curve, the values of fitness values are near to zero, which means the algorithm reaches the minimum number of features with a minimum classification error.  Figure 2(e) shows the behavior of the HTGWPS-V1, GWO, and PSO on the DLBCL dataset. The curves show that HTGWPS-V1 can reach the minimum fitness value, followed by GWO, then PSO. Figure 2(f) illustrates the behavior of the three algorithms on the Leukemia1 dataset. HTGWPS-V1 obtained the minimum fitness value during the iterations, followed by PSO, which stuck in local optima after 20 iterations, followed by GWO. Figure 2(g) shows the fitness value curves when dealing with the Leukemia2 dataset. The figure shows that GWO and HTGWPS-V1 have the minimum number of features with maximum classification accuracy, while PSO cannot clearly optimize its result during 100 iterations. Figure 2(h) describes the behavior of the three optimizers on the Prostate-Tumor dataset. The curves show that HTGWPS-V1 and GWO obtained the minimum fitness values, which is near to zero, while PSO cannot reach the best results during 100 iterations and trapped in the local optima. Figure 2(i) shows the fitness value curves of the three optimizers in the SRBCT dataset. It also shows the best results obtained by GWO and HTGWPS-V1 with few differences, followed by PSO optimizer.

Comparing top HTGWPS variants with metaheuristics from the literature
In this section, a comparison of the top performing variants in both S-shaped and V-shaped with well-established optimizers from the literature such as GSA, ant lion optimization (ALO) [33], BA, WOA, HHO, and TLBO has been carried out. According to the preceding two sections, HTGWPS-S1 had the best performing S-shaped approach, whereas HTGWPS-V1 had the best performing V-shaped approach.  Table 11 clearly shows that HTGWPS-V1 outperformed all other techniques in terms of classification accuracy, as it produced the best results in five datasets, with 100% accuracy on three datasets. The HTGWPS-S1 outperformed other techniques in three datasets. In terms of the average number of features, Table 12 shows that WOA performed the best in seven datasets, followed by HTGWPS-V1 in two datasets, with a little difference between their results. Table 13 reveals that HTGWPS-V1 outperformed other techniques in terms of average fitness values in six datasets, followed by HTGWPS-S1. The results showed that V1 is the best TF that is used to binarize HTGWPS, which revealed the superiority in tuning between global and local search to avoid the local optima and attain the global solution.    Table 14 displays the average computational times. It demonstrated that the BA obtained the shortest time on 78% of the datasets. The comparisons presented in this section are considered fair comparisons since all techniques were run in the same experimental environment. The comparisons clearly showed that HTGWPS-V1 outperforms other approaches in terms of average fitness value, which combines the size of the features that are picked and the classification error that is obtained from these features.
The binary version of HTGWPS based on V1 TF obtained very competitive results compared to other competitors, followed by HTGWPS based on S1 TF. The main reason is that HTGWPS can achieve a more stable balance between exploration and exploitation. It switches effectively between exploration and 792 exploitation using the time-varying parameter, Z, and the random parameter, rand, to produce more exploitative behavior not only in the early stages of the search process, but also in the last stages. This mechanism reveals the potential of this approach to jump out of local optima.

Impact of feature selection on the classification performance
This section shows a comparison between using the KNN classifier with HTGWPS versus without using this novel approach. Based on Table 15, using HTGWPS-V1 with KNN increases the performance of the classifier in terms of classification accuracy and the number of relevant features selected. This is also clearly observed in Figure 3.

CONCLUSION
A hybrid approach that is proposed in this paper is used to handle the complexity of the FS problem in large datasets. This hybrid approach combines PSO, GWO, and TS techniques in the proper way to avoid the local optima, especially when dealing with high-dimensional datasets. The PSO's exploration ability at the beginning step of the search mechanism helps the algorithm scan the search space for the promising search area. The exploitation capability of the GWO at the end of the search mechanism helps the algorithm to converge toward the best solution. The TS applies the exploratory phase at the end of the search process to escape from local optima and tend toward the global optima. This approach uses a time-varying transition parameter, named Z. Depending on Z's threshold, the solutions are updated based on the PSO, GWO, or TS. The FS is treated as a binary problem by using the TFs to convert continuous solutions to binaries. Eight variants of the proposed approach were implemented in this paper. Four variants are based on S-shaped TFs and four are based on V-shaped TFs. To evaluate the proposed variants, a variety of high-dimensional small-instance medical datasets were used, and the KNN classifier's feedback was utilized to assess the performance of the proposed approach. The experimental findings demonstrated that HTGWPS-V1 outperformed the other investigated variants and well-known optimizers in terms of average fitness value on 67% of the datasets. Future work includes utilizing the TS mechanism with other meta-heuristics that suffer from global search ability. Also, the Z-parameter can be tuned to other thresholds, which may obtain better results.