Improved feature selection using a hybrid side-blotched lizard algorithm and genetic algorithm approach

ABSTRACT

Optimization problems can be categorized based on the solution produced [11]: classical algorithms [12] and non-classical algorithms [13]. Classical search techniques divided into two categories: gradient based algorithms [14], which are used when the objective function has continuous derivatives, and direct search algorithms [15], which are used with partially continuous or non-differentiable objective function [16]. One of the main issues facing classical search methods is the vastness of the search space. Assuming that a dataset comprises k features, there will be 2 k potential solutions which requires high computational cost [17]. Metaheuristic approaches are considered useful for optimization problems since they can find good solutions with less computational power and time.
A new meta-heuristic algorithm, called side-blotched lizard algorithm (SBLA) [44], has been proposed which emulate polymorphic population of the lizard. The experiments results showed the superiority of SBLA over some recent meta-heuristic algorithms in some engineering problems. Some issues such as stucking into local minima and achieving a proper trade-off between the exploration and exploitation faces SBLA as many metaheuristic algorithms. More modification and hybridization strategies are required to get better results. The main contributions of this work: i) we developed a binary form of SBLA by using sigmoid transfer function and ii) a hybrid method was introduced by combining the binary SBLA with GA then the experiments were performed in two phases: First, the hybrid method compared with SBLA and GA and the outcomes revealed the superiority of the hybrid method. Second, the hybrid method evaluated against a variety of well-known algorithms used in studies in the literature including HGSO, binary dragonfly algorithm (BDA), binary grey wolf optimizer (BGWO), and binary whale optimization algorithm (BWOA) and the outcomes demonstrated the advantages of the hybrid approach. The method applied in this study is presented in section 5738 while sections 3 and 4 give the results and conclusion, respectively.

METHOD
The SBLA with genetic algorithm (SBLAGA) approach for feature selection in machine learning is specifically tailored for classification tasks and involves the following steps as described in Figure 1. Firstly, the input data comprises a data set with an equal number of features (greater than one), a label with a nonnegative value, and features that are characterized by real-valued numerical descriptions. Secondly, the input data is partitioned randomly into training and testing sets, with 80% of the data assigned to the training set and 20% assigned to the testing set during the holdout cross-validation phase. The SBLAGA algorithm is then employed for feature selection, with the KNN classifier used for each iteration of the algorithm. The aim of the optimization problem is to achieve optimal predictive performance while utilizing the fewest possible number of features, and the best individual is determined based on the value of the objective function, with the minimal value indicating the best individual. Lastly, the classifier's performance is assessed. Figure 2 and algorithm 1 are used to describe the proposed approach SBLAGA, respectively. The SBLAGA consists of several key stages, including the transformation function, initialization, KNN, and evaluation. These phases will be extensively covered in the upcoming subsections.

Transfer function
SBLA is typically employed for solving optimization problems that involve continuous variables, whereas feature selection is a type of optimization problem that deals with binary variables. Each lizard position should be transformed to its corresponding binary solution. To transform a continuous search space into a binary one, transfer functions are utilized for mapping purposes. S-shaped and V-shaped are the categories of transfer functions. The proposed approach uses the sigmoid function described in (1) which is an example of S-shaped transfer function.
where represents the ℎ lizard position in the ℎ dimension at iteration number , (1) is applied to determine . The output of sigmoid function still continuous number ∈ [0, 1] so, the (2) is used to convert it to binary one.
where is a value chosen at random from [0, 1].
Algorithm 1. Overview of SBLAGA in pseudo code 1 Set the parameters for the SBLA algorithm, including the maximum number of iterations ( _ ) and the population size. 2 Initialize each lizard position in the population. 3 Transform each lizard position into binary. 4 Evaluate each lizard in the population using KNN classifier. 5 Generate every subpopulation size. 6 Assign color for each subpopulation.
Get the current season. 10 Calculate population changes. 11 Apply eliminate, transform, and add particles functions depends on the current season and population changes.

12
Apply defensive search strategy on blue lizards.

13
Apply expansion search strategy on orange lizards. 14 Apply sneaky search strategy on yellow lizards. 15 End 16 Use the returned lizards population as an input to GA. 17 Initialize the GA parameters: mutation rate, crossover rate and iterations number. 18 Evaluate each lizard in the population using the fitness function. 19 While (stopping criteria have not been met) do 20 Choose two pairs of lizards using roulette wheel selection operator. 21 Employ crossover operator with probability specified in crossover rate parameter.

22
Employ mutation operator with probability specified in mutation rate parameter. 23 Evaluate Offsprings. 24 Update the population with the new offsprings. 25 Apply fitness function to the new population. 26 End 27 Return the best solution in the population.

Initialization
The lizard population is initially created at random. Each lizard is represented by a vector of size , where denotes the size of the dataset's features. The vector's values are all either 1 or 0. 1 signifies that the feature has been selected, and 0 indicates that it has not been selected. As illustrated in Figure 3, five features are chosen while the rest are excluded.

K-nearest neighbor (KNN)
KNN is among the most frequently utilized supervised machine learning techniques for classification tasks. KNN classifies a new data point based on the classification on k-neighbors, where k represents the maximum number of nearest neighbors to be considered. KNN is very simple, and extremely powerful. Figure 4 show an example of KNN.
Several techniques are available for computing the distance between a new data point and each of the training points. Among the most widely recognized methods are Euclidean, Manhattan, and Hamming distance. The method used in this paper is the Euclidean distance. The Euclidean distance can be computed by taking the square root of the sum of the squared differences between the new point (x) and the existing point. Euclidean distance is shown as:

Evaluation
Optimizing the classification accuracy and reducing the features number are the two objectives in solving feature selection problem. The evaluation function formulated in [37] simultaneously addresses the conflicting objectives as (4).

Datasets
In this work, the experiments were conducted on a set of 23 benchmark datasets sourced from the UCI repository. Details regarding the number of features and instances in each dataset can be found in Table 1. It is worth noting that the datasets selected represent a diverse range of real-world problems from various domains such as healthcare and finance. Furthermore, the datasets have been extensively used in the literature for evaluating the effectiveness of various metaheuristic algorithms used to solve feature selection problem, which allows for a fair comparison of our approach against state-of-the-art techniques.

Parameter settings
Each dataset is divided into two sets; the first set is used as training set and represents 80% of the dataset and the second set used as test set and represents 20% of the dataset. This partitioning has been used in previous works by many researchers. KNN classifier is evaluated by using K-fold-cross-validation where the parameter K of KNN classifier is equals five as in [45]. For all experiments, the parameters were set as: a maximum of 200 iterations, a population size of 10, and a dimension corresponding to the number of features. The common parameters for all the algorithms are presented in Table 2. Each algorithm was executed 10 times with a random seed on a computer equipped with an Intel® Core™ i5-6500 processor with a clock speed of 3.20 GHz and 16 GB of RAM.

Comparison of SBLA, GA, and SBLAGA
In this section the performance of SBLA, GA and SBLAGA is outlined due to the average classification accuracy and average number of features selected. In the proposed method SBLAGA, GA begins to execute after SBLA terminates and the final solution from SBLA is used as initial solution for GA. Table 3 demonstrates the comparison among the three algorithms on 23 data sets. We notice that the proposed algorithm SBLAGA is better than both SBLA and GA in 19 datasets due to the average classification accuracy. It is important to note that there is a small discrepancy in the number of selected features between SBLA and SBLAGA, but the difference in average classification accuracy between the two is significant. Therefore, SBLAGA is still considered the better algorithm.

Comparison with other meta-heuristic-based approaches
This section objective is to compare the hybrid algorithm SBLAGA with other optimization algorithms. The algorithms used in the comparison are popular population-based algorithms commonly  Tables 4 and 5 respectively. We notice that SBLAGA obtained the highest average of fitness values and classification accuracy in 18 datasets while HGSO is superior in 3 datasets and BDA is superior in 2 datasets. Also, standard deviation in Tables 4 and 5 refers that SBLAGA behaves more robust than the other algorithms on almost the data sets. Table 6 presents the average number of selected features. We notice that SBLAGA algorithm obtained the highest average of selected features in all the data sets.

CONCLUSION
In this work, SBLAGA was introduced as a hybrid feature selection approach. Twenty-three bench-mark data sets from the UCI repository were collected to investigate the performance of the proposed approach with GA and the original SBLA. The experimental results indicate that the SBLAGA approach outperformed both GA and SBLA in terms of average classification accuracy. SBLAGA then compared with recent well-known meta-heuristic algorithms used to solve feature selection problem including BGWO, BDA, HGSO, and BWOA. The experiments were conducted on the same datasets, measuring average classification accuracy, fitness value, and number of selected features. SBLAGA outperformed the four recent well-known algorithms in terms of these metrics. In future studies, a potential direction for improvement would be to parallelize the algorithm, particularly for handling high-dimensional datasets, in order to reduce the computation time. Other classification algorithms such as neural network and SVM can be used to investigate the proposed algorithm. Real world problems like spam email detection and medical diagnosis can be investigated using the proposed approach.

BIOGRAPHIES OF AUTHORS
Amr Abdel-aal received the bachelor's degree from Zagazig University, in 2017. He is currently pursuing the master's degree with the Faculty of Computer and Informatics, Zagazig University, Egypt. His research interests include multi-objective optimization, evolutionary algorithms, computational intelligence, and natural language processing. He can be contacted at email: AMEbrahem@fci.zu.edu.eg.

Ibrahim El-Henawy
received the M.S. and Ph.D. degrees in computer science from State University of New York, USA in 1980 and 1983, respectively. Currently, he is a professor in computer science department, Zagazig University. His current research interests are mathematics, networks, artificial intelligence, optimization, digital image processing, and pattern recognition. He can be contacted at email: ielhenawy@zu.edu.eg.