Neighborhood search methods with moth optimization algorithm as a wrapper method for feature selection problems

ABSTRACT


INTRODUCTION
Feature selection methods work by selecting the most relevant feature subset among a number of features in the dataset, which leads to better learning performance. Eliminating some features does not mean they are valueless information, but they may have considerable statistical relations with other features [1]. Feature selection methods are important during analysis and evaluation. Many feature selection algorithms have been discovered and widely used by scientists and researchers in experimental. Methods for feature selection are divided into three types depending on their relations with the classifiers [2,3], these types are: The filter method works on overall characteristics of the data regardless of the classifier select the valuable features. The wrapper methods use optimization techniques to optimize the prediction process and the selected features. And the embedded methods, in the embedded method the feature selection is connected to the classification having the advantages of wrapper method which contain the interaction with the classification, while filter methods are less consumption of computer resources than wrapper methods [2][3][4]. Yet, this type is much robustness than in the wrapper method. That is because feature selection is included in the classifier architecture and the classifier is used to deliver a criterion for feature selection [5].

PROPOSED ALGORITHMS
This section discusses the Moth-Flame Optimization algorithm and the proposed Modified MFO Algorithm for feature selection problem.

Moth-flame optimization algorithm
Moth Flame Optimization (MFO) is considered as a population-based metaheuristic algorithm developed by Mirjalili in 2015 [15]. MFO Algorithm mimics the behavior of Moths in nature during the transverse movement of the moths [15]. The real Moths use an intelligent flying method during the night to move in a straight line for long distances, by keeping a fixed angle toward the moon as a source of light. However, when the moths see an artificial light, they keep similar angle then they stuck in a spiral path around it because the artificial light is nearby compared to the light of the moon, such behavior is shown in Figure 1, where each moth eventually converges with respect to the light [15].  [15] MFO algorithm consists of two main components the Moths and Flams, each Moth represents a candidate solution and the variables of a given problem are the position of moths in the space. The Moth is considered as a search agent that obtains the search. So, the moths are able to search in multiterminal space by updating their positions. A set of moths can be represented in the following array: an array to keep the corresponding fitness value for each moth as follow: where n is the moths number and the dimension flames d is represented in an array to maintain the best position found so far by each moth, which is similar to the moth's Array as follows: Similar to moth's fitness array, an array to keep the corresponding fitness value for each best position found so far as follows: OF= [OF1, OF2, OF3 ……. OFn]. By having both Moths and Flams arrays, each moth searches around and updates the flame (best positions) if a better solution found. The general representation of the MFO algorithm can be presented as follows: MOF= (I, P, T).
MOF algorithm has three tuple estimation procedures, I is a procedure that randomly initializes the population by the following formula:

3675
In (1) the ul( i) and ll(i) are the upper and lower bounds of the variable I, and the objective function of moths is given by:

OM = fitness function (M);
(2) Prepresents a procedure that responsible of searching for neighbor solutions of the moths until the T termination condition is met. Where T represents a procedure that returns whether termination condition is met or not. The main idea of the MFO algorithm is the model of transverse orientation behavior. The Moth updates its position in respect to a flame based on the following: where Mi is the moth with index i, and S indicates the spiral procedure and Fj is the flame with index j. For the MFO algorithm a logarithmic spiral is given by (3): where Di represents the distance between the ith moth (Mi) and jth flame (Fj), calculated by: Di =|Fj-Mi|, in (3), t is a number generated randomly and its value between -1 and b is the constant shape of the logarithmic spiral. The P procedure is the main procedure for the moths to explore the search space. The pseudocode of the MFO algorithm is shown in Figure 2.

Modified MFO Algorithm for feature selection problem
MFO algorithm is used to select the optimal features from a given dataset. In MFO algorithm, after number of iterations, Moths (M) are influenced by their corresponding F (best position) and will stop moving towards F. If F is not changed, the Moths cluster around F. However, due to the solution of feature selection problem is an array consisting of 0 and 1, there is a great probability to generate the same or similar new individuals based on the updating strategy. If all or most individuals are replaced by the new generated individuals in each iteration, it will be difficult to keep the population diversity as the number of iterations increased and the algorithm may get trapped in local optima. To prevent Moths from getting trapped in local optima and to maintain population diversity, we propose simple three neighborhood methods as follows: NBChange, NBMove, and NBSwap. In the Modified Moth algorithm we assumed the limit as the termination condition of the search if moths unable to improve the best solution after a number of iterations (limit), which means the Moths have gotten stuck in local optima. In order for Moths to leave the local optima after the parameter limit. We used the proposed neighborhood Methods to all Moths, then each Moth will start from its original position, and continue to search in the search space as shown in Figure 3.  The solution of feature selection problem can be represented as a binary vector, for example, if the dataset contains 10 features the vector will be as follow: Sol= [1, 0, 1, 0, 1, 0, 1, 1, 0, 1], 1 if feature is selected and 0 if not selected. By considering this solution the proposed neighborhood methods are explained as follow: a. NBChange neighborhood Method selects a random feature and changes its value by Not operator, for example, we assume that the randomly selected feature is the third feature in Sol which its value is 1, the not operator will change it to 0. So, the new solution will be as follow: NBMove neighborhood Method selects a random feature and moves its position to a new position, example, we assume that the randomly selected feature is the first feature in Sol which its value is 1, then we move it to a new random position (let's assume the fifth feature), the new solution will be as follow: c. NBSwap neighborhood Method selects two random positions (features) and swaps their values, for example, we assume that the randomly selected two features are the third and sixth features in Sol. So, the new solution will be as follow:

EXPERIMENTAL RESULTS AND DISCUSSIONS
The investigation of the effectiveness and robustness of the proposed Modified MFO is presented in this section. Also, the Modified MFO is compared with other population-based algorithms i.e. Genetic algorithm (GA), Particle Swarm Optimization algorithm (PSO) and Firefly Algorithm (FFA) which tested on 8 datasets with diverse characteristics.
The parameters that used are the number of iteration where it is equal 100, the population size is 20 and the limit is 20. Table 1 shows a brief detail of the eight datasets used in this work. Which they are well known standard datasets retrieved from the UCI data source [32]. These datasets have been considered in several well-confirmed works. The main attributes for these datasets are the number of features (features), the number of instances (Instances) and the number of classes as shown in Table 1. In this study, the instances in the datasets are split into training and testing, where 80% of the instances were applied for training purposes and 20% is used for testing purposes by Friedman et al., in 2001 [33]. All the runs and the experimental results in this research are prepared and reported on a PC with Intel CPU i5-5200U 2.2GHz and a RAM of 8.0 GB. The number of iterations used in this study was 100 and the population size equal to 10. Moreover, the average results obtained from 30 independent runs. Table 2 reveals the number of selected features denoted as # of features and the best-obtained Accuracy denoted as ACC that are used to compare the modified MFO algorithm versus GA, PSO algorithm, FFA, and MFO algorithm. From Table 2 it can be seen that the Modified MFO algorithm can comparatively outperform other algorithms in terms of ACC with 50%, and comparable (same accuracy) with 50%. The GA couldn't achieve high accuracy compared with the PSO, FFA and MFO algorithms which obtain 3, 1, and 4 same ACC results obtained from the Modified MFO algorithm respectively. Table 2 also shows that the Modified MFO algorithm outperforms other algorithms in term of the number of features just in 3 datasets, where the PSO achieve best results in 5 datasets. Figure 4 illustrates the comparison between GA, PSO, FFA, MFO and Modified MFO in terms of best accuracy and number of selected features. Table 3 displays the average Accuracy results that have been achieved by the GA, PSO, FFA, MFO and modified MFO algorithm. The Modified MFO algorithm obtained five best average results in German, SpectF, Sonar, Parkinsons, and WBC datasets, while in the heart dataset the MFO and the modified MFO algorithm achieve the same average, the PSO and MFO algorithms obtained the highest average in Ionosphere and Heart datasets respectively. The highest average accuracies are presented in bold. Figure 4 (a and b) compare Modified MFO and other techniques depending on the best-obtained accuracy and the number of features selected. Figure 4 displays that the proposed technique was very close in accuracy results with comparable methods if not better in some cases such as Parkinsons, SpectF, Sonar and WBC databases. In the same sense, Figure 5 also compares between the proposed technique and the other approaches in term of the average accuracy of 30 runs for the best-obtained results and it was clear again that the results of the suggested method were equal to the best results or the best in accuracy in comparison with the other methods. Figure 4(b) shows that the proposed technique gave the best performance with most of the datasets in term of the reduction of the selected features    The Comparison for the Modified MFO algorithm with GA, PSO, FFA, and MFO algorithms by using tenfold cross-validation illustrated in Table 4, which reveals that the modified MFO algorithm achieve highest results in 3 out of 8 datasets (Ionosphere, Sonar, and WDBC datasets) and 3 datasets obtained same results in Modified MFO with one or more other algorithms (German, Parkinsons and WBC datasets). This finding indicates that 38% of the datasets obtain the highest accuracy in the Modified MFO algorithm compared with the gained results from other algorithms. While the FFA algorithm gains the highest accuracy in SpectF dataset and GA gain the best result in Heart dataset.  Figure 6 (a and b) compares the Modified MFO and other approaches using the ten-cross validation depending on the accuracy and the number of features selected respectively. From Figure 6 it was observed that the proposed technique is extremely acceptable in terms of accuracy and number of selected features. Table 5 shows the comparison of the results that have been obtained from the different approaches with those achieved from the Modified MFO algorithm based on the average accuracy results. The Modified MFO algorithm achieves five best average results in German, SpectF, Sonar, WDBC and WBC datasets. The highest average accuracies are presented in bold.    Tables 6 and 7. Such statistical tests can prove that the spotted differences and improvements are significantly meaningful. Table 6 displays the excellence of Modified MFO in terms of average accuracy over the other competitors and it is statistically significant for most cases except for MOF technique and some cases in other approaches. The levels of marginal significance (p-values) of the Mann Whitney test according to the number of features are shown in Table 7. Where the observed differences between the Modified MFO and GA algorithms are statistically significant for all datasets and are statistically significant for most other competitor techniques except for the MOF technique. Figure 10 shows the convergence behavior of a. MOF and b. the Modified MOF, the x-axis denotes the number of iterations and the y-axis denotes the error rate measured by the SVM classifier.   Figure 10 its easily can be seen that the MFO algorithm has very fast convergence speed and it gets stuck in premature convergence [35]. Dorronsoro et, al.,2013 stated that when the algorithm performs a fast convergence, it's possible to get stuck in local optimal [36]. For example, first graph in Figure 10 is the behavior of MFO and Modified_MOF for the parkinsons dataset the solution not enhanced from iteration number 5 to 38. While in Modified_MOF for the same dataset, the algorithm smoothly converges and produces better results. In MFO algorithm at iteration# 5 the algorithm is able to produce error rate of 13.54 but in Modified MFO produces the same error rate after Iteration# 70, this shows that after applying the neighborhood methods to the solutions that unable to get better results, the converges speed of the solutions are slowing down a bit to search more effectively in search space and produces at the last better results. The Modified MFO technique showed equal to or better accuracy results and a better number of selected features for most of the used datasets but not all the observed differences have statistically significant in comparison with all other competitors.

COMPARISON OF MODIFIED MFO ALGORITHM WITH THE STATE-OF-THE-ART APPROACHES
This section compares the best results obtained from the Modified MFO algorithm with the bestknown solutions in the literature for the eight tested datasets. Table 8 compares the best-known results of the Modified MFO algorithm with those of the different algorithms from the literature. Accuracy is used as the main objective in comparing the performance of the algorithm. The best Accuracy is presented in bold. As observed from Table 8 the Modified MFO achieved very close values for most competitors in term of accuracy and better than others with some datasets Take into consideration the results taken from the different algorithm.

CONCLUSION
In this paper, the Modified MFO Algorithm with neighborhood search methods for feature selection problems. The algorithms in this work are applied on the benchmark of 8 standard UCI datasets were used. The results of the modified MFO algorithm were compared with four methods in the literature. This method demonstrated the superiority as a result by help in avoiding the premature convergence and help the algorithm jump-out from local optima, it was experiential that the neighborhood search methods suitable to improve the results in the proposed algorithm were it shows good performance when compared with the basic MFO algorithm and with the state of the arts approaches.