Hybrid filtering methods for feature selection in high-dimensional cancer data

,


INTRODUCTION
In the medical field of breast and prostate cancer, high-dimensional data is defined as when the number of variables or features exceeds the number of observations.Statistical scientists in academia and industry encounter this evidence daily.Most researchers have always worked with data that has many features.However, advances in data storage and computing capacity have resulted in the development of high-dimensional data in various fields, such as genetics, signal processing, and finance.The accuracy of classification algorithms tends to decline in high dimensions data due to a phenomenon known as the curse of dimensionality.When the dimensionality increases, the volume of space expands rapidly, and the accessible data becomes sparse.
Furthermore, recognizing areas where objects form groups with similar attributes is widely used to organize and search data.However, due to the high dimensionality of the data, it became sparse, resulting in Int J Elec & Comp Eng ISSN: 2088-8708  Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh) 6863 increased errors.It is challenging to design an algorithm for dealing with high-dimensional data.The ability of an algorithm to give a precise result and converge to an accurate model decreases as dimensionality increases.
The next challenge in modeling high-dimensional data is to avoid overfitting.It is critical to develop a classification model that is capable of generalization.The classification model must perform admirably in both training and testing data.Nonetheless, the small number of samples on high-dimensional data can cause the classification model to be overfitted to the training data, resulting in poor model generalization ability.To avoid the abovementioned issues, feature selection must be applied to high-dimensional data beforehand to select only the significant features.The problem is determining the most efficient method to determine the relevant elements with less loss in different sample sizes.Several studies on high-dimensional classification reporting methods have been published in the literature.Liang et al. [1] proposed conditional mutual information-based feature selection with interaction to reduce performance error [2].Tally et al. [3] discovered the genetic algorithm feature selection with a support vector machine classifier for intrusion detection, while Sagban et al. [2] investigated the performance of feature selection applied to cervical cancer data.Ibrahim and Kamarudin [4] applied filter feature selection method to improve heart failure data classification.
Guo et al. [5] proposed weighting and ranking-based hybrid feature selection to select essential risk factors for ischemic stroke detection, and Cekik et al. [6] developed a proportional rough feature selector to classify short text.Too many researchers focus on fusing feature selection, leaving traditional techniques like filter, wrapper, and embedded unstudied.A traditional feature selection method contains no hybrid or novel features.Way et al. [7] also investigated how small sample size affects feature selection and classification accuracy.More research is needed to understand how small sample sizes affect high-dimensional data.Wah et al. [8] compared information gain and correlation-based feature selection, wrapper sequential forward and sequential backward elimination to maximize classifier accuracy but still need to include the embedded technique.As a result, this study proposes the best integration method for evaluating high-dimensional data classification performance.As a result, breast cancer classification would improve with key features.As a result, much of this research topic requires further investigation.
The rest of the paper is organized as follows: section 2 explains material and method.Section 3 presents the results and the discussion of the experiment.Section 4 constitutes the conclusion.

MATERIAL AND METHOD
This section describes the structure, method, and procedure of the study.The methodology for this study includes data collection, preprocessing, feature extraction, and modeling.Patients with T1T2N0 breast carcinoma were studied at the Marie Curie Institute in Paris [9].Data on prostate cancer is also being used, and the data were first analyzed by Singh et al. [10].To remove unnecessary information, the data is preprocessed.
The training set will be filtered using hybrid information gain (hybrid IG) and hybrid chi-square to select essential features.After that, the data is used to generate training and testing sets.Figure 1 depicts the conceptual research methodology for the study.To begin, two high-dimensional data sets, breast and prostate cancer, will be entered into the R software.The data will be preprocessed, which includes detecting, removing, or replacing missing values with appropriate ones and checking for redundant ones.Hence, 100 samples at random and 75 sample sizes before dividing the data into 70% and 70%, where 70% for training and 30% for testing [11], [12].Training data samples were used as input for the classifier to learn the features and build a new learning model [13].
Meanwhile, the learning model was used during the testing phase to predict the test data.The training data will be filtered using two filter selection methods after the data has been split: information gain and chi-square.During this process, the important features were identified, and several features began to be reduced based on their ranking.The ranking was sorted based on the weight of the individual features, with a higher weight indicating a status of an importance feature.Following that, all of the identified essential features used in this study have  =50, 22, and 10, where  is the dimension of reduced features, and will be fed into the logistic regression data mining algorithm, from which a new model was built.The performance of each data was then predicted and evaluated using testing data.Finally, the testing data was used to make a prediction and the classifier's performance was evaluated using accuracy, sensitivity, specificity, and precision.

Data acquisition
The breast cancer data was first explored [9].It was collected from a patient at Institute Curie for ten years from 1989 until 1999 or pT1T2N0 breast carcinoma.The data has clinicopathological characteristics of the tumor and the gene expression that had two classes where patients with no event after diagnosis were labeled good and patients with early metastatic will be labeled poor.The data consist of 2,905 genes with only 168 samples.Recent studies show that many researchers used breast cancer data to tackle the problem of highdimensional data [14], [15].The second data will be used in prostate cancer, which was initially analyzed by [10].It was gathered from 1995 to 1997 from Brigham and Women's Hospital patients who were having radical The data contains 102 patterns of gene expression, of which 50 are from normal prostate specimens, and 52 are from tumors.The data, a collection of gene expression data based on oligonucleotide microarray, contains roughly 12,600 genes.Past studies show that there are a lot of past researchers that used prostate cancer data in investigating the classification of high-dimensional data [16]- [19].The summary of these high-dimensional cancer data is shown in Table 1.

Data preparation
Data preparation consists of preprocessing, identifying sample size, and selecting features.Preprocessing entails cleaning the data by detecting, removing, or replacing missing values with appropriate ones and examining redundant ones.The sample sizes are determined based on the previous study [20].This study uses 75, 100, and the maximum sample size (full sample size of data set) as sample size configurations.The top 50 important, top 22 important, and top 10 important features are listed.The features identified are based on industry standards [21].

Filtering steps
For the feature selection process, filtering methods are chosen.Two filtering methods were employed to obtain three subsets of essential features.The filter selection method is a feature ranking technique that assesses the features independently, based on the data characteristic, without involving any learning algorithm or classifier.Each variable is scored using a suitable ranking criterion by assigning weight for each feature.The weight under the threshold value would be deemed unimportant and removed [22].Then, all the reduced features would be input into the learning algorithm to assess the performance of the measurement.

Information gain
Based on the literature review, information gain was one of the most widely used univariate filters in evaluating the attributes [8], [23], [24].This filter method analyzes a single feature at a time based on the

𝐻(𝑌) = ∑ 𝑝(𝑦)𝑙𝑜𝑔 2 (𝑝(𝑦))
=1 (1) The marginal probability density function for the random variable  is denoted by ().The marginal probability density function for the random variable  is denoted by ().There was a link between feature  and .It occurs in the training data, where the observed value for  was partitioned based on the second feature .The result of partitioning makes the entropy of  produced by  regarding the partition less than that of  before the partitioning.Hence, the entropy of  after observing  is stated in (2).Once both entropy is computed, the differences are calculated to determine the gain value.The gain values from  and  are the reduction in entropy values known as information gain.The calculation is as in (3). (2) The feature will each be ranked with its own information gain value.Higher information gain value will hold more information.After obtaining the information gain value, a threshold is needed to select the important features according to the order accepted.However, the weakness of using information gain is that it does not remove redundant data and needs to be more balanced with features that have more value, even though it only holds a little information.

Chi-square
Chi-square is a univariate filter algorithm that uses a test of independence evaluation to measure each feature's merit using a discretization algorithm [25].This method assesses each feature individually by calculating chi-square statistics concerning each class [26], [27].A relevant feature will have a high chi-square value for each class.The equation of measure is shown in (4).Given from (4), where  denotes the number of intervals,  is the number of classes,  denotes the total number of instances,   refers to the number of instances in the i th range,   the number of instances in  th class and finally,   is the number of instances in the  th range and  th class.

Logistic regression model
Logistic regression assigns each independent variable a coefficient that explains the contribution to variation in the independent variable.If the response is "Yes," the dependent variable will become 1.Otherwise, it will become 0. The predicted probabilities model is expressed as a natural logarithm () of the odds ratio and the linear logistic model is shown as in (5) [28]. and =   0 + 1  1 + 2  2 …..+    −     0 + 1  1 + 2  2 …..+    (8) where  [

Performance measures
The evaluation of performance on the algorithm used is defined from a matrix with the numbers of instances correctly and incorrectly classified for each class are called a confusion matrix.It is a table with two rows and two columns that displays the number of true positives (), true negatives (), false negatives () and false positives ().Accuracy was shown as the most used metric in the past study [29], [30] calculates the percentage of correctly specified predictions which shows the effectiveness of the chosen algorithm.Equation (10) shows the calculation of accuracy.Furthermore, sensitivity is a test's ability to specify a positive class called a true positive rate.It is a ratio of true positives to the sum of true positives and false negatives.Mathematically it can be stated as (11).In addition, this study also used specificity as one of the performance metrics.Specificity is the test's ability to correctly identify the negative class called the true negative rate.It is calculated by dividing the true negative by the sum of the true negative and the false positive.It can be stated as (12).Nevertheless, precision is the last performance metric used in this study.Precision is the ability of a test to assign the positive event to the positive class.Equation ( 13

RESULTS AND DISCUSSION
The study features several experiments to find a promising binary classification solution for breast cancer and prostate cancer binary classification solution.The analysis will be based on experiments integrating filtering methods with logistic regression.The methods are hybrid IG ( =22)+LR, hybrid IG ( =10)+LR, hybrid chi-square ( =22)+LR and hybrid chi-square ( =10)+LR.The measurement considers full sample size, 100 sample size, and 75 sample size of breast cancer and prostate cancer data.

Result for top 𝒅 = 50 important features
Performance measures on the full sample size, 100 sample size, and 75 sample size applied to the high-dimensional breast and prostate cancer data are shown in Table 2.The performance of feature selection methods was assessed using classification accuracy, sensitivity, specificity, and precision.As demonstrated in Table 2, the accuracy is highest for hybrid chi-square ( =50)+LR with 72.55% and hybrid IG ( =50)+LR reporting at 66.67% for full sample sizes.The worst accuracy for no feature selection applied with only 56.86% value can be found.Hybrid chi-square ( =50)+LR has good performance in sensitivity, where it obtained the highest value of 84.62%, whereas hybrid IG ( =50)+LR and a reasonably good sensitivity value of 76.92%.No feature selection method again has the worst sensitivity value with only 48.72%.However, it can be seen when comparing specificity and precision that no feature selection approach outperforms others with 83.33% and 90.48%, respectively but also obtained the worst accuracy and sensitivity value, which are 56.89% and 48.72%.For a sample size of 100, hybrid IG ( =50)+LR obtained the best value for all performance measures with 66.67% accuracy, 72.22% sensitivity, 58.33% specificity, and 72.22% precision.No feature selection approach is the worst feature selection method with the lowest accuracy, specificity, and precision of 60.00%, 41.67% and 65.00%, respectively.Furthermore, every feature selection method has the same sensitivity value of 72.22%.
Meanwhile, for a sample size of 75, it can be observed that hybrid IG ( =50)+LR holds the highest value for accuracy, sensitivity, and precision which are 86.96%,94.44%, and 89.47% when applied to highdimensional breast cancer data.However, when considering the specificity, it can be seen that no feature selection has the highest value among the rest, 80.00%.In summary, hybrid IG ( =50)+LR performs the best for breast cancer data since it achieves the best performance for all criteria.Even though other methods have been demonstrated to perform best in some performance metrics, such as no feature selection, which reaches the highest specificity value when applied to breast cancer data, when considering other performance values, hybrid IG ( =50)+LR still outperforms other methods in most performance metrics.As a result, the optimum filter selection strategy is hybrid IG ( =50)+LR for breast cancer for 75 sample sizes considering the top  =50 important features.For prostate cancer data with full sample sizes, hybrid chi-square ( =50)+LR shows the highest accuracy, sensitivity, and precision percentages with 70.97%, 66.67%, and 71.43%.Contradict output can be seen when these feature selection methods were applied to prostate cancer with sample sizes of 100 and 75, where hybrid chi-square ( =50)+LR outperformed other methods with 80.00%, 90.00%, 75.00% and 64.29% of accuracy, sensitivity, specificity, and precision respectively.Hence, hybrid chi-square ( =50)+LR is the best method for full sample size, considering only 50 significant features for both high-dimensional breast and prostate cancer data.

Result for Top 𝒅 = 22 important features
Performance measures of each filter method applied to top  =22 features, and  =  sample, 100 and 75 for high-dimensional breast cancer data is demonstrated in Table 3.As can be seen in Table 3, the hybrid IG ( =22)+LR shows the highest accuracy and sensitivity of 68.63% and 79.49%, respectively.However, in terms of specificity and precision, no feature selection outperforms other methods with percentages of 83.33%.Thus, after considering only 22 features, filter hybrid IG ( =22)+LR is the optimal feature selection approach for the full sample size of high-dimensional breast cancer with top  =22 features based on accuracy and sensitivity.Each method's performance metrics were assessed using 100 sample sizes for high-dimensional breast cancer data.When comparing specificity, no feature selection, hybrid IG ( =22) +LR and hybrid chi-square shared the same performance of 41.67%.The more stable performance of no feature selection with a high value for each measure goes to achieving 60.00% accuracy, 72.22% sensitivity, and 65.00% precision.In addition, hybrid chi-square looks to perform the worst, scoring 46.67% in accuracy and 56.25% in precision.For a sample size of 75, hybrid IG ( =22)+LR outperforms others in accuracy and sensitivity, with 82.61% accuracy and 94.44% sensitivity.Hence, it can be concluded that hybrid IG ( =22)+LR is the best feature selection method for high-dimensional breast cancer data.In addition, no feature selection yielded the lowest performance measure for accuracy, with 47.83% and 38.89% for sensitivity, making it the worst possible method to be applied.It can be said that filter hybrid IG ( =22)+LR is the best feature selection method for highdimensional breast cancer.However, if the study wants to proceed with filtering techniques, it is better to use hybrid IG ( =22)+LR.It is the best feature selection method, with 22 features and 75 sample sizes for highdimensional breast cancer respectively.These feature selection procedures were also applied to prostate cancer, and Table 3 displayed that hybrid Chi-Square ( =22)+LR is the best procedure for top  =22 and  = full sample and  =75 as it obtained the highest accuracies, sensitivity, specificity, and specificity precision.

Result for top 𝒅 = 𝟏𝟎 important features
The performance measures of each filter method applied to top  =10 features and  = full sample, 100 and 75 for high-dimensional breast cancer data are illustrated in Table 4.The performance metrics for high-dimensional breast cancer data using ten significant features by filtering technique (hybrid IG ( =10) + LR and hybrid chi-square ( =10)+LR), for full sample size and then compared to no feature selection.In high-dimensional breast data, no feature selection has the highest performance value for specificity and precision with 83.33% and 90.48%, respectively but performs poorly for the other two criteria having the lowest values for accuracy, 56.86%, and sensitivity, 48.72%.Hence, it can be concluded that no feature selection performs the worst as it gives out a very imbalanced output.Hybrid IG ( =10)+LR had the best performance as it gave out stable and high measurement values of 83.87%, 80.00%, 87.50%, and 85.71% for accuracy, sensitivity, specificity, and precision, respectively.
For a sample size of 100, as shown in Table 4, performance metrics were applied to two data, highdimensional breast cancer data, where two different feature selection was used, which are filter (hybrid IG ( =10)+LR and hybrid chi-square ( =10)+LR) by 100 sample size and 10 important features.The results were compared with no feature selection.Hybrid IG ( =10)+LR and hybrid chi-square ( =10)+LR also gave a good result, with both having the same values for all metrics, which are 66.67% accuracy, 88.89% sensitivity, 33.33% specificity, and 66.67 precision.
For a sample size of 75, Table 4 illustrates the performance measure for high-dimensional breast cancer data when applying hybrid IG ( =10)+LR and hybrid chi-square ( =10) + LR, taking only 10 important features for 75.The high-dimensional breast cancer data shows that no feature selection performs poorly by achieving the lowest performance value for two out of four metrics, with 47.83% for accuracy and 38.89% for sensitivity, in terms of specificity and precision.Hence, it can be concluded that no feature selection performs the worst for high-dimensional breast cancer data.The method that can be seen giving out a high and consistent value in terms of accuracy, and sensitivity, is hybrid IG ( =10)+LR, with 69.57% and 77.78%.hybrid IG ( =10)+LR is the best feature selection method when applied to high-dimensional prostate cancer data with the configuration of  = full sample,  =100, and  =75 with  =10 features.Thus, it can be concluded that hybrid IG ( =10)+LR is the best feature selection method for high-dimensional breast cancer data.It can be said that the filter feature selection method works well for both high-dimensional breast cancer and prostate cancer data.

Discussion
To classify the model's output, several performance evaluations were used.Metrics such as the gap between the two sets, the accuracy of both sets, and the precision, recall, and F1-score value reveal differences between the training and testing sets.It is interesting to note that hybrid IG+LR performs the best for high-dimensional breast and prostate cancer data since it achieves the best performance for all criteria.Even though other methods have been demonstrated to perform at par within some performance metrics, no feature selection achieved the highest specificity value when applied to high-dimensional breast cancer data.As a result, the optimum filter selection strategy is hybrid IG+LR for high-dimensional breast cancer data for 75 sample sizes considering top  =50 and  =22 important features.In addition, filter hybrid chi-square+LR gives a feasible feature selection solution for high-dimensional breast and prostate cancer data.Table 5 shows the top  =22 and  =10 important features of each filter method applied  = full sample for breast and prostate cancer data.As shown in Table 5, several the same features appeared in the top  =22 and top  =10 for hybrid IG+LR and hybrid chi-square+LR when these methods were applied to high-dimensional breast cancer data.However, different features were selected by hybrid chi-square+LR when this method was used for high-dimensional prostate cancer data with top  =22 and top  =10 important features and full sample sizes.
The top 50 and 22 features outperformed the other configurations, with the highest classification accuracies of 86.96% and 82.61%, respectively, after integrating the hybrid information gain and logistic function (hybrid IG+LR) with a sample size of 75.In conclusion, this study shows that reducing in sample size resulted in increased classification accuracy.This finding was supported by Eckstein et al. [31] and Arbabshirani et al. [32] who also gave out the same result.Hence, it can be assumed that sample size does influence the classification accuracy.This study revealed that sample sizes influenced the hybrid IG+LR and performance and hybrid chi-square+LR.So, deciding the best feature selection methods to be applied to highdimensional data is still challenging.However, this study showed that the recommended feature selection method is hybrid IG+LR for high-dimensional cancer data.

CONCLUSION
This paper attempts to provide more detailed investigations regarding high-dimensional breast cancer and prostate data.This research compares filter feature selection methods in different sample sizes.Logistic regression with hybrid IG+LR demonstrates improvement in binary classification accuracy, especially for small sample sizes.It can be said that the filter feature selection method works well on high-dimensional breast cancer and prostate cancer data.The result is significant for many features and a small sample size.In addition, the sample size configuration affects the feature selection and classification performance.It resulted in integrating hybrid IG+LR with a sample size of 75, with the top 50 and 22 important features outperforming other configurations.Thus, this integration is expected to be used in other types of high-dimensional data.In the future, evaluating the multiclass classification from a different domain is recommended.

Figure 1 .
Figure 1.Conceptual research methodology Int J Elec & Comp Eng ISSN: 2088-8708  Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh) 6865 information gained.It uses entropy measures to rank the variables.It is calculated by calculating the changes in entropy from a previous state to the known value state.The entropy for class features  is presented as (1).
) states the calculation for precision.

Int
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh) 6867

Table 1 .
Summary of high-dimensional cancer data

Table 2 .
Performance measures of each filter method applied to top  =50 features and  = full sample, 100 and 75 for breast cancer data

Table 3 .
Performance measures of each filter method applied to top  = 22 features and  = full sample, 100 and 75 for high-dimensional breast and prostate cancer data

Table 4 .
Performance measures of each filter method applied to top =10 features and =full sample, 100 and 75 for breast and prostate cancer data The gap between the accuracy, training, and validation measures narrows to a reasonable level during model training.All models can classify patients.However, this study finds that the hybrid information gain with logistic regression (hybrid IG+LR) provides the best results for  =100 and  =75 with top  =50, n=full sample and  =75 with top  =22 after training and testing data were applied 6869 on the breast cancer data.Suprisingly, hybrid IG+LR is the best method for all sample sizes with top  =10.Furthermore, hybrid IG+LR is still outperformed other methods when it was applied to high-dimensional prostate cancer data.Specifically, the analysis involved  =100 with top  =22 and  =100 and  =75 with top  =10.
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh)

Table 5 .
Top =22 and top =10 essential features of each filter method applied =full sample for breast and prostate cancer data