An ICA-ensemble learning approaches for prediction of RNA- seq malaria vector gene expression data classification

Received Feb 4, 2020 Revised Jul 17, 2020 Accepted Sep 23, 2020 Malaria parasites introduce outstanding life-phase variations as they grow across multiple atmospheres of the mosquito vector. There are transcriptomes of several thousand different parasites. Ribonucleic acid sequencing (RNAseq) is a prevalent gene expression tool leading to better understanding of genetic interrogations. RNA-seq measures transcriptions of expressions of genes. Data from RNA-seq necessitate procedural enhancements in machine learning techniques. Researchers have suggested various approached learning for the study of biological data. This study works on ICA feature extraction algorithm to realize dormant components from a huge dimensional RNA-seq vector dataset, and estimates its classification performance, Ensemble classification algorithm is used in carrying out the experiment. This study is tested on RNA-seq mosquito anopheles gambiae dataset. The results of the experiment obtained an output metrics with a 93.3% classification accuracy.


INTRODUCTION
Next-generation sequencing technology has created several wide datasets, that allows biologists to examine and determine difficult gene transcripts such as RNA relationships and ailments such as cancer, contagions (malaria), tumors, heredities, biological, among others [1]. In Africa, mosquito anopheles gambiae are blood-sucking parasites with large pathways to Plasmodium Falciparum. Anopheles mosquitoes is a deadly malaria parasite, accountable for thousands of deaths. As battle with antimalaria suppositories banquets upsurges, perceptives for state-of-the-art drugs necessitates improved biological knowledge of these kind. Mosquito anopheles organism approved precise gene expression controls has been a major concern needing an improved quantitative predictive malaria vector transcripts model [2,3].
RNA-seq learning produces sensitive biological perceptive investigations by recognizing a preliminary biological enhanced sequencing purposeful plan analysis. RNA-seq data includes the removal of the high-dimensionality curses in a data, such as: disturbances, repetitions, inconsistencies, redundancy, irrelevant, incorrect, invalid, among others [4]. Recent innovations have enhanced approaches for designing state-of-the-art healthcare models such as adapted therapies, intelligent health surveillance systems, among other disease diagnoses [5].
Numerous machine learning methods with practical advances have peen developed through the years to analyze the enormous volume of RNA-seq and data expression of next generation gene sequencing by studying the related biologically outlines [6]. Researchers [7,8]. Computational approaches have remained appliecable to large genetic ailments databases of persons, genes can be found responsible for the presence of ailments. Numerous approaches are used in detecting differentially expressed genes (DEG). Procedures of datamining are significant in identifying the differences between genes derived from the human genome. Numerous machine learning methods are emulated and used in examining and identifying expression of various gene profiling diseases. Gene expression profiling and its approaches by means of numerous datamining are indispensable. Research works have been proposed by numerous authors in this area, existing researches are known in studying gene expressions [5]. Blood-based signature gene expression and datamining for diseases in identifying transcripts that can be used in classification is proposed [9]. Using Gene expression omnibus database from RNA data and using machine learning algorithms language tools, works on RNA-seq data have been proposed by dimensionality reduction, clustering and classification by performing an integrated review, that have recently arose as predominant shifts, using indirect and direct methods with reducing sc-RNA-seq data dimension approaches, reporting scRNA-seq data [10].
This study proposes a dimensionality reduction model, by using ICA feature extraction technique, to realize the relevant correlated latent components in a high dimensional dataset in the gene expression data analyzsis, a Sub-space group Ensemble classification system is used in learning discrete biological outlines that helps achieve developed classification accuracy and suggested as an effective procedure for the finding of innovative genes for malaria.

PROPOSED METHOD
In this study, a summarized proposed framework in Figure 1 is adopted, the fundamental idea is to predict machine learning task on high dimensional RNA-seq data, for cells and genes into lower dimensional dataset. The plan is adjusted to fetch out important data in a given dataset by utilizing ICA feature extraction method as a stage. To evaluate the performance of RNA-seq dataset, Ensemble classification algorithms are compared. Numerous approaches on machine learning have been emulated to examine and identify gene expression profiles of several ailments. There is discussion of the necessity for expression of gene profiling and approaches using specific datamining techniques. Numerous investigations carried out by researchers in this area are consulted, recent investigations in analysing gene expressions are reviewed [5]. A supervised machine learning method for variety of RNA-seq segment was proposed by ranking huge sets of segments measured with RNA-seq, using random forest classifier variable rank measurements, specifying the EPS (extreme pseudo-samples) frequency, with variational autoencoder regressors in the RNA-seq extraction ranks of cancer datasets with about 1,210 samples. Results in the RNA-seq training demonstrated a supervised hidden learning-based feature selection method and highlighted the need for gene assortment methods for gene expression analysis [11]. Classification of RNA-seq dataset using supervised model was proposed for a generalized method of highly accurate single cell classifications, by integrating unbiased collection of condensed dimensional space feature selection technique. Sc-Pred was used on RNA-seq pancreatic tissue, colorectal tumour cell removal, mononuclear cells, and mixing dendritic cells datasets. Sc-Pred demonstrated a high classified discrete cells accuracy [12].
RNA-DNA machine learning analysis was proposed on a low expressed genome that could be affected collectively by PAH disease. A state-of-the-art feature selection procedure to classify an irrelevant range of very beneficial genes. Small expression clustered genes were discovered at predicting transformed PAH procedures [13]. Stomach cancer gene expression data classification was developed using deep learning approach, Heatmaps, PCA, and CNN algorithm. RNA-seq gene data expression studied the genes and analysed them, 95.96% and 50.51% were achieved [14]. Transcriptions of RNA-seq malaria data through dissimilarity of techniques to deconvolute disparity transcription for dissimilar malaria parasites were revealed using hidden transcriptional discrete signatures [15]. Supervised datamining approaches such as C4.5, boosted and bagged ensemble classification algorithm for cancer data were proposed on openly available oncogenic microarray data and correlated, the boost and bag ensemble classification outperform the 1563 C4.5 [16]. A diagnostic classification using ensemble algorithm method for genomic cancer data expression was designed using RFE to fetch efficient features for enhanced classification results using AdaBoost [17]. Classification of cancer gene data expression, was carried out using effective ensemble learning method upsurging the performance of the classification of the outcome results, with a reduced amount of dependent on originalities of individual training set [18]. An enhanced ensemble classification learning wrapper-based feature selection and random trees procedure to improve knowledge, makes a subset by using bagged and random trees. Irrelevant features were removed to select the best features for classification, using RF, SVM, and NB with 92% accuracy [19]. Text classification algorithms was proposed using various text dimensionality reduction methods [20].

RESEARCH METHOD
Datamining for high dimensional dataset enhancements have been carried out by several authors, Independent component analysis (ICA) and classification using algorithm is proposed for RNA-seq malaria vector data.

Methods
The experimental tool used MATLAB to analyze the data obtained [21], using ICA to fetch latent features, and carry out classification using ensemble algorithm approach [22] on the MATLAB tool environment.

Independent component analysis (ICA)
ICA is a valued PCA extension that has remained conservative since the visor parting of independent bases from their direct grouping [20]. The original fact of ICA is the possessions of uncorrelation of the general PCA. Built n x p on data medium X, whose rows ( = 1 … , ) reckon toward variables observed also whose ( = 1 … , ) columns are the entities of matching variables, the ICA X model, written as follows: With complete overview, A is a fusion matrix, where S is a is a basis matrix under the need of being statistically independent as conceivable. Independent components are the innovative variables kept in the rows of S, to wit, the variables detected are linearly composed independent components. The independent components achieved by learning the precise linear groupings of the variables observed, subsequently mixing can be inverted as:

Ensemble classifier
Ensemble classifiers can be proficient using on unrelated subsets of the data training, diverse classification constraints, or with diverse subset features in random subspace model [23]. Ensemble classifier comprises of integrating fallouts of assorted classifiers to produce a concluding decision, it is frequently used for gaining highly accurate results. Ensemble classifiers are relatively common in machine learning complications, and can be employed in bioinformatics field. Classification decision is achieved by merging the decision of each classifier [24]. Ensemble approaches is machine learning techniques combines decisions to advance the performance of the general classification. Several terms have been discovered in the literature to signify comparable connotations such as; multi-strategy learning, aggregation, integration multiple classifiers, classifier fusion, combination, committee, and so on. Ensemble classifier may have complete and improved performance than discrete base classifiers. The efficiency of ensemble approaches is extremely dependent on the unconventionality of error devoted by discrete learner. Ensemble approaches performance hinge on the accuracy and variety of the base learners, ensemble classification has common techniques; Boostrap aggregating (Bagging) employs training the data by arbitrarily changing the unique training T by items N data. The training auxiliary sets are called bootstrap duplicates with some occurrences not appearing while others give the impression more than once. The C*(x) final classifier is built by combining Ci(x). All Ci(x) takes an equivalent division.
Adaptive boosting (AdaBoost) technique effects the training data. Originally, the procedure allocates all xi instance by means of an equivalent mass. In individual iteration i, knowledge algorithm attempts to diminish the training set weighted error and a classifier ( ) is yielded. The ( ) weighted error is calculated and useful in informing the training instances xi weights.
weight rises giving to its effects on the performance of classifier's that allots a weight higher for a misclassified xi and a small weight aimed at an acceptably classified xi. The concluding classifier C*(x) is created by a discrete Ci(x) weighted vote rendering to its built accuracy on the training weighted set [19]. Adopting Kamran, et al. [20], they showed how a boosting algorithm works for datasets, then trained by multi-model designs (ensemble learning). These advances resulted in the adaptive boosting (AdaBoost). Presume constructing Dt such that 1( ) = 1 given Dt and ht: where states to the normalization factor and is as follows; Basic ensemble classification techniques namely: The max voting (MV), weighted averaging (WA) and Averaging. Max voting (MV) exists [25][26][27] Ensemble learning have three combinational methods: stacking (STK), blending (BLD), bagging (BAG) and boosting (BOT) [28][29][30][31].

Evaluation performance
Datamining model performance evaluation requires metrics of validations, classification algorithms uses the confusion matrix in analyzing four features known as the; true positive (TP), false positive (FP), true negative (TN) and the false negative (FN). These features recognize the correctly and incorrectly classified instances from the given sample of dataset used in testing the model [5,32].

Applications
An enhanced path of gene expression analysis in dentifying RNA-seq data discoveries for related genes can be helpful in the development of various applications such as modified treatment, diseases detection, genes and drug discovery, tumor recognition, ailments, among others. Datamining technique is used in identifying the designs and possesses fantastic applicable algorithms tools. In this study, MATLAB tool is used to carry out the program due to its user-friendly environment [16], to predict RNA-seq technology for the prognosis and dialnosis of malaria ailments using an 8GB RAM size, 64-bit System, iCore2 processor and MATLAB 2015A tool.

RESULTS AND DISCUSSIONS
This study determines RNA-seq innovation of 2457 instances mosquitoes' data. ICA algorithm was applied to fetch out latent components from the anopheles' data, the ICA feature extraction distinguishes and removes uncorrelated variables, to choose the determinant variance with a reduced number of independent components to give important useful gene evidence valuable for supplementary examination. Ensemble AdaBoost classification algorithm is applied on the extracted ICA 45 latent significant features of genes realized in 7.8486 Seconds. 10-folds cross validation is used to evaluate the classification execution performance, using 0.05 parameter holdout to training the data and 5% for testing the classification accuracy.

1565
Assessment learning procedure classification is used to train, test and evaluate the experiment using a 10-fold cross validation in eliminating the sampling partialities. Result evaluation is carried out on the computational time and performance metrics [32]. Classification of the models, using AdaBoost ensemble classifier is carried out with 93.3% performance accuracy. The results and procedures are shown in the figures below. ICA feature extraction algorithm is used to extract the hidden features from mosquito anopheles data shown in Figure 2. The extracted features are classified using ensemble algorithm, the scattered plot and results are shown in the figures below using the confusion matrix to give a result to the performance metrics.
In Figure 3 a scattered plot is shown for the classification, the correctly classified and misclassified using dots and cross signs to represent values for the variables, indicating values for individual data points, this plot is used in observing the relationships between the classified variables. Figure 4 and Figure 5 shows the confusion matrix for the classifications of the experiment, using bagged and boosted ensemble classifiers. The confusion matrix table is then used in describing the performance of the classification model of the sets of the tested data with the known true values with the confusion matrix represented with true positive, false positive, true negative and false negative values.  Testing the datamining learning performance methods, the RNA-seq data was copied from the https://figshare.com/articles/Additional_file_4_of_RNAseq_analyses_of_changes_in_the_Anopheles_gambi ae_transcriptome_associated_with_resistance_to_pyrethroids_in_Kenya_identification_of_candidateresistanc e_genes_and_candidateresistance_ SNPs/4346279/1 repository. ICA feature extraction technique was used on the 2457 genes features, and extracted 1572 features with 45 latent components. Ensemble classification is used to predict the performance. Result demonstrated the efficiency of datamining approached in genes. The performance results for the proposed approach are revealed and related in Table 2. The outcome shows that Subspace Discriminant ensemble classification outperforms bagged tree ensemble in terms of accuracy.
In this study, an improved investigation of the classification of malaria vector data is carried out, numerous works have been proposed by investigators, the figure and tables above have shown and demonstrated that, dimensionality reduction model with ICA feature extraction methods can progress ensemble classification results, Figure 6 shows the performance chart for comparing the output results. This study proposed a prediction and detection model for malaria disease in human. The proposed method used an ICA dimensionality reduction and ensemble classification datamining procedures, the investigation and performance assessment of the results gotten were shown in the tables and figures below.  Figure 6. Performance metrics graph

CONCLUSION
An enhanced classification approach for malaria prognosis and diagnosis using dimensionality reduction and classification algorithm was proposed, numerous works by researchers in this area has been reviwed, results of the experiment have demonstrated ICA feature extraction dimensionality reduction can support the advancement of ensemble classification. Recent and future works can be enhanced using other ensemble classifiers with other feature extraction algorithms.