A novel optimized deep learning method for protein-protein prediction in bioinformatics

ABSTRACT


INTRODUCTION
Protein-protein interactions (PPIs) can be utilized to look into the mechanisms underlying many biological processes, such as deoxyribonucleic acid (DNA) replication, protein modification, and signal transmission.Due to their accurate understanding and analysis, which can reveal numerous roles at the molecular and proteome levels, PPIs have been a research focus [1], [2].On the other hand, there are problems with incomplete and imprecise prediction using web-lab identification methods [3], [4].Alternately, low-cost candidates for future experimental validation could be obtained by applying precise bioinformatics methods for PPI prediction [5], [6].
Using advanced techniques to calculate PPI is not only laborious and costly, but it also produces an excessive number of false positives and false negatives [7], [8].As a result, computational tools that can aid in the process of discovering genuine protein interactions are required.This problem can be viewed as a categorical classifying problem from the standpoint of machine learning, and it can be tackled using supervised learning methods [9], [10].With the accelerated growth of deep learning techniques and neural network infrastructure, certain machine intelligence-based and sequence-based models for PPI prediction have been developed.Table 1 shows a summary of the state-of-art-methods.
Li et al. [11] proposed DeepCellEss which is a methodology for easy-to-interpret deep learning (DL) based on sequences and cell line-specific key protein predictions.To extract minute and prolonged-range hidden features from protein sequences, DeepCellEss uses a convolutional network and bidirectional long short-term memory (LSTM).Additionally, to enable the residue-level point process, a multi-head self-attention technique is adopted.Numerous computer studies show that DeepCellEss beats previous sequence-based approaches as well as network-based clear implications and provides effective prediction results for distinct cell lines.
Hou et al. [12] created a method for identifying PPI sites based on an ensemble DL model called ensemble deep learning method for protein-protein interaction (EDLMPPI).This would aid in solving the issue of modeling the properties of amino acid (AA) sequences for PPI bindings by directly encoding them into distributed vector representations.Additionally, their performance could stand to be improved when AA sequences are directly encoded into distributed vector model to categorize PPI binding events because the experiment numbers for detected PPI sites are significantly less than the number of PPIs or protein domains in protein complexes.
Gao et al. [13] offered the HIGH-PPI two-side learning hierarchical graph network to forecast PPIs and deduce the relevant chemical information.A vertex in the graph (top outer view) is a protein graph in this model's hierarchical graph (bottom inside-of-protein view).To effectively depict the quality of support of the protein, a set of chemically pertinent descriptors rather than protein sequences are used in the bottom view.To create a solid machine understanding of PPIs, HIGH-PPI investigates the human interactome's inside and outside of protein components.In terms of forecasting PPIs, this model has good accuracy and durability.
Yue et al. [14] introduced a deep learning framework for identifying important proteins.Their research focused on three main objectives: investigating the significance of each element's value in model prediction, improving the handling of unbalanced datasets, and assessing the model's accuracy in predicting important proteins.They used node2vec for feature representation and depth-wise separable convolution for gene expression profiles.Results on Saccharomyces cerevisiae (S. cerevisiae) data demonstrated their model's superiority over traditional deep learning methods.
In their 2022 study, Díaz-Eufracio and Medina-Franco [15] developed ensemble models utilizing support vector machine (SVM), logistic regression (LR), and random forest (RF) algorithms, employing an extended connectivity fingerprint radius of 2. Their primary objective was to validate newly generated PPI inhibitors from apothecary sources.The significance of their research lies in the predictive models they have created, which will empower future initiatives in designing PPI inhibitors to make informed, data-driven choices.Attention scores play a vital role in enabling the identification of essential sequence regions for predicting outcomes specific to various cell lines.They facilitate in-depth research and comparisons for critical cell line-specific proteins.
Does not reflect the relationships between several cell lines within the same tissue or cancer type.
Nucleotide, Protein Sequences [12] BiLSTM and capsule network Work directly with AA sequences.
Need to add more dynamic word embedding models to the model and modify them to address further pertinent protein-identifying issues.
Dset_448, Dset_72, and Dset_164 [13] Graph convolutional network Its capacity to recognize residue significance for PPI is a positive sign of great interpretability.
Protein-level annotations weren't fully explored, and memory needs increase with more views in a hierarchical graph.
PPI Sequences [14] 1D convolution On gene expression profiles, the notion of depth wise separable convolution is applied to extract attributes.
Using a long vector to represent subcellular localization demands significant processing resources.
S. cerevisiae [15] RF, LR, SVM Assess ML models for classifying new inhibitors by chemists and maintain the PPI inhibitors database regularly.
For classification, new models and challenges can be applied.

PPI Inhibitors
Deep learning (DL) methodologies, as documented in references [16], [17], encompass a range of techniques, including support vector machines (SVM) [18], artificial neural networks (ANN) [19], and others.These approaches offer indispensable tools for the secure prediction of PPI by extracting essential peptide information from amino acid sequences [20].This research demonstrates that deep learning frameworks [21] Int J Elec & Comp Eng ISSN: 2088-8708  A novel optimized deep learning method for protein-protein prediction in bioinformatics (Preeti Thareja) 751 excel at handling vast, unstructured datasets with intricate characteristics, thereby enhancing the comprehension of pivotal elements in PPI prediction [22], [23].Consequently, a novel deep learning concept based on artificial neural networks, combined with a meticulous hyperparameter tuning strategy, has been devised to facilitate precise and dependable PPI predictions.This study offers significant contributions in three main areas.Firstly, the feature extraction process has been improved by incorporating a semantic similarity-based feature alongside other features, resulting in more accurate outcomes.Secondly, an approach combining LSTM and restricted Boltzmann machines (RBMs) has been devised to ensure precise predictions and minimize loss.Lastly, a novel optimization technique named Aquila and Shark nose optimization has been introduced to fine-tune the weights of both classifiers, thereby enhancing the efficiency of PPI prediction.The structure of this article is organized as follows: section 2 discusses our proposed sequence-dependent ASN PPI prediction technique; section 3 presents the experimental results; section 4 concludes the paper; and the subsequent section includes references.

METHOD
Proteins are macromolecules that are organic and composed of AAs that are required by cells to support living activities [24].They are significant in biology because they connect different important physiological functions of cells to PPIs, allowing a variety of life processes such as apoptotic and immunological responses [25].The suggested ASN technique for predicting interactions from protein sequences is described in this section.Figure 1 depicts the architecture.Our method for predicting PPIs is comprised of two steps: i) To aid in reliable prediction, features are gathered using a standard sequencedependent and semantic similarity approach; and ii) LSTM and RBMs are employed to execute protein interaction prediction tasks.The new ensemble Aquila and Shark Nose are applied at this step to produce more dependable findings, with weighting parameters optimized.Finally, the prediction model uses feature extraction, ensemble deep learning, and the best parameters to predict protein interactions.

Feature extraction
The provided inputs were employed to extract two distinct types of characteristics [26].These characteristics consist of one set based on sequence-based physical-chemical attributes and another set based on semantic similarity features.To understand the feature extraction process in detail, a comprehensive description is provided below.

Sequence-based physical-chemical features
To establish a strong foundation for predicting PPI, proteins have been thoroughly characterized using a comprehensive set of 12 physical and chemical attributes.These attributes are derived from the constituent amino acids of the proteins and include hydrophilicity, adaptability, accessibility, torsion, external surface, polarizability, antigenic propensity, hydrophobicity, net charge of side chains, polarity, solvent-accessible surface area, and side-chain volume.Notably, among these attributes, hydrophobicity and polarity were assessed using two distinct measurement methods, as detailed in reference [27], which documented the values of 14 different physical and chemical characteristic scales for the 20 essential amino acids In this approach, each AA is transformed into a matrix consisting of 14 numerical data points, corresponding to the various physicochemical scale ratings.Since proteins exhibit variations in length, this transformation can result in a variable number of vectors, making it challenging to process uniformly.To address this issue and provide a consistent input for the ensemble meta-base learner's classifier, a conversion method is employed.This method transforms the protein descriptions into an even matrix format utilizing auto The l th physicochemical property scale's auto covariance  , is provided by ( 1) and ( 2): (1) where  denotes the predefined gap,  indicates the protein 's length, while   indicates the mean of protein 's  th physicochemical scale values.By fixing the greatest spacing to ( = 1, 2, … , ), every protein may be initialized of  ×  elements, where k seems to be the physicochemical property scales count.

Semantic similarity-based feature extraction
The semantic similarity identification technique has been utilized to compute the degree of similarity.To compute the resemblance, every source is represented as a vector.In particular, the vector model faces issues such as word impropriety (e.g., disregard synonymy) as well as lacking semantic data.The point word recognition (PWR) technique incorporates semantic meaning into the vector model, hence removing vector semantic problems.Throughout this application, the species sensitivity distributions (SSD) approach is primarily concerned with determining the associations between every pair of resources by using a cosine similarity metric.Syntax, as well as semantic similarity, are integrated into SSD cosine similarity as can be expressed with (3).

Optimal trained hybrid classifier
The extracted features are subjected to the prediction model where a hybrid model that combines improved LSTM and the RBMs classifiers is used.The hybrid concept is as follows: initially, the features are passed to both the individual classifiers, and finally the mean of the classifiers' output will be considered as the outcome.Here, to enhance the performance of prediction results, the training of both classifiers is carried out by the proposed ASN via tuning the optimal weights.

LSTM networks
The most popular type of recurrent neural networks (RNNs) are LSTM networks.The memory cell and the gates are the two essential parts of the LSTM.The input gates and forget gates alter the internal elements of the memory cell.
LSTM networks rely on four essential gates.The forget gate (f), responsible for determining what information from the previous state should be remembered or discarded.The input gate (i) comes into play, deciding which incoming data should be integrated into the current state.The input modulation gate (g), often considered as part of the input gate, alters incoming data to ensure its appropriateness for updating the internal state.Finally, the output gate (o) combines various outputs, including the previous state, to generate the current state.Together, these four gates orchestrate the flow of information, control state updates, and contribute to the network's output.
In our work, we have employed the tanh activation function, denoted in (4).The use of the tanh activation function is pivotal in our neural network framework, introducing essential non-linearity that aids in capturing intricate data relationships.This function's significance lies in its widespread use across various neural network architectures, contributing to tasks like feature transformation and classification.
To reduce the model's loss, we employed the following cross-entropy loss function in our work, as in (5).

Restricted Boltzmann machines
RBM layers, which were used for pre-training, were transformed into a feed-forward network to enable weight fine-tuning by using a new strategy.A SoftMax layer was added to the top layer during the fine- A novel optimized deep learning method for protein-protein prediction in bioinformatics (Preeti Thareja) 753 tuning step to improve the characteristics of the tagged samples.The underlying features were learned using a greedy tier unsupervised technique during the pre-training stage.

ASN optimizer
The proposed ASN is the combination of Aquila optimizer and shark nose smell optimization.Aquila update is influenced by the Shark Nose algorithm.Normally, the hybrid concept of optimization ensures better convergence rate and speed rather than executing as the individual algorithms.The objective function of our proposed ASN-based PPI prediction model is provided in further subsections.

Objective function
The input for our proposed ASN-based method is visually represented in Figure 2.This figure serves as a pivotal element in conveying the data and information that are crucial for the successful implementation of our approach.Figure 2 provides a clear and concise visualization of the solution input, which can include various data sources, parameters, or components, depending on the context of the method.

Figure 2. Solution encoding of proposed ASN-based PPI prediction technique
The primary objective of this work was centered around the minimization of the mean square error, as described in (6).Mean square error (MSE) is a fundamental metric used in various fields, particularly in the context of optimization problems and statistical analysis.In this work, (6) serves as the key representation of the objective function, which outlines the specific mathematical criterion for minimizing the discrepancies between predicted and observed values.
Where, the MSE denotes the mean square error.

Aquila optimizer
The Aquila optimizer is a nature-inspired algorithm based on Aquila bird hunting behavior, primarily used for optimization tasks.It may exhibit slower convergence and suboptimal results in complex optimization tasks.Aquila birds, like true eagles, build their nests in high places, use speed and talons for hunting, with ground squirrels being their common prey.
The four hunting methods used by the bird are described as: ii) expanded exploration, in which the target is pursued (high soar with vertical scoop); ii) narrowed exploration, which is the preferred technique for pursuing ground creatures like snakes and squirrels (contour flight with short glide attack); iii) expanded exploitation, which is the technique for pursuing slow prey (low flight with a slow descending attack); and iv) narrowed exploitation is a technique for hunting huge animals (walking and grabbing the target).The mathematical expressions for the methods are expressed in ( 7)- (11).
where  1 ,  2 , and  3 represent the new solution for methods 1, 2, and 3,   is the best solution,  is the current iteration,  is the maximum iteration,  is the population size,  is the variable size,  is the random value in the range 0 to 1, and   is the local mean value, as in (8). () is Levy's flight distribution,  is the upper bound,  is the lower bound, α, δ are exploitation parameters, and ,  1 , and  2 are quality factors as shown in ( 12)-( 14).

Shark nose optimizer
Shark nose optimization algorithm is a population-based metaheuristic optimization algorithm.Shark nose optimization algorithm is inspired by the Shark food foraging behavior.The entire algorithm is based on calculating the shark's position based on the movements of the shark which are: i) forward movement and ii) rotational movement.The mathematical expression for the movements is expressed in ( 15)- (17).
The shark's new position is determined using the expression as shown in (17).
Gauss mutation was also performed in our work to provide an accurate and reliable optimization.To make a new generation, Gaussian mutation simply adds a random value from a Gaussian distribution to every member of an individual's vector.The pseudocode for the proposed algorithm is described in Table 2.

Table 2. Pseudocode for proposed ASN technique
Step Number Step Name Step procedure 1 Initialization Set the attributes new position,   , α, δ.Create an initial population.Create every decision randomly within the acceptable range.
Initializing the stage counter  = 1 for  = 1 to   2 Forward movement Compute velocity vector using for every element.Acquire a new location of the shark depending on its forward movement, using the Aquila updating function.3 Rotational movement Depending on the rotational movement, acquire the new location of the shark.Depending on the two moves, choose the shark's upcoming location.4 Gaussian mutation Apply Gaussian mutation to increase the local search ability.End for  Set  =  + 1 Choose the shark position with the greatest value in the final stage.

Simulation setup
The proposed work has been implemented in the MATLAB tool.The datasets are typical UniProt proteins with experimental gene ontology (GO) annotation and structure models predicted by I-TASSER.Performance matrices of our proposed ASN PPI prediction technique were evaluated and compared with conventional techniques such as Aquila, cat swarm optimization, hunger games search, poor rich optimization, and shark nose optimization.

Error analysis
The error analysis in this study encompasses several performance metrics to evaluate the model's accuracy.These metrics include mean absolute error (MAE), measuring the absolute size of discrepancies between actual and predicted values, root mean square error (RMSE), assessing the overall magnitude of errors, mean absolute relative error (MARE), evaluating prediction accuracy in relation to relative errors, and mean squared error (MSE), which calculates the average of squared differences between predicted values and the overall mean, offering insights into prediction variability.These metrics collectively provide a comprehensive assessment of the model's predictive capabilities and its ability to minimize errors across a range of contexts.A novel optimized deep learning method for protein-protein prediction in bioinformatics (Preeti Thareja)

755
Our proposed ASN approach was compared to traditional optimization techniques, including cat swarm optimization (CSO), hunger games search (HGS), and poor rich optimization (PRO), using various evaluation metrics.For dataset-1, our approach achieved a lower MAE of 0.013 at 60% learning percentage (LP) compared to PRO (0.017) and CSO (0.014).Additionally, our method demonstrated a MARE value of 1 for 60% and 70% LPs, highlighting its effectiveness.In contrast, HGS resulted in higher MSE and RMSE values of 0.043 and 0.21, respectively, at 60% LP, indicating that our ASN approach is more reliable and outperforms traditional methods.Figure 3 shows the comparison for ASN PPI prediction model with traditional models when applied to dataset 1 giving results for MAE in Figure 3  For dataset-2, we compared our ASN approach to traditional optimization algorithms, assessing metrics like MAE, MARE, MASE, and RMSE. Figure 4 shows the comparison for ASN PPI prediction model with traditional models when applied to dataset 2 giving results for MAE in Figure 4(a), MSE in Figure 4(b), MARE in Figure 4(c) and RMSE in Figure 4(d).Notably, at 60-90% LPs, our approach achieves lower MAE values (0.17, 0.18, 0.19, and 0.17  In Figure 5(a), the performance results for dataset-1 are presented.The key performance metric, MAE, is highlighted, showing that the ASN-based strategy achieves a low MAE of 0.135.This is contrasted with LSTM, CNN, and SVM, which exhibit significantly higher MARE values of 2.44, 2.99, and 1.96, respectively.The results emphasize the superior performance of the proposed ASN-based strategy in dataset-1.

Accuracy analysis
The performance of the projected model is evaluated for dataset 1 and dataset 2 by considering various learning percentages such as 60, 70, 80, and 90 respectively.As per the obtained results, the projected model has attained the highest accuracy over the conventional models for different learning percentages.The obtained results are illustrated in Tables 3 and 4.
At a 60% learning percentage in dataset 1, the developed model achieves an impressive accuracy of approximately 87.37%, surpassing traditional methods such as CSO, HGS, and PRO.Additionally, in dataset 2, at a 70% learning percentage, the developed model consistently attains the highest accuracy among the alternatives.These results underscore the model's robust performance and its superiority over traditional methods in delivering accurate outcomes across various datasets and learning percentages.

CONCLUSION
The current research work has emphasized predicting the protein-to-protein interaction by using sequence-based features and optimized classifiers.Different physicochemical properties have different effects on the classification of AAs in protein sequences.The classification criteria for AAs based on their physicochemical properties is difficult to choose.This is also the direction of our efforts in the future.In addition, the proposed machine learning approach has distinctive inherent biases, including representation biases and process biases, which affect their learning behaviors and performances significantly even in the same learning task.In the future, we will develop an ensemble meta-learning strategy to overcome these issues and it will extensible to other domains also.And also, we would employ another sequence-based model with an advanced deep-learning concept.In addition, the most effective optimization approach can be developed for extending the current method.

Figure 1 .
Figure 1.Proposed method ASN PPI prediction model

Figure 3 .
Figure 3. Comparing results of (a) MAE, (b) MSE, (c) MARE, and (d) RMSE of proposed ASN PPI prediction model with standard optimization algorithms for dataset-1 ) compared to CSO (0.180, 0.183, 0.194, and 0.195).Similarly, our MARE values for dataset-2 are consistently lower (1.7, 1.3, 1.8, and 1.5) across LPs, showcasing the effectiveness of our ASN technique for PPI prediction.In contrast, both HGS and PRO techniques yield higher MAE and MARE values, underlining the superior performance of our proposed prediction strategy over traditional methods.

Figure 5
serves as a visual representation of the performance comparison of the proposed ASN-based prediction strategy with other alternative networks across multiple cases and datasets.The figure provides a clear and concise summary of the evaluation results for dataset-1 and dataset-2.It is divided into two sub-figures, Figures 5(a) and 5(b), each focusing on a specific dataset.

Table 1 .
Literature review on traditional models

Table 3 .
Comparison of accuracy of the proposed ASN approach with traditional optimization algorithms for dataset 1

Table 4 .
Comparison of accuracy of the proposed ASN approach with traditional optimization algorithms for dataset 2