Predicting active compounds for lung cancer based on quantitative structure-activity relationships

ABSTRACT


INTRODUCTION
Research works conducted in the field of drug discovery are important and contribute to the improvement of healthcare quality.Developing a new drug is a long and complex process that relies on the translation of a new molecular target into a proven therapy with efficient results.Drug discovery is one of the most outstanding scientific tasks.Advances in computational biology have broadly improved drug discovery pipelines.Classical methods directed towards this goal are time-consuming and expensive [1].Therapeutic studies are crucial for designing new drugs for the benefit of patients, as well as for public health reasons [2].
Nowadays, computational techniques have expanded their focus and greatly improved pipelines in the field of pharmacological medicine, as they have demonstrated successful results compared to traditional methods.Moreover, the remarkable amount of biological data publicly available and carefully stored in repositories has enabled researchers to explore numerous computational-based methodologies.Predictive modeling is one of the most widely applied techniques to enhance drug discovery pipelines.Machine learning (ML) techniques can be utilized to construct models that effectively classify drugs into relevant therapeutic  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5755-5763 5756 categories and accurately detect and classify various stages of tumors [3], [4].Additionally, ML methods can be used to design new drugs based on the chemical properties of studied molecules [5].
Quantitative structure-activity relationship (QSAR) methods are techniques that apply ML in order to learn from the relationships among the chemical structure and the biological activity [6] of molecules, additionally, they have helped to establish an empirical statistical model for the computational chemistry toolkit [7].The chemical structure of molecules is subject of calculations of molecular descriptors that describe essentially the physical and chemical properties that distinguish one molecule from another.QSAR-based models can provide insights on which chemical properties are important to inhibit a biological process.Such information will be of great interest to biologists and chemists in their design of future molecules in order to have more robust properties.
Bioinformatic methods have successfully enabled researchers to study molecules from a system level perspective.It uses computational processes to integrate knowledge and expertise from genomics, proteomics, transcriptomics, population genetics, and molecular phylogenetics.Bioinformatic analysis has enhanced drug target identification and drug candidate screening.Moreover, it facilitates predictions of drug resistance, minimized side effects, and has become more essential in drug discovery [8].Thus, numerous ML-based algorithms have been proposed to predict interactions among biological entities, as well as to design new drugs with similar properties for specific medical treatments.One of the main challenges to build an efficient ML classifier is the absence of good quality data.In fact, the available biological data is heterogeneous and requires a preprocessing step before initiating the training process of the ML models.Moreover, in cancer classification problems, most of the available datasets are imbalanced; as there are extensively more non-active molecules than active ones [9].
Our contribution is to build a classifier able to predict active compounds for lung cancer.We inferred active and non-active molecules from ChEMBL database to constitute our dataset.We computed fingerprints descriptors of collected molecules to learn from them.We took full advantage of the chemical characteristics and the structure of the molecules to build a sequential neural network model.Furthermore, we conducted a comparative study between the Multilayer perceptron neural network and the gradient boosting tree classifiers to analyze features contribution for each model in order to identify important chemical structures of active molecules for lung cancer.
This paper is a part of a series of research carried out by a team of our laboratory, interested to exploring biological data using datamining machine learning tools [10]- [12].The paper is organized as: in section 2 we present some related works, our approach is presented in section 3. The obtained results are discussed in section 4 and finally the conclusion.

RELATED WORK
Experiments to identify new therapeutic targets are aimed at investigating novel molecules and improve bioavailability of drugs.Traditional methods applied in drug discovery rely on the physical and chemical structure of the studied molecules.Genome-wide association studies (GWAS) screen a large number of genomes to identify associations between genetic variants and non-disease traits [13].This approach has been widely used to identify single nucleotide polymorphisms (SNPs) associated with diseases and greatly improved our understanding of biological processes [14].Identification of drug target sites is another methodology that many studies rely on.It refers to the discovery of interactions among diverse compounds and protein targets in the human body.Lee et al. [15] experimentally demonstrated that the duration of in vivo drug-target binding is highly affected by the drug-target resistance.In another study on targeted therapies for lung cancer predictions, Larsen et al. [16] suggested to integrate genome-wide tumor analysis along with drug-targeted responsive phenotypes to investigate new therapeutic strategies.This approach requires further knowledge on the binding sites.Moreover, it involves prior knowledge of related pathways to develop effective targeted therapeutics.
Structure-based approaches have significantly enhanced virtual screening, de novo design, and lead optimization [17], [18], based on the availability of ligand structures.On this subject, Almeida et al. [19] used multiple ligand-based virtual screening approaches to investigate novel potential MARK-3 Inhibitors in cancer.Similarly, Li et al. [20] suggested a nanoparticle-mediated targeted drug as a novel therapeutic for hepatocellular carcinoma using ligands that recognize hepatoma cells.The main disadvantage of this approach is that it cannot be used in situations where ligands are unknown.Similarity-based methods have also been used to design novel compounds.QSAR is a methodology that suggests structurally similar compounds tend to possess similar biological activities [21].Numerous studies based on this approach calculate a similarity score among drug profiles to discover potential drug-drug interactions.Vilar and Hripcsak [22] used several drug profiles to compute a similarity score between multiple compounds.Correspondingly, Ferdousi et al. [23] compared diverse molecular profiles and found that the structural profile is the most optimal metric to predict

5757
drug-drug interactions.The major disadvantage of this method is the choice of a suitable threshold for the computed similarity; which is highly affected by the quality of the used dataset and false positive interactions.
Likewise, classical methods used in drug discovery are time consuming.Besides, they are less accurate because of the number of reported ADRs.
Computational methods have significantly changed the way novel drugs are designed.Drugs discovery pipelines have been largely enhanced and improved our understanding of biological processes.Biological networks are a great way to represent chemical interactions as they have helped to integrate and create a model of diverse heterogeneous biological data.Hanaf et al. [12], proposed a network-based method combined with an ML algorithm to classify and predict interactions between genes, drugs, and diseases.Similarly, they were able to rank the top 20 gene-drug pairs related to lung cancer.Topological data analysis has recently been used to study large-scale biological data.Hanafi et al. [24], built a biological network using data integration methods and explored numerous graph properties to evaluate potential gene-disease interactions.Huang et al. [25] about drug repositioning for non-small cell lung cancer (NSCLC), the authors combined topological parameter-based classification and ML algorithms to explore potential therapeutic drugs for NSCLC, they successfully suggested promising drugs for treating early and late-stage lung cancer that were supported by the literature and appeared highly effective in clinical trials and in vitro.Similarly, in a study about identification of small potent molecule inhibitors to target Src kinase as a therapeutic strategy for lung cancer Weng et al. [26], constructed a computational model for the in silico screening of Src inhibitors and evaluated the effect of potential candidate compounds based on a QSAR model.The obtained results were promising, as the candidate compounds used revealed a significant inhibitory effect against Src activity.
In this paper, we present a computational QSAR-based model, combined with a tree-based classifier and a neural network model, to predict novel targeted compounds in lung cancer.We created a dataset of compounds related to NSCLC from the ChEMBL database.We computed molecular descriptors of the molecules as an 881-bit array, which we used as input features for the learning tasks.Furthermore, to evaluate our models, we conducted a feature engineering step and compared feature contributions using the SHAP values method.

OUR APPROACH
Our study follows a very meticulous approach to propose active compounds for lung cancer.The overall methodology is described in Figure 1.We started by collecting bioactive compounds related to non-small cell lung cancer from the ChEMBL database to construct our dataset.Afterwards, we clustered the compounds into two groups: highly active and non-active drugs, based on their inhibition concentration value at 50%, denoted as IC50.The lower the IC50, the more likely the drug is effective in inhibiting NSCLC.Then, we computed molecular descriptors and initiated two learning tasks to build our models and learn from the chemical characteristics of the calculated molecular descriptors.

Dataset construction
We constructed our dataset from the ChEMBL database; it is a discovery platform that covers drug-like compounds [27], [28].It offers a large variety of data related to drugs and provides insights, tools, and resources for drug discovery.Researchers use ChEMBL to make associations between diseases and their relevant targets.It also helps identify small molecules that can be used to target newly sequenced genomes.ChEMBL integrates its content primarily from the scientific literature, making it a great and accurate tool for in silico drug design.The collected compounds have two features to predict their binding activity.Table 1 shows some samples from our dataset.
The IC50 measures the potency of a molecule in inhibiting a biological process by 50%.It indicates how much of a substance is needed to inhibit half of a given process.Consequently, drugs with lower IC50 values are highly active and have a value of 1 in the column "Activity in NSCLC".Conversely, drugs with higher IC50 values are less active and have a value of 0 in the column "Activity in NSCLC".

Molecular descriptors
Molecular descriptors can be defined as a way to encode the chemical structure of molecules into numbers, typically represented as an array of bits.Each numerical value denotes the presence or absence of a certain pattern, such as a hydrogen bond, atom, or fragment.They are used to explore physicochemical and topological properties to establish the basis for in silico predictive QSAR-based models and are also useful in performing similarity searches in molecular libraries.
We used the PaDEL-descriptor [29] software to calculate the molecular descriptors of our collected molecules.The software was developed by the National University of Singapore with the aid of the Chemistry Development Kit.It uses the simplified molecular input line entry system (SMILES) to compute hundreds of molecular descriptors and fingerprints.SMILES is the simplest way to represent a molecule based on a line notation [30].It is a way to encode a chemical structure using notations that can be read and understood by a computer.The ChEMBL database provides the SMILES notation for the collected compounds, and Table 2 shows some of our collected molecules with their corresponding SMILES notation.
PubChem is a chemistry database that covers substances, compounds, and bio-assays.It defines a binary substructure fingerprint for chemical structures.Each compound's SMILES in our dataset was encoded into an array of 881 bits consisting of physicochemical properties defined by PubChem.Table 3 shows a summary description of the bits used by PubChem descriptors.

Learning tasks
The dataset contains a collection of 142,852 molecules clustered into active and non-active groups based on the IC50 value.We allocated 90% of our data to build the training set and 10% for the test set.The target feature used to predict, activity in NSCLC, can take discrete values which are: 0 for non-active 5759 compounds and 1 for highly active compounds.We carried out a pre-processing step which consisted of reducing the number of features to use as an input to the ML model.The initial number of features is 881.Low variance features have been removed using a variance threshold of 0.15, which means dropping the feature where 85% of values are similar.We ended up with 175 features with high variance that will present a good set to allow the model to detect regularities present in the used dataset.A low number of features is also useful to perform optimal training phase experience.
To build our neural network, we used the Keras [31] library to define a multilayer perceptron model for binary classification.Physiochemical properties were fed into a sequential model, which consists of three hidden layers.Each layer is a dense class.The input layer size is 175 in order to be mapped to the feature vector.Then, the three layers have 50, 10, and 2 neurons, respectively, with rectified linear unit (ReLU) as the activation function.The output layer has one node that uses the sigmoid activation function.We represented the physiochemical properties to the network with a single output value.We trained our model using binary cross-entropy as the loss function and the Adam method as the optimizer.Similarly, we applied a gradient boosting tree classifier to learn from a molecular descriptors array to predict the activity of compounds in NSCLC.We used the extreme gradient boosting (XGBoost) algorithm from the Scikit-Learn [32], [33] implementation to train and evaluate our model's performance.These bits check for the existence of bonded atom pairs, regardless of their count and order From 327 to 448 These bits check for the existence of atom nearest neighbor patterns, taking into account aromaticity significant bonds From 445 to 459 These bits check for the existence of detailed atom neighborhood patterns, regardless of count, but where bond orders are specific From 460 to 712 These bits check for the existence of simple SMARTS patterns, regardless of count, but where bond orders are specific and bond aromaticity matches both single and double bonds From 713 to 880 These bits check for the existence of complex SMARTS patterns, regardless of count, but where bond orders and bond aromaticity are specific

RESULTS AND DISCUSSION
Our neural network is a sequential feedforward model consisting of 3 hidden layers.The performance of the model was evaluated based on the log loss function as well as the reached accuracy during the training process over 100 epochs.The data was shuffled and split into portions called batches, with each batch consisting of 10 samples.During the learning process, the model loops over all these batches in each epoch and updates the model.Figure 2 shows the evolution over training cycles of the log loss.The model achieved an accuracy score of 0.96 with a log loss of 0.1166.
Similarly, we calculated the log loss of the decision tree-based classifier to evaluate its performance over 100 epochs.We performed tuning using the grid-search function to find optimal values for the hyperparameters of the model, which reached its highest performance with 100 boosted trees for a max depth of 6 levels.Figure 3 shows the obtained curve of the log loss function, which achieved a value of 0.008 for the validation set.The plot illustrates a decreasing curve in the log loss function.In addition, the model achieved an F1 score of 0.86 for both predicted classes.This gives us a snapshot of the training process which is successful for the two models, the XGBoost model is more effective than the neural network-based model.We can see that both models can achieve highest prediction performances.However, the number of predictors (175) is still high.Consequently, finding relevant features is a crucial step to depict important structures of active molecules in lung cancer.Moreover, it will help set up appropriate model parameters that will enhance the classification results of our method when evaluated on unseen molecules.For that reason, we calculated the most relevant features to fully utilize the capabilities of our models, which can easily capture patterns within the structure of novel molecules and reduce the number of hyperparameters that need to be tuned.We plotted the shapely values using the Shapley additive explanations (SHAP) method [34], a way to explain how the model is estimating a prediction class for a given molecule.It is also a way to measure feature contribution and to find the most relevant features for the used dataset.Figures 4 and 5 depict the key features identified by the artificial neural network (ANN) and XGBoost models, respectively, highlighting the essential molecular patterns utilized by both models to make predictions.Remarkably, there are 11 common features that both models leverage, suggesting that these features play a critical role in distinguishing active and inactive drugs in NSCLC.To provide a comprehensive overview, Table 4 summarizes these 11 features, shedding light on the structural characteristics that underlie drug efficiency in NSCLC.5761 Now, we have performed the training task for our XGBoost model using a reduced number of features.The 11 most relevant features were used as input for learning process.The model achieved an F1 score of 0.98 with a final mean squared error of 0.02.This score is higher compared to that reached by the study in [35], where an F1 score of 0.76 was obtained.Moreover, we compared the ability of our model to predict drugs that were revealed as highly active in the study [35].Then, we ranked top-10 highly active molecules in lung cancer, and Table 5 shows a list of these drugs.
The top-10 list of molecules we obtained were all supported by the literature.Erlotinib, which ranked 1, is an oral anticancer drug that inhibits the epidermal growth factor receptor responsible for excessive cell development in malignant lung tumors [36].Paclitaxel, which ranked 2, is used in combination with Cisplatin (ranked 5) as a first-line therapy for patients whose disease cannot be treated with surgery or radiation therapy [37], [38].Specific KRAS mutations are responsible for lung cancer, and patients with these mutations are often resistant to targeted drugs such as those ranked 3 and 7 [39].Moreover, drugs predicted to be active in NSCLC by the study in [35] were also present in our top-10 list (drugs ranked 4, 6, 8, 9, and 10), and have been validated by many studies [40]- [45].

CONCLUSION
In this paper, we propose a new approach to explore the important structures of active molecules in lung cancer.we set up two machine learning models to learn from chemical structures and predict novel drugs highly active in lung cancer, taking full advantage of a QSAR-based method.We conducted a comparative study to evaluate the performance of the two models based on several metrics.We used SHAP values to perform a feature engineering step and to list essential chemical structures that had a high contribution to the training phase of our models to make accurate predictions.Both models showed good results and were successfully able to rank the top 10 highly active molecules used as a therapy process for patients with lung cancer.The obtained results were compared to the medical research literature and supported by several studies.Our methodology demonstrated promising results that can enhance drug discovery pipelines not only for lung cancer case but can be generalized to other diseases.
Int J Elec & Comp Eng ISSN: 2088-8708  Predicting active compounds for lung cancer based on quantitative structure-activity … (Hamza Hanafi)

Figure 1 .
Figure 1.Overall approach followed to predict compounds activity for non-small cell lung cancer Int J Elec & Comp Eng ISSN: 2088-8708  Predicting active compounds for lung cancer based on quantitative structure-activity … (Hamza Hanafi)

Figure 2 .Figure 3 .
Figure 2. Log loss curve obtained for the artificial network model

Figure 4 .Figure 5 .
Figure 4. Feature contribution in the artificial neural network model

Table 1 .
Some samples from the used dataset

Table 3 .
Description of bits defined by PubChem

Table 4 .
Structural patterns that describe drugs' activity in NSCLC obtained from our ML methods

Table 5 .
Top-10 ranked drugs in lung cancer