Use of QSAR Global Models and Molecular Docking for Developing New Inhibitors of c-src Tyrosine Kinase

A prototype of a family of at least nine members, cellular Src tyrosine kinase is a therapeutically interesting target because its inhibition might be of interest not only in a number of malignancies, but also in a diverse array of conditions, from neurodegenerative pathologies to certain viral infections. Computational methods in drug discovery are considerably cheaper than conventional methods and offer opportunities of screening very large numbers of compounds in conditions that would be simply impossible within the wet lab experimental settings. We explored the use of global quantitative structure-activity relationship (QSAR) models and molecular ligand docking in the discovery of new c-src tyrosine kinase inhibitors. Using a dataset of 1038 compounds from ChEMBL database, we developed over 350 QSAR classification models. A total of 49 models with reasonably good performance were selected and the models were assembled by stacking with a simple majority vote and used for the virtual screening of over 100,000 compounds. A total of 744 compounds were predicted by at least 50% of the QSAR models as active, 147 compounds were within the applicability domain and predicted by at least 75% of the models to be active. The latter 147 compounds were submitted to molecular ligand docking using AutoDock Vina and LeDock, and 89 were predicted to be active based on the energy of binding.


Introduction
Src (c-src, pp60-src, or p60-src) is a nonreceptor, cytoplasmic tyrosine kinase, the first of its kind to be discovered (in the 1970s) in the living world, whereas the corresponding gene was the first oncogene to be uncovered [1]. It is the prototype of a larger family, comprising at least nine members, most of them with little activity in normal cells in the absence of stimulatory signals [2]. Src kinases have been suggested to be involved in the exacerbation of neurodegenerative pathologies, whereas their inhibition would diminish microgliosis and mitigate inflammation, findings that are in line with experimental effects seen for nonspecific src inhibitors such as bosutinib or LCB-03-0110 [3]. Nonclinical evidence has pointed to the inhibition of src kinases as a possible method of therapy for the pulmonary vascular remodeling and right ventricular hypertrophy in pulmonary hypertension [4], although several reports indicate that dual Abl/src inhibitor dasatinib may actually induce pulmonary hypertension [5][6][7]; it was more recently suggested that this dasatinib effect may in fact be independent of the src inhibition [7]. Variability of the dataset illustrated by several simple constitutional descriptors or molecular properties. Blue-inactive compounds; red-active compounds. For the Lipinsky rule, "No" indicates compounds not obeying to the Lipinsky's rule of five, and "Yes" compounds satisfying the rule; among the latter the active compounds are more frequent. C% indicates the percentage of carbon atoms, N% the percentage of nitrogen atoms, whereas ALOGP is the Ghose-Crippen octanol-water partition coeff. (logP).
A dissimilarity matrix based on the Gower distance was computed (the Gower distance is appropriate for data of a heterogeneous nature), using 783 most relevant descriptors (that remained after removing autocorrelated and quasi-constant features). Although Gower distance takes values between 0 and 1, because it tends to give larger weights to binary variables [28], we rescaled the distance matrix and plotted it as a dissimilarity plot ( Figure 2). Before rescaling the maximum value of the Gower distance was 0.404, following rescaling it became 1. The median (scaled) dissimilarity values were mostly around 0.2-0.3, suggesting that the chemical diversity in the dataset was rather limited (Supplementary Figures S1-S3). Variability of the dataset illustrated by several simple constitutional descriptors or molecular properties. Blue-inactive compounds; red-active compounds. For the Lipinsky rule, "No" indicates compounds not obeying to the Lipinsky's rule of five, and "Yes" compounds satisfying the rule; among the latter the active compounds are more frequent. C% indicates the percentage of carbon atoms, N% the percentage of nitrogen atoms, whereas ALOGP is the Ghose-Crippen octanol-water partition coeff. (logP).
A dissimilarity matrix based on the Gower distance was computed (the Gower distance is appropriate for data of a heterogeneous nature), using 783 most relevant descriptors (that remained after removing autocorrelated and quasi-constant features). Although Gower distance takes values between 0 and 1, because it tends to give larger weights to binary variables [28], we rescaled the distance matrix and plotted it as a dissimilarity plot ( Figure 2). Before rescaling the maximum value of the Gower distance was 0.404, following rescaling it became 1. The median (scaled) dissimilarity values were mostly around 0.2-0.3, suggesting that the chemical diversity in the dataset was rather limited (Supplementary Figures S1-S3).

Performances of Models in Nested Cross-Validation
Using a variety of classification algorithms (random forests, support vector machines, adaboost M1, Bayesian Additive Regression Trees, C5.0, and binomial regression), of feature selection methods (17), and numbers of features (between 3 and 40-for instance, for binomial regression we used models with 3, 5, 10, and 20 features, and thus the number of models built for this classifier was 68), a total number of over 350 models were built and their performance was assessed by nested cross-validation. Only models with an acceptable performance (defined as having both a balanced accuracy higher than 70% and a positive predictive value higher than 70% in the nested cross-validation) were selected (Table 1). In instances when several models (with different numbers of features) had good performance for the same classifier and selection algorithm (over the threshold of 70%), we only tabulated the model we judged as best (highest average between balanced accuracy and positive predictive value (PPV), and for equal value of the average giving preference to higher PPV). Numbers of true positives, true negatives, false positives, and false negatives, allowing computation of other performance metrics are available in Supplementary Table S2.

Performances of Models in Nested Cross-Validation
Using a variety of classification algorithms (random forests, support vector machines, adaboost M1, Bayesian Additive Regression Trees, C5.0, and binomial regression), of feature selection methods (17), and numbers of features (between 3 and 40-for instance, for binomial regression we used models with 3, 5, 10, and 20 features, and thus the number of models built for this classifier was 68), a total number of over 350 models were built and their performance was assessed by nested cross-validation. Only models with an acceptable performance (defined as having both a balanced accuracy higher than 70% and a positive predictive value higher than 70% in the nested cross-validation) were selected (Table 1). In instances when several models (with different numbers of features) had good performance for the same classifier and selection algorithm (over the threshold of 70%), we only tabulated the model we judged as best (highest average between balanced accuracy and positive predictive value (PPV), and for equal value of the average giving preference to higher PPV). Numbers of true positives, true negatives, false positives, and false negatives, allowing computation of other performance metrics are available in Supplementary Table S2.
As the dataset includes 1038 compounds, of which 286 are active and 752 inactive, the most probable random accuracy (Q 2, rnd) [29,30] may be estimated to 60.08% (286 × 286 + 752 × 752)/(1038 × 1038). As shown by the last column in Table 1, our models have a superiority of about 20-24% over random accuracy. However, the concept of random accuracy assumes that the correct classification of the two classes is of equal importance; in fact, in our case, we were more interested in correctly predicting the active compounds (i.e., optimizing the PPV was more important than Q 2 ). The models were thus not optimized to increase the global accuracy, but rather both balanced accuracy and PPV.
As the models applied in the nested cross-validation are always based on only a subset of the data, the estimation of performance should be conservative (i.e., applying the selected models on the whole dataset has better performance). * Each model name is formed by three parts separated by an underscore: the first part of the name indicates the classifier, the second part the feature selection algorithm (in an abbreviated form), and the third part the number of features used to build the model. The names of the classification and feature selection algorithms are provided in Section 4. For instance, RF_anova_20 was a random forest based on features selected based on ANOVA (as implemented in "anova.test" within "mlr" R package) and the number of features used was 20. BA: balanced accuracy; PPV: positive predictive value; MMCE: mean misclassification error; AUC: area under the ROC curve; TPR: true positive rate; TNR: true negative rate; Q2 -accuracy; Q2, rnd -most probable random accuracy (as explained in the text).

Y-Randomization Test
As expected, despite following the same steps in building the models, scrambling the activity labels had a strong impact on the performance of the models, which was clearly inferior to those based on the initial (unscrambled) data: the average balanced accuracy of all 10 y-scrambling tests (nested cross-validation performed in the same conditions and following the same pre-processing as the true data) was 50.23%, with a standard deviation of 0.59% (minimum value 49.73% and maximum 51.45%).
In a similar way, the mean value of the positive predictive (PPV) was 20.38%, and its value varied between 0.00% and 30.00%. This provides reassurance that the performance of the models is not the result of mere chance, but rather reflects a true relationship between the descriptors and the inhibitory activity on c-src tyrosine kinase.

Descriptors Associated with c-src Inhibitory Activity
While for all models the number of features was relatively high (in most cases between 20 and 30), the largest predictive effect could be attributed to no more than five features. For instance, in the case of random forest, using ANOVA as a feature selection (filtering) algorithm, with 23 features, the area under the receiver operating characteristic (ROC) curve (AUC) was 82.56% and the balanced accuracy 70.24%; however, using only the first most important five molecular descriptors, the AUC was 77.53%, and the balanced accuracy 66.39%. Although there was an improvement for the higher number of features (23), the first five explained the largest part of the variability in the training and testing datasets. We therefore focused on the first five descriptors selected by each of the 17 selection algorithms and found that most algorithms identified the same features as being the most important. These are shown in Table 2 (and descriptor values in Supplementary Table S3).

Virtual Screening and External Validation
We applied the models to the 104,619 ZINC compounds and ranked them based on the percentage of models predicting the compounds as active. Using a threshold of 50% (i.e., compounds predicted to be "active" by more than 50% of all models applied) 744 compounds were identified. Our validation data (using the predictions on the test sets from the nested cross-validation) indicated that the PPV for this threshold was 78.57%. Increasing the decision threshold to 75% the number of compounds decreased to 158, but after eliminating the compounds that had been part of the training set and the duplicates (multiple ZINC ids may correspond to the same substance), their number decreased to 115 (Table S2); the validation data indicated a PPV value for this threshold of 85.43%. For a threshold of 90% the PPV in the validation was also close to 90% (90.1%), but the number of unique compounds was limited to 37.
For external validation purposes, we searched PubChem and ChEMBL for biological data related to the activity of the predicted compounds on the src tyrosine kinase, so as to have at least partial confirmation on the accuracy of the predictions. We found that among the 115 substances predicted as being active, for nine compounds (i.e., 7.83%) there is available evidence that they are active on the c-src tyrosine kinase. We could not find ki values for the nine compounds, but in most cases rather mean inhibition (as a percentage) at 1.0 or 0.1 µM was available. Taking into account that IC50 values are always higher than ki values for a competitive inhibitor, and the fact that percent inhibition is dependent on both substrate and inhibitor concentration, we considered as active compounds those with inhibition values of at least 30%. When a compound was labeled as "active" on the src target in one of the two public databases without further information on the endpoint or bioassay used, we also considered that compound as active (that was the case for balamapimod, reported by PubChem). Of the nine compounds labeled by us as "active", three had a mean inhibition higher than 50%, one had a ki less than 1000 nM (20 nM to be precise), one was stated as "active" by PubChem with no further information and four had a mean% inhibition between 30% and 42.23% at 1 µM). A total of 34 additional substances (29.56%), predicted by the large majority of models as being active, were in fact proven to be inactive on src-tyrosine kinase, whereas 72 of the substances (62.61%) predicted to be active, seem to have never been tested for their effect on src tyrosine kinase. If the 43 compounds that were indeed tested were representative for the rest, the rate of success for the predictions would be 20.93%.

Applicability Domain
The "applicability domain" (AD) is a concept meant to evaluate if a model may be validly applied to predict the effect of a candidate compound; such validity is conditioned on the satisfaction of the assumptions applied in the construction of the model [31]. If the new substance whose activity we are trying to predict differs substantially from those on which a QSAR a model was based, such a prediction cannot be trusted. Therefore, assessing the AD for model is of paramount importance if that model is to be used for predictions, and a wide range of methods have been proposed in the literature for this purpose, each with its own advantages and flaws [32].
We used a variety of algorithms to assess the applicability domain for the predictions of the QSAR virtual screening by different models. Using the method by Roy et al. (2015) [33], which considers as an outlier each compound with a value outside the mean ± standard deviation, none of the compounds predicted by more than 50% of our models to be active were outside of the applicability model. This was not very surprising, because that method uses a decision tree based on three standard deviations, whereas we capped, centered, and scaled values to two standard deviations. Using the Kernel Density Estimation Outlier Score (KDEOS) algorithm (with a minimum of three and a maximum of 10 neighbors), which is based on a number of k-nearest neighbors, the number of outliers among the 744 compounds predicted as active by the majority of the QSAR models was small for each model, and not higher than 15% of the total number (with a median proportion toward 5%). Selecting the compounds after filtering them based on the applicability model did not change the hierarchization of the compounds predicted as active. The Influenced Outlierness (INFLO) algorithm (with k = 5), which is also based on a number of k-nearest neighbors, but taking into account a "a reverse nearest neighborhood set", and that of F. Sahigara et al. (2013), which not only uses k-nearest neighbors, but also individual decision threshold for each data point of the training sample [34], identified a much larger proportion of compounds as outside the applicability method: for the latter, for instance, the proportion of outliers varied (for the different models) between 1.75% and 44.35%, with a median of 32.39% of the total of 744 compounds ( Figure 3). A number of 147 compounds (of which five had been in the training dataset) were predicted by 75% of the models as being active, after limiting the votes to those compounds that were within the applicability domain estimated with the F. Sahigara et al. (2013) method [34]. All compounds identified by the virtual screening (before checking the applicability domain) fell for at least some of the models within the applicability domain, but the degree of confidence in the predictions changed after checking for the applicability domain.
3). A number of 147 compounds (of which five had been in the training dataset) were predicted by 75% of the models as being active, after limiting the votes to those compounds that were within the applicability domain estimated with the F. Sahigara et al. (2013) method [34] . All compounds identified by the virtual screening (before checking the applicability domain) fell for at least some of the models within the applicability domain, but the degree of confidence in the predictions changed after checking for the applicability domain.

Molecular Docking
In order to assess the performance of docking for the two software programs used (AutoDock Vina [35] and LeDock [36]) we first compared the estimated energies of binding for 175 compounds of the training set, with known activities on the target enzyme. With LeDock, the mean binding energy in the active compound group was −8.02 kcal/mol, whereas in the inactive compound group it was −7.29 kcal/mol (p < 10 −7 , Welch t-test). For the very active compounds (ki < 20 nM), the mean binding energy was −8.43 kcal/mol (p < 10 −8 versus all inactive compounds, Welch t-test). Using the "cutpointr" package, an optimal cut-off was found at an energy of binding of −7.17 kcal/mol, which ensured an accuracy of 70.29%, with high sensitivity (90%), but low specificity (44%). In order to minimize the false positive, a cut-off point of −9.21 kcal/mol was necessary; at this level the specificity was 100% (i.e., none of the inactive compounds had such a low energy of binding in the docking runs), but with a very low sensitivity (only 9% of the active compounds had this low estimated energy of binding) ( Figure 4). As our interest was to minimize the false-positive rate, we docked the 147 compounds predicted by the QSAR models to be active and within the applicability domain and somewhat surprisingly no less than 89 of them (61.22%) had such a low energy of binding, in other words they could be considered as active ( Table 3). Considering that in our training subset, the sensitivity at this cut-off point (−9.21 kcal/mol) was only 9%, this high value does suggest that an important proportion of the compounds predicted by the QSAR models to be active might be indeed active, although when using docking one must be very cautious [37]. The root-mean-square deviation (RMSD) computed for the first cluster of poses of the ANP was 1.25, under the conventional threshold of 2.0, which may be considered reasonably well. The visual examination of the pose indicated that the ring pose was very well predicted, whereas the side chain prediction was less accurate ( Figure 5). Of the 89 compounds of Table 3, 34 (38.20%) have already been reported to inhibit one or multiple tyrosine kinases.
Following the suggestion of one of the reviewers of this paper, we also submitted the 89 compounds to the online version of PASS [38], a software that predicts potential activities for chemical compounds. A total of 24 out of the 89 compounds (26.97%) were predicted to be active on the src tyrosine kinase and 62 of the 89 compounds were predicted to be active on at least one or multiple kinases (Table S4, Supplementary Information). Nevertheless, PASS predictions are also affected by limitations, because Pf-562271, a compound that was in our training set, was not detected at all as a src-tyrosine kinase inhibitor. Gw683134a, which based on the ChEMBL data causes a 36.99% inhibition of c-src tyrosine kinase at 1 µM, was not predicted as an inhibitor at all. Bx-795, which also at 1 µM causes a 27-30% inhibition of human c-src and 77-90% inhibition of Gallus gallus c-src, was also not predicted as an inhibitor. As for lapatinib, the probabilities to be active and to be inactive predicted by PASS were only 0.086 and 0.053, respectively.     AutoDock Vina performance was inferior to that of LeDock: on the same 175 compounds from the training set, the mean energy of binding was −10.30 kcal/mol for the active compounds and −10.03 kcal/mol for the inactive (p = 0.21, Welch t-test). An optimal cut-off for the AutoDock Vina compounds was at −9.26 kcal/mol, which ensured an accuracy of only 62.86%, with a sensitivity of 87.00% and a specificity of only 30.67%. As the performance of Vina was inferior to that of LeDock, we preferred to use only LeDock for virtual screening.
Computing various ligand efficiency metrics did not improve the predictions in the case of LeDock results: the accuracy rather decreased with all ligand efficiency measures attempted. In the case of AutoDock Vina, using different ligand efficiency measures changed the values of accuracy, sensitivity, and specificity, with no spectacular improvement. For instance, dividing the energy of binding to the molecular weight decreased sensitivity (from 87% to 43%), increased specificity (from 30.67% to 81.33%), and slightly increased the AUC (from 56.85% to 62.87%), but it also slightly decreased the accuracy (from 62.86% to 59.43%). Of the different ligand efficiency measures, for the AutoDock Vina results the best was obtained by dividing the energy of binding to the squared Ghose-Crippen octanol-water partition coefficient: 78% sensitivity, 49.33% specificity, 65.71% accuracy, and 65.05% AUC. Even with this ligand efficiency measure, the results were inferior to those obtained with LeDock based on the energies of binding.    * Based on ChEMBL and PubChem data for each substance ("Yes" means that there are at least limited confirmatory data in one of the public databases, "No" means that there is no such confirmatory data; NA-data not available at all). ** For an estimation of the docking error we provided in brackets the standard deviation of the energy of binding computed from the value of the different clusters of 20 poses.

Discussion
Several studies of QSAR models for c-src tyrosine kinase inhibitors have been published up to date in the scientific literature. Five such studies have explored the use of 3D-QSAR, and all of them used a relatively small number of compounds (80,42,156, and 39, respectively), with the same basic chemical structure within each study (pyrrolo-pyrimidine, quinazoline, anilinoquinazoline and quinolinecarbonitrile, quinolinecarbonitrile, and 4,6-substituted-(diaphenylamino)quinazolines); they could, therefore, be considered "local" models [39][40][41][42]. In the QSAR field, the term "local" is used to designate models based on a data set consisting of compounds related by their chemical structure, unlike global models, that are based on data sets consisting of structurally diverse chemical substances [43]. Another paper reported on the use of 2D-QSAR for c-src inhibitors, but these models were also local, focused on ethynyl-3-quinolinecarbonitriles [44]. Therefore, our study is the first one focused on global QSAR models for inhibitors targeting the c-src tyrosine kinase. It has been argued (and it stands to reason) that local models tend to have limited predictive power, even when their apparent performance indicates that they are robust [43]. Our global models are expected to have a higher predictive power, as partially confirmed in our external validation.
By far the most important descriptor in our work, identified by multiple feature selection algorithms, was SpMax4_Bh(m), the largest eigenvalue n. 4 of Burden matrix weighted by mass. This has not generally been reported in previous works as correlating with pharmacological activities. Other two Burden eigenvalues (SpMax3_Bh(m), SpMax5_Bh(m)) have also been among the most important descriptors correlating with the inhibition of c-src. SpMax3_Bh(m) has been used in predicting depuration rate constants for environmental pollutants of the polychlorinated biphenyls group [45], and the less relevant (in our case) SpMax6_Bh(m) has been used to predict chronic toxicity of substances to Pseudokirchneriella subcapitata [46]. The second most important descriptor for our data set was DECC (eccentric topologic index), which has been previously reported to be important in the prediction of monoamine oxidase A (MAO-A) activity [47,48], placental barrier permeability [49], and gas chromatographic retention times [50]. F06[C-N] was used in a model to describe the anti-proliferative effect of phenyl 4-(2-oxoimidazolidin-1-yl)-benzenesulfonates (local QSAR model) [51], antimalaric effect [52], or skin permeability of substances [53]. P_VSA_MR_6 has also been used for modeling of skin permeability [53], whereas we identified the use of Chi1_EA(dm) only for the QSPR modeling of fluorescence properties of a number of fluorescent dyes [54]. The aromatic nitrogen (N-073) has been shown to correlate positively with HIV-1 integrase activity inhibition [55] and negatively with the inhibition of the fibroblast growth factor (FGFR) [56]. We found no previous reports on the use of the Balaban distance connectivity index (J_D) in other models in the biological field, neither of the F05[C-N].
Rarely, the 49 QSAR models with similarly good performance converged in their predictions. Only eight compounds were predicted by all models to be active, and half of them (n = 4) were already in the training data set; for the large majority of compounds at least one or more of the models had contradictory results. This illustrates the need to avoid making decisions based on the results of a single or a small number of models.
As shown in the results section, for nine compounds (7.83% of the 115 substances with the best predictions) it has been confirmed that they are active. How good is such a measure for a virtual screening exercise? If we compare it with the PPV value in the nested cross-validation, the results are rather disappointing and indicate that one should always be cautious in interpreting results even when using double cross-validation, because the real world data are likely to be different from the data set used for training and testing. For instance, it is likely that the proportion of actives in the available data set used for the construction of the models is higher than the proportion of actives in the "real world" (i.e., the wide chemical space used for virtual screening), and this may lead to a decrease in the positive predictive value in the real world. However, if we compare the results of the virtual screening with those of the most costly high throughput screening (HTS), the results are noteworthy. It has been reported that the hit rate of HTS should be expected to be less than 1% [57] and even less than 0.1% or 0.01% [58]. In one study, adding a computer-aided virtual screen was able to increase the screening hit proportion to 5.8% [57]. Thus, our success rate of at least 7.83% is reasonably good. If we compute the confirmation rate against the compounds that were assessed for their effect on src-tyrosine kinase (20.93%), the results are even better. As another positive aspect, more than a quarter of our predictions were supported by the PASS online software. Our virtual screening results showed, however, additional interesting facts.
A total of 16 additional false positives were in fact reported to be active on other members of the src family members, particularly Yes1 tyrosine kinase. This suggests that although our virtual screening exercise failed in multiple cases, the failure was often not far from the true target. Thus, from a total of 43 molecules that were tested for their effects on the src and other tyrosine kinases, 58.14% (25 compounds) were inhibitors of one or several members of the src-tyrosine kinase family (most often Yes1, sometimes also LCK or LYN tyrosin kinase).
Other false positives of the virtual screening exercise are inhibitors of proteins that src tyrosine kinase interacts directly, either activating them or being activated by them. It is known, for instance, that EGFR (epidermal growth factor receptor) can be activated by src without the presence of the EGFR ligand and that there is a direct correlation between EGFR overexpression and src activation [59].
Rather surprisingly for us, 13 compounds wrongly predicted by our models to be src tyrosine kinase inhibitors, are in fact inhibitors of EGFR, and 10 additional compounds that were inactive on src or other members of src family, were reported to be inhibitors of EGFR. Most of these 10 additional compounds (as well as most of the compounds active on src or Yes1 tyrosine kinase) are also active on ErbB4, and it has been reported that ErbB4-derived phosphopeptides are able to interact with the SH2 domain of src [60], that following stimulation by EGF, c-src is rapidly recruited to ErbB receptor complexes [61] and that activated src binds to ERBB4s80 (E4ICD), a cleaved fragment of ERBB4 [62]. Moreover, dasatinib, described often as a src inhibitor [63], has also shown to be one of the most potent ligands of ErbB4 [64]. Defactinib, apparently a false positive of our virtual screening is a potent FAK (focal adhesion kinase) inhibitor; it is known that FAK and nonreceptor src tyrosin kinase are both part of a focal adhesion complex (together with other structural, enzymatic, or adapter proteins), where they interact directly [65]. Three false positives of the virtual screening results were KIT and PDGFR inhibitors; KIT promotes phosphorylation of src and is activated by src [66], while src and PDGFR interact and phosphorylate each other at certain Tyr positions [67].
Such findings (compounds inactive on c-src tyrosine kinase, but active on kinases from the same kinase family or signaling pathway) tend to suggest that where the QSAR virtual screening fails is often not far from the target (but this is nonetheless a failure). How could these failures be explained, considering that multiple models converge in predicting a certain molecule as active on the target of interest (src tyrosine kinase)? It seems that the models manage to predict the tyrosine kinase properties of certain compounds, without having sufficient specificity to always separate those active on src from those active on other tyrosine kinases. We hypothesize that the training set is too small and does not include (a sufficient number of) molecules with selective src inhibitory properties; we intend to evaluate whether extending the data set with additional molecules inactive on src but active on other tyrosine kinases may improve the results of the virtual screening. It is also worth exploring the combining of more diverse descriptor sets in the final assembly of models with a view of improving the performance of the virtual screening.
Among the results produced by our virtual screening there is a sizeable number of antiviral molecules (vedroprevir, daclatasvir, ciluprevir, deleobuvir, ledipasvir, faldaprevir, tegobuvir, elbasvir, ombitasvir, narlaprevir), all of them approved or developed against hepatitis C viruses. They either target the NS3/NS4A (vedroprevir, ciluprevir, faldaprevir, narlaprevir) [68] or NS5A (daclatasvir, elbasvir, ombitasvir, ledipasvir) [69] or NS5B (deleobuvir, tegobuvir) [70] nonstructural proteins of the virus. It is not very surprising to see inhibitors of NS5A and NS5B here, considering that is already known that NS5A protein binds to tyrosine kinases from the src-family [71], and c-src is an essential host protein involved in the formation of the HCV replication complex, together with NS5A and NS5B [72]. It was less expected to see also inhibitors of the NS3/NS4A among the results of the virtual screening, because no direct interaction was reported between the Ns3/NS4A complex and src tyrosine kinase. This list of HCV antivirals might consist only of false positives, but it is worth testing in wet lab experiments.
The docking applied to 147 compounds predicted with a high probability by the QSAR models to be active, reduced their number to about 61% of the initial size. For a number (27.78%) of these 89 compounds, predicted by both QSAR and docking to be active, data available in ChEMBL or PubChem (from a single wet lab test) indicate that they are inactive, and for others (6.67%), that they are active, as discussed for the QSAR models. This suggests that computational results have to be interpreted with caution even when different models, with different methodologies and assumptions, converge in their predictions. On the other hand, the last decade has witnessed a growing realization of what has been dubbed "the reproducibility crisis", ascribed to the inappropriate quality of antibodies used as reagents [73], insufficiently described methodologies or simply to the biology itself [74]. Whereas positive findings have often not been reproduced when experiments were repeated in other laboratories, it is not impossible that negative findings could also not be replicable and some of the compounds shown by databases to be inactive might, as a matter of fact, be active. However, in the absence of contrary evidence, such compounds have to be considered inactive.
Virtual screening results are also influenced by potential errors affecting the input data: if the wet lab data that were used to generate the models are affected by errors, they will propagate forward in the models built and in the predictions made on new compounds. The estimated docking energies are also potentially affected by errors (in our estimation the accuracy was about 70%, but the large number of compounds used in screening may differ more from our data set, and thus accuracy might be lower). Moreover, docking methods are also prone to errors, there are often discrepancies between docking results and ligand-based studies, and there are multiple cases where top compounds identified by docking methods failed in wet lab experiments [37].

Dataset
The dataset (Table S1) was downloaded from ChEMBL (https://www.ebi.ac.uk/chembl) and included experimental data for c-src as a target (target code CHEMBL267). Only the records with ki values expressed in nM were kept. Records with "=" values in the field "Relation" were kept for analysis and labeled as "active" if ki < 1000 nM and "inactive" if ki ≥ 1000 nM; records with ">" or "<" values in the field "Relation" were kept for analysis only if they allowed unequivocal classification (e.g., records with ki > 5000 nM were kept and labeled as "inactive", whereas those with ki > 100 nM were discarded; similarly, records with ki < 5000 nM were discarded). A threshold of 1000 nM for the formal discrimination between "active" and "inactive" compounds is usual in the field and has been used in other publications [75]. We used classification rather than regression, because the data came from different laboratories and experimental settings, and although ki values have less variability than IC50, published experimental ki values still vary considerably (of the 75 compounds in our data set with multiple ki values, the relative standard deviation (RSD) of ki varied from 0% to 103%; for the first three quartiles, RSD was relatively low, under 13.85%, but for the last quartile it was quite high). Inorganic compounds were removed. For the detection and removal of duplicate compounds we proceeded in two steps: first, canonical SMILES (available in the downloaded dataset) were searched for duplicates in R (v. 3.6.0) and their ki values were replaced by the average of the duplicates. We then used ChemAxon Standardizer v. 18.8.0 (ChemAxon, Budapest, Hungary) for the standardization of the molecules, and then employed the ISIDA/Duplicates software (http://infochim.u-strasbg.fr; University of Strasbourg, Strassbourg, France) software for the identification of potential further duplicates. We used Discovery Studio Visualizer v16.1.0.15350 (Dassault Systèmes BIOVIA, San Diego, CA, USA) to convert the standardized SMILES to 2D chemical structures (sdf). Following the removal of duplication, our dataset decreased from an initial number of 1151 compounds to 1038, of which 286 were labeled as "active" and 752 as "inactive".

Descriptors
Molecular descriptors of the dataset molecules were computed using the Dragon 7 software

Feature Selection
As the number of computed descriptors is very large (almost 4000), the "dimensionality curse" precludes optimal operation of the classification or regression algorithms, which are generally designed for a relatively small number of variables, and tends to result in overfitting [76]. Feature selection, which is a process of filtering a high number of variables while keeping only the most relevant of them increases the performance of machine learning algorithms, reduces the computational costs, and strengthens the generalization ability of the models built [76]. Multiple algorithms of feature selection have been proposed in the literature, with variable performance, often depending on the nature and particularities of the data. We used 17 different feature selection algorithms, implemented directly in the "mlr" R package [77] or through other R packages: based on an ANOVA test, on a Kruskal test, on the Area Under the Curve (AUC), variance, and an univariate model performance score ('mlr'), based on a permutation importance of random forest (as implemented in the R package 'party', [78]), based on a chi-square test, gain ratio, information gain, OneR classifier, RELIEF algorithm, and symmetrical uncertainty (methods implemented in the 'FSelector' R package [79]), three algorithms based on random forest importance (as implemented in the randomForest [80] and randomForestSRC [81] packages), and two algorithms based on node impurity and permutation in random forests, as implemented in the 'ranger' R package [82]. The feature selection algorithms were applied after pre-processing consisting of removal of constant and quasi-constant features (i.e., those where less than 1% of the observations differed from the mode value) and highly correlated features (defined as those with a correlation coefficient higher than 0.9).

Machine Learning Algorithms and Model Building
For building the models we used the following algorithms: random forests, support vector machines, ada Boosting M1, Bayesian additive regression trees, binomial regression, and C5.0 decision trees and rule-based models.
Based on an arbitrary number of decision trees used as an ensemble with a majority vote to decide on the most probable class assigned to each data point, random forests (RF) are a popular classification algorithm often used with very good performance in QSAR models [83][84][85]. Each decision tree is constructed using bootstrap sets of the training set and subsets of descriptors that are selected in a random manner [86].
The support vector machines (SVM) algorithm is able to address data sets with high number of variables and has often been used with very good performance in a variety of classification and regression tasks, including QSAR applications [87,88]. It uses a variety of kernel functions (e.g., linear, polynomial, radial, etc.) to project features in a vector space maximizing the partitioning boundary between classes and to identify the hyperplane that best discriminates the classes [89].
The adaboost M1 (Adaptive Boosting) algorithms were described as "widely used in QSAR studies" [90], although they are probably less used than RF or SVM. AdaBoost is an iterative algorithm that uses weights to improve the performance of "weak" classifiers (particularly decision tress), giving higher weights to the trees with better performance (smaller misclassification rates) [90].
Bayesian Additive Regression Trees (BART) is nonlinear regression technique based on a Bayesian approach, whose performance in QSAR modelling has been stated to be competitive with that of other machine learning methods [91]. Unlike other decision trees, where decision is taken based on a majority vote or with the help of empirical weights, BART makes use of prior knowledge and likelihood to improve the performance of the decision trees.
Binomial regression (logistic regression), despite the term "regression" is a relatively simple algorithm used for classification purposes, because it linearly models the probability that an observation belongs to one of two categorical outcomes [92]. In other words, logistic regression computes the probability P = 1/(1 + e −t ), where t = a 0 + a 1 x 1 + a 2 x 2 + ... + a n x n [93].
C5.0 decision trees and rule-based models represent an extension of a classification algorithm proposed by R. Quinlan in 1993, under the name "C4.5", and builds models that can take either the form of a decision tree or a set of rules (in simple or boosted versions) [94]. Although apparently less used in QSAR modeling than other machine learning algorithms, when employed, it gave excellent performance, comparable with that of random forests or support vector machines [95].

Performance Evaluation
Nested cross-validation using five folds in the inner loop and 10 folds in the outer loop was used to evaluate the performance of the models selected, except for the Bayesian Additive Regression Trees, for which five folds were also used in the external loop (due to the long time taken by this classifier). The assessment of QSAR model performance should include both internal and external evaluations, and the external validation is generally deemed as "the gold standard" [106,107]. However, the concept of "external validation" has received different interpretations and most often is assumed to describe a holdout data set, obtained by an initial one-time split (i.e., a set that has not been seen by the model during any adjustments or hyperparameter optimization) [108]. Despite its apparent advantages of objectivity and ability to assess the generalization of the selected model(s), the use of a hold-out data set is fraught with thorny issues: the split may be simply fortunate, leading to overestimation of performance (or of contrary, it may be unfortunate, leading to underestimation of performance), it requires the holdout sample to be large (which in practice may be costly or a requirement impossible to satisfy), and the sample size needed for holdout is larger than it is necessary for cross-validation to estimate the prediction error with a similar degree of precision [106]. For these reasons, using nested cross-validation (also known as double cross-validation) not only does not reject the idea of external validation, but it extends it to the entire data set [109].
All models were assessed by computing (within the nested cross-validation) the balanced accuracy (BA), mean misclassification error (MMCE), sensitivity (true positive rate, TPR), specificity (true negative rate, TNR), area under the receiver operating characteristics curve (AUC), and positive predictive value (PPV), with their widely known definitions and equations [75,110] (formulae for their computation are available in the Supplementary Information). Particularly for virtual screening purposes PPV is important (because it indicates the likely proportion of positive values among the values predicted as positive). We therefore selected only models with a PPV higher than 70% and BA higher than 70%.
To make sure that the performance of the models is not consequential to chance, a Y-scrambling procedure was applied, where for multiple models the dependent variable (in our case the ki values) was shuffled through 1000 permutations (using the R package 'gtools' [111]), then the models were rebuilt using the same procedure from the first steps (i.e., applying the same feature selection algorithms, in the same order) and their performance evaluated. If there is a real relationship between the activity and the descriptors, following the y randomization the performance of the new models thus built should be worse.

Applicability Domain
We used two local density-based outlier methods implemented in the DDoutlier R package [112]the Kernel Density Estimation Outlier Score (KDEOS) algorithm with gaussian kernel [113], and the INFLO algorithm (which compares the density in the neighborhood of an observed value with the density in the "reverse neighborhood") [114]-adding each new test observation one at a time and computing whether it is or not an outlier in comparison with the reference (i.e., training) data set. We also applied the KNN (k nearest neighbour) approach proposed by Sahigara et al. (2013) [34] and the method advanced by Roy et al. (2015) [33] using R code written in house.

Virtual Screening by QSAR
The 49 best-performing QSAR models were used to predict the activity of a data set consisting of 104,619 ZINC database compounds (the "named" subset, i.e., compounds that have names in the ZINC 15 database [115]). The 49 models were stacked using a simple majority voting for the decision; the performance of the stacking was assessed by applying the same majority voting to the independent predictions in the nested cross-validation loops. The compounds were ranked in decreasing order, from those predicted by 100% of the models to those predicted by only 51% of the models.

Molecular Docking Study
Crystallographic data available in the PDB database (PDB ID: 4MXO [116], PDB ID: 3QLG [117]) show that src-tyrosin kinase inhibitors engage the enzyme primarily at the hinge residues, a few amino acid residues having a particular relevance: Val 281, Ala 293, Met 314, Ile 336, Met 341, Leu 393 [118]. We intended to evaluate whether the molecules ranked in our virtual screening as active with highest confidence bind in the back pocket of the src-tyrosin kinase in a similar way with dasatinib or bosutinib. Docking was performed using AutoDock Vina [35] with default parameters under Yasara (version 19.7.20), and LeDock. Human c-src protein (PDB ID: 2src [119]) was used as a target. For Vina, the protein preparation was performed in Chimera (Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, CA, USA) using the Dock Prep module (deleting the ligand and water molecules, eliminating alternate locations of residues, replacing selenomethionine with methionine, etc.); protonation states were assigned with the addH module of Dock Prep, at physiological pH (about 7.4), using the default method. The active site for the Vina docking was defined as a cubic cell of 5 Å around the selected residues (mentioned above). The setup was performed with the YASARA molecular modeling software (YASARA Biosciences GmbH, Vienna, Austria), the compounds being sorted by the program by the free energy of binding (the best hit of 25 runs), this being used for post-analysis, as discussed below. For LeDock the protein preparation was carried out using the LePro module (with the default values) and the docking was run with the default values of the LeDock module; the binding pocket was also a rectangular box with a radius of 5 Å. Clustering by RMSD (1.0 Å) was used to reduce redundancy, and the score of the first cluster (obtained from 20 runs) was selected for each compound for post-analysis.
The SMILES structures corresponding to the ZINC codes of the compounds predicted as active in the virtual screening by at least 75% of the models were downloaded in Python with the help of the smilite package; they were then converted to sdf format in DataWarrior (adding 3D coordinates) and then to mol2 format (with hydrogens added) in Biovia Discovery Studio and batch split to individual mol2 files with Open Babel. Ligand energy minimization was performed with Marvin Sketch, v. 19.19. The mol2 files were used in the LeDock software (Lephar Research, Stockholm, Sweden) for virtual screening.
To estimate the performance of the docking a subset of the training set comprising 175 compounds (33 with ki < 20 nM, 67 with 500 < ki < 1000 nM, 32 with 1500 < ki < 2000 nM, and 43 compounds with ki > 10,000 nM) was used and "cutpointr" R package was employed to define the best cut-off point of computed energies of binding between actives and inactives, based on the sum of sensitivity and specificity. We also computed various ligand efficiency metrics, which have been reported in the literature to improve the docking scoring; they were computed by dividing the energy of binding to the molecular weight, number of heavy atoms, number of carbon atoms, partition coefficient, and Wiener index [120]. We also explored computing ligand efficiencies by dividing the energy of binding to the squared value of the partition coefficient, to the total surface area, McGowan volume, van der Waals volume from McGowan volume, and van der Waals volume from the Zhao-Abraham-Zissimos equation (metrics not reported previously). The "cutpointr" R package [121] was used to define the best cut-off point of computed energies of binding between active and inactive compounds, based on the sum of sensitivity and specificity. For further validation we also docked the co-crystallized ligand from the c-src protein (PDB ID 2csrc), namely the phosphoaminophosphonic acid-adenylate ester, and RMSD was computed for the first cluster of poses predicted by LeDock. RMSD computation was performed in R based on the well-known formula and the results were compared with those obtained with the online DockRMSD [122], the values obtained being identical. Following the strong suggestion of one of the reviewers of this paper, we tested the compounds predicted by both the QSAR models and docking to be active and evaluate their potential effects using the online version of the program PASS [38].

Conclusions
A total of 49 global QSAR models have been developed, predicting the c-src tyrosine kinase inhibition with reasonable accuracy (>70%) and positive predictive value (>70%). The 49 models were assembled by stacking and used for the virtual screening of over 100,000 named compounds from the ZINC database. Several hundreds of compounds were predicted to be active, depending on the decision threshold used. Those with the highest probability of being active were also subjected to molecular docking and for the majority (about 61%) of them the energies of binding obtained were consistent with a hypothesis of activity. External data from ChEMBL and PubChem confirmed that at least 7.83% (in the case of QSAR) or 6.67% (in the case of integrated QSAR and molecular docking) of the compounds are active on the c-src target; more than a quarter of the predictions were also confirmed by prediction performed by the online version of PASS The ratio of active compounds is smaller than what was to be expected from the nested cross-validation data, but still better than what one should expect from any high-throughput type of screening experiments.

Acknowledgments:
The authors would like to thank the anonymous reviewers who, through their comments, contributed to the improvement of this paper.

Conflicts of Interest:
The authors declare no conflict of interest. R.A has received consultancy and speakers' fees from various pharmaceutical companies. The companies had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.