Discussion
Protein kinases are dynamic molecules that transition between distinct conformational states, each linked to specific biological functions. The activation segment plays a crucial role in regulating these transitions, maintaining a balance between active and inactive states. In this study, we developed a classification scheme to accurately distinguish between these conformations using 15 geometric descriptors derived from the activation segment (see Materials and Methods). To ensure high reliability, we trained multiple machine learning models using only non-conflicting labels from our resources. During this process, we identified over 300 conflicting kinase activity labels between the KinCore resource and previous study (
Table S1) [
13,
17].
To resolve these discrepancies, we optimized model performance using Benchmarking, Randomized Search, Bayesian Optimization, and Coordinate Descent techniques. Among all models tested, Random Forest consistently achieved perfect classification, with an accuracy of 1.0 across training and test sets, alongside perfect precision, recall, and F1 scores. XGBoost also performed exceptionally well, achieving near-perfect accuracy and excelling with a high ROC AUC value. In contrast, SVM and Logistic Regression demonstrated strong classification power but showed slight limitations in recall and F1 scores, when compared to Random Forest and XGBoost.
Beyond traditional classifiers, we explored probabilistic models, including Kernel Density Estimation (KDE) and Gaussian Mixture Models (GMM), for kinase classification based on density estimation. While KDE achieved near-perfect classification on both training and test sets, GMM exhibited minor declines in accuracy and recall, indicating a slight generalization gap.
We compared our classification scheme with five previously published models. One of these models, published in 2006 [
9], centers on the hydrophobic regulatory spine (R-spine), a critical dynamic feature that governs kinase function. The R-spine consists of four residues (RS1–RS4), with two located in the C-lobe and two in the N-lobe of the kinase, each contributing significantly to the kinase’s structural integrity [
9]. However, we identified structures where the R-spine was partially disassembled in active structures (
Figure 5), challenging the assumption that a disrupted R-spine necessarily indicates an inactive kinase [
9]. This finding underscores the complexity of kinase regulation and suggests that the R-spine alone is not always a a definitive indicator of kinase activity.
Another comparison was made with the work of McSkimming et al. [
13], who utilized 723 features to classify kinases as active or inactive based on the orientation of the activation segment, measured by φ, ψ, χ, and pseudo-dihedral angles. They trained a Random Forest model called Kinconform. In contrast, our model outperformed previously reported results while using only 15 features, eliminating the need to select from a vast pool of over 700 potential features, thereby making our approach more efficient (
Table 11).
We also compared our approach with the classification method by Ung et al.[
11], which categorizes kinases into CIDI, CODI, CIDO, CODO (C-helix in/out and DFG in/out), and ωCD (distorted αC-helix or DFG motif) conformations. However, their classification does not differentiate between active and inactive DFG-in structures. Furthermore, their curated dataset included only 264 structures, restricting its applicability for machine learning tasks.
Faezov et al. defined several criteria for identifying the active form of protein kinases based on structures bound to substrates and ATP [
12]. These criteria include: (1) the DFG-in position of the DFG-Phe side chain; (2) the “BLAminus” conformation, characterized by specific backbone and side-chain dihedral angles of the XDFG motif, previously identified as essential for ATP binding; (3) the presence of an N-terminal domain salt bridge between a conserved Glu residue in the C-helix and a conserved Lys residue in the N-terminal domain beta sheet; (4) backbone-backbone hydrogen bonds involving the sixth residue of the activation loop (DFGxxX) and the residue immediately preceding the HRD motif (“X-HRD”); and (5) a contact or near-contact between the Cα atom of the APE9 residue (nine residues before the activation loop’s C-terminus) and the carbonyl oxygen of the Arg residue in the HRD motif.
While Faezov et al. used these criteria for kinase classification [
12], McSkimming et al. employed previously established classification methods to annotate kinase structures [
13]. Disagreements between the two groups’ annotations were resolved through consensus manual curation by two independent chemists. Notably, more than 300 disagreements were identified between the annotations in these two studies (
Table S1).
Our model, developed based on the consensus between these resources, was tested on structures with inconsistent annotations. It aligned with Faezov et al.’s conflicting annotations in 12.30% of cases and with McSkimming et al.’s conflicting annotations in 87.70% of cases.
Lastly, we compared our model with the approach of Reveguk et al.[
14], which initially considered 1692 structural variables spanning the catalytic domain. After identifying a smaller set of features, they used 3289 labeled structures for training. They trained an XGBoost model called KinActive. However, their annotations were biased by relying on McSkimming et al. kinase activity label [
13] , whereas our study used only agreed-upon annotations. The model developed by Reveguk et al. used 78 features for classification, whereas our model achieved success with only 15 features, demonstrating greater efficiency in distinguishing between active and inactive kinase states (
Table 11).
From this comparative evaluation, our classification approach proved to be highly efficient and reliable for kinase conformation prediction, with consistent performance across multiple optimization techniques. Unlike previous methods that relied on hundreds of structural features, our model achieved high classification accuracy using only 15 geometric descriptors, making it a more efficient and scalable approach. By successfully distinguishing active and inactive kinase conformations, our framework provides a powerful tool for resolving conflicting kinase annotations. Furthermore, its ability to identify misclassified structures and refine existing kinase databases enhances its practical utility in structure-based kinase research. This advancement has significant implications for guided drug design, as accurately characterizing kinase conformations is crucial for identifying druggable states and designing selective kinase inhibitors.