SMILES-Transformer-Assisted SAVI Screening for Novel Chronobiotics

Ilya A. Solovev; Gleb R. Kabachevskiy; Denis A. Golubev; Arina I. Yagovkina; Nadezhda O. Kotelina

doi:10.20944/preprints202604.2062.v1

Submitted:

28 April 2026

Posted:

29 April 2026

You are already at the latest version

Abstract

The development of new chronobiotics, substances capable of selectively modulating the parameters of circadian rhythms, is hampered by the fragmented nature and limited volume of available experimental data.In the present study, a comprehensive evaluation of the applicability of the SMILES-Transformer architecture to the classification of circadian rhythm modulators was performed using the specialised ChronobioticsDB resource, and the first systematic virtual screening of the SAVI (Synthetically Accessible Virtual Inventory) library of synthetically accessible compounds for chronobiotic activity was carried out. Rigorous protocols were applied for model training and validation: Data-Efficient Modeling (DEM) assessment with 20 repeats, repeated scaffold validation (5 × 5), and a comparative analysis of training strategies (feature-based vs. end-to-end fine-tuning). The influence of three variants of circadian-effect labelling (raw, aggregated, and expert-curated) and three loss functions (BCE, Focal Loss, and Asymmetric Loss) on the quality of multi-label classification was investigated. The results demonstrate that systematic hyperparameter optimisation in end-to-end mode provides the best predictive performance (ROC-AUC 0.666 for the effect_coarse task), whereas standard fine-tuning without optimisation leads to overfitting (ROC-AUC 0.470). Scaffold validation confirmed the ability of the model to generalise to structurally novel compounds (ROC-AUC 0.587). Expert aggregation of labels improved the recognition of rare classes (F1-macro 0.254 versus 0.148 for the raw labelling). Based on the trained models, a consensus virtual screening of the SAVI library was performed using four independent classifiers (classf, effect_coarse, target, mechanism). From more than five million compounds, 10,000 of the most promising candidates were selected, among which 34 super-candidates (consensus score > 0.9) and 435 strong candidates (> 0.8) were identified. Analysis of the predicted targets revealed dominance of the CLOCK-BMAL1 complex (60.49%), while among effects the circadian phase shift prevailed (37%). All identified candidates are synthetically accessible and are recommended for prioritised experimental verification.

Keywords:

chronobiotics

;

circadian rhythms

;

machine learning

;

SMILES-Transformer

;

deep learning

;

chemoinformatics

;

ChronobioticsDB

;

virtual screening

;

SAVI

;

scaffold validation

;

prediction of molecular properties

Subject:

Chemistry and Materials Science - Medicinal Chemistry

1. Introduction

Throughout their evolutionary history, biological systems have developed under conditions of strict 24-hour cyclicity caused by the rotation of the Earth around its axis. This has led to the formation of a fundamental regulatory mechanism, circadian rhythms, which are endogenous autonomous oscillations with a period of approximately 24 hours [1,2,3]. In mammals, the hierarchical organisation of the circadian system is headed by the suprachiasmatic nucleus (SCN) of the hypothalamus, which functions as the central pacemaker coordinating peripheral oscillators present in virtually all tissues and cells of the organism. At the molecular level, circadian rhythms are sustained by a complex system of transcription–translation feedback loops involving the core “clock” genes Bmal1, Clock, Per1/2, Cry1/2, and Nr1d1 (REV-ERBα). The transcription factor CLOCK-BMAL1 acts as the principal activator of the circadian cycle by initiating expression of the Per and Cry genes, the products of which, as they accumulate, inhibit the activity of CLOCK-BMAL1 and close the loop with a period of about 24 hours [1,3]. Disruption of synchrony between internal biological clocks and external synchronisers, or zeitgebers, such as the light regime and meal timing, is associated with a wide spectrum of pathological states. Clinical and epidemiological data indicate links between circadian desynchronisation and sleep disorders, depression, metabolic syndrome, cardiovascular diseases, neurodegenerative pathologies, and certain forms of cancer [3]. The prevalence of circadian disturbances in modern society is steadily increasing as a consequence of artificial lighting, shift work, and transmeridian flights, which makes the development of pharmacological agents for the targeted correction of circadian rhythms highly significant from both medical and social perspectives.

In this context, chronobiotics, substances capable of selectively modulating biological clock parameters such as the amplitude, phase, or period of oscillations, represent a promising class of therapeutic agents [2,3]. Classical chronobiotics include the hormone melatonin and its synthetic agonists (agomelatine, tasimelteon, ramelteon), as well as more recent ligands of the REV-ERB receptors (SR9009, SR9011) and stabilisers of the cryptochromes (KL001). The spectrum of potential clinical applications of chronobiotics covers the correction of jet-lag and shift-work sleep disturbances, the treatment of mood disorders, the slowing of progression of neurodegenerative diseases, and chronotherapeutic approaches in oncology. However, the development of new chronobiotics faces a critical problem of scarce and fragmented experimental data. Unlike standard pharmacological targets, evaluation of a compound’s effect on circadian dynamics requires lengthy and technically complex in vitro studies using luciferase reporter systems (e.g., the Bmal1-luciferase assay) or in vivo experiments with monitoring of locomotor activity (the wheel-running activity assay). The high cost and labour-intensiveness of such experiments substantially limit the size of training sets available for machine-learning methods [4,5]. The ChronobioticsDB database, created by our group in 2025 as the first specialised resource of experimentally confirmed circadian-rhythm modulators [1], contains around 350 annotated compounds, which by the standards of deep learning constitutes an extremely small sample.

Traditional chemoinformatics methods based on pre-defined molecular descriptors (Morgan fingerprints, MACCS keys) or graph neural networks (GNN) often demonstrate insufficient generalisation ability on such small samples because of the limited feature space or a tendency towards overfitting. An alternative and promising approach is the use of textual molecular representations in the SMILES (Simplified Molecular Input Line Entry System) format combined with Transformer architectures [5,6]. SMILES-Transformer-type models borrow principles from natural-language processing: during the pre-training stage on millions of unlabelled chemical structures from the PubChem or ChEMBL databases, the model forms context-dependent vector representations (embeddings) that capture deep structural and physico-chemical regularities of chemical space. During subsequent fine-tuning on specialised small data sets, such models can effectively adapt to specific tasks of biological-activity prediction, which makes them particularly attractive for working with resources such as ChronobioticsDB [4,6]. After training and validating the predictive models, an essential next step is their practical application to virtual screening of large chemical libraries with the aim of identifying new potential chronobiotics. The SAVI (Synthetically Accessible Virtual Inventory) library is a collection of more than five million synthetically accessible compounds, each of which can be obtained in the laboratory in a limited number of synthetic steps, ensuring the practical feasibility of computational screening results.

The aim of the present work was to perform a comprehensive training and validation of SMILES-Transformer-based models for multi-label classification tasks on chronobiotics from ChronobioticsDB using rigorous evaluation protocols, and then to apply the optimised models to virtual screening of the SAVI library for chronobiotic activity, producing a prioritised list of candidates for experimental verification.

2. Materials and Methods

2.1. The ChronobioticsDB Database

The present study is based on the use of the specialised ChronobioticsDB resource [1]—the first manually curated database in the world of compounds modulating circadian rhythms. The database brings together information on experimentally confirmed chronobiotics that has been systematically extracted from peer-reviewed scientific literature published over the last fifty years. The architecture of ChronobioticsDB is built on the PostgreSQL relational database management system and the Django web framework, which ensures the integrity, scalability, and accessibility of the data for the research community. The main entities of the database include chemical structures in SMILES format, pharmacological classes, biological targets, mechanisms of action, and types of observed circadian effects. For machine-learning purposes, a structured sample was prepared that included several levels of annotation of biological action. Each compound was associated with a vector of binary labels covering four aspects: (1) classf, membership in functional pharmacological groups such as CRY ligands, steroids, and melatonin receptor agonists; (2) mechanism, the detailed mechanism of interaction with the target; (3) target, the specific protein object such as MT1 or MT2 receptors, CRY1 or CRY2 proteins, REV-ERB nuclear receptors, or the CLOCK BMAL1 complex; (4) effect, the description of the impact on circadian rhythm parameters, including phase shift, amplitude change, rhythm restoration, and sleep modulation.

Given the high variability of terminology in the primary literature, three variants of effect labelling were used in the work to evaluate the influence of annotation granularity on the quality of training: effect_raw (the original labels, six categories after filtering), effect_coarse (aggregated categories, nine labels), and effect_expert (labels that had undergone additional verification by chronobiology experts, ten labels). Filtering of labels was carried out with a minimum support threshold (min_label_support ≥ 2), excluding classes with a single representative.

2.2. Molecular Representation and Model Architecture

To convert chemical structures into a machine-readable format, the SMILES notation (Simplified Molecular Input Line Entry System) was used, which encodes the topology of a molecule as a linear sequence of symbols. To increase the robustness of training, a SMILES-augmentation strategy was applied: for each molecule, fifty random canonical variants of the SMILES string were generated using the Chem.MolToSmiles (mol, doRandom=True) function of the RDKit library, allowing the effective size of the training set to be increased and preventing memorisation of specific string patterns.

The work used the SMILES-Transformer architecture [6] based on the BERT (Bidirectional Encoder Representations from Transformers) encoder. The model contains 13.8 million trainable parameters and was pre-trained on a large collection of unlabelled molecular structures from the ChEMBL database. The key advantage of the Transformer architecture lies in the self-attention mechanism, which allows the model to dynamically estimate the importance of each token (atom, bond, or functional group) in the context of the entire sequence and effectively capture long-range intramolecular dependencies [5,7]. The training procedure was divided into two alternative approaches. Within the feature-based approach, the pre-trained encoder was used in a frozen-weights regime to extract fixed vector representations of molecules, on top of which a classifier (logistic regression with L1/L2 regularisation) was trained. Within the end-to-end fine-tuning approach, all weights of the Transformer encoder, together with the classification head, were jointly adapted to the specific tasks of chronobiotic classification. For the tokenisation of SMILES strings, a character-level method using a specialised dictionary of chemical symbols was applied.

2.3. Validation Protocols and Metrics

Evaluating the quality of models on small samples requires rigorous protocols to avoid overfitting and false-positive conclusions about predictive capability. The following validation strategies were used. Data-Efficient Modeling (DEM). This metric assesses the ability of the model to learn under conditions of data scarcity. It is calculated as the mean value of the target metric (ROC-AUC) when training on subsamples of increasing size (from 1.25% to 80% of the total sample). In the present study, DEM evaluation was carried out with 20-fold repetition for each threshold, providing robustness of the resulting estimates against the randomness of partitioning.

Scaffold validation. The traditional random split often inflates metrics because of the presence of structurally related molecules in both the training and test sets. Scaffold validation based on Bemis–Murcko scaffolds guarantees that molecules with an identical carbon skeleton fall exclusively into one of the sets. This makes it possible to evaluate the model’s capacity for “scaffold hopping”—generalisation to fundamentally new chemical classes. The work used a repeated 5 × 5 scaffold-validation scheme: five independent partitions into training and test sets, each evaluated five times, with a total sample of 207 compounds containing 157 unique scaffolds.

To assess the quality of multi-label classification, the F1-micro metric (the overall ratio of correct to incorrect predictions across all labels), F1-macro (the average F1 across individual labels, allowing the quality of prediction of rare classes to be evaluated), and ROC-AUC micro (the area under the ROC curve aggregated across all labels) were used.

2.4. Optimisation and Loss Functions

Particular attention was paid to addressing class imbalance, in which the number of negative examples greatly exceeds the number of positive activations. Within the end-to-end fine-tuning framework, three loss functions were implemented and tested: Binary Cross Entropy (BCE)—the standard loss for independent binary tasks; Focal Loss, which introduces a modulating factor (1 − p)^γ that decreases the contribution of easily classified examples; and Asymmetric Loss (ASL) [10], which extends the ideas of Focal Loss by separating the focusing levels for positive and negative samples, thereby allowing more aggressive suppression of the gradients arising from excessive negative labels.

The hyperparameter-tuning procedure included a systematic search (hyperparameter sweep) for the optimal values of the learning rate, batch size, warm-up ratio, and loss-function parameters using the Syne Tune framework [9] for distributed optimisation.

2.5. Virtual Screening of the SAVI Library

After completion of the training and validation stage, the trained models were applied to virtual screening of the SAVI (Synthetically Accessible Virtual Inventory) library, which contains more than five million synthetically accessible compounds. The SAVI library includes molecules each of which can be synthesised in 1–3 chemical steps from commercially available reagents, ensuring the practical feasibility of the screening results. For every molecule in the SAVI library, evaluation was performed by four independent models: classf (membership in a pharmacological class, eight labels), effect_coarse (the type of circadian effect), target (the molecular target), and mechanism (the mechanism of action). The final consensus score was calculated as a weighted combination of the normalised ranks across the four models with the weights classf = 0.2, effect_coarse = 0.5, target = 0.2, and mechanism = 0.1. The increased weight of the effect_coarse model is justified by the priority given to the functional characterisation of chronobiotic action for practical screening tasks. From the total number of molecules, 10,000 candidates with the highest consensus scores were selected. The chemical correctness of the selected structures was validated using the RDKit library.

3. Results

3.1. Evaluation of Baseline Predictive Capability (DEM)

The results of model evaluation under the Data-Efficient Modeling (DEM) protocol with 20 repeats for all task types are presented in Table 1. The data reflect the average predictive performance of the model when trained on subsamples of varying size, allowing the robustness of the algorithm to data-scarce conditions to be assessed.

Analysis of Table 1 shows that the model copes most successfully with determining the pharmacological class (classf, ROC-AUC = 0.571) and the mechanism of action (mechanism, ROC-AUC = 0.567). Prediction of biological targets (target) demonstrates somewhat lower efficiency (0.552). As expected, the most difficult task proved to be predicting circadian effects, where ROC-AUC values fluctuate within the range 0.520–0.529. According to commonly accepted criteria for the interpretation of ROC-AUC, values in the interval 0.5–0.6 correspond to weak predictive ability, bordering on the level of random guessing. Nevertheless, in the context of extremely small and noisy samples (~200–300 molecules), the presence of a stable signal above the 0.5 threshold indicates that the Transformer architecture is capable of extracting basic structural regularities associated with chronobiological activity.

3.2. Robustness Under Scaffold Validation

To evaluate the model’s ability to generalise to structurally novel compounds, scaffold validation was carried out using the effect_coarse task as an example. The results of the comparison between a single split and repeated 5 × 5 cross-validation are given in Table 2.

The transition from the single split to the repeated 5 × 5 scheme produced a moderate increase in the key metrics: F1 micro rose from 0.287 to 0.319, while ROC-AUC micro increased from 0.567 to 0.587. The 95% confidence interval for F1 micro [0.299; 0.340] indicates the stability of the results under different scaffold-partitioning variants. The substantial gap between F1 micro (0.319) and F1 macro (0.254) confirms persistent difficulties in recognising rare types of effects, which contribute disproportionately to the F1 macro metric.

3.3. Comparison of Feature-Based and End-to-End Approaches

The data in Table 3 reveal the critical importance of hyperparameter optimisation when using deep Transformer models on small data sets. A single fine-tuning run with default parameters (end2end_single) led to a catastrophic drop in performance (ROC-AUC = 0.470), which is below the level of random choice and indicates pronounced overfitting. Transformer architectures with their large number of parameters (13.8 million) are extremely sensitive to learning-rate and regularisation settings when trained on small samples. At the same time, systematic optimisation (end2end_sweep_best) allowed the best metrics to be achieved: F1 micro = 0.376 and ROC-AUC micro = 0.666. This confirms that with adequate tuning, the weights of a pre-trained Transformer can be effectively adapted to the specific characteristics of chronopharmacology tasks and surpass the static feature-based approach in discriminative ability (ROC-AUC).

The results of the comparative analysis of different training strategies for the effect_coarse task are presented in Table 3.

3.4. Influence of Effect-Labelling Variants

The analysis of labelling variants (Table 4) reveals a characteristic pattern: the original labelling effect_raw shows an anomalously high ROC-AUC (0.832) accompanied by an extremely low F1 macro (0.148). This dissociation indicates a pronounced imbalance in the original data, in which the model successfully predicts a few dominant classes but completely ignores the rest. The aggregated variants (effect_coarse and effect_expert) reduce the ROC-AUC to 0.58–0.60 but at the same time substantially increase F1 macro (to 0.232–0.254). This indicates that expert curation of labels yields a more balanced task, increasing the probability of recognising biologically meaningful but rare effects—a critical property for practical virtual screening.

3.5. Results of Virtual Screening of the SAVI Library

Based on the trained and validated models, a consensus virtual screening of the SAVI library (more than five million compounds) was performed. From the total number of molecules, 10,000 of the most promising candidates were selected; the distribution of their consensus scores is presented in Table 5.

Among the 10,000 candidates, 301 molecules received high scores simultaneously from all four models (cross-task candidates), which substantially increases the reliability of the prediction for this subset. Particular attention should be paid to the 34 super-candidates with consensus scores above 0.9, for which all four independent models demonstrated consistent high confidence.

3.5.1. Predicted Pharmacological Classes

The classification model (classf) assigned 98.05% of the top candidates the label “circadian rhythm modulator”, 1.90% the label “hypnotic”, and only 0.05% the label “potential chronodisruptor”. The almost complete absence of chronodisruptors at the top indicates that the model does not recommend compounds potentially harmful to the circadian system.

3.5.2. Predicted Action Effects

The three principal effects are distributed relatively evenly, with a slight predominance of phase shift (37%). Molecules with a predicted phase-shift effect are of interest for the correction of desynchronosis associated with transmeridian flights and shift work. Candidates with a circadian-restoration effect are promising for the therapy of circadian disturbances in neurodegenerative diseases, while molecules with a sleep-modulation effect represent a potential alternative to classical hypnotics (Table 6).

3.5.3. Predicted Molecular Targets

The dominance of the CLOCK-BMAL1 complex (60.49%) among the predicted targets reflects its central role in the ChronobioticsDB training database, where most known chronobiotics interact precisely with this transcriptional complex. A substantial proportion of candidates directed at CRY1–PER2 (28.02%) indicates the potential of the identified compounds to modulate the negative feedback loop of the circadian cycle. The relatively low proportion of melatonin ligands (1.85%) is explained by the specific chemical structure of melatonin agonists (indole derivatives), which is rarely encountered in synthetic libraries of a general nature (Table 7).

3.5.4. Structural Features of the Top Candidates

Analysis of the chemical structure of the 30 candidates with the highest consensus scores revealed characteristic structural motifs (Table 8) forming the “pharmacophore portrait” of a potential chronobiotic from the SAVI library.

The 100% occurrence of aromatic rings is consistent with the well-known fact that small molecules interacting with the transcription factors CLOCK, BMAL1, CRY, and PER are predominantly planar aromatic compounds. The high frequency of imine bonds N=C (77%) indicates the ability of these structures to act as hydrogen-bond acceptors and donors of π-electrons. The presence of chlorine substituents (43%) contributes to increased lipophilicity, improving permeation through the cell membrane to intracellular targets (Figure 1).

Table 9. Ranking of super-candidate compounds for chronobiotic activity from the SAVI 1/55 block.

No.	SAVI ID	SMILES	Consensus Score
1	1A275D1E3DD3524A_059EC9AC131E1D6F_2201_UN	CC1(C)COC2=C(C3=C(N=C12)C4=C(S3)NC(C(=C4)C(=O)O)=O)C5=CC=CC=C5	0.95093
2	8D3182EB37F23078_F37BCE2DDB9434D2_2201_UN	CCOC1=CC3=C(C=C1)OCC4=C(C)C2=C(C=CC(=C2)[N](=O)=O)N=C34	0.94279
3	37D4039641B3BC96_C51990CEE4ECD988_2201_UN	ClC1=C(C=CC=C1)C2=C(C4=C(N=C2C3CCOC3)N=CC=C4)C5=CC(=CC=C5)Cl	0.93457
4	D0AE13FD123CDC8C_7435451788DFFA67_2201_UN	ClC1=C(C=CC=C1)C3=C(C2=CC(=C(OC)C=C2N=C3C4=CSC=C4)OC)C	0.93363
5	4A662B2114895A5C_17C8243AD7AA0E1B_2201_UN	ClC1=C(C=CC=C1)C2=C(C4=C(N=C2C3CCOC3)N=CC=C4)C5=CC(=CC(=C5)F)F	0.93169
6	48D2262B63F9A9ED_AC5CCD9004562092_2201_UN	ClC1=C(C=CC=C1)C2=CC4=C(N=C2C3CCOC3)C=C(C=N4)F	0.92927
7	FCAE9B9C9F85925C_C93F992927D663B2_2201_UN	CC1=NC3=C(C(=C1CCCC=C)C2=CC=C(OC)C=C2)C4=C(S3)CCC4	0.92651
8	3DC78391E8CFE6FE_04382ADD4A0EE0BF_2201_DP	OC2C1=C(C3=C(N=C1C2(C)C)C4=C(S3)NC(C(=C4)C(=O)O)=O)C5=CC=CC=C5	0.92577
9	F0EEAB259F5604BB_0784D55EEB07C643_2201_UN	CC2(C)COC3=C(C1=CC=C(C)C=C1)C4=C(N=C23)SC5=C4CCC(C5)(C)C	0.92511
10	DD91AFA3FD38EC01_EA8D819DAFE7BBFF_2201_UN	COC1=CC4=C(C=C1)C3=NC2=C(C=C(C=C2)Cl)C(=C3CC4)C5=C(C=CC=C5)F	0.92311
11	7333F032F376DB51_EEF15F3C4B182B7F_2201_UN	ClC1=C(C=CC=C1)C3=C(C)C2=C(C=C(C=C2)F)N=C3C4CCOC4	0.92239
12	1B8724E4F079AF13_D8F1E5D17DCFE5AF_2201_UN	CCOC(=O)C2CCN(CC1=CC=CC=C1)CCC3=C(C4=C(N=C23)C=C(C=C4)Cl)C5COCCC5	0.92089
13	D70D62A960D63F7C_332F7499CFA77E1C_2201_UN	ClC1=C(C=CC=C1)C3=C(C2=C(OC)C=CC=C2N=C3C4CCOC4)C	0.91971
14	26B5CA1C071C41C7_E875983BE719FFF7_2201_UN	C2=NC1=C(C=CC=C1)C(=C2CC3=CON=C3)C[S](=O)(=O)N	0.91788
15	3EA1FB1AFF1DA556_3973C278384A9C44_2201_UN	CC1=NC2=C(C(=C1CCCC=C)C(C)C)C3=C(S2)C(CC(C3)(C)C)(C)C	0.91660
16	301F30FFC04DC940_6A2991726B857875_2201_UN	CCOC1=CC2=C(C=C1)OCC3=C(C4=C(N=C23)SC5=C4CCC5)C6=CN=CC=C6	0.91629
17	AD6DC4D914DEF26E_1EB6E3F7DECCFB7A_2201_UN	CC2=NC1=C(C=NC=C1)C(=C2CCCC=C)C(F)(F)F	0.91437
18	C838FE0623C5B07D_D9320B3361EB2866_2201_UN	C3=NC1=C(C2=C(S1)CCCC2)C(=C3CC4=CON=C4)C5=CC(=CC=C5)F	0.91286
19	4220E144E1F5944B_FF7C2A854CEED719_2201_UN	CCOC1=CC3=C(C=C1)OCC4=C(C)C2=C(C=CC(=C2)O)N=C34	0.91137
20	B7C81F203CC1F6C5_F5508E7A730D0588_2201_UN	ClC1=C(C=CC=C1)C2=C(C4=C(N=C2C3CCOC3)C=CC(=C4)F)C5=CC=CC=C5	0.91109
21	B88882A3FAA95561_2A5FE6C4841D3564_2201_DP	OC2C1=CC4=C(N=C1C23CCC3)C(=NC=C4)Br	0.90852
22	5A8108D267EFBBF6_F1DF654F4F5B1E47_2201_DP	OC2C1=C5C4=C(N=C1C23CCC3)C(=CC(=C4C(C6=C5C=CC=C6)=O)Br)C(=O)O	0.90731
23	FF5E13C995B41E36_B5B6E33BB7C03FDF_2201_UN	COC1=CC5=C(C=C1)C4=NC2=C(C3=C(S2)CCC3)C(=C4CC5)C6=CC=NC=C6	0.90707
24	2E6BF2F0D8FA29A0_30D34167389A43FD_2201_DP	OC2C1=CC3=C(N=C1C2(C)C)N=CC=C3OC(F)(F)F	0.90643
25	8779633B5BFF22F4_75EA71CFD58A7E89_2201_UN	CCOC1=CC3=C(C=C1)OCC4=C(C)C2=C(C=CC(=C2)Br)N=C34	0.90493
26	C1941DBDD6103EBA_A0692FCDD1DF39C3_2201_UN	ClC1=C(C=CC=C1)C2=C(C4=C(N=C2C3CCOC3)C=CC=C4)C5=CC=CC=C5	0.90455
27	F866369954281749_B36D81BEB5370B6F_2201_UN	CC2(C)COC3=C(C1=CC(=C(C)C=C1)F)C4=C(N=C23)SC5=C4CCC5	0.90379
28	2FBF185F383D3EDA_5F2563289E30B338_2201_UN	COC1=CC5=C(C=C1)C4=NC2=C(C3=C(S2)CCC3)C(=C4CC5)C6=CC(=CC=C6)Cl	0.90329
29	F9919567E86AD30F_4D6275FD2B17E549_2201_UN	CCOC(=O)C2CCN(CC1=CC=CC=C1)CCC3=C(C4=C(N=C23)C=C(C=C4)Cl)C5COCC5	0.90323
30	4BD3AF3106A1D416_F8C04917CCC7A808_2201_UN	CC(C)(C)C3=NC2=C([N]1CCCCC1=N2)C(=C3CC4=CC=CC=C4)C5=CC=C(C=C5)Cl	0.90284
31	D1CE029DC385F1BF_37224ED6F642E061_2201_DP	OC2C1=CC4=C(N=C1C23CCC3)N=CC(=N4)Br	0.90280
32	57EA6E9327C0702B_E91BC4B810E319EB_2201_UN	C1=NC3=C(C(=C1CC2=CON=C2)C)SC4=C3C(=CC(=N4)C)C	0.90236
33	26F6C2711940E9B9_CF3B95CDD8C70C89_2201_UN	CC(C)(C)C1=NC3=C(C(=C1CC2=CC=CC=C2)C(C)=C)C=CC=C3	0.90236
34	94EB09F58F06CB7C_9E67FB133423B583_2201_DP	OC4C3=C(C)C1=C(C2=C(S1)N=C(C)C=C2C)N=C3C4(C)C	0.90148

3.5.5. Characterisation of the 34 Super-Candidates

The 34 molecules with consensus scores above 0.9 represent the principal practical outcome of the virtual screening. All 34 super-candidates are synthetically accessible compounds (28 molecules in the UN—unique synthesis—category, and 6 molecules in the DP—defined pathway—category) that have passed RDKit chemical validation (correct SMILES, proper valences, absence of anomalous structures). All four independent models simultaneously rated these structures as highly promising. Detailed structural data and scores for all 34 super-candidates are provided in the Supplementary Materials (Supplementary File S1).

4. Discussion

Interpretation of the obtained results requires consideration of fundamental limitations characteristic of the current state of data in the field of chronobiology and chronopharmacology. The ROC-AUC values in the range 0.55–0.67, recorded in the best configurations of the SMILES-Transformer, may seem moderate compared with the results achieved in computer-vision or natural-language-processing tasks. However, in the context of computational pharmacology of chronobiotics, such metrics are expected and justified. The chronobiotic effect of a compound is determined not only by its affinity for a particular molecular target but also by the complex systemic response of the organism, which depends on chronopharmacokinetic parameters: time of administration, dosage, metabolic stability, and tissue distribution—factors that are fundamentally not encoded in the SMILES string [3,4].

4.1. The Influence of Small Samples on Model Stability

One of the central problems of the study was the critical shortage of data in ChronobioticsDB (~200–300 annotated compounds), which can only be overcome as the database is filled with new experimentally verified molecules [1]. The small training-set size makes the training process of deep models extremely unstable, as clearly demonstrated by the failure of the end2end_single model (Table 3, ROC-AUC = 0.470). Transformer architectures with 13.8 million parameters, when training examples are scarce, tend to memorise noise in the data rather than to identify true chemico-biological regularities [13]. In this context, the use of pre-trained SMILES-Transformer embeddings acts as a powerful regulariser, providing the model with a priori knowledge of the regularities of chemical space derived from millions of unlabelled ChEMBL structures. The success of the end2end_sweep_best approach (ROC-AUC = 0.666) demonstrates that Transformer models can be effective even on extremely small samples, provided that hyperparameters are carefully controlled and loss functions resistant to class imbalance are applied. The performance gap between end2end_single and end2end_sweep_best (0.470 vs. 0.666) amounts to 0.196 ROC-AUC units—a magnitude which, in practical terms, may mean the difference between a useless and a workable model for virtual screening.

4.2. Scaffold Validation and Generalisation Ability

The application of scaffold validation made it possible to assess the real predictive power of the models in the most stringent scenario, prediction for molecules with chemical scaffolds absent from the training set. The ROC-AUC value of 0.587 under repeated 5 × 5 scaffold splitting (Table 2) indicates that the model has limited but statistically significant capacity to generalise to structurally novel scaffolds. In chemoinformatics, this type of generalisation is often called “out-of-distribution generalisation” and represents one of the most fundamental unresolved problems of the field [11]. The fact that the model performs above random under such a stringent partition confirms the advantage of the contextual embeddings of SMILES-Transformer over classical descriptors: the Transformer is capable of capturing the functional equivalence of various chemical groups even when the central scaffold of the molecule changes substantially.

At the same time, the scaffold ROC-AUC of 0.587 is only marginally above the random-guessing threshold (0.5), which objectively limits the model’s applicability for the identification of fundamentally new classes of chronobiotics. The model preferentially identifies compounds structurally related to already known chronobiotics, which must be taken into account when interpreting the results of SAVI screening.

4.3. The Role of Data Curation in Prediction Quality

The comparison of labelling variants (Table 4) underscores the critical importance of high-quality data curation for learning tasks in chemoinformatics. The use of the original labels (effect_raw) yields deceptively high ROC-AUC values (0.832), which in practice prove useless because of the model’s inability to predict anything beyond a few dominant classes (F1 macro = 0.148). The transition to expert labelling (effect_expert), despite a formal decrease in ROC-AUC to 0.584, increases F1 macro to 0.254, which is critically important for practical virtual-screening tasks where the researcher is interested in the model’s ability to identify rare but biologically significant modifiers of the circadian rhythm.

4.4. Interpretation of the SAVI Screening Results

Virtual screening of the SAVI library using a consensus approach based on four independent models made it possible to systematise more than five million compounds according to their potential chronobiotic activity. The consensus evaluation method, which combines the predictions of four models, has a substantial advantage over a single-model approach because the low correlation between models (r = 0.011 between mechanism and classf) indicates the de facto independence of the scores and increases the reliability of the aggregated ranking. The identification of 34 super-candidates with consensus scores > 0.9 is the principal practical result of the screening. These molecules simultaneously received high scores against all four criteria, which substantially reduces the probability of accidental ranking at the top. Structural analysis of the super candidates revealed a characteristic pharmacophore profile, including planar aromatic systems with imine bonds, sulfur containing heterocycles, and halogen substituents, motifs that agree well with the known requirements for ligands of transcription factors of the circadian system. The dominance of CLOCK-BMAL1 (60.49%) among the predicted targets reflects not only its central role in the molecular clock but also a bias of the training set: most known chronobiotics in ChronobioticsDB are characterised precisely through their interaction with this complex. This observation simultaneously represents an advantage (more reliable predictions for the most well-studied target) and a limitation (potential underestimation of candidates with less studied mechanisms of action). In the context of practical pharmacology, ROC-AUC of 0.60–0.67 is sufficient for virtual-screening tasks. Even under a conservative estimate of screening success of 1% (one active molecule out of 100 tested), the presence of 435 strong candidates (consensus > 0.8) makes it possible to predict on the order of 4–5 potentially active compounds—a result that justifies the costs of chemical synthesis and biological testing of the priority candidates.

4.5. Comparison with Results in Chemoinformatics

In the context of typical machine-learning studies in chemoinformatics (for example, on MoleculeNet benchmarks), the obtained ROC-AUC values of 0.60–0.67 appear competitive for tasks of comparable complexity and data volume. In the original SMILES-Transformer paper [6], metrics on small data subsamples were often within similar bounds. A literature review of the application of machine-learning methods in pharmacology shows that, for tasks with a training set of fewer than 300 molecules and a complex phenotypic endpoint (such as the circadian effect), ROC-AUC values above 0.65 are highly competitive [4,12,13].

5. Conclusion

In the present work, a comprehensive study of the applicability of the SMILES-Transformer architecture for the prediction of the properties of circadian-rhythm modulators based on ChronobioticsDB was carried out, and the first systematic virtual screening of the SAVI library [14] for chronobiotic activity was performed. For the first time for this class of tasks, rigorous validation protocols, including Data-Efficient Modeling and repeated scaffold validation, were applied, providing an objective evaluation of the robustness and generalisation ability of the models.

The principal conclusions of the study are as follows. First, the SMILES-Transformer architecture demonstrates statistically significant predictive capacity in tasks of chronobiotic classification, attaining a ROC-AUC of 0.666 under deep hyperparameter optimisation in end-to-end fine-tuning mode. Second, systematic optimisation of training parameters is critical for preventing overfitting on small samples: non-optimised fine-tuning leads to a catastrophic drop in performance (ROC-AUC 0.470). Third, expert aggregation of effect labels improves the prediction of rare classes (an increase in F1 macro from 0.148 to 0.254), which enhances the practical suitability of the models for virtual screening. Fourth, scaffold validation confirms the limited but statistically significant capacity of the model to generalise to structurally novel compounds (ROC-AUC 0.587). Fifth, virtual screening of more than five million compounds in the SAVI 1/55 library made it possible to identify 34 super-candidates and 435 strong candidates for prioritised experimental verification; the principal predicted target is the CLOCK-BMAL1 complex (60.49%), and the dominant effect is a phase shift of the circadian rhythm (37%).

The limitations of the current work include: the small size of the ChronobioticsDB training set (345 compounds); the absence in the SMILES string of system-level biological factors (time of administration, concentration, pharmacokinetic parameters); the limited capacity of the model to generalise to fundamentally new chemical scaffolds; and the need for experimental verification of the screening results. Future research directions should include: expansion of ChronobioticsDB through the integration of data from new publications and high-throughput experiments; the development of target-aware models that incorporate information about the three-dimensional structure of biological targets; the investigation of hybrid graph–Transformer architectures; experimental verification of the identified super-candidates on cellular reporter systems (Bmal1-luciferase assay) followed by molecular docking against the crystal structures of CLOCK-BMAL1 and CRY1–PER2 from the PDB; and the extension of the screening to other ultra-large virtual libraries (PatCID, ZINC, Enamine REAL) for maximum coverage of the chemical space of potential chronobiotics.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org. Supplementary Table S1: Complete structural data and consensus scores for the candidate compounds (file consensus_top.csv). Supplementary Data S2: Ranked candidate lists by score categories (archive candidates_by_score.zip). The source code of the project is available at: https://github.com/Glebbbbbbb-g/chronobiotics-smiles-transformer-publication.

Funding

This work was carried out with the financial support of the Russian Science Foundation grant “Creation of the world’s first pharmacological database of circadian rhythm modulators (ChronobioticsDB) and provision of access to it” (Project No. 24-75-00108).

Authorship

All members of the author group meet all four authorship criteria formulated in the ICMJE recommendations: (1) I.A.S., G.R.K., D.A.G., A.I.Y. and N.O.K. participated in the development of the concept and design or in the analysis and interpretation of the data; (2) I.A.S., G.R.K., D.A.G., A.I.Y. and N.O.K. carried out the substantiation of the manuscript and the verification of its critically important intellectual content; (3) I.A.S., G.R.K., D.A.G., A.I.Y. and N.O.K. gave final approval of the manuscript for publication, being physicians by specialty; (4) I.A.S., G.R.K., D.A.G., A.I.Y. and N.O.K. agree to be accountable for all aspects of the work and certify that questions related to the accuracy and integrity of any part of the presented research have been duly investigated and resolved.

Compliance with Ethical Standards

This article does not contain any studies involving human participants or animal subjects.

Conflicts of Interest

The authors declare that there is no conflict of interest.

References

Solovev, I.A.; Golubev, D.A.; Yagovkina, A.I.; Kotelina, N.O. ChronobioticsDB: The Database of Drugs and Compounds Modulating Circadian Rhythms. Clocks Sleep 2025, 7, 30. [Google Scholar] [CrossRef] [PubMed]
Solovev, I.A.; Shaposhnikov, M.V.; Moskalev, A.A. Chronobiotics KL001 and KS15 Extend Lifespan and Modify Circadian Rhythms of Drosophila melanogaster. Clocks Sleep 2021, 3, 429–441. [Google Scholar] [CrossRef] [PubMed]
Solovev, I.A.; Golubev, D.A. Chronobiotics: Classifications of existing circadian clock modulators, future perspectives. Biomeditsinskaya Khimiya 2024, 70, 381–393. [Google Scholar] [CrossRef] [PubMed]
Bi, X.; Wang, Y.; Wang, J.; Liu, C. Machine learning for multi-target drug discovery: Challenges and opportunities in systems pharmacology. Pharmaceutics 2025, 17, 1186. [Google Scholar] [CrossRef] [PubMed]
Mswahili, M.E.; Jeong, Y.S. Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon 2024, 10, e39038. [Google Scholar] [CrossRef] [PubMed]
Honda, S.; Shi, S.; Ueda, H.R. SMILES Transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv 2019, arXiv:1911.04738. [Google Scholar] [CrossRef]
Gage, P. A new algorithm for data compression. C. Users J. 1994, 12, 23–38. [Google Scholar] [CrossRef]
Temizer, A.B.; Uludoğan, G.; Özçelik, R.; Koulani, T.; Ozkirimli, E.; Ulgen, K.O.; Karali, N.; Özgür, A. Exploring data-driven chemical SMILES tokenization approaches to identify key protein–ligand binding moieties. Mol. Inform. 2024, 43, e202300249. [Google Scholar] [CrossRef] [PubMed]
Sharma, R.; Mukherjee, S.; Sipka, A.; Hullermeier, E.; Vollmer, S.; Redyuk, S.; Selby, D.A. X-Hacking: The Threat of Misguided AutoML. Open Access LMU (Ludwig Maximilian University of Munich). January 2025. [CrossRef]
Ridnik, T.; Ben-Baruch, E.; Zamir, N.; Noy, A.; Friedman, I.; Protter, M.; Zelnik-Manor, L. Asymmetric Loss for Multi-Label Classification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 82–91. [Google Scholar] [CrossRef]
Mao, W.; Wu, J.; Liu, H.; Sui, Y.; Wang, X. Invariant graph learning meets information bottleneck for out-of-distribution generalization. Front. Comput. Sci. 2026, 20, 2001305. [Google Scholar] [CrossRef]
Ramos, M.C.; Collison, C.J.; White, A.D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 2025, 16, 2514–2572. [Google Scholar] [CrossRef] [PubMed]
Naser, M.Z. A review of machine learning with small and limited data. J. Big Data 2026, 13, 18. [Google Scholar] [CrossRef]

Figure 1. Super-candidate molecules–potential chronobiotics from the SAVI 1/55 library.

Table 1. DEM scores (20 repeats) for chronobiotic classification tasks.

Task	DEM ROC-AUC Micro	Repeats
classf	0.571	20
mechanism	0.567	20
target	0.552	20
effect	0.521	20
effect_coarse	0.529	20
effect_expert	0.520	20

Table 2. Scaffold-validation metrics for the effect_coarse task.

Protocol	F1 Micro	F1 Macro	ROC-AUC Micro	CI95 Low	CI95 High
single_split	0.287	0.264	0.567	—	—
repeated_5×5	0.319	0.254	0.587	0.299	0.340

Note: “—“ indicates that the data are not available.

Table 3. Comparison of the feature-based approach and end-to-end fine-tuning variants.

Configuration	F1 Micro	F1 Macro	ROC-AUC Micro
feature_based_pro	0.304	0.232	0.597
end2end_single	0.223	0.133	0.470
end2end_sweep_best	0.376	0.144	0.666

Table 4. Comparative characterisation of effect-labelling variants.

Variant	F1 Micro	F1 Macro	ROC-AUC Micro	Number of Labels
effect_raw	0.500	0.148	0.832	6
effect_coarse	0.304	0.232	0.597	9
effect_expert	0.296	0.254	0.584	10

Table 5. Distribution of consensus scores for the top 10,000 candidates from the SAVI library.

Score Range	Number of Molecules	Interpretation
>0.90	34	Super-candidates (highest confidence of all 4 models)
0.80–0.90	401	Very strong candidates
0.70–0.80	1130	Strong candidates
0.60–0.70	2132	Moderate candidates
0.50–0.60	2299	Weak candidates
< 0.50	4004	Lower half of the top 10,000

Table 6. Distribution of predicted effects among the top 10,000 candidates.

Effect	Proportion	Potential Medical Application
Phase shift	37.00%	Correction of jet-lag, shift work
Circadian restoration	31.55%	Neurodegeneration, oncology
Sleep modulation	31.45%	Insomnia, disturbances of sleep architecture

Table 7. Distribution of predicted targets among the top 10,000 candidates.

Target	Proportion	Role in the Circadian Rhythm
CLOCK-BMAL1	60.49%	Principal activator of the circadian cycle
CRY1–PER2	28.02%	Negative regulator of CLOCK-BMAL1
Arntl (Bmal1)	6.72%	Genetic regulation of BMAL1
Melatonin receptor	1.85%	Melatonin signal transduction
Other	2.92%	Auxiliary targets

Table 8. Structural motifs of the top 30 candidates.

Structural Feature	Frequency	Chemical Significance
Aromatic rings	100% (30/30)	Binding within the active sites of proteins
Imine bond N=C	77% (23/30)	Planar structure, π–π stacking
Chlorine substituent (Cl)	43% (13/30)	Increased lipophilicity
Sulfur (thiophene/thiazole)	40% (12/30)	Pharmacophoric heterocycles

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.