The proposed PSQ method is divided into two stages. During the first stage, both StackRNN and QSAR models are trained separately with supervised learning algorithms. The StackRNN as generative model is trained to learn the building rules of molecules using SMILES notation, and QSAR as predictive model is trained to predict specific biological activities compounds. In the second stage, both two deep neural networks are trained jointly with an RL approach that optimizes target properties. In this process, the generative model acts as an agent responsible for creating novel, chemically feasible molecules. Meanwhile, the predictive model serves as a critic, assessing the properties of these generated compounds. The critic assigns a numerical reward (or penalty) to each molecule, based on its predicted properties. This reward guides the agent's behavior, and the generative model is trained to maximize the expected reward, thereby optimizing the quality and desirability of the generated molecules.
Figure 1 illustrates the overall workflow of the framework for targeted molecule generation.
2.1. Reinforcement Learning of Molecule Design
RL is employed to address and solve the problem by enabling agents to use learning strategies that maximize returns or achieve specific goals during interactions with the environment. The PPO algorithm was used to implement the DRL framework, aiming to train the generator to produce a chemical space for molecules with desired properties. Both generative () and predictive () models are combined into the PPO reinforcement learning framework to optimize the generation of SMILES (Simplified Molecular Input Line Entry System) strings, which are commonly used to represent chemical structures in drug discovery.
The action set comprises all letters and symbols utilized in the SMILES notation, such as carbon (C), oxygen (O), nitrogen (N), hydrogen (H), and various bond types (single, double, triple, and aromatic bonds). This comprehensive action set ensures that the generative model can select the appropriate character at each step, progressively constructing a complete SMILES string. For instance, the aspirin molecule can be represented by the SMILES string CC(O)OC1CCCCC1C(O)O, which includes elements like carbon and oxygen, as well as ring structures (C1CCCCC1).
The state set encompasses all possible strings composed of the alphabet, with lengths ranging from zero to a specified maximum . The initial state of length zero serves as the unique starting point. Each training episode commences from , with the generative model incrementally adding characters until reaching a terminal state characterized by strings of length , marking the end of a training episode. The subset of terminal states of includes all states with length . This approach allows the system to explore all potential molecular representations within a fixed length range. The reward function is designed to assess the quality of the generated SMILES strings at the terminal state . All intermediate state rewards for are set to zero to avoid premature evaluation of incomplete molecules: Rewards are computed only upon completion of the generation, and this strategy focuses the system on producing complete and valid molecular representations.
In this framework, the generative model
acts as the policy approximation model. At each time step
, where
,
takes the previous state
as input and estimates the probability distribution
for the next action
:
The subsequent action is then sampled from this probability distribution and appended to the current state, generating a new partial SMILES string. For example, if the previous state is the partially generated SMILES string "CC(O)", the model estimates the probabilities of selecting the next character, such as "O", "C", or others.
The predictive model
plays a crucial role in evaluating the properties of the generated SMILES strings and providing feedback on their chemical properties. The reward
at the terminal state is a function of these predicted properties:
where
is a task-specific function that evaluates the quality of the SMILES string
based on the predictions from the predictive model
.
The PPO algorithm iteratively optimizes the policy parameters
through a series of key steps. The generative model
samples a batch of molecular structures using the current policy
. The predictive model
then assesses these structures and provides corresponding reward values
. The PPO algorithm updates the policy parameters
by maximizing the expected reward while constraining the update step to ensure stability. The objective function incorporates a clipping mechanism to prevent excessively large updates, maintaining the balance between exploration and exploitation. The optimization goal is formalized as:
where
represents the set of all possible terminal states. Through this iterative process, the generative model continuously refines its policy, leading to the generation of molecular structures with desired properties as assessed by the predictive model.
2.2. Proximal Policy Optimization
The PPO algorithm plays a pivotal role in optimizing the policy by maximizing the expected reward while maintaining stability and robustness during training. The core of the Proximal Policy Optimization (PPO) algorithm lies in optimizing the policy parameters
using policy gradient methods, with the objective of maximizing the expected cumulative reward
:
where
denotes a trajectory,
is the parameterized policy,
is the discount factor, and
is the reward function.
The PPO employs a surrogate objective function
that penalizes substantial policy deviations. This objective function is defined as:
where
represents the probability ratio between the new and old policies:
and
is the advantage estimate at time step
. The clipping function
ensures that the probability ratio remains within the range
to prevent large policy updates:
Algorithm 1 delineates the essential steps of the Proximal Policy Optimization (PPO) algorithm, focused on optimizing policy parameters
. The algorithm iteratively collects trajectories, computes returns and advantages, and performs gradient ascent on a clipped objective to update the policy. This clipping mechanism ensures stability by constraining probability ratios. The process continues until convergence, resulting in optimized policy parameters that effectively balance exploration and exploitation in reinforcement learning.
| Algorithm 1. PPO |
Input: an initialized policy and value function
for k=1 to epoch
Set
Collect trajectories from
Compute returns and advantages using
for each minibatch in :
Compute probability ratio
Compute clipped objective:
Perform gradient ascent on to update
end for
end for
Output: Optimized policy
|
2.3. Generative Model
The generative model employed is a generative RNN that outputs molecules in SMILES notation (
Figure 2). We specifically use a stack-augmented RNN (Stack-RNN), known for its effectiveness in identifying algorithmic patterns [
16]. In our implementation, valid SMILES strings, which represent chemically feasible molecules, are treated as sequences of SMILES characters. The Stack-RNN's objective is to learn the underlying rules for forming these sequences, ensuring the generation of valid SMILES strings.
The Stack-RNN builds upon the standard GRU cell by incorporating two additional multiplicative gates, known as the memory stack. This structure enhances the model's ability to capture meaningful long-range dependencies. The stack memory functions as a persistent storage that retains information to be used by the hidden layer in subsequent time steps. This type of memory can be accessed only through its topmost element. The stack supports three operations: POP, which removes the top element; PUSH, which adds a new element to the top; and NO-OP, which leaves the stack unchanged. The top element of the stack has value
and is stored at position
when
, a new element
is added to the top of the stack, pushing previous elements down by one position. If
, the top element of the stack is removed, and the remaining elements are shifted up by one position. If
, the stack remains unchanged. These operations enable dynamic and flexible management of the stack memory, crucial for capturing complex dependencies in sequential data.
Now, the hidden layer
is updated as
where
is the matrix and
represents the top element of the stack at time step
. The matrix
integrates this stack element into the hidden layer update, allowing the model to leverage stack memory for capturing complex dependencies in sequential data [
8].
During the training phase, we pre-trained a generative model on the ChEMBL dataset [
17], consisting of approximately 1.5 million drug-like compounds, to ensure the model could produce chemically feasible molecules. This phase did not incorporate any property optimization. The neural network architecture featured a GRU layer with 1500 units [
18] and a stack augmentation layer with 512 units. Training was carried out on a GPU over 10,000 epochs.
The generative model operates in two modes: training and generating. In training mode, at each time step, the network processes the current prefix of the training sequence and predicts the probability distribution of the next character. The predicted character is sampled from this distribution and compared to the actual character, allowing the calculation of the cross-entropy loss. This loss is then used to update the model parameters. In generating mode, the network uses the prefix of the already generated sequences to predict and sample the next character's probability distribution. Unlike in training mode, the model parameters remain unchanged during generation.
Figure 2.
The outline of producing the SMILES string ‘C1CCCCC1’. For each step, the input character is processed by the GRU to generate a new hidden state, followed by an output probability distribution, selecting the output character, performing stack operations, and updating the input for the next time step.
Figure 2.
The outline of producing the SMILES string ‘C1CCCCC1’. For each step, the input character is processed by the GRU to generate a new hidden state, followed by an output probability distribution, selecting the output character, performing stack operations, and updating the input for the next time step.
2.4. Predictive Model
The predictor is a QSAR model designed to estimate the physical, chemical, or biological properties of molecules. This deep neural network, based on the study by Popova et al.[
8], includes an embedding layer, an LSTM layer, and two dense layers. It determines user-specified properties, such as molecular activity, by using SMILES strings as input data vectors. While similar to traditional QSAR models, this approach does not require numerical descriptors. Instead, the model learns directly from the SMILES notation, effectively relating comparisons between SMILES strings to differences in target properties.
In recently years, QSAR models have been crucial in optimizing leads and predicting biological activities of compounds in drug design [
19]. Any QSAR method can generally be expressed in the form
, where
denotes the biological activity or property of interest,
represent the molecular descriptors. The function
captures the relationship between these descriptors and the target property, and it can take various forms, including linear and non-linear functions [
8].
The QSAR model was developed to predict the pIC50 properties for Janus kinase 2 (JAK2) inhibitors. The curated dataset was split into training and validation sets using a 5-fold cross-validation (5CV) fashion. Instead of utilizing traditional calculated chemical descriptors, this model employed SMILES representations. The model consisted of an embedding layer that transformed the sequence of discrete SMILES tokens into a vector of 100 continuous numbers, an LSTM layer with 100 units and tanh nonlinearity, a dense layer with 100 units and rectify nonlinearity function, and a final dense layer with a single unit and identity activation function. Training was performed using a learning-rate decay technique until convergence was achieved.
2.5. Evaluation Metrics
To explore the utility of the PPO algorithm in a drug design setting, we conducted a case study on biological activity prediction focusing on putative inhibitors of JAK2. Specifically, we designed molecules to modulate JAK2 activity with the goal of minimizing or maximizing negative logarithm of half maximal inhibitory concentration (
) values. While most drug discovery studies focus on identifying molecules with enhanced activity, minimizing bioactivity is also crucial to mitigate off-target effects. Therefore, we aimed to explore our system's capability to bias the design of novel molecular structures toward any desired range of target properties. Janus Kinase 2 (JAK2) is a key non-receptor tyrosine kinase involved in cytokine signaling. Mutations in JAK2 are linked to myeloproliferative neoplasms, including polycythemia vera, essential thrombocythemia, and myelofibrosis [
20]. JAK2 inhibitors have shown significant promise in treating these conditions [
21].
The evaluation of generated molecules is crucial to ensure their potential efficacy and safety in drug development. Six specific metrics were chosen: molecular validity, molecular diversity, molecular weight distribution, biological activity distribution, functional group analysis, and pharmacophore features. These metrics provide a comprehensive assessment of the generated molecules from various critical perspectives.
Molecular validity ensures that the generated molecules conform to the basic rules of chemical structure, including valence rules and chemical stability, which is essential for further biological testing and development [
22]. Molecular diversity measures the variety within the set of generated molecules, indicating a broader exploration of chemical space, which is important for identifying novel compounds with unique properties [
23]. The molecular weight distribution assesses the distribution of molecular weights in the generated set, ensuring that the molecules fall within a desirable range for drug-like properties, as extreme weights (either too high or too low) can affect bioavailability and drug-likeness [
24]. Biological activity distribution evaluates the predicted biological activities of the molecules, helping to identify compounds that are likely to be effective against the target, ensuring that the generated set includes potent candidates [
25]. Functional group analysis examines the presence and frequency of functional groups within the molecules, this method is crucial for the interaction of drugs with biological targets, and their appropriate representation can influence the pharmacokinetics and pharmacodynamics of the molecules [
26]. Pharmacophore features analyze the presence of essential pharmacophore features required for the interaction with the biological target, ensuring that the generated molecules have the necessary structural motifs to engage effectively with the target, increasing the likelihood of biological efficacy [
27].