Modeling an Inverted Pendulum via Differential Equations and Reinforcement Learning Techniques

The prevalence of differential equations as a mathematical technique has refined the fields of control theory and constrained optimization due to the newfound ability to accurately model chaotic, unbalanced systems. However, in recent research, systems are increasingly more nonlinear and difficult to model using Differential Equations only. Thus, a newer technique is to use policy iteration and Reinforcement Learning, techniques that center around an action and reward sequence for a controller. Reinforcement Learning (RL) can be applied to control theory problems since a system can robustly apply RL in a dynamic environment such as the cartpole system (an inverted pendulum). This solution successfully avoids use of PID or other dynamics optimization systems, in favor of a more robust, reward-based control mechanism. This paper applies RL and Q-Learning to the classic cartpole problem, while also discussing the mathematical background and differential equations which are used to model the aforementioned system.


I. Introduction to the Cartpole problem
The cart pole problem is a famous problem in dynamics and control theory, with a pendulum whose center of gravity is above the pivot point. This naturally creates an unstable system and the pendulum will typically remain vertically downward without any force of swing or dynamic control being applied. The cartpole has one degree of freedom upon its axis, and the system has no vertical movement. The goal of most cartpole systems is to effectively keep the cartpole balanced through applying various forces on the pivot point and its axis of movement (the horizontal direction).

II. Differential Equations for modeling the Cartpole System
We may derive the equation for controlling the cartpole system with cart mass (M), pole mass (m), angle (θ), length (l), and force (f). We begin by utilizing Newton's Second Law of Motion. This is preferable since it avoids the mathematics involved with Lagrange's equations while also providing reaction forces between the pendulum and cart at its joint: Rx and Ry represent the reaction forces while FN represents the normal force applied to the cart. The above equations are in the x and y axes respectively. We may use the first equation to solve for the horizontal reaction force. However, we must first note the acceleration of the point mass of the system. The positional vector can be described as: Now, we take the derivative twice to obtain the acceleration vector in the Inertial reference frame: We may now solve for the reaction forces independently via Newton's Second Law: In the event that the applied force on the system is unknown, we can use the first equation (Rx) to solve for it. We substitute: Which yields: By observation, we know that this equation is equal to the output of the Lagrange's system of equations for an inverted pendulum. To obtain the second equation that governs the system, we must dot the pendulum equation with an orthonormal unit vector. We use 2-D translation: The equation of motion for the pendulum is written These equations assume that the bar connecting the rod to the cart is massless. The equation utilizes the relationship between the inertial frame components and body frame components of the reaction forces to compute the above relationship between the LHS and RHS. To solve for the tension in the rod, we derive the following equation: We may now solve for the Right-Hand Side of the equation via dotting xB with the acceleration of the pendulum:

Right-Hand Side (a.) :
We can later combine the equations for the LHS (left side) and RHS (righthand side) to obtain the solution (upon dividing by m): The above equation is able to model the controlling of the cartpole system for angle and length l and displacement x. The result is the same that may be applied from the Lagrangian system of equations.

III. Analysis of Cartpole Differential Equations
The earlier derived equation represents the mechanics and variables which control the cartpole system: Therefore, it can be reasoned that the cartpole system's balance and net force depends primarily by the length l, the acceleration due to gravity g, and displacement x. In order to keep the system balanced, it is necessary to satisfy this linear equation. In a nonlinear system, we may model the dynamics of the cartpole system as: In this case, F is the input force on the nonlinear system. For the purposes of the classic cartpole problem, we will only discuss the further applications of the linear dynamics.

IV. Control Theory Techniques
In solving the physical cartpole problem, it is common to utilize a feedback control loop in conjunction with both proportionalintegral-derivative (PID) and linearquadratic-regulator (LQR) controllers. These mechanisms are typically adapted for industrial applications and may be similarly applied to a nonlinear cartpole dynamics system. Both LQR and PID controllers seek to find the optimal control mechanism. However, the complexity of systems has rapidly increased, requiring more sophisticated controllers which can accurately respond to parameter variations such as increased noise and time and space nonlinearities. Thus, we may need more advanced algorithms to robustly handle the various dynamics of the traditional cartpole system. We will now look into Reinforcement algorithms which may effectively manipulate the dynamics of a cartpole system successfully.

V. Reinforcement Learning
Reinforcement Learning (RL) is a growing subset of Machine Learning which involves software agents attempting to take actions or make moves in hopes of maximizing some prioritized reward. There are several different forms of feedback which may govern the methods of a RL system. Compared to Supervised Learning algorithms which map functions from input to output, RL algorithms typically do not involve the target outputs (only inputs are given). There are 3 elements of a basic RL algorithm: the agent (which can choose to commit to actions in its current state), the environment (responds to action and provides new input to agent), and the reward (incentive or cumulative mechanism returned by environment).

Figure II. Reinforcement Learning Feedback Loop
The broad goal of most RL algorithms is to achieve balance between exploration (training on new data points) and exploitation (use of previously captured data). The immediate goal is to maximize the reward with trials alternating between the aforementioned exploitation and exploration. It is important to note that are three types of RL implementations: policybased, value-based, and model-based. Policy-based RL involves coming up with a policy or deterministic/stochastic strategy to maximize the cumulative reward. Valuebased RL attempts to maximize an arbitrary value function, V(s). Model-based RL is based on creating a virtual model for a certain environment and the agent learns to perform within the constraints of the environment.

VI. Reinforcement Learning for the Cartpole System
Since RL is a form of learning characterized by trial and error response to actions and its effect on the environment, it makes sense to model the cartpole system via RL, since the cartpole system is heavily subject to various parameter changes while having a clearly defined agent-action-environment-reward schema. The agent is the controller or algorithm which controls the movement of the cart. The action is the physical movement of the cartpole in response to various forces and torques following the swing-up phase. The environment is the physical setting of the cartpole in regard to the constrained area of the system. The reward is the ability of the cartpole to achieve sustained balance in its current state.
We will now identify the specific actions and states of the cartpole problem:

Figure III. Actions for Cartpole problem
The cartpole agent is limited to two possible actions: (1) exert a rightward constant force on the cart. (2) exert a leftward constant force on the cart. As seen in the diagram, these two forces are directed in the horizontal direction. These actions that may be taken by the agent will change the position of the cart and environment accordingly. The state of the cartpole is solely determined by the velocity and position of the cart, angle , and pole velocity at the tip. All of these parameters were earlier identified as the basis for the differential equation which addresses the properties necessary to adjust and control the cartpole system.
Each time a force is applied by the controller, the controller checks for whether the cumulative reward is achieved or maximized. In the case of the cartpole problem, the angle of the pole in respect to the cart and distance from the center determines the value/reward achieved. If the cartpole is generally upright and near the center of the environment, a reward is given and maximized for that sequence. In the case that the reward is not maximized, the controller makes the necessary adjustments to the force and subsequent displacement. It is also important to note that there are two crucial conditions that may terminate or restart the action-environment-reward loop. In both of the above cases, the agent's actions led to a case that did not maximize the reward in the specified environment.
(cartpole was either not upright or not within the specified bounds) This could be seen as the "punishment" of the model. On the other hand, each time the reward is gained, the "score" of the cartpole increases by 1.
Now that we have discussed the physical properties of the cartpole system, the mathematical model used to solve it, and the application of RL towards the cartpole, we may move to analyze two algorithms that may allow for effective control and stability of the cartpole.

VI. Q-Learning
We know based off the possible states of the cartpole problem, that if we make the right decision, the cartpole will stay upright and balanced. Thus, we can identify the actionstate pairs which lead to higher reward in the cartpole system. We can model each pair as a function of the probability of reward: = ( , ). In this case, the reward is known as a Q-value. The goal of Q-Learning, a RL algorithm, is to find this function ( , ), while applying it iteratively to ′ (future state). The initial Qlearning function can be represented as: After obtaining some reward by making an action , we can reach the next state: ′. Upon reaching the next state ( ′), the agent performs a new action ( ′) in regard to the reward. The weight we wish to focus on the next reward ( ′) is . Thus, we update the equation as follows:

VII. Application of Deep Q-Learning to the Cartpole System
Based on analysis of the elements of Qlearning, it is evident that the cartpole system can be effectively modeled using Qlearning. The cartpole problem has a state space of 4 dimensions of continuous values ( , , , 2 ) and an action space of 2 discrete values (move right or left). However, in typical Q-learning, we must shift our state for every slight change in the angle or position of the cartpole, and this would require extreme memory storage capabilities. Moreover, to apply Q-learning to balancing the cartpole system, we must approximate the model-free function ( , ), where the input is a state-action pair ( , ) and the output is some expected reward. This technique of approximating the ( , ) function to is known as a Deep Q-Network (DQN) and is more robust against parameter variations and frequent changes in the state. This technique follows the same process behind Q-learning, but rather utilizes a deep neural network to compute ( , ) based on a trained network of nodes:

Figure VI. DQN for the cartpole system
As seen in the diagram above, the DQN uses the current states of the cartpole to calculate the expected reward and next action for the cartpole, returning a ( , ) for both movement to the right and movement to the left. The DQN would most likely need to be supplemented with a loss-function. We know that the updated Q-learning equation already calculates the value for ( , ): Thus, it is essential to have a loss function that minimizes the error between the approximation from the DQN and the true ( , ) obtained from the equation. In summary, it is best to think of the overall process behind Q-learning and DQN as a "controlled trial and error" that looks to approximate the expected reward: ( , ). Q-learning makes use of the updated Qfunction which makes iterative adjustments between discrete state-action pairs. DQN seeks to avoid the memory overuse that may occur in Q-learning with near infinite stateaction pairs, in favor of a Neural Network that approximates the expected reward from previous continuous state-action pairs.

VII. Training a DQN for the Cartpole System
DQNs are commonly used for the cartpole problem, and we may now understand the implementation of a DQN that maximizes the reward of the cartpole (the reward is the ability for the controller to both balance and control the cartpole). It is first important to note we can summarize the state-actionreward-state and environment of the cartpole system as a tuple: ( , , , , ), where is the state, is the action, is the reward function, is the transition probabilities, and is the initial state distribution. The reward function is: This formulation of the cartpole system as a tuple is known as a Markov Decision Process (MDP). An MDP typically provides us with a method to accurately select an action 8 , given a state 8 . We then observe 89: and 89: based on the transition probabilities P. MDPs also provide techniques for helping agents find the optimized policy within the specified environment in the long run. Most implementations of DQNs use flat convolutional neural networks with batch normalization. This technique uses iterative adjustments towards ( , ). After a DQN is implemented, it is necessary to train the model. The training of the DQN simply acts as a learning phase for the cartpole system in its environment. The training stage of any RL model is similar to the commonplace example of a child learning to walk. There are a few phases of learning that the cartpole must go through before it can effectively balance both the angle and position of the system: 1.) Learning to balance the pole alone 2.) Staying in bounds 3.) Staying in bounds but unable to balance pole 4.) Staying in bounds while effectively balancing the pole. The ultimate goal is to solve the environment as quickly as possible (solve in the least number of steps/episodes). As the model trains on state-action pairs, it will eventually improve the number of steps needed before solving. The figure below demonstrates a sample training for a DQN Cartpole. As seen in the graph, the cartpole eventually reduces the number of steps needed to solve the environment.

VIII. Conclusions
Instead of using a PID controller or LQR controller for managing a cartpole problem, we can apply the earlier discussed RL algorithms to managing the forces and torques felt by the cartpole system. We had successfully derived the equation for a linear cartpole system that did not require swing up: 22 − ( ) = 22 cos ( ). Qlearning and DQN can successfully model the trial-error process of trying to balance the cartpole (satisfy the linear equation above). These functions use state-action pairs to both calculate and approximate the ( , ) (expected reward) of the specified environment. We can even implement a DQN that trains on MDP state-action pairs to find the optimal solution to the cartpole system. In the future, it may be necessary to investigate a more robust RL solution to the classic cartpole problem; other algorithms of interest include Monte-Carlo simulations, SARSA, and Actor Critic Policy-Gradient. Moreover, it may be of interest to identify a RL algorithm that can incorporate the swing-up of the cartpole system.