Deep Reinforcement Learning Based Robotic Arm’s Target Reaching Performance Enhancement

Ldet Honelign; Yoseph Abebe; Young Seok Jung; Abera Tullu; Sunghun Jung

doi:10.20944/preprints202501.1917.v1

Submitted:

26 January 2025

Posted:

26 January 2025

You are already at the latest version

Abstract

This work presents the implementation of Deep Deterministic Policy Gradient (DDPG) algorithm to enhance target reaching capability of the seven Degree-of-Freedom (7-DoF) Franka Panda robot arm. A simulated environment is established by employing OpenAI Gym, PyBullet, and Panda Gym. Upon completion of 100,000 training time steps, the DDPG algorithm attains a success rate of 100% and an average reward of -1.8. The actor loss and critic loss values are 0.0846 and 0.00486, respectively, indicating improved decision-making and accurate value function estimations. The simulation results demonstrate the efficiency of DDPG in improving robotic arm performance, highlighting its potential for application to improve robot arm manipulation.

Keywords:

Deep reinforcement learning

;

Robotic arm manipulator

;

OpenAI Gym

;

PyBullet

;

Panda Gym

Subject:

Engineering - Control and Systems Engineering

1. Introduction

The enhancement of precision for robot arm manipulation remains a core research area for acquiring full autonomy of robots in various sectors such as industrial manufacturing and assembly processes. The next logical progression in this field is to achieve complete autonomy of robotic manipulators through the use of machine learning (ML), artificial neural networks (ANNs), and artificial intelligence (AI) as a whole, given the maturity of image recognition and vision systems [1,2]. Numerous attempts have been made to create intelligent robots that can take tasks and execute accordingly [3,4,5,6,7,8,9,10].

Although developing a system with intelligence close to that of humans is still a long way off, robots that can perform specialized autonomous activities, such as intelligent facial emotion recognition [11], fly in natural and man-made environments [12], drive a vehicle [13], swim [14], carry boxes and material in different terrains [15], and pick up and place objects [16,17] is already actualized.

However, some challenges must be overcome to achieve this goal. For instance, the mapping complexity from Cartesian space to the joint space of a robot arm increases with the number of joints and linkages that the manipulator has. This is problematic because the tasks assigned to a robotic arm are in Cartesian space, whereas the commands (velocity or torque) are in joint space [18,19]. Therefore, if full autonomy of robotic manipulators is the objective, the target-reaching problem is probably one of the most crucial factors that must be addressed.

The field of reinforcement learning, as described in [20,21], is a type of machine learning that aims to maximize the outcome of a given system using a dynamic and autonomous trial-and-error approach. It shares a similar objective with human intelligence, which is characterized by the ability to perceive and retain information as knowledge to be used for environment-adaptive behaviors. Central to the reinforcement learning framework are trial-and-error search and delayed rewards, which allow the learning strategy to interact with the environment by performing actions and discovering rewards [22]. Through this approach, software agents and machines can automatically select the most effective course of action to take in a given circumstance, thus improving performance. Reinforcement learning offers a framework and set of tools for designing sophisticated and challenging-to-engineer behaviors in robotics [23,24]. In contrast, the challenges presented by robotic issues serve as motivation, impact, and confirmation of advances in reinforcement learning. Multiple previous works on the implementation of reinforcement learning in the field of robotics depict this fact [25,26,27,28,29].

2. Modeling of Robotic Arm

2.1. Direct Kinematic Model of Robot Arm

The rotation matrices in the DH coordinate frame represent the rotations about the X and Z axes. The rotation matrices for these axes are, respectively, given as:

R_{x} = [\begin{matrix} 1 & 0 & 0 \\ 0 & C δ_{i} & - S δ_{i} \\ 0 & S δ_{i} & C δ_{i} \end{matrix}]

(1)

and

R_{z} = [\begin{matrix} C ϕ_{i} & - S ϕ_{i} & 0 \\ S ϕ_{i} & C ϕ_{i} & 0 \\ 0 & 0 & 1 \end{matrix}]

(2)

The homogeneous transformation matrix (

T_{i - 1}^{i}

) that accounts for rotation and translation is given as:

\begin{matrix} T_{i - 1}^{i} & = [\begin{matrix} R_{i - 1}^{1} & p_{i - 1}^{1} \\ 0 & 1 \end{matrix}] \\ = {Rot}_{x} (ϕ_{i}) \cdot {trans}_{x} (d_{i}) \cdot {trans}_{z} (Δ_{i}) \cdot {Rot}_{z} (a_{i}) \end{matrix}

\Rightarrow \begin{matrix} T_{i - 1}^{1} = [\begin{matrix} C ϕ_{i} & - S ϕ_{i} C δ_{i} & S ϕ_{i} S δ_{i} & a_{i} C ϕ_{i} \\ S ϕ_{i} & C ϕ_{i} C δ_{i} & - C ϕ_{i} S δ_{i} & a_{i} S ϕ_{i} \\ 0 & S δ_{i} & C δ_{i} & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}] \end{matrix}

(3)

where:

The rotation matrix ( $R_{i - 1}^{1}$ ) represents the orientation of the i-th frame relative to the $(i - 1)$ -th frame.
$P_{i - 1}^{1}$ represents the center of the link frame with components ( $P_{x}$ , $P_{y}$ , and $P_{z}$ ).

2.1.1. DH Axis Representation

The four DH parameters describe the translation and rotation relationship between two consecutive coordinate frames as follows:

d: a distance between the current frame and the previous frame along the Z-axis,
( $ϕ$ ): an angle between the X-axis of the previous frame and the X-axis of the current frame about the previous z-axis,
a: a distance between the Z-axes of the current and previous frames.
$δ$ : an offset of the previous frame from the current frame along the Z-axis of the current frame.

The DH parameters for the Franka Panda robot, shown in Figure 1 are given in Table 1. From the above DH parameters and based on the homogeneous transformation matrix 3, the transformation matrix for the Franka Panda robot is derived as.

T_{i - 1}^{1} = [\begin{matrix} C ϕ_{i} & - S ϕ_{i} C δ_{i} & S ϕ_{i} S δ_{i} & a_{i} C ϕ_{i} \\ S ϕ_{i} & C ϕ_{i} C δ_{i} & - C ϕ_{i} S δ_{i} & a_{i} S ϕ_{i} \\ 0 & S δ_{i} & C δ_{i} & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

T_{0}^{1} = [\begin{matrix} C ϕ_{1} & - S ϕ_{1} & 0 & 0 \\ S ϕ_{1} & C ϕ_{1} & 0 & 0 \\ 0 & 0 & 1 & 0.33 \\ 0 & 0 & 0 & 1 \end{matrix}]

(4)

T_{1}^{2} = [\begin{matrix} C ϕ_{2} & 0 & - S ϕ_{2} & 0 \\ S ϕ_{2} & 0 & C ϕ_{2} & 0 \\ 0 & - 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(5)

T_{2}^{3} = [\begin{matrix} C ϕ_{3} & 0 & S ϕ_{3} & 0 \\ S ϕ_{3} & 0 & - C ϕ_{3} & 0 \\ 0 & 1 & 1 & 0.316 \\ 0 & 0 & 0 & 1 \end{matrix}]

(6)

T_{3}^{4} = [\begin{matrix} C ϕ_{4} & 0 & S ϕ_{4} & 0.0825 C ϕ_{4} \\ S ϕ_{4} & 0 & - C ϕ_{4} & 0.0825 S ϕ_{4} \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(7)

T_{4}^{5} = [\begin{matrix} C ϕ_{5} & 0 & - S ϕ_{5} & 0.0825 C ϕ_{5} \\ S ϕ_{5} & 0 & C ϕ_{5} & 0.0825 S ϕ_{5} \\ 0 & - 1 & 0 & 0.384 \\ 0 & 0 & 0 & 1 \end{matrix}]

(8)

T_{5}^{6} = [\begin{matrix} C ϕ_{6} & 0 & S ϕ_{6} & 0 \\ S ϕ_{6} & 0 & - S ϕ_{6} & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(9)

T_{6}^{7} = [\begin{matrix} C ϕ_{7} & 0 & S ϕ_{7} & 0.088 C ϕ_{7} \\ S ϕ_{7} & 0 & - C ϕ_{7} & 0.088 S ϕ_{7} \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(10)

T_{7}^{0} = T_{0}^{1} \cdot T_{1}^{2} \cdot T_{2}^{3} \cdot T_{3}^{4} \cdot T_{4}^{5} \cdot T_{5}^{6} \cdot T_{6}^{7}

(11)

T_{7}^{0} = [\begin{matrix} r_{11} & r_{12} & r_{13} & p_{x} \\ r_{21} & r_{22} & r_{23} & p_{y} \\ r_{31} & r_{32} & r_{33} & p_{z} \\ 0 & 0 & 0 & 1 \end{matrix}]

(12)

The orientation and position of the end-effector, respectively, are given by:

\begin{matrix} R_{11} & = - S ϕ_{1} S ϕ_{3} C ϕ_{2} + C ϕ_{1} C ϕ_{3} C ϕ_{4} \end{matrix}

(13)

\begin{matrix} R_{12} & = - S ϕ_{1} C ϕ_{2} C ϕ_{3} C ϕ_{4} - S ϕ_{3} C ϕ_{1} \end{matrix}

(14)

\begin{matrix} R_{13} & = S ϕ_{2} C ϕ_{1} \end{matrix}

(15)

\begin{matrix} R_{21} & = S ϕ_{1} C ϕ_{3} C ϕ_{4} + S ϕ_{3} C ϕ_{1} C ϕ_{2} \end{matrix}

(16)

\begin{matrix} R_{22} & = - S ϕ_{1} S ϕ_{3} C ϕ_{2} + C ϕ_{1} C ϕ_{3} C ϕ_{4} \end{matrix}

(17)

\begin{matrix} R_{23} & = - S ϕ_{2} S ϕ_{1} \end{matrix}

(18)

\begin{matrix} R_{31} & = - S ϕ_{2} C ϕ_{3} C ϕ_{4} \end{matrix}

(19)

\begin{matrix} R_{32} & = S ϕ_{2} S ϕ_{3} C ϕ_{4} \end{matrix}

(20)

\begin{matrix} R_{33} & = C ϕ_{2} \end{matrix}

(21)

and

\begin{matrix} p_{x} & = - 0.107 S ϕ_{2} C ϕ_{1} + 0.088 C ϕ_{1} \\ + 0.384 (- S ϕ_{1} S ϕ_{3} C ϕ_{2} + C ϕ_{1} C ϕ_{3} C ϕ_{4}) \\ + 0.316 C ϕ_{1} C ϕ_{2} + 0.333 C ϕ_{1} \end{matrix}

(22)

\begin{matrix} p_{y} & = - 0.107 S ϕ_{2} S ϕ_{1} + 0.088 S ϕ_{1} \\ + 0.384 (S ϕ_{1} C ϕ_{3} C ϕ_{4} + S ϕ_{3} C ϕ_{1} C ϕ_{2}) \\ + 0.316 S ϕ_{1} C ϕ_{2} + 0.333 S ϕ_{1} \end{matrix}

(23)

\begin{matrix} p_{z} & = 0.107 C ϕ_{2} \\ + 0.384 S ϕ_{2} C ϕ_{3} C ϕ_{4} \\ + 0.316 S ϕ_{2} + 0.333 S ϕ_{2} \end{matrix}

(24)

2.2. Incremental Inverse Kinematics of Robot Arm

The 3D position vector

x \in R^{6}

of the end-effector (EE) is given by:

y = f (q)

(25)

In case of the Panda robot, there are

n = 7

active joint angles

q = {[ϕ_{1}, \dots, ϕ_{7}]}^{T} .

With these joint angles(q) and direct kinematics

f : R^{n} \to R^{6}

, the goal is to solve the inverse kinematics

q = f^{- 1} (y)

using incremental inverse kinematics. In incremental inverse kinematics, a direct kinematics is linearized around the current joint angle configuration

q^{*}

as

δ {x|}_{x^{*}} \propto δ {q|}_{q^{*}} .

(26)

The goal is to find the change in joint angles (

δ q

) that corresponds to a desired change in the end-effector position (

δ y

). The linearized form of the direct kinematics equations is expressed as 27, where

$δ {x|}_{y^{*}}$ represents the change in end-effector position around a reference point $x^{*}$ and
$δ {q|}_{q^{*}}$ represents the change in joint angles around a reference point $q^{*}$ .

The proportionality symbol (∝) indicates that the change in end-effector position is directly related to the change in joint angles. To solve for the change in joint angles, the Jacobian matrix, denoted as

J_{f}

, is utilized. The Jacobian matrix is a matrix of partial derivatives that describes how the end-effector position

f

depends on the joint angles

q

. Specifically, the Jacobian matrix is defined as

\begin{matrix} J_{f} : = {(\frac{\partial f_{i}}{\partial q_{j}})}_{i, j}, \end{matrix}

(27)

where $\frac{\partial f_{i}}{\partial q_{j}}$ represents the partial derivative of the $i t h$ component of the end-effector position with respect to the $j t h$ joint angle.

By multiplying the Jacobian matrix

J_{f}

by the change in joint angles

δ {q|}_{q^{*}}

, an approximation of the change in the end-effector position

δ {x|}_{y^{*}}

is obtained. The joint angles are iteratively updated to minimize the difference between the current end-effector position and the desired end-effector position.This can be efficiently solved around

(x^{*}, q^{*})

using the Jacobian.

J_{f} : = {(\partial f_{i} / \partial q_{j})}_{i, j}

Δ {x|}_{x^{*}} = J_{f} (q^{*}) δ {q|}_{q^{*}}

(28)

2.2.1. Steps of Incremental Inverse Kinematics

Given: target pose

y^{(t)}

Required: joint angles

q^{(t)}

Define the starting pose $(x^{(0)}, q^{(0)})$ and set up the Incremental inverse kinematics from (3.29)

$δ {x|}_{x^{(0)}} = J (q^{(0)}) δ {q|}_{q^{(0)}}$
Determine the deviation $δ y^{(r)}$ relative to the target pose; e.g. $(y^{(t)} - y^{(0)})$
Check for termination; e.g. $max (| δ y_{i, j}^{(r)} |) \leq ϵ$ .
Solve

$δ x^{(k)} = J (q^{(k)}) δ q^{(k)}$

(29)
Calculate new joint angles

$q^{(r + 1)} = q^{(r)} + δ q^{(r)}$

(30)
$r \leftarrow r + 1$

3. Deep Reinforcement Learning Algorithm Design

3.1. Policy Gradient Algorithm

The policy gradient theorem states that the expected return to the policy parameters can be calculated as the sum of the action value function

q^{π} (s, a)

multiplied by the policy function

π_{ϱ} (s, a)

, summed over all states s and actions a, and weighted by the stationary distribution of states

d^{π} (s)

.

\begin{matrix} O (ϱ) = \sum_{s} d^{π} (s) \sum_{a} Q^{π} (s, a) π_{ϱ} (s, a) \end{matrix}

(31)

where:

Objective function( $J_{ϱ}$ ): Represents the expected cumulative reward obtained by following the policy $π_{ϱ}$ in the given environment. The objective function is optimized by adjusting the parameter $ϱ$ to maximize the expected cumulative reward.
Discounted state distribution( $d^{π} (s)$ ): Represents the probability of being in a particular state s under the policy $π$ . Mathematically,

$\begin{matrix} d^{π} (s) = lim_{t \to \infty} P (s_{t} = s | s_{0}, π_{ϱ}) \end{matrix}$

(32)

where $s_{t} = s$ when starting from $s_{0}$ and following policy $π_{ϱ}$ for t time steps.
Action-value function (( $Q^{π} (s, a)$ ): Represents the expected cumulative reward obtained by taking action a in state s and following the policy $π$ thereafter.
Policy function( $π_{ϱ} (s, a)$ ): Represents the probability of taking action a in state s under the parameterized policy $ϱ$ .

The policy gradient theorem helps to solve this problem by providing a formula for the gradient of the expected return to the policy parameters. This formula involves the stationary distribution of the Markov chain and is given by:

\begin{matrix} \nabla_{ϱ} J_{ϱ} = \sum_{s} d^{π} (s) \sum_{a} q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (a | s) \end{matrix}

(33)

where

q^{π} (s, a)

is the state-action value function for policy

π_{ϱ}

.

\begin{matrix} \propto \sum_{s \in S} d^{π} (s) \sum_{a \in A} Q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (s, a) \end{matrix}

(34)

3.1.1. Derivation of Policy Gradient Theorem

\begin{matrix} \nabla_{ϱ} V^{π} (s) = \nabla_{ϱ} [\sum_{a \in A} π_{ϱ} (a | s) Q^{π} (s, a)] \end{matrix}

(35)

using derivative product rule:

\begin{matrix} = \sum_{a \in A} (\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)) \end{matrix}

The steps of the derivation using, the derivative product rule from equation (4.6) are as follows:

Step 1: Apply the derivative product rule:

\begin{matrix} \nabla_{ϱ} V^{π} (s) & = \nabla_{ϱ} \sum_{a \in A} (π_{ϱ} (a | s) Q^{π} (s, a)) \end{matrix}

Step 2: Distribute the derivative operator inside the summation:

\begin{matrix} \nabla_{ϱ} V^{π} (s) & = \sum_{a \in A} \nabla_{ϱ} (π_{ϱ} (a | s) Q^{π} (s, a)) \end{matrix}

Step 3: Apply the chain rule to differentiate the product of

π_{ϱ} (a | s)

and

Q^{π} (s, a)

to

ϱ

:

\begin{matrix} \nabla_{ϱ} V^{π} (s) & = \sum_{a \in A} (\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)) \end{matrix}

Step 4: Simplify the expression by rearranging the terms.

After applying the derivative product rule, the derivative of the state-value function

V^{π} (s)

to the policy parameter

ϱ

is:

\begin{matrix} \nabla_{ϱ} V^{π} (s) = \sum_{a \in A} (\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)) \end{matrix}

(36)

Extend

Q^{π} (s, a)

by incorporating the future state value. This can be done by considering the state-action pair

(s, a)

and summing over all possible future states

s^{'}

and corresponding rewards r:

\begin{matrix} = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \\ + π_{ϱ} (a | s) \nabla_{ϱ} \sum_{s^{'}, r} P (s^{'}, r | s, a) (r + V^{π} (s^{'}))]; \end{matrix}

(37)

Since

P (s^{'}, r | s, a)

and r are not functions of

ϱ

, the derivative operator

\nabla_{ϱ}

can be moved inside the summation over

s^{'}, r

without affecting these terms.

\begin{matrix} \nabla_{ϱ} V^{π} (s) & = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \\ + π_{ϱ} (a | s) \sum_{s^{'}, r} P (s^{'}, r | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{matrix}

(38)

Next, it is observed that

\nabla_{ϱ} V^{π} (s^{'})

is the derivative of the state-value function to the policy parameter

ϱ

at state

s^{'}

. This can be rewritten as

\nabla_{ϱ} V^{π} (s^{'}) = \nabla_{ϱ} V^{π} (s^{'}) \sum_{s^{'}} P (s^{'} | s, a)

. Here,

P (s^{'} | s, a)

represents the probability of transitioning to state

s^{'}

given the current state-action pair

(s, a)

. The following substitution is now made:

\begin{matrix} = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \sum_{s^{'}, r} P (s^{'}, r | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{matrix}

\begin{matrix} = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{matrix}

Where

P (s^{'} | s, a) = \sum_{r} P (s^{'} | s, a)

Now there is,

\begin{matrix} = \nabla_{ϱ} V^{π} (s) = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \\ + π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{matrix}

(39)

Consider the following visiting sequence and identify the probability of changing state s to state x using policy

π_{ϱ}

after k steps as

ρ (s \to x, k)

s \overset{a \sim π_{ϱ} (. | s)}{\to} s^{'} \overset{a \sim π_{ϱ} (. | s^{'})}{\to} s^{''} \overset{a \sim π_{ϱ} (. | s^{''})}{\to} \dots

When $k = 0 : ρ (s \to s, k = 0) = 1 .$
When $k = 1$ , consider every action that might be taken and add up the probabilities of reaching the desired state:

$\begin{matrix} ρ (s \to x, k = 1) = \sum_{a} π_{ϱ} (a | s) P (s^{'} | s, a) \end{matrix}$

(40)
The goal is to move from s to x after $k + 1$ steps, by following $π_{ϱ}$ .The agent can first move from s to intermediate state $s^{'} (s^{'} \in S)$ going to final state x in the last steps after the k stages. This allows to recursively update the visitation probability.

$\begin{matrix} ρ (s \to x, k + 1) = \sum_{s^{'}} ρ^{π} (s \to s^{'}, k) ρ (s^{'} \to x, 1) \end{matrix}$

(41)

After discussing the probability

ρ (s \to x, k)

for transitioning from state s to state x after a certain number of steps k, the next step is to drive a recursive formulation for

\nabla_{ϱ} V^{π} (s)

.

To accomplish this, a function

ϕ (s)

is introduced, defined as:

\begin{matrix} ϕ (s) = \sum_{a \in A} \nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \end{matrix}

(42)

Here,

Φ (s)

represents the sum of the gradients of the policy

π_{ϱ}

to

ϱ

, weighted by the corresponding action-value function

Q^{π} (s, a)

.

To simplify (4.10)

\begin{matrix} \nabla_{ϱ} V^{π} (s) = Φ (s) + [\sum_{a \in A} π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{matrix}

(43)

\begin{matrix} = ϕ (s) + \sum_{a} π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'}) \end{matrix}

(44)

\begin{matrix} = Φ (s) + \sum_{s^{'}} \sum_{a \in A} π_{ϱ} (a | s) P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'}) \end{matrix}

(45)

\begin{matrix} = Φ (s) + \sum_{s^{'}} \sum_{a \in A} π_{ϱ} (a | s) P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'}) \end{matrix}

(46)

\begin{matrix} = Φ (s) + \sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) \nabla_{ϱ} V^{π} (s^{'}) \end{matrix}

(47)

\begin{matrix} = Φ (s) + \sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) [ϕ (s^{'}) + \sum_{s^{''}} ρ^{π} (s^{'} \to s^{''}, 1) \nabla_{ϱ} V^{π} (s^{''})] \end{matrix}

Consider

s^{'}

as the middle point for

s^{'} \to s^{''}

\begin{matrix} = Φ (s) + [\sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) ϕ (s)] \\ + [\sum_{s^{'}} ρ^{π} (s^{'} \to s^{''}, 2) \nabla_{ϱ} V^{π} (s^{''})] \end{matrix}

(48)

The expression for

\nabla_{ϱ} V^{π} (s)

can be unrolled as follows:

\begin{matrix} = Φ (s) + [\sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) Φ (s^{'})] \\ + [\sum_{s^{''}} ρ^{π} (s \to s^{''}, 2) Φ (s^{''})] \\ + [\sum_{s^{'''}} ρ^{π} (s \to s^{'''}, 3) \nabla_{ϱ} V^{π} (s^{'''})] \end{matrix}

(49)

\begin{matrix} \nabla_{ϱ} V^{π} (s) = \sum_{x \in S} \sum_{k = 0} ρ^{π} (s \to x, k) Φ (x) \end{matrix}

(51)

Eliminating the derivatives of the Q-value function

\nabla_{ϱ} Q^{π} (s)

and inserting objective function

O (ϱ)

in (4.5), starting from state

S_{0}

\begin{matrix} \nabla_{ϱ} O (ϱ) = \nabla_{ϱ} V^{π} (S_{0}) \end{matrix}

(52)

\begin{matrix} = \sum_{s} \sum_{k = 0} ρ^{π} (s \to x, k) ϕ (s); \end{matrix}

(52)

Let,

η (s) = \sum_{s} ρ^{π} (s, \to x, k)

\begin{matrix} \nabla_{ϱ} O (ϱ) = \sum_{s} η (s) ϕ (s) \end{matrix}

(53)

Substituting

η (s) = \sum_{s} ρ^{π} (s, \to x, k)

in to eqn(4.24)

\begin{matrix} \nabla_{ϱ} O (ϱ) = \sum_{s} η (s) ϕ (s) \end{matrix}

(54)

Normalize

η (s)

in (4.27),

s \in S

to be probability distribution as show down bellow:

\begin{matrix} \nabla_{ϱ} O (ϱ) = (\sum_{s} η (s)) \sum_{s} [\frac{η (s)}{\sum_{s} η (s)} ϕ (s)] \end{matrix}

(55)

Since the

\sum_{s} η (s)

is constant, the gradient of the objective function is proportional to the normalized

η (s)

and

ϕ (s)

\begin{matrix} \nabla_{ϱ} O (ϱ) \propto \sum_{s} [\frac{η (s)}{\sum_{s} η (s)} ϕ (s)] \end{matrix}

(56)

Where

d^{π} (s) = \frac{η (s)}{\sum_{s} η (s)}

is stationary distribution.

In the episodic case, the constant proportionality

\sum_{s} η (s)

is the average length of an episode; In the continuing case, it is one(1) [31].

\begin{matrix} = \sum_{s} d^{π} (s) \sum_{a \in A} \nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \end{matrix}

(57)

\begin{matrix} \nabla_{ϱ} O (ϱ) \propto \sum_{s \in S} d^{π} (s) \sum_{a \in A} Q^{π} (s, a) \nabla_{ϱ} π (ϱ) (a | s) \end{matrix}

(58)

\begin{matrix} \nabla_{ϱ} O (ϱ) = \sum_{s \in S} d^{π} (s) \frac{\nabla_{ϱ} π_{(} ϱ) (a | s)}{π_{(} ϱ) (a | s)} \end{matrix}

(59)

\begin{matrix} = \sum_{s \in S} d^{π} (s) \frac{\nabla_{ϱ} π_{(} ϱ) (a | s)}{π_{(} ϱ) (a | s)} \end{matrix}

(60)

\begin{matrix} \nabla_{ϱ} O (ϱ) = E_{π} {Q^{π} (s, a) \nabla_{ϱ} ln (π_{ϱ} (a | s))} \end{matrix}

(61)

Where

E_{π} refers to E_{s \sim d_{π}, a \sim π_{ϱ}}

when the distribution of states and actions follows policy

π_{ϱ}

(on policy).

3.1.2. Off-Policy Policy Gradient

Since DDPG is one of the off-policy Policy gradient Algorithms, First let’s talk about off-policy Policy gradient Algorithms in great detail.

The behavior policy for collecting samples is known and labeled as

α (a | s)

the objective function sums up the reward over the state distribution defined by this behavior policy:

O (ϱ) = \sum_{s \in S} d^{α} (s) \sum_{a \in A} Q^{π} (s, a) π_{ϱ} (a | s)

= E_{s \sim d^{α}} [\sum_{a \in A} Q^{π} (s, a) π_{ϱ} (a | s)]

where

d^{α} (s)

is the stationary distribution of the behavior policy

α

since

d^{α} (s) = {lim}_{t \to \infty} P (S_{t} = s | S_{0}, α)

and

Q^{π}

is the action-value function estimated about the target policy

π

. Given that the training observations are sampled by

a \sim α (a | s)

, the gradient can be rewritten as:

\begin{matrix} \nabla_{ϱ} O (ϱ) & = \nabla_{ϱ} E_{s \sim d^{α}} [\sum_{a \in A} Q^{π} (s, a) π_{ϱ} (a | s)] \end{matrix}

(62)

By Derivative product rule.

\begin{matrix} = E_{s \sim d^{α}} [\sum_{a \in A} (Q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (a | s) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a))] \end{matrix}

(63)

By ignoring

π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)

\begin{matrix} \overset{(i)}{\approx} E_{s \sim d^{α}} [\sum_{a \in A} Q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (a | s)] \end{matrix}

(64)

\begin{matrix} = E_{s \sim d^{α}} [\sum_{a \in A} α (a | s) \frac{π_{ϱ} (a | s)}{α (a | s)} Q^{π} (s, a) \frac{\nabla_{ϱ} π_{ϱ} (a | s)}{π_{ϱ} (a | s)}] \end{matrix}

(65)

\begin{matrix} = E_{α} [\frac{π_{ϱ} (a | s)}{α (a | s)} Q^{π} (s, a) \nabla_{ϱ} ln π_{ϱ} (a | s)] \end{matrix}

(66)

Where

[\frac{π_{ϱ} (a ∣ s)}{α (a ∣ s)}]

is the importance weight. Since

Q^{π}

is a function of the target policy and, consequently, a function of the policy parameter

ϱ

, the derivative

\nabla_{ϱ} Q^{π} (s, a)

must also be computed using the product rule. However, computing

\nabla_{ϱ} Q^{π} (s, a)

directly is challenging in practice. Fortunately, by approximating the gradient and ignoring the gradient of

Q^{π}

, policy improvement can still be guaranteed, and eventual convergence to the true local minimum is achieved.

In summary, when applying policy gradient in the off-policy setting, it can be adjusted by a weighted sum, where the weight is the ratio of the target policy to the behavior policy,

[\frac{π_{ϱ} (a | s)}{α (a | s)}]

[31].

3.2. Deterministic Policy Gradient (DPG)

The policy function

π (\cdot | s)

is typically represented as a probability distribution over actions A based on the current state, making it inherently stochastic. However, in the case of the Deterministic Policy Gradient (DPG), the policy is modeled as a deterministic decision, denoted as

a = μ (s)

. Instead of selecting actions probabilistically, DPG directly maps states to specific actions without uncertainty. Let:

$ρ_{0} (s)$ The initial distribution over states
$ρ^{μ} (s \to s^{'}, k)$ : Starting from state s, the visitation probability density at state $s^{'}$ after moving k steps by policy $μ$ .
$ρ^{μ} (s^{'})$ : Discounted state distribution, defined as

$ρ^{μ} (s^{'}) = \int_{S} \sum_{k = 1}^{\infty} γ^{k - 1} ρ_{0} (s) ρ^{μ} (s \to s^{'}, k) d_{s}$

The objective function to optimize for is listed as follows:

\begin{matrix} O (ϱ) = \int_{S} ρ^{μ} (s) Q (s, μ_{ϱ} (s)) d s \end{matrix}

(67)

According to the chain rule, first, take the gradient of Q with respect to the action a and then take the gradient of the deterministic policy function

μ

w.r.t.

ϱ

:

\begin{matrix} \nabla_{ϱ} O (ϱ) & = \int_{S} ρ^{μ} (s) \nabla_{a} Q^{μ} (s, a) \nabla_{ϱ} μ_{ϱ} {(s) |}_{a = μ_{ϱ} (s)} d s \end{matrix}

(68)

\begin{matrix} = E_{s \sim ρ^{μ}} {[\nabla_{a} Q^{μ} (s, a) \nabla_{ϱ} μ_{ϱ} (s) |}_{a = μ_{ϱ} (s)}] \end{matrix}

(69)

3.3. Deep Deterministic Policy Gradient (DDPG)

By combining DQN and DPG, DDPG leverages the power of deep neural networks to handle high-dimensional state spaces and complex action spaces, making it suitable for a wide range of reinforcement learning tasks. The original DQN works in discrete space, and DDPG extends it to continuous space with the actor-critic framework while learning a deterministic policy. In order to do better exploration, an exploration policy

μ^{'}

is constructed by adding noise

N

:

\begin{matrix} μ^{'} (s) = μ_{ϱ} (s) + N \end{matrix}

(70)

Moreover, the DDPG algorithm integrates a sophisticated technique known as soft updates, or conservative policy iteration, to update the parameters of both the actor and critic networks. This revised methodology utilizes a small parameter, denoted as

τ

, which is much smaller than 1(

τ ≪ 1

).

The soft update equation is formulated as

\begin{matrix} ϱ^{'} \leftarrow τ ϱ + (1 - τ) ϱ^{'} \end{matrix}

(71)

It guarantees that the target network values alter gradually over time, unlike the approach employed in DQN, where the target network remains static for a fixed period.

3.4. Working of DDPG Algorithm

Algorithm 1: Deep Deterministic Policy Gradient

4. Results And Discussions

4.1. Software Configuration

The deep reinforcement learning agent was trained using Python within a Jupyter Notebook on a Linux Ubuntu 20.04 operating system. The training process spanned approximately 5.415 hours on a modest hardware configuration consisting of an Intel graphics card, 8 GB of RAM, and a 1.9 GHz processor.

4.2. Hyper Parameter Selection and Initial Search

Parameters given in Table 2 are selected in this work.

4.2.1. Batch Size Comparison

Upon completing the training process, it was observed that there was a minimal disparity in the success rate, Figure 2, and cumulative reward, Figure 3, achieved across the different batch sizes. Also, the decrease in critic loss values, Figure 4 and actor loss values, Figure 5 indicates an improvement in the actor and critic networks’ ability to approximate the optimal policy and value functions. Although the success rate and cumulative reward were similar, the enhanced convergence demonstrated by the lower losses in the 2048 batch size suggests a more efficient learning process and a potentially higher quality of learned policies.

4.2.2. Learning Rate Comparison

After the training process, observations, Figure 8 revealed that there was a minimal disparity in the cumulative reward and success rate achieved between the two learning rates as shown in Figure 7. However, the learning rate of 2e-4 displayed slightly superior performance compared to 1e-3 in terms of success rate. Conversely, when using the learning rate of 1e-3, a notable decrease in actor loss, Figure 9 and critic loss, Figure 10 was observed, indicating improved policy and value estimation by the agent. Despite comparable cumulative reward and success rates, the reduced losses at the learning rate of 1e-3 signify enhanced convergence and a potentially more efficient learning process, suggesting the agent may have acquired higher-quality learned policies.

4.3. Selection of Optimal Hyperparameters and Extended Training of DDPG Agent

After comparing the hyperparameters, as depicted in the preceding figures, and conducting a thorough analysis of the associated results, the selection of hyperparameters with promising performance was undertaken. Following this, an extended training phase was initiated, encompassing 100,000 time steps. This extended training phase serves as the fundamental training stage, which will be elaborated upon in the subsequent section.

Table 3. Parameters and Selected Hyper parameters.

Parameter	Value
Policy	MultiInputPolicy
Replay buffer class	HerReplayBuffer
Verbose	1
Gamma	0.95
Tau ( $τ$ )	0.005
Batch size	2048
Buffer size	100000
Replay buffer kwargs	rb kwargs
Learning rate	1e-3
Action noise	Normal action noise
Policy kwargs	Policy kwargs
Tensorboard log	Log path

Table 4. Training Metrics at 200 time step and at

100, 000

time step.

Table 4. Training Metrics at 200 time step and at

100, 000

time step.

Category	Value	Category	Value
rollout/		rollout/
Episode length	50	Episode length	50
Episode mean reward	-49.2	Episode mean reward	-1.8
Success rate	0	Success rate	1
time/		time/
Episodes	4	Episodes	2000
FPS	18	FPS	5
Time elapsed	10	Time elapsed	19505
Total timesteps	200	Total time steps	100000
train/		train/
Actor loss	0.625	Actor loss	0.0846
Critic loss	0.401	Critic loss	0.00486
Learning rate	0.001	Learning rate	0.001
Number of updates	50	Number of updates	99850

4.3.1. Improvement in Cumulative Reward and Success Rate

The mean episode reward, Figure 11 improves from -49.2 in the first loop to -1.8 in the last loop. The success rate, Figure 12 increases from 0 in the first loop to 1 in the last loop.

4.3.2. Frames per Second (FPS)

The training speed, Figure 13 decreases from 18 frames per second (FPS) in the first loop to 5 FPS in the last loop.

4.3.3. Improvement in Actor and Critic Losses

The actor loss, Figure 14 decreases from 0.625 in the first loop to 0.0846 in the last loop. The critic loss, Figure 15 decreases from 0.401 in the first loop to 0.00486 in the last loop.

4.4. Comparing DDPG and PPO: Off-Policy vs. On-Policy Reinforcement Learning Algorithms

Proximal policy optimization (PPO), an on-policy reinforcement learning algorithm, was trained to compare its performance with Deep deterministic policy gradient (DDPG), an off-policy algorithm, as shown in Figure 16 and Figure 17. The cumulative reward achieved by DDPG was

- 1.8

, whereas the cumulative reward obtained by PPO was

- 50

as shown in Figure 16. The results of this comparison indicate that, in this particular scenario, DDPG exhibited superior performance over PPO in terms of cumulative reward.

5. Conclusion

In this study, the Deep Deterministic Policy Gradient (DDPG) algorithm is applied to train a robotic arm manipulator, specifically the Franka Panda robotic arm, for a target-reaching task. The objective of this task is to enable the robotic arm to accurately reach a designated target position. The DDPG algorithm is chosen because of its effectiveness in continuous control tasks and its ability to learn policies with high-dimensional action spaces. By leveraging a combination of deep neural networks and actor-critic architecture, DDPG approximates the optimal policy for the robotic arm. When comparing the performance of PPO and DDPG after training for

100, 000

time steps:

PPO achieved a mean episode reward of

- 50

indicating that the agent struggled to achieve positive rewards on average. Despite training at a relatively fast speed of

561 F P S

, the results suggest that PPO faced challenges in finding successful strategies for the given task.

On the other hand, DDPG demonstrated superior performance with a mean episode reward of

- 1.8

. It achieved a success rate of 1, indicating consistent success in reaching desired outcomes. Despite a slower training speed of

5 F P S

. DDPG showcased its capability to effectively learn and improve its policy over time. Based on these results, DDPG outperformed PPO in terms of cumulative reward and success rate in the given scenario.

Author Contributions

The authors contributions in this manuscript are stated as follows: Conceptualization, L.H. and Y.A.; methodology, L.H.; software, L.H.; validation, A.T., Y.S.J. and S.J.; formal analysis, L.H.; investigation, A.T., Y.S.J.; resources, L.H.; data curation, L.H.; writing—original draft preparation, L.H.; writing—review and editing, Y.A., and A.T., and S.J.; visualization, A.T. and L.H.; supervision, Y.A. and S.J.; project administration, S.J.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project for Smart Manufacturing Innovation R&D funded Korean Ministry of SMEs and Startups in 2024(Project No.RS-2024-00434311).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable’

Data Availability Statement

All the required data in this work are available with the authors and they can be provided upon request.

Acknowledgments

This work was supported by the project for Smart Manufacturing Innovation R&D funded Korean Ministry of SMEs and Startups in 2024(Project No.RS-2024-00434311).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Mohsen, S.; Behrooz, A.; Roza, D. Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Cognitive Robotics 2023, 3, 54–70. [Google Scholar]
Sridharan, M.; Stone, P. Color Learning on a Mobile Robot: Towards Full Autonomy under Changing Illumination. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); 2007; pp. 2212–2217. [Google Scholar]
Xinle, Y.; Minghe, Sh.; Lingling, Sh. Adaptive and intelligent control of a dual-arm space robot for target manipulation during the post-capture phase. Aerospace Science and Technology 2023, 142, 108688. [Google Scholar]
Abayasiri, R. A. M.; Jayasekara, A. G. B. P.; Gopura, R. A. R. C.; Kazuo, K. Intelligent Object Manipulation for a Wheelchair-Mounted Robotic Arm. Journal of Robotics, 2024. [Google Scholar]
Mohammed, M. A.; Hui, L.; Norbert, St.; Kerstin, Th. Intelligent arm manipulation system in life science labs using H20 mobile robot and Kinect sensor. 2016 IEEE 8th International Conference on Intelligent Systems (IS), Sofia, Bulgaria, 2016. [Google Scholar]
Yoshiyuki Ohmura and Yasuo Kuniyoshi. Humanoid robot which can lift a 30kg box by whole body contact and tactile feedback. 007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 2007; pp. 1136–1141. [Google Scholar]
Li, Z.; Ming, J.; Dewan, F.; Hossain, M.A. Intelligent facial emotion recognition and semantic-based topic detection for a humanoid robot. Expert Systems with Applications 2013, 40, 5160–5168. [Google Scholar]
Martin, J. G. Muros, F. J., Maestre, J. M., and Camacho, E. F. Multi-robot task allocation clustering based on game theory. Robotics and Autonomous Systems. Robotics and Autonomous Systems 2023, 161, 104314. [Google Scholar] [CrossRef]
Nguyen, M. N. T. Ba, D. X. A neural flexible PID controller for task-space control of robotic manipulators. Frontiers in Robotics and AI 2023, 9, 975850. [Google Scholar]
Laurenzi, A. Antonucci, D., Tsagarakis, N. G., and Muratore, L. The XBot2 real-time middleware for robotics. Robotics and Autonomous Systems. Robotics and Autonomous Systems 2023, 163, 104379. [Google Scholar] [CrossRef]
Zhang, L. Jiang, M., Farid, D., and Hossain, M. A. Intelligent Facial Emotion Recognition and Semantic-Based Topic Detection for a Humanoid Robot. Expert Systems with Applications 2013, 40, 5160–5168. [Google Scholar] [CrossRef]
Floreano, D. Wood, R. J. Science, Technology, and the Future of Small Autonomous Drones. Nature 2015, 521, 460–466. [Google Scholar] [CrossRef] [PubMed]
Chen, T. D. , Kockelman, K. M., and Hanna, J. P. Operations of a Shared, Autonomous, Electric Vehicle Fleet: Implications of Vehicle & Charging Infrastructure Decisions. Transportation Research Part A: Policy and Practice, 2016. [Google Scholar]
Chen, Z. Jia, X., Riedel, A., and Zhang, M. A Bio-Inspired Swimming Robot. 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014; pp. 2564–2564. [Google Scholar]
Ohmura, Y. and Kuniyoshi, Y. Humanoid Robot Which Can Lift a 30kg Box by Whole Body Contact and Tactile Feedback. 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems; 2007; pp. 1136–1141. [Google Scholar]
Kappassov, Z. Corrales, J.-A., and Perdereau, V. Tactile Sensing in Dexterous Robot Hands. Robotics and Autonomous Systems 2015, 74, 195–-220. [Google Scholar] [CrossRef]
Arisumi, H. Miossec, S., Chardonnet, J.-R., and Yokoi, K Dynamic Lifting by Whole Body Motion of Humanoid Robots. 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems; 2008; pp. 668–675. [Google Scholar]
Aryslan, M. Yevgeniy, L., Troy, H., Richard, P. A deep reinforcement-learning approach for inverse kinematics solution of a high degree of freedom robotic manipulator. Robotics 2022, 11, 44. [Google Scholar] [CrossRef]
Serhat, O. Enver T., Erkan Z. Adaptive Cartesian space control of robotic manipulators: A concurrent learning based approach. Journal of the Franklin Institute 2024, 361, 106701. [Google Scholar]
Kaelbling, L. P. Littman, M. L., and Moore, A. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 1996, 4, 237–285. [Google Scholar] [CrossRef]
Fadi, Al. , Katarina Gr. Reinforcement learning algorithms: An overview and classification. 2021 IEEE Canadian Conference on Elec-trical and Computer Engineering (CCECE); 2021; pp. 1–7. [Google Scholar]
Thrun, S.; and Littman, M. L. Reinforcement Learning: An Introduction. AI Magazine 2000, 21, 103-. [Google Scholar]
Smruti Amarjyoti. Deep reinforcement learning for robotic manipulation-the state of the art. arXiv:1701.08878, 2017.
Jens, K., Bagnell. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research 2013, 32, 1238–1274. [Google Scholar]
Tianci, G. Optimizing robotic arm control using deep Q-learning and artificial neural networks through demonstration-based methodologies: A case study of dynamic and static conditions. Robotics and Autonomous Systems 2024, 104771. [Google Scholar]
Andrea, Fr., Elisa. Robotic Arm Control and Task Training through Deep Reinforcement Learning. 2020, arXiv:2005.02632v1. [Google Scholar]
Jonaid, Sh., Michael. Optimizing Deep Reinforcement Learning for Adaptive Robotic Arm Control. 2024, arXiv:2407.02503v1. [Google Scholar]
Roman, P. , Jakub, K. Computation 2024, 12(6), 116. [Google Scholar]
Wanqing, X., Yuqian. Deep reinforcement learning based proactive dynamic obstacle avoidance for safe human-robot collaboration. Manufacturing Letters 2024, 1246–1256. [Google Scholar]
Franka Emika Documentation. Control Parameters Documentation, 2024. Available at: CrossRef.
Weng, L. Policy Gradient Algorithms. Lil’Log, 2018. 2024. Available online: https://lilianweng.github.io/posts/2018-04-08-policy-gradient.

Figure 1. DH Axis Representation of Franka Panda robot [30].

Figure 2. Success Rate In Different Batch Sizes.

Figure 3. Cumulative Mean Reward In Different Batch Sizes.

Figure 4. Critic Loss In Different Batch Sizes.

Figure 5. Actor Loss In Different Batch Sizes.

Figure 6. Frames per Second In Different Batch Sizes

Figure 7. Success Rate In Different Learning Rate

Figure 8. Cumulative Mean Reward In Different Learning Rate

Figure 9. Actor Loss In Different Learning Rates

Figure 10. Critic Loss In Different Learning Rates.

Figure 11. Improved Cumulative Mean Reward

Figure 12. Improved Success Rate

Figure 13. Frames Per Second (FPS) or Training Speed

Figure 14. Improved Actor Loss

Figure 15. Improved Critic Loss

Figure 16. Cumulative Mean Reward.

Figure 17. Training Speed.

Table 1. DH Axis Representation.

Joint	a(m)	d(m)	$δ$ (m)	$ϕ$ (rad)
1	0	0.333	0	$ϕ_{1}$
2	0	0	-90	$ϕ_{2}$
3	0	0.316	90	$ϕ_{3}$
4	0.0825	0	90	$ϕ_{4}$
5	-0.0825	0.384	-90	$ϕ_{5}$
6	0	0	90	$ϕ_{6}$
7	0.088	0	90	$ϕ_{7}$
Flange	0	0.107	0	0

Table 2. Parameters and Hyper Parameters.

Parameter	Value
Policy	MultiInputPolicy
Replay buffer class	HerReplayBuffer
Verbose	1
Gamma	0.95
Tau ( $τ$ )	0.005
Batch size	512,1024,2048
Buffer size	100000
Replay buffer kwargs	rb kwargs
Learning rate	1e-3 , 2e-4
Action noise	Normal action noise
Policy kwargs	Policy kwargs
Tensorboard log	Log path

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Deep Reinforcement Learning Based Robotic Arm’s Target Reaching Performance Enhancement

Abstract

Keywords:

Subject:

1. Introduction

2. Modeling of Robotic Arm

2.1. Direct Kinematic Model of Robot Arm

2.1.1. DH Axis Representation

2.2. Incremental Inverse Kinematics of Robot Arm

2.2.1. Steps of Incremental Inverse Kinematics

3. Deep Reinforcement Learning Algorithm Design

3.1. Policy Gradient Algorithm

3.1.1. Derivation of Policy Gradient Theorem

3.1.2. Off-Policy Policy Gradient

3.2. Deterministic Policy Gradient (DPG)

3.3. Deep Deterministic Policy Gradient (DDPG)

3.4. Working of DDPG Algorithm

4. Results And Discussions

4.1. Software Configuration

4.2. Hyper Parameter Selection and Initial Search

4.2.1. Batch Size Comparison

4.2.2. Learning Rate Comparison

4.3. Selection of Optimal Hyperparameters and Extended Training of DDPG Agent

4.3.1. Improvement in Cumulative Reward and Success Rate

4.3.2. Frames per Second (FPS)

4.3.3. Improvement in Actor and Critic Losses

4.4. Comparing DDPG and PPO: Off-Policy vs. On-Policy Reinforcement Learning Algorithms

5. Conclusion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe