1. Introduction
With the continuous advancement of science and technology, robotics is developing at an astonishing rate and is widely used in numerous fields, undertaking critical missions. Robots can replace highly repetitive and complex production tasks previously performed by humans, significantly improving production efficiency while ensuring consistent product quality. Moreover, robots can be deployed in hazardous environments, effectively enhancing operational safety [
1]. Due to their outstanding advantages in production, robots are now extensively applied in agriculture [
2], healthcare [
3], nuclear industry [
4], aerospace [
5], and many other fields.
Traditional calibrated systems refer to those where the system parameters (e.g., camera intrinsic and extrinsic parameters, distortion coefficients) are precisely calibrated using specific methods before processing image or sensor data. For visual systems, it is also necessary to calibrate the camera parameters and the hand-eye coordinate transformation. The accuracy of these parameters directly affects overall performance, imposing significant limitations in practical applications [
6]. To overcome the limitations of calibrated visual servoing, researchers have proposed uncalibrated visual servoing systems [
7,
8,
9,
10]. Uncalibrated visual servoing systems do not require parameter calibration but can accurately control robots by analyzing real-time image features, combining the robot’s current state information, and using advanced control algorithms to compute the system’s control inputs for the next time step. Compared to traditional calibrated systems, uncalibrated systems eliminate the need for precise geometric or kinematic model calibration, reducing system complexity and improving adaptability in practical applications.
With the deepening of research on uncalibrated visual servoing, Model-Free Adaptive Control (MFAC) has gradually gained attention due to its independence from system models. Model-Free Adaptive Control, as an advanced control methodology that does not rely on precise mathematical models of controlled objects, is fundamentally characterized by dynamically adjusting control strategies through online acquisition of system input-output data, rather than depending on a priori mechanistic models for control law design. Based on methodological differences, existing MFAC approaches can be primarily classified into two implementation paradigms. The first category encompasses dynamic linearization-based methods, which construct time-varying linear approximation models through online estimation techniques of pseudo-gradient or pseudo-Jacobian matrices. While these methods retain the structural assumption of local system linearization, they completely eliminate dependence on global model information. The second category comprises fully data-driven model-free methods. These approaches directly establish nonlinear mapping relationships based on input-output data or employ intelligent algorithms (such as neural networks, fuzzy logic systems, etc.) to generate control laws, thereby fundamentally circumventing the modeling process inherent in traditional control methodologies. For the quintessential nonlinear control problem of robotic motion control, scholars in the control field have proposed various specialized model-free adaptive control methods.
In [
11], a hybrid adaptive disturbance rejection control (HADRC) algorithm was proposed, which integrates dynamic linearization, disturbance observers, and fuzzy logic control to significantly improve the control performance of inflatable robotic arms. Dynamic linearization is suitable for multiple scenarios, disturbance observers enhance anti-disturbance capabilities, and fuzzy logic control effectively handles highly nonlinear and uncertain systems. In [
12], a neural network-based model-free control method was proposed, which uses neural network approximation techniques and position measurements to estimate uncertain Jacobian matrices, significantly improving the adaptability and accuracy of continuum robots in complex environments. Additionally, for the dynamic uncertainty and saturation constraints of rehabilitation exoskeleton robots, [
13] proposed a data-driven model-free adaptive containment control (MFACC) strategy, which linearizes the dynamic system into an equivalent data model and designs an improved model-free controller to enhance control performance in complex environments. For the nonlinear dynamics of NAO robots in robust walking, [
14] proposed a model-free method based on time-delay estimation (TDE) and fixed-time sliding mode control, which uses TDE to estimate system dynamics in real-time and combines a fixed-time observer with an improved exponential reaching law (MERL) to enhance the stability and trajectory tracking accuracy of the control system.
Although model-free control methods do not rely on precise system models and have shown significant advantages in handling complex dynamic systems, their overall performance still has limitations. First, these methods have limited adaptability to environmental changes, especially in highly nonlinear, uncertain, or strongly disturbed scenarios, where control accuracy and stability may be affected. Second, the design of model-free control often relies on empirical criteria and the selection of algorithm parameters, which poses certain limitations for complex control tasks in high-dimensional spaces. Additionally, traditional model-free control methods struggle to fully utilize large amounts of online data, limiting their potential for dynamic optimization and long-term performance improvement.
Reinforcement Learning (RL), with its core mechanism of autonomously learning optimal policies through interaction with the environment, provides a novel approach to overcoming traditional challenges in robotic arm visual servoing [
15,
16,
17,
18,
19]. Its key advantages lie in eliminating the need for precise robot kinematics/dynamics models or cumbersome camera calibration, significantly reducing system complexity, as well as its exceptional capability in high-dimensional policy generation—enabling direct learning of complex control strategies from high-dimensional visual inputs (e.g., camera images). These strengths have led to the widespread application of RL in robotic arm visual servoing tasks, such as target localization, grasping, trajectory tracking, and obstacle avoidance [
20,
21,
22,
23,
24].
To address diverse task requirements, the RL algorithm framework has continued to evolve. Early value-based methods (e.g., DQN [
25]) successfully tackled simple tasks with discrete action spaces, such as image-based target localization, but struggled to handle the continuous action spaces required for robotic arm control, often resulting in non-smooth motions.To overcome these limitations, policy gradient-based Actor-Critic methods (e.g., DDPG [
26,
27], SAC [
28]) have emerged as the dominant approach. These methods directly output continuous actions and demonstrate superior performance in complex dynamic environments, including high-DoF precise positioning, smooth trajectory tracking, and multi-task learning. However, such methods heavily rely on online environment interaction for extensive trial-and-error learning, posing significant safety risks and high training costs when deployed on real robotic arms. Additionally, the sim-to-real transfer challenge further limits their practical efficiency.
To overcome the limitations imposed by online interaction requirements, offline reinforcement learning (Offline RL) has emerged as an effective alternative. The TD3+BC algorithm [
29] represents a significant advancement in this field, incorporating a behavior cloning (BC) regularization term into the TD3 framework to enable training using pre-collected experience datasets, thereby eliminating the risks and costs associated with online interaction. However, this approach presents notable constraints in practical applications. Although TD3+BC’s dual-critic architecture addresses the overestimation bias of single-critic methods by selecting the minimum Q-value estimate between its two networks, this approach introduces excessive conservatism in value estimation that significantly reduces convergence speed. Consequently, although TD3+BC establishes an important methodological foundation for training robotic arms safely and efficiently, its fundamental limitations in addressing distributional shift and policy conservatism significantly impair its robustness, generalization capability, and ultimate performance in complex, high-precision visual servoing tasks. These identified shortcomings highlight the critical need for next-generation offline RL algorithms. Such algorithms must demonstrate enhanced robustness and efficiency while overcoming dataset constraints. This challenge forms the core research motivation and starting point of our study.
This paper proposes an uncalibrated visual servo control method for robotic arms based on improved offline reinforcement learning, with its core innovation being the novel Multi-Network Mean Delayed Deep Deterministic Policy Gradient Algorithm with Behavior Cloning (MN-MD3+BC). The method establishes a multi-critic network integration mechanism that employs the mean output of networks as the final Q-value estimate, effectively reducing estimation bias inherent in single-critic approaches. The algorithm innovatively incorporates a behavior cloning regularization term into policy gradient updates, creating a dual-driven mechanism that combines traditional Q-value maximization objectives. This approach constrains policy deviations from dataset distributions while balancing the conservatism of BC through Q-optimization objectives, thereby enhancing policy optimization potential without compromising safety. At the system implementation level, the method designs an end-to-end direct mapping strategy from visual features to joint control commands, eliminating the complex calibration processes required in conventional methods. Through the proposed data-recombination-driven offline pretraining framework, the method leverages pre-constructed high-quality datasets and enhances experience reuse efficiency via data recombination technology, maximizing the utility of limited datasets while improving training efficiency. Compared with existing approaches, this solution maintains control precision while significantly reducing system dependence on precise calibration and environmental prior knowledge, offering a novel approach for robotic arm visual servo control in complex scenarios.
The remainder of this paper is organized as follows:
Section 2 describes the experimental platform and outlines theoretical foundations.
Section 3 details the proposed offline reinforcement learning adaptive controller and the MN-MD3+BC algorithm architecture.
Section 4 presents validation results through both MATLAB simulations and WPR1 robotic arm experiments. Finally,
Section 5 provides concluding remarks on the research findings.