1. Introduction
In recent years, unmanned aerial vehicles (UAVs) have found applications in various military and civilian domains due to their advantages, such as high mobility, accessibility, convenient deployment, and low cost. They have gradually become indispensable in modern society, with roles in civil sectors such as agriculture [
1,
2], mineral exploration [
3], and forest rescue [
4], as well as in military reconnaissance [
5] and strikes [
6]. Multi-UAV target search problem is a significant issue in autonomous UAV decision-making and has garnered extensive academic attention recently. Multi-UAV target search involves UAVs using on-board detection equipment to reconnoiter designated areas and share information via a communication network, thereby jointly capturing targets. Currently, three primary methods are used for multi-UAV target search. The first category is pre-planning methods, such as partition search [
7] and formation search [
8]. These methods essentially transform the target search problem into a planning problem with area coverage, offering high reliability and easy evaluation of the solution results. However, they require a known mission area model in advance, involve longer planning times, and are not highly adaptive to dynamic environmental changes. The second category is online optimization methods, which approximate the search problem as a real-time objective function optimization problem and typically employ traditional or heuristic algorithms, such as ant colony algorithms [
9] and genetic algorithms [
10]. These methods are better adapted to environmental dynamics than pre-planning approaches, but depend on a central node for decision-making and exhibit low adaptability in distributed environments. The third category is Multi-Agent Reinforcement Learning (MARL) methods, which model the problem as a Partially Observable Markov Decision Process (POMDP) and use algorithms based on the MARL framework. These methods enable agents to learn and optimize behavior through interaction with the environment and other agents, allowing adaptation to dynamic changes and rapid decision-making [
11,
12]. The primary challenge lies in designing the algorithm training architecture, agent exploration mechanism, and reward function tailored to specific task requirements.
Currently, the design of MARL methods is a prominent focus in multi-UAV target search research. Within the MARL framework, Shen et al. [
13] proposed the DNQMIX algorithm, which enhances search rate and coverage. Lu et al. [
14] proposed the MUICTSTP algorithm, demonstrating superior performance in terms of anti-interference and collision avoidance. Yu et al. [
15] proposed the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm, which has exhibited excellent performance in multi-agent testing environments and is regarded as one of the most advanced algorithms available. Wei et al. [
16] combined the MAPPO algorithm with optimal control (OC) and GPU parallelization to propose the OC-MAPPO algorithm, which accelerates UAV learning.
To better describe environmental uncertainty, Bertuccelli et al. [
17] proposed a probabilistic approach that divides the task area into units, each associated with the probability of target presence, establishing a target probability graph. This method has achieved good results in target search and is widely recognized. Building on the MARL framework and the target probability graph, Zhang et al. [
18] designed a confidence probability graph using evidence theory and proposed a Double Critic DDPG algorithm. This approach effectively balances the bias in action-value function estimation and the variance in strategy updates. Hou et al. [
19] converted the probability function into a grid-based goal probability graph and proposed a MADDPG-based search algorithm, improving search speed and avoiding collisions and duplications.
Multi-UAV target search has made some progress at this stage, but two challenges remain. Firstly, the utilization of sample data remains inefficient, and balancing utilization and exploration presents a challenge. Existing MARL algorithms primarily employ neural networks such as fully connected networks and convolutional networks, which fail to simultaneously achieve efficient utilization of temporal and spatial information in the sample data, and also lack effective environmental exploration. Secondly, the behavioral modeling of dynamic targets is relatively simple, and existing work primarily considers changes in the target's position over time, often transforming the target search problem into a target tracking problem. In actual non-cooperative target search scenarios, targets may exhibit escape behavior, actively changing their position to evade detection and potentially using environmental blind spots to hide, preventing real-time tracking by UAVs. Addressing the challenges identified above, this paper investigates the Multi-UAV Escape Target Search (METS) problem in complex environments. The contributions of this paper are summarized as follows:
The simulation environment for the METS problem is constructed, introducing a Target Existence Probability Map (TEPM), and an appropriate probability update method is employed for the escaping target. Based on the TEPM, a Local State Field of View is designed, with local state values obtained through entropy calculation. Additionally, a new state space, action space, and reward function are devised within the framework of Decentralized Partially Observable Markov Decision process (DEC-POMDP). Ultimately, a model that addresses the METS problem is established.
To enhance the MARL algorithm's ability to process spatio-temporal sequence information and improve environmental exploration, this paper propose the Spatio-Temporal Efficient Exploration (STEE) network, constructed using a Convolutional Long Short-Term Memory network and a Noise network. This network is integrated into the MAPPO algorithm, and its impact on the overall performance of the MAPPO algorithm is validated.
For searching the escape target in the METS problem, the Global Convolutional Local Ascent (GCLA) mechanism is proposed. A Multi-UAV Escape Target Search Algorithm Based on MAPPO (ETS-MAPPO) is introduced by combining the STEE network. This algorithm effectively addresses the challenges of searching for escape target, and experimental comparisons with five classical MARL algorithms show significant improvements in the number of target searches, the area coverage rate, and other metrics.
The remaining chapters of this paper are organized as follows:
Section 2 defines the system model and provides a mathematical formulation of the METS problem.
Section 3 introduces the ETS-MAPPO algorithm within the MARL framework and describes it in detail. In
Section 4, experiment results are presented to validate the effectiveness of ETS-MAPPO.
Section 5 concludes the paper and outlooks the future research.