Preprint
Article

This version is not peer-reviewed.

UAV Task Scheduling and Resource Allocation for Data Collection Applications: a Hierarchical Reinforcement Learning Approach

Submitted:

10 March 2025

Posted:

11 March 2025

You are already at the latest version

Abstract
In recent years, the utilization of unmanned aerial vehicles (UAVs) has surged in a variety of applications, including weather monitoring, emergency search and rescue operations, and smart agriculture. In these applications, UAVs are instrumental in executing data collection tasks. However, due to the scarcity of resources such as battery capacity, the duration of work is limited, necessitating the optimization of UAV trajectories and resource allocation. In this paper, we introduce a hierarchical task scheduling and resource allocation scheme for UAV data collection tasks, which incorporates the deep reinforcement learning (DRL) technology into a two-layer training framework. The continuous actions, including UAV trajectories, UAV communication power, and UAV CPU main frequency, are first optimized in the lower layer, while the upper one concentrates on the discrete action space for the target allocation. In particular, the well-trained lower-layer network is integrated into the upper layer, facilitating rapid global reward feedback between the two-layer network and accelerating the training process of the upper-layer network. Experimental results demonstrate that the proposed approach achieves superior performance than the baseline methods in terms of the amount of collected data as well as UAV energy consumption.
Keywords: 
;  ;  ;  ;  

1. Introduction

As a lightweight aircraft with no requirement for human pilots, unmanned aerial vehicles (UAVs) have the advantages of low cost, strong robustness, and commercial availability [1]. UAVs equipped with various types of sensors have many applications, such as fire detection, weather monitoring, emergency search and rescue, logistics transmission, precision agriculture, target tracking, etc. In recent years, more and more research has focused on using UAVs equipped with high-resolution cameras to collect image data from target areas by taking photos or videos, which is called data collection tasks [2]. Owing to their high maneuverability and high-altitude communication advantages, UAVs are increasingly being used in data collection tasks, especially in search and rescue, surveillance, and the Internet of Things [3,4,5].
UAV-enabled data collection mainly focuses on the monitoring of disaster-affected areas in emergency scenarios[6,7]. Earthquakes or heavy rainfall often cause geological disasters such as landslides, ground subsidence, and debris flows, causing huge property losses and casualties. In order to collect information about the degree of disaster, the scope of impact, and disaster trends in the affected area, data collection can be achieved by taking image data of the affected area through UAVs. However, due to the special nature of the data collection task, the UAV needs to cover as many target areas as possible in a limited time, while it needs to minimize flight distance and energy consumption at the same time. This requires comprehensive consideration of task scheduling [8,9] and resource allocation [10,11] two issues, and make effective trade-offs. Therefore, it is necessary to plan UAV trajectories and resource allocation. Besides, in emergency scenarios, the environment is often dynamically changing. For example, the priority of ground targets in disaster-affected areas will change dynamically. During the task execution process, priority information will change through new information monitored by satellites or manual modifications by relevant personnel. Traditional optimization algorithms are often difficult to solve such problems, and reinforcement learning methods are one of the effective ways to solve such dynamic programming problems. It is of great significance and value to study the application of reinforcement learning methods in task planning for UAV data collection tasks.
The main problem that needs to be solved in the UAV data collection task is to optimize its trajectory planning and resource allocation under the condition of limited UAV battery capacity to effectively complete the task. However, (1) most of the current research on UAV data collection tasks is to optimize the UAV’s trajectory under the condition of limited UAV energy or limited working time [12,13], ignoring the problem of resource allocation (computing resources and communication resources) during the task process, and rarely considering the limited cache space of the UAV, as well as the UAV having certain on-board processing capabilities, which can process data both on-board and transmit data to ground processing nodes for processing; (2) Most of the existing studies focus on using optimization methods to complete the solution of trajectory planning and resource allocation under fixed target distribution [14]. When the target distribution changes, it needs to be recalculated.
Reinforcement learning, as an algorithm based on trial-and-error learning, can adaptively optimize UAV trajectory planning and resource allocation in a dynamic environment. Evolutionary algorithms, genetic algorithms, and other optimization methods based on natural evolution and genetic mechanisms also have good adaptability and robustness, widely used in UAV data collection tasks.
To solve these issues, we proposed a novel hierarchical deep reinforcement learning based task scheduling and resource allocation method (HDRL-TSRAM) for UAV data collection applications, which achieves hierarchical modeling of UAV data collection task scenarios, fully considering the processes of UAV flight, communication, calculation, data collection, etc., and designs appropriate reward functions to improve the efficiency of UAV collection. Firstly, the scenario modeling of UAV data collection tasks is introduced, followed by a detailed introduction of the principles and frameworks of the lower-layer resource allocation algorithm and the upper-layer task scheduling algorithm. Finally, the effectiveness of the proposed HDRL-TSRAM is verified through rich comparative experimental results. The main contributions of this paper are summarized as follows:
  • We propose a two-layer hierarchical task scheduling and resource allocation scheme for the UAV data collection task, where the amount of the collected data, as well as UAV energy consumption, is jointly minimized by training the target allocation, trajectories, communication power, and CPU frequency of the UAV in a decentralized manner.
  • In the lower layer, we design the PPO based algorithm to optimize the continuous actions, where the Lin-Kernighan-Helsgaun method is utilized to obtain the optimal trajectory as the given input of the lower layer network (LLN).
  • In the upper layer, we leverage the DQN based algorithm to optimize the discrete actions with the assistance of the nested well-trained LLN, which gives rapid feedback on global rewards and accelerates the training process of the upper layer network.
  • Extensive experiments are implemented to demonstrate the effectiveness of the proposed two-layer HDRL-TSRAM framework, which achieves superior performance compared with several baseline methods.
The rest of this paper is organized as follows: Section II provides the related work. Section III introduces the system model and problem formulation. Section IV presents the proposed HDRL-TSRAM framework. Simulation experiments and result analysis are demonstrated in Section V. Section VI concludes this paper.
Table 1. The Structure of This Paper
Table 1. The Structure of This Paper
Section Content
1 Introduction
2 Related works
3 System model
4 Hierarchical DRL method
5 Experimental results and analysis
6 Discussion
7 Conclusions

3. System model

As shown in Figure 1, the task scenario involves a UAV, a ground base station(GBS), and M ground targets, which are denoted as M = 1 , 2 , . . . , M . The UAV collects high-definition images or video data of the targets through high-resolution cameras within a fixed area. Due to the limited storage space, UAV needs to process the data onboard or transmit it to the GBS for further processing during data collection. Additionally, the battery energy of UAV is restricted, which will support various activities such as aerial flight, hovering, onboard data processing, and data transmission. Therefore, efficient task scheduling and resource allocation are necessary to optimize the energy consumption and the size of data collection.
Table 4. The Simulation Parameters
Table 4. The Simulation Parameters
Parameters Meaning Parameters Meaning
n Time slot F n Main frequency
T n f Flight time T Dynamic transition matrix
T n s Collection time M Number of ground targets
P f Flight power B Allocated communication bandwidth
P h Hovering power σ Power of the random noise
d m , n Distance R n Offloading data transmission
P n T r Transmission power γ Effective conversion rate of UAV
E n c Energy consumption P n Prioritization matrix
C Number of CPU cycles α Energy consumption penalty coefficient
For ease of representation, the task time T is divided into N slots, and the duration of each slot should be sufficient to complete one data collection. Each slot n , n N consists of the flight time T n f and the collection time T n s , the former one depends on the flight distance, and the later one is fixed, thus the divided time slots vary in size.
Assuming that the maximum horizontal coverage radius of the UAV is R m a x , the UAV can capture high-definition images of the target nodes within such a range. To conveniently denote the data collection relationship between the UAV and the target within any slot n, This paper introduces a binary pointer as follows
I m , n = 0 , 1 , m M , n N
where I m , n = 1 , m M indicates that the data of target m will be collected at slot n; otherwise, I m , n = 0 . It should be noted that only the first collection result is valid when the same target is collected multiple times. In addition, the UAV can only collect data when the remaining cache is enough to complete one data collection. Hence, several constraints exist during the collection process as follows
I m , n · d m , n R max , m M , n N
n = 1 N I m , n = 1 , m M , n N
where d m , n represents the horizontal distance between the UAV and the ground target m during time slot n, C n is the remaining cache of the UAV, and χ is the amount of the collected data, which is fixed. The formula 2 indicates that only targets within the coverage range of the UAV can be collected data, formula 3 indicates that data collection for a target is only valid for the first time. Besides, a binary pointer is introduced to describe whether the UAV performs data collection tasks during time slot n, which is denoted as follows
I n = 1 , i f m = 1 M I m , n > 0 0 , o t h e r w i s e
Considering that the energy consumption of the UAV is a key indicator for the data collection task, the corresponding models are built to measure the value quantitatively, which consists of the flight model, the communication model, and the computation as well as the cache model.

3.1. Flight Model

Similar to prior studies, this article adopts a 3-D Cartesian coordinate system to model the stereoscopic space [43] of the data collection task. The UAV is assumed to move at the fixed heights H, and the coordinates of UAV at time slot n is denoted by p n u = x n , y n , H , n N . The velocity of the UAV at time slot n is assumed as v n , and remains constant while ignoring the acceleration and deceleration processes, then the flight time T n f is given by
T n f = p n u p n 1 u v n .
Besides, the UAV keeps hovering during the collection. Thus the flight energy consumption at each slot can be given as the following equation:
E n f = P f T n f + P h T n s
where T n f and T n h are the flight time and hovering time of the UAV at slot n, respectively. P f and P h are the flight power and hovering power, which are fixed as constant.

3.2. Communication Model

During the collection, the UAV can offload the data to the GBS to alleviate the pressure of online processing. To simulate the communication process between the UAV and the GBS, an air-ground communication channel model is established, which can be regarded as a line-of-sight channel, since the obstruction is generally less in the air. The offloading data transmission rate can be expressed as
R n = B log 2 1 + g 0 P n T r H 2 + d n 2 σ 2
where B is the allocated communication bandwidth, which is fixed. The frequency-division duplex is adopted for the uplink channel between UAV and targets as well as the downlink channel between UAV and GBS to avoid co-frequency interference. g 0 is the received power at 1m position with a transmission power of 1W, P n T r is the transmission power of the UAV at time slot n, d n is the relative distance between UAV and GBS, σ is the power of the random noise. Therefore, the size of the offloaded data can be expressed as
D n T r = R n T n s
It is worth noting that the UAV only transmits the data while hovering in the air, and the energy consumption for the communication between UAV and GBS can be formulated as
E n Tr = P n Tr T n s , n N

3.3. UAV Computation Model

Assuming that the main frequency of the UAV CPU at time slot n is F n when the UAV performs online processing on the cached data, and the energy consumption for computation is expressed as
E n c = γ F n 3 T n f + T n s , n N
where γ is the effective conversion rate of UAV during online processing, and the amount of data processed online can be described as follows based on most existing research
D n c = F n T n f + T n s C , n N
where C is the number of CPU cycles per input bit of the data to be processed.

3.4. Optimization Model

Taking into account the above models comprehensively, the optimization model of the UAV task scheduling and resource allocation for data collection application is established. We select the weighted sum of the total amount of information collected and the total energy consumption of the UAV as the optimization object. Among them, the priority of ground targets dynamically changes, and the information content of different priority targets varies. We model the change process of target priority as a Markov process, the target priority at the next slot is only related to the one at the current slot, which can be expressed as
P n = P n 1 T
where P n is the prioritization matrix of ground targets at time slot n, T is the dynamic transition matrix of target priority. The optimization objective is to maximize the amount of the collected data while reducing the energy consumption of UAV, and the final optimization model is formulated as follows
P 1 : max I m , n , p n u , P n Tr , F n 1 N n N w n α E n s . t . C 1 : 0 p n u p n 1 u l max , n N , C 2 : I m , n 0 , 1 , m M , n N , C 3 : I n 0 , 1 , n N , C 4 : I m , n d m , n R max , m M , n N , C 5 : n = 1 N I m , n 1 , m M , n N , C 6 : n = 1 N E n E , n N , C 7 : 0 D n Tr + D n c C n , n N , C 8 : C n + I n χ C max , n N .
where α is the energy consumption penalty coefficient, E n = E n f + E n Tr + E n c is the total energy consumption of UAV at slot n, w n is the amount of the collected data at time slot n, which can be described as
w n = m M w m · f m
M is the set of the targets within the coverage range, w m is the priority of target m, and f m = 1 means that target m is first detected by UAV, otherwise f m = 0 .
As for the constraints, C 1 limited the UAV trajectory, l max is the maximum flying distance of UAV within one time slot. C 4 is the coverage limitation, considering that the flight altitude of the UAV is fixed, the coverage range of the UAV is unchanged, and the target is believed to be within the UAV field of view when the relative distance is less than the coverage range. C 5 is the first-valid limitation. C 6 means that the total energy consumption of the UAV should be under the budget E . C 7 is the causal constraint, which means that the amount of the online processed data and the offload data at time slot n can not exceed the amount of the data cached on the UAV. C 8 indicates that data collection cannot be performed when the remaining cache space is insufficient, C max is the maximum cache space of the UAV.

4. Hierarchical DRL Method

The deep reinforcement learning (DRL) approach utilizes the Markov decision process (MDP) to make decisions in a given environment. The MDP is described by four tuples [ s ( n ) , a ( n ) , r ( n ) , s ( n + 1 ) ]. s ( n ) is the possible states of the agent in time slot n; a ( n ) is the available actions; r ( n ) is the reward when the agent interacts with the environment, which is given by: s ( n ) × a ( n ) r ( n ) , and s ( n + 1 ) is the state in next time slot under a ( n ) . The object of the DRL method is to learn a policy π : s a to maximize the long-term reward.
Although DRL methods can learn strategies through interaction with the environment and fit the mapping relationship from state to action through neural networks, the large-scale distribution of ground targets leads to explosive growth in their state combination space. It is difficult to learn the mapping from all states to the optimal action through interaction with the environment. Besides, the proposed problem (P1) is mixed, where the task scheduling is a discrete action while the resource allocation is a continuous one the training of the mixed problem is under-explored in existing research.
To address the above issues, the layered decoupling idea is introduced into the DRL methodology, and a hierarchical DRL based task scheduling and resource allocation method is proposed, aiming to optimize the trajectory, transmission power, as well as CPU main frequency of the UAV jointly under random and large-scale targets distribution.
HDRL-TSRAM divides the original problem (P1) into the lower-layer subproblem and upper-layer subproblem and establishes the corresponding environment respectively. The former is responsible for resource allocation, and the latter is responsible for task scheduling. The algorithm complexity can be reduced by cutting down the range of the state space. The framework of the proposed HDRL-TSRAM is shown in Figure 2, the target region is divided into a certain number of equally sized sub-regions, and the dynamic changes in target priority within the sub-regions can be approximately ignored. Therefore, the lower-layer can use the shortest path principle (traveling salesman problem, TSP) to plan the trajectory of UAV in the sub-region, while optimizing their corresponding communication and computing resources. At the upper layer, only the order of access to each sub-region needs to be optimized, that is, task scheduling. After layering, different types of DRL methods (value-based and policy-based) can be used to train the upper-layer and lower-layer decision networks, reducing the overall algorithm complexity.

4.1. Lower-layer Network

With the variables p n u , P n Tr , F n and the relative constraints from (P1), the optimization problem of the lower layer is formulated as follows:
P 2 : max p n u , P n Tr , F n 1 N i n N i w n α E n s . t . C 1 : 0 p n u p n 1 u l max , n N i , i I , C 2 : n = 1 N E n E , n N i , i I , C 3 : 0 D n Tr + D n c C n , n N i , i I , C 4 : C n + I n χ C max , n N i , i I .
where i is the number of sub-region, N is the set of task times for UAV in sub-region i.
It is worth noting that the lower-layer considers the path planning problem as a TSP, thus the flight time can be calculated by solving for the shortest flight path
T i f = p a t h i min v
where p a t h i min is the shortest path for the UAV to access all targets in the sub-region i, v is the UAV cruise speed in the sub-region i. Set the number of targets in the sub-region i to K i , the collection time is fixed at δ , and the total flight energy consumption in sub-region i can be expressed as
E i f = P f T i f + P h K i δ
The energy consumption for communication and computation is the same as the ones in (P1).

4.1.1. LKH-based UAV Trajectory Planning Algorithm

TSP is a classic combinatorial optimization problem, whose goal is to find the shortest path given a set of cities and their distance relationship, so that the traveling salesman starts from one city, visits all other cities exactly once, and finally returns to the starting city. Therefore, the shortest UAV trajectory planning problem within the sub-regions can be regarded as a TSP for solving.
The Lin-Kernighan-Helsgaun (LKH) method is considered one of the most effective heuristic algorithms to solve TSP, and the scale of the targets is small in the sub-regions after dividing, thus the solution can be quickly and effectively obtained. The LKH algorithm is developed from the Lin Kernighan (LK) heuristic algorithm. The LK algorithm adopts a local search strategy called 2-opt exchange, which continuously exchanges edges in the path to reduce the total path length. On this basis, the LKH algorithm has introduced some improvements and extensions, making the algorithm have better performance in solving large-scale TSP. The initial position of UAV is set in the center of the sub-region, we use the LKH algorithm to obtain the shortest path for the UAV to access all targets from the center of the current sub-region, the detail is shown in Algorithm 1.
Algorithm 1: LKH based UAV Trajectory Planning Method
1:
for episode=1,…,10000 do
2:
   Generating an initial solution using greedy algorithms.
3:
   Using k-opt exchange for local search strategy to form new paths.
4:
   Adopting a backtracking strategy to record historical optimal solutions during the search process.
5:
end for
6:
Output the historical optimal solutions.
In step 2, the initial solution may not be the optimal solution, but it can serve as the starting point for optimization. In step 3, k-opt exchange refers to attempting to disconnect k edges and reconnect them based on the current solution. In step 4, the LKH algorithm adopts a backtracking strategy to avoid getting stuck in local optima and backtracks to the historical optimal solution under certain conditions to escape the trap of local optima. In addition, the LKH algorithm also introduces a branch and bound strategy, which determines whether to search for sub-problems by predicting their lower bounds to reduce search space and improve solution efficiency.
Algorithm 1 has a time complexity of O ( 10000 · k · n 2 ) , where 10000 represents the number of iterations, k is the number of k-opt exchanges in the local search step, and n is the number of nodes in the UAV trajectory. This complexity arises from the combination of the greedy initialization, the k-opt local search, and the backtracking strategy used to track historical optimal solutions.

4.1.2. DRL-based UAV Resource Allocation Algorithm

With the optimal trajectory, the lower-layer network (LLN) utilizes the DRL method to optimize the communication and computation resource of UAV, which correspond to UAV CPU main frequency F n and UAV transmission power P n Tr respectively. Considering that both the variables above are continuous, the policy-based proximal policy optimization (PPO) is used to train the lower-layer network.
PPO is based on the Actor-Critic framework, which combines policy gradient and function approximation. The Actor provides an action while the Critic evaluates the value of the selected action. Then the Actor will update its parameters according to the feedback of the Critic. The training objective of the Actor is to generate action with the maximal Q value, while that of the Critic is to estimate the Q value accurately.
We define the state space, action space, and reward function for the LLN in our scenario as follows:
1) State: The state space s ( n ) consists of the remaining UAV battery energy, the remaining UAV cache, the relative distance between UAV and GBS, the remaining task time, and the target number in current sub-region, which is formulated as:
s ( n ) = E n l , C n l , d n , T l , K i
where T l is the remaining task time, and the initial values of the remaining UAV battery energy E n l and the remaining UAV cache C n l are set randomly to cover all situations as much as possible, which is formulated as follows
E n l 0 = E min + r a n d 0 , 1 × E E min
C n l 0 = C min + r a n d 0 , 1 × C max C min
where E min and C min is the minimum energy and cache that can ensure the progress of the task.
2) Action: a ( n ) is a set of actions for UAV at time slot n, which consists of the UAV CPU main frequency and the UAV transmission power. a ( n ) is expressed as
a n = F n , P n Tr
3) Reward: r ( n ) is a set of rewards obtained at time slot n, which is formulated as
r n = D n α E n
where D n = D n Tr + D n c is the total amount of data processed within slot n, including the amount of data processed online and the amount of data offloaded to GBS.
The training framework of the lower-layer is shown in Figure 3, the actor network generates action and interacts with the environment according to the current status and policy π , and the generated four tuples will be stored in the experience buffer, where LLN samples data from to form the training dataset. The buffer model in this study represents a finite-capacity queue that temporarily stores incoming data packets during jamming attacks, enabling controlled transmission and mitigating delays, and it supports evaluating network performance under adversarial conditions. The estimated value of the advantage function for each sample is calculated by the following equation:
A π s ( n ) , a ( n ) = Q π s ( n ) , a ( n ) V π s ( n )
where Q π s ( n ) , a ( n ) is the Q value of the state-action s ( n ) , a ( n ) , and V π s ( n ) is the value of state s ( n ) .
The loss function of the actor network varies in forms, the objective function of cropping or the objective function of confidence domain for example. the latter is utilized to update the parameters of the policy, and the loss function is defined as follows
L ( φ ) = E ^ n min r n ( φ ) A ^ n , clip r n ( φ ) , 1 ϵ , 1 + ϵ A ^ n
where φ is the parameters of actor network, the clip function restricts the gradient update amplitude through cropping, r n ( φ ) represents the probability ratio of the current policy to the past policy selection action a ( n ) , which is formulated as
r n ( φ ) = π φ a ( n ) , s ( n ) π φ o l d a ( n ) , s ( n )
The time difference approach is utilized to define the loss function of the critic network
L ( θ ) = E ^ n 1 2 V θ s ( n ) ) R ^ n 2
where θ is the parameters of the critic network, R n is the total discount reward, and the training process of LLN is shown in Algorithm 2.
Algorithm 2: PPO based Training Scheme for Lower-layer network
1:
Initial the netowrk parameter θ and φ .
2:
Initialize the experience buffer.
3:
for episode=1,…,10000 do
4:
   Setup the environment and select an initial state s ( 1 ) .
5:
   for n=1,…,N do
6:
     Generate an action a ( n ) .
7:
     Execute a ( n ) and calculate the r ( n ) using Eq.(22).
8:
     Update the state s ( n + 1 ) .
9:
     Store the experience [ s ( n ) , a ( n ) , r ( n ) , s ( n + 1 ) ] .
10:
     if The experience buffer is filled then
11:
        Sample mini-batch from experience buffer.
12:
        Update the parameters of the critic network using Eq.(26).
13:
        Update the parameters of the actor network using Eq.(24).
14:
     end if
15:
   end for
16:
end for

4.2. Upper-layer Network

With the variables I m , n and the relative constraints from (P1), the optimization problem of the upper layer is formulated as follows:
P 3 : max I i 1 I i I w i α E i s . t . C 1 : I i 0 , 1 , 2 , . . . , I , i I , C 2 : C I i 1 + χ I i C max , i I , C 3 : E i E , i I , C 4 : I i I i = 1 , i I , i I , i i .
where I i is the task schedule instruction for sub-region i, I i = i means that UAV performs data collection task in sub-region i, I is the total number of sub-region. C2 is the cache constraint for the upper layer to avoid partial collection data loss caused by cache overflow, C I i 1 is the amount of the cached data in UAV after the i-th data collection, χ I i is the amount of collected data in the i-th data collection. C3 is the energy constraint for the upper layer, E i is the energy consumption for the i-th data collection, which can be expressed as:
E i = P f d i , i 1 v + E i l
where P f d i , i 1 v is the energy consumption for the flight from i 1 st sub-region to i-th sub-region, d i , i 1 is the relative distance between the two sub-region. E i l is the energy consumption for the communication and computation in sub-region i, which can be obtained from the lower layer.
In the upper layer, it is not necessary to consider the resource allocation of UAV, as the resource allocation strategy is utilized as a known quantity to plan the overall service order of the UAV between sub regions. The ultimate subjective function is to maximize the overall data collection volume while minimizing the total energy consumption.
Considering that the optimization variable I i is discrete, the value-based deep Q-learning (DQN) is used to train the upper-layer network (ULN). DQN utilizes a deep neural network (DNN) to estimate the Q-value and avoids the limitation of the traditional Q table. For a given state, the Q value of the selected action is calculated by the DNN. Besides, DQN designs two DNNs to prevent overfitting: the Evaluation Network and the Target Network, which share the same structure but with different parameters. The former network updates the parameters continuously, while the parameters of the latter are temporarily fixed to disrupt the correlation between samples. The state space, action space, and reward for the ULN are defined in the following.
1) State: The state space s ( n ) consists of the remaining UAV battery energy, the remaining UAV cache, the relative distance between UAV and GBS, The number of targets in each sub region that have not collected data, sub-region number of the UAV currently located, and the priority of targets within each sub region, which is formulated as:
s ( i ) = E i u , C i u , d i , L i , N i , P i
where P i is the prioritization matrix of the ground targets in sub-region i, which will be updated at regular intervals, and its change model is also modeled as a Markov process.
2) Action: a ( i ) is a set of actions for UAV, which consists of the sub-region number for the next data collection. a ( i ) is expressed as
a i = I i
3) Reward: r ( i ) is a set of rewards obtained at i-th sub-region selection, which is formulated as
r i = w i α E i
The framework of the upper layer training method is shown in Figure 4. The ULN also consists of an evaluation network and a target network. The evaluation network generates an action first, which is:
a ( n ) = arg max a Q ( s ( n ) , a θ φ ) ,
where θ φ is the parameters of the evaluate network. The action is sent to the upper-layer environment, which is the convergent LLN. Once the ULN generates the sub-region allocation, the environment of the upper layer utilizes the convergent LLN to obtain the optimal resource allocation strategy in real time. With the assistance of LLN, the reward of the upper layer r ( i ) can be calculated with Eq.(31), the form of nesting the LLN in the upper layer accelerates the convergence speed of upper layer training. Same with the training process of the lower layer, the experience [ s ( n ) , a ( n ) , r ( n ) , s ( n + 1 ) ] is sent to the buffer, and the target Q-value is calculated as follows:
y ( n ) = r ( n ) + γ max a Q ( s ( n + 1 ) , a | θ φ ) ,
where θ φ is the parameters of the target network. When the experience buffer is filled, the evaluation network selects a mini-batch of samples to update its parameters as follows:
θ φ θ φ + α y ( n ) Q s ( n ) , a ( n ) θ φ × Q s ( n ) , a ( n ) θ φ ,
where α is the learning rate. The parameters of the target network are also updated by the soft updating method:
θ φ τ θ φ + ( 1 τ ) θ φ .
The training strategy for the ULN is summarized as Algorithm 3.
Algorithm 3: Training strategy for the ULN
1:
Initialize the evaluate network Q s , a θ φ .
2:
Initialize the target network with parameters θ φ θ φ .
3:
Initialize the experience buffer X .
4:
for episode=1,…,10000 do
5:
   Setup the environment and reset the initial state s ( 1 ) .
6:
   for i=1,…,I do
7:
     Choose an action a ( i ) using Eq.(32).
8:
     Execute a ( i ) and calculate the r ( i ) using Eq.(31).
9:
     Update the state s ( i + 1 ) .
10:
     Store the experience [ s ( i ) , a ( i ) , r ( i ) , s ( i + 1 ) ] .
11:
     if The experience buffer is filled then
12:
        Sample mini-batch from the experience buffer randomly.
13:
        Calculate the target Q-value using Eq.(33).
14:
        Update the parameters of the evaluate network θ φ using Eq.(34).
15:
        Update the corresponding parameters of the target network θ π using Eq.(35).
16:
     end if
17:
   end for
18:
end for

5. Experimental Results and Analysis

5.1. Experimental Setup

Our experiments utilized a custom simulation environment based on the PyTorch framework. The initial position of the GBS is [0,0], the initial position of the UAV is [100, 100], and the coordinate of ground targets is randomly generated. The initial priority of ground targets is divided into four levels, with the highest being level IV and the lowest being level I. Level I targets account for about 35%, while level II and III targets account for about 25% and level IV targets account for about 15% respectively. The priority of targets will dynamically change. The priority dynamic transfer matrix is shown in Table 5, and the information content of each priority level target is shown in Table 6. Our design is intended to ensure the stochastic nature of target priorities, thereby simulating the variability and uncertainty of diverse targets at distinct time points. We have conducted multiple rounds of experiments to ensure the reproducibility of the results. We are confident that these experiments effectively validate the stability and reliability of the algorithm.
The task scenario is a size of 1000×1000m square area with 20-100 randomly distributed ground targets. The UAV flies at a constant speed of 20m/s. It takes 10 seconds to collect ground target image data, and the task time is 30 seconds when there is no target. The maximum calculation frequency of the UAV is 10GHz, and the amount of data collected in a single data acquisition is 5GB. The maximum data transmission power is 100W, the communication bandwidth is 20MHz, the flight power is 110W, the aerial hovering power is 80W, the UAV battery energy is 2000kJ, and the flight altitude is 100m.
In the lower layer, the PPO based LLN is compared with the soft actor-critic (SAC) and the genetic algorithm (GA), and the corresponding parameter is shown in Table 7.

5.2. Numerical Results

5.2.1. Experimental Verification of Lower Layer Network

The underlying UAV trajectory planned by the LKH algorithm is shown in Figure 5, which is the shortest trajectory for the UAV to traverse all ground targets in the sub region starting from the center of the sub region and then return to the center of the sub region. The average total reward for the SAC, PPO, and GA in the lower layer is 338.6, 338.3, and 340.5, respectively. The average total reward for the SAC is similar to PPO, slightly lower than that of the GA, However, the SAC algorithm and PPO algorithm can complete the allocation of UAV resources with any initial cache and battery energy after completing the training process, while the GA needs to recalculate in each new scenario.
Figure 6 shows the curve graph of the reward obtained by SAC and PPO as a function of the number of training steps. The average total reward after the convergence of the two algorithms is similar. However, PPO has higher data utilization and faster convergence compared to SAC, and converges faster with less empirical data. GA needs to recalculate and optimize in each scenario.
To verify the rationality of using the shortest path algorithm for trajectory planning and reinforcement learning algorithm to optimize resource allocation without priority changes in the lower layer, experiments were conducted under the conditions of target numbers 2, 4, 6, 8, and 10 in sub-regions. The experimental results are shown in Figure 7, and the performance difference between the two algorithms is within 5%.

5.2.2. Experimental Verification of Upper Layer Network

The validation scenario for the upper layer algorithm is a size of 1000m×1000m square area, which is divided into a 5×5 grid, and each grid is a 200m×200m sub-region, which is with a certain number of distributed targets. The upper layer utilizes DQN to train the UAV task scheduling strategy with the assistance of the resource allocation strategy trained in the lower layer. DQN series of algorithms are the most commonly used deep reinforcement learning algorithms for solving discrete action spaces.
The top-level trajectory planning results of the UAV are shown in Figure 8. The red dots represent the distribution of ground targets, the green dots and green dashed lines represent the UAV’s position and trajectory, and the purple dots represent the ground processing units. To reduce the complexity of algorithm solving, (1) in the trajectory planning process of this chapter, it is assumed that the dwell time in each sub-region is the total time for completing data collection for all targets and rounding up the data collection time [44]. If there are no targets present or all target data has been collected, data processing is carried out in that sub-region for a duration of 30 seconds; (2) The optional action space for UAV in flight decision-making is to fly to the nearest 9 sub regions, including the current sub region. If there are targets in the target sub region that have not been collected data, the data will be processed simultaneously during data collection. If there are too many targets in the selected sub region and the current remaining cache is insufficient, they will stay in the current sub region; (3) Allocate an average of 30 seconds for data collection of each ground target, and after exceeding the total time, the data collection task ends.
This section randomly generated a large amount of target distribution data for algorithm training, which met the constraint that the total number of targets should not exceed 100 and the number of targets in a single sub-region should not exceed 10. Then, tests were conducted on the 4 combinations, which is GA in the lower layer-GA in the upper layer, DQN in the lower layer-GA in the upper layer, GA in the lower layer+PPO in the upper layer, DQN in the lower layer-PPO in the upper layer. The experimental results are shown in Figure 9 to Figure 10. Reward, UAV energy consumption, and information collection are key indicators for evaluating algorithm performance. The reward is dimensionless, reflecting the overall effectiveness of the task scheduling algorithm; UAV energy consumption is measured in watt-hours (Wh), indicating the total energy expended by the UAV to complete tasks; and the amount of information collected is quantified in gigabytes (GB), representing the total volume of effective information gathered during task scheduling.
When the storage space of the UAV is 128GB and the number of ground targets is 20, 40, 60, 80, and 100, the optimization results of each algorithm are shown in Figure 9. The impact of PPO and GA in the lower layer on the upper layer is almost the same. Any algorithm used in the lower layer can achieve the same performance, while PPO has a longer training time compared to GA. However, after training, decisions can be made directly after reaching any sub-region without resolving.
The proposed DQN-based ULN achieved UAV task scheduling under random target distribution. When faced with constrained target distribution, it can traverse various sub regions to complete data collection. Moreover, compared to GA, the average total reward per round is higher, up to 65% higher than the genetic algorithm. It outperforms in data collection for ground targets, but the corresponding energy consumption increases by 20% -30% compared to GA. The trend of the total information collected is the same as the trend of the total reward, DQN is superior to GA. In addition, DQN can handle various ground target distributions that satisfy assumptions once trained, while GA needs to recalculate when encountering new ground target distributions.

6. Conclusions

This paper proposes a novel Hierarchical Deep Reinforcement Learning-based Task Scheduling and Resource Allocation Method (HDRL-TSRAM) for UAV-enabled data collection applications, addressing the critical challenges of optimizing energy efficiency and adaptability in dynamic environments. By decoupling the complex joint optimization problem into a two-layer hierarchical framework, our approach significantly advances the state of the art: the lower layer employs a Proximal Policy Optimization (PPO)-based algorithm to optimize continuous actions (trajectory, communication power, CPU frequency), while the upper layer utilizes a Deep Q-Network (DQN) strategy for discrete task scheduling, effectively reducing computational complexity and enabling efficient training in large-scale scenarios. A key innovation lies in integrating the Lin-Kernighan-Helsgaun (LKH) heuristic with DRL, where LKH solves the Traveling Salesman Problem (TSP) for local trajectory planning within sub-regions, and DRL dynamically adapts to global resource constraints and environmental changes, achieving 65% higher data collection efficiency and 20–30% lower energy consumption compared to baseline methods like genetic algorithms (GA) and standalone SAC. Unlike static optimization approaches requiring recalibration for environmental shifts, our framework inherently adapts to dynamic target priorities modeled via Markov processes, with a nested training architecture that accelerates convergence by leveraging pre-trained lower-layer networks for rapid reward feedback. Furthermore, we holistically integrate energy models for flight, communication, and computation—often overlooked in prior work—to balance data collection volume and energy efficiency, crucial for mission-critical applications like disaster monitoring. Practical implications include real-world deployment in scenarios such as post-earthquake reconnaissance, where UAVs autonomously prioritize high-impact areas while conserving energy. Despite these advancements, future work should address limitations such as predefined priority transition assumptions and simplified maneuverability models, while extending the framework to multi-UAV systems for large-scale surveillance. Overall, HDRL-TSRAM sets a new benchmark for UAV task scheduling, bridging theoretical optimization with real-world applicability through scalability, adaptability, and energy-conscious design.

7. Discussion

While this study has yielded notable advancements in the trajectory planning and resource allocation for UAV data collection missions, there remain several avenues for improvement: (1)The dynamic characteristics of the environment considered in this work pertain solely to the dynamic changes in the priority of ground targets. However, in real-world scenarios, such dynamic characteristics may not be predetermined. Consequently, the development of models to capture the dynamic priority of targets in unknown environments remains an open research question; (2)Beisdes, this study has primarily emphasized the efficacy of the algorithms employed. The environmental modeling for UAV data collection tasks is relatively rudimentary, encompassing models such as the channel, cache, computation, and UAV maneuverability models. The cache model fails to account for the correlation of onboard data, while the UAV maneuverability model neglects acceleration, deceleration, and turning maneuvers. Thus, further research is warranted to enhance the sophistication of environmental modeling.

References

  1. Hanscom, A.; Bedford, M. Unmanned aircraft system (uas) service demand 2015-2035. Literature review and projections of future usage 2013. [Google Scholar]
  2. Liang, Y.; Xu, W.; Liang, W.; Peng, J.; Jia, X.; Zhou, Y.; Duan, L. Nonredundant information collection in rescue applications via an energy-constrained UAV. IEEE Internet of Things Journal 2018, 6, 2945–2958. [Google Scholar]
  3. Nguyen, M.T.; Nguyen, C.V.; Do, H.T.; Hua, H.T.; Tran, T.A.; Nguyen, A.D.; Ala, G.; Viola, F. Uav-assisted data collection in wireless sensor networks: A comprehensive survey. Electronics 2021, 10, 2603. [Google Scholar] [CrossRef]
  4. Yuan, X.; Hu, Y.; Zhang, J.; Schmeink, A. Joint User Scheduling and UAV Trajectory Design on Completion Time Minimization for UAV-Aided Data Collection. IEEE Transactions on Wireless Communications 2022. [Google Scholar]
  5. Wei, Z.; Zhu, M.; Zhang, N.; Wang, L.; Zou, Y.; Meng, Z.; Wu, H.; Feng, Z. UAV-assisted data collection for internet of things: A survey. IEEE Internet of Things Journal 2022, 9, 15460–15483. [Google Scholar]
  6. Khan, A.; Gupta, S.; Gupta, S.K. Emerging UAV technology for disaster detection, mitigation, response, and preparedness. Electronics 2022, 39, 905–955. [Google Scholar]
  7. Fu, Y.; Li, D.; Tang, Q.; Zhou, S. Joint speed and bandwidth optimized strategy of UAV-assisted data collection in post-disaster areas. In Proceedings of the 2022 20th Mediterranean Communication and Computer Networking Conference (MedComNet). IEEE; 2022; pp. 39–42. [Google Scholar]
  8. Xie, Z.; Song, X.; Cao, J.; Qiu,W. Providing aerial MEC service in areas without infrastructure: A tethered-UAV-based energy-efficient task scheduling framework. IEEE Internet of Things Journal 2022, 9, 25223–25236. [Google Scholar] [CrossRef]
  9. Halder, S.; Ghosal, A.; Conti, M. Dynamic Super Round-Based Distributed Task Scheduling for UAV Networks. IEEE Transactions on Wireless Communications 2022, 22, 1014–1028. [Google Scholar]
  10. Zhou, S.; Cheng, Y.; Lei, X.; Peng, Q.; Wang, J.; Li, S. Resource allocation in UAV-assisted networks: A clustering-aided reinforcement learning approach. IEEE Transactions on Vehicular Technology 2022, 71, 12088–12103. [Google Scholar] [CrossRef]
  11. Liu, B.; Wan, Y.; Zhou, F.; Wu, Q.; Hu, R.Q. Resource allocation and trajectory design for miso uav-assisted mec networks. IEEE Transactions on Vehicular Technology 2022, 71, 4933–4948. [Google Scholar]
  12. Zhang, F.; Ding, Y.; Cao, M.; Wu, M.; Lu, W. and Nallanathan, A. Energy Efficiency Optimization of RIS-Assisted UAV Search-Based Cognitive Communication in Complex Obstacle Avoidance Environments. IEEE Transactions on Vehicular Technology.
  13. Wang, Z.; Wen, J.; He, J.; Yu, L. and Li, Z. Resource and Trajectory Optimization for Secure Enhancement in IRS-Assisted UAV-MEC Systems. IEEE Transactions on Cognitive Communications and Networking.
  14. Chen, Y.; Yang, Y.; Wu, Y.; Huang, J. and Zhao, L. Joint Trajectory Optimization and Resource Allocation in UAV-MEC Systems: A Lyapunov-Assisted DRL Approach. IEEE Transactions on Services Computing.
  15. Shen, L.; Zhang, H.; Wang, N.; Cui, Y.; Cheng, X. and Mu, X. Joint Clustering and 3-D UAV Deployment for Delay-Aware UAV-Enabled MTC Data Collection Networks. IEEE Sensors Letters 2024, 8, 1–4. [Google Scholar]
  16. Gong, J.; Chang, T.H.; Shen, C.; Chen, X. Aviation time minimization of UAV for data collection from energy constrained sensor networks. In Proceedings of the 2018 IEEE Wireless Communications and Networking Conference (WCNC). IEEE; 2018; pp. 1–6. [Google Scholar]
  17. Gong, J.; Chang, T.H.; Shen, C.; Chen, X. Flight time minimization of UAV for data collection over wireless sensor networks. IEEE Journal on Selected Areas in Communications 2018, 36, 1942–1954. [Google Scholar]
  18. Zhu, K.; Xu, X.; Han, S. Energy-efficient UAV trajectory planning for data collection and computation in mMTC networks. In Proceedings of the 2018 IEEE Globecom Workshops (GC Wkshps). IEEE; 2018; pp. 1–6. [Google Scholar]
  19. Song, C.; Zhang, X.; She, Y.; LI, B. and Zhang, Q. Trajectory Planning for UAV Swarm Tracking Moving Target Based on an Improved Model Predictive Control Fusion Algorithm. IEEE Internet of Things Journal 2025, 1–1. [Google Scholar]
  20. Zhang, H.; Li, B.; Rong, Y.; Zeng, Y. and Zhang, R. Joint Optimization of Transmit Power and Trajectory for UAV-Enabled Data Collection With Dynamic Constraints. IEEE Transactions on Communications 2025, 1–1. [Google Scholar]
  21. Li, J.; Shi, Y.; Dai, C.; Yi, C.; Yang, Y.; Zhai, X. and Zhu, K. A Learning-Based Stochastic Game for Energy Efficient Optimization of UAV Trajectory and Task Offloading in Space/Aerial Edge Computing. IEEE Transactions on Vehicular Technology 2025, 1–16. [Google Scholar]
  22. Wang, M.; Zhang, D.; Wang, B. and Li, L. Dynamic Trajectory Planning for Multi-UAV Multi-Mission Operations Using a Hybrid Strategy. IEEE Transactions on Aerospace and Electronic Systems 2025, 1–19. [Google Scholar]
  23. Zhang, X.; Yu, X. and Cai, H. In Joint Trajectory and Resource Allocation Optimization for UAV-Assisted Edge Computing. In Proceedings of the 2024 16th International Conference on Wireless Communications and Signal Processing (2024 WCSP). IEEE; 2024; pp. 1068–1073. [Google Scholar]
  24. Qin, P.; Wu, X.; Ding, R.; Fu, M.; Zhao, X.; Chen, Z. and Zhou, H. Joint Resource Allocation and UAV Trajectory Design for D2D-Assisted Energy-Efficient Air–Ground Integrated Caching Network. IEEE Transactions on Vehicular Technology 2024, 73, 17558–17571. [Google Scholar]
  25. Khan, N.; Ahmad, A.; Alwarafy, A.; Shah, M.; Lakas, A. and Azeem, M. Efficient Resource Allocation and UAV Deployment in STAR-RIS and UAV-Relay Assisted Public Safety Networks for Video Transmission. IEEE Open Journal of the Communications Society 2025, 1–1. [Google Scholar]
  26. Qian, J.; Yan, Y.; Gao, F.; Ge, B.; Wei, M.; Shangguan, B. and He, G. C3DGS: Compressing 3D Gaussian Model for Surface Reconstruction of Large-Scale Scenes Based on Multiview UAV Images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2025, 18, 4396–4409. [Google Scholar]
  27. Aldao, E.; Veiga-López, F.; Miguel González-deSantos, L. and González-Jorge, H. Enhancing UAV Classification With Synthetic Data: GMM LiDAR Simulator for Aerial Surveillance Applications. IEEE Sensors Journal 2024, 24, 26960–26970. [Google Scholar]
  28. Qiu, J.; Kuang, Z.; Huang, Z. and Lin, S. Security Offloading Scheduling and Caching Optimization Algorithm in UAV Edge Computing. IEEE Systems Journal 2025, 1–11. [Google Scholar]
  29. Wan, L.; Wang, J.; Sun, L.; Li, K.; Xiong, X. and Lin, Y. Heterogeneous UAV Resource Scheduling for Dynamic Time Sensitive Target Detection and Interference. IEEE Internet of Things Journal 2025, 1–1. [Google Scholar]
  30. Chen, X.; Chen, X. The UAV dynamic path planning algorithm research based on Voronoi diagram. In Proceedings of the The 26th chinese control and decision conference (2014 ccdc). IEEE; 2014; pp. 1069–1071. [Google Scholar]
  31. Planning, P.R. On the Probabilistic Foundations of Probabilistic Roadmap Planning. Proceedings of the Robotics Research: Results of the 12th International Symposium ISRR. Springer Science and Business Media. 2007; Vol. 28, p. 83. [Google Scholar]
  32. Liu, X. Four alternative patterns of the Hilbert curve. Applied mathematics and computation 2004, 147, 741–752. [Google Scholar]
  33. Mokrane, A.; BRAHAM, A.C.; Cherki, B. UAV path planning based on dynamic programming algorithm on photogrammetric 654 DEMs. In Proceedings of the 2020 International Conference on Electrical Engineering (ICEE). IEEE; 2020; pp. 1–5. [Google Scholar]
  34. Binney, J.; Sukhatme, G.S. Branch and bound for informative path planning. In Proceedings of the 2012 IEEE international conference on robotics and automation. IEEE; 2012; pp. 2147–2154. [Google Scholar]
  35. Samir, M.; Sharafeddine, S.; Assi, C.; Nguyen, T. and Ghrayeb, A. UAV trajectory planning for data collection from time-constrained IoT devices. IEEE Transactions on Wireless Communications 2019, 19, 34–46. [Google Scholar]
  36. Sun, Y.; Babu, P.; Palomar, D.P. Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing 2016, 65, 794–816. [Google Scholar]
  37. Xu, Y.; Wang, J.; Wang, J.; Que, X. and Lu, D. In Trajectory Design and Resource Allocation for UAV-Assisted Computation Offloading. In Proceedings of the 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (2024 ICISCAE), IEEE; 2024; pp. 963–967. [Google Scholar]
  38. Nguyen, M.; and Ajib, W. and Zhu, W. In Joint UAV Trajectory Control and Channel Assignment for UAV-Based Networks with Wireless Backhauling. In Proceedings of the 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), IEEE; 2024; pp. 1–5. [Google Scholar]
  39. Nguyen, K.K.; Duong, T.Q.; Do-Duy, T.; Claussen, H.; Hanzo, L. 3D UAV trajectory and data collection optimisation via deep 666 reinforcement learning. IEEE Transactions on Communications 2022, 70, 2358–2371. [Google Scholar]
  40. Fu, S.; Tang, Y.; Wu, Y.; Zhang, N.; Gu, H.; Chen, C.; Liu, M. Energy-efficient UAV-enabled data collection via wireless charging: A reinforcement learning approach. IEEE Internet of Things Journal 2021, 8, 10209–10219. [Google Scholar]
  41. Kurunathan, H.; Li, K.; Ni, W.; Tovar, E.; Dressler, F. Deep reinforcement learning for persistent cruise control in UAV-aided data collection. In Proceedings of the 2021 IEEE 46th Conference on Local Computer Networks (LCN). IEEE; 2021; pp. 347–350. [Google Scholar]
  42. Oubbati, O.S.; Atiquzzaman, M.; Lim, H.; Rachedi, A.; Lakas, A. Synchronizing UAV teams for timely data collection and energy transfer by deep reinforcement learning. IEEE Transactions on Vehicular Technology 2022, 71, 6682–6697. [Google Scholar]
  43. Liao, Z.; Li, H.; Cai, W.; Zhong, Y. and Zhang, X. Phase Sensitivity-Based Fringe Angle Optimization in Telecentric Fringe Projection Profilometry. IEEE Transactions on Instrumentation and Measurement 2025, 74, 1–10. [Google Scholar]
  44. Gao, M.; Xu, G.; Song, Z.; Cheng, Y. and Niyato, D. Performance Analysis of Random 3D mmWave-Assisted UAV Communication System. IEEE Transactions on Vehicular Technology 2022, 73, 19169–19185. [Google Scholar]
Figure 1. Scenario of the UAV Data Collection Task.
Figure 1. Scenario of the UAV Data Collection Task.
Preprints 151859 g001
Figure 2. Framework of HDRL-TSRAM.
Figure 2. Framework of HDRL-TSRAM.
Preprints 151859 g002
Figure 3. Training scheme of the LLN based on the PPO framework.
Figure 3. Training scheme of the LLN based on the PPO framework.
Preprints 151859 g003
Figure 4. Training scheme of the ULN based on the DQN framework.
Figure 4. Training scheme of the ULN based on the DQN framework.
Preprints 151859 g004
Figure 5. Results of UAV trajectory planning in the lower layer.
Figure 5. Results of UAV trajectory planning in the lower layer.
Preprints 151859 g005
Figure 6. Training results of different algorithms in the lower layer.
Figure 6. Training results of different algorithms in the lower layer.
Preprints 151859 g006
Figure 7. Comparison of data collection performance with and without priority changes.
Figure 7. Comparison of data collection performance with and without priority changes.
Preprints 151859 g007
Figure 8. Results of UAV task scheduling planning in the upper layer.
Figure 8. Results of UAV task scheduling planning in the upper layer.
Preprints 151859 g008
Figure 9. Performance comparison of the task scheduling algorithms in terms of UAV energy consumption.
Figure 9. Performance comparison of the task scheduling algorithms in terms of UAV energy consumption.
Preprints 151859 g009
Figure 10. Performance comparison of the task scheduling algorithms in terms of the amount of collected information.
Figure 10. Performance comparison of the task scheduling algorithms in terms of the amount of collected information.
Preprints 151859 g010
Table 5. Ground Target Priority Transfer Matrix
Table 5. Ground Target Priority Transfer Matrix
Priority I II III IV
I 0.8 0.1 0.08 0.02
II 0 0.8 0.16 0.04
III 0 0 0.9 0.1
IV 0 0 0 1
Table 6. The amount of information for targets in different priority.
Table 6. The amount of information for targets in different priority.
Priority I II III IV
The amount of information 1 2 3 4
Table 7. The Parameters of the Algorithm in the Lower Layer
Table 7. The Parameters of the Algorithm in the Lower Layer
Algorithm Parameters
PPO Learning Rate 3 e 4
Network Structure 128 × 128
Batch Size 64
Discount Factor 0.99
SAC Learning Rate 3 e 4
Network Structure 128 × 128
Batch Size 64
Discount Factor 0.99
GA Population Size 200
Iterations 1000
Table 8. The Parameters of the Algorithm in the Lower Layer
Table 8. The Parameters of the Algorithm in the Lower Layer
Algorithm Parameters
DQN Energy Consumption Penalty Coefficient 1 e 5
Learning Rate 3 e 5
Greed Coefficient 0.1
Size of Experience Buffer 500000
Parameters of soft updating 0.005
Batch Size 256
Discount Factor 0.99
Training Step 10000000
GA Population Size 200
Iterations 1000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated