Joint Power and Bandwidth Allocation for UAV Backhaul Networks: A Hierarchical Learning Approach

: Unmanned Aerial Vehicles (UAVs) severing as the relay is an effective technology method to extend the coverage. It can also alleviate the congestion and increase the throughput, especially applied in UAV networks. However, since the energy of UAVs is limited and the resources in UAV networks are scarce, how to optimize the network delay performance under these constraints should be well investigated. Besides, the relationship among different resources, e.g. power and bandwidth, is coupled which makes the optimization more complex. This article investigates the problem of joint power and bandwidth allocation in UAV backhaul networks, which considers both the delay performance and the resource utilization efﬁciency. Considering the heterogenous locations characteristics of different UAVs, we formulate the optimization problem as a Stackelberg game. The relay UAV acts as the leader and extended UAVs act as followers. Their utility functions take both the delay durance and the resource consumption into account. To capture the competitive relationship among followers, the sub-game is proved to be an exact potential game and exists Nash equilibriums (NE). The Stackelberg Equilibrium (SE) is proved afterwards. We utilize a hierarchical learning algorithm (HLA) to ﬁnd out the best resource allocation strategies, which also reduces the computational complexity. Simulation results demonstrate the effectiveness of the proposed method.


Introduction
In recent years, Unmanned Aerial Vehicles (UAVs) have been widely applied in military, commercial and civilian activities [1].They can gather and deliver images, texts or videos to ground center stations (GCSs) [2], either can conduct disaster rescuing or provide commodities in extreme environment where infrastructure or necessary services are absent [3].In order to finish tasks, the UAV networks tend to be in large scale where the performance guarantee turns out to be a tough problem.Besides, since the energy of each UAV is constrained and the available spectrum resource is scarce, how to allocate resources in a reasonable manner should be well studied.Therefore, this article focuses on the power and bandwidth allocation problems to enhance the performance of the UAV network.
Backhaul is used to being the bottleneck of a ground network and several researchers resorted to UAVs to tackle this problem.For example, UAVs were used to alleviate the congestion in high-demand and overloaded situations of small cell (SC) networks [4] [5] [6], either to allocate resources on-demand for users [7] [8], or to extend coverage at the disaster scenes [9].However, as far as we are concerned, the backhual scheme in UAV networks has not been studied in existing works, which results in some challenges.For example, the battery-powered UAVs need energy to support the hardware and realize mobility.However, the energy is finite and it is urgent to manage them efficiently to extend the lifespan of UAV networks [10] [11].Another challenge is that the heavy communication load and the large scale of UAV networks make the scarcity of resources more severe.Hence, the nice performance can not be blindly pursued without taking the resource utilization efficiency into account.Besides, the relationship among resources of different UAVs, e.g.power and bandwidth, are coupled, which will make the resource optimization challengeable.Therefore, how to allocate resources reasonably in UAV networks based on the backhaul scheme is a tough task.
For the sake of considering resource utilization efficiency and delay improvement simultaneously, a new resource allocation strategy is proposed in this paper, which is based on the tradeoff between the resource consumption and delay durance.To realize the above objective, we construct the utility function with delay and resource consumption to mutual restrain.In order to fully utilize the bandwidth, we make the relay small UAV (R-UAV) use a half-duplex mechanism and propose a two-phrase process to help the extended small UAVs (E-UAVs) deliver information.Specifically, in the first phrase, the E-UAVs to R-UAV phrase, all of E-UAVs divide total bandwidth evenly and select the transmission powers to arrive at R-UAV using equal time as much as possible.In the second phrase, the R-UAV to the cluster head large UAV phrase, the R-UAV delivering all the received information can use total bandwidth.Even if the amount of information is heterogeneous, the corresponding E-UAVs could save energy and make full use of bandwidth through selecting appropriate transmission powers in the first phrase.Such resource allocation method makes UAV backhaul network more efficient.It can realize an optimal resource allocation and delay improvement.
The Stackelberg game is always used to solve the distributed decision problem in wireless network [12] [13] [14] [15] where entities have heterogeneous characteristics.In this article, due to different locations of small UAVs, they are classified into R-UAVs and E-UAVs.The small UAV who can deliver information one-hop to the cluster head large UAV is located in the core network serving as a relay.Others who are located in the expanded network require the relay UAV to help delivering information.The heterogeneity of these two kinds of UAVs motivates us to apply the Stackelberg game.The E-UAVs act as followers while the R-UAV acts as the leader.The backward induction method [16] is used to find the Stackelberg Equilibrium (SE) of the game.The lower level sub-game of followers are solved firstly is proved to be an exact potential game which has at least a Nash Equlibrium (NE).The leader find the optimal strategy afterwards based on the observation of the responses of followers.To solve the problem in unknown channel environment, we use a hierarchical learning algorithm (HLA) [17], which helps users learn from their history reward and converge to a SE.
The main contributions can be summarized as follows: • We investigate the joint optimization of delay durance and resource consumption for UAV backhaul networks, where the coverage of the cluster head is extended.The small UAV locates in the core networks could act as a relay.It can help small UAVs in the extended network deliver information to the cluster head.• Considering the heterogeneity of UAVs in backhaul networks, we formulate the resource allocation optimization problem and the delay improvement problem as a Stackelberg game.The existence of SE is proved by using the backward induction.The lower level sub-game is proved to be an exact potential game, which certifies the existence of the NE.After that the best strategy of the leader is given based on the best responses of the followers.• To solve the problem in unknown environment, a hierarchical learning algorithm (HLA) is proposed to reach the SE, by using the interactive learning in different levels.Users update their strategies according to their historical reward value.Simulation results verify the effectiveness of the proposed method.
The rest of this paper is organized as follows: In section 2, we summarize the related work.In section 3, we describe the system model and problem formulation.The section 4 formulates the Stackelberg game for optimizing the delay durance and resource allocation.In section 5, we propose HLA to find out the SE of the game.Simulation results are discussed in section 6.At last, section 7 describes our conclusion.

Related Work
The UAV backhaul communication has been studied in many literatures [18] [19] [20] [21].In information collection and transmission scenes, a UAVs network which is connected by the relay can be more robust than the one where several UAVs respond for the failures individually.However, some problems, e.g.,the delay, may appear.Authors of [18] proposed a new cyclical multiple access mechanism where the users transmitted information to the relay UAV in a cyclical time-division manner, which expanded the coverage of the relay UAV.In [19], the UAV relay connected with the sink and ensured without any delay only within a desired time window.In [20], a resource allocation mechanism was proposed for packet delay minimization but used the total constraint resource.These works all used to optimize the delay durance, but they are at the expense of decreasing throughput performance and increasing resource consumption.In large-scale UAV backhaul networks, spectrum and power resource is scarce and limited, so that it urgent needs to design an optimization mechanism for reasonable resource allocation and delay performance guarantee.
The problem of optimizing resource allocation in UAV relay network has drawn much attentions [10] [11] [23] [22].Authors of [11] proposed jointly optimizing transmission power, bandwidth and position of the relay to maximize the throughput.In [22], the author proposed a schedule to maximize the minimum signal to interference-plus-noise ratio through optimizing power and time-frequency block.In literature [23], UAVs constitute floating relay cells inside the macro cell to realize frequency reuse and coverage extension.In [10], an UAV aims to maximize its energy efficiency through optimizing power and share spectrum with the primary network.The defect of the above works is that they did not consider the delay enhancement performance, which is not suitable for the scenes of information collection and transmission monitoring environment.
There are several literatures solving the optimization issue from the game theoretic perspective [24] [25].In [24], they proposed a network formation game to solve the problem of connecting SBSs to the core network by multi-hop UAVs devices and to optimize the delay and throughput.Literature [25] selected the optimal relay and allocated power using a Stackelberg game without the complete channel information (CSI), which used a auction mechanism to obtain a comparable performance.Game theoretic method uses formulas to represent the interaction of incentives and studies the optimal decision-making problems under conflict confrontation.Different from other games, Stackelberg game is suitable for solving the problems where players have heterogeneity.In this paper, we employ Stackelberg game to analyze the interactions among players' decisions due to the heterogeneity of R-UAV and E-UAVs.The leader is superior to followers, so takes actions firstly.In this UAV backhaul networks, the R-UAV acts as the leader to decide total bandwidth and the E-UAVs are followers to decide the transmission power, through which to realize the tradeoff between resource allocation and delay durance.
The existing works studied the optimization of UAV relay through the placement, resource allocation and throughput performance.The differences of our work with theirs can be concluded as follows: • We proposed consider the delay performance enhancement and the optimization of resource allocation simultaneously.To obtain this objective, we resorted to design utility functions and utilize Stackelberg game method where is adaptive to solve the interaction of different selfish players.• We consider the tradeoff between energy consumption and delay durance, which is beneficial to extend the lifespan of large scale battery-powered UAV networks.• Face the severe trend of scarcity of spectrum resources in UAV networks, we utilize the bandwidth in two phrase and enhance the bandwidth utilization through changing the power strategies.• We consider the coupled relationship among different resources, which is beneficial to the efficient allocation of resources. .

System model
We consider a UAV backhaul network involving a cluster head large UAV H, a relay small UAV (R-UAV) M and the extended small UAVs (E-UAVs) N .The R-UAV extends the communication coverage of cluster head large UAV, which could help more small UAVs transmit the information to the ground control station (GCS).Suppose that the number of E-UAVs does not exceed the load of the cluster head.In addition, the small UAV who can acts as a relay must meet two conditions: a) UAV is idle; b) UAV is located in the core network.The core network means the small UAV can deliver information to the cluster head through one hop.We propose a backhaul allocation scheme, where the E-UAVs can deliver the information to the cluster head through the R-UAV.
As shown in Fig. 1, one R-UAV M and E-UAVs users N constitute our system.In order to eliminate interference, all E-UAVs transmit in the orthogonal channels.Denote the E-UAVs' set as N = {1, 2, ..., N} and each E-UAV's available power profile is P = {p 1 , p 2 , ..., p M }.For simplicity, we assume all the E-UAVs start to transmit their information to the R-UAV at the same time with the information transmission demand T = {t 1 , t 2 , ..., t N }.The R-UAV M works in a half-duplex mode.Firstly, the R-UAV decides the total bandwidth B j ∈ B and broadcasts to all the E-UAVs after the E-UAVs told the information demands t i ∈ T to the R-UAV.Secondly, each E-UAV i decides its own transmission power p i and transmits all the information demand to the R-UAV with the delay as similar to others.In this phrase, all the E-UAVs are allocated the bandwidth evenly, which means b i = B j N .In the third stage, the R-UAV M transmits all information received to the cluster head large UAV H using the total bandwidth B j after the last E-UAV finished its information transmission.
Referring from the UAV-to-UAV multi-hop links model given by [24], we assume that UAVs transmit information over the sub-6 GHz band and the free-space path loss model ξ is where f c is the system center frequency (in Hz) and d i,M is the distance between the E-UAV i and the R-UAV M.
Referring from literature [26], the signal-to-noise rate (SNR) between E-UAV i and R-UAV M is where p i is transmission power from the E-UAV i to the R-UAV M, and σ is the noise.The rate of the A2A link is R i,M = b i log 2 (1 + r i,M ).The delay d i of the E-UAV i delivering its information to the R-UAV M can be depicted as where L i,M = ξ i,M + η LoS .We default that all the small UAVs propagate at line-of-sight (LoS) links.

Problem formulation
The objective of our study is to jointly optimize the delay durance and energy consumption.Inspired by the literature [16], the utility function can be defined as profit minus cost.In our system, we hope to make a tradeoff between the total delay and power or bandwidth consumption.The bigger power or bandwidth is, the bigger information transmission rate will be, and the smaller delay will be.
We wish all the E-UAVs to finish their information transmission as much as possible at the same time to make full use of the bandwidth resource, which is because they has divided the total bandwidth B j evenly.The R-UAV M waits for the last E-UAV to finish its information transmission, and then it can deliver the received information to the cluster head large UAV H.The power selected by the E-UAV i and the bandwidth selected by the R-UAV M are denoted as p i ∈ P and B j ∈ B, respectively.Specifically, the utility function of each E-UAV i can be defined as: where c is the normalization coefficient of the power.The first term of equation ( 4) represents the delay of the last E-UAV one-hop transmitting to the R-UAV M. The physical meaning of the second term is the energy consumption of the E-UAV i, which is calculated by the power multiplying delay.Therefore, we optimize the selection of the power and bandwidth to maximize the utility function of each E-UAV i.The corresponding optimization can be depicted as follows: (P1) : For the R-UAV M, the delay of information transmission from R-UAV to cluster head can be depicted as: Its utility function is composed of the opposite value of the system total delay and its bandwidth cost, which can be depicted as: where C L and P L denote the bandwidth cost coefficient and power for the R-UAV M, respectively.The second term represents the delay from the R-UAV M to the cluster head H.We can express the optimization of the bandwidth selection as follows: However, due to the available power and bandwidth are discrete values and the environment is uncertain, the optimization issues P1 and P2 cannot be solved by traditional optimization methods.In the following, we solve this problem as a Stackelberg game, and HLA is proposed to find out the SE.

Stackelberg Game of UAV Backhaul Networks
In this section, we formulate jointly optimizing the delay durance and resource allocation problems as a Stackelberg game.The power selection game in the E-UAVs has Nash equilibriums" which is proved by an exact potential game.Then, the existence of Stckelberg equilibrium is certified through combining with the finite bandwidth strategies selections game.At last, a hierarchical stochastic learning algorithm is proposed to find out the optimal solution.

Stckelberg game model
Mathematically, the resource allocation game could denote as G = {N ∪ M, B, P, {u i } i∈N , u L }, in which N = {1, 2, ..., N} and M represent the set profile of the E-UAVs and the R-UAV.P represents the set of selectable power strategies for all the E-UAVs and B is the set of selectable bandwidth strategies for the R-UAV M. {u i } i∈N and {u L } are the utility function of the E-UAV i and R-UAV, respectively.Specifically, the R-UAV M acts as the leader, who can take an action firstly.The E-UAVs act as followers, who would select the best response dynamic rationally based on the action of the leader.We consider the Stackelberg game model with one leader and multiple followers.The way of we solving the Stackelberg game is to find out the Stackelberg equilibrium (SE).
From the E-UAV's side, it is objective to optimize the maximal delay of the followers' information transmission using as little as possible energy cost, and its utility function can be defined as equation (4).The objective of the E-UAV i is to change its power strategy to maximize its utility function, and the optimization could be depicted as p i = arg max p i u i (p i , P −i , B j ).
From the perspective of the E-UAVs, the lower hierarchical sub-game can be depicted as: where all the E-UAVs N are players, and their available strategy set is power set P. Every user wants to maximize its own utility function through selecting an optimal power strategy independently and selfishly.
For the R-UAV M, the objective is to jointly optimize the delay of system and the system bandwidth cost.Its utility function can be depicted as equation (7).The optimization of the R-UAV M is B j = arg max B j u 0 (P, B j ).The sub-game of the leader can be depicted as: where B is the selectable bandwidth strategies set of the R-UAV M. Each E-UAV has the same power strategies set P.

Stackelberg Game Analysis
Due to the heterogeneity of UAV in backhaul networks, we formulate a Stackelberg game to solve the joint optimization.In this model, The R-UAV acts as the leader and E-UAVs act as followers.The leader make action firstly.Followers rationally choose the best dynamic response based on the observed upper-level leader actions.In this subsection, we proved the potential game and analyzed the NE.Finally, we proved the existence of SE.

Definition 1 (Exact Potential Game) [27]:
A strategy formulated game G f is an exact potential game, when there exists a potential function φ and any player's unilateral deviation causing the variation in φ equals to the variation in the utility function.We can depict it mathematically, where pi is the action of user i after unilateral deviation.
where p * −i represents the set of actions for all participants except participant i. Theorem 1: The follower sub-game G f with given bandwidth strategy B j is an exact potential game, and have at least one pure strategy NE point.

Proof.
Referencing literature [16], the details are as follows.We construct a potential function as follows: If the E-UAV k unilaterally changes his action from p k to pk , the change of its utility function is: At the same time, the change of the potential function is: We notice that the change of the action of the E-UAV k has no influence on others.In addition, the actions of other E-UAVs have no influence on the change of the potential function before and after the E-UAV k changing the power action.Thus, the game G f formulated is an exact potential game, and at least have one pure strategy NE point.We can find out the optimal pure strategy NE of G f , which is the global optimal solution of problem P1.
Definition 3 (Stackelberg Equilibrium) [16]: If B j * could make the utility function of the R-UAV M maximal and p * are the best response of all the E-UAVs, we call the strategy combination (B j * , p * ) as a Stackelberg equilibrium point of the Stacklberg game.Mathematically, for any combination of strategies (B j , p), the conditions of follows are always satisfying: where p * −i represent that all the E-UAVs except the E-UAV i adopt the best response of the strategy vector.Stackelberg game is a non-cooperation game, and belongs to a multi-stage dynamic game.When the system is at a SE point, all the participants in the hierarchical structure cannot increase their own utility by only changing own strategy.
Next, we will analyze the existence of SE in the proposed Stackelberg game.

Theorem 2:
The Stackelberg game G formulated by the optimization for the delay durance and resource consumption always exists a SE point.
Proof.Given any one of the stationary bandwidth strategy B j , the Stackelberg game degenerates into a non-cooperation game G f = {N , P, u i (p i , p −i , B j )}.We have certified the existence of Nash Equilibrium in lower hierarchial sub-game through an exact potential game.Therefore, there is always existing a NE(B j ).In the leader sub-game, there is a finite bandwidth strategies set, and we can find out an optimal bandwidth policy B j * ∈ B of the R-UAV: Therefore, (B j * , NE(B j * )) constitutes a SE in the proposed Stackelberg game.

Algorithm Description
In this subsection, we propose a hierarchical learning algorithm to solve the joint optimization of delay durance and resource allocation.Under the learning framework, the smart agent could observe the state of the environment and then select an optimal action strategy.Each smart agent can optimize its future return through selecting the corresponding action.We define the selection of bandwidth and power with a mixed strategies form requested by the proposed HLA.The hierarchical learning algorithm mentioned in this article could be analyzed with a time frame structure in Fig. 2 referring literature [28].The R-UAV M updates its strategies selection probabilities at each epoch h, and each epoch could be decomposed into K time slots, so as to the E-UAVs update their strategy selection probabilities at each time slot.The R-UAV's available bandwidth strategies set is B = {b 1 , ..., b M } and at epoch h the mixed strategies can be denoted as ω 0 (h) = (ω 01 (h), ..., ω 0m (h), ..., ω 0M (h)) and ∑ m∈B ω 0m (h) = 1.The E-UAVs' available power strategy set is P = {p 1 , ..., p L }.At time slot k, the mixed strategies of E-UAV i are ω i (k) = (ω i1 (k), ..., ω il (k), ..., ω iL (k)) and ∑ l∈P ω il (k) = 1.In this way, the HLA proposed could be initialized.In the lower hierarchical sub-game, we use the stochastic learning automata (SLA) algorithm [17] to help the E-UAVs select the power strategy and deliver the information to the R-UAV M. The advantage of this algorithm is that it does not require the interaction of user's information.Each user updates his own strategies selection probabilities according to his own profit.The SLA algorithm selects the optimal strategy by repeated iterating in a random environment, so it is widely used for the decision problems in the field of wireless communication [29] [30] [31] [32].At the kth time slot, the random profit of the E-UAV i can be defined as follows: where l denotes the adjustment factor and it ensures the value of profit of the E-UAV i between 0 and 1.The trend of change is the same as that of the original utility function.We put the utility function of the E-UAV i in exponential position to ensure its non-negativity and effectiveness of the proposed algorithm.
Step 2: In the hth epoch, the R-UAV stochastically selects total bandwidth a m (h) according to its strategy profile ω 0 (h), and broadcasts it to all the E-UAVs.
Step 3: Learning process of all the E-UAVs (1) At the beginning of time slot k, each E-UAV selects its transmission power a i stochastically according to its current strategy selection probability set ω i (k).
(2) Each E-UAV i calculates its profit U i (a 0 , a k i , a k −i ).
(3) Each E-UAV updates its strategies selection probabilities according to the following rules: where 0 < η < 1 is a learning step, and ũi (k) is the normalized profit, which value is between 0 and 1.
Step 4: The R-UAV M calculates its utility u 0 (h).
Step 5: The R-UAV M updates its Q value according to its Q function as follows.
where κ i ∈ [0, 1) is a learning rate, and it meets the condition of where the temperature τ 0 > 0, and it can make a tradeoff between exploration and exploitation.When τ 0 is bigger, the relay would select the strategy more randomly, otherwise the relay would select the strategy which can make the value of Q maximum.We make the trend of τ 0 from big to small to approximate the optimal solution.
Step 6: The R-UAV selects an action according to the newest strategy selection probability.
Step 7: In the upper hierarchical sub-game, we propose a Q-learning algorithm for the R-UAV M selecting the bandwidth strategy.In the process of Q learning, all the actions of the R-UAV M are represented by Q values, which represent the relative payoff value of the selected actions when the R-UAV interacts with the environment.We make the action of the corresponding high return value strengthen continuously through updating the Q function of the R-UAV M repetitively, which could help find out the optimal bandwidth strategy.Correspondingly, the profit of the R-UAV M is given by the equation as follows at the hth epoch.
where s denotes the coefficient to ensure the utility function of the R-UAV M between 0 and 1, which makes the Q value update within a reasonable range.
As shown in Fig. 2, it depicts the proposed hierarchy learning algorithm clearly.At last, we default that the stop criterions is satisfying the maximum number of iterations.The proof of convergence for the proposed HLA can be referred from literature [16], which has also proved the HLA can always find a SE.The system delay under the optimal power The E−UAVs used the maximal powers The E−UAVs used the minimal powers We can observe the convergence of the proposed HLA in a single simulation presented in Fig. 3 and Fig. 4. In the leader's sub-game, the R-UAV M repeats 120 epochs to select the optimal bandwidth strategy.The selection probability ω 0 of bandwidth strategy 5 (19 MHz) converges to 1 using about 100 epochs, and other strategies converge to 0 in Fig. 3.The power strategies convergence performance of the E-UAV user 1 in the first epoch is shown in Fig. 4, which begin with equal probabilities.E-UAV user 1 converges to power 1 (0.5 W) about 325 times iterations, and selective probabilities of other strategies converge to 0. Fig. 5 shows that the leader's utility optimized power and system bandwidth by HLA, only optimized power by SLA for the lower-level user's strategy and selected power and system bandwidth by the random strategies.It can be seen from the Fig. 5 that our proposed method obtains the largest leader's utility.The larger the leader's utility is, the smaller the sum of the system delay and the resource consumption will be, which embodies the superiority of our proposed algorithm.
Fig. 6 shows the proposed method has a good enhancement for delay performance.The five curves represent different bandwidths strategies.The three nodes of each curve represent the corresponding system delay under the current bandwidth strategy in three conditions: a) all E-UAVs in the lower layer select the minimum powers; b) all E-UAVs select the optimal powers obtained by HLA; c) all E-UAVs select the maximum powers.We can see that each line is downward convex, which reflects the superiority of our method.It realized the tradeoff between the system delay and the power consumption with different system bandwidth strategies.Fig. 7 shows the tradeoff between energy efficiency and system latency obtained using HLA and random allocation algorithm of resources.The four pairs points in red and green represent that other parameters are fixed and the amount of information for E-UAVs is [1∼3, 3∼5, 5∼7, 7∼9] Mbit, respectively.HLA is better than random allocation of resources because it not only increases the energy efficiency, but reduces the system delay as well.The energy efficiency can be defined as follows [34]: Based on the half-duplex relay mode, the EE of the UAV backhaul networks is given by where R SD denotes the rate from the E-UAV to the cluster head large UAV.p i d i denotes the energy consumption of each E-UAV delivering information to the R-UAV.Fig. 8 shows the bandwidth utilization for the system.The big gap between the amount of carried information by E-UAVs is heterogeneous information.Similarly, The small gap between the amount of carried information by E-UAVs is homogeneous information.The homogeneous information caused higher system bandwidth utilization.The reason is that the heterogeneous information makes some E-UAVs take longer time to wait.In addition, increasing the number of the E-UAVs would not have a significant impact on system bandwidth utilization.

Conclusion
In this paper, we investigated the joint optimization for delay durance and the resource allocation in UAV bakhaul networks.Different from previous works only pursuing the network performance, we also considered resource utility efficiency.The reason is that the energy is limited and the resource is scarce in the large scale UAV network.Besides, the coupled relationship between different resources increases the complexity of the problem.We maked a tradeoff between the delay and the resource consumption in UAV backhaul networks by formulated as a Stackelberg game.The R-UAV acted the leader and the E-UAVs acted as followers.The lower hierarchical sub-game exists a NE solution proved by an exact potential game, which is combined with the best bandwidth strategy constituting the SE solution.Finally, a hierarchical learning algorithm was proposed and simulation results revealed that the proposed backhaul allocation method was efficient for delay improvement and resource utilization.

Figure 1 .
Figure 1.The resource allocation in UAV backhaul networks system model.

Figure 2 .
Figure 2. The hierarchical learning algorithm model.

Figure 3 .Figure 4 .
Figure 3. Convergence trend of bandwidth selection probabilities of the relay small UAV M.

Figure 5 .
Figure 5. Performance comparison of the utility of leader for different resource strategy choose algorithm.

Figure 6 .Figure 7 .Figure 8 .
Figure 6.The tradeoff between system delay and power consumption for different bandwidth strategies.