Preprint
Article

Multi-Constraints Guidance and Maneuvering Penetration Strategy via Meta Deep Reinforcement Learning

This version is not peer-reviewed.

Submitted:

21 August 2023

Posted:

22 August 2023

You are already at the latest version

A peer-reviewed article of this preprint also exists.

Abstract
In response to the issue of vehicle escape guidance, this manuscript proposes a unified intelligent control strategy synthesizing optimal guidance and Meta Deep Reinforcement Learning (DRL). Optimal control with minor energy consumption is introduced to meet terminal latitude, longitude and altitude. Maneuvering escape is realized by adding longitudinal and lateral direction maneuver overloads. Maneuver command decision model is calculated based on Soft-Actor-Critic (SAC) networks. Meta learning is introduced to enhance autonomous escape capability, which improves generalization performance to time-varying scenarios not encountered in the training process. In order to obtain training samples at a faster speed, this manuscript uses the prediction method to solve reward values, which avoiding a large number of numerical integration. The simulation results manifest that the proposed intelligent strategy can achieve high precise guidance and effective escape.
Keywords: 
;  ;  ;  ;  

1. Introduction

The hypersonic UAV mainly glides in the near space [1]. In the early phase, higher flight velocity is acquired relying on the thin atmospheric environment, which is an advantage to effectively avoid interception of the defense system. At the end of gliding flight, the velocity is mainly influenced by aerodynamic force, and suffered from restrictions of heat flow, dynamic pressure, overload [2]. The velocity advantage is hard to penetration, so orbital maneuvering is applied by UAV to achieve penetration. The main flight mission is split into avoiding defense system interception and satisfying terminal multiple constraints [3]. The core of manuscript is designing penetration guidance strategy via orbital maneuvering capability, avoiding the interception and reduce the penetration impact on guidance accuracy.
The penetration strategy is summarized as tactical penetration and technical penetration strategy [4]. The technical penetration strategy changes the flight path through maneuvering, aiming to increase the miss distance to successfully penetrate. Common maneuver manners include sine maneuver, step maneuver, square wave maneuver and spiral maneuver [5]. There are some limitations and instability for technical penetration strategy, attributed to the UAV is hard to adopt optimal penetration strategy according to the actual situation of offensive and defensive confrontation. Compared with the traditional procedural maneuver strategy, differential game guidance law has characters of real-time and intelligence as a tactical penetration strategy [6]. Penetration problems are essentially regarded as the continuous dynamic conflict problem of multi-party participants, and this strategy is an essential solution on solving multi-party optimal control problems. Applying it to the problem of attack defense confrontation can not only fully consider the relative information between UAV and interceptor, but also obtain Nash equation strategy to reduce energy consumption. Many scholars have proposed differential game models of various maneuvering control strategies based on control indexes and motion models. GARCIA [7] regarded the scenario of active target defense modeling as a zero-sum differential game, designed a complete differential game solution, and comprehensively considered the optimal strategy of closed-loop state feedback to obtain the value function. In Ref.[8], the optimal guidance problem was studied between interceptor and active defense ballistic UAV, and an optimal guidance scheme was proposed based on the linear quadratic differential game method and the numerical solution of Riccati differential equations. Liang [9] mainly analyzed the problem of pursuit and escape attack of multiple players, inducted the three body game confrontation into competition and cooperation problems, and solved the optimal solution of multiple players via differential game theory. Above methods are of great significance for analyzing and solving the confrontation process between UAV and interceptor. Nearby space UAV has characteristics of high velocity and short time in the phase of attack and defense confrontation terminal guidance [10], the differential game guidance law is difficult to show advantages in this phase. Moreover, the differential game method has a large amount of calculation, bilateral performance indicators are difficult to model [11], as a result, this theory is unable to be applied in practice.
DRL is a research hotspot in the field of artificial intelligence that has sprung up in recent years, and amazing learning results are achieved in robot control, guidance and control technologies[12]. DRL specifically refers to agents learning in the process of interaction with the environment to find the best strategy to maximize cumulative rewards [13]. With advantages of dealing with high-dimensional abstract problems and giving decisions quickly, DRL provides a new solution for maneuvering penetration of high velocity UAV. In order to solve the problem of intercepting high maneuvering target, an auxiliary DRL algorithm was proposed in Ref. [14] to optimize the frontal interception guidance control strategy based on neural network. Simulation results showed that DRL had higher hit rate and larger terminal interception angle than traditional methods and proximal policy optimization algorithms. Gong [15] proposed an Omni bearing attack guidance law of agile UAV via DRL, which effectively deal with aerodynamic uncertainty and strong nonlinearity at high attack angle. DRL was used to generate guidance law of attack angle in agile turning phase. Furfaro [16] proposed an adaptive guidance algorithm based on classical zero-efforts velocity, and limitations of this algorithm was overcome via RL. A closed-loop guidance algorithm was created, which is lightweight and flexible enough to adapt to a given constrained scene.
Compared with the differential game theory, DRL is convenient to establish the performance index function, and it is a feasible method to solve the dynamic programming problem by utilizing the powerful numerical calculation ability of computer to skillfully avoid solving the function analytical solution. However, the traditional DRL has some limitations, such as high sample complexity, low sample utilization, long training time and so on. Once the mission changing, original DRL parameters are hard to adapt to new mission and need to learn from scratch. The change of mission or environment will lead to the failure of trained model and poor generalization ability of model. In order to solve existing problems of DRL, researchers introduce Meta learning into DRL and propose Meta DRL [17]. By learning useful meta knowledge from a group of related missions, agents acquire the ability to learn to learn, and the learning efficiency on new missions is improved and the complexity of samples is reduced. When facing with new missions or environments, the network is responded quickly based on the previously accumulated knowledge, so that only a small number of samples are needed to quickly adapt new mission.
Based on above analysis, the manuscript proposes Meta DRL to solve the UAV guidance penetration strategy, and the DRL is improved, resulting in enhancing the adaptability of UAV in the complex and changeable attack and defense confrontation. Besides, the idea of Meta learning is used to make UAV learn to learn and improve the ability of autonomous flight penetration. Core contributions of this manuscript are as follows.
(1)
By modeling the three-dimensional attack and defense scene between UAV and interceptor, analyzing terminal and process constraints of UAV, guidance penetration strategy based on DRL is proposed, aiming to solve the optimal solution of maneuvering penetration under constant environment or mission.
(2)
Meta learning is used to improve the UAV guidance penetration strategy, to make the UAV learn to learn and improve the autonomous flight penetration ability. Improving generalization performance to time-varying scenarios not encountered in the training process.
(3)
Testing the network parameters based on Meta DRL, and analyzing the flight path and state under different attack and defense distances. Besides, analyzing the penetration strategy, and exploring penetration timing and maneuvering overload, and summarizing the penetration tactics.

2. Modeling of the Penetration Guidance Problem

2.1. Modeling of UAV Motion

The three-degree-of-freedom motion equation is adopted to describe UAV, and the dynamic equation is established in ballistic coordinating system:
{ v ˙ = ρ v 2 S m C D 2 m + g r sin θ + g ω e ( cos σ cos θ cos ϕ + sin θ sin ϕ ) + ω e 2 r ( cos 2 ϕ sin θ cos ϕ sin ϕ cos σ cos θ ) θ ˙ = ρ v 2 S m C L cos υ 2 m v + g r cos θ v 2 ω e sin σ cos ϕ + v cos θ r + g ω e v ( cos θ sin ϕ cos σ sin θ cos ϕ ) + ω e 2 r v ( cos ϕ sin ϕ cos σ sin θ + cos 2 ϕ cos θ ) σ ˙ = ρ v 2 S m C L sin υ 2 m v cos θ g ω e sin σ cos ϕ v cos θ + ω e 2 r ( cos ϕ sin ϕ sin σ ) v cos θ + v tan ϕ cos θ sin σ r 2 ω e ( sin ϕ cos σ tan θ cos ϕ ) ϕ ˙ = v cos θ cos σ r λ ˙ = v cos θ sin σ r cos ϕ r ˙ = v sin θ
v is the velocity of UAV relative to the earth, θ is the velocity slope angle, σ is the velocity azimuth, and the positive direction is clockwise from the north. r represents the geocentric distance, and ( λ , φ ) is the longitude and latitude. The differential argument is flight time t . g ω e is the component of the earth gravitational acceleration in direction of the earth rotational angular rate ω e , meanwhile, g r is the component of the earth gravitational acceleration in direction of the geo-center. ρ is the density of atmosphere, moreover, in addition, m and S m are the mass and reference area of UAV. C D , C L are the drag and lift coefficient, relating to the Mach number and attack angle, so the control variable attack angle α is implicit in it, the other control variable is the bank angle υ .
For a hypersonic UAV with large L/D, heat flow, overload, and dynamic pressure are considered as flight process constraints:
{ k h ρ 1 / 2 v 3 Q s max ρ v 2 2 q max D 2 + L 2 m g 0 n max
Q s max , q max and n max are the maximum heat flow, dynamic pressure and overload, respectively, parameter k h is constant coefficients. In order to keep gliding state steady and effectively prevent trajectory jumps, the balance of lift and gravity is required by UAV [18]. Stable gliding is generally considered as quasi-equilibrium gliding (QEGs). At present, the international definition of QEG can be summarized into two types[19]: the velocity slope angle is considered as constant, expressed by θ ˙ = 0 , or flight altitude variance ratio is considered as constant, expressed by h ¨ = 0 . θ ˙ = 0 was adopted by the traditional QEG guidance, and h ¨ = 0 mainly appeared in the analytical prediction guidance of the Mars landing at the end of last century [20].
For the UAV with large range, the manuscript adopts θ ˙ = 0 as QEG condition. Referring to the second equation in Eq.(1), θ ˙ = 0 is transferred into Eq.(3).
m ( g v 2 r ) cos θ L cos υ = 0

2.2. Description of Flight Missions

2.2.1. Guidance Mission

The physical meaning of gliding guidance is eliminating heading errors, satisfying complex process constraints, and minimizing the energy loss. The UAV is guided to glide unpowered to the setting terminal target point ( h f , λ f , φ f ), to satisfy the terminal altitude, longitude and latitude. Hence, terminal constraints are expressed by Eq.(4).
{ h ( L R f ) = h f ,   λ ( L R f ) = λ f φ ( L R f ) = φ f , Δ σ f Δ σ max
where, the terminal range L R f is given, Δ σ represents the heading error, and Δ σ max is a pre-setting allowable value. The guidance problem is the process of determining α and υ .

2.2.2. Penetration Mission

The main indexes are used to judge penetration probability as follow:
(1)
The miss distance D m i s s with interceptor at the encounter moment.
(2)
The overload N m of interceptor at the last phase.
(3)
The line-of-sight (LOS) angular rates θ ˙ int l o s and σ ˙ int l o s with interceptor at the encounter moment.
where D m i s s directly reflects the result of penetration, the larger of D m i s s , the greater of penetration probability. N m and δ M T indirectly reflect the result of penetration. The larger of θ ˙ int l o s and σ ˙ int l o s , the more difficult for interceptor to successfully intercept, which is attributed to θ int l o s and σ int l o s are hard to converge to a constant value at the encounter moment. Besides, a larger overload is required to adjust θ int l o s and σ int l o s at the end of interception. The larger of N m , the greater control cost of interceptor has to pay to complete the interception mission. Once N m exceeds the overload limit of interceptor, indicating the control is saturated, which reflects the interceptor is hard to intercept based on the current maximum overload constraint. Conferred to the size and flight characteristics of UAV [21] , the manuscript assumes the penetration mission is completed if D m i s s is greater than 2 m.
The maximum maneuvering overload of UAV is constrained by structure and QEG condition [19]. The lateral maneuver overload is too large, which will cause the UAV to deviate from the course, eventually, leading to the failure of guidance mission. A large longitudinal maneuver amplitude will have a significant impact on the lift-drag ratio (L/D) of UAV, affecting the safety. According to the above analysis, the maximum lateral maneuvering overload is set as 2 g, and the maximum longitudinal maneuvering overload is set as 1 g.

2.3. The Guidance Law of Interceptor

Guidance law of the interceptor relies on the inertial navigation system to obtain information such as position and velocity of the UAV. The relative motion is shown in Figure 1.
P, T and M respectively represent UAV, target and interceptor. r is the relative position between UAV and interceptor, and r t is for target.
The LOS angular rate and the approach velocity to UAV are obtained by the interceptor. Overload control command is derived by generalized proportional navigation guidance (GPNG).
{ n y 2 * = K D | r ˙ | θ ˙ int l o s / G 0 n z 2 * = K T | r ˙ | σ ˙ int l o s / G 0
As shown in Eq.(5), K D and K T are navigation ratios in the longitudinal and lateral direction respectively. r ˙ is the approach velocity. θ ˙ int l o s and σ ˙ int l o s represent LOS angular rate between the longitudinal and lateral direction.
{ r = ( x x m ) 2 + ( y y m ) 2 + ( z z m ) 2 θ int l o s = arcsin ( ( y y m ) / r ) σ int l o s = arctan ( ( z z m ) / ( x x m ) )
Differentiating Eq.(6) with respect to t , we have
{ θ ˙ int l o s = 1 1 ( y y m r ) 2 ( y ˙ y ˙ m ) r ( y y m ) r ˙ r 2 σ ˙ int l o s = 1 1 + ( z z m x x m ) 2 ( z ˙ z ˙ m ) ( x x m ) ( z z m ) ( x ˙ x ˙ m ) ( x x m ) 2 r ˙ = ( x x m ) ( x ˙ x ˙ m ) + ( y y m ) ( y ˙ y ˙ m ) + ( z z m ) ( z ˙ z ˙ m ) r

3. Design of Penetration Strategy Considering Guidance

3.1. Guidance Penetration Strategy Analysis

Generally, the UAV achieves penetration through velocity advantage or increasing maneuvering overload. The former is used in pre-gliding flight, compared with interceptor, the velocity of UAV is relatively large, which is conductive to penetrate defense. More threats to UAV come from the defense system of intended target, resulting in intercept threat focusing on end of glide flight. However, the flight velocity gradually decreases, which is not enough to penetrate escape. Based on the above analysis, the manuscript designs penetration strategy by increasing the maneuvering overload.
Firstly, in the previous research [22], the energy-optimized gliding guidance law has been designed. Then, avoiding intercept by maneuvering becomes the key point. The real-time flight information of interceptor is difficult to be accurately obtained by UAV, contrarily, the real-time flight information of UAV is obtained by interceptor. Based on the UAV penetration mission for only knowing the initial launch position of interceptor, the manuscript simulates the attack and defense environment between the UAV and interceptor, applying the DRL method to solve the overload command of UAV maneuvering penetration. DNN parameters are trained offline, and maneuvering penetration command is constructed online. The launch position of interceptor is changeable, and stable penetration command cannot adapt to the complex penetration environment. To improve the adaptability of DNN parameters, the original DRL is optimized by adopting the idea of Meta learning, and the ontology and environmental information are fully utilized. The optimization of Meta learning enhances flight capability, fast response to complex missions and flight self-learning. Finally, the UAV guidance penetration strategy based on Meta DRL is proposed by this manuscript, as shown in Figure 2.
As for gliding guidance mission, DRL and optimal control are used to achieve guidance penetration mission of UAV. Optimal guidance command was conducted in previous research [22], which is introduced to satisfy constraints of terminal position, altitude and minimal energy loss. Maneuvering overloads are added between longitudinal and lateral directions, aiming to achieve penetration mission at the end of gliding flight. Maneuvering overloads are solved by DRL. The DNN parameters are optimized by Meta learning.
Guidance command ( n y ,   n z ) is generated by optimal guidance strategy, and maneuvering overload command ( Δ n y , Δ n z ) is generated by Meta DRL. The flight total overload is shown in Eq.(8).
{ N y = n y + Δ n y N z = n z + Δ n z
Maneuvering penetration command via Meta DRL is the core section in this manuscript. The penetration considering guidance is described as Markov Decision Process (MDP), which consists of finite-dimensional continuous flight state space, longitudinal and lateral direction overload set, and reward function judging the penetration strategy. Flight data generated by numerical integration, and SAC networks are introduced to train and learn MDP. Optimizing network parameters via Meta learning, aims at adjusting network parameters with very little flight data when the UAV is faced with online mission changing, adapting to the new environment as soon as possible.

3.2. Energy Optimal Gliding Guidance Method

In the previous research [22], basing on the QEG condition and taking the required overload as the control amount, the performance index with minimum energy loss is established. The optimal longitudinal and lateral overload are designed respectively, satisfying the constraints on terminal latitude, longitude, altitude and velocity. The required overload command is shown in Eq.(9).
{ u y * = k ( C h L R C θ ) + 1 u z * = σ l o s σ k ( L R f L R )
where, u y * = n y * and u z * = n z * are optimal overload between longitudinal and lateral. k = g 0 v 2 g v 2 , L R is the current range, and L R f is the total range of gliding phase. C h and C θ are the guidance coefficients based on optimal control, represented as Eq.(10).
{ C h = 6 ( ( L R L R f ) ( θ f + θ ) 2 h + 2 h f ) k 2 ( L R L R f ) 3 C θ = 2 ( L R L R f ( θ θ f ) L R f 2 ( 2 θ + θ f ) + L R 2 ( 2 θ f + θ ) + 3 ( L R f + L R ) ( h f h ) ) k 2 ( L R L R f ) 3
Based on Eq.(11), the control variable α and υ are calculated as:
{ ρ v 2 S m C L ( M a , α ) 2 g 0 = n y * 2 + n z * 2 υ = arctan ( n z * n y * )
where g0 is the gravitational acceleration at sea level, α is obtained by calculating contrast value in Eq.(11).

4. RL Model for Penetration Guidance

The problem of maneuvering penetration is modeled as a series of stationary MDP [23] with unknown transition probabilities. The continuous flight state space, action set, and reward function for judging the command are determined in this section.

4.1. MDP of Penetration Guidance Mission

Deterministic MDP with continuous state and action, which is defined as ( S , A , T , R , γ ) by a quintuple. S is described as continuous state space, A is described as finite action set, and T is depicted as the state transition function. S × A T , reflecting deterministic state transition relationships. R is defined as immediate reward. γ [ 0 , 1 ] is the discount factor, to balance immediate and forward reward.
The agent choices action a t A at current state s t , while state changes from s t to s t + 1 S , the environment returns immediate reward R t = f ( s t , a t ) for agent. Cumulative rewards are obtained with controlled action sequence τ , as shown in Eq.(12).
G ( s 0 , τ ) = R 0 + γ R 1 + γ 2 R 2 + ... = t = 0 γ t R t
The goal of MDP is determining policy, maximizing the expected accumulated rewards:
τ = arg max τ { G ( s 0 , τ ) }

4.1.1. State Space Design

The environment of attack-defense is abstracted as state space of MDP, which applies guiding for action command. UAV is hard to acquire the flight information and guidance law of interceptor. Hence, the information of UAV and target is only considered as state space.
S = { x r , y r , z r , θ l o s , σ l o s }
( x r , y r , z r ) represents the relative distance between UAV and target under North East Down (NED) Coordinate System. ( θ l o s , σ l o s ) represents respectively longitudinal and lateral LOS angle. In order to eliminate dimension difference and enhance compatibility among states, S is normalized by Eq.(15).
S = ( x r = x r x r 0 y r = y r y r 0 z r = z r z r 0 θ l o s = θ l o s 2 π σ l o s = σ l o s 2 π )

4.1.2. Action Space Design

Action is a decision selected by UAV based on current state, and action space A is the set of all possible decisions. Overload directly affects velocity azimuth and slope angle, indirectly affects gliding flight status, hence, the manuscript determines overload as an intermediate control variable.
On the basis of optimal guidance law and flight process constraints, longitudinal overload Δ n y and lateral overload Δ n z are added as action space of MDP. Considering the safety of UAV and heading error, the manuscript assumes longitudinal and lateral maximum maneuvering overload are 1 g and 2 g respectively. A is defined as continuous set, showing in Eq.(16).
A = { Δ n y [ g ,   g ] Δ n z [ 2 g ,   2 g ]

4.2. Multi-Missions Reward Function Designing

The reward function f r is an essential section for guiding and training maneuvering penetration strategy. After execution the action command, f r returns reward value to UAV, which reflects the fairness and scientific of action judgement. The rationality of f r directly affects the training result, and determines the efficiency of SAC training. In this manuscript, the aim of f r is guiding UAV to achieve guidance penetration mission, while satisfying terminal multi-constraints. Given requirements of mission, f r consists the miss distance with the interceptor and the terminal deviate with target.
f r ( s , a ) = c 1 d m i s s c 2 d e r r o r
where d m i s s and d e r r o r are miss distance and terminal deviation after execution the action command. The sufficient and necessary condition of satisfying terminal position deviate is eliminating heading error, which is directly related to LOS angular rate. Similarly, d m i s s can be reflected by the LOS angular rate at the encounter time between UAV and interceptor. Therefore, the normalized f r is expressed by Eq.(18).
f r = c 1 ( θ ˙ ¯ int l o s ) 2 + ( σ ˙ ¯ int l o s ) 2 c 2 ( θ ˙ ¯ l o s ) 2 + ( σ ˙ ¯ l o s ) 2
where θ ˙ ¯ int l o s represents the normalized LOS angular rate in the longitudinal direction with interceptor, σ ˙ ¯ int l o s represents the normalized LOS angular rate in the lateral direction with interceptor. θ ˙ ¯ l o s represents the normalized LOS angular rate in the longitudinal direction with target, σ ˙ ¯ l o s represents the normalized LOS angular rate in the lateral direction with target. In this manuscript, the LOS angular rates of at encounter time and terminal time are solved analytically by numerical calculation.

4.2.1. The Solution of LOS Angular Rate in the Lateral Direction

The attack defense confrontation model in lateral direction is shown in Figure 3.
P, T and M respectively represent UAV, target and interceptor. LRgo and LMgo are represented as remaining range among UAV, target and interceptor, which are calculated in Eq.(19) and (20).
Lateral relative motion model between UAV and target is shown in Eq.(19).
{ L ˙ R g o = v cos Δ δ L R g o σ ˙ l o s = v sin Δ δ
Lateral relative motion model between UAV and interceptor is shown in Eq.(20).
{ L ˙ M g o = v cos Δ σ v m cos Δ σ m L M g o σ ˙ int l o s = v sin Δ σ v m sin Δ σ m
In order to simplify the calculation, the relative motion equation Eq.(20) is conducted as Eq.(21)
{ L ˙ M g o = v r cos Δ σ m n L M g o σ ˙ int l o s = v r sin Δ σ m n
where v r and Δ σ m n are calculated as Eq.(22).
{ v r = ( v cos σ v v m cos σ m ) 2 + ( v sin σ v v m sin σ m ) 2 Δ σ m n = atan v sin σ v v m sin σ m v cos σ v v m cos σ m
To facilitate the analysis and prediction, σ ˙ l o s is calculated as follows. Taking the derivation of the second formula in Eq.(19):
L ˙ R g o σ ˙ L O S + L R g o σ ¨ L O S = v ˙ sin Δ σ + v Δ σ ˙ cos Δ σ
Bring the heading error and first formula in Eq.(19) into Eq.(23), the rate of LOS angular rate is calculated by Eq.(24).
L ˙ R g o σ ˙ l o s + L R g o σ ¨ l o s = v ˙ sin Δ σ + v Δ σ ˙ cos Δ σ L ˙ R g o σ ˙ l o s + L R g o σ ¨ l o s = v ˙ v L R g o σ ˙ l o s L ˙ R g o σ ˙ l o s + L ˙ R g o σ ˙ σ ¨ l o s = ( v ˙ v 2 L ˙ R g o L R g o ) σ ˙ l o s + L ˙ R g o L R g o σ ˙
Defining T g o c as the predicted remaining time of flight, T g o c is derived via the remaining flight range and variation in range, expressing in Eq.(25).
T g o c = L R g o L ˙ R g o
Defining the value of state x = σ ˙ l o s and the value of control u = σ ˙ , the differential of LOS angular rate is obtained, which is shown in Eq.(26).
x ˙ = ( v ˙ v + 2 T g o c ) x 1 T g o c u
In the latter phase of glide flight, v ˙ v is an order of magnitude smaller than 2 T g o c . Eq.(26) is further simplified to Eq.(27).
x ˙ = 2 T g o c x 1 T g o c u
The current remaining flight time Tgoc, a certain time t of future flight starting from the current time t, and the remaining flight time Tgo1 at time t satisfy the following relationship:
T g o c = T g o 1 + t
d T g o 1 = d t represents the derivation of remaining flight time. For a given control input u, the definite integral of Eq.(27) is solved, and the calculated result is shown in Eq.(29).
x ( t ) = e 2 T g o 1 d t ( u T g o 1 e 2 T g o 1 d t d t + C ) = e 2 T g o c t d t ( u T g o c t e 2 T g o c t d t d t + C ) = e 2 ln ( T g o c t ) ( u t T g o c e 2 ln ( t T g o c ) d t + C ) = 1 ( T g o c t ) 2 ( u ( t T g o c ) d t + C ) = 1 ( T g o c t ) 2 ( u ( 1 2 t 2 T g o c t ) + C )
The LOS angular rate is σ ˙ l o s , at the current time t=0, the constant C is expressed by Eq.(30).
C = σ ˙ l o s T g o c 2
σ ˙ l o s is obtained by Eq.(31).
σ ˙ l o s ( t ) = 1 ( T g o c t ) 2 ( u ( 1 2 t 2 T g o c t ) + σ ˙ l o s T g o c 2 )
σ ˙ l o s at the terminal time tf is shown in Eq.(32).
σ ˙ l o s = σ ˙ l o s ( t f ) = 1 ( T g o c t f ) 2 ( u ( 1 2 t f 2 T g o c t f ) + σ ˙ l o s T g o c 2 )
Similarly, based on above analysis, σ ˙ int l o s is calculated via the analysis and prediction. The solution is shown in Eq.(33).
σ ˙ int l o s = σ ˙ int l o s ( t int f ) = 1 ( T int g o c t int f ) 2 ( u ( 1 2 t int f 2 T int g o c t int f ) + σ ˙ int l o s T int g o c 2 )
where T int g o c represents the total encounter time based on the current moment, t int f represents encounter time with interceptor, and u is the input overload.

4.2.2. The Solution of LOS Angular Rate in the Longitudinal Direction

The attack defense confrontation model in longitudinal direction is shown in Figure 4.
RRgo and RMgo are represented as remaining range among UAV, target and interceptor, which are calculated in Eq. (34)and(35).
Longitudinal relative motion model between UAV and target is shown in Eq.(19).
{ R ˙ R g o = v cos Δ θ R R g o θ ˙ l o s = v sin Δ θ
Longitudinal relative motion model between UAV and interceptor is shown in Eq.(35).
{ R ˙ M g o = v cos Δ θ v m cos Δ θ m R ˙ M g o θ ˙ int l o s = v sin Δ θ v m sin Δ θ m
In order to simplify the calculation, the relative motion equation Eq.(35) is conducted as Eq.(36)
{ R ˙ M g o = v r cos Δ θ m n R M g o θ ˙ int l o s = v r sin Δ θ m n
where v r and Δ θ m n are calculated as Eq.(37).
{ v r = ( v cos θ v v m cos θ m ) 2 + ( v sin θ v v m sin θ m ) 2 Δ θ m n = atan v sin θ v v m sin θ m v cos θ v v m cos θ m
To facilitate the analysis and prediction, θ ˙ l o s is calculated as follows. Taking the derivation of the second formula in Eq.(34):
R ˙ R g o θ ˙ L O S + R R g o θ ¨ L O S = v ˙ sin Δ θ + v Δ θ ˙ cos Δ θ
Bring the heading error and first formula in Eq.(34) into Eq.(38), the rate of LOS angular rate is calculated by Eq.(39).
R ˙ R g o θ ˙ l o s + R R g o θ ¨ l o s = v ˙ sin Δ θ + v Δ θ ˙ cos Δ θ R ˙ R g o θ ˙ l o s + R R g o θ ¨ l o s = v ˙ v R R g o θ ˙ l o s R ˙ R g o θ ˙ l o s + R ˙ R g o θ ˙ θ ¨ l o s = ( v ˙ v 2 R ˙ R g o R R g o ) θ ˙ l o s + R ˙ R g o R R g o θ ˙
Based on the predicted remaining time of flight T g o c in lateral prediction, the LOS angular rate is σ ˙ l o s , at the current time t=0, the constant C is expressed by Eq.(40).
C = θ ˙ l o s T g o c 2
θ ˙ l o s is obtained by Eq.(41).
θ ˙ l o s ( t ) = 1 ( T g o c t ) 2 ( u ( 1 2 t 2 T g o c t ) + θ ˙ l o s T g o c 2 )
θ ˙ l o s at the terminal time tf is shown in Eq.(42).
θ ˙ l o s = θ ˙ l o s ( t f ) = 1 ( T g o c t f ) 2 ( u ( 1 2 t f 2 T g o c t f ) + θ ˙ l o s T g o c 2 )
Similarly, based on above analysis, θ ˙ int l o s is calculated via the analysis and prediction. The solution is shown in Eq.(33).
θ ˙ int l o s = θ ˙ int l o s ( t int f ) = 1 ( T int g o c t int f ) 2 ( u ( 1 2 t int f 2 T int g o c t int f ) + θ ˙ int l o s T int g o c 2 )
where T int g o c represents the total encounter time based on the current moment, t int f represents encounter time with interceptor, and u is the input overload.
According to the above analysis, ( σ ˙ l o s , θ ˙ l o s ) and ( σ ˙ int l o s , θ ˙ int l o s ) depend on change of LOS angle rate. For the guidance mission, ( σ ˙ l o s , θ ˙ l o s ) is related to overload of UAV, the smaller value, the closer UAV approaches target at the end of gliding flight. For the penetration mission, ( σ ˙ int l o s , θ ˙ int l o s ) is related to overloads of UAV and interceptor, the greater value, the higher cost of interceptor at the interception terminal phase, the easier of UAV will break through if the control overload of interceptor reaches saturation .

5. DRL Penetration Guidance Law

5.1. SAC Training Model

Standard DRL is maximizing the sum of expected rewards t E ( s t , a t ) ρ π [ f r ( s t , a t ) ] . For the problem of multi-dimensional continuous state inputs and continuous action output, SAC networks are introduced to solve the MDP model.
Compared with other policy learning algorithm [24], SAC augments the standard RL objective with expected policy entropy by
J π = t γ t E ( s t , a t ) ρ π [ f r ( s t , a t ) + τ ( π (   | s t ) ) ]
The entropy term τ ( π (   | s t ) ) is shown in Eq.(45), which represents the stochastic feature of strategy, balancing the exploration and learning of networks. The entropy parameter τ determines the relative importance of entropy against immediate reward.
( π (   | s t ) ) = a A π ( a | s t ) log π ( a | s t ) d a = E a π (   | s t ) [ log π ( a | s t ) ]
The optimal strategy of SAC is shown in the Eq.(46), aiming of maximizing the cumulative reward and policy entropy.
π M a x E n t * = arg max π t γ t E ( s t , a t ) ρ π [ f r ( s t , a t ) + τ ( π (   | s t ) ) ]
The frame of sac networks is shown in the Figure 5, consists of Actor net and Critic net. Actor net is generating action, and environment returns the reward and next state. All of ballistics data is stored in experience pool, including state, action, reward, and next state.
Critic net is used to judge the found strategies, which guides impartially the strategy of Actor network. At the beginning, Actor net and Critic net are given random parameters. Actor net is difficult generating the optimal strategy, and Critic net is difficult judging scientifically towards the strategy of Actor net. The parameters of networks are need to update based on continuously generating and sampling ballistics data.
For updating Critic net, Critic net outputs the expected reward E a π (   | s t ) based on samples, and Actor net outputs the action probability, which is depicted by entropy term ( π (   | s t ) ) . Combining E a π (   | s t ) with ( π (   | s t ) ) , the value function is conducted and shown in Eq.(47)
Q s o f t ( s t , a t ) = E ( s t , a t ) ρ π [ t = 0 γ t r ( s t , a t ) + τ t = 1 γ t ( π ( | s t ) ) ]
Further obtain the Bellman equation:
Q s o f t ( s t , a t ) = E s t + 1 ρ ( s t + 1 | s t , a t ) a t + 1 π [ r ( s t , a t ) + γ ( Q s o f t ( s t + 1 , a t + 1 ) + τ H ( π ( | s t + 1 ) ) ) ]
Given by Eq.(49), the loss function of Critic net is acquired:
J Q ( ψ ) = E ( s t , a t , s t + 1 ) D a t + 1 π [ 1 2 ( Q s o f t ( s t , a t ) ( r ( s t , a t ) + γ ( Q s o f t ( s t + 1 , a t + 1 ) τ log ( π ( a t + 1 | s t + 1 ) ) ) ) ) 2 ]
For updating Actor net, the updating strategy is shown in Eq.(50)
π n e w = arg min D K L π ( π ( | s t ) exp ( 1 τ Q s o f t π o l d ( s t , ) ) Z s o f t π o l d ( s t ) )
where, Π represents the set of strategy, Z is partition function, used to normalized distribution. D K L is Kullback-Leibler (KL) divergence [25].
Combining re-parameterization technique with Eq.(51), the loss function of Actor net is obtained:
J π ( ϕ ) = E s t D , ε t N [ τ log π ( f ( ε t ; s t ) | s t ) Q s o f t ( s t , f ( ε t ; s t ) ) ]
In which a t = f ( ε t ; s t ) , ε t is the input noise, obeying the distribution N .
The method of stochastic gradient descent is introduced to minimize the loss function of networks. The optimal parameters of Actor-Critic networks are obtained with repeating the updating process, passing respectively the parameters to target networks via soft-updating.
Figure 5. updating principle of SAC.
Figure 5. updating principle of SAC.
Preprints 82932 g005

5.2. Meta SAC Optimization Algorithm

The learning algorithm in DRL relies on a lot of interaction between agent and environment, and high training costs. Once the environment changed, the original strategy is no longer applicable, and need to learn from scratch. The penetration guidance problem under the stable flight environment can be solved by SAC networks. For the changeable flight environment, such as the initial position of interceptor changes greatly, or the interceptor guidance law deviates greatly from preset value, the strategy solved by traditional SAC is hard to adapt, which is need to restudy and redesign. The manuscript introduces Meta learning to optimize and improve SAC performance. The training goal of Meta SAC is to obtain initial SAC model parameters. When the UAV penetration mission is changed, through a few scenes of learning, the UAV can adapt to new environment and complete the corresponding guidance penetration mission, without relearning model parameters. Meta SAC can achieve “learn while flying” for UAV, and strengthen the adaptability of UAV.
The Meta SAC algorithm is shown in Algorithm 1, which is divided into meta training and meta testing phase. Meta training phase is to solve the optimal meta learning parameters based on multi-experience missions. In the meta testing phase, the trained meta parameters are interactively learned with the new mission environment to fine-tune the meta parameters.
Algorithm 1 Meta SAC
1: Initialize the experience pool Ω ,Storage space N
2: Meta training:
3: Inner loop
4: for iteration k do
5:  sample mission(k) from T p ( T )
6:  update actor policy Θ to Θ using SAC based on mission(k):
7:   Θ S A C ( Θ , m i s s i o n ( k ) ) .
9: Outer loop
10: Θ = m m s e ( i = 1 k Θ i )
11:Generate D 1 from Θ and estimate the reward of Θ.
12: Add a hidden layer feature as a random noise.
13:  Θ i = Θ + α Θ Θ E α t π ( α t | s t ; Θ , z i ) , z i N ( μ i , σ i ) [ t R t ( s t ) ]
14: The meta learning process of different missions is carried out through SGD.
15: for iteration mission(k) do
16:   min Θ T i p ( T ) T i ( f Θ i ) = T i p ( T ) T i ( f Θ α Θ T i ( f Θ ) )
17:   Θ = Θ β Θ T i p ( T ) T i ( f Θ i )
18: Meta testing
19: Initialize the experience pool Ω ,Storage space N.
20: Load meta training network parameters Θ .
21: Set training parameters.
22: for iteration i do
23:  sample mission from T p ( T )
24:   Θ = S A C ( Θ , m i s s i o n ( k ) )
End for
The basic assumption of Meta SAC is that the experience mission for meta training and the new mission for meta testing obey the same mission distribution p ( T ) . Therefore, there are some common characteristics between different missions. In the DRL scenario, our goal is to learn a function f θ with parameter θ , which can minimize the loss function T of specific mission T . In the Meta DRL scenario, our goal is to learn a learning process θ = μ ψ ( D T t r , θ ) , which can quickly adapt to the new mission T with a very small dataset D T t r . Meta SAC can be summarized as optimizing the parameters θ , ψ in the learning process:
min E T P ( T ) θ , ψ [ ( D T t e s t , θ ) ]   s . t .   θ = μ ψ ( D T t r , θ )
where D T t e s t and D T t r respectively represent training and testing missions sampled from p ( T ) , ( D T t e s t , θ ) represents the testing loss function. In the meta training phase, parameters are optimized by inner loop and outer loop.
In the inner loop, Meta SAC updates model parameters with a small amount of randomly selected data of specific mission T as the training data, reducing the loss of model on mission T . In this part, updating of model parameters is the same as the original SAC algorithm, and the agent learns several scenes on randomly selected missions.
The minimum mean square error of strategy parameters θ corresponding to different missions in the inner loop phase is solved to obtain the initial strategy parameters θ i n i of the outer loop. In this manuscript, a hidden layer feature is added to the input part of strategy θ i n i as a random noise. The random noise is sampled again in each episode, in order to provide a more continuous random exploration in time, which is helpful for agent to adjust their overall strategy exploration according to the current mission MDP. The goal of Meta learning is to let agent learn how to quickly adapt to new missions by simultaneously updating a small amount of gradient of strategy parameters and hidden layer features. Therefore, the θ of θ = μ ψ ( D T t r , θ ) includes not only parameters of the neural network, but also the distribution parameters of hidden variables of each mission, namely the mean and variance of the Gaussian distribution, as shown in Eq.(53).
μ i = μ i + α μ μ i E α t π ( α t | s t ; θ , z i ) , z i N ( μ i , σ i ) [ t R t ( s t ) ] σ i = σ i + α σ σ i E α t π ( α t | s t ; θ , z i ) , z i N ( μ i , σ i ) [ t R t ( s t ) ] θ i = θ + α θ θ E α t π ( α t | s t ; θ , z i ) , z i N ( μ i , σ i ) [ t R t ( s t ) ]
The model is represented by a parameterized function f θ with parameter θ , and when it is transferred to a new mission T , model parameter θ is updated to θ through gradient rise, as shown in Eq.(54).
θ i = θ α θ T i ( f θ )
Update step α is a fixed super parameter. Model parameter θ is updated to maximize the performance f θ i of different missions, as shown in Eq.(55).
min θ T i p ( T ) T i ( f θ i ) = T i p ( T ) T i ( f θ α θ T i ( f θ ) )
The meta learning process of different missions is carried out through SGD, and the principle of θ is:
θ = θ β θ T i p ( T ) T i ( f θ i )
where β is meta update step.
In the meta testing phase, a small amount of experience in new missions is used to quickly learn strategies for solving new missions. A new mission may involve completing a new mission goal or achieving the same mission goal in a new environment. The updating process of model in this phase is the same as the cycle part in the meta training phase, by calculating the loss function with the data collected in the new mission, adjusting the model through back propagation, at last, new mission is adapted by agent.

6. Simulation Analysis

In this section, the manuscript analyzes and verifies escape guidance strategy based on Meta SAC. SAC is used to solve specific escape guidance mission. We conduct comprehensive experiments to verify whether the UAV can complete the guidance escape mission under satisfying terminal constraints and process constraints. Once the UAV guidance escape mission changes, the original strategy based on SAC is difficult to adapt to the changed mission, which is need to be relearned and retrained. The manuscript proposes an optimization method via Meta learning, which improves the learning ability of UAV during the training process. This section focuses on verifying the validation of Meta SAC, demonstrating the performance in various new missions. Besides, the maneuvering overload commands under different pursuit evading distances are analyzed, which is used to explore the influence of different maneuvering timings and distances on the escape results. Taking CAV-H to verify the escape guidance performance. The initial conditions, terminal altitude and Meta SAC training parameters are given in Table 1.

6.1. Validity verification on SAC

In order to verify the effectiveness of SAC, three different pursuit evading scenarios are constructed, and the terminal reward value, miss distance and terminal position deviate are respectively analyzed. As shown in Figure 8(a), the terminal reward value is poor in the initial phase of training, which manifests the optimal strategy is not found. After 500 episodes, the terminal reward value increases gradually, indicating better strategy is explored and converged. At the last 100 episodes, the optimal strategy is trained and learned, meanwhile, the network parameters have been adjusted to the optimal. As can be seen from Figure 8(b), the miss distance is relatively divergent in the first 150 episodes of training, indicating the action network in SAC is constantly exploring new strategies, and the critic network is also learning scientific evaluation criteria. After 500 training episodes, the network is gradually learning and training in the direction of optimal solution. The miss distance at the encounter moment converges to about 20m. As shown in Figure 8(c), the terminal position of UAV has a large deviation in the early training phase, which is attributed to the exploration of escape strategy by network. In the later training phase, the position deviation is less affected by exploration. These pursuit evading scenarios tested in the manuscript can achieve convergence, and the final convergence values are all within 1m.
Figure 6. Train results of SAC. (a) reward value, (b) miss distance, (c) target deviate.
Figure 6. Train results of SAC. (a) reward value, (b) miss distance, (c) target deviate.
Preprints 82932 g006
In order to verify whether the SAC algorithm can solve the escape guidance strategy that meets the mission requirements in different pursuit and evasion scenarios, the pursuing and evading distance is changed, and the training results are shown in the Figure 7. In medium range scenario, the miss distance converges to about 2m, and the terminal deviation converges to about 1m.
Figure 7. Train results of SAC. (a) reward value, (b) miss distance, (c) target deviate.
Figure 7. Train results of SAC. (a) reward value, (b) miss distance, (c) target deviate.
Preprints 82932 g007
As shown in Figure 8, In long range attack and defense scenarios, the miss distance converges to about 5m, and the terminal deviation converges to about 1m.
Figure 8. Train results of SAC. (a) reward value, (b) miss distance, (c) target deviate.
Figure 8. Train results of SAC. (a) reward value, (b) miss distance, (c) target deviate.
Preprints 82932 g008
Based on above simulation analysis, SAC is a feasible method to solve the UAV guidance escape strategy. After limited episodes of learning and training, network parameters are converged, which is used to test on flight mission.

6.2. Validity Verification on Meta SAC

When the mission of UAV changes, the original SAC parameters can not meet acquirements of new mission,which needs to be re-trained and learned. The SAC proposed in the manuscript is improved via Meta learning. Strong adaptive network parameters are found by learning and training, when the pursuit evading environment changes, the network parameters is fine-tuned to adapt to the new environment immediately.
Meta SAC is divided into meta training phase and meta testing phase, Initialization parameters for SAC network are trained in meta training phase, which is fine-tuned by interacting with the new environment in meta testing phase. By changing the initial interceptor position, three different pursuit evading scenarios are constructed, which respectively represents short distance, medium and long distance.
Training results of Meta SAC and SAC are compared, terminal reward values are represented as shown in Figure 9(a). Meta SAC is an effective method to speed up the training process, after 100 episodes, better strategy is learned by the network and converged gradually, contraryly, the SAC network needs 500 episodes to find the optimal solution. Miss distance is shown in Figure 9(b). The better strategy is quickly learned by Meta SAC, which is more effective than the SAC method. Figure 9(c) show the terminal deviate between UAV and target.
To explore the optimal solution as much as possible, some strategies with large terminal position deviation appear in the training process. As shown in Figure 10(b-c), in medium range attack and defense scenarios, the miss distance converges to about 8m based on Meta SAC, and the terminal deviation converges to about 1m.
As shown in Figure 11(b-c), in long range attack and defense scenarios, the miss distance converges to about 10m based on Meta SAC, and the terminal deviation converges to about 1m.
According to the theoretical analysis, in the training process, new missions corresponding to the same distribution are used to execute micro-testing by Meta SAC, resulting in more gradient descending directions of optimal solution are learned by network. Combined with the theory analysis and training results, the manuscript manifests Meta learning is a feasible method to accelerate convergence and improve the efficiency of training.
In the previous analysis, when the pursuit evading secenerio is changed, network parameters obtained in the meta training phase are fine-tuned through few interactions. The manuscript verifies meta testing performance by changing the initial interceptor position, and results compared with SAC method are shown in the Table 2. Based on the network parameters of meta training phase, the strategic solutions meeting escape guidance missions are found through training within 10 episodes. On the contrary, network parameters based on SAC need more interaction to find solutions, and the the episode of interactions is basically more than 50 episodes. According to above simulation, the adaptability of Meta SAC is much greater than SAC, once the escape mission changing, through very few episodes of learning, the new mission is completed by UAV without re-learning and designing strategy. The method provides possibility for realizing UAV learning while flying.

6.3. Strategy Analysis Based on Meta SAC

This section tests the network parameters based on Meta SAC, and analyzes the escape strategy and flight state under different pursuit evading distances. As shown in Figure 12(a), for the pursuit evading scene of short distance, the longitudinal maneuvering overload is larger in the first half phase of escape, resulting in velocity slope angle decreases gradually. In the second half phase of escape, if strategy is executed under the original maneuvering overload, the terminal altitude constraint can not be satisfied, therefore, the overload gradually decreases, the velocity slope angle is slowly reduced. As shown from Figure 12(b), at the beginning of escape, the lateral maneuvering overload is positive, and the velocity azimuth angle is constantly increasing. With the distance between UAV and interceptor reducing, the overload increases gradually in the opposite direction, and the velocity azimuth angle decrease. On the one hand, it can confuse the strategy of interceptor, on the other hand, the guidance course is corrected.
As shown in Figure 13(a), compared with the pursuit evading scene of short distance, the medium escape process takes longer, the pursuing time left to interceptor is longer, and the UAV flies under the direction of increasing the velocity slope angle. The timing of maximum escape overload corresponding to the medium distance is also different. As shown in Figure 13(b), in the first half phase of escape, lateral maneuvering overload corresponding to medimum distance is larger than that in the short distance, and in the second half phase of the escape, the corresponding reverse maneuvering overload is smaller, resulting in UAV can use the longer escape time to slowly correct the course.
As shown in Figure 14, under the long pursuit distance, the overload change of UAV maneuver is similar to that of medium range, and the escape timing is basically the same as the escape strategy.
According to the above analysis, the escape guidance strategy via Meta SAC can be used as a tactical escape strategy, and the timing of escape and maneuvering overload are adjusted timely under different pursuit evading distances. On the one hand, the overload corresponding to this strategy can confuse the interceptor and cause some interference, on the other hand, it can take into account the guidance mission, correcting the course deviation caused by escape.
Figure 15(a) shows the flight trajectory of interceptor against UAV at the North East Down (NED) coordinate (10 km,30 km,30 km), the trajectory point at the encountering moment is shown in Figure 15(b), and the miss distance is 19 m in this pursuit evading scene. To verify the scientific and applicability of Meta SAC, the initial position of interceptor is changed. Flight trajectories are respectively represented shown in Figure 15(c,e), and trajectory points at the encountering moment are shown in Figure 15(d,f). The miss distances in these two pursuit evading scenarios are 3 m and 6 m respectively. Based on the CAV-H structure, the miss distance between UAV and interceptor is greater than 2 m at the encountering moment, which means the escape mission is achieved.
Based on the principle of Meta SAC and optimal guidance, flight states are shown in the Figure 16. Longitude, latitude and altitude during flight of UAV are shown in Figure 16(a)-(b), under different pursuit evading scenarios, terminal position and altitude constraints are meet. There is larger amplitude modification in the velocity slope and azimuth angle, which is attributed to escape strategy via lateral and longitudinal maneuvering, as shown in Figure 16(c)-(d). The total change of velocity slope and azimuth angle is within two degrees, which meets flight process constraints. Through the analysis of flight states, this escape strategy is an effective measure for guidance escape with high accuracy.
Flight process deviation mainly includes aerodynamic calculation deviation and output overload deviation. For the aerodynamic deviation, the manuscript uses interpolation method to calculate based on the flight Mach number and angle of attack, which may have some deviation. Therefore, when calculating the aerodynamic coefficient, random noise with an amplitude of 0.1 is added to verify whether the UAV can complete the guidance mission. As shown in Figure 17 (a), aerodynamic deviation noise causes certain disturbances to the angle of attack during flight. At the 10th second and end of flight, the maximum deviation of the angle of attack is 2 °. However, overall, the impact of aerodynamic deviation on the entire flight is relatively small, and the change in angle of attack is still within the safe range of the UAV. As shown in Figure 17 (b), due to the constraints of UAV game confrontation and guidance missions, the bank angle during the entire flight process changes significantly, and aerodynamic deviation noise has a small impact on the bank angle. After increasing the aerodynamic deviation noise, the miss distance between the UAV and the interceptor at the time of encounter is 8.908m, and the terminal position deviation is 0.52m. Therefore, under the influence of aerodynamic deviation, the UAV can still complete the escape guidance mission.
For the output overload deviation, the total overload is composed of the guiding overload derived from the optimal guidance law and the maneuvering overload output from the neural network. random maneuvering overload with an amplitude of 0.1 is added to verify whether the UAV can complete the maneuver guidance mission. As shown in Figure 18, random overloads are added in the longitudinal and lateral directions respectively. Through simulation testing, the miss distance between the UAV and the interceptor at the encounter point is 10.51m, and the terminal deviation of the UAV is 0.6m. Under this deviation, the UAV can still achieve high-precision guidance and efficient penetration.

7. Conclusion

The manuscript proposes escape guidance strategy satisfying terminal multiple constraints via SAC. The action space is designed under the UAV process constraint. Considering the fact that the real-time interceptor information is hard to obtain in the actual escape process, the state space considering the target and the heading angle is designed. The reward function is an important index function, which is used to guide and evaluate the training results. Based on the pursuit evading model, terminal LOS angle rates of lateral and longitudinal are derived to describe the deviation and miss distance. In order to improve the adaptability of escape guidance strategy, the manuscript improves SAC via meta learning, and compares meta SAC with SAC. The strong adaptive escape strategy based on meta SAC is analyzed. In view of the above theoretical numerical analysis, we have obtained the following conclusions.
(1)
The escape guidance strategy based on SAC is a feasible tactical escape strategy, which can achieve high precision guidance under meeting the escape requirements.
(2)
Meta SAC can significantly improve the adaptability of escape strategy. When the escape mission changes, it can fine-tune network parameters to adapt to mission through a small number of training. This method provides a possibility for the UAV to learn while flying.
(3)
The strong adaptive escape strategy based on meta SAC can adjust the escape timing and maneuvering overload in real time according to the pursuit evading distance. On the one hand, the overload corresponding to this strategy can confuse the interceptor and cause some interference, on the other hand, it can take into account the guidance mission, correcting the course deviation caused by escape.

Author Contributions

Conceptualization, SiBo Zhao. and JianWen Zhu.; methodology, SiBo Zhao.; software, SiBo Zhao; validation, SiBo Zhao. and JianWen. Zhu; formal analysis, SiBo Zhao; investigation, SiBo Zhao.; resources, XiaoPing Li.; data curation, HaiFeng Sun.; writing—original draft preparation, HaiFeng Sun.; writing—review and editing, XiaoPing Li.; visualization, SiBo Zhao.; supervision, WeiMin Bao. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. H. W. LI ZB, Z. J. S. ZHANG, and T. n. Review, "Summary of the Hot Spots of Near Space Vehicles in 2018," vol. 37, no. 1, p. 44, 2019. [CrossRef]
  2. Li, G.; Zhang, H.; Tang, G. Maneuver characteristics analysis for hypersonic glide vehicles. Aerosp. Sci. Technol. 2015, 43, 321–328. [CrossRef]
  3. Wang, L.; Lan, Y.; Zhang, Y.; Zhang, H.; Tahir, M.N.; Ou, S.; Liu, X.; Chen, P. Applications and Prospects of Agricultural Unmanned Aerial Vehicle Obstacle Avoidance Technology in China. Sensors 2019, 19, 642. [CrossRef]
  4. Y. WANG, T. ZHOU, W. CHEN, T. J. J. o. B. U. o. A. HE, and Astronautics, "Optimal maneuver penetration strategy based on power series solution of miss distance," vol. 46, no. 1, p. 159. [CrossRef]
  5. Rim, J.-W.; Koh, I.-S. Survivability Simulation of Airborne Platform With Expendable Active Decoy Countering RF Missile. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 196–207. [CrossRef]
  6. Liu, F.; Dong, X.; Li, Q.; Ren, Z. Robust multi-agent differential games with application to cooperative guidance. Aerosp. Sci. Technol. 2021, 111, 106568. [CrossRef]
  7. E. Garcia, D. W. Casbeer, and M. J. I. T. o. A. C. Pachter, "Design and analysis of state-feedback optimal strategies for the differential game of active defense," vol. 64, no. 2, pp. 553-568, 2018. [CrossRef]
  8. H. Liang, W. Jianying, W. Yonghai, W. Linlin, and L. J. C. J. o. A. Peng, "Optimal guidance against active defense ballistic missiles via differential game strategies," vol. 33, no. 3, pp. 978-989, 2020. [CrossRef]
  9. Liang, L.; Deng, F.; Peng, Z.; Li, X.; Zha, W. A differential game for cooperative target defense. Automatica 2019, 102, 58–71. [CrossRef]
  10. Liu, S.; Wang, Y.; Li, Y.; Yan, B.; Zhang, T. Cooperative guidance for active defence based on line-of-sight constraint under a low-speed ratio. Aeronaut. J. 2022, 127, 491–509. [CrossRef]
  11. D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. J. A. i. N. I. P. S. Dong, "You only propagate once: Accelerating adversarial training via maximal principle," vol. 32, 2019.
  12. Ruthotto, L.; Osher, S.J.; Li, W.; Nurbekyan, L.; Fung, S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. 2020, 117, 9183–9193. [CrossRef]
  13. Ullah, Z.; Al-Turjman, F.; Mostarda, L.; Gagliardi, R. Applications of Artificial Intelligence and Machine learning in smart cities. Comput. Commun. 2020, 154, 313–323. [CrossRef]
  14. Song, H.; Bai, J.; Yi, Y.; Wu, J.; Liu, L. Artificial Intelligence Enabled Internet of Things: Network Architecture and Spectrum Access. IEEE Comput. Intell. Mag. 2020, 15, 44–51. [CrossRef]
  15. Gong, X.; Chen, W.; Chen, Z. All-aspect attack guidance law for agile missiles based on deep reinforcement learning. Aerosp. Sci. Technol. 2022, 127. [CrossRef]
  16. Furfaro, R.; Scorsoglio, A.; Linares, R.; Massari, M. Adaptive generalized ZEM-ZEV feedback guidance for planetary landing via a deep reinforcement learning approach. Acta Astronaut. 2020, 171, 156–171. [CrossRef]
  17. Yuan, Y.; Zheng, G.; Wong, K.-K.; Letaief, K.B. Meta-Reinforcement Learning Based Resource Allocation for Dynamic V2X Communications. IEEE Trans. Veh. Technol. 2021, 70, 8964–8977. [CrossRef]
  18. Zhao, L.-B.; Xu, W.; Dong, C.; Zhu, G.-S.; Zhuang, L. Evasion guidance of re-entry vehicle satisfying no-fly zone constraints based on virtual goals. Sci. Sin. Phys. Mech. Astron. 2021, 51, 104706. [CrossRef]
  19. Guo, Y.; Li, X.; Zhang, H.; Wang, L.; Cai, M. Entry Guidance With Terminal Time Control Based on Quasi-Equilibrium Glide Condition. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 887–896. [CrossRef]
  20. Krasner, S.M.; Bruvold, K.; Call, J.A.; Fieseler, P.D.; Klesh, A.T.; Kobayashi, M.M.; Lay, N.E.; Lim, R.S.; Morabito, D.D.; Oudrhiri, K.; et al. Reconstruction of Entry, Descent, and Landing Communications for the InSight Mars Lander. J. Spacecr. Rocket. 2021, 58, 1569–1581. [CrossRef]
  21. Huang, Z.C.; Zhang, Y.S.; Liu, Y. Research on State Estimation of Hypersonic Glide Vehicle. J. Physics: Conf. Ser. 2018, 1060, 012088. [CrossRef]
  22. Zhu, J.; Su, D.; Xie, Y.; Sun, H. Impact time and angle control guidance independent of time-to-go prediction. Aerosp. Sci. Technol. 2019, 86, 818–825. [CrossRef]
  23. Ni, C.; Zhang, A.R.; Duan, Y.; Wang, M. Learning Good State and Action Representations via Tensor Decomposition. 2021, 1682–1687. [CrossRef]
  24. Ma, Y.; Wang, Z.; Castillo, I.; Rendall, R.; Bindlish, R.; Ashcraft, B.; Bentley, D.; Benton, M.G.; Romagnoli, J.A.; Chiang, L.H. Reinforcement Learning-Based Fed-Batch Optimization with Reaction Surrogate Model. 2021. [CrossRef]
  25. X. Yang et al., "Learning high-precision bounding box for rotated object detection via kullback-leibler divergence," vol. 34, pp. 18381-18394, 2021.
Figure 1. Attack-defense model.
Figure 1. Attack-defense model.
Preprints 82932 g001
Figure 2. guidance penetration strategy.
Figure 2. guidance penetration strategy.
Preprints 82932 g002
Figure 3. The attack defense confrontation model in lateral direction.
Figure 3. The attack defense confrontation model in lateral direction.
Preprints 82932 g003
Figure 4. The attack defense confrontation model in longitudinal direction.
Figure 4. The attack defense confrontation model in longitudinal direction.
Preprints 82932 g004
Figure 9. Meta SAC training results. (a) Reward value, (b) miss distance, (c) target deviate.
Figure 9. Meta SAC training results. (a) Reward value, (b) miss distance, (c) target deviate.
Preprints 82932 g009
Figure 10. Meta SAC training results. (a) Reward value, (b) miss distance, (c) target deviate.
Figure 10. Meta SAC training results. (a) Reward value, (b) miss distance, (c) target deviate.
Preprints 82932 g010
Figure 11. Meta SAC training results. (a) Reward value, (b) miss distance, (c) target deviate.
Figure 11. Meta SAC training results. (a) Reward value, (b) miss distance, (c) target deviate.
Preprints 82932 g011
Figure 12. the maneuvering overload. (a) longitudinal direction under short distance, (b) lateral direction under short distance.
Figure 12. the maneuvering overload. (a) longitudinal direction under short distance, (b) lateral direction under short distance.
Preprints 82932 g012
Figure 13. the maneuvering overload. (a) longitudinal direction under medimum distance, (b) lateral direction under medium distance.
Figure 13. the maneuvering overload. (a) longitudinal direction under medimum distance, (b) lateral direction under medium distance.
Preprints 82932 g013
Figure 14. the maneuvering overload. (a) longitudinal direction under long distance, (b) lateral direction under long distance.
Figure 14. the maneuvering overload. (a) longitudinal direction under long distance, (b) lateral direction under long distance.
Preprints 82932 g014
Figure 15. The ballistic flight diagrams of whole course under different pursuit evading distances, (b) and (d) and (f) represent trajectories at the encountering time under different pursuit evading distances.
Figure 15. The ballistic flight diagrams of whole course under different pursuit evading distances, (b) and (d) and (f) represent trajectories at the encountering time under different pursuit evading distances.
Preprints 82932 g015
Figure 16. Flight states of UAV. (a) the longitude and latitude, (b) the height, (c) the velocity slope angle, (d) the velocity azimuth angle.
Figure 16. Flight states of UAV. (a) the longitude and latitude, (b) the height, (c) the velocity slope angle, (d) the velocity azimuth angle.
Preprints 82932 g016
Figure 17. a) the attack angle, (b) the bank angle.
Figure 17. a) the attack angle, (b) the bank angle.
Preprints 82932 g017
Figure 18. (a) the maneuvering overload in the longitudinal direction, (b) the maneuvering overload in the lateral direction.
Figure 18. (a) the maneuvering overload in the longitudinal direction, (b) the maneuvering overload in the lateral direction.
Preprints 82932 g018
Table 1. Simulation and Meta SAC training conditions.
Table 1. Simulation and Meta SAC training conditions.
Simulation Conditions Meta SAC Training Parameters
UAV initial velocity 4000m/s Learning episodes 1000
Initial velocity Inclination Guidance period 0.1s
Initial velocity azimuth Data sampling interval 30Km
Initial position (3°E, 1°N) Discount factor γ = 0.99
Initial altitude 45km Soft update tau 0.001
Terminal altitude 40km Learing rate 0.005
Target position (0°E, 0°N) Sampling size for each train 128
Interceptor Initial velocity 1500m/s Net layers 2
Initial velocity Inclination longitudinal LOS angle Net nodes 256
Initial velocity azimuth lateral LOS angle Capacity of experience pool 20000
Table 2. Results compared with SAC method.
Table 2. Results compared with SAC method.
Interceptor
Initial position (Km)
Interaction episodes Miss distance (m) Terminal deviate (m)
SAC Meta SAC SAC Meta SAC SAC Meta SAC
(0, 30, 0) 74 1 3.78 3.29 0.56 0.61
(2, 30, 6) 75 4 2.80 2.72 0.68 0.72
(4, 30, 12) 59 8 6.93 3.75 0.69 0.58
(6, 30, 18) 59 1 2.71 6.82 0.68 0.72
(8, 30, 24) 26 2 3.16 3.70 0.47 0.50
(10, 30, 30) 58 3 3.50 2.37 0.61 0.64
(12, 30, 36) 67 1 2.86 2.21 0.68 0.45
(14, 30, 42) 56 8 2.18 2.89 0.55 0.61
(16, 30, 48) 69 1 2.73 2.23 0.61 0.72
(18, 30, 54) 106 1 2.45 3.71 0.56 0.63
(20, 30, 60) 94 1 2.7 2.35 0.49 0.54
(22, 30, 66) 59 1 2.23 2.51 0.73 0.71
(24, 30, 72) 62 1 2.11 3.47 0.48 0.67
(26, 30, 78) 63 1 2.04 4.5 0.48 0.57
(28, 30, 84) 63 4 2.64 5.12 0.47 0.40
(30, 30, 90) 63 9 2.95 6.05 0.68 0.47
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Downloads

85

Views

32

Comments

0

Subscription

Notify me about updates to this article or when a peer-reviewed version is published.

Email

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated