CN116430904B

CN116430904B - Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm

Info

Publication number: CN116430904B
Application number: CN202310543396.3A
Authority: CN
Inventors: 李阳阳; 李浩哲; 曹梦晨; 沈家皓; 张雪帆; 刘睿娇; 焦李成; 尚荣华
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2025-08-01
Anticipated expiration: 2043-05-15
Also published as: CN116430904A

Abstract

The invention discloses an unmanned aerial vehicle autonomous path planning method based on a lightweight continuous SAC algorithm, which comprises the steps of constructing an unmanned aerial vehicle flight control mathematical model; designing a state space, an action space and a reward function, building a deep reinforcement learning neural network model, generating an experience data set, training the deep neural network by using an SAC algorithm, and performing model distillation by taking the trained network as a teacher network. The invention is based on deep reinforcement learning, takes the SAC algorithm as a model basic framework, autonomously designs a reward function to improve training efficiency, reduces network scale by using model distillation, realizes a high-exploring-degree and light-weight unmanned aerial vehicle path planning method, and solves the problems that unmanned aerial vehicles sometimes have poor in-situ turning smoothness, misjudgment is generated under the condition of more noise, training efficiency and stability are poor, and model response speed is slow.

Description

Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm

Technical Field

The invention belongs to the technical field of communication, and further relates to an unmanned aerial vehicle autonomous path planning method based on a lightweight continuous SAC (Soft activator-Critic) algorithm in the technical field of unmanned aerial vehicles. The method can be applied to unmanned aerial vehicles in different environments, realize autonomous decision in the moving process, meet the requirement that the unmanned aerial vehicle can reach a target range efficiently and safely in a smoother track in an unmanned state, and realize autonomous path planning of the unmanned aerial vehicle.

Background

Unmanned aerial vehicle autonomous path planning is a technique that enables unmanned aerial vehicles to autonomously seek to reach a destination without human intervention. In the autonomous flight process of the unmanned aerial vehicle, the unmanned aerial vehicle has higher requirements on decision response speed and accuracy due to higher flight speed, and has stronger dependence on the environment of flight. The traditional solutions generally have some methods based on genetic algorithm, dynamic Bayesian network, approximate dynamic planning and other technologies, and most of the methods have the problems of complex modeling, low real-time decision-making efficiency, large data set support, huge calculation amount, easy dimension disaster and the like, and because of the particularity of unmanned aerial vehicle autonomous path planning, the problems can cause the problems of slow decision of unmanned aerial vehicle flying at high speed, unstable effect under different application scenes, excessive model training cost and the like, thereby causing great difficulty to the practical application of unmanned aerial vehicle autonomous path planning. Some existing solutions based on deep reinforcement learning mostly adopt DQN(Deep Q-network)、DDPG(Deep Deterministic Policy Gradient)、TD3(Twin Delayed Deep Deterministic policy gradient algorithm) algorithm as a markov decision model and use a plurality of basic action combinations to construct a discrete action space, and DQN and DDPG both use the same Q network to select and evaluate actions when calculating a target value, which can generate a higher value estimation under the condition of noise and error, and is generally called an overestimation Overestimation problem, which has a great influence on flight decisions of unmanned aerial vehicles. However, the TD3 algorithm improves the overestimation problem of the two, but the training efficiency and stability are still poor in the task scene of the real-time decision of the unmanned aerial vehicle. In addition, the discrete action space is used, so that the autonomous flight path of the unmanned aerial vehicle is not smooth enough, and the situation of in-situ turning is easy to occur.

Jinwen Hu et al have adopted an autonomous path planning method in an unmanned aerial vehicle air combat autonomous decision based on deep reinforcement learning disclosed in their published paper "Autonomous Maneuver Decision Making of Dual-UA V Cooperative Air Combat Based on Deep Reinforcement Learning"(Hu.J,Wang.L,Hu.T,Guo.C,Wang.Y.Autonomous Maneuver Decision Making of Dual-UAV Cooperative Air Combat Based on Deep Reinforcement Learning.Electronics 2022,11,467.). According to the method, fifteen typical action instructions of the unmanned aerial vehicle are designed, the discrete action space is used for modeling the action of the aircraft, and DDPG algorithm is used as a Markov decision model to generate a path planning strategy of the unmanned aerial vehicle. The method has the defects that firstly, the DDPG algorithm has the over-estimation problem due to the characteristics, misjudgment can be generated under the condition of more noise, training efficiency and stability are poor, and secondly, a discrete action space is used, and because discrete actions are not flexible enough in the control process, the unmanned aerial vehicle can only fly in various fixed postures, so that the flight path of the unmanned aerial vehicle is not smooth enough, and the situation that the unmanned aerial vehicle can turn in situ sometimes is caused.

An autonomous path planning method is adopted in unmanned aerial vehicle combat autonomous decision-making based on a deep reinforcement learning TD3 algorithm, which is disclosed in a patent document (unmanned aerial vehicle combat autonomous decision-making method based on a deep reinforcement learning TD3 algorithm) (application number: 202210264539.2 application date: 2022.03.17 application publication number: CN 114706418A) applied by the university of civil liberation army of China. The unmanned aerial vehicle learning maneuver strategy training method comprises the specific steps of establishing an unmanned aerial vehicle movement model, establishing an unmanned aerial vehicle air fight model based on a Markov decision process according to the unmanned aerial vehicle movement model, using a four-element representation comprising a state space, an action space, a reward function and a discount factor, wherein the unmanned aerial vehicle movement model represents a state transfer function in the unmanned aerial vehicle air fight model, and training an unmanned aerial vehicle learning maneuver strategy based on a TD3 algorithm according to the unmanned aerial vehicle air fight model. The method has the defects that the algorithm training process is not stable enough, the training efficiency is poor, and the response speed of the unmanned aerial vehicle in the decision-making process is low due to the fact that a TD3 algorithm adopts a more complex network model structure and the model reasoning speed is low.

Disclosure of Invention

The invention aims to solve the problems that in the prior art, unmanned aerial vehicles sometimes have poor in-situ turning smoothness, misjudgment is generated under the condition of more noise, training efficiency and stability are poor, and model response speed is low.

The method for achieving the purpose of the invention has the specific thinking that when the real-time path planning problem of the unmanned aerial vehicle is solved, the method models the unmanned aerial vehicle movement mode in a three-degree-of-freedom flight model, and the model defines how the unmanned aerial vehicle will move in a three-dimensional space in real time under different actions. The environment of the drone is then modeled, including a state space, an action space, and a reward function of the drone. The system comprises a state space, a potential energy-based rewarding mechanism, a control system and a control system, wherein the state space consists of position information, speed, end position coordinate information, safety distance and the like of the unmanned aerial vehicle under three-dimensional coordinates, the action space based on continuous actions is established to drive the unmanned aerial vehicle to fly, and the action space consists of three elements of tangential overload, normal overload and rolling angle, so that the unmanned aerial vehicle is more flexible to control, the problem that in-situ turning smoothness is poor in the prior art is solved, a rewarding function mainly consisting of three factors of distance, angle and height is designed, the distance is used as a main line rewarding, the angle and the height are used as auxiliary rewarding, and the potential energy-based rewarding mechanism is adopted to enable the unmanned aerial vehicle to be rewarded more densely in the flying process, and the strategy network of the unmanned aerial vehicle to be guided more efficiently. The deep reinforcement learning neural network based on the SAC algorithm is constructed, and compared with the DQN algorithm and the DDPG algorithm, the SAC algorithm uses two Q networks and takes smaller values, so that the problem of misjudgment under the condition of more noise in the prior art is solved. Compared with the TD3 algorithm, the method has the advantages that the maximum entropy is introduced into the loss function of the SAC algorithm, so that the exploration capacity of the model is greatly improved, and the training efficiency of the model is effectively improved. A priority experience playback mechanism is added on the basis of the SAC algorithm, and weight is given to each piece of data added into the experience pool so as to improve training efficiency. The method overcomes the problem of poor training efficiency and stability in the prior art by designing the potential energy-based reward function, constructing the SAC deep reinforcement learning neural network and introducing the priority experience playback mechanism. Finally, the strategy network performs model distillation operation, a smaller student network is trained to lighten the model, and the problem of slow response speed of the model in the prior art is solved.

In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps:

step 1, constructing a three-degree-of-freedom flight model of an unmanned aerial vehicle movement mode;

Step 2, autonomously designing a state space set, a continuous action space set and a reward function based on an unmanned aerial vehicle flight control mathematical model and a Markov decision process;

step 3, constructing a deep reinforcement learning neural network based on a SAC algorithm:

Step 3.1, constructing a strategy sub-network, wherein the structure of the strategy sub-network is formed by sequentially connecting six full-connection layers in series, and setting nodes of the six full-connection layers to 14,512,512,512,512,3 respectively;

step 3.2, constructing two action value evaluation sub-networks with the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set as 17,512,512,512,512,1;

step 3.3, constructing a state value evaluation sub-network and a state value evaluation target sub-network which have the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set to 14,512,512,512,512,1;

step 3.4, connecting the two action value evaluation sub-networks in parallel and then respectively connecting the two action value evaluation sub-networks with the strategy sub-network and the state value sub-network to form a deep reinforcement learning neural network;

step 4, generating an experience data set:

The method comprises the steps of forming a state space set, an action space set, rewarding information obtaining and a state space set after action execution of an unmanned aerial vehicle are performed for each action into quadruple experience data corresponding to the action, storing the quadruple experience data into an experience pool, and forming an experience data set from the quadruple experience data of at least 10000 actions stored in the experience pool;

step 5, training the deep neural network by using the SAC algorithm:

Step 5.1, randomly initializing a state space set, inputting the state space set into a strategy sub-network, outputting an action space set by the strategy sub-network, enabling the unmanned aerial vehicle to act according to the action space set given by the strategy sub-network, storing a quadruple consisting of the state set of the unmanned aerial vehicle before acting, the action space set given by the strategy sub-network, a reward value generated by a reward function and the state set of the unmanned aerial vehicle after acting into an experience pool, giving the highest weight to the data stored in the experience pool at present, and attenuating the weight to the rest data according to the storage sequence;

Step 5.2, extracting 128 experience data subsets from the updated experience pool according to the prior experience playback, inputting the experience data subsets into a deep neural network, outputting two action values, a state value 1, a state value 2 and a rewarding value, substituting the currently output two action values and the state value 1 into a loss function L1, substituting the smaller value of the currently output two action values and the state value 2 into the loss function L2, substituting the smaller value of the currently output two action values into a loss function L3, respectively updating weight parameters of an action value evaluation sub-network, a state value evaluation sub-network and a strategy network by using a gradient back propagation method, and updating the weight parameters of the state value evaluation target sub-network by using an exponential decay average method to obtain updated weight parameters of the deep neural network;

Step 5.3, judging whether the current output rewarding value is converged, if yes, executing step 6 after obtaining a trained deep neural network, otherwise, executing step 5.1;

step 6, model distillation is carried out on the strategy subnetwork:

step 6.1, randomly extracting an action space set in 10000 pieces of experience data from the experience pool to be used as a strategy student network training set;

Step 6.2, inputting 32 data in each batch in the training set of the strategy student network into the strategy network and the strategy student network respectively, substituting the output of the two networks into a cross entropy loss function to calculate a loss value, and updating the weight parameter of the strategy student network by using a gradient back propagation method until the loss value converges to obtain a distilled strategy sub-network;

step 7, planning a flight path of the unmanned aerial vehicle:

7.1, inputting a state space set of the unmanned aerial vehicle at the current moment of a path to be planned into a distilled strategy subnetwork, outputting an action space set at the current moment, enabling the unmanned aerial vehicle to act according to the action space set, generating a state space set at the next moment and a motion path of the unmanned aerial vehicle at the current moment, and splicing the motion path at the current moment into the motion path generated at the previous moment according to position information;

And 7.2, judging whether the unmanned aerial vehicle reaches a target place, if so, executing the step 8, and otherwise, executing the step 7.3.

Step 7.3, judging whether the number of actions of the unmanned aerial vehicle reaches a preset upper limit, if so, prompting the unmanned aerial vehicle that the path planning fails, otherwise, repeating the step 7.1;

And 8, taking the path spliced by whether the unmanned aerial vehicle arrives at the target place as a planned unmanned aerial vehicle action path.

Compared with the prior art, the invention has the following advantages:

Firstly, the invention adopts the most advanced SAC algorithm in the deep reinforcement learning field, and adds a priority experience playback mechanism on the basis, and autonomously designs a potential energy-based reward function, thereby effectively avoiding the defects of misjudgment, poor training efficiency and poor stability caused by overestimation under the condition of more noise, leading the invention to have the characteristics of strong robustness, stable training and high convergence speed, being beneficial to higher decision accuracy of unmanned aerial vehicles under different environments, lower training cost and higher efficiency.

Secondly, because the unmanned aerial vehicle is subjected to flight control by adopting continuous actions when the reinforcement learning action space is designed, the defect that in the prior art, in-situ turning sometimes occurs due to poor smoothness is overcome, and the unmanned aerial vehicle flight control method has the characteristics of strong control flexibility and smooth flight path.

Thirdly, the model distillation is carried out on the model after model training is finished, so that the light weight work of the model is realized, the reasoning speed of the model is accelerated, the defect of low response speed of the model in the prior art is overcome, and the model distillation method has the characteristics of high instantaneity and high response speed.

Drawings

FIG. 1 is a flow chart of an overall implementation of the present invention;

FIG. 2 is a schematic diagram of a flight model of the unmanned aerial vehicle under a three-dimensional coordinate system;

FIG. 3 is a schematic diagram of autonomous seek destination range in a bonus function design of the present invention;

FIG. 4 is a schematic diagram of a deep reinforcement learning network model constructed based on the SAC algorithm of the present invention;

FIG. 5 is a graph showing the change of the reward function during training according to the embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples.

The implementation steps of the embodiment of the present invention will be described in further detail with reference to fig. 1.

Step 1, constructing a three-degree-of-freedom flight model of an unmanned aerial vehicle movement mode:

According to the flight control principle of the unmanned aerial vehicle, based on three quantities of tangential overload, normal overload and rolling angle, the track yaw angle, track pitch angle and speed of the unmanned aerial vehicle are respectively controlled, and a three-degree-of-freedom flight control mathematical model of the unmanned aerial vehicle in a three-dimensional space is constructed as follows:

Referring to a schematic diagram of a flight model of an unmanned aerial vehicle in a three-dimensional coordinate system shown in fig. 2, wherein g represents gravitational acceleration of the unmanned aerial vehicle, t represents unit time of the flight process of the unmanned aerial vehicle, n _x represents tangential overload of the unmanned aerial vehicle, n _z represents normal overload of the unmanned aerial vehicle, μ represents rolling angle of the unmanned aerial vehicle, v represents speed of the unmanned aerial vehicle, v e [ v _min,v_max],v_min ] represents minimum flight speed of the unmanned aerial vehicle, v _max represents maximum flight speed of the unmanned aerial vehicle, γ represents track pitch angle of the unmanned aerial vehicle, namely an included angle between a speed direction and a horizontal plane, y represents track yaw angle of the unmanned aerial vehicle, namely an included angle between projection of the speed direction on the horizontal plane and a y axis, satisfies constraint conditions of y e [ -pi, pi ], x, y and z respectively represent coordinate values of the unmanned aerial vehicle in the three-dimensional space coordinate system.

Step 2, autonomously designing a state space set, a continuous action space set and a reward function based on an unmanned aerial vehicle flight control mathematical model and a Markov decision process:

The autonomous design state space set comprises coordinate values when the unmanned aerial vehicle is located at a certain position in the three-dimensional space, speed v of the unmanned aerial vehicle, track pitch angle gamma of the unmanned aerial vehicle, track yaw angle phi of the unmanned aerial vehicle, target position safety approaching range radius d _in, target position judging range radius d _out which is successfully approached, and coordinate values of the target point position in the three-dimensional space.

The continuous action space set comprises tangential overload n _x of the unmanned aerial vehicle, normal overload n _z of the unmanned aerial vehicle and rolling angle mu of the unmanned aerial vehicle.

R _total＝r_res+ω_dr_d+ω_ar_a+ω_hr_h, wherein R _res、r_d、r_a、r_h respectively represents a result rewarding function, a distance rewarding function, an angle rewarding function and a height rewarding function, and omega _d、ω_a、ω_h respectively represents coefficients of the distance rewarding function, the angle rewarding function and the height rewarding function;

the resulting reward function r _res is as follows:

Referring to the schematic diagram of the autonomous navigational destination range shown in fig. 3, wherein D _in represents the safe distance between the drone and the target location, D _out represents the successful approach of the drone to the boundary of the target location range, and D represents the euclidean distance between the drone and the target location in three dimensions;

The distance reward function r _d is as follows:

Wherein D _max represents a flight boundary of the unmanned aerial vehicle relative to the target position in a three-dimensional coordinate system, e ^(·) represents an exponential function based on a natural constant e;

the angle bonus function r _a is as follows:

wherein θ represents an angle difference between the unmanned aerial vehicle track yaw angle ψ and the unmanned aerial vehicle and target point straight line direction;

the height bonus function r _h is as follows:

Wherein H represents the flying height of the unmanned aerial vehicle, H _low represents the minimum safe height of the unmanned aerial vehicle, and H _high represents the maximum safe height of the unmanned aerial vehicle.

Step 3, referring to the model structure schematic diagram shown in fig. 4, constructing a deep reinforcement learning neural network based on a SAC algorithm:

In the specific embodiment, a strategy network pi, two action value evaluation networks Q ₁、Q₂ and two state value evaluation networks V ₁、V₂ are built;

step 4, generating an experience data set:

step 5, training the deep neural network by using the SAC algorithm:

In the embodiment of the invention, when the unmanned aerial vehicle is converted from the current time state s _t to the next time state s _t+1, a quadruple (s _t,a_t,r_t,s_t+1) formed by the unmanned aerial vehicle, the action a _t performed at the current time and the reward r _t obtained at the current time is input into an experience pool P, and the P assigns a weight omega to each input quadruple;

In the embodiment of the invention, 128 experience data subsets are extracted according to the data probability of omega in an experience pool P, a strategy network pi is input as state information s _t of the unmanned aerial vehicle at the current moment, unmanned aerial vehicle action a _t,Q₁、Q₂ generated for decision is output as state information s _t of the current moment and action a _t taken by the unmanned aerial vehicle, a score Q ₁、q₂,V₁、V₂ of the current action is output, the network is input as state information s _t of the current moment, the score V ₁、v₂ of the current state information is output, Q ₁、q₂、v₁ and a loss function L1 are used for updating Q ₁、Q₂, min (Q ₁,q₂)、v₂ and a loss function L2 are used for updating V ₁, min (Q ₁,q₂) and a loss function L3 are used for updating pi, and finally an exponential decay averaging method is used for updating V ₂.

The loss function L1 is as follows:

Wherein E (-) represents a desired function, Q _θ (-) represents an output value of the action value evaluation network, θ represents a parameter of the action value evaluation network, s _t represents a state of the unmanned aerial vehicle before executing the action, a _t represents an action executed at the current moment, r (-) represents a reward value outputted by the reward function, ζ represents a discount factor, An output value representing the state value evaluation target network,And s _t+1 represents the state of the unmanned aerial vehicle after the current action is executed.

The loss function L2 is as follows:

wherein V _ψ (·) represents the output value of the state value evaluation network, ψ represents the parameters of the state value evaluation network, pi _φ (·) represents the output value of the policy network, and φ represents the parameters of the policy network.

The loss function L3 is as follows:

Wherein the said Random actions obtained for sampling in a gaussian distribution.

The exponential decay averaging method is as follows:

Where ψ represents the parameters of the state value evaluation network and τ represents the probability superparameter.

The convergence condition of the reward function in the embodiment of the invention is shown as a reward function change chart in the training process of the embodiment shown in fig. 5;

step 6, model distillation is carried out on the strategy subnetwork:

step 7, planning a flight path of the unmanned aerial vehicle:

7.1, inputting a current time state space set of an unmanned aerial vehicle of a path to be planned into a distilled strategy subnetwork, outputting a current time action space set, enabling the unmanned aerial vehicle to act according to the action space set, generating a next time state space set and a current time unmanned aerial vehicle motion path, wherein the unmanned aerial vehicle path set represents an integral path which is formed by each time and sequentially connects the path generated according to the current time state space set and the next time state space set;

Step 7.2, judging whether the unmanned aerial vehicle reaches a target place, if so, executing step 8, otherwise, executing step 7.3;

Step 7.3, judging whether the number of actions of the unmanned aerial vehicle reaches a preset upper limit, if so, prompting the unmanned aerial vehicle that the path planning fails, otherwise, adding the motion path at the current moment into the overall planning path, and repeating the step 7.1;

Claims

1. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm is characterized by utilizing a deep reinforcement learning neural network model played back by the SAC algorithm and priority experience, autonomously designing a state space, a continuous action space and a potential energy-based reward function based on a Markov decision process, and carrying out model distillation on the trained model, wherein the planning method comprises the following steps of:

step 4, generating an experience data set:

step 5, training the deep neural network by using the SAC algorithm:

step 6, model distillation is carried out on the strategy subnetwork:

step 7, planning a flight path of the unmanned aerial vehicle:

2. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the construction of the unmanned aerial vehicle flight control mathematical model in step 1 means that according to the unmanned aerial vehicle flight control principle, the track yaw angle, the track pitch angle and the speed of the unmanned aerial vehicle are respectively controlled based on three quantities of tangential overload, normal overload and roll angle, and the three-degree-of-freedom flight control mathematical model of the unmanned aerial vehicle in a three-dimensional space is constructed as follows:

The method comprises the steps of g representing gravitational acceleration of an unmanned aerial vehicle, t representing unit time of the unmanned aerial vehicle in a flight process, n _x representing tangential overload of the unmanned aerial vehicle, n _z representing normal overload of the unmanned aerial vehicle, mu representing rolling angle of the unmanned aerial vehicle, v representing speed of the unmanned aerial vehicle, meeting constraint conditions v E [ v _min,v_max],v_min ] representing minimum flight speed of the unmanned aerial vehicle, v _max representing maximum flight speed of the unmanned aerial vehicle, gamma representing track pitch angle of the unmanned aerial vehicle, namely an included angle between a speed direction and a horizontal plane, meeting constraint conditions gamma E [ -pi/2, pi/2 ], phi representing track yaw angle of the unmanned aerial vehicle, namely an included angle between projection of the speed direction on the horizontal plane and a y axis, and meeting constraint conditions phi E [ -pi, pi ], and x, y and z respectively representing coordinate values of the unmanned aerial vehicle in a three-dimensional space coordinate system.

3. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the autonomous design state space set in step 2 comprises coordinate values of the unmanned aerial vehicle when the unmanned aerial vehicle is located at a certain position in a three-dimensional space, a speed v of the unmanned aerial vehicle, a track pitch angle gamma of the unmanned aerial vehicle, a track yaw angle psi of the unmanned aerial vehicle, a target position safe approach range radius d _in, a successful approach target position determination range radius d _out, and coordinate values of a target point position in the three-dimensional space.

4. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the continuous action space set in step 2 comprises a tangential overload n _x of the unmanned aerial vehicle, a normal overload n _z of the unmanned aerial vehicle, and a roll angle μ of the unmanned aerial vehicle.

5. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the reward function R _total＝r_res+ω_dr_d+ω_ar_a+ω_hr_h in step 2, wherein R _res、r_d、r_a、r_h represents a result reward function, a distance reward function, an angle reward function, a height reward function, and ω _d、ω_a、ω_h represents coefficients of the distance reward function, the angle reward function, and the height reward function, respectively;

the resulting reward function r _res is as follows:

Wherein D _in represents the safe distance between the unmanned aerial vehicle and the target site, D _out represents the boundary of the range of the target site successfully approached by the unmanned aerial vehicle, and D represents the euclidean distance between the unmanned aerial vehicle and the target site in three-dimensional space;

The distance reward function r _d is as follows:

wherein D _max represents a flight boundary of the unmanned aerial vehicle with respect to the target position in a three-dimensional coordinate system, and e () represents an exponential function based on a natural constant e;

the angle bonus function r _a is as follows:

the height bonus function r _h is as follows:

6. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the loss function L1 in step 5.2 is as follows:

7. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 6, wherein the loss function L2 in step 5.2 is as follows:

8. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 7, wherein the loss function L3 in step 5.2 is as follows:

9. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 7, wherein the exponential decay averaging method in step 5.2 is as follows:

10. The unmanned aerial vehicle autonomous path planning method according to claim 1, wherein the unmanned aerial vehicle path set in step 7.3 represents an overall path in which paths generated at each time are sequentially connected according to a state space set at a current time and a state space set at a next time.