CN116430904B - Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm - Google Patents

Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm

Info

Publication number
CN116430904B
CN116430904B CN202310543396.3A CN202310543396A CN116430904B CN 116430904 B CN116430904 B CN 116430904B CN 202310543396 A CN202310543396 A CN 202310543396A CN 116430904 B CN116430904 B CN 116430904B
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
network
action
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310543396.3A
Other languages
Chinese (zh)
Other versions
CN116430904A (en
Inventor
李阳阳
李浩哲
曹梦晨
沈家皓
张雪帆
刘睿娇
焦李成
尚荣华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310543396.3A priority Critical patent/CN116430904B/en
Publication of CN116430904A publication Critical patent/CN116430904A/en
Application granted granted Critical
Publication of CN116430904B publication Critical patent/CN116430904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/106Change initiated in response to external conditions, e.g. avoidance of elevated terrain or of no-fly zones
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses an unmanned aerial vehicle autonomous path planning method based on a lightweight continuous SAC algorithm, which comprises the steps of constructing an unmanned aerial vehicle flight control mathematical model; designing a state space, an action space and a reward function, building a deep reinforcement learning neural network model, generating an experience data set, training the deep neural network by using an SAC algorithm, and performing model distillation by taking the trained network as a teacher network. The invention is based on deep reinforcement learning, takes the SAC algorithm as a model basic framework, autonomously designs a reward function to improve training efficiency, reduces network scale by using model distillation, realizes a high-exploring-degree and light-weight unmanned aerial vehicle path planning method, and solves the problems that unmanned aerial vehicles sometimes have poor in-situ turning smoothness, misjudgment is generated under the condition of more noise, training efficiency and stability are poor, and model response speed is slow.

Description

Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm
Technical Field
The invention belongs to the technical field of communication, and further relates to an unmanned aerial vehicle autonomous path planning method based on a lightweight continuous SAC (Soft activator-Critic) algorithm in the technical field of unmanned aerial vehicles. The method can be applied to unmanned aerial vehicles in different environments, realize autonomous decision in the moving process, meet the requirement that the unmanned aerial vehicle can reach a target range efficiently and safely in a smoother track in an unmanned state, and realize autonomous path planning of the unmanned aerial vehicle.
Background
Unmanned aerial vehicle autonomous path planning is a technique that enables unmanned aerial vehicles to autonomously seek to reach a destination without human intervention. In the autonomous flight process of the unmanned aerial vehicle, the unmanned aerial vehicle has higher requirements on decision response speed and accuracy due to higher flight speed, and has stronger dependence on the environment of flight. The traditional solutions generally have some methods based on genetic algorithm, dynamic Bayesian network, approximate dynamic planning and other technologies, and most of the methods have the problems of complex modeling, low real-time decision-making efficiency, large data set support, huge calculation amount, easy dimension disaster and the like, and because of the particularity of unmanned aerial vehicle autonomous path planning, the problems can cause the problems of slow decision of unmanned aerial vehicle flying at high speed, unstable effect under different application scenes, excessive model training cost and the like, thereby causing great difficulty to the practical application of unmanned aerial vehicle autonomous path planning. Some existing solutions based on deep reinforcement learning mostly adopt DQN(Deep Q-network)、DDPG(Deep Deterministic Policy Gradient)、TD3(Twin Delayed Deep Deterministic policy gradient algorithm) algorithm as a markov decision model and use a plurality of basic action combinations to construct a discrete action space, and DQN and DDPG both use the same Q network to select and evaluate actions when calculating a target value, which can generate a higher value estimation under the condition of noise and error, and is generally called an overestimation Overestimation problem, which has a great influence on flight decisions of unmanned aerial vehicles. However, the TD3 algorithm improves the overestimation problem of the two, but the training efficiency and stability are still poor in the task scene of the real-time decision of the unmanned aerial vehicle. In addition, the discrete action space is used, so that the autonomous flight path of the unmanned aerial vehicle is not smooth enough, and the situation of in-situ turning is easy to occur.
Jinwen Hu et al have adopted an autonomous path planning method in an unmanned aerial vehicle air combat autonomous decision based on deep reinforcement learning disclosed in their published paper "Autonomous Maneuver Decision Making of Dual-UA V Cooperative Air Combat Based on Deep Reinforcement Learning"(Hu.J,Wang.L,Hu.T,Guo.C,Wang.Y.Autonomous Maneuver Decision Making of Dual-UAV Cooperative Air Combat Based on Deep Reinforcement Learning.Electronics 2022,11,467.). According to the method, fifteen typical action instructions of the unmanned aerial vehicle are designed, the discrete action space is used for modeling the action of the aircraft, and DDPG algorithm is used as a Markov decision model to generate a path planning strategy of the unmanned aerial vehicle. The method has the defects that firstly, the DDPG algorithm has the over-estimation problem due to the characteristics, misjudgment can be generated under the condition of more noise, training efficiency and stability are poor, and secondly, a discrete action space is used, and because discrete actions are not flexible enough in the control process, the unmanned aerial vehicle can only fly in various fixed postures, so that the flight path of the unmanned aerial vehicle is not smooth enough, and the situation that the unmanned aerial vehicle can turn in situ sometimes is caused.
An autonomous path planning method is adopted in unmanned aerial vehicle combat autonomous decision-making based on a deep reinforcement learning TD3 algorithm, which is disclosed in a patent document (unmanned aerial vehicle combat autonomous decision-making method based on a deep reinforcement learning TD3 algorithm) (application number: 202210264539.2 application date: 2022.03.17 application publication number: CN 114706418A) applied by the university of civil liberation army of China. The unmanned aerial vehicle learning maneuver strategy training method comprises the specific steps of establishing an unmanned aerial vehicle movement model, establishing an unmanned aerial vehicle air fight model based on a Markov decision process according to the unmanned aerial vehicle movement model, using a four-element representation comprising a state space, an action space, a reward function and a discount factor, wherein the unmanned aerial vehicle movement model represents a state transfer function in the unmanned aerial vehicle air fight model, and training an unmanned aerial vehicle learning maneuver strategy based on a TD3 algorithm according to the unmanned aerial vehicle air fight model. The method has the defects that the algorithm training process is not stable enough, the training efficiency is poor, and the response speed of the unmanned aerial vehicle in the decision-making process is low due to the fact that a TD3 algorithm adopts a more complex network model structure and the model reasoning speed is low.
Disclosure of Invention
The invention aims to solve the problems that in the prior art, unmanned aerial vehicles sometimes have poor in-situ turning smoothness, misjudgment is generated under the condition of more noise, training efficiency and stability are poor, and model response speed is low.
The method for achieving the purpose of the invention has the specific thinking that when the real-time path planning problem of the unmanned aerial vehicle is solved, the method models the unmanned aerial vehicle movement mode in a three-degree-of-freedom flight model, and the model defines how the unmanned aerial vehicle will move in a three-dimensional space in real time under different actions. The environment of the drone is then modeled, including a state space, an action space, and a reward function of the drone. The system comprises a state space, a potential energy-based rewarding mechanism, a control system and a control system, wherein the state space consists of position information, speed, end position coordinate information, safety distance and the like of the unmanned aerial vehicle under three-dimensional coordinates, the action space based on continuous actions is established to drive the unmanned aerial vehicle to fly, and the action space consists of three elements of tangential overload, normal overload and rolling angle, so that the unmanned aerial vehicle is more flexible to control, the problem that in-situ turning smoothness is poor in the prior art is solved, a rewarding function mainly consisting of three factors of distance, angle and height is designed, the distance is used as a main line rewarding, the angle and the height are used as auxiliary rewarding, and the potential energy-based rewarding mechanism is adopted to enable the unmanned aerial vehicle to be rewarded more densely in the flying process, and the strategy network of the unmanned aerial vehicle to be guided more efficiently. The deep reinforcement learning neural network based on the SAC algorithm is constructed, and compared with the DQN algorithm and the DDPG algorithm, the SAC algorithm uses two Q networks and takes smaller values, so that the problem of misjudgment under the condition of more noise in the prior art is solved. Compared with the TD3 algorithm, the method has the advantages that the maximum entropy is introduced into the loss function of the SAC algorithm, so that the exploration capacity of the model is greatly improved, and the training efficiency of the model is effectively improved. A priority experience playback mechanism is added on the basis of the SAC algorithm, and weight is given to each piece of data added into the experience pool so as to improve training efficiency. The method overcomes the problem of poor training efficiency and stability in the prior art by designing the potential energy-based reward function, constructing the SAC deep reinforcement learning neural network and introducing the priority experience playback mechanism. Finally, the strategy network performs model distillation operation, a smaller student network is trained to lighten the model, and the problem of slow response speed of the model in the prior art is solved.
In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps:
step 1, constructing a three-degree-of-freedom flight model of an unmanned aerial vehicle movement mode;
Step 2, autonomously designing a state space set, a continuous action space set and a reward function based on an unmanned aerial vehicle flight control mathematical model and a Markov decision process;
step 3, constructing a deep reinforcement learning neural network based on a SAC algorithm:
Step 3.1, constructing a strategy sub-network, wherein the structure of the strategy sub-network is formed by sequentially connecting six full-connection layers in series, and setting nodes of the six full-connection layers to 14,512,512,512,512,3 respectively;
step 3.2, constructing two action value evaluation sub-networks with the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set as 17,512,512,512,512,1;
step 3.3, constructing a state value evaluation sub-network and a state value evaluation target sub-network which have the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set to 14,512,512,512,512,1;
step 3.4, connecting the two action value evaluation sub-networks in parallel and then respectively connecting the two action value evaluation sub-networks with the strategy sub-network and the state value sub-network to form a deep reinforcement learning neural network;
step 4, generating an experience data set:
The method comprises the steps of forming a state space set, an action space set, rewarding information obtaining and a state space set after action execution of an unmanned aerial vehicle are performed for each action into quadruple experience data corresponding to the action, storing the quadruple experience data into an experience pool, and forming an experience data set from the quadruple experience data of at least 10000 actions stored in the experience pool;
step 5, training the deep neural network by using the SAC algorithm:
Step 5.1, randomly initializing a state space set, inputting the state space set into a strategy sub-network, outputting an action space set by the strategy sub-network, enabling the unmanned aerial vehicle to act according to the action space set given by the strategy sub-network, storing a quadruple consisting of the state set of the unmanned aerial vehicle before acting, the action space set given by the strategy sub-network, a reward value generated by a reward function and the state set of the unmanned aerial vehicle after acting into an experience pool, giving the highest weight to the data stored in the experience pool at present, and attenuating the weight to the rest data according to the storage sequence;
Step 5.2, extracting 128 experience data subsets from the updated experience pool according to the prior experience playback, inputting the experience data subsets into a deep neural network, outputting two action values, a state value 1, a state value 2 and a rewarding value, substituting the currently output two action values and the state value 1 into a loss function L1, substituting the smaller value of the currently output two action values and the state value 2 into the loss function L2, substituting the smaller value of the currently output two action values into a loss function L3, respectively updating weight parameters of an action value evaluation sub-network, a state value evaluation sub-network and a strategy network by using a gradient back propagation method, and updating the weight parameters of the state value evaluation target sub-network by using an exponential decay average method to obtain updated weight parameters of the deep neural network;
Step 5.3, judging whether the current output rewarding value is converged, if yes, executing step 6 after obtaining a trained deep neural network, otherwise, executing step 5.1;
step 6, model distillation is carried out on the strategy subnetwork:
step 6.1, randomly extracting an action space set in 10000 pieces of experience data from the experience pool to be used as a strategy student network training set;
Step 6.2, inputting 32 data in each batch in the training set of the strategy student network into the strategy network and the strategy student network respectively, substituting the output of the two networks into a cross entropy loss function to calculate a loss value, and updating the weight parameter of the strategy student network by using a gradient back propagation method until the loss value converges to obtain a distilled strategy sub-network;
step 7, planning a flight path of the unmanned aerial vehicle:
7.1, inputting a state space set of the unmanned aerial vehicle at the current moment of a path to be planned into a distilled strategy subnetwork, outputting an action space set at the current moment, enabling the unmanned aerial vehicle to act according to the action space set, generating a state space set at the next moment and a motion path of the unmanned aerial vehicle at the current moment, and splicing the motion path at the current moment into the motion path generated at the previous moment according to position information;
And 7.2, judging whether the unmanned aerial vehicle reaches a target place, if so, executing the step 8, and otherwise, executing the step 7.3.
Step 7.3, judging whether the number of actions of the unmanned aerial vehicle reaches a preset upper limit, if so, prompting the unmanned aerial vehicle that the path planning fails, otherwise, repeating the step 7.1;
And 8, taking the path spliced by whether the unmanned aerial vehicle arrives at the target place as a planned unmanned aerial vehicle action path.
Compared with the prior art, the invention has the following advantages:
Firstly, the invention adopts the most advanced SAC algorithm in the deep reinforcement learning field, and adds a priority experience playback mechanism on the basis, and autonomously designs a potential energy-based reward function, thereby effectively avoiding the defects of misjudgment, poor training efficiency and poor stability caused by overestimation under the condition of more noise, leading the invention to have the characteristics of strong robustness, stable training and high convergence speed, being beneficial to higher decision accuracy of unmanned aerial vehicles under different environments, lower training cost and higher efficiency.
Secondly, because the unmanned aerial vehicle is subjected to flight control by adopting continuous actions when the reinforcement learning action space is designed, the defect that in the prior art, in-situ turning sometimes occurs due to poor smoothness is overcome, and the unmanned aerial vehicle flight control method has the characteristics of strong control flexibility and smooth flight path.
Thirdly, the model distillation is carried out on the model after model training is finished, so that the light weight work of the model is realized, the reasoning speed of the model is accelerated, the defect of low response speed of the model in the prior art is overcome, and the model distillation method has the characteristics of high instantaneity and high response speed.
Drawings
FIG. 1 is a flow chart of an overall implementation of the present invention;
FIG. 2 is a schematic diagram of a flight model of the unmanned aerial vehicle under a three-dimensional coordinate system;
FIG. 3 is a schematic diagram of autonomous seek destination range in a bonus function design of the present invention;
FIG. 4 is a schematic diagram of a deep reinforcement learning network model constructed based on the SAC algorithm of the present invention;
FIG. 5 is a graph showing the change of the reward function during training according to the embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples.
The implementation steps of the embodiment of the present invention will be described in further detail with reference to fig. 1.
Step 1, constructing a three-degree-of-freedom flight model of an unmanned aerial vehicle movement mode:
According to the flight control principle of the unmanned aerial vehicle, based on three quantities of tangential overload, normal overload and rolling angle, the track yaw angle, track pitch angle and speed of the unmanned aerial vehicle are respectively controlled, and a three-degree-of-freedom flight control mathematical model of the unmanned aerial vehicle in a three-dimensional space is constructed as follows:
Referring to a schematic diagram of a flight model of an unmanned aerial vehicle in a three-dimensional coordinate system shown in fig. 2, wherein g represents gravitational acceleration of the unmanned aerial vehicle, t represents unit time of the flight process of the unmanned aerial vehicle, n x represents tangential overload of the unmanned aerial vehicle, n z represents normal overload of the unmanned aerial vehicle, μ represents rolling angle of the unmanned aerial vehicle, v represents speed of the unmanned aerial vehicle, v e [ v min,vmax],vmin ] represents minimum flight speed of the unmanned aerial vehicle, v max represents maximum flight speed of the unmanned aerial vehicle, γ represents track pitch angle of the unmanned aerial vehicle, namely an included angle between a speed direction and a horizontal plane, y represents track yaw angle of the unmanned aerial vehicle, namely an included angle between projection of the speed direction on the horizontal plane and a y axis, satisfies constraint conditions of y e [ -pi, pi ], x, y and z respectively represent coordinate values of the unmanned aerial vehicle in the three-dimensional space coordinate system.
Step 2, autonomously designing a state space set, a continuous action space set and a reward function based on an unmanned aerial vehicle flight control mathematical model and a Markov decision process:
The autonomous design state space set comprises coordinate values when the unmanned aerial vehicle is located at a certain position in the three-dimensional space, speed v of the unmanned aerial vehicle, track pitch angle gamma of the unmanned aerial vehicle, track yaw angle phi of the unmanned aerial vehicle, target position safety approaching range radius d in, target position judging range radius d out which is successfully approached, and coordinate values of the target point position in the three-dimensional space.
The continuous action space set comprises tangential overload n x of the unmanned aerial vehicle, normal overload n z of the unmanned aerial vehicle and rolling angle mu of the unmanned aerial vehicle.
R total=rresdrdarahrh, wherein R res、rd、ra、rh respectively represents a result rewarding function, a distance rewarding function, an angle rewarding function and a height rewarding function, and omega d、ωa、ωh respectively represents coefficients of the distance rewarding function, the angle rewarding function and the height rewarding function;
the resulting reward function r res is as follows:
Referring to the schematic diagram of the autonomous navigational destination range shown in fig. 3, wherein D in represents the safe distance between the drone and the target location, D out represents the successful approach of the drone to the boundary of the target location range, and D represents the euclidean distance between the drone and the target location in three dimensions;
The distance reward function r d is as follows:
Wherein D max represents a flight boundary of the unmanned aerial vehicle relative to the target position in a three-dimensional coordinate system, e (·) represents an exponential function based on a natural constant e;
the angle bonus function r a is as follows:
wherein θ represents an angle difference between the unmanned aerial vehicle track yaw angle ψ and the unmanned aerial vehicle and target point straight line direction;
the height bonus function r h is as follows:
Wherein H represents the flying height of the unmanned aerial vehicle, H low represents the minimum safe height of the unmanned aerial vehicle, and H high represents the maximum safe height of the unmanned aerial vehicle.
Step 3, referring to the model structure schematic diagram shown in fig. 4, constructing a deep reinforcement learning neural network based on a SAC algorithm:
Step 3.1, constructing a strategy sub-network, wherein the structure of the strategy sub-network is formed by sequentially connecting six full-connection layers in series, and setting nodes of the six full-connection layers to 14,512,512,512,512,3 respectively;
step 3.2, constructing two action value evaluation sub-networks with the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set as 17,512,512,512,512,1;
step 3.3, constructing a state value evaluation sub-network and a state value evaluation target sub-network which have the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set to 14,512,512,512,512,1;
step 3.4, connecting the two action value evaluation sub-networks in parallel and then respectively connecting the two action value evaluation sub-networks with the strategy sub-network and the state value sub-network to form a deep reinforcement learning neural network;
In the specific embodiment, a strategy network pi, two action value evaluation networks Q 1、Q2 and two state value evaluation networks V 1、V2 are built;
step 4, generating an experience data set:
The method comprises the steps of forming a state space set, an action space set, rewarding information obtaining and a state space set after action execution of an unmanned aerial vehicle are performed for each action into quadruple experience data corresponding to the action, storing the quadruple experience data into an experience pool, and forming an experience data set from the quadruple experience data of at least 10000 actions stored in the experience pool;
step 5, training the deep neural network by using the SAC algorithm:
Step 5.1, randomly initializing a state space set, inputting the state space set into a strategy sub-network, outputting an action space set by the strategy sub-network, enabling the unmanned aerial vehicle to act according to the action space set given by the strategy sub-network, storing a quadruple consisting of the state set of the unmanned aerial vehicle before acting, the action space set given by the strategy sub-network, a reward value generated by a reward function and the state set of the unmanned aerial vehicle after acting into an experience pool, giving the highest weight to the data stored in the experience pool at present, and attenuating the weight to the rest data according to the storage sequence;
In the embodiment of the invention, when the unmanned aerial vehicle is converted from the current time state s t to the next time state s t+1, a quadruple (s t,at,rt,st+1) formed by the unmanned aerial vehicle, the action a t performed at the current time and the reward r t obtained at the current time is input into an experience pool P, and the P assigns a weight omega to each input quadruple;
Step 5.2, extracting 128 experience data subsets from the updated experience pool according to the prior experience playback, inputting the experience data subsets into a deep neural network, outputting two action values, a state value 1, a state value 2 and a rewarding value, substituting the currently output two action values and the state value 1 into a loss function L1, substituting the smaller value of the currently output two action values and the state value 2 into the loss function L2, substituting the smaller value of the currently output two action values into a loss function L3, respectively updating weight parameters of an action value evaluation sub-network, a state value evaluation sub-network and a strategy network by using a gradient back propagation method, and updating the weight parameters of the state value evaluation target sub-network by using an exponential decay average method to obtain updated weight parameters of the deep neural network;
In the embodiment of the invention, 128 experience data subsets are extracted according to the data probability of omega in an experience pool P, a strategy network pi is input as state information s t of the unmanned aerial vehicle at the current moment, unmanned aerial vehicle action a t,Q1、Q2 generated for decision is output as state information s t of the current moment and action a t taken by the unmanned aerial vehicle, a score Q 1、q2,V1、V2 of the current action is output, the network is input as state information s t of the current moment, the score V 1、v2 of the current state information is output, Q 1、q2、v1 and a loss function L1 are used for updating Q 1、Q2, min (Q 1,q2)、v2 and a loss function L2 are used for updating V 1, min (Q 1,q2) and a loss function L3 are used for updating pi, and finally an exponential decay averaging method is used for updating V 2.
The loss function L1 is as follows:
Wherein E (-) represents a desired function, Q θ (-) represents an output value of the action value evaluation network, θ represents a parameter of the action value evaluation network, s t represents a state of the unmanned aerial vehicle before executing the action, a t represents an action executed at the current moment, r (-) represents a reward value outputted by the reward function, ζ represents a discount factor, An output value representing the state value evaluation target network,And s t+1 represents the state of the unmanned aerial vehicle after the current action is executed.
The loss function L2 is as follows:
wherein V ψ (·) represents the output value of the state value evaluation network, ψ represents the parameters of the state value evaluation network, pi φ (·) represents the output value of the policy network, and φ represents the parameters of the policy network.
The loss function L3 is as follows:
Wherein the said Random actions obtained for sampling in a gaussian distribution.
The exponential decay averaging method is as follows:
Where ψ represents the parameters of the state value evaluation network and τ represents the probability superparameter.
Step 5.3, judging whether the current output rewarding value is converged, if yes, executing step 6 after obtaining a trained deep neural network, otherwise, executing step 5.1;
The convergence condition of the reward function in the embodiment of the invention is shown as a reward function change chart in the training process of the embodiment shown in fig. 5;
step 6, model distillation is carried out on the strategy subnetwork:
step 6.1, randomly extracting an action space set in 10000 pieces of experience data from the experience pool to be used as a strategy student network training set;
Step 6.2, inputting 32 data in each batch in the training set of the strategy student network into the strategy network and the strategy student network respectively, substituting the output of the two networks into a cross entropy loss function to calculate a loss value, and updating the weight parameter of the strategy student network by using a gradient back propagation method until the loss value converges to obtain a distilled strategy sub-network;
step 7, planning a flight path of the unmanned aerial vehicle:
7.1, inputting a current time state space set of an unmanned aerial vehicle of a path to be planned into a distilled strategy subnetwork, outputting a current time action space set, enabling the unmanned aerial vehicle to act according to the action space set, generating a next time state space set and a current time unmanned aerial vehicle motion path, wherein the unmanned aerial vehicle path set represents an integral path which is formed by each time and sequentially connects the path generated according to the current time state space set and the next time state space set;
Step 7.2, judging whether the unmanned aerial vehicle reaches a target place, if so, executing step 8, otherwise, executing step 7.3;
Step 7.3, judging whether the number of actions of the unmanned aerial vehicle reaches a preset upper limit, if so, prompting the unmanned aerial vehicle that the path planning fails, otherwise, adding the motion path at the current moment into the overall planning path, and repeating the step 7.1;
And 8, taking the path spliced by whether the unmanned aerial vehicle arrives at the target place as a planned unmanned aerial vehicle action path.

Claims (10)

1. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm is characterized by utilizing a deep reinforcement learning neural network model played back by the SAC algorithm and priority experience, autonomously designing a state space, a continuous action space and a potential energy-based reward function based on a Markov decision process, and carrying out model distillation on the trained model, wherein the planning method comprises the following steps of:
step 1, constructing a three-degree-of-freedom flight model of an unmanned aerial vehicle movement mode;
Step 2, autonomously designing a state space set, a continuous action space set and a reward function based on an unmanned aerial vehicle flight control mathematical model and a Markov decision process;
step 3, constructing a deep reinforcement learning neural network based on a SAC algorithm:
Step 3.1, constructing a strategy sub-network, wherein the structure of the strategy sub-network is formed by sequentially connecting six full-connection layers in series, and setting nodes of the six full-connection layers to 14,512,512,512,512,3 respectively;
step 3.2, constructing two action value evaluation sub-networks with the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set as 17,512,512,512,512,1;
step 3.3, constructing a state value evaluation sub-network and a state value evaluation target sub-network which have the same structure, wherein each sub-network is formed by sequentially connecting six full-connection layers in series, and the nodes of the six full-connection layers are respectively set to 14,512,512,512,512,1;
step 3.4, connecting the two action value evaluation sub-networks in parallel and then respectively connecting the two action value evaluation sub-networks with the strategy sub-network and the state value sub-network to form a deep reinforcement learning neural network;
step 4, generating an experience data set:
The method comprises the steps of forming a state space set, an action space set, rewarding information obtaining and a state space set after action execution of an unmanned aerial vehicle are performed for each action into quadruple experience data corresponding to the action, storing the quadruple experience data into an experience pool, and forming an experience data set from the quadruple experience data of at least 10000 actions stored in the experience pool;
step 5, training the deep neural network by using the SAC algorithm:
Step 5.1, randomly initializing a state space set, inputting the state space set into a strategy sub-network, outputting an action space set by the strategy sub-network, enabling the unmanned aerial vehicle to act according to the action space set given by the strategy sub-network, storing a quadruple consisting of the state set of the unmanned aerial vehicle before acting, the action space set given by the strategy sub-network, a reward value generated by a reward function and the state set of the unmanned aerial vehicle after acting into an experience pool, giving the highest weight to the data stored in the experience pool at present, and attenuating the weight to the rest data according to the storage sequence;
Step 5.2, extracting 128 experience data subsets from the updated experience pool according to the prior experience playback, inputting the experience data subsets into a deep neural network, outputting two action values, a state value 1, a state value 2 and a rewarding value, substituting the currently output two action values and the state value 1 into a loss function L1, substituting the smaller value of the currently output two action values and the state value 2 into the loss function L2, substituting the smaller value of the currently output two action values into a loss function L3, respectively updating weight parameters of an action value evaluation sub-network, a state value evaluation sub-network and a strategy network by using a gradient back propagation method, and updating the weight parameters of the state value evaluation target sub-network by using an exponential decay average method to obtain updated weight parameters of the deep neural network;
Step 5.3, judging whether the current output rewarding value is converged, if yes, executing step 6 after obtaining a trained deep neural network, otherwise, executing step 5.1;
step 6, model distillation is carried out on the strategy subnetwork:
step 6.1, randomly extracting an action space set in 10000 pieces of experience data from the experience pool to be used as a strategy student network training set;
Step 6.2, inputting 32 data in each batch in the training set of the strategy student network into the strategy network and the strategy student network respectively, substituting the output of the two networks into a cross entropy loss function to calculate a loss value, and updating the weight parameter of the strategy student network by using a gradient back propagation method until the loss value converges to obtain a distilled strategy sub-network;
step 7, planning a flight path of the unmanned aerial vehicle:
7.1, inputting a state space set of the unmanned aerial vehicle at the current moment of a path to be planned into a distilled strategy subnetwork, outputting an action space set at the current moment, enabling the unmanned aerial vehicle to act according to the action space set, generating a state space set at the next moment and a motion path of the unmanned aerial vehicle at the current moment, and splicing the motion path at the current moment into the motion path generated at the previous moment according to position information;
Step 7.2, judging whether the unmanned aerial vehicle reaches a target place, if so, executing step 8, otherwise, executing step 7.3;
step 7.3, judging whether the number of actions of the unmanned aerial vehicle reaches a preset upper limit, if so, prompting the unmanned aerial vehicle that the path planning fails, otherwise, repeating the step 7.1;
And 8, taking the path spliced by whether the unmanned aerial vehicle arrives at the target place as a planned unmanned aerial vehicle action path.
2. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the construction of the unmanned aerial vehicle flight control mathematical model in step 1 means that according to the unmanned aerial vehicle flight control principle, the track yaw angle, the track pitch angle and the speed of the unmanned aerial vehicle are respectively controlled based on three quantities of tangential overload, normal overload and roll angle, and the three-degree-of-freedom flight control mathematical model of the unmanned aerial vehicle in a three-dimensional space is constructed as follows:
The method comprises the steps of g representing gravitational acceleration of an unmanned aerial vehicle, t representing unit time of the unmanned aerial vehicle in a flight process, n x representing tangential overload of the unmanned aerial vehicle, n z representing normal overload of the unmanned aerial vehicle, mu representing rolling angle of the unmanned aerial vehicle, v representing speed of the unmanned aerial vehicle, meeting constraint conditions v E [ v min,vmax],vmin ] representing minimum flight speed of the unmanned aerial vehicle, v max representing maximum flight speed of the unmanned aerial vehicle, gamma representing track pitch angle of the unmanned aerial vehicle, namely an included angle between a speed direction and a horizontal plane, meeting constraint conditions gamma E [ -pi/2, pi/2 ], phi representing track yaw angle of the unmanned aerial vehicle, namely an included angle between projection of the speed direction on the horizontal plane and a y axis, and meeting constraint conditions phi E [ -pi, pi ], and x, y and z respectively representing coordinate values of the unmanned aerial vehicle in a three-dimensional space coordinate system.
3. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the autonomous design state space set in step 2 comprises coordinate values of the unmanned aerial vehicle when the unmanned aerial vehicle is located at a certain position in a three-dimensional space, a speed v of the unmanned aerial vehicle, a track pitch angle gamma of the unmanned aerial vehicle, a track yaw angle psi of the unmanned aerial vehicle, a target position safe approach range radius d in, a successful approach target position determination range radius d out, and coordinate values of a target point position in the three-dimensional space.
4. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the continuous action space set in step 2 comprises a tangential overload n x of the unmanned aerial vehicle, a normal overload n z of the unmanned aerial vehicle, and a roll angle μ of the unmanned aerial vehicle.
5. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the reward function R total=rresdrdarahrh in step 2, wherein R res、rd、ra、rh represents a result reward function, a distance reward function, an angle reward function, a height reward function, and ω d、ωa、ωh represents coefficients of the distance reward function, the angle reward function, and the height reward function, respectively;
the resulting reward function r res is as follows:
Wherein D in represents the safe distance between the unmanned aerial vehicle and the target site, D out represents the boundary of the range of the target site successfully approached by the unmanned aerial vehicle, and D represents the euclidean distance between the unmanned aerial vehicle and the target site in three-dimensional space;
The distance reward function r d is as follows:
wherein D max represents a flight boundary of the unmanned aerial vehicle with respect to the target position in a three-dimensional coordinate system, and e () represents an exponential function based on a natural constant e;
the angle bonus function r a is as follows:
wherein θ represents an angle difference between the unmanned aerial vehicle track yaw angle ψ and the unmanned aerial vehicle and target point straight line direction;
the height bonus function r h is as follows:
Wherein H represents the flying height of the unmanned aerial vehicle, H low represents the minimum safe height of the unmanned aerial vehicle, and H high represents the maximum safe height of the unmanned aerial vehicle.
6. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 1, wherein the loss function L1 in step 5.2 is as follows:
Wherein E (-) represents a desired function, Q θ (-) represents an output value of the action value evaluation network, θ represents a parameter of the action value evaluation network, s t represents a state of the unmanned aerial vehicle before executing the action, a t represents an action executed at the current moment, r (-) represents a reward value outputted by the reward function, ζ represents a discount factor, An output value representing the state value evaluation target network,And s t+1 represents the state of the unmanned aerial vehicle after the current action is executed.
7. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 6, wherein the loss function L2 in step 5.2 is as follows:
wherein V ψ (·) represents the output value of the state value evaluation network, ψ represents the parameters of the state value evaluation network, pi φ (·) represents the output value of the policy network, and φ represents the parameters of the policy network.
8. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 7, wherein the loss function L3 in step 5.2 is as follows:
Wherein the said Random actions obtained for sampling in a gaussian distribution.
9. The unmanned aerial vehicle autonomous path planning method based on the lightweight continuous SAC algorithm according to claim 7, wherein the exponential decay averaging method in step 5.2 is as follows:
Where ψ represents the parameters of the state value evaluation network and τ represents the probability superparameter.
10. The unmanned aerial vehicle autonomous path planning method according to claim 1, wherein the unmanned aerial vehicle path set in step 7.3 represents an overall path in which paths generated at each time are sequentially connected according to a state space set at a current time and a state space set at a next time.
CN202310543396.3A 2023-05-15 2023-05-15 Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm Active CN116430904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310543396.3A CN116430904B (en) 2023-05-15 2023-05-15 Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310543396.3A CN116430904B (en) 2023-05-15 2023-05-15 Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm

Publications (2)

Publication Number Publication Date
CN116430904A CN116430904A (en) 2023-07-14
CN116430904B true CN116430904B (en) 2025-08-01

Family

ID=87094569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310543396.3A Active CN116430904B (en) 2023-05-15 2023-05-15 Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm

Country Status (1)

Country Link
CN (1) CN116430904B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119598825B (en) * 2024-03-11 2026-04-07 北京航空航天大学 Multi-unmanned aerial vehicle collaborative countermeasure decision-making method and system
CN119915298B (en) * 2025-04-01 2025-06-20 南京师范大学 A mobile charging robot intelligent navigation method and system based on distillation strategy
CN120338387A (en) * 2025-04-07 2025-07-18 南京理工大学 Distributed mobile platform mission planning and adjustment method and system based on reinforcement learning algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115185288A (en) * 2022-05-27 2022-10-14 西北工业大学 SAC algorithm-based unmanned aerial vehicle layered flight decision method
CN115830454A (en) * 2022-12-16 2023-03-21 西安电子科技大学 Hyperspectral image band selection method based on multi-agent feature selection model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10062218B2 (en) * 2016-06-14 2018-08-28 Bell Helicopter Textron Inc. Statistically equivalent level of safety modeling
EP3422130B8 (en) * 2017-06-29 2023-03-22 The Boeing Company Method and system for autonomously operating an aircraft
CN110781614B (en) * 2019-12-06 2024-03-22 北京工业大学 Ship-borne aircraft play recycling online scheduling method based on deep reinforcement learning
CN115562345B (en) * 2022-10-28 2023-06-27 北京理工大学 Unmanned aerial vehicle detection track planning method based on deep reinforcement learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115185288A (en) * 2022-05-27 2022-10-14 西北工业大学 SAC algorithm-based unmanned aerial vehicle layered flight decision method
CN115830454A (en) * 2022-12-16 2023-03-21 西安电子科技大学 Hyperspectral image band selection method based on multi-agent feature selection model

Also Published As

Publication number Publication date
CN116430904A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN116430904B (en) Unmanned aerial vehicle autonomous path planning method based on lightweight continuous SAC algorithm
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN113093802B (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN114253296B (en) Hypersonic aircraft airborne track planning method and device, aircraft and medium
CN114089776B (en) A UAV obstacle avoidance method based on deep reinforcement learning
CN112433525A (en) Mobile robot navigation method based on simulation learning and deep reinforcement learning
CN114355980B (en) Quad-rotor UAV autonomous navigation method and system based on deep reinforcement learning
CN115185288B (en) Unmanned aerial vehicle layered flight decision method based on SAC algorithm
CN113741533A (en) Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning
CN115826621B (en) Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning
CN110531786B (en) Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN
CN114879738B (en) Model-enhanced unmanned aerial vehicle flight trajectory reinforcement learning optimization method
CN120371013A (en) Autonomous obstacle avoidance and path optimization control method under complex environment of unmanned aerial vehicle
CN116795138A (en) A multi-UAV intelligent trajectory planning method for data collection
CN116697829A (en) Rocket landing guidance method and system based on deep reinforcement learning
CN116859989A (en) Unmanned aerial vehicle cluster intelligent countermeasure strategy generation method based on group cooperation
CN111338375A (en) Control method and system for moving and landing of quadrotor UAV based on hybrid strategy
CN117784812A (en) UAV autonomous flight decision-making method based on evolution-guided deep reinforcement learning
CN119690112A (en) Multi-vertical fixed wing unmanned aerial vehicle track planning and intelligent obstacle avoidance method
CN118915772A (en) Formation path planning method integrating experience sharing and balanced rewarding Actor-Critic network
CN118760220A (en) A path planning method for underwater robot formation based on deep reinforcement learning
CN119479385A (en) A multi-machine real-time three-dimensional conflict resolution method based on graph reinforcement learning
CN116227622B (en) Multi-agent landmark coverage method and system based on deep reinforcement learning
CN115016540B (en) A multi-UAV disaster detection method and system
CN117406762A (en) A UAV remote control algorithm based on segmented reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant