Disclosure of Invention
In order to overcome the problems in the prior art, the invention provides a multi-unmanned aerial vehicle collaborative countermeasure decision-making method and a system, which are based on a HASAC (Heterogeneous-Agent Soft Actor-Critic) algorithm, can realize real-time dynamic countermeasure decision of heterogeneous multi-unmanned aerial vehicles based on multi-Agent reinforcement learning and are used for solving the problems in the prior art.
A multi-unmanned aerial vehicle collaborative countermeasure decision-making method, the method comprising the steps of:
Step 1, constructing a multi-unmanned aerial vehicle collaborative air combat countermeasure decision-making environment by building a multi-unmanned aerial vehicle air combat countermeasure motion model and an air combat situation assessment model;
step 2, establishing a distributed partially observable Markov decision process model of the multi-unmanned aerial vehicle collaborative countermeasure decision problem according to the action space and the local observation and the state of each unmanned aerial vehicle in the countermeasure decision environment;
step 3, designing a multi-machine collaborative countermeasure reward function and HASAC algorithm network space;
And 4, training and generating a multi-machine collaborative countermeasure policy model based on HASAC algorithm network space and multi-unmanned aerial vehicle collaborative countermeasure decision-making environment interaction, wherein the multi-unmanned aerial vehicle comprises multiple unmanned aerial vehicles and multiple unmanned aerial vehicles of enemy, wherein the three unmanned aerial vehicles are red parties, and the enemy is blue party.
In the aspects and any possible implementation manner described above, there is further provided an implementation manner, where the step 1 specifically includes:
step 11, analyzing the stress of each unmanned aerial vehicle and establishing a particle motion model;
step 12, analyzing the interrelation among the multiple unmanned aerial vehicles, and establishing a relative motion model of the multiple unmanned aerial vehicles;
and 13, establishing an unmanned aerial vehicle air combat situation assessment model according to the angle, the speed, the height and the distance of the unmanned aerial vehicle.
Aspects and any one of the possible implementations described above, further providing an implementation in which the particle motion model includes kinematic and kinetic equations, in particular of the formula:
In the formula, Respectively unmanned aerial vehicle under inertial coordinate systemCoordinates of the shaft, speed, track inclination angle, track deflection angle and gravity acceleration; the tangential overload of the unmanned aerial vehicle is represented along the flying speed direction of the unmanned aerial vehicle; Vertical and flight velocity vectors, representing the normal overload of the drone, Indicating the roll angle of the drone about the speed axis.
In the aspects and any possible implementation manner as described above, there is further provided an implementation manner, wherein the building of the relative motion model of the unmanned aerial vehicle is as follows:
in the formula, Numbering set for number of red unmanned aerial vehicle,Numbering set for the number of blue unmanned aerial vehiclesRed squareSpeed vector of unmanned aerial vehicle,Lan Fangdi ASpeed vector of unmanned aerial vehicle,Red squareUnmanned aerial vehicle's teammate unmanned aerial vehicleVelocity vector of (2),I.e.A teammate number; Is red square Unmanned aerial vehicle and Lan FangdiUnmanned aerial vehicle is atThe distance of the axis; Represent the first Red square unmanned aerial vehicle and red square teammate unmanned aerial vehicleAt the position ofDistance of axis, red squareUnmanned aerial vehicle and Lan FangdiRelative position vector of unmanned aerial vehicleRed squareUnmanned aerial vehicle frame and red team friend unmanned aerial vehicleIs a relative position vector of (2)Red square (I)Unmanned aerial vehicle and Lan FangdiOff angle of unmanned aerial vehicleIs red squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and Lan FangdiRelease angle of unmanned aerial vehicleIs blue squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and teammate unmanned aerial vehicle thereofIs a departure angle of (2)Is red squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and teammate unmanned aerial vehicle thereofIs of the angle of departure of (2)Unmanned aerial vehicle for teammatesVelocity vector and relative position vectorIs arranged at the lower end of the cylinder,Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe relative distance between the unmanned aerial vehicle and the frame,Express Red squareUnmanned aerial vehicle is relative to its teammate unmanned aerial vehicleThe magnitude of the relative distance is such that,Respectively represent red squareThe speed of the unmanned aerial vehicle, the track inclination angle and the track deflection angle,Respectively represent the blue squareThe speed of the unmanned aerial vehicle, the track inclination angle and the track deflection angle,Respectively represent red squareUnmanned aerial vehicle frame teammate unmanned aerial vehicleSpeed, track dip angle, track offset angle; Respectively represent red square The position of the unmanned aerial vehicle under the inertial coordinate system,Respectively represent the blue squareThe position of the unmanned aerial vehicle under the inertial coordinate system,Respectively represent red squareUnmanned aerial vehicle frame teammate unmanned aerial vehiclePosition under inertial coordinates.
In the aspect and any possible implementation manner described above, there is further provided an implementation manner, and step 13 specifically includes converting the multi-machine cooperative countermeasure into a target assignment and a single machine countermeasure, where the target assignment is based on situation assessment, and in the case of mutual threat of red party and blue party, the assigned target is made to have a minimum threat than itself, and the threat of itself is greater than that of the target. The expression of the situation assessment is as follows:
Wherein, the Is red squareUnmanned aerial vehicle is with respect to Lan FangdiA situation evaluation value of the unmanned aerial vehicle,,,,Respectively represent red squareUnmanned aerial vehicle is with respect to Lan FangdiThe attitude advantages of the unmanned plane angle, height, speed and distance,For the corresponding weight, satisfy。
Aspects and any one of the possible implementations as described above, further providing an implementation, the step 2 specifically includes that the multi-unmanned aerial vehicle collaborative countermeasure decision model is implemented based on a distributed partially observable markov decision process, the distributed partially observable markov process employing tuplesTo describe, among others,Representation ofA set of red-colored square unmanned aerial vehicles; Is the state space of the red unmanned plane; Is the joint action space of all red unmanned aerial vehicles, Express Red squareThe action space of the unmanned aerial vehicle is set up,;Is red squareThe frame unmanned aerial vehicle locally observes in a global state,Is a joint rewarding function of all red unmanned aerial vehicles to cooperate against blue parties,Is a state transfer function that is a function of the state,Is discount factor, red squareThe local observation of the unmanned aerial vehicle comprises a red squareAnd (5) information of the unmanned aerial vehicle, information of the blue unmanned aerial vehicle and information of teammates.
Aspects and any one of the possible implementations as described above, further providing an implementation of the design of the multi-machine collaborative countermeasure bonus functionThe sum of rewards is obtained for each unmanned plane of red party against blue party, namely, wherein,Express Red squareRewards against all blue-side drones by the drones.
The method for establishing HASAC algorithm network space specifically comprises the steps of adopting a centralized training distributed execution framework, including n strategy networks, two value networks and two target value networks, wherein each red unmanned aerial vehicle corresponds to one strategy network, the strategy networks have the same structure and are mutually independent and are used for approximating an unmanned aerial vehicle decision model, and the value networks are used for evaluating whether the strategy networks execute actions under given observation.
The above aspect and any possible implementation manner further provides an implementation manner, wherein the step 4 specifically includes taking the observation of each red unmanned aerial vehicle at the current moment as the input of the policy network of each red unmanned aerial vehicle, outputting the action of each red unmanned aerial vehicle under the observation at the current moment, and simultaneously returning the observation and global state and joint rewards of each red unmanned aerial vehicle at the next moment by the interaction environment, and storing the observation and global state of each red unmanned aerial vehicle at the current moment, the joint action of each red unmanned aerial vehicle and the observation and global state of each red unmanned aerial vehicle at the next moment, and the joint rewards into an experience pool connected with a HASAC algorithm network space.
The invention also provides a multi-unmanned aerial vehicle cooperative countermeasure decision-making system, which is used for realizing the method and comprises the following steps:
The construction module is used for constructing a multi-unmanned aerial vehicle collaborative air combat countermeasure decision-making environment by constructing a multi-unmanned aerial vehicle air combat countermeasure motion model and an air combat situation assessment model;
The first establishing module is used for establishing a distributed partially observable Markov decision process model of the multi-unmanned aerial vehicle collaborative countermeasure decision problem according to the action space and the local observation and the state of each unmanned aerial vehicle in the countermeasure decision environment;
the second building module is used for designing a multi-machine collaborative countermeasure reward function and HASAC algorithm network space;
The generation module is used for interacting with the multi-unmanned aerial vehicle collaborative countermeasure decision-making environment based on HASAC algorithm network space, training and generating a multi-unmanned aerial vehicle collaborative countermeasure policy model, wherein the multi-unmanned aerial vehicle comprises a plurality of unmanned aerial vehicles on the my side and a plurality of unmanned aerial vehicles on the enemy side, the my side is taken as a red side, and the enemy side is taken as a blue side.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a multi-machine collaborative countermeasure decision-making method based on heterogeneous multi-agent reinforcement learning based on unmanned aerial vehicle motion modeling and air combat situation assessment. Firstly, a three-degree-of-freedom unmanned aerial vehicle particle model and a situation assessment model are established. And secondly, establishing a multi-machine collaborative countermeasure decision-making model based on the distributed part observable Markov decision-making process model, and designing actions, states, observations and rewarding functions and networks of the multi-machine collaborative countermeasure. And finally, training the network by adopting HASAC algorithm as heterogeneous multi-agent reinforcement learning algorithm to generate a multi-machine collaborative countermeasure decision model. Has the following beneficial effects:
(1) The invention aims at the problem of multi-machine collaborative air combat countermeasure, designs a specific global state for multi-machine collaborative countermeasure decision-making agents, reduces the dimension of the global state and improves the training efficiency compared with the direct splicing of the observation vectors of all agents.
(2) According to the invention, HASAC algorithm is adopted as heterogeneous multi-agent reinforcement learning algorithm, maximized entropy is introduced, randomness of action exploration is increased, sub-optimal Nash equilibrium is avoided, strategy networks of all agents are sequentially updated, and a combined strategy is formed by the trained strategy networks, so that multi-machine collaborative countermeasure real-time decision can be realized.
Detailed Description
For a better understanding of the present invention, the present disclosure includes, but is not limited to, the following detailed description, and similar techniques and methods should be considered as falling within the scope of the present protection. In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
It should be understood that the described embodiments of the invention are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The invention provides a multi-unmanned aerial vehicle collaborative countermeasure decision-making method, which comprises the following steps:
Step 1, constructing a multi-unmanned aerial vehicle collaborative air combat countermeasure decision-making environment by building a multi-unmanned aerial vehicle air combat countermeasure motion model and an air combat situation assessment model;
Step 2, establishing a distributed partially observable Markov decision process model of the multi-unmanned aerial vehicle collaborative countermeasure decision problem according to the action space and the local observation and the state of each unmanned aerial vehicle in the countermeasure decision environment;
step 3, designing a multi-machine collaborative countermeasure reward function and HASAC algorithm network space;
And 4, training and generating a multi-machine collaborative countermeasure policy model based on HASAC algorithm network space and multi-unmanned aerial vehicle collaborative countermeasure decision-making environment interaction, wherein the multi-unmanned aerial vehicle comprises multiple unmanned aerial vehicles and multiple unmanned aerial vehicles of enemy, wherein the three unmanned aerial vehicles are red parties, and the enemy is blue party.
Preferably, the step 1 specifically includes:
step 11, analyzing the stress of each unmanned aerial vehicle and establishing a particle motion model;
step 12, analyzing the interrelation among the multiple unmanned aerial vehicles, and establishing a relative motion model of the multiple unmanned aerial vehicles;
and 13, establishing an unmanned aerial vehicle air combat situation assessment model according to the angle, the speed, the height and the distance of the unmanned aerial vehicle.
Preferably, the particle motion model comprises kinematic and kinetic equations, in particular the following formula:
in the formula, Respectively unmanned aerial vehicle under inertial coordinate systemCoordinates of the shaft, speed, track inclination angle, track deflection angle and gravity acceleration; the tangential overload of the unmanned aerial vehicle is represented along the flying speed direction of the unmanned aerial vehicle; Vertical and flight velocity vectors, representing the normal overload of the drone, Indicating the roll angle of the drone about the speed axis.
Preferably, the relative motion model of the unmanned aerial vehicle is established as follows:
in the formula, Numbering set for number of red unmanned aerial vehicle,Numbering set for the number of blue unmanned aerial vehiclesRed squareSpeed vector of unmanned aerial vehicle,Lan Fangdi ASpeed vector of unmanned aerial vehicle,Red squareUnmanned aerial vehicle's teammate unmanned aerial vehicleVelocity vector of (2),I.e.A teammate number; Is red square Unmanned aerial vehicle and Lan FangdiUnmanned aerial vehicle is atThe distance of the axis; Represent the first Red square unmanned aerial vehicle and red square teammate unmanned aerial vehicleAt the position ofDistance of axis, red squareUnmanned aerial vehicle and Lan FangdiRelative position vector of unmanned aerial vehicleRed squareUnmanned aerial vehicle frame and red team friend unmanned aerial vehicleIs a relative position vector of (2)Red square (I)Unmanned aerial vehicle and Lan FangdiOff angle of unmanned aerial vehicleIs red squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and Lan FangdiRelease angle of unmanned aerial vehicleIs blue squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and teammate unmanned aerial vehicle thereofIs a departure angle of (2)Is red squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and teammate unmanned aerial vehicle thereofIs of the angle of departure of (2)Unmanned aerial vehicle for teammatesVelocity vector and relative position vectorIs arranged at the lower end of the cylinder,Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe relative distance between the unmanned aerial vehicle and the frame,Express Red squareUnmanned aerial vehicle is relative to its teammate unmanned aerial vehicleThe magnitude of the relative distance is such that,Respectively represent red squareThe speed of the unmanned aerial vehicle, the track inclination angle and the track deflection angle,Respectively represent the blue squareThe speed of the unmanned aerial vehicle, the track inclination angle and the track deflection angle,Respectively represent red squareUnmanned aerial vehicle frame teammate unmanned aerial vehicleSpeed, track dip angle, track offset angle; Respectively represent red square The position of the unmanned aerial vehicle under the inertial coordinate system,Respectively represent the blue squareThe position of the unmanned aerial vehicle under the inertial coordinate system,Respectively represent red squareUnmanned aerial vehicle frame teammate unmanned aerial vehiclePosition under inertial coordinates.
Preferably, the step 13 specifically comprises converting the multi-machine cooperative countermeasure into target allocation and single-machine countermeasure, wherein the target allocation is based on situation assessment, so that the allocated target has minimum threat than the target itself and has greater threat than the target itself under the condition that the red party and the blue party threaten each other. The expression of the situation assessment is as follows:
Wherein, the Is red squareUnmanned aerial vehicle is with respect to Lan FangdiA situation evaluation value of the unmanned aerial vehicle,,,,Respectively represent red squareUnmanned aerial vehicle is with respect to Lan FangdiThe attitude advantages of the unmanned plane angle, height, speed and distance,For the corresponding weight, satisfy。
Preferably, the step 2 specifically comprises that the multi-unmanned aerial vehicle collaborative countermeasure decision model is implemented based on a distributed partially observable Markov decision process, and the distributed partially observable Markov decision process adopts tuplesTo describe, among others,Representation ofA set of red-colored square unmanned aerial vehicles; Is the state space of the red unmanned plane; Is the joint action space of all red unmanned aerial vehicles, Express Red squareThe action space of the unmanned aerial vehicle is set up,;Is red squareThe frame unmanned aerial vehicle locally observes in a global state,Is a joint rewarding function of all red unmanned aerial vehicles to cooperate against blue parties,Is a state transfer function that is a function of the state,Is discount factor, red squareThe local observation of the unmanned aerial vehicle comprises a red squareAnd (5) information of the unmanned aerial vehicle, information of the blue unmanned aerial vehicle and information of teammates.
Preferably, the design multi-machine collaborative countermeasure reward functionThe sum of rewards is obtained for each unmanned plane of red party against blue party, namely, wherein,Express Red squareRewards against all blue-side drones by the drones.
Preferably, the establishing HASAC algorithm network space specifically comprises the steps of adopting a centralized training distributed execution framework, including n strategy networks, two value networks and two target value networks, wherein each red unmanned aerial vehicle corresponds to one strategy network, the strategy networks have the same structure and are mutually independent and are used for approximating an unmanned aerial vehicle decision model, and the value networks are used for evaluating the performance of the strategy networks under given observation.
Preferably, the step 4 specifically includes taking the observation of each red unmanned aerial vehicle at the current moment as the input of a strategy network of each red unmanned aerial vehicle, outputting the actions of each red unmanned aerial vehicle under the observation at the current moment, returning the observation and global state and the combined rewards of each red unmanned aerial vehicle at the next moment by the interactive environment, and storing the observation and global state of each red unmanned aerial vehicle at the current moment, the combined actions of each red unmanned aerial vehicle and the observation and global state and the combined rewards of each red unmanned aerial vehicle at the next moment into an experience pool connected with a HASAC algorithm network space.
As shown in fig. 1, fig. 2, fig. 3, fig. 4 and fig. 5, according to the perception information, the invention assumes that the red-blue unmanned aerial vehicle is the two opposing sides, the my is the red side, the opponent is the blue side, the two sides can acquire the position, speed and gesture information of the self, opponent and teammate, only consider the opposing of the decision link to perform situation assessment and maneuvering decision, and the overall frame of the air combat decision is shown in fig. 1. The air combat decision is a key link of air combat countermeasure, and the specific implementation process of the invention is as follows:
S1, multi-machine collaborative air combat countermeasure decision-making environment
Assuming that the three-dimensional unmanned aerial vehicle is a red unmanned aerial vehicle, and an opponent is a blue unmanned aerial vehicle, establishing a single unmanned aerial vehicle particle motion model and a multi-unmanned aerial vehicle relative motion model. And combining the air combat situation elements to establish an air combat situation assessment model.
S1-1 building a multi-unmanned aerial vehicle air combat countermeasure motion model
A. Single unmanned aerial vehicle motion model
By simplifying and deducing the stress of the unmanned aerial vehicle, a three-degree-of-freedom particle model is established, and the kinematic and dynamic equations are as follows
(1)
(2)
In the formula,Respectively unmanned aerial vehicle under inertial coordinate systemCoordinates of the shaft, speed, track inclination angle, track deflection angle and gravity acceleration; the tangential overload of the unmanned aerial vehicle is represented along the flying speed direction of the unmanned aerial vehicle; Vertical and flight velocity vectors, representing the normal overload of the drone, Indicating the roll angle of the drone about the speed axis.
B. Multi-unmanned aerial vehicle relative motion model
Multiple units against middle redUnmanned aerial vehicle frame, number setThe blue prescription hasUnmanned aerial vehicle frame, number setRed squareThe relative motion relationship between the frame unmanned aerial vehicle and the blue unmanned aerial vehicle and the red teammate unmanned aerial vehicle is shown in fig. 2. Red squareSpeed vector of unmanned aerial vehicle() Lan Fangdi ASpeed vector of unmanned aerial vehicle() Red squareUnmanned aerial vehicle's teammate unmanned aerial vehicleVelocity vector of (2)Subscript ofI.e.The number of the teammate unmanned aerial vehicle is indicated.Is red squareUnmanned aerial vehicle and Lan FangdiUnmanned aerial vehicle is atThe distance of the axis; Represent the first Red square unmanned aerial vehicle and red square teammate unmanned aerial vehicleAt the position ofDistance of axis, red squareUnmanned aerial vehicle and Lan FangdiRelative position vector of unmanned aerial vehicleRed squareUnmanned aerial vehicle frame and red team friend unmanned aerial vehicleIs a relative position vector of (2)Red square (I)Unmanned aerial vehicle and Lan FangdiOff angle of unmanned aerial vehicleIs red squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and Lan FangdiRelease angle of unmanned aerial vehicleIs blue squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and teammate unmanned aerial vehicle thereofIs a departure angle of (2)Is red squareSpeed vector and relative position vector of unmanned aerial vehicleAngle of (2) red squareUnmanned aerial vehicle and teammate unmanned aerial vehicle thereofIs of the angle of departure of (2)Unmanned aerial vehicle for teammatesVelocity vector and relative position vectorIs arranged at the lower end of the cylinder,Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe relative distance between the unmanned aerial vehicle and the frame,Express Red squareUnmanned aerial vehicle is relative to its teammate unmanned aerial vehicleThe relative distance is calculated by the following formula:
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
in the formula, Respectively represent red squareThe speed of the unmanned aerial vehicle, the track inclination angle and the track deflection angle,Respectively represent the blue squareThe speed of the unmanned aerial vehicle, the track inclination angle and the track deflection angle,Respectively represent red squareUnmanned aerial vehicle frame teammate unmanned aerial vehicleSpeed, track dip angle, track offset angle; Respectively represent red square The position of the unmanned aerial vehicle under the inertial coordinate system,Respectively represent the blue squareThe position of the unmanned aerial vehicle under the inertial coordinate system,Respectively represent red squareUnmanned aerial vehicle frame teammate unmanned aerial vehiclePosition under inertial coordinates.
S1-2 unmanned aerial vehicle air combat situation assessment model
And realizing multi-machine countermeasure multi-objective distribution based on situation evaluation in the multi-machine cooperative countermeasure process, and converting the multi-machine cooperative countermeasure problem into objective distribution and single-machine countermeasure problem. The target allocation is based on situation assessment, and the threat of the red party to the blue party and the threat of the blue party to the red party are considered, so that the allocated target has the least threat than the target, and the threat of the target is larger than the target. The situation assessment focuses on the angle, speed, height and distance elements of the red and blue unmanned aerial vehicle, and the following formula is calculated:
(20)
Wherein, the Is red squareUnmanned aerial vehicle is with respect to Lan FangdiA situation evaluation value of the unmanned aerial vehicle,,,,Respectively represent red squareUnmanned aerial vehicle is with respect to Lan FangdiThe attitude advantages of the unmanned plane angle, height, speed and distance,For the corresponding weight, satisfy。
The angle dominance function is designed as follows:
(21)
the height dominance function is designed as follows:
(22) In which, in the process, Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe z-axis coordinate difference of the unmanned aerial vehicle,And the optimal height advantage difference of the unmanned aerial vehicle for both red and blue is shown.
The speed dominance function is designed as follows:
(23)
in the formula, Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe speed of the unmanned aerial vehicle is different,Indicating the maximum speed and minimum speed of the unmanned aerial vehicle.
The distance advantage is designed as follows:
(24)
in the formula, Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe relative distance between the unmanned aerial vehicle and the frame,Representing the maximum and minimum distances of the unmanned aerial vehicle on-board weapon attack.
By applying the formulae (20) - (24) toIs replaced by-Then calculate to get the red squareUnmanned aerial vehicle frame teammate unmanned aerial vehicleIs used for the parameter setting.
S2, a distributed part observable Markov decision process model of the multi-machine collaborative countermeasure decision problem is established, two parts of unmanned plane model and situation assessment in the environment of FIG. 3 are corresponding to S1, modeling is carried out through the distributed part observable Markov process in the step S2-modeling of the whole multi-machine countermeasure decision problem is carried out, and the distributed part observable Markov process is formed by tuplesDescription.
In the case of a multi-machine collaborative countermeasure decision-making problem,Representation ofA set of red-colored square unmanned aerial vehicles; Is the state space of the red unmanned plane; Is the joint action space of all red unmanned aerial vehicles, Express Red squareThe action space of the unmanned aerial vehicle is set up,;Is red squareThe frame unmanned aerial vehicle locally observes in a global state,Is a state transfer function that is a function of the state,Is a discount factor. At the time ofThe red unmanned plane is in observationExecuting actions at that time(,) Is the firstPolicy of the red-legged unmanned aerial vehicle), obtain a joint rewarding functionAnd the state of the next momentAnd observation ofCombined reward functionThe result is obtained in S3, which is simply referred to as R. Its joint goal is to learn Red party strategySo as to maximize the expected total yield, introducing a maximized entropy term in the joint objective, whose objective is the following formula:
(25)
Wherein, the Is a temperature constant that is a trade-off between rewards and maximization of entropy.Representing mathematical expectations, T represents the moment when the round ends,Representing policiesIs a function of the entropy of (a).
Each red square unmanned aerial vehicle has own local observation, red squareUnmanned aerial vehicle observationComprises a red squareFrame unmanned aerial vehicle self informationUnmanned aerial vehicle informationAnd teammate information。Comprises a red squareThe speed of the unmanned aerial vehicle, the track dip angle and the track deflection angle; Comprises a red square The relative angle and distance elements of the unmanned aerial vehicle relative to each blue unmanned aerial vehicle, and the speed, the track dip angle and the track deflection angle of the blue unmanned aerial vehicle; Comprises a red square Relative angle and distance elements of unmanned aerial vehicle relative to each team, speed, track dip angle and track deflection angle of unmanned aerial vehicle, subscriptThe teammate number is represented by the following formula:
Wherein the method comprises the steps of
,
,
,
Wherein, the Respectively represent the firstThe speed of the red-square unmanned aerial vehicle, the track dip angle and the track deflection angle; Respectively represent the first The speed, the track dip angle and the track deflection angle of the blue-based unmanned aerial vehicle; Respectively represent the first Team friend unmanned aerial vehicle of red square unmanned aerial vehicle of frameSpeed, track pitch angle and track offset angle; Respectively represent the first Red square unmanned aerial vehicle and Lan FangdiDeviation angle, deviation angle and relative position deviation between unmanned aerial vehicles; express Red square Unmanned aerial vehicle and red team friend unmanned aerial vehicleOffset angle, disengaging angle, relative positional deviation between the two.
Global stateThe method has the advantages that redundant information is removed on the basis of combining the observation information of each red unmanned aerial vehicle, and compared with direct observation splicing, the method reduces the state dimension and is beneficial to accelerating training to convergence.
Wherein, the Unmanned plane for expressing red team friendsRemoving and reddening the observed information of (2)Identical elements in the observation information of the unmanned aerial vehicle.
Action spaceDesigned as continuous action space, the red unmanned aerial vehicle jointly acts, wherein,。
S3, designing a multi-machine collaborative countermeasure rewarding function
Multi-machine collaborative air combat joint rewarding function in step S2The sum of rewards is obtained for each unmanned plane of red party against blue party, namely,Express Red squareRewards against all blue unmanned aerial vehicles by the unmanned aerial vehicle are set up as follows
Wherein, the Express Red squareFrame unmanned aerial vehicle relative blue squareRewards for unmanned aerial vehicles, including short-term rewardsAnd long-term rewardsI.e.。
Is dense rewards, is short-term rewards designed based on situation assessment, and has the expression:
(26)
Wherein, the Is red squareUnmanned aerial vehicle is with respect to Lan FangdiAngle, height, speed and distance rewards of the unmanned aerial vehicle,Corresponding weights are awarded for angle, altitude, speed, distance.
(27)
(28)
In the formula,Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe z-axis coordinate of the unmanned aerial vehicle is different.
(29)
(30)
Is sparse rewarding, red squareUnmanned aerial vehicle and Lan FangdiAnd (5) rewarding value for ending the fight of the unmanned aerial vehicle. Wherein the two unmanned aerial vehicle fight ending marks are divided into red squaresRack unmanned aerial vehicle hit Lan FangdiFrame unmanned aerial vehicle or by blue squareHit by unmanned aerial vehicle, red squareThe unmanned aerial vehicle is beyond the simulation boundary Lan FangdiThe frame unmanned aerial vehicle exceeds the simulation boundary and exceeds the maximum simulation step length of each round, and the following formula is calculated:
(31)
Red square Rack unmanned aerial vehicle hit Lan FangdiThe unmanned aerial vehicle needs to meet the following constraints
(32)
In the formula,Express Red squareUnmanned aerial vehicle is with respect to Lan FangdiThe relative distance between the unmanned aerial vehicle and the frame,Representing the maximum and minimum distances of the unmanned aerial vehicle on-board weapon attack.
The unmanned aerial vehicle needs to satisfy the following formula in the emulation boundary:
(33)
Each round of simulation step used in equation (31) The following formula is required to be satisfied:
(34)
in the formula, Is the maximum simulation step per round.
S4 design HASAC algorithm network space
The HASAC algorithm adopts a centralized training distributed execution framework, and consists of n strategy networks, two value networks and two target value networks, wherein n represents the number of red unmanned aerial vehicles, and each strategy network corresponds to one red unmanned aerial vehicle. Each red unmanned aerial vehicle is connected with a strategy network and is used for approximating an unmanned aerial vehicle decision model to generate a decision action under the local observation of the red unmanned aerial vehicle, the strategy network structures are the same, but all the strategy networks are independently learned, and network parameters are not shared. For example, red squarePolicy network of unmanned aerial vehicleThe input of (a) is red squareUnmanned aerial vehicle is atObservation of time of dayAnd outputting the decision action distribution of the unmanned aerial vehicle under the current observation. Value networks are used to approximate a given global stateAll red unmanned aerial vehicles execute joint actionI.e., the input is state and action, and the output is Q. The two value networks are independently trained in the same structure, and smaller Q values are used for strategy network updating to relieve the problem of Q value overestimation, namely, the self characteristics of a reinforcement learning algorithm are improved, and when the Q values are estimated by using the value networks, the learned Q value function overestimates the real Q values due to estimation errors or other factors in the training process. The target value network is used for stabilizing the training process, the network structure is the same as that of the value network, and the target value network parameter update is obtained by carrying out weighted average based on the value network parameter. The policy network inputs the observation and outputs the decision action under the observation, each policy network interacts with the environment, the collected training data is stored in the experience pool, namely, the action instruction of the red unmanned aerial vehicle is acted on the red unmanned aerial vehicle model, the following whole environment is used for updating the state and collecting the data, o ', S ' R of the environment are returned, the o ', S and a are stored in the experience pool together, then in the training process, the estimated Q (S, a) value of the current state and the estimated Q (S ', a ') value of the next state are needed for calculating the loss function loss, at this time, the S ' for the estimated Q (S ', a ') of the next state is extracted from the experience pool, and a ' is a ' generated by using the policy network input o ', and the content is described in detail in step S5.
S5 training based on HASAC and generating multi-machine collaborative countermeasure decision model
When the countermeasure training is performed based on HASAC algorithm, a centralized training and distributed execution framework is adopted, after parameters such as a network, an experience pool and the like are initialized, data are collected and network parameter updating is performed based on strategy network and environment interaction, as shown in fig. 3, the specific training process is as follows:
1) Acquisition from an interaction environment Time of day observationRespectively, policy networkInput Red squareUnmanned aerial vehicle observationOutput red squareAction of unmanned aerial vehicleAll red unmanned aerial vehicle motor forms combined action;
2) All actions are performedInputting the interactive environment, and returning the environment to the next observationAnd global stateAnd performing a joint reward for an action. Wherein the motion isThe method comprises the steps of respectively acting on n unmanned aerial vehicle models of the red party to update the state of the red party unmanned aerial vehicle, converting a multi-countermeasure decision problem into a situation that each blue party unmanned aerial vehicle distributes a red party target and a one-to-one countermeasure decision problem based on target distribution, outputting the red party target attacked by the blue Fang Moren based on situation evaluation by the target distribution model, outputting the decision action of the blue party unmanned aerial vehicle against the red party target by the minimum, and updating the state of the blue party unmanned aerial vehicle. The target allocation is used for evaluating the threat of an opponent when the red unmanned aerial vehicle and the blue unmanned aerial vehicle are opponent based on the situation evaluation model, so that the threat of the allocated red target to the blue target is small, and the threat of the blue unmanned aerial vehicle to the red target is large. Extraction of observations of a red unmanned aerial vehicle by an observation and global state processorAnd global state valueAnd calculating a joint prize based on the prize function;
3) Will currently beObservation of each red square unmanned aerial vehicle at momentAnd global stateCombined action of all red unmanned aerial vehiclesObserving each red unmanned aerial vehicle at next momentAnd global stateCombined rewardsComposition tupleStoring the data into an experience pool;
4) Then randomly sampling data from the experience pool As training data of the value network, the strategy network and the target value network, updating the value network, the target value network and the strategy network of each red unmanned aerial vehicle based on an Adam optimizer;
The specific process of updating the value network is that two value networks input sampling data AndOutput stateExecute action downwardsIs of the estimated Q value of (2)And will be smallerThe value is used as the predicted Q value in the loss function, and the two target value networks are based on the state of the sampling data at the next momentAnd the policy network observes at the next time of inputOutput at the timePredicting an estimated Q value for a next time instantBased on rewardsEntropy regularization termCalculating a target Q value, and taking the mean square error of the minimum target Q value and the predicted Q value as a loss function loss to update the value network parameters, wherein the two value network parameters are independently updated;
and updating the strategy network, namely randomly generating a group of n red unmanned aerial vehicle number arrangements, updating the strategy network of each red unmanned aerial vehicle according to the arrangement sequence, and considering the previously updated red unmanned aerial vehicle strategy network when updating the current red unmanned aerial vehicle strategy network parameters, wherein a loss function for guiding the strategy network to update takes a smaller Q value and entropy regularization item in the estimated Q values of the two value networks as input.
Updating the target value network, namely optimizing the target value network parameters through soft updating at intervals of a certain step number.
And repeating the steps until all the networks gradually converge, namely taking each strategy network as a multi-unmanned aerial vehicle cooperative countermeasure decision model of the red party when rewards per round are not obviously increased and the countermeasure time length of the round is not obviously shortened in a period of time, inputting the strategy network as local observation of each unmanned aerial vehicle of the red party at the current moment, and outputting the strategy network as decision action executed by each unmanned aerial vehicle of the red party under the current observation.
In order to facilitate understanding of the above technical solutions of the present invention, the following specific examples are used to describe the above technical solutions of the present invention in detail.
Taking two-to-two air combat as an example, the red party adopts HASAC algorithm to conduct combat training, and the blue party adopts traditional decision algorithm including target allocation and minimax algorithm. The target allocation is designed based on situation assessment values of the blue unmanned aerial vehicle and the red unmanned aerial vehicle, a threat matrix of the red party to the blue party and a threat matrix of the blue party to the red party are constructed, threat to an enemy target is maximized, self risk is minimized, and the method is solved through linear programming. The target allocation changes the multi-machine cooperative countermeasure into one-to-one countermeasure, and the blue-side one-to-one countermeasure decision algorithm is a minimum algorithm. The multi-machine collaborative countermeasure decision framework based on heterogeneous multi-agent reinforcement learning is shown in fig. 3.
The simulation environment parameters are set as follows, the simulation boundary is 20km x 10km, parameters below the unmanned aerial vehicle with red and blue are the same, the speed value range is 80 m/s-400 m/s, and the track dip angle range isThe range of track deflection angle isThe weapon attack range is 150 m-900 m, the action quantity tangential overload range is-3 g, the normal overload range is-5 g, and the rolling angle range around the speed shaft is。
The parameters of the training algorithm were set to 1000,000 experience pool size, 1000 experience playback batch size, 0.99 discount factor, 0.005 soft update factor, 500 maximum steps per round, and 0.0005 each network learning rate.
The first set of experimental red Fang Chushi is in an average situation, the second set of experimental red Fang Chushi is in a dominant situation, the third set of experimental red Fang Chushi is in a disadvantaged situation, and the states of the two specific red and blue are shown in table 1.
Observations ofWherein the method comprises the steps of
I.e.
Global stateSplicing is carried out on the basis of local observation of two red unmanned aerial vehicles, andAndThe duplicate elements in the tree are de-duplicated, the global state dimension is reduced,Representation ofIs removed fromThe elements of the repetition are selected to be,Representation ofIs removed fromRepeated elements.
The observation space of each red unmanned aerial vehicle is 27 dimension, the state space is 42 dimension, wherein,,Respectively representing the speeds, track dip angles and track deflection angles of two red unmanned aerial vehicles;, respectively representing the speeds, track dip angles and track deflection angles of the two blue unmanned aerial vehicles; Respectively represent red square Relative angle and relative position information between the unmanned aerial vehicle and Lan Fangdi unmanned aerial vehicles; Represent the first Relative angle and relative position information between the red unmanned aerial vehicle and Lan Fangdi unmanned aerial vehicles; Represent the first Relative angle and relative position information between red unmanned aerial vehicle and red teammate unmanned aerial vehicle, whereinIndicating the teammate number.
,,,The parameters are shown in Table 1, relative angles,Calculation was performed using the previous formulas (14), (15), (16) and (17);,, the relative position information calculation is obtained using the foregoing formulas (6) (7) (8) (9) (10) (11).
The motion space of each red unmanned aerial vehicle is 3D, and the red unmanned aerial vehicles jointly act, wherein,。
Joint rewarding function
Wherein, the
When the fight training is carried out based on HASAC algorithm, a centralized training and distributed execution framework is adopted, each red unmanned aerial vehicle is provided with a strategy network, and all red unmanned aerial vehicles are provided with a centralized value network and a target value network. The structure of the policy network and the value network are shown in fig. 4 and 5, respectively. In order to alleviate the problem of Q value overestimation, two independent value networks are designed, and meanwhile, two target value networks are introduced for stabilizing the training process.
After initializing parameters such as a network, an experience pool and the like, collecting data based on interaction of a strategy network and an environment and updating network parameters, wherein a specific training process is shown in fig. 3:
1) Acquisition from an interaction environment Time of day observationThe policy network 1 inputs the observation of the first unmanned aerial vehicle of red squareOutputting the action of the first unmanned aerial vehicleThe policy network 2 inputs the observation of the second unmanned aerial vehicle of the red squareOutputting the action of the second unmanned aerial vehicleForm a combined action;
2) Action is to takeRespectively inputting the environment to the interaction environment, and returning the environment to the observation at the next momentAnd global stateAnd performing a joint reward for an action. Wherein the motion isThe method comprises the steps of operating on a red 1 unmanned aerial vehicle model and a red 2 unmanned aerial vehicle model, carrying out state updating of the red unmanned aerial vehicle, converting a two-to-two countermeasure decision problem into a one-to-one countermeasure decision problem by a blue party decision model based on target allocation, allocating a red party target and a one-to-one countermeasure decision problem by each blue party unmanned aerial vehicle, outputting the red party target attacked by the blue Fang Moren machine by the target allocation model based on situation evaluation, outputting decision actions of the blue party unmanned aerial vehicle against the red party target by the minimum, and carrying out state updating of the blue party unmanned aerial vehicle. The target allocation is used for evaluating the threat of an opponent when the red unmanned aerial vehicle and the blue unmanned aerial vehicle are opponent based on the situation evaluation model, so that the threat of the allocated red target to the blue target is small, and the threat of the blue unmanned aerial vehicle to the red target is large. Extraction of observations of a red unmanned aerial vehicle by an observation and global state processorAnd global state valueAnd calculating a joint prize based on the prize function;
3) Then, the current isObservation of two red unmanned aerial vehicles at momentAnd global stateCombined action of two red square unmanned aerial vehiclesTwo red unmanned aerial vehicle observations at the next momentAnd global stateCombined rewardsComposition tupleStoring the data into an experience pool;
4) Thereafter, randomly sampling data from the experience pool As training data of the value network, the strategy network and the target value network, updating the value network, the target value network and the strategy networks of the two red unmanned aerial vehicles based on an Adam optimizer;
updating value networks two value networks input sample data AndOutput stateExecute action downwardsIs of the estimated Q value of (2)And will be smallerAs the predicted Q value in the loss function, two target value networks based on the state of the sampling data at the next momentAnd the policy network observes at the next time of inputOutput at the timePredicting an estimated Q value for a next time instantBased on rewardsEntropy regularization termCalculating a target Q value, and updating value network parameters by taking the mean square error of the minimum target Q value and the predicted Q value as a loss function, wherein the two value network parameters are independently updated;
And updating the strategy network, namely randomly generating a group of red unmanned aerial vehicle number arrangement, updating strategy networks of two red unmanned aerial vehicles according to the arrangement sequence, and considering the previously updated red unmanned aerial vehicle strategy network when updating the current red unmanned aerial vehicle strategy network parameters, wherein a loss function for guiding the strategy network to update takes a smaller Q value and an entropy regularization term in the estimated Q values of the two value networks as input.
Updating the target value network, namely optimizing the target value network parameters through soft updating at intervals of a certain step number.
Repeating the steps until all the networks gradually converge, taking each strategy network as a cooperative countermeasure decision model of the red unmanned aerial vehicle, inputting the strategy network into local observation of each red unmanned aerial vehicle in the current environment, and outputting the strategy network into actions executed by each red unmanned aerial vehicle under the current observation.
Table 1 unmanned plane state meter for red and blue parties under three initial situations
The simulation experiment results are shown in fig. 6, 7 and 8, under three basic initial situations, all the red-square generation strategy models can hit an enemy plane in a short time, and the two red-square unmanned aerial vehicles are matched in a cooperative manner to hit a blue unmanned aerial vehicle respectively. When the situation is initially in the dominant situation, the red and blue unmanned aerial vehicles quickly approach to the blue unmanned aerial vehicle in an acceleration way, and the blue unmanned aerial vehicles are respectively knocked down by the two red unmanned aerial vehicles due to the fact that the turning speed is high, and when the situation is initially in the inferior situation, the red unmanned aerial vehicles firstly conduct turning maneuver to twist the inferior situation, the blue unmanned aerial vehicles are accelerated to approach to the opponents, and after that, the blue unmanned aerial vehicles are overturned to the front of the red unmanned aerial vehicles due to the fact that the speed of the blue unmanned aerial vehicles is high, the two red unmanned aerial vehicles are subjected to the minimum turning maneuver again, and the two blue unmanned aerial vehicles can be respectively knocked down by the blue unmanned aerial vehicles.
As a disclosed embodiment, the present invention further provides a multi-unmanned aerial vehicle collaborative countermeasure decision-making system for implementing the method, the system comprising:
The construction module is used for constructing a multi-unmanned aerial vehicle collaborative air combat countermeasure decision-making environment by constructing a multi-unmanned aerial vehicle air combat countermeasure motion model and an air combat situation assessment model;
The first establishing module is used for establishing a distributed partially observable Markov decision process model of the multi-unmanned aerial vehicle collaborative countermeasure decision problem according to the action space and the local observation and the state of each unmanned aerial vehicle in the countermeasure decision environment;
the second building module is used for designing a multi-machine collaborative countermeasure reward function and HASAC algorithm network space;
The generating module is used for training the observation of each unmanned aerial vehicle obtained by the multi-unmanned aerial vehicle collaborative countermeasure decision-making model based on HASAC and generating a multi-unmanned aerial vehicle collaborative countermeasure strategy model, wherein the multi-unmanned aerial vehicle comprises a plurality of unmanned aerial vehicles on the my side and a plurality of unmanned aerial vehicles on the enemy side, and the enemy side is red and the enemy side is blue.
While the foregoing description illustrates and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, and is capable of numerous other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept as expressed herein, either as a result of the foregoing teachings or as a result of the knowledge or technology of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.