CN120635750A

CN120635750A - Small target detection method for UAV based on deep learning

Info

Publication number: CN120635750A
Application number: CN202510621264.7A
Authority: CN
Inventors: 陆艳辉; 王一丁; 袁鹏辉; 于欢
Original assignee: Chengdu Liangyuan Technology Co ltd; Sichuan Tengdun Technology Co Ltd
Current assignee: Chengdu Liangyuan Technology Co ltd; Sichuan Tengdun Technology Co Ltd
Priority date: 2025-05-14
Filing date: 2025-05-14
Publication date: 2025-09-12

Abstract

The present invention relates to the technical field of deep learning image recognition, and in particular to a method for detecting small targets in unmanned aerial vehicles (UAVs) based on deep learning. The method comprises the following steps: embedding a parallelized block perception attention module in the C2f module of YOLOv8; integrating a large separable kernel attention mechanism into the SPPF module of YOLOv8; replacing the original nearest neighbor interpolation with CARAFE dynamic upsampling; replacing NMS with Soft-NMS; reconstructing the Neck part of the backbone network; combining a dynamic target detection head with three perception modules: scale, space, and task; and designing a WFS-IoU loss function that integrates Wise-IoU, Focaler-IoU, and Shape-IoU. The method addresses core bottlenecks in UAV small target detection, such as feature loss, poor scale adaptability, and defects in dense target processing. Through multi-module collaborative optimization and lightweight design, the method achieves a balance between accuracy and efficiency, thus providing an efficient and reliable detection solution for UAV aerial photography applications.

Description

Unmanned aerial vehicle small target detection method based on deep learning

Technical Field

The invention relates to the technical field of deep learning image recognition, in particular to a method for detecting a small target of an unmanned aerial vehicle based on deep learning.

Background

Unmanned aerial vehicle aerial photography technology is widely applied to the fields of disaster relief, city management and the like, but the target size in the high-altitude photographed image is extremely small and densely distributed. The traditional target detection algorithm faces significant challenges in such a scene that the information of a small target is seriously lost in the process of multiple downsampling, so that the feature extraction is insufficient, and the detection precision is significantly reduced.

Specifically, the small target in the unmanned aerial vehicle view angle has extremely low pixel duty ratio in the image, and key details are blurred and even completely lost after multi-layer downsampling of the network. While existing approaches attempt to alleviate this problem by adding detection layers or introducing attention mechanisms, there is a lack of systematic design:

The traditional attention module does not perform joint modeling aiming at local details and long-distance dependence of a small target, so that feature fusion is insufficient, multi-scale context information is difficult to dynamically capture by fixed pooling operation (such as an SPPF module), feature degradation of the small target is further aggravated, and a feature map generated by up-sampling methods such as nearest neighbor interpolation has a jagged problem and has insufficient detail recovery capability.

In addition, unmanned aerial vehicle airborne computing resources are limited, and the existing improvement scheme often causes the rapid increase of model parameter quantity, and the accuracy and the instantaneity are difficult to balance. Therefore, an end-to-end optimization method for a small target detection scene of an unmanned aerial vehicle is needed, the problem of information loss is systematically solved through multi-module collaborative design, and the light deployment requirement is met while the detection precision is improved.

Disclosure of Invention

Aiming at the core bottlenecks such as feature loss, poor scale adaptability, dense target processing defects and the like in unmanned aerial vehicle small target detection, the invention realizes the balance of precision and efficiency through multi-module collaborative optimization and lightweight design, and provides a high-efficiency and reliable detection solution for unmanned aerial vehicle aerial photography application.

The invention is realized by the following technical scheme:

The unmanned aerial vehicle small target detection method based on deep learning comprises the following steps:

S1, embedding a parallelization block perception attention module in a C2f module YOLOv, and enhancing the local-global feature fusion capability of a small target through multi-branch feature extraction and self-adaptive attention processing;

S2, fusing a large separable kernel attention mechanism in an SPPF module of YOLOv, and expanding the context sensing range of the multi-scale features by utilizing space expansion convolution and separable kernel decomposition;

S3, replacing original nearest neighbor interpolation by CARAFE dynamic upsampling, and reducing detail information loss caused by high-rate downsampling through content perception kernel prediction and feature recombination;

s4, replacing NMS with Soft-NMS, and inhibiting redundant detection frames of dense small targets based on Gaussian confidence attenuation strategies, so that recall rate in a shielding scene is improved;

S5, reconstructing Neck parts of a backbone network, constructing a multi-scale feature pyramid through a scale sequence feature fusion module, a three-feature encoder and a channel position attention mechanism, and additionally arranging a P2 detection layer to capture the low-resolution features of the bottom layer;

S6, combining the dynamic target detection head with three perception modules of scale, space and task, and enhancing the regression degree of the bounding box of the small target through global pooling, deformation convolution and channel modeling;

S7, designing WFS-IoU loss functions fusing with Wise-IoU, focaler-IoU and Shape-IoU, and optimizing the positioning of a small target boundary box by combining a dynamic focusing mechanism, sample difficulty grading and Shape sensitivity factors.

In step S1, the parallelized block aware attention module includes a local branch, a global branch and a serial convolution branch, wherein the local branch extracts local features through non-overlapping patch aggregation, the global branch captures long-distance dependence through displacement operation, the convolution branch refines features through serial convolution, and after the output features of the three branches are added and fused, self-adaptive enhancement is performed sequentially through a one-dimensional channel attention mechanism and a two-dimensional space attention mechanism.

Further, in step S2, the large separable kernel attention mechanism decomposes the standard convolution kernel into separable kernels in the horizontal direction and separable kernels in the vertical direction, extracts the horizontal spatial context information and the vertical spatial context information of the feature map respectively, combines the spatial expansion convolution expansion receptive fields with different expansion rates, and performs adaptive fusion on the features with different scales through a dynamic weight distribution mechanism.

In step S3, the CARAFE dynamic upsampling performs weighted recombination on the local feature region by predicting an upsampling kernel related to the input feature content, the upsampling kernel generation includes the steps of feature map channel compression, kernel prediction and Softmax normalization, the upsampling process retains high-frequency detail information, and the feature recombination module performs dot product operation on the local region feature.

Further, in step S4, the Soft-NMS method performs confidence level attenuation on the candidate frames overlapped with the high confidence level detection frame, the attenuation amplitude is dynamically adjusted according to the overlapping degree through a gaussian function, and the detection frame with the confidence level higher than the preset threshold value is reserved.

Further, in step S5, the scale sequence feature fusion module normalizes feature graphs with different downsampling rates, then splices the feature graphs, and fuses multi-scale information through three-dimensional convolution, the three-feature encoder separates large, medium and small-size features, then re-amplifies and fuses the features to supplement bottom-layer detail information, the channel position attention mechanism modulates feature response through channel weight and spatial weight in a combined mode, and the resolution of the P2 detection layer is 160×160 pixels.

Further, in step S6, the dynamic target detection head includes a scale sensing module, a space sensing module and a task sensing module, where the scale sensing module dynamically adjusts channel weights through global pooling and a full connection layer, the space sensing module captures deformation characteristics of the target through deformation convolution, and the task sensing module optimizes classification and weight distribution of regression tasks through the full connection layer.

Further, in step S7, the WFS-IoU loss function reduces gradient influence of the low-quality sample through a dynamic focusing mechanism, distinguishes easily-detected and difficult-to-detect targets through sample difficulty grading, restricts aspect ratio case deviation of the prediction frame and the real frame through a shape sensitive factor, and comprehensively improves positioning accuracy of the small targets.

Preferably, the unmanned aerial vehicle small target detection method based on deep learning comprises a processor and a computer readable storage medium, wherein the computer readable storage medium stores machine executable instructions, and the machine executable instructions realize the unmanned aerial vehicle small target detection method based on deep learning when being executed by the processor.

The invention has the beneficial effects that:

(1) According to the unmanned aerial vehicle small target detection method based on deep learning, local details and global context information of the small target are effectively fused through the collaborative design of the parallelization block perception attention module and the large separable nuclear attention mechanism, and the feature expression capability of the small target is remarkably improved;

(2) According to the unmanned aerial vehicle small target detection method based on deep learning, provided by the invention, the perception range of the model to targets with different scales is expanded by combining a space expansion convolution and a dynamic weight distribution mechanism, and the detection capability of the scale change targets under the view angle of the unmanned aerial vehicle is enhanced;

(3) According to the unmanned aerial vehicle small target detection method based on deep learning, a dynamic content perception up-sampling method is adopted, loss of high-frequency detail information is reduced, the problems of blurring and jagging of edges of small targets are effectively solved, and the definition of contours is improved;

(4) According to the unmanned aerial vehicle small target detection method based on deep learning, the weight of the overlapped detection frames is dynamically adjusted through a confidence level attenuation strategy, missing detection and false detection in densely distributed or shielding scenes are reduced, and the target recall rate in a complex environment is remarkably improved;

(5) According to the unmanned aerial vehicle small target detection method based on deep learning, the dynamic focusing mechanism and the loss function design of the shape sensitivity factor are fused, the sensitivity of the model to the small target position deviation is enhanced, and more accurate bounding box regression is realized.

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is an overall flowchart of a deep learning-based unmanned aerial vehicle small target detection method provided by the invention;

fig. 2 is a PPA diagram of a parallelized block perception attention module of the unmanned aerial vehicle small target detection method based on deep learning;

FIG. 3 is a Bottleneck and P-Bottleneck block diagram of the unmanned aerial vehicle small target detection method based on deep learning;

FIG. 4 is a C2f-PPA structure diagram of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;

fig. 5 is a LSKA block diagram of a detection method of a small target of an unmanned aerial vehicle based on deep learning, which is provided by the invention;

FIG. 6 is a SPPF-LSKA block diagram of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;

fig. 7 is a working flow chart of a CARAFE up-sampling method of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;

fig. 8 is a scale sequence feature fusion SSFF diagram of the unmanned aerial vehicle small target detection method based on deep learning provided by the invention;

fig. 9 is a TFE diagram of a three-feature encoder of the unmanned aerial vehicle small target detection method based on deep learning according to the present invention;

Fig. 10 is a CPAM diagram of a channel and position attention mechanism of the unmanned aerial vehicle small target detection method based on deep learning according to the present invention;

fig. 11 is a schematic diagram of a Neck network structure after Neck structural reconstruction and small target detection layer addition of the unmanned aerial vehicle small target detection method based on deep learning provided by the invention;

Fig. 12 is a Dyhead block diagram of a method for detecting a small target of an unmanned aerial vehicle based on deep learning according to the present invention;

fig. 13 is a bounding box function visualization diagram of the unmanned aerial vehicle small target detection method based on deep learning provided by the invention;

FIG. 14 is a WFS-IoU performance comparison chart of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;

fig. 15 is a schematic diagram of a terminal device of the unmanned aerial vehicle small target detection method based on deep learning;

Fig. 16 is a schematic diagram of a readable storage medium of the deep learning-based unmanned aerial vehicle small target detection method according to the present invention;

in the figure, 200-terminal equipment, 210-memory, 211-RAM, 212-cache memory, 213-ROM, 214-program/utility, 215-program modules, 220-processor, 230-bus, 240-external device, 250-I/O interface, 260-network adapter, 300-program product.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

Example 1

The unmanned aerial vehicle small target detection method based on deep learning provided by the embodiment comprises the following steps:

1. embedding a parallelization block perception attention module in a C2f module YOLOv, and enhancing the local-global feature fusion capability of the small target through multi-branch feature extraction and self-adaptive attention processing;

In small object detection tasks, due to the small object size, often accompanied by blurred contours and fewer pixels, there is a significant risk of information loss during multiple downsampling, resulting in a greatly reduced feature extraction for small objects. Therefore, PPA of HCF-Net of infrared small target network structure, namely parallelized block perception attention module, replaces traditional convolution operation in basic components of encoder and decoder, well solves the problem of losing key information in down sampling operation, and is helpful for solving the problem of losing key information of small target.

The functional principle of PPA is proposed in this embodiment, and specifically includes:

(1) Multi-branch feature extraction, PPA's major advantage is its multi-branch feature extraction strategy. As shown in fig. 2, PPA employs a parallel multi-branch approach, each branch being responsible for extracting features of different scales and levels. This multi-branch strategy is advantageous for capturing multi-scale features of objects, thereby improving the accuracy of small object detection. The multi-branch strategy comprises three parallel branches, namely a local branch, a global branch and a serial convolution branch. Given input feature tensor Then, firstly, the method is obtained by point-by-point convolution adjustmentThen through the three branches, can be calculated respectivelyAndFinally, adding the three results to obtainThe distinction between local branches and global branches is realized by controlling the patch size parameter p, which is realized by the aggregation and displacement of non-overlapping patches in the spatial dimension. In addition, the embodiment also calculates the attention matrix between the non-overlapping patches to realize local and global feature extraction and interaction.

(2) Feature fusion and attention, and after feature extraction is performed through multi-branch feature extraction, self-adaptive feature enhancement is performed by using an attention mechanism. The attention module consists of a series of efficient channel attention and spatial attention. In this caseSequentially by one-dimensional channel attention mapAnd two-dimensional spatial attention diagramAnd (5) processing. This process can be summarized as follows:

Wherein, the The elements of the representation are multiplied together,AndRepresenting the characteristics of the channel and the space after selection, delta (·) andThe rectified linear unit ReLU and the batch normalized BN are shown respectively,Is the final output of PPA, and the P-Bottleneck structure is shown in FIG. 3.

Fusion improvement is carried out on the PPA module and the C2f module in YOLOv to obtain a C2f-PPA model, so that feature extraction and feature fusion capability of a YOLOv algorithm on a small target are improved, and a specific structure is shown in figure 4.

2. The SPPF module of YOLOv is fused with a large separable core attention mechanism, and the context sensing range of the multi-scale features is expanded by utilizing space expansion convolution and separable core decomposition;

YOLOv the original SPPF in YOLOv contains three successive pool layers, the outputs of each layer being combined to ensure multi-scale fusion. Meanwhile, compared with SPP, the method reduces the computational complexity and remarkably improves the speed so as to realize self-adaptive size output, but the structure is difficult to effectively integrate small and fine-granularity characteristic information, and small target information is often related. A LSKA large separable nuclear attention mechanism is introduced and combined with SPPF, thereby significantly enhancing the multi-scale feature extraction capabilities of the model.

The structure and function of LSKA in this embodiment is divided into the following steps:

(1) The convolutional layers are initialized and are responsible for extracting the horizontal and vertical features of the input feature map, respectively, which helps the model focus on important parts of the image in order to generate a preliminary attention map.

(2) The spatial dilation convolution layer, LSKA, after obtaining a preliminary attention map, employs spatial dilation convolutions with different dilations to further extract features. These convolution layers can cover a larger receptive field without increasing computational cost, capture a wider range of context information, enable finer processing of image features by operating in the horizontal and vertical directions, respectively, and enhance the model's understanding of image spatial relationships.

(3) The method comprises the steps of fusing and applying attention, after convolution operation, LSKA fusing the obtained features through the last convolution layer to generate a final attention diagram, wherein the attention diagram is subjected to multiplication operation at element level with an original input feature diagram, namely an attention mechanism is applied, so that each element in the original feature diagram can be weighted according to the value of the attention diagram, important features are highlighted, and unimportant features are restrained.

The LSKA large core separable attention mechanism is a design that is optimized for the attention module in convolutional neural networks. The core idea is to decompose the 2D convolution kernel of the depth convolution layer into cascaded horizontal 1D and vertical 1D kernels, thereby making it possible to directly use the large kernel of the depth convolution layer in the attention module. The advantage is that LSKA can enhance the ability of the model to capture long-range dependencies through a larger convolution kernel size, while it does not significantly increase the number of parameters and computational complexity of the model as a full-size convolution does because of the use of separable convolutions. The attention map is generated by capturing the extensive context information of the image with a large and separable convolution kernel and a spatially-expanded convolution, and the original features are weighted by this attention map, thereby enhancing the attention of the network to important features and improving the performance of the model.

In this embodiment, LSKA attentiveness mechanism is put between Concat and Conv of SPPF module, which becomes a new SPPF-LSKA module. LSKA the attention mechanism is applied to the SPPF for weighting the multi-scale feature map. Specifically, LSKA first calculates the spatial weights and channel weights on each feature map, and then adjusts the feature map according to these weights. The model can be focused more on features related to the object, thereby improving the detection accuracy. LSKA is shown in FIG. 5, and SPPF-LSKA is shown in FIG. 6.

LSKA the specific output formula is as follows:

A^C＝W_1×1*Z^C

Wherein the method comprises the steps of Representing the output of the depth convolution, Z ^C is the output of the depth convolution obtained by convolving a kernel of size kxk with the input feature map, 2d-1 and k/d representing the size of the convolution kernel, representing the convolution,Representing the hadamard product.

It is clear that each channel C in F is convolved with the corresponding channel in the kernel W. The output T ^C of LSKA is the hadamard product of the attention map a ^C and the input signature F ^C.

3. The CARAFE dynamic up-sampling is used for replacing the original nearest neighbor interpolation, and detail information loss caused by high-rate down-sampling is reduced through content perception kernel prediction and feature recombination;

YOLOv8 an up-sampling operation is performed by using an up-sampling operator, up-sampling is performed by using a nearest-neighbor interpolation method, the nearest-neighbor interpolation is performed by using the nearest-neighbor distance between pixels to represent up-sampling, namely, the gray value of the transformed pixel is equal to the gray value of the pixel of the original feature image closest to the nearest-neighbor interpolation, but the method only considers the relation between adjacent pixels and ignores the context semantic information of the feature image, which leads to different gray values of the pixels of the image generated at the same point of the original image when the nearest-neighbor value is calculated, the jagged edge of the image appears, the detail information of the image is lost, the characteristic expression capability of fine granularity is lacked, the feature extraction part of the algorithm is lacked in detail aspect, small target detection is not facilitated, and a CARAFE up-sampling method is introduced in the embodiment, and the up-sampling kernel is dynamically adjusted to adapt to different up-sampling requirements, so that the feature information is better reserved and extracted, and the information loss can be reduced while the calculation efficiency is maintained. In addition, by fusing feature graphs with different scales, more abundant context information can be obtained, so that the feature expression capability of the network is enhanced, the perception range and the positioning accuracy of an algorithm are improved, and a specific implementation flow is shown in fig. 7.

In the upsampling kernel prediction section, first, a feature map of an input size H×W×C is compressed by a1×1-sized Conv module by C _m as the number of channels of the input feature map, and then the number of output channels is changed to be the number of channels by a convolution operationWhere σ represents the upsampling rate, the upsampling kernel size is k _up×k_up, and if a different upsampling kernel is used for each position of the output feature map, the predicted upsampling kernel size isAnd finally, carrying out Softmax normalization on the obtained up-sampling kernel to ensure that the weight sum of the convolution kernel is 1. For the feature reorganization module, each position in the output feature map is mapped back to the input feature map, a k _up×k_up region taking the output feature map as the center is taken out, dot product operation is carried out on the region and the up-sampling core of the predicted point, different channels at the same position share the same up-sampling core, for each reorganization core, the content perception reorganization module reorganizes the features of the local region, and for a target position l' in the output feature map and a square region N (χ _l,k_up) taking the input feature map as the center, the reorganization formula is as follows:

wherein r= [ k _up/2],N(χ_l,k_up ] is dot product of up-sampling kernel corresponding to the point, each channel at the same position adopts the same up-sampling kernel, that is, the values of up-sampling pixels are rearranged, and the contribution value of each pixel in N (χ _l,k_up) to the up-sampling kernel is different, and different contributions are correspondingly made based on the content of the feature map. After the up-sampling operation is completed, an output feature map result with a shape of σh×σw×c is obtained.

4. Adopting Soft-NMS to replace NMS, inhibiting redundant detection frames of dense small targets based on Gaussian confidence attenuation strategy, and improving recall rate in a shielding scene;

YOLOv8 is an anchor-free model that predicts the center of an object directly, rather than locating the object by predicting the offset of the anchor box. I.e. YOLOv, directly predicts the center of the object instead of the offset of the known anchor boxes, this new method accelerates a very complex reasoning step NMS non-maximal suppression, as the number of box predictions is reduced, but the accuracy is still not fully guaranteed for densely occluding small target objects.

In the embodiment, soft-NMS is introduced, improvement on the traditional NMS algorithm is realized, and detection of overlapping targets is restrained by reducing the confidence of the overlapping targets, so that the accuracy of target detection is improved. A different hyper-parameter is introduced to determine the degree of discount on the confidence score of the overlap detection by means of a gaussian distribution.

Soft-NMS differs from NMS in that redundant boxes are processed, NMS directly deletes redundant boxes, and Soft-NMS is a confidence-reduced process for redundant boxes, thresholding boxes on the candidate list again. Since only redundant boxes are confidence subtracted, boxes that are not 0 confidence will fall into the candidate list. At this time, when the frames in the candidate list are processed once by using the threshold value, the frames below the threshold value are deleted. For the algorithm, only overlapping boxes where the model outputs a confidence level strong enough will be considered to have overlapping objects in its vicinity.

Specifically, the Soft-NMS receives the IOUs of the two anchors and makes punishments, and the anchors confidence level is modified according to the punishment degree, namely, when the IOU is 0, the punishment is not punishment, when the IOU is larger, the punishment is larger, and when the IOU is smaller, the punishment is gradually increased according to the ascending order of the IOU. The calculation logic of NMS and Soft-NMS is as follows:

the specific steps of Soft-NMS are as follows:

(1) Sorting the anchors according to the confidence scores;

(2) Adding an anchor with the highest score into the output list, and deleting the anchor from the bounding box list;

(3) Calculating IoU of the anchor with the highest score and other anchors in the step (2);

(4) If IoU exceeds the set threshold, decreasing the confidence score of the anchor (the anchor with the highest score in the non-2), and clearing away anchors with the confidence lower than the threshold;

(5) The above steps are repeated until the bounding box list is emptied.

5. Reconstructing Neck parts of a backbone network, constructing a multi-scale feature pyramid through a scale sequence feature fusion module, a three-feature encoder and a channel position attention mechanism, and additionally arranging a P2 detection layer to capture the low-resolution features of the bottom layer;

The YOLO framework ASF-YOLO based on attention scale sequence fusion combines space and scale features, the YOLO segmentation framework based is used for extracting scale sequence feature fusion SSFF module to enhance multi-scale information extraction capability of a network, and a three-feature encoder TPE module is used for fusing feature graphs of different scales to increase detailed information. Finally, a channel and position attention mechanism CPAM is introduced to integrate an SSFF and TPE module, and the module focuses on small objects related to information channels and space positions so as to improve the detection performance of the small objects.

The scale sequence feature fusion SSFF module in this embodiment can provide complementary information for small object segmentation, the TFE module can capture local fine details of small objects, separate large, medium and small-size features, add feature images of larger size, and then perform feature amplification to enhance detailed feature information to combine the local and global feature information to generate a more accurate segmentation map, the channel and position attention mechanism CPAM is composed of two parts, the first part is a channel attention network, the channel attention network receives input from TFE input 1 for extracting representative feature information contained in different channels. The second part is a location awareness network that receives the output from the channel awareness network and SSFF input 2 and superimposes them for introducing location information. In this way, the CPAM can integrate different attention mechanisms, comprehensively utilize the channel and the position information, and improve the performance of target identification.

As shown in fig. 8, the scale sequence feature fusion SSFF corresponds to ScalSeq structure, and feature maps of P3, P4, and P5 are normalized to the same size, up-sampled and then stacked together as input to the 3D convolution to fuse the multi-scale features.

The three-feature encoder TFE corresponds to a Zoom_cat structure, and for a Zoom_cat module, the module splits large, medium and small features, processes a large-scale feature map and amplifies the features so as to improve detailed feature information. As shown in fig. 9, the structure of the module is shown, where C represents the number of channels, S represents the feature map size, and each zoom_cat uses three feature maps of different sizes as inputs.

As shown in fig. 10, the location attention mechanism CPAM corresponds to the asf_ attention structure, and performs relevant representative information extraction by fusing the two parts of the channel and the location attention mechanism, and the network receives the superimposed input from the channel attention network output and the SSFF input 2.

Small objects typically occupy a limited number of pixels in the image and are easily ignored or misclassified. In order to improve YOLOv's 8 efficiency in small target detection, a P2 small target detection layer is introduced. The resolution of the P2 layer detection head is 160 multiplied by 160 pixels, which is equivalent to that the main network is only subjected to 2 downsampling operations, and the detection head contains the bottom layer characteristic information with richer targets, thereby being beneficial to the detection of micro targets.

And the scale sequence feature fusion SSFF, the three-feature encoder TFE, the channel and position attention mechanism CPAM and YOLOv algorithm structure are fused, and a P2 layer with small convolution times is added, so that the feature map is larger in size, the recognition of a small target is facilitated, and an ASF-P2 structure is obtained. The output feature map is continuously fused with the feature maps of P3, P4 and P5 through downsampling operation in the structure after Neck reconstruction and optimization design, more position information is provided for the feature maps, the overall prediction accuracy of the network is improved, and the improved network structure is shown in FIG. 11.

Because there are two joinable positions of the asf_ attention module in the algorithm, respectively located at the 25 th layer and the 30 th layer, performance tests of the asf_ attention module are performed on the two layers, and four cases are respectively ASF-P2, ASF-P2 (25), ASF-P2 (30) and ASF-P2 (25, 30), as shown in the table one, when the asf_ attention module is placed at the 25 th layer, namely ASF-P2 (25), the ASF-P2 (25) is used as the final algorithm structural arrangement.

Table ASF-P2 Structure attention mechanism position Performance Change Table

6. Combining a dynamic target detection head with three perception modules of scale, space and task, and enhancing the regression degree of a boundary frame of a small target through global pooling, deformation convolution and channel modeling;

Small objects may be obscured by other objects or by the background, resulting in partial or complete occlusion of the object, which in turn creates challenges for the observation of the object, highlighting the importance of the perceptibility of the detection algorithm. To solve this problem, the present embodiment introduces a dynamic head DYNAMIC HEAD, abbreviated as Dyhead, which combines the target detection head and the attention mechanism.

The method effectively integrates various self-attention mechanisms into the scale-aware feature layer, the space-aware spatial position and the task-aware output channel, thereby improving the characterization capability of the target detection head. This enhancement significantly improves the ability of the model to describe the target, making detection of small targets more flexible and accurate. By combining various self-attention mechanisms with the feature layers of the scale sensing module, the space sensing spatial position module and the task sensing output channel module, the method remarkably improves the feature description capability of the target detection head.

The key principle of DyHead algorithm is that the unified attention mechanism of the feature map realizes the comprehensive attention of the detection head through the serial connection processing of the scale perception module pi _L, the space perception module pi _S and the task perception module pi _C. Wherein pi _L performs global pooling, 1×1 convolution, reLU activation, hard signature activation, and the like, adaptiveAvgPool d is responsible for global pooling on H×W to obtain the maximum value of the feature map, and convolution is responsible for integrating all channels together to form the number. Pi _S performs a deformation convolution V2 (offset + eigenvalue modulation) operation. Pi _C uses a two-layer fully connected neural network for channel modeling, and a specific schematic diagram is shown in fig. 12.

7. Designing WFS-IoU loss functions fusing Wise-IoU, focaler-IoU and Shape-IoU, and optimizing the positioning of a small target boundary box by combining a dynamic focusing mechanism, sample difficulty grading and Shape sensitivity factors;

in conventional object detection tasks, a commonly used loss function is IoU (IntersectionoverUnion), which is used to measure the degree of overlap between the predicted and real frames. However IoU has problems such as poor detection of small objects, insensitivity to highly compressed bounding boxes, etc. To solve these problems, the present embodiment designs a novel WFS-IoU to make a change in the loss function for small targets by combining the features of Wise-IoU, focaler-IoU, and Shape-IoU.

(1) Wise-IoU, a gradient gain strategy is added to WiseIoU, wherein aspect ratio similarity is added to Complete-IoU (CIoU) of YOLOv to evaluate the quality of an anchor frame. WiseIoU are classified into WiseIoU-v1, wiseIoU-v2, wiseIoU-v3 according to the gradient gain strategy. WiseIoU-v1 does not interfere with training too much when the anchor frame and the target frame are well coincident, and builds a distance attention mechanism according to the distance measurement, and forms a boundary frame loss function with two layers of attention mechanisms by combining aspect ratio similarity. The core idea of Wise-IoU is to evaluate the quality of a sample by using the outlier of the anchor box, i.e. the relative position of the sample and other samples in the feature space, and dynamically adjust the weights of the parts in the loss function according to the quality.

WIoU-v1 is calculated as follows:

Wherein, the As a function of the loss of the original bounding box,W, h is the width and height of the predicted frame, W _gt,h_gt is the width and height of the real frame, W _iH_i is the width and height of the overlapped part of the predicted frame and the real frame, x, y is the coordinate of the predicted center point, x _gt,y_gt is the coordinate of the target center point, and W _gH_g is the width and height of the minimum boundary frame; as a function of the distance-attention mechanism, Fig. 13 shows a bounding box function visualization.

WIoU-v2 and WIoU-v3 construct a gradient gain based calculation method on the basis of WIoU-v1 and add a focusing mechanism, wherein WIoU-v2 constructs a monotonic focusing coefficientThe weight of the anchor frame is regulated and controlled.

WIoU-v2 is calculated as follows:

WIoU-v3 describes the mass of the anchor frame by adopting outliers, wherein small outliers indicate high anchor frame mass, a small gradient gain is allocated to the anchor frame, and large gradient gain is allocated to the anchor frame with large outliers, so that a dynamic non-monotonic focusing coefficient is constructed.

WIoU-v3 is calculated as follows:

(2) Focaler-IoU, frame regression plays a crucial role in the field of object detection, and the positioning accuracy of object detection depends largely on the loss function of frame regression. The prior study improves regression performance by utilizing the geometric relationship between the frames, and ignores the influence of sample difficult and easy distribution on the frame regression. Focaler-IoU can improve the performance of the detector in different detection tasks by focusing on regression samples of different difficulties.

To focus on different regression samples, one can focus on different detection tasks, reconstruct IoU the loss using a linear interval mapping method, which allows for improved edge regression. The formula is as follows:

Wherein IoU ^Focaler is Focaler-IoU after reconstruction, ioU is the original IoU value, [ d, u ] ∈ [0,1]. By adjusting the values of d and u, ioU ^Focaler focus on different regression samples can be made. The loss is defined as follows:

L_Focaler-IoU＝1-IoU^Focaler

(3) Shape-IoU, according to research and analysis of the regression characteristics of the frames in the current academia, finds that the Shape factors and scale factors of the frames can influence the regression results. The size and frame condition requirements of IoU are high based on small target detection and identification, and a Shape-IoU method is introduced, so that the frame regression is more accurate through focusing the Shape and the size of the frame to calculate the loss.

The Shape-IoU loss function, when calculating the loss, first calculates the overlap Area between the real and predicted frames (Area of Overlap), and then calculates the combined total Area of the real and predicted frames (Area of Union). And calculating the ratio of the overlapped area to the total area (namely IoU value) to obtain a numerical value between 0 and 1, wherein the numerical value represents the overlapping degree of the two boundary boxes. The IoU value is finally converted into a loss value in a certain way (such as 1-IoU values) and is used for guiding the training and optimization of the model.

ShapeIoU the following formula is defined:

L_shopeIoU＝1-IoU+distance^shape+0.5×Ω^shape

Where scale is a scaling factor, ww and hh represent weight coefficients in the horizontal and vertical directions of the prediction frame, respectively, in relation to the size of the object in the dataset, θ is a factor of the degree of normalization of the shape loss (Ω ^shape).

(4) WFS-IoU, the dynamic focusing mechanism of WiseIoU, the regression sample capability of FocalerIoU with different difficulties, and the self shape and self dimension capability of the focusing frame of ShapeIoU, namely WFS-IoU in the embodiment.

The core formula of WFS-IoU is as follows:

L_{Wise-Focaler-ShopeIoU}＝r×L_ShapeIoU+IoU-IoU^FocalerIoU

Where r is the dynamic focusing coefficient in Wise-IoU, L _ShapeIoU is the Loss of ShapeIoU, ioU is the intersection ratio between the prediction bounding box and the real bounding box, and IoU ^Focaler is Focaler-IoU after reconstruction.

Performance testing of the WIoU-v1, WIoU-v2, and WIoU-v3 versions of WFS-IoU on the YOLOv s model, performance verification based on the VisDrone2019 dataset, performance index pairs of which are shown in Table two;

Table two WFS-IoU versions Performance Change Table

As shown in FIG. 14, the specific WFS-IoU performance is best to the effect of the WFS-IoU-v2 strategy through observation, and the WFS-IoU-v2 is adopted as the loss function version selection because the performance indexes such as the accuracy rate, map@0.5, map@0.5:0.95 and the like are optimal.

Example 2

As shown in fig. 15, based on embodiment 1, the present embodiment proposes a terminal device of a deep learning-based unmanned aerial vehicle small target detection method, where the terminal device 200 includes at least one memory 210, at least one processor 220, and a bus 230 connected to different platform systems.

Memory 210 may include readable media in the form of volatile memory, such as RAM211 and/or cache memory 212, and may further include ROM213.

The memory 210 further stores a computer program, where the computer program may be executed by the processor 220, so that the processor 220 executes the application of the deep learning-based unmanned aerial vehicle small target detection method according to any one of the embodiments of the present application, and a specific implementation manner of the application is consistent with an implementation manner and an achieved technical effect described in the embodiment of the application, and some contents are not repeated. Memory 210 may also include a program/utility 214 having a set (at least one) of program modules 215 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Accordingly, the processor 220 may execute the computer programs described above, as well as the program/utility 214.

Bus 230 may be a local bus representing one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or using any of a variety of bus architectures.

Terminal device 200 can also communicate with one or more external devices 240, such as a keyboard, pointing device, bluetooth device, etc., as well as one or more devices capable of interacting with the terminal device 200, and/or with any device (e.g., router, modem, etc.) that enables the terminal device 200 to communicate with one or more other computing devices. Such communication may occur through the I/O interface 250. Also, terminal device 200 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 260. Network adapter 260 may communicate with other modules of terminal device 200 via bus 230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with terminal device 200, including, but not limited to, microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, among others.

Example 3

The embodiment provides a readable storage medium of a small target detection method of an unmanned aerial vehicle based on deep learning, and the computer readable storage medium stores instructions, and when the instructions are executed by a processor, the specific implementation manner of any small target detection method of the unmanned aerial vehicle based on deep learning is consistent with the implementation manner and the technical effect recorded in the embodiment of the application, and part of contents are not repeated.

Fig. 16 shows a program product 300 provided by the present embodiment for implementing the above application, which may employ a portable compact disc read-only memory (CD-ROM) and comprise program code, and may be run on a terminal device, such as a personal computer. However, the program product 300 of the present invention is not limited thereto, and in the present embodiment, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Program product 300 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for detecting small targets on drones based on deep learning, characterized by comprising the following steps:

S1. Embed a parallelized block-aware attention module in the C2f module of YOLOv8 to enhance the local-global feature fusion capability of small targets through multi-branch feature extraction and adaptive attention processing;

S2. Integrate a large separable kernel attention mechanism into the SPPF module of YOLOv8, and use spatial dilated convolution and separable kernel decomposition to expand the contextual perception range of multi-scale features;

S3, replace the original nearest neighbor interpolation with CARAFE dynamic upsampling, and reduce the loss of detail information caused by high-rate downsampling through content-aware kernel prediction and feature reorganization;

S4. Soft-NMS is used instead of NMS. Based on the Gaussian confidence decay strategy, it suppresses redundant detection boxes of dense small targets and improves the recall rate in occlusion scenarios.

S5, reconstruct the Neck part of the backbone network, build a multi-scale feature pyramid through the scale sequence feature fusion module, three-feature encoder and channel position attention mechanism, and add a P2 detection layer to capture the low-resolution features of the bottom layer;

S6: Combine the dynamic object detection head with the scale, space, and task perception modules to enhance the bounding box regression of small objects through global pooling, deformable convolution, and channel modeling.

S7. Design the WFS-IoU loss function that integrates Wise-IoU, Focaler-IoU and Shape-IoU, and combine the dynamic focusing mechanism, sample difficulty classification and shape sensitivity factor to optimize the positioning of small object bounding boxes.

2. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S1, the parallelized block perception attention module includes a local branch, a global branch, and a serial convolution branch, wherein the local branch extracts local features by non-overlapping patch aggregation, the global branch captures long-range dependencies by displacement operations, and the convolution branch refines features by serial convolution; after the output features of the three branches are added and fused, they are adaptively enhanced by a one-dimensional channel attention mechanism and a two-dimensional spatial attention mechanism in sequence.

3. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S2, the large separable kernel attention mechanism decomposes the standard convolution kernel into separable kernels in the horizontal and vertical directions, extracts the horizontal and vertical spatial context information of the feature map respectively, and combines spatial dilated convolution with different dilation rates to expand the receptive field, and adaptively fuses features of different scales through a dynamic weight allocation mechanism.

4. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S3, the CARAFE dynamic upsampling performs weighted reorganization on the local feature area by predicting the upsampling kernel related to the input feature content; the generation of the upsampling kernel includes feature map channel compression, kernel prediction and Softmax normalization steps, the upsampling process retains high-frequency detail information, and the feature reorganization module performs a dot product operation on the local area features.

5. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S4, the Soft-NMS method performs confidence attenuation on candidate boxes that overlap with high-confidence detection boxes, and the attenuation amplitude is dynamically adjusted by a Gaussian function according to the degree of overlap, retaining detection boxes with confidence levels higher than a preset threshold.

6. The deep learning-based UAV small target detection method according to claim 1 is characterized in that, in step S5, the scale sequence feature fusion module normalizes and splices feature maps with different downsampling ratios, and fuses multi-scale information through three-dimensional convolution; the three-feature encoder separates large, medium, and small size features, re-amplifies and fuses them to supplement the underlying detail information; the channel position attention mechanism jointly modulates the feature response through channel weights and spatial weights; and the resolution of the P2 detection layer is 160×160 pixels.

7. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S6, the dynamic target detection head includes a scale perception module, a spatial perception module, and a task perception module, wherein the scale perception module dynamically adjusts channel weights through global pooling and fully connected layers, the spatial perception module captures target deformation features through deformable convolution, and the task perception module optimizes the weight distribution of classification and regression tasks through fully connected layers.

8. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S7, the WFS-IoU loss function reduces the gradient influence of low-quality samples through a dynamic focusing mechanism, distinguishes easy-to-detect and difficult-to-detect targets through sample difficulty grading, and constrains the width-to-height ratio deviation between the predicted box and the true box through a shape-sensitive factor, thereby comprehensively improving the small target positioning accuracy.

9. A deep learning-based drone small target detection system, characterized in that it includes a processor and a computer-readable storage medium, wherein the computer-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are executed by the processor, the deep learning-based drone small target detection method described in any one of claims 1 to 8 is implemented.