CN120635750A - Small target detection method for UAV based on deep learning - Google Patents

Small target detection method for UAV based on deep learning

Info

Publication number
CN120635750A
CN120635750A CN202510621264.7A CN202510621264A CN120635750A CN 120635750 A CN120635750 A CN 120635750A CN 202510621264 A CN202510621264 A CN 202510621264A CN 120635750 A CN120635750 A CN 120635750A
Authority
CN
China
Prior art keywords
feature
module
iou
deep learning
target detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510621264.7A
Other languages
Chinese (zh)
Inventor
陆艳辉
王一丁
袁鹏辉
于欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Liangyuan Technology Co ltd
Sichuan Tengdun Technology Co Ltd
Original Assignee
Chengdu Liangyuan Technology Co ltd
Sichuan Tengdun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Liangyuan Technology Co ltd, Sichuan Tengdun Technology Co Ltd filed Critical Chengdu Liangyuan Technology Co ltd
Priority to CN202510621264.7A priority Critical patent/CN120635750A/en
Publication of CN120635750A publication Critical patent/CN120635750A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

本发明涉及深度学习图像识别技术领域,尤其涉及基于深度学习的无人机小目标检测方法,包括在YOLOv8的C2f模块中嵌入并行化区块感知注意力模块;在YOLOv8的SPPF模块中融合大型可分离核注意力机制;以CARAFE动态上采样替换原始最近邻插值;采用Soft‑NMS替代NMS;重构主干网络的Neck部分;将动态目标检测头与尺度、空间、任务三种感知模块相结合;设计融合Wise‑IoU、Focaler‑IoU及Shape‑IoU的WFS‑IoU损失函数,本发明针对无人机小目标检测中的特征丢失、尺度适应性差、密集目标处理缺陷等核心瓶颈,通过多模块协同优化与轻量化设计,实现了精度与效率的平衡,为无人机航拍应用提供了高效可靠的检测解决方案。

The present invention relates to the technical field of deep learning image recognition, and in particular to a method for detecting small targets in unmanned aerial vehicles (UAVs) based on deep learning. The method comprises the following steps: embedding a parallelized block perception attention module in the C2f module of YOLOv8; integrating a large separable kernel attention mechanism into the SPPF module of YOLOv8; replacing the original nearest neighbor interpolation with CARAFE dynamic upsampling; replacing NMS with Soft-NMS; reconstructing the Neck part of the backbone network; combining a dynamic target detection head with three perception modules: scale, space, and task; and designing a WFS-IoU loss function that integrates Wise-IoU, Focaler-IoU, and Shape-IoU. The method addresses core bottlenecks in UAV small target detection, such as feature loss, poor scale adaptability, and defects in dense target processing. Through multi-module collaborative optimization and lightweight design, the method achieves a balance between accuracy and efficiency, thus providing an efficient and reliable detection solution for UAV aerial photography applications.

Description

Unmanned aerial vehicle small target detection method based on deep learning
Technical Field
The invention relates to the technical field of deep learning image recognition, in particular to a method for detecting a small target of an unmanned aerial vehicle based on deep learning.
Background
Unmanned aerial vehicle aerial photography technology is widely applied to the fields of disaster relief, city management and the like, but the target size in the high-altitude photographed image is extremely small and densely distributed. The traditional target detection algorithm faces significant challenges in such a scene that the information of a small target is seriously lost in the process of multiple downsampling, so that the feature extraction is insufficient, and the detection precision is significantly reduced.
Specifically, the small target in the unmanned aerial vehicle view angle has extremely low pixel duty ratio in the image, and key details are blurred and even completely lost after multi-layer downsampling of the network. While existing approaches attempt to alleviate this problem by adding detection layers or introducing attention mechanisms, there is a lack of systematic design:
The traditional attention module does not perform joint modeling aiming at local details and long-distance dependence of a small target, so that feature fusion is insufficient, multi-scale context information is difficult to dynamically capture by fixed pooling operation (such as an SPPF module), feature degradation of the small target is further aggravated, and a feature map generated by up-sampling methods such as nearest neighbor interpolation has a jagged problem and has insufficient detail recovery capability.
In addition, unmanned aerial vehicle airborne computing resources are limited, and the existing improvement scheme often causes the rapid increase of model parameter quantity, and the accuracy and the instantaneity are difficult to balance. Therefore, an end-to-end optimization method for a small target detection scene of an unmanned aerial vehicle is needed, the problem of information loss is systematically solved through multi-module collaborative design, and the light deployment requirement is met while the detection precision is improved.
Disclosure of Invention
Aiming at the core bottlenecks such as feature loss, poor scale adaptability, dense target processing defects and the like in unmanned aerial vehicle small target detection, the invention realizes the balance of precision and efficiency through multi-module collaborative optimization and lightweight design, and provides a high-efficiency and reliable detection solution for unmanned aerial vehicle aerial photography application.
The invention is realized by the following technical scheme:
The unmanned aerial vehicle small target detection method based on deep learning comprises the following steps:
S1, embedding a parallelization block perception attention module in a C2f module YOLOv, and enhancing the local-global feature fusion capability of a small target through multi-branch feature extraction and self-adaptive attention processing;
S2, fusing a large separable kernel attention mechanism in an SPPF module of YOLOv, and expanding the context sensing range of the multi-scale features by utilizing space expansion convolution and separable kernel decomposition;
S3, replacing original nearest neighbor interpolation by CARAFE dynamic upsampling, and reducing detail information loss caused by high-rate downsampling through content perception kernel prediction and feature recombination;
s4, replacing NMS with Soft-NMS, and inhibiting redundant detection frames of dense small targets based on Gaussian confidence attenuation strategies, so that recall rate in a shielding scene is improved;
S5, reconstructing Neck parts of a backbone network, constructing a multi-scale feature pyramid through a scale sequence feature fusion module, a three-feature encoder and a channel position attention mechanism, and additionally arranging a P2 detection layer to capture the low-resolution features of the bottom layer;
S6, combining the dynamic target detection head with three perception modules of scale, space and task, and enhancing the regression degree of the bounding box of the small target through global pooling, deformation convolution and channel modeling;
S7, designing WFS-IoU loss functions fusing with Wise-IoU, focaler-IoU and Shape-IoU, and optimizing the positioning of a small target boundary box by combining a dynamic focusing mechanism, sample difficulty grading and Shape sensitivity factors.
In step S1, the parallelized block aware attention module includes a local branch, a global branch and a serial convolution branch, wherein the local branch extracts local features through non-overlapping patch aggregation, the global branch captures long-distance dependence through displacement operation, the convolution branch refines features through serial convolution, and after the output features of the three branches are added and fused, self-adaptive enhancement is performed sequentially through a one-dimensional channel attention mechanism and a two-dimensional space attention mechanism.
Further, in step S2, the large separable kernel attention mechanism decomposes the standard convolution kernel into separable kernels in the horizontal direction and separable kernels in the vertical direction, extracts the horizontal spatial context information and the vertical spatial context information of the feature map respectively, combines the spatial expansion convolution expansion receptive fields with different expansion rates, and performs adaptive fusion on the features with different scales through a dynamic weight distribution mechanism.
In step S3, the CARAFE dynamic upsampling performs weighted recombination on the local feature region by predicting an upsampling kernel related to the input feature content, the upsampling kernel generation includes the steps of feature map channel compression, kernel prediction and Softmax normalization, the upsampling process retains high-frequency detail information, and the feature recombination module performs dot product operation on the local region feature.
Further, in step S4, the Soft-NMS method performs confidence level attenuation on the candidate frames overlapped with the high confidence level detection frame, the attenuation amplitude is dynamically adjusted according to the overlapping degree through a gaussian function, and the detection frame with the confidence level higher than the preset threshold value is reserved.
Further, in step S5, the scale sequence feature fusion module normalizes feature graphs with different downsampling rates, then splices the feature graphs, and fuses multi-scale information through three-dimensional convolution, the three-feature encoder separates large, medium and small-size features, then re-amplifies and fuses the features to supplement bottom-layer detail information, the channel position attention mechanism modulates feature response through channel weight and spatial weight in a combined mode, and the resolution of the P2 detection layer is 160×160 pixels.
Further, in step S6, the dynamic target detection head includes a scale sensing module, a space sensing module and a task sensing module, where the scale sensing module dynamically adjusts channel weights through global pooling and a full connection layer, the space sensing module captures deformation characteristics of the target through deformation convolution, and the task sensing module optimizes classification and weight distribution of regression tasks through the full connection layer.
Further, in step S7, the WFS-IoU loss function reduces gradient influence of the low-quality sample through a dynamic focusing mechanism, distinguishes easily-detected and difficult-to-detect targets through sample difficulty grading, restricts aspect ratio case deviation of the prediction frame and the real frame through a shape sensitive factor, and comprehensively improves positioning accuracy of the small targets.
Preferably, the unmanned aerial vehicle small target detection method based on deep learning comprises a processor and a computer readable storage medium, wherein the computer readable storage medium stores machine executable instructions, and the machine executable instructions realize the unmanned aerial vehicle small target detection method based on deep learning when being executed by the processor.
The invention has the beneficial effects that:
(1) According to the unmanned aerial vehicle small target detection method based on deep learning, local details and global context information of the small target are effectively fused through the collaborative design of the parallelization block perception attention module and the large separable nuclear attention mechanism, and the feature expression capability of the small target is remarkably improved;
(2) According to the unmanned aerial vehicle small target detection method based on deep learning, provided by the invention, the perception range of the model to targets with different scales is expanded by combining a space expansion convolution and a dynamic weight distribution mechanism, and the detection capability of the scale change targets under the view angle of the unmanned aerial vehicle is enhanced;
(3) According to the unmanned aerial vehicle small target detection method based on deep learning, a dynamic content perception up-sampling method is adopted, loss of high-frequency detail information is reduced, the problems of blurring and jagging of edges of small targets are effectively solved, and the definition of contours is improved;
(4) According to the unmanned aerial vehicle small target detection method based on deep learning, the weight of the overlapped detection frames is dynamically adjusted through a confidence level attenuation strategy, missing detection and false detection in densely distributed or shielding scenes are reduced, and the target recall rate in a complex environment is remarkably improved;
(5) According to the unmanned aerial vehicle small target detection method based on deep learning, the dynamic focusing mechanism and the loss function design of the shape sensitivity factor are fused, the sensitivity of the model to the small target position deviation is enhanced, and more accurate bounding box regression is realized.
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
Fig. 1 is an overall flowchart of a deep learning-based unmanned aerial vehicle small target detection method provided by the invention;
fig. 2 is a PPA diagram of a parallelized block perception attention module of the unmanned aerial vehicle small target detection method based on deep learning;
FIG. 3 is a Bottleneck and P-Bottleneck block diagram of the unmanned aerial vehicle small target detection method based on deep learning;
FIG. 4 is a C2f-PPA structure diagram of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;
fig. 5 is a LSKA block diagram of a detection method of a small target of an unmanned aerial vehicle based on deep learning, which is provided by the invention;
FIG. 6 is a SPPF-LSKA block diagram of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;
fig. 7 is a working flow chart of a CARAFE up-sampling method of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;
fig. 8 is a scale sequence feature fusion SSFF diagram of the unmanned aerial vehicle small target detection method based on deep learning provided by the invention;
fig. 9 is a TFE diagram of a three-feature encoder of the unmanned aerial vehicle small target detection method based on deep learning according to the present invention;
Fig. 10 is a CPAM diagram of a channel and position attention mechanism of the unmanned aerial vehicle small target detection method based on deep learning according to the present invention;
fig. 11 is a schematic diagram of a Neck network structure after Neck structural reconstruction and small target detection layer addition of the unmanned aerial vehicle small target detection method based on deep learning provided by the invention;
Fig. 12 is a Dyhead block diagram of a method for detecting a small target of an unmanned aerial vehicle based on deep learning according to the present invention;
fig. 13 is a bounding box function visualization diagram of the unmanned aerial vehicle small target detection method based on deep learning provided by the invention;
FIG. 14 is a WFS-IoU performance comparison chart of the unmanned aerial vehicle small target detection method based on deep learning, which is provided by the invention;
fig. 15 is a schematic diagram of a terminal device of the unmanned aerial vehicle small target detection method based on deep learning;
Fig. 16 is a schematic diagram of a readable storage medium of the deep learning-based unmanned aerial vehicle small target detection method according to the present invention;
in the figure, 200-terminal equipment, 210-memory, 211-RAM, 212-cache memory, 213-ROM, 214-program/utility, 215-program modules, 220-processor, 230-bus, 240-external device, 250-I/O interface, 260-network adapter, 300-program product.
Detailed Description
For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.
Example 1
The unmanned aerial vehicle small target detection method based on deep learning provided by the embodiment comprises the following steps:
1. embedding a parallelization block perception attention module in a C2f module YOLOv, and enhancing the local-global feature fusion capability of the small target through multi-branch feature extraction and self-adaptive attention processing;
In small object detection tasks, due to the small object size, often accompanied by blurred contours and fewer pixels, there is a significant risk of information loss during multiple downsampling, resulting in a greatly reduced feature extraction for small objects. Therefore, PPA of HCF-Net of infrared small target network structure, namely parallelized block perception attention module, replaces traditional convolution operation in basic components of encoder and decoder, well solves the problem of losing key information in down sampling operation, and is helpful for solving the problem of losing key information of small target.
The functional principle of PPA is proposed in this embodiment, and specifically includes:
(1) Multi-branch feature extraction, PPA's major advantage is its multi-branch feature extraction strategy. As shown in fig. 2, PPA employs a parallel multi-branch approach, each branch being responsible for extracting features of different scales and levels. This multi-branch strategy is advantageous for capturing multi-scale features of objects, thereby improving the accuracy of small object detection. The multi-branch strategy comprises three parallel branches, namely a local branch, a global branch and a serial convolution branch. Given input feature tensor Then, firstly, the method is obtained by point-by-point convolution adjustmentThen through the three branches, can be calculated respectivelyAndFinally, adding the three results to obtainThe distinction between local branches and global branches is realized by controlling the patch size parameter p, which is realized by the aggregation and displacement of non-overlapping patches in the spatial dimension. In addition, the embodiment also calculates the attention matrix between the non-overlapping patches to realize local and global feature extraction and interaction.
(2) Feature fusion and attention, and after feature extraction is performed through multi-branch feature extraction, self-adaptive feature enhancement is performed by using an attention mechanism. The attention module consists of a series of efficient channel attention and spatial attention. In this caseSequentially by one-dimensional channel attention mapAnd two-dimensional spatial attention diagramAnd (5) processing. This process can be summarized as follows:
Wherein, the The elements of the representation are multiplied together,AndRepresenting the characteristics of the channel and the space after selection, delta (·) andThe rectified linear unit ReLU and the batch normalized BN are shown respectively,Is the final output of PPA, and the P-Bottleneck structure is shown in FIG. 3.
Fusion improvement is carried out on the PPA module and the C2f module in YOLOv to obtain a C2f-PPA model, so that feature extraction and feature fusion capability of a YOLOv algorithm on a small target are improved, and a specific structure is shown in figure 4.
2. The SPPF module of YOLOv is fused with a large separable core attention mechanism, and the context sensing range of the multi-scale features is expanded by utilizing space expansion convolution and separable core decomposition;
YOLOv the original SPPF in YOLOv contains three successive pool layers, the outputs of each layer being combined to ensure multi-scale fusion. Meanwhile, compared with SPP, the method reduces the computational complexity and remarkably improves the speed so as to realize self-adaptive size output, but the structure is difficult to effectively integrate small and fine-granularity characteristic information, and small target information is often related. A LSKA large separable nuclear attention mechanism is introduced and combined with SPPF, thereby significantly enhancing the multi-scale feature extraction capabilities of the model.
The structure and function of LSKA in this embodiment is divided into the following steps:
(1) The convolutional layers are initialized and are responsible for extracting the horizontal and vertical features of the input feature map, respectively, which helps the model focus on important parts of the image in order to generate a preliminary attention map.
(2) The spatial dilation convolution layer, LSKA, after obtaining a preliminary attention map, employs spatial dilation convolutions with different dilations to further extract features. These convolution layers can cover a larger receptive field without increasing computational cost, capture a wider range of context information, enable finer processing of image features by operating in the horizontal and vertical directions, respectively, and enhance the model's understanding of image spatial relationships.
(3) The method comprises the steps of fusing and applying attention, after convolution operation, LSKA fusing the obtained features through the last convolution layer to generate a final attention diagram, wherein the attention diagram is subjected to multiplication operation at element level with an original input feature diagram, namely an attention mechanism is applied, so that each element in the original feature diagram can be weighted according to the value of the attention diagram, important features are highlighted, and unimportant features are restrained.
The LSKA large core separable attention mechanism is a design that is optimized for the attention module in convolutional neural networks. The core idea is to decompose the 2D convolution kernel of the depth convolution layer into cascaded horizontal 1D and vertical 1D kernels, thereby making it possible to directly use the large kernel of the depth convolution layer in the attention module. The advantage is that LSKA can enhance the ability of the model to capture long-range dependencies through a larger convolution kernel size, while it does not significantly increase the number of parameters and computational complexity of the model as a full-size convolution does because of the use of separable convolutions. The attention map is generated by capturing the extensive context information of the image with a large and separable convolution kernel and a spatially-expanded convolution, and the original features are weighted by this attention map, thereby enhancing the attention of the network to important features and improving the performance of the model.
In this embodiment, LSKA attentiveness mechanism is put between Concat and Conv of SPPF module, which becomes a new SPPF-LSKA module. LSKA the attention mechanism is applied to the SPPF for weighting the multi-scale feature map. Specifically, LSKA first calculates the spatial weights and channel weights on each feature map, and then adjusts the feature map according to these weights. The model can be focused more on features related to the object, thereby improving the detection accuracy. LSKA is shown in FIG. 5, and SPPF-LSKA is shown in FIG. 6.
LSKA the specific output formula is as follows:
AC=W1×1*ZC
Wherein the method comprises the steps of Representing the output of the depth convolution, Z C is the output of the depth convolution obtained by convolving a kernel of size kxk with the input feature map, 2d-1 and k/d representing the size of the convolution kernel, representing the convolution,Representing the hadamard product.
It is clear that each channel C in F is convolved with the corresponding channel in the kernel W. The output T C of LSKA is the hadamard product of the attention map a C and the input signature F C.
3. The CARAFE dynamic up-sampling is used for replacing the original nearest neighbor interpolation, and detail information loss caused by high-rate down-sampling is reduced through content perception kernel prediction and feature recombination;
YOLOv8 an up-sampling operation is performed by using an up-sampling operator, up-sampling is performed by using a nearest-neighbor interpolation method, the nearest-neighbor interpolation is performed by using the nearest-neighbor distance between pixels to represent up-sampling, namely, the gray value of the transformed pixel is equal to the gray value of the pixel of the original feature image closest to the nearest-neighbor interpolation, but the method only considers the relation between adjacent pixels and ignores the context semantic information of the feature image, which leads to different gray values of the pixels of the image generated at the same point of the original image when the nearest-neighbor value is calculated, the jagged edge of the image appears, the detail information of the image is lost, the characteristic expression capability of fine granularity is lacked, the feature extraction part of the algorithm is lacked in detail aspect, small target detection is not facilitated, and a CARAFE up-sampling method is introduced in the embodiment, and the up-sampling kernel is dynamically adjusted to adapt to different up-sampling requirements, so that the feature information is better reserved and extracted, and the information loss can be reduced while the calculation efficiency is maintained. In addition, by fusing feature graphs with different scales, more abundant context information can be obtained, so that the feature expression capability of the network is enhanced, the perception range and the positioning accuracy of an algorithm are improved, and a specific implementation flow is shown in fig. 7.
In the upsampling kernel prediction section, first, a feature map of an input size H×W×C is compressed by a1×1-sized Conv module by C m as the number of channels of the input feature map, and then the number of output channels is changed to be the number of channels by a convolution operationWhere σ represents the upsampling rate, the upsampling kernel size is k up×kup, and if a different upsampling kernel is used for each position of the output feature map, the predicted upsampling kernel size isAnd finally, carrying out Softmax normalization on the obtained up-sampling kernel to ensure that the weight sum of the convolution kernel is 1. For the feature reorganization module, each position in the output feature map is mapped back to the input feature map, a k up×kup region taking the output feature map as the center is taken out, dot product operation is carried out on the region and the up-sampling core of the predicted point, different channels at the same position share the same up-sampling core, for each reorganization core, the content perception reorganization module reorganizes the features of the local region, and for a target position l' in the output feature map and a square region N (χ l,kup) taking the input feature map as the center, the reorganization formula is as follows:
wherein r= [ k up/2],N(χl,kup ] is dot product of up-sampling kernel corresponding to the point, each channel at the same position adopts the same up-sampling kernel, that is, the values of up-sampling pixels are rearranged, and the contribution value of each pixel in N (χ l,kup) to the up-sampling kernel is different, and different contributions are correspondingly made based on the content of the feature map. After the up-sampling operation is completed, an output feature map result with a shape of σh×σw×c is obtained.
4. Adopting Soft-NMS to replace NMS, inhibiting redundant detection frames of dense small targets based on Gaussian confidence attenuation strategy, and improving recall rate in a shielding scene;
YOLOv8 is an anchor-free model that predicts the center of an object directly, rather than locating the object by predicting the offset of the anchor box. I.e. YOLOv, directly predicts the center of the object instead of the offset of the known anchor boxes, this new method accelerates a very complex reasoning step NMS non-maximal suppression, as the number of box predictions is reduced, but the accuracy is still not fully guaranteed for densely occluding small target objects.
In the embodiment, soft-NMS is introduced, improvement on the traditional NMS algorithm is realized, and detection of overlapping targets is restrained by reducing the confidence of the overlapping targets, so that the accuracy of target detection is improved. A different hyper-parameter is introduced to determine the degree of discount on the confidence score of the overlap detection by means of a gaussian distribution.
Soft-NMS differs from NMS in that redundant boxes are processed, NMS directly deletes redundant boxes, and Soft-NMS is a confidence-reduced process for redundant boxes, thresholding boxes on the candidate list again. Since only redundant boxes are confidence subtracted, boxes that are not 0 confidence will fall into the candidate list. At this time, when the frames in the candidate list are processed once by using the threshold value, the frames below the threshold value are deleted. For the algorithm, only overlapping boxes where the model outputs a confidence level strong enough will be considered to have overlapping objects in its vicinity.
Specifically, the Soft-NMS receives the IOUs of the two anchors and makes punishments, and the anchors confidence level is modified according to the punishment degree, namely, when the IOU is 0, the punishment is not punishment, when the IOU is larger, the punishment is larger, and when the IOU is smaller, the punishment is gradually increased according to the ascending order of the IOU. The calculation logic of NMS and Soft-NMS is as follows:
the specific steps of Soft-NMS are as follows:
(1) Sorting the anchors according to the confidence scores;
(2) Adding an anchor with the highest score into the output list, and deleting the anchor from the bounding box list;
(3) Calculating IoU of the anchor with the highest score and other anchors in the step (2);
(4) If IoU exceeds the set threshold, decreasing the confidence score of the anchor (the anchor with the highest score in the non-2), and clearing away anchors with the confidence lower than the threshold;
(5) The above steps are repeated until the bounding box list is emptied.
5. Reconstructing Neck parts of a backbone network, constructing a multi-scale feature pyramid through a scale sequence feature fusion module, a three-feature encoder and a channel position attention mechanism, and additionally arranging a P2 detection layer to capture the low-resolution features of the bottom layer;
The YOLO framework ASF-YOLO based on attention scale sequence fusion combines space and scale features, the YOLO segmentation framework based is used for extracting scale sequence feature fusion SSFF module to enhance multi-scale information extraction capability of a network, and a three-feature encoder TPE module is used for fusing feature graphs of different scales to increase detailed information. Finally, a channel and position attention mechanism CPAM is introduced to integrate an SSFF and TPE module, and the module focuses on small objects related to information channels and space positions so as to improve the detection performance of the small objects.
The scale sequence feature fusion SSFF module in this embodiment can provide complementary information for small object segmentation, the TFE module can capture local fine details of small objects, separate large, medium and small-size features, add feature images of larger size, and then perform feature amplification to enhance detailed feature information to combine the local and global feature information to generate a more accurate segmentation map, the channel and position attention mechanism CPAM is composed of two parts, the first part is a channel attention network, the channel attention network receives input from TFE input 1 for extracting representative feature information contained in different channels. The second part is a location awareness network that receives the output from the channel awareness network and SSFF input 2 and superimposes them for introducing location information. In this way, the CPAM can integrate different attention mechanisms, comprehensively utilize the channel and the position information, and improve the performance of target identification.
As shown in fig. 8, the scale sequence feature fusion SSFF corresponds to ScalSeq structure, and feature maps of P3, P4, and P5 are normalized to the same size, up-sampled and then stacked together as input to the 3D convolution to fuse the multi-scale features.
The three-feature encoder TFE corresponds to a Zoom_cat structure, and for a Zoom_cat module, the module splits large, medium and small features, processes a large-scale feature map and amplifies the features so as to improve detailed feature information. As shown in fig. 9, the structure of the module is shown, where C represents the number of channels, S represents the feature map size, and each zoom_cat uses three feature maps of different sizes as inputs.
As shown in fig. 10, the location attention mechanism CPAM corresponds to the asf_ attention structure, and performs relevant representative information extraction by fusing the two parts of the channel and the location attention mechanism, and the network receives the superimposed input from the channel attention network output and the SSFF input 2.
Small objects typically occupy a limited number of pixels in the image and are easily ignored or misclassified. In order to improve YOLOv's 8 efficiency in small target detection, a P2 small target detection layer is introduced. The resolution of the P2 layer detection head is 160 multiplied by 160 pixels, which is equivalent to that the main network is only subjected to 2 downsampling operations, and the detection head contains the bottom layer characteristic information with richer targets, thereby being beneficial to the detection of micro targets.
And the scale sequence feature fusion SSFF, the three-feature encoder TFE, the channel and position attention mechanism CPAM and YOLOv algorithm structure are fused, and a P2 layer with small convolution times is added, so that the feature map is larger in size, the recognition of a small target is facilitated, and an ASF-P2 structure is obtained. The output feature map is continuously fused with the feature maps of P3, P4 and P5 through downsampling operation in the structure after Neck reconstruction and optimization design, more position information is provided for the feature maps, the overall prediction accuracy of the network is improved, and the improved network structure is shown in FIG. 11.
Because there are two joinable positions of the asf_ attention module in the algorithm, respectively located at the 25 th layer and the 30 th layer, performance tests of the asf_ attention module are performed on the two layers, and four cases are respectively ASF-P2, ASF-P2 (25), ASF-P2 (30) and ASF-P2 (25, 30), as shown in the table one, when the asf_ attention module is placed at the 25 th layer, namely ASF-P2 (25), the ASF-P2 (25) is used as the final algorithm structural arrangement.
Table ASF-P2 Structure attention mechanism position Performance Change Table
6. Combining a dynamic target detection head with three perception modules of scale, space and task, and enhancing the regression degree of a boundary frame of a small target through global pooling, deformation convolution and channel modeling;
Small objects may be obscured by other objects or by the background, resulting in partial or complete occlusion of the object, which in turn creates challenges for the observation of the object, highlighting the importance of the perceptibility of the detection algorithm. To solve this problem, the present embodiment introduces a dynamic head DYNAMIC HEAD, abbreviated as Dyhead, which combines the target detection head and the attention mechanism.
The method effectively integrates various self-attention mechanisms into the scale-aware feature layer, the space-aware spatial position and the task-aware output channel, thereby improving the characterization capability of the target detection head. This enhancement significantly improves the ability of the model to describe the target, making detection of small targets more flexible and accurate. By combining various self-attention mechanisms with the feature layers of the scale sensing module, the space sensing spatial position module and the task sensing output channel module, the method remarkably improves the feature description capability of the target detection head.
The key principle of DyHead algorithm is that the unified attention mechanism of the feature map realizes the comprehensive attention of the detection head through the serial connection processing of the scale perception module pi L, the space perception module pi S and the task perception module pi C. Wherein pi L performs global pooling, 1×1 convolution, reLU activation, hard signature activation, and the like, adaptiveAvgPool d is responsible for global pooling on H×W to obtain the maximum value of the feature map, and convolution is responsible for integrating all channels together to form the number. Pi S performs a deformation convolution V2 (offset + eigenvalue modulation) operation. Pi C uses a two-layer fully connected neural network for channel modeling, and a specific schematic diagram is shown in fig. 12.
7. Designing WFS-IoU loss functions fusing Wise-IoU, focaler-IoU and Shape-IoU, and optimizing the positioning of a small target boundary box by combining a dynamic focusing mechanism, sample difficulty grading and Shape sensitivity factors;
in conventional object detection tasks, a commonly used loss function is IoU (IntersectionoverUnion), which is used to measure the degree of overlap between the predicted and real frames. However IoU has problems such as poor detection of small objects, insensitivity to highly compressed bounding boxes, etc. To solve these problems, the present embodiment designs a novel WFS-IoU to make a change in the loss function for small targets by combining the features of Wise-IoU, focaler-IoU, and Shape-IoU.
(1) Wise-IoU, a gradient gain strategy is added to WiseIoU, wherein aspect ratio similarity is added to Complete-IoU (CIoU) of YOLOv to evaluate the quality of an anchor frame. WiseIoU are classified into WiseIoU-v1, wiseIoU-v2, wiseIoU-v3 according to the gradient gain strategy. WiseIoU-v1 does not interfere with training too much when the anchor frame and the target frame are well coincident, and builds a distance attention mechanism according to the distance measurement, and forms a boundary frame loss function with two layers of attention mechanisms by combining aspect ratio similarity. The core idea of Wise-IoU is to evaluate the quality of a sample by using the outlier of the anchor box, i.e. the relative position of the sample and other samples in the feature space, and dynamically adjust the weights of the parts in the loss function according to the quality.
WIoU-v1 is calculated as follows:
Wherein, the As a function of the loss of the original bounding box,W, h is the width and height of the predicted frame, W gt,hgt is the width and height of the real frame, W iHi is the width and height of the overlapped part of the predicted frame and the real frame, x, y is the coordinate of the predicted center point, x gt,ygt is the coordinate of the target center point, and W gHg is the width and height of the minimum boundary frame; as a function of the distance-attention mechanism, Fig. 13 shows a bounding box function visualization.
WIoU-v2 and WIoU-v3 construct a gradient gain based calculation method on the basis of WIoU-v1 and add a focusing mechanism, wherein WIoU-v2 constructs a monotonic focusing coefficientThe weight of the anchor frame is regulated and controlled.
WIoU-v2 is calculated as follows:
WIoU-v3 describes the mass of the anchor frame by adopting outliers, wherein small outliers indicate high anchor frame mass, a small gradient gain is allocated to the anchor frame, and large gradient gain is allocated to the anchor frame with large outliers, so that a dynamic non-monotonic focusing coefficient is constructed.
WIoU-v3 is calculated as follows:
(2) Focaler-IoU, frame regression plays a crucial role in the field of object detection, and the positioning accuracy of object detection depends largely on the loss function of frame regression. The prior study improves regression performance by utilizing the geometric relationship between the frames, and ignores the influence of sample difficult and easy distribution on the frame regression. Focaler-IoU can improve the performance of the detector in different detection tasks by focusing on regression samples of different difficulties.
To focus on different regression samples, one can focus on different detection tasks, reconstruct IoU the loss using a linear interval mapping method, which allows for improved edge regression. The formula is as follows:
Wherein IoU Focaler is Focaler-IoU after reconstruction, ioU is the original IoU value, [ d, u ] ∈ [0,1]. By adjusting the values of d and u, ioU Focaler focus on different regression samples can be made. The loss is defined as follows:
LFocaler-IoU=1-IoUFocaler
(3) Shape-IoU, according to research and analysis of the regression characteristics of the frames in the current academia, finds that the Shape factors and scale factors of the frames can influence the regression results. The size and frame condition requirements of IoU are high based on small target detection and identification, and a Shape-IoU method is introduced, so that the frame regression is more accurate through focusing the Shape and the size of the frame to calculate the loss.
The Shape-IoU loss function, when calculating the loss, first calculates the overlap Area between the real and predicted frames (Area of Overlap), and then calculates the combined total Area of the real and predicted frames (Area of Union). And calculating the ratio of the overlapped area to the total area (namely IoU value) to obtain a numerical value between 0 and 1, wherein the numerical value represents the overlapping degree of the two boundary boxes. The IoU value is finally converted into a loss value in a certain way (such as 1-IoU values) and is used for guiding the training and optimization of the model.
ShapeIoU the following formula is defined:
LshopeIoU=1-IoU+distanceshape+0.5×Ωshape
Where scale is a scaling factor, ww and hh represent weight coefficients in the horizontal and vertical directions of the prediction frame, respectively, in relation to the size of the object in the dataset, θ is a factor of the degree of normalization of the shape loss (Ω shape).
(4) WFS-IoU, the dynamic focusing mechanism of WiseIoU, the regression sample capability of FocalerIoU with different difficulties, and the self shape and self dimension capability of the focusing frame of ShapeIoU, namely WFS-IoU in the embodiment.
The core formula of WFS-IoU is as follows:
LWise-Focaler-ShopeIoU=r×LShapeIoU+IoU-IoUFocalerIoU
Where r is the dynamic focusing coefficient in Wise-IoU, L ShapeIoU is the Loss of ShapeIoU, ioU is the intersection ratio between the prediction bounding box and the real bounding box, and IoU Focaler is Focaler-IoU after reconstruction.
Performance testing of the WIoU-v1, WIoU-v2, and WIoU-v3 versions of WFS-IoU on the YOLOv s model, performance verification based on the VisDrone2019 dataset, performance index pairs of which are shown in Table two;
Table two WFS-IoU versions Performance Change Table
As shown in FIG. 14, the specific WFS-IoU performance is best to the effect of the WFS-IoU-v2 strategy through observation, and the WFS-IoU-v2 is adopted as the loss function version selection because the performance indexes such as the accuracy rate, map@0.5, map@0.5:0.95 and the like are optimal.
Example 2
As shown in fig. 15, based on embodiment 1, the present embodiment proposes a terminal device of a deep learning-based unmanned aerial vehicle small target detection method, where the terminal device 200 includes at least one memory 210, at least one processor 220, and a bus 230 connected to different platform systems.
Memory 210 may include readable media in the form of volatile memory, such as RAM211 and/or cache memory 212, and may further include ROM213.
The memory 210 further stores a computer program, where the computer program may be executed by the processor 220, so that the processor 220 executes the application of the deep learning-based unmanned aerial vehicle small target detection method according to any one of the embodiments of the present application, and a specific implementation manner of the application is consistent with an implementation manner and an achieved technical effect described in the embodiment of the application, and some contents are not repeated. Memory 210 may also include a program/utility 214 having a set (at least one) of program modules 215 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Accordingly, the processor 220 may execute the computer programs described above, as well as the program/utility 214.
Bus 230 may be a local bus representing one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or using any of a variety of bus architectures.
Terminal device 200 can also communicate with one or more external devices 240, such as a keyboard, pointing device, bluetooth device, etc., as well as one or more devices capable of interacting with the terminal device 200, and/or with any device (e.g., router, modem, etc.) that enables the terminal device 200 to communicate with one or more other computing devices. Such communication may occur through the I/O interface 250. Also, terminal device 200 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 260. Network adapter 260 may communicate with other modules of terminal device 200 via bus 230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with terminal device 200, including, but not limited to, microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, among others.
Example 3
The embodiment provides a readable storage medium of a small target detection method of an unmanned aerial vehicle based on deep learning, and the computer readable storage medium stores instructions, and when the instructions are executed by a processor, the specific implementation manner of any small target detection method of the unmanned aerial vehicle based on deep learning is consistent with the implementation manner and the technical effect recorded in the embodiment of the application, and part of contents are not repeated.
Fig. 16 shows a program product 300 provided by the present embodiment for implementing the above application, which may employ a portable compact disc read-only memory (CD-ROM) and comprise program code, and may be run on a terminal device, such as a personal computer. However, the program product 300 of the present invention is not limited thereto, and in the present embodiment, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Program product 300 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1.基于深度学习的无人机小目标检测方法,其特征在于,包括以下步骤:1. A method for detecting small targets on drones based on deep learning, characterized by comprising the following steps: S1、在YOLOv8的C2f模块中嵌入并行化区块感知注意力模块,通过多分支特征提取与自适应注意力处理,增强小目标的局部-全局特征融合能力;S1. Embed a parallelized block-aware attention module in the C2f module of YOLOv8 to enhance the local-global feature fusion capability of small targets through multi-branch feature extraction and adaptive attention processing; S2、在YOLOv8的SPPF模块中融合大型可分离核注意力机制,利用空间扩张卷积与可分离核分解,扩展多尺度特征的上下文感知范围;S2. Integrate a large separable kernel attention mechanism into the SPPF module of YOLOv8, and use spatial dilated convolution and separable kernel decomposition to expand the contextual perception range of multi-scale features; S3、以CARAFE动态上采样替换原始最近邻插值,通过内容感知核预测与特征重组,减少高倍率下采样导致的细节信息丢失;S3, replace the original nearest neighbor interpolation with CARAFE dynamic upsampling, and reduce the loss of detail information caused by high-rate downsampling through content-aware kernel prediction and feature reorganization; S4、采用Soft-NMS替代NMS,基于高斯置信度衰减策略抑制密集小目标的冗余检测框,提升遮挡场景下的召回率;S4. Soft-NMS is used instead of NMS. Based on the Gaussian confidence decay strategy, it suppresses redundant detection boxes of dense small targets and improves the recall rate in occlusion scenarios. S5、重构主干网络的Neck部分,通过尺度序列特征融合模块、三特征编码器及通道位置注意机制构建多尺度特征金字塔,并增设P2检测层捕获底层分低辨率特征;S5, reconstruct the Neck part of the backbone network, build a multi-scale feature pyramid through the scale sequence feature fusion module, three-feature encoder and channel position attention mechanism, and add a P2 detection layer to capture the low-resolution features of the bottom layer; S6、将动态目标检测头与尺度、空间、任务三种感知模块相结合,通过全局池化、形变卷积、信道建模,增强小目标的边界框回归度;S6: Combine the dynamic object detection head with the scale, space, and task perception modules to enhance the bounding box regression of small objects through global pooling, deformable convolution, and channel modeling. S7、设计融合Wise-IoU、Focaler-IoU及Shape-IoU的WFS-IoU损失函数,结合动态聚焦机制、样本难度分级与形状敏感因子,优化小目标边界框的定位。S7. Design the WFS-IoU loss function that integrates Wise-IoU, Focaler-IoU and Shape-IoU, and combine the dynamic focusing mechanism, sample difficulty classification and shape sensitivity factor to optimize the positioning of small object bounding boxes. 2.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S1中,所述并行化区块感知注意力模块包括局部分支、全局分支和串行卷积分支,其中局部分支通过非重叠补丁聚合提取局部特征,全局分支通过位移操作捕捉长距离依赖,卷积分支通过串行卷积细化特征;三路分支的输出特征相加融合后,依次通过一维通道注意力机制和二维空间注意力机制进行自适应增强。2. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S1, the parallelized block perception attention module includes a local branch, a global branch, and a serial convolution branch, wherein the local branch extracts local features by non-overlapping patch aggregation, the global branch captures long-range dependencies by displacement operations, and the convolution branch refines features by serial convolution; after the output features of the three branches are added and fused, they are adaptively enhanced by a one-dimensional channel attention mechanism and a two-dimensional spatial attention mechanism in sequence. 3.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S2中,所述大型可分离核注意力机制将标准卷积核分解为水平方向与垂直方向的可分离核,分别提取特征图的水平与垂直空间上下文信息,并结合不同扩张率的空间扩张卷积扩大感受野,通过动态权重分配机制对不同尺度的特征进行自适应融合。3. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S2, the large separable kernel attention mechanism decomposes the standard convolution kernel into separable kernels in the horizontal and vertical directions, extracts the horizontal and vertical spatial context information of the feature map respectively, and combines spatial dilated convolution with different dilation rates to expand the receptive field, and adaptively fuses features of different scales through a dynamic weight allocation mechanism. 4.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S3中,所述CARAFE动态上采样通过预测与输入特征内容相关的上采样核,对局部特征区域进行加权重组;上采样核的生成包括特征图通道压缩、核预测及Softmax归一化步骤,上采样过程保留高频细节信息,并通过特征重组模块对局部区域特征进行点积操作。4. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S3, the CARAFE dynamic upsampling performs weighted reorganization on the local feature area by predicting the upsampling kernel related to the input feature content; the generation of the upsampling kernel includes feature map channel compression, kernel prediction and Softmax normalization steps, the upsampling process retains high-frequency detail information, and the feature reorganization module performs a dot product operation on the local area features. 5.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S4中,所述Soft-NMS方法对与高置信度检测框重叠的候选框进行置信度衰减,衰减幅度根据重叠程度通过高斯函数动态调整,保留置信度高于预设阈值的检测框。5. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S4, the Soft-NMS method performs confidence attenuation on candidate boxes that overlap with high-confidence detection boxes, and the attenuation amplitude is dynamically adjusted by a Gaussian function according to the degree of overlap, retaining detection boxes with confidence levels higher than a preset threshold. 6.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S5中,所述尺度序列特征融合模块将不同下采样倍率的特征图归一化后拼接,并通过三维卷积融合多尺度信息;所述三特征编码器将大、中、小尺寸特征分离后重新放大融合,补充底层细节信息;所述通道位置注意机制通过通道权重与空间权重联合调制特征响应;所述P2检测层的分辨率为160×160像素。6. The deep learning-based UAV small target detection method according to claim 1 is characterized in that, in step S5, the scale sequence feature fusion module normalizes and splices feature maps with different downsampling ratios, and fuses multi-scale information through three-dimensional convolution; the three-feature encoder separates large, medium, and small size features, re-amplifies and fuses them to supplement the underlying detail information; the channel position attention mechanism jointly modulates the feature response through channel weights and spatial weights; and the resolution of the P2 detection layer is 160×160 pixels. 7.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S6中,所述动态目标检测头包括尺度感知模块、空间感知模块和任务感知模块,其中尺度感知模块通过全局池化与全连接层动态调整通道权重,空间感知模块通过形变卷积捕捉目标形变特征,任务感知模块通过全连接层优化分类与回归任务的权重分配。7. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S6, the dynamic target detection head includes a scale perception module, a spatial perception module, and a task perception module, wherein the scale perception module dynamically adjusts channel weights through global pooling and fully connected layers, the spatial perception module captures target deformation features through deformable convolution, and the task perception module optimizes the weight distribution of classification and regression tasks through fully connected layers. 8.根据权利要求1所述的基于深度学习的无人机小目标检测方法,其特征在于,步骤S7中,所述WFS-IoU损失函数通过动态聚焦机制降低低质量样本的梯度影响,通过样本难度分级区分易检与难检目标,并通过形状敏感因子约束预测框与真实框的宽高比例偏差,综合提升小目标定位精度。8. The deep learning-based UAV small target detection method according to claim 1 is characterized in that in step S7, the WFS-IoU loss function reduces the gradient influence of low-quality samples through a dynamic focusing mechanism, distinguishes easy-to-detect and difficult-to-detect targets through sample difficulty grading, and constrains the width-to-height ratio deviation between the predicted box and the true box through a shape-sensitive factor, thereby comprehensively improving the small target positioning accuracy. 9.基于深度学习的无人机小目标检测系统,其特征在于,包括处理器以及计算机可读存储介质,所述计算机可读存储介质存储有机器可执行指令,所述机器可执行指令被处理器执行时实现权利要求1-8中任意一项所述的基于深度学习的无人机小目标检测方法。9. A deep learning-based drone small target detection system, characterized in that it includes a processor and a computer-readable storage medium, wherein the computer-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are executed by the processor, the deep learning-based drone small target detection method described in any one of claims 1 to 8 is implemented.
CN202510621264.7A 2025-05-14 2025-05-14 Small target detection method for UAV based on deep learning Pending CN120635750A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510621264.7A CN120635750A (en) 2025-05-14 2025-05-14 Small target detection method for UAV based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510621264.7A CN120635750A (en) 2025-05-14 2025-05-14 Small target detection method for UAV based on deep learning

Publications (1)

Publication Number Publication Date
CN120635750A true CN120635750A (en) 2025-09-12

Family

ID=96968480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510621264.7A Pending CN120635750A (en) 2025-05-14 2025-05-14 Small target detection method for UAV based on deep learning

Country Status (1)

Country Link
CN (1) CN120635750A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121121578A (en) * 2025-11-17 2025-12-12 湖南盛鼎科技发展有限责任公司 Methods, equipment and storage media for drone aerial target detection
CN121281104A (en) * 2025-12-10 2026-01-06 国网江西省电力有限公司电力科学研究院 Cross-modal feature learning-based bird identification and monitoring method and system
CN121617078A (en) * 2026-02-03 2026-03-06 南昌市言诺科技有限公司 Methods, devices, equipment, and media for chemical instrument identification based on lightweight models

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121121578A (en) * 2025-11-17 2025-12-12 湖南盛鼎科技发展有限责任公司 Methods, equipment and storage media for drone aerial target detection
CN121281104A (en) * 2025-12-10 2026-01-06 国网江西省电力有限公司电力科学研究院 Cross-modal feature learning-based bird identification and monitoring method and system
CN121617078A (en) * 2026-02-03 2026-03-06 南昌市言诺科技有限公司 Methods, devices, equipment, and media for chemical instrument identification based on lightweight models

Similar Documents

Publication Publication Date Title
Hu et al. PolyBuilding: Polygon transformer for building extraction
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
Raghavan et al. Optimized building extraction from high-resolution satellite imagery using deep learning
US12039440B2 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
EP4404148A1 (en) Image processing method and apparatus, and computer-readable storage medium
JP6798183B2 (en) Image analyzer, image analysis method and program
CN115546500B (en) Infrared image small target detection method
CN114241274B (en) Small target detection method based on super-resolution multi-scale feature fusion
CN120635750A (en) Small target detection method for UAV based on deep learning
CN110807362A (en) Image detection method and device and computer readable storage medium
CN113297959A (en) Target tracking method and system based on corner attention twin network
CN120495641B (en) Unmanned aerial vehicle aerial photography small target detection method, system, medium and equipment based on RT-DETR
Kolluri et al. Intelligent multimodal pedestrian detection using hybrid metaheuristic optimization with deep learning model
CN109977978A (en) A kind of multi-target detection method, device and storage medium
CN114037640A (en) Image generation method and device
CN115810123A (en) Small target pest detection method based on attention mechanism and improved feature fusion
CN120318499A (en) UAV target detection method and electronic equipment based on cross-spatial frequency domain
CN120125618B (en) Unmanned aerial vehicle target tracking method and system based on mamba feature extraction
CN120953950A (en) SOC-YOLOv11 Optimization Computation Method for Complex Vehicle Inspection Scenarios
CN120726513A (en) Target detection method in UAV scenarios based on deep learning
CN120953787A (en) A Feature-Hierarchical Attention Fusion and Dynamic Optimization Method for Single-Stage Object Detection
CN118262258B (en) Ground environment image aberration detection method and system
CN119579514A (en) Power image target detection method and system based on feature fusion and attention
CN118840644A (en) Low-altitude unmanned aerial vehicle target detection model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 610000 Sichuan Province, Chengdu City, Jinpu High-tech Industrial Park, Jinke East Road No. 50, Building 4, 8th Floor, Room 801

Applicant after: Sichuan Tengdun Liangyuan Intelligent Technology Co.,Ltd.

Applicant after: SICHUAN TENGDUN TECHNOLOGY Co.,Ltd.

Address before: 610000 Sichuan Province, Chengdu City, Jinpu High-tech Industrial Park, Jinke East Road No. 50, Building 4, 8th Floor, Room 801

Applicant before: Chengdu Liangyuan Technology Co.,Ltd.

Country or region before: China

Applicant before: SICHUAN TENGDUN TECHNOLOGY Co.,Ltd.