CN118365882A - Optical remote sensing image segmentation method based on VMamba model - Google Patents

Optical remote sensing image segmentation method based on VMamba model Download PDF

Info

Publication number
CN118365882A
CN118365882A CN202410530012.9A CN202410530012A CN118365882A CN 118365882 A CN118365882 A CN 118365882A CN 202410530012 A CN202410530012 A CN 202410530012A CN 118365882 A CN118365882 A CN 118365882A
Authority
CN
China
Prior art keywords
layer
block
model
remote sensing
optical remote
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410530012.9A
Other languages
Chinese (zh)
Inventor
曹宜策
柳晨辰
吴振华
姚汶昕
黄志祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202410530012.9A priority Critical patent/CN118365882A/en
Publication of CN118365882A publication Critical patent/CN118365882A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an optical remote sensing image segmentation method based on VMamba models, which comprises the following steps: acquiring an optical remote sensing image of a scene to be segmented; constructing a VM-UNet model based on VMamba model; training the constructed VM-UNet model by utilizing the optical remote sensing image to obtain a segmentation model; and (5) utilizing the segmentation model to complete segmentation of the optical remote sensing image. The present invention introduces a Visual State Space (VSS) block as a basic block to capture extensive context information and builds an asymmetric encoder-decoder structure that can capture powerful remote information and preserve linear computational complexity. The defects of the CNN model, such as the limitation of local receptive field, the defect of the capability of capturing remote information, and the high calculation burden caused by the secondary complexity of a self-attention mechanism of a transducer structure in the aspect of image size, are avoided.

Description

基于VMamba模型的光学遥感图像分割方法Optical remote sensing image segmentation method based on VMamba model

技术领域Technical Field

本发明涉及光学遥感图像技术领域,具体涉及基于VMamba模型的光学遥感图像分割方法。The invention relates to the technical field of optical remote sensing images, and in particular to an optical remote sensing image segmentation method based on a VMamba model.

背景技术Background technique

光学遥感图像是利用可见光、红外光和紫外光等电磁波对地球表面进行拍摄和记录的一种遥感图像。这些图像能够反映地表的表面特征、植被覆盖、地物分布、水体分布等信息。通过对这些图像进行分析和处理,可以帮助人们了解地球表面的情况,监测环境变化,进行资源勘探,进行灾害监测等应用。光学遥感图像是遥感技术中应用最为广泛的一种形式,同时也是获取地球表面信息最为直观的手段之一。Optical remote sensing images are remote sensing images that use electromagnetic waves such as visible light, infrared light, and ultraviolet light to capture and record the earth's surface. These images can reflect information such as surface features, vegetation coverage, distribution of land objects, and distribution of water bodies. By analyzing and processing these images, people can understand the situation on the earth's surface, monitor environmental changes, conduct resource exploration, and conduct disaster monitoring. Optical remote sensing images are the most widely used form of remote sensing technology, and are also one of the most intuitive means of obtaining information on the earth's surface.

光学遥感图像分割方法主要可划分为传统图像处理方法和基于深度学习的处理方法两大类。无论采用何种方法,其本质均在于从遥感图像中提取特征表示,以实现对地物的有效分割。在高分辨率遥感图像中,纹理结构特征丰富、同类地物内部特征明显、不同地物特征交界处清晰,这些特性提供了丰富的特征和拓扑信息,为精细化地物分类提供了基础。然而,由于光学遥感图像成像场景的多样性,同一类地物在不同条件下的特征表示可能存在较大差异,例如植被在不同纬度或季节的表现,而不同地物之间可能呈现相似性,如草地和林地。这现象被描述为“同物异谱,同谱异物”,增加了精细化分割的难度。Optical remote sensing image segmentation methods can be mainly divided into two categories: traditional image processing methods and deep learning-based processing methods. Regardless of the method used, its essence is to extract feature representations from remote sensing images to achieve effective segmentation of objects. In high-resolution remote sensing images, the texture structure features are rich, the internal features of the same type of objects are obvious, and the boundaries of different objects are clear. These characteristics provide rich features and topological information, which provide a basis for refined object classification. However, due to the diversity of imaging scenes of optical remote sensing images, the feature representations of the same type of objects under different conditions may vary greatly, such as the performance of vegetation at different latitudes or seasons, while different objects may show similarities, such as grassland and woodland. This phenomenon is described as "same object, different spectrum, same spectrum, different objects", which increases the difficulty of refined segmentation.

目前,光学遥感图像分割的深度学习模型有两种,一种是基于CNN(ConvolutionalNeural Network)的模型,另一种是基于Transformer结构的模型。现有研究表明,将CNN模型运用于高分辨率遥感图像的土地利用分类以及分割等任务中相较于传统方法都有效提升了精度,取得了更好的分类、分割效果。CNN模型现已被广泛应用于遥感图像分割任务中,其通过共享卷积核提取特征,减少网络参数数量,提高模型效率。尽管CNN模型存在很多优势,但由于其感受野有限,不利于捕获全局特征,地物分类精度仍受到一定限制,还不尽如人意。At present, there are two deep learning models for optical remote sensing image segmentation, one is based on CNN (Convolutional Neural Network) and the other is based on Transformer structure. Existing studies have shown that the application of CNN models to land use classification and segmentation tasks of high-resolution remote sensing images has effectively improved the accuracy compared with traditional methods, and achieved better classification and segmentation effects. CNN models are now widely used in remote sensing image segmentation tasks. They extract features by sharing convolution kernels, reduce the number of network parameters, and improve model efficiency. Although CNN models have many advantages, due to their limited receptive field, they are not conducive to capturing global features, and the accuracy of ground object classification is still limited and unsatisfactory.

Transformer架构近年来逐渐成为计算机视觉领域主要的深度学习模型,并逐渐应用于遥感图像的分割研究。注意力机制是Transformer结构的主要组成部分,其在获取空间信息和建立全局关系方面优于CNN结构。Liu等提出的Shifted Windows Transformer(SwinT)采用分Transformer块获取多尺度特征,具备强大的全局建模能力,分类效果要显著优于传统的CNN模型,但其模型结构复杂、计算量大。受此启发,研究者将Transformer运用于CNN模型中,取长补短。例如,Liu等从ResNet50(R50)出发并借鉴SwinT的优点提出了一种新的CNN模型:ConvNeXt,仅依靠卷积结构便达到了ImageNetTop-1的准确率,在图像分类准确率和模型简洁性上都超过了SwinT。虽然上述结果已经展现出深度学习在光学遥感分割上的优势但目前出现的光学遥感图像分割模型存在以下问题:In recent years, the Transformer architecture has gradually become the main deep learning model in the field of computer vision and has been gradually applied to the segmentation of remote sensing images. The attention mechanism is the main component of the Transformer structure, which is superior to the CNN structure in obtaining spatial information and establishing global relationships. The Shifted Windows Transformer (SwinT) proposed by Liu et al. uses Transformer blocks to obtain multi-scale features, has a strong global modeling ability, and has a significantly better classification effect than the traditional CNN model, but its model structure is complex and computationally intensive. Inspired by this, researchers apply Transformer to CNN models to learn from each other's strengths and overcome their weaknesses. For example, Liu et al. proposed a new CNN model: ConvNeXt based on ResNet50 (R50) and borrowed the advantages of SwinT. It achieved the accuracy of ImageNet Top-1 by relying solely on the convolutional structure, and surpassed SwinT in both image classification accuracy and model simplicity. Although the above results have shown the advantages of deep learning in optical remote sensing segmentation, the current optical remote sensing image segmentation models have the following problems:

基于CNN的模型和基于Transformer的模型都有固有的局限性。基于CNN的模型受到其局部感受野的限制,大大阻碍了它们捕获远程信息的能力。这通常会导致特征提取不充分的特征,导致分割结果次优。尽管基于Transformer的模型在全局建模方面表现出卓越的性能,但自注意力机制在图像大小方面需要二次复杂度,导致计算负担很高,特别是对于需要密集预测的任务。这些模型的当前缺点迫使我们开发了一种新的光学遥感图像语意分割架构,能够捕获强大的远程信息并保持线性计算复杂度。Both CNN-based models and Transformer-based models have inherent limitations. CNN-based models are limited by their local receptive fields, which greatly hinders their ability to capture long-range information. This often leads to insufficient feature extraction, resulting in suboptimal segmentation results. Although Transformer-based models show excellent performance in global modeling, the self-attention mechanism requires quadratic complexity in terms of image size, resulting in a high computational burden, especially for tasks that require dense predictions. The current shortcomings of these models forced us to develop a new architecture for semantic segmentation of optical remote sensing images that is able to capture strong long-range information and maintain linear computational complexity.

发明内容Summary of the invention

为解决上述背景中的技术问题,本发明引入视觉状态空间(VSS)块作为基础块来捕获广泛的上下文信息,并构建不对称的编码器-解码器结构,来捕获强大的远程信息并保持线性计算复杂性。To solve the technical problems in the above background, the present invention introduces a visual state space (VSS) block as a basic block to capture extensive contextual information, and constructs an asymmetric encoder-decoder structure to capture powerful long-range information and maintain linear computational complexity.

为实现上述目的,本发明提供了基于VMamba模型的光学遥感图像分割方法,步骤包括:To achieve the above object, the present invention provides an optical remote sensing image segmentation method based on the VMamba model, the steps comprising:

获取待分割场景的光学遥感图像;Acquire an optical remote sensing image of the scene to be segmented;

基于VMamba模型,构建VM-UNet模型;Based on the VMamba model, the VM-UNet model is constructed;

利用所述光学遥感图像训练构建好的所述VM-UNet模型,得到分割模型;Using the optical remote sensing image to train the constructed VM-UNet model to obtain a segmentation model;

利用所述分割模型,完成对所述光学遥感图像的分割。The segmentation model is used to complete the segmentation of the optical remote sensing image.

优选的,所述VM-UNet模型为基于VMamba模型的非对称U型结构网络模型,包括:块嵌入层、编码器、解码器和最终投影层;其中,Preferably, the VM-UNet model is an asymmetric U-shaped structure network model based on the VMamba model, comprising: a block embedding layer, an encoder, a decoder and a final projection layer; wherein,

所述块嵌入层用于将输入图像进行嵌入处理,得到处理后图像;The block embedding layer is used to embed the input image to obtain a processed image;

所述编码器用于对所述处理后图像进行特征提取;The encoder is used to extract features from the processed image;

所述解码器用于基于所述编码器提取的特征,生成分割图;The decoder is used to generate a segmentation map based on the features extracted by the encoder;

所述最终投影层用于将所述分割图投射为最终的分割结果。The final projection layer is used to project the segmentation map into a final segmentation result.

优选的,所述块嵌入层进行嵌入处理的步骤包括:Preferably, the step of embedding the block embedding layer comprises:

将输入的所述光学遥感图像划分成大小为4×4的非重叠块;Dividing the input optical remote sensing image into non-overlapping blocks of size 4×4;

将划分后的所述光学遥感图像进行维度映射,得到映射图像;Performing dimension mapping on the divided optical remote sensing image to obtain a mapping image;

对所述映射图像进行归一化处理,得到所述处理后图像。The mapped image is normalized to obtain the processed image.

优选的,所述编码器包括四个级联的VSSLayer层;其中,前三个所述VSSLayer层均包括两个VSSblock块和一个PatchMerging2D块;第四个所述VSSLayer层包括两个VSSblock块。Preferably, the encoder includes four cascaded VSSLayer layers; wherein the first three VSSLayer layers each include two VSSblock blocks and one PatchMerging2D block; and the fourth VSSLayer layer includes two VSSblock blocks.

优选的,所述解码器包括四个级联的VSSLayer_up层;其中,第一个所述VSSLayer_up层包括两个VSSblock块;后三个所述VSSLayer_up层均包括两个VSSblock块和一个PatchExpand2D块。Preferably, the decoder comprises four cascaded VSSLayer_up layers; wherein, the first VSSLayer_up layer comprises two VSSblock blocks; and the last three VSSLayer_up layers each comprise two VSSblock blocks and one PatchExpand2D block.

优选的,所述VSSblock块包括一个层归一化层和一个SS2D模块;所述PatchMerging2D块包括一个线性层和一个层归一化层;所述PatchExpand2D块包括一个线性层和一个层归一化层;所述SS2D模块包括:两个线性层、一个卷积层、一个SiLU激活层和一个层归一化层。Preferably, the VSSblock block includes a layer normalization layer and a SS2D module; the PatchMerging2D block includes a linear layer and a layer normalization layer; the PatchExpand2D block includes a linear layer and a layer normalization layer; the SS2D module includes: two linear layers, a convolutional layer, a SiLU activation layer and a layer normalization layer.

优选的,在所述编码器中,每个所述VSSLayer层均对应一个阶段,在前三个阶段中,每个阶段的末尾均采用块合并操作,以减少输入特征的高度和宽度,同时增加通道数。Preferably, in the encoder, each of the VSSLayer layers corresponds to a stage, and in the first three stages, a block merging operation is used at the end of each stage to reduce the height and width of the input features while increasing the number of channels.

优选的,在所述解码器中,每个所述VSSLayer_up层均对应一个阶段,在后三个阶段中,每个阶段的开始均采用块扩展操作减少特征通道的数量,增加高度和宽度。Preferably, in the decoder, each of the VSSLayer_up layers corresponds to a stage, and in the latter three stages, a block expansion operation is used at the beginning of each stage to reduce the number of feature channels and increase the height and width.

与现有技术相比,本发明的有益效果如下:Compared with the prior art, the present invention has the following beneficial effects:

本发明引入视觉状态空间(VSS)块作为基础块来捕获广泛的上下文信息,并构建不对称的编码器-解码器结构,能够捕获强大的远程信息并保持线性计算复杂性。避免了CNN模型受到局部感受野的限制,导致的捕获远程信息能力的欠缺,以及Transformer结构自注意力机制在图像大小方面需要二次复杂度,导致的较高的计算负担。The present invention introduces the visual state space (VSS) block as a basic block to capture a wide range of contextual information, and constructs an asymmetric encoder-decoder structure that can capture powerful long-range information and maintain linear computational complexity. This avoids the limitation of the local receptive field of the CNN model, which leads to the lack of ability to capture long-range information, and the high computational burden caused by the self-attention mechanism of the Transformer structure requiring quadratic complexity in terms of image size.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明的技术方案,下面对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solution of the present invention, the following briefly introduces the drawings required for use in the embodiments. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为本发明实施例的方法流程示意图。FIG1 is a schematic diagram of a method flow chart of an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

实施例一Embodiment 1

如图1所示,为本实施例的方法流程示意图,步骤包括:As shown in FIG1 , it is a schematic diagram of the method flow of this embodiment, and the steps include:

S1.获取待分割场景的光学遥感图像。S1. Obtain an optical remote sensing image of the scene to be segmented.

采集待分割场景的光学遥感图像及其对应的地物真实标记图,待分割的场景可以是自然场景下的地物提取或城市的区域规划;在本实施例中,地物真实标记图用于为模型的预测结果提供参考,以计算损失函数;还可以通过与模型预测结果进行比较,来评估模型性能。An optical remote sensing image of a scene to be segmented and its corresponding real-marked map of ground objects are collected. The scene to be segmented can be ground object extraction in a natural scene or regional planning of a city. In this embodiment, the real-marked map of ground objects is used to provide a reference for the prediction results of the model to calculate the loss function. The model performance can also be evaluated by comparing with the model prediction results.

之后,将光学遥感图像按一定比例划分为训练集和验证集,具体步骤包括:Afterwards, the optical remote sensing images are divided into training sets and validation sets according to a certain ratio. The specific steps include:

对光学遥感图像进行预处理,步骤包括:将光学遥感图像进行分割,分割成大小为512×512的数据切片若干个(本实施例为1820个),根据8:2的比例划分成训练集和验证集;其中训练集包含训练图像和训练标签,验证集包括验证图像和验证标签,最终得到VM-UNet网络模型训练和验证所需的数据集。The optical remote sensing image is preprocessed, and the steps include: segmenting the optical remote sensing image into a number of data slices of size 512×512 (1820 in this embodiment), and dividing it into a training set and a verification set according to a ratio of 8:2; wherein the training set includes training images and training labels, and the verification set includes verification images and verification labels, and finally the data set required for VM-UNet network model training and verification is obtained.

S2.基于VMamba模型,构建VM-UNet模型。S2. Based on the VMamba model, build the VM-UNet model.

VM-UNet(Vision Mamba UNet)模型为基于VMamba模型的非对称U型结构网络模型,引入视觉状态空间(VSS)块作为基础块来捕获广泛的上下文信息,并构建不对称的编码器-解码器结构,能够捕获强大的远程信息并保持线性计算复杂性。VM-UNet模型具体包括:块嵌入层、编码器、解码器和最终投影层。The VM-UNet (Vision Mamba UNet) model is an asymmetric U-shaped network model based on the VMamba model. It introduces the visual state space (VSS) block as a basic block to capture a wide range of contextual information and constructs an asymmetric encoder-decoder structure that can capture powerful long-range information and maintain linear computational complexity. The VM-UNet model specifically includes: block embedding layer, encoder, decoder, and final projection layer.

在本实施例,网络模型的块嵌入层,即输入层,其输入样本特征数据的特征映射通道数目为3,产生的输出特征为x∈R H×W×3;其中,H表示图像的高度;W表示图像的宽度。In this embodiment, the block embedding layer of the network model, ie, the input layer, has a feature mapping channel number of 3 for input sample feature data, and generates an output feature of x∈R H×W×3 ; wherein H represents the height of the image; and W represents the width of the image.

块嵌入层用于将输入图像进行嵌入处理,得到处理后图像,具体嵌入过程包括:将输入的光学遥感图像划分成大小为4×4的不重叠的patch并将划分后的光学遥感图像进行映射,得到映射图像;最后对映射图像进行归一化处理,得到处理后图像。The block embedding layer is used to embed the input image to obtain the processed image. The specific embedding process includes: dividing the input optical remote sensing image into non-overlapping patches of size 4×4 and mapping the divided optical remote sensing image to obtain a mapped image; finally, normalizing the mapped image to obtain the processed image.

编码器包括四个级联的VSSLayer层,在本实施例中称为第一VSSLayer层、第二VSSLayer层、第三VSSLayer层和第四VSSLayer层。其中,前三个VSSLayer层均包括两个VSSblock块和一个PatchMergig2D块;第四个VSSLayer层包括两个VSSblock块。在本实施例中,编码器的每个VSSLayer层均对应一个阶段,在前三个阶段中,每个阶段的末尾均采用块合并操作,以减少输入特征的高度和宽度,同时增加通道数。The encoder includes four cascaded VSSLayer layers, which are referred to as the first VSSLayer layer, the second VSSLayer layer, the third VSSLayer layer, and the fourth VSSLayer layer in this embodiment. The first three VSSLayer layers each include two VSSblock blocks and one PatchMergig2D block; the fourth VSSLayer layer includes two VSSblock blocks. In this embodiment, each VSSLayer layer of the encoder corresponds to a stage, and in the first three stages, a block merging operation is used at the end of each stage to reduce the height and width of the input features while increasing the number of channels.

为方便后续说明,本实施例对其进行了详细区分,具体包括:For the convenience of subsequent description, this embodiment makes a detailed distinction, including:

第一VSSLayer层包含第一VSSblock块、第二VSSblock块和第一PatchMerging2D块;第二VSSLayer层包含第三VSSblock块、第四VSSblock块和第二PatchMerging2D块;第三VSSLayer层包含第五VSSblock块、第六VSSblock块和第三PatchMerging2D块;第四VSSLayer层包含第七VSSblock块和第八VSSblock块。The first VSSLayer layer includes the first VSSblock block, the second VSSblock block and the first PatchMerging2D block; the second VSSLayer layer includes the third VSSblock block, the fourth VSSblock block and the second PatchMerging2D block; the third VSSLayer layer includes the fifth VSSblock block, the sixth VSSblock block and the third PatchMerging2D block; the fourth VSSLayer layer includes the seventh VSSblock block and the eighth VSSblock block.

解码器包括四个级联的VSSLayer_up层,在本实施例中称为第一VSSLayer_up层、第二VSSLayer_up层、第三VSSLayer_up层和第四VSSLayer_up层。其中,第一个VSSLayer_up层包括两个VSSblock块;后三个VSSLayer_up层均包括两个VSSblock块和一个PatchExpand2D块。在本实施例中,解码器的每个VSSLayer_up层均对应一个阶段,在后三个阶段中,每个阶段的开始均采用块扩展操作减少特征通道的数量,增加高度和宽度。The decoder includes four cascaded VSSLayer_up layers, which are referred to as the first VSSLayer_up layer, the second VSSLayer_up layer, the third VSSLayer_up layer, and the fourth VSSLayer_up layer in this embodiment. Among them, the first VSSLayer_up layer includes two VSSblock blocks; the last three VSSLayer_up layers each include two VSSblock blocks and a PatchExpand2D block. In this embodiment, each VSSLayer_up layer of the decoder corresponds to a stage, and in the last three stages, a block expansion operation is used at the beginning of each stage to reduce the number of feature channels and increase the height and width.

同样的,本实施例也对其进行了详细区分,具体包括:Similarly, this embodiment also makes a detailed distinction therebetween, specifically including:

其中第一VSSLayer_up层包含第九VSSblock块,第十VSSblock块;其中第二VSSLayer_up层包含第一PatchExpand2D块,第十一VSSblock块,第十二VSSblock块;其中第三VSSLayer_up层包含第二PatchExpand2D块,第十三VSSblock块,第十四VSSblock块;其中第四VSSLayer_up层包含第三PatchExpand2D块,第十五VSSblock块。The first VSSLayer_up layer includes the ninth VSSblock block and the tenth VSSblock block; the second VSSLayer_up layer includes the first PatchExpand2D block, the eleventh VSSblock block and the twelfth VSSblock block; the third VSSLayer_up layer includes the second PatchExpand2D block, the thirteenth VSSblock block and the fourteenth VSSblock block; the fourth VSSLayer_up layer includes the third PatchExpand2D block and the fifteenth VSSblock block.

上述的VSSblock块包括顺次连接的一个层归一化层和一个SS2D模块;PatchMerging2D块包括顺次连接的一个线性层和一个层归一化层;PatchExpand2D块包括顺次连接的一个线性层和一个层归一化层;SS2D模块包括:两个线性层、一个卷积层、一个SiLU激活层和一个层归一化层,其连接关系为线性层-卷积层-SiLU激活层-层归一化层-线性层。The above-mentioned VSSblock block includes a layer normalization layer and a SS2D module connected in sequence; the PatchMerging2D block includes a linear layer and a layer normalization layer connected in sequence; the PatchExpand2D block includes a linear layer and a layer normalization layer connected in sequence; the SS2D module includes: two linear layers, a convolutional layer, a SiLU activation layer and a layer normalization layer, and its connection relationship is linear layer-convolutional layer-SiLU activation layer-layer normalization layer-linear layer.

最后,最终投影层包括一个线性层,一个层归一化层和一个卷积层。Finally, the final projection layer includes a linear layer, a layer normalization layer, and a convolutional layer.

S3.利用光学遥感图像训练构建好的VM-UNet模型,得到分割模型。S3. Use optical remote sensing images to train the constructed VM-UNet model to obtain the segmentation model.

利用S1中得到的数据集对VM-UNet网络模型进行训练,得到分割模型。The VM-UNet network model is trained using the dataset obtained in S1 to obtain a segmentation model.

具体步骤包括:The specific steps include:

首先,将训练集批次输入至块嵌入层;将输入图像划分为大小为4×4的不重叠的patch,并将图像的维度映射到C,(即使该图像的映射通道数为C),产生嵌入图像并产生一个特征矩阵作为第一个跳跃连接层。具体的:First, the training set batch is input to the block embedding layer; the input image is divided into non-overlapping patches of size 4×4, and the dimension of the image is mapped to C (that is, the number of mapped channels of the image is C), an embedded image is generated, and a feature matrix is generated as the first skip connection layer. Specifically:

块嵌入层将输入图像x∈R H×W×3通过一个卷积层划分为大小为4×4的非重叠块,并将图像的维度映射到C。通过该过程得到嵌入图像接着本实施例使用层归一化层对x′进行归一化,并将产生的特征矩阵保存为第一个跳跃连接层,然后将其输入到编码器中进行特征提取。The block embedding layer divides the input image x∈R H×W×3 into non-overlapping blocks of size 4×4 through a convolutional layer and maps the dimension of the image to C. The embedded image is obtained through this process Next, this embodiment uses the layer normalization layer to normalize x′ and converts the resulting feature matrix Save it as the first skip connection layer and then feed it into the encoder for feature extraction.

之后,通过第一,第二VSSblock块提取特征并通过第一PatchMerging2D块进行下采样操作并产生一个特征矩阵作第二个跳跃连接层;通过第三,第四VSSblock块提取特征并通过第二PatchMerging2D块进行下采样操作并产生一个特征矩阵作第三个跳跃连接层;通过第五,第六个VSSblock块提取特征并通过第三PatchMerging2D块进行下采样操作并产生一个特征矩阵作第四个跳跃连接层;通过第七,第八VSSblock块提取特征作为第九VSSblock块的输入。具体的:After that, the first and second VSSblock blocks extract features and perform downsampling operations through the first PatchMerging2D block to generate a feature matrix for the second jump connection layer; the third and fourth VSSblock blocks extract features and perform downsampling operations through the second PatchMerging2D block to generate a feature matrix for the third jump connection layer; the fifth and sixth VSSblock blocks extract features and perform downsampling operations through the third PatchMerging2D block to generate a feature matrix for the fourth jump connection layer; the seventh and eighth VSSblock blocks extract features as the input of the ninth VSSblock block. Specifically:

将块嵌入层的产生的输出输入到第一VSSblock块,经过第一VSSblock中的层归一化层产生特征输入到第一VSSblock中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征第二线性层产生输出并输入到第二VSSblock块,经过第二VSSblock中的层归一化产生特征输入到第二VSSblock中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征第二线性层产生输出并输入到第一PatchMerging2D块,依次经过第一PatchMerging2D块的线性层和层归一化层产生输出并保存为第二个跳跃连接层;将第一PatchMerging2D产生的输出x2输入到第三VSSblock块,经过第三VSSblock中的层归一化层产生特征输入到第三VSSblock中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征和第二线性层产生输出并输入到第四VSSblock块,经过第四VSSblock块中的层归一化层产生特征输入到第四VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征和第二线性层产生输出并输入到第二PatchMerging2D块,依次经过第二PatchMerging2D块的线性层和层归一化层产生输出并保存为第三个跳跃连接层;将第二PatchMerging2D产生的输出x3输入到第五VSSblock块,经过第五VSSblock中的层归一化层产生特征输入到第五VSSblock中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征和第二线性层产生输出并输入到第六VSSblock块,经过第六VSSblock中的层归一化层产生特征输入到第六VSSblock中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征和第二线性层产生输出并输入到第三PatchMerging2D块,依次经过第三PatchMerging2D块的线性层和层归一化层产生输出并保存为第三个跳跃连接层;将第三PatchMerging2D产生的输出x4输入到第七VSSblock块,经过第七VSSblock块中的层归一化层产生特征输入到第七VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征第二线性层产生输出并输入到第八VSSblock块,经过第八VSSblock块中的层归一化产生特征输入到第八VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征卷积层产生特征SiLU激活层,层归一化层产生特征第二线性层产生输出并作为第九VSSblock块的输入。The output produced by embedding the block into the layer Input to the first VSSblock block, and generate features through the layer normalization layer in the first VSSblock Input to the SS2D module in the first VSSblock, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features The second linear layer produces the output And input to the second VSSblock block, after the layer normalization in the second VSSblock to generate features Input to the SS2D module in the second VSSblock, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features The second linear layer produces the output And input to the first PatchMerging2D block, and then pass through the linear layer and layer normalization layer of the first PatchMerging2D block to generate output And save it as the second jump connection layer; input the output x 2 generated by the first PatchMerging2D into the third VSSblock block, and generate features through the layer normalization layer in the third VSSblock Input to the SS2D module in the third VSSblock, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features and the second linear layer produces the output And input to the fourth VSSblock block, the layer normalization layer in the fourth VSSblock block generates features Input to the SS2D module in the fourth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features and the second linear layer produces the output And input to the second PatchMerging2D block, and then pass through the linear layer and layer normalization layer of the second PatchMerging2D block to generate output And save it as the third jump connection layer; input the output x 3 generated by the second PatchMerging2D into the fifth VSSblock block, and generate features through the layer normalization layer in the fifth VSSblock Input to the SS2D module in the fifth VSSblock, and then pass through the first linear layer in the SS2D module to generate characteristics Convolutional layers generate features SiLU activation layer, layer normalization layer generates features and the second linear layer produces the output And input to the sixth VSSblock block, the layer normalization layer in the sixth VSSblock generates features Input to the SS2D module in the sixth VSSblock, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features and the second linear layer produces the output And input to the third PatchMerging2D block, and then pass through the linear layer and layer normalization layer of the third PatchMerging2D block to generate output And save it as the third jump connection layer; input the output x 4 generated by the third PatchMerging2D to the seventh VSSblock block, and generate features through the layer normalization layer in the seventh VSSblock block Input to the SS2D module in the seventh VSSblock block, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features The second linear layer produces the output And input to the eighth VSSblock block, and generate features after layer normalization in the eighth VSSblock block Input to the SS2D module in the eighth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features Convolutional layers generate features SiLU activation layer, layer normalization layer generates features The second linear layer produces the output And serves as the input of the ninth VSSblock block.

最后,经过第十VSSblock块后与第四个跳跃连接层特进行融合作为第一PatchExpand2D块输入,通过第一PatchExpand2D块对输入特征进行上采样;通过第十一,第十二VSSblock块提取特征并与第三个跳跃连接层特征进行融合作为第二PatchExpand2D块输入,通过第二PatchExpand2D块对输入特征进行上采样;通过第十三,第十四VSSblock块提取特征并与第二个跳跃连接层特征进行融合作为第三PatchExpand2D块输入,通过第三PatchExpand2D块对输入特征进行上采样;通过第十五VSSblock块提取特征并与第一个跳跃连接层特征进行融合作为最终投影层输入,通过最终投影层产生最终分割结果。具体的:Finally, after the tenth VSSblock block, it is fused with the fourth jump connection layer feature as the first PatchExpand2D block input, and the input feature is upsampled through the first PatchExpand2D block; the features are extracted through the eleventh and twelfth VSSblock blocks and fused with the third jump connection layer feature as the second PatchExpand2D block input, and the input feature is upsampled through the second PatchExpand2D block; the features are extracted through the thirteenth and fourteenth VSSblock blocks and fused with the second jump connection layer feature as the third PatchExpand2D block input, and the input feature is upsampled through the third PatchExpand2D block; the features are extracted through the fifteenth VSSblock block and fused with the first jump connection layer feature as the final projection layer input, and the final segmentation result is generated through the final projection layer. Specifically:

将第八VSSblock块产生的输出z””输入到第九VSSblock块,经过第九VSSblock块中的层归一化层产生特征输入到第九VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并输入到第十VSSblock块,经过第十VSSblock块中的层归一化层产生特征输入到第十VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并与第四跳跃连接层进行特征融合,输入到第一PatchExpand2D块,依次经过第一PatchExpand2D块的线性层和层归一化层产生输出将第一PatchExpand2D块产生的输出x5输入到第十一VSSblock块,经过第十一VSSblock块中的层归一化产生特征输入到第十一VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并输入到第十二VSSblock块,经过第十二VSSblock块中的层归一化层产生特征输入到第十二VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并与第三跳跃连接层进行特征融合,输入到第二PatchExpand2D块,依次经过第二PatchExpand2D块的线性层和层归一化层产生输出将第二PatchExpand2D块产生的输出x6输入到第十三VSSblock块,经过第十三VSSblock块中的层归一化层产生特征输入到第十三VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并输入到第十四VSSblock块,经过第十四VSSblock块中的层归一化层产生特征输入到第十四VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并与第二跳跃连接层进行特征融合,输入到第三PatchExpand2D块,依次经过第三PatchExpand2D块的线性层和层归一化层产生输出将第三PatchExpand2D块产生的输出x7输入到第十五VSSblock块,经过第十五VSSblock块中的层归一化层产生特征输入到第十五VSSblock块中SS2D模块,随后依次经过SS2D模块中的第一线性层产生特征SiLU激活层,卷积层产生特征层归一化层产生特征第二线性层产生输出并与第一个跳跃连接层进行特征融合,并输入到最终投影层,依次经过线性投影层的线性层产生特征层归一化层产生特征卷积层产生最终输出y∈RH ×W×numclass,其中numclass是需要分割的类别数。The output z"" generated by the eighth VSSblock is input to the ninth VSSblock, and the feature is generated through the layer normalization layer in the ninth VSSblock. Input to the SS2D module in the ninth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output And input to the tenth VSSblock block, the layer normalization layer in the tenth VSSblock block generates features Input to the SS2D module in the tenth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output And perform feature fusion with the fourth jump connection layer, input to the first PatchExpand2D block, and pass through the linear layer and layer normalization layer of the first PatchExpand2D block to generate output The output x 5 generated by the first PatchExpand2D block is input to the eleventh VSSblock block, and the feature is generated after layer normalization in the eleventh VSSblock block Input to the SS2D module in the eleventh VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output And input to the twelfth VSSblock block, the layer normalization layer in the twelfth VSSblock block generates features Input to the SS2D module in the twelfth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output The feature is fused with the third jump connection layer and input to the second PatchExpand2D block, and then the linear layer and layer normalization layer of the second PatchExpand2D block are used to generate the output. The output x6 generated by the second PatchExpand2D block is input to the thirteenth VSSblock block, and the feature is generated through the layer normalization layer in the thirteenth VSSblock block Input to the SS2D module in the thirteenth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output And input to the fourteenth VSSblock block, the layer normalization layer in the fourteenth VSSblock block generates features Input to the SS2D module in the fourteenth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output The feature is fused with the second jump connection layer and input to the third PatchExpand2D block, and then the linear layer and layer normalization layer of the third PatchExpand2D block are used to generate the output. The output x7 generated by the third PatchExpand2D block is input to the fifteenth VSSblock block, and the feature is generated through the layer normalization layer in the fifteenth VSSblock block Input to the SS2D module in the fifteenth VSSblock block, and then pass through the first linear layer in the SS2D module to generate features SiLU activation layer, convolutional layer generates features Layer normalization layer produces features The second linear layer produces the output And the features are fused with the first skip connection layer and input to the final projection layer, and the linear layers of the linear projection layer are sequentially passed to generate features Layer normalization layer produces features The convolutional layer produces the final output y∈RH ×W×numclass , where numclass is the number of classes to be segmented.

S4.利用分割模型,完成对光学遥感图像的分割。S4. Use the segmentation model to complete the segmentation of the optical remote sensing image.

在本实施例中,设定网络模型的块嵌入层,即输入层,其输入样本特征数据的特征映射通道数目为3,输出特征数据的特征映射通道数目为96(即C为96),卷积滤波器的大小为4×4,移动步幅为4,产生的输出特征为x∈R128×128×96。将该输出特征输入至上述的分割模型,最终投影层的参数设置如下:In this embodiment, the block embedding layer of the network model, i.e., the input layer, is set to have 3 feature mapping channels for input sample feature data, 96 feature mapping channels for output feature data (i.e., C is 96), the size of the convolution filter is 4×4, the moving stride is 4, and the generated output feature is x∈R 128×128×96 . The output feature is input to the above segmentation model, and the parameters of the final projection layer are set as follows:

将融合特征输入到最终投影层,即输出层,依次经过线性投影层的线性层,其输入样本特征数据的特征映射通道数目为192,输出特征数据的特征映射通道数目为6,卷积滤波器的大小为1×1,移动步幅为1,产生的最终输出为y∈R512×512×6,其中6代表需要分割的类别数。The fused features are input to the final projection layer, i.e., the output layer, and pass through the linear layer of the linear projection layer in sequence. The number of feature mapping channels of the input sample feature data is 192, the number of feature mapping channels of the output feature data is 6, the size of the convolution filter is 1×1, the moving stride is 1, and the final output is y∈R 512×512×6 , where 6 represents the number of categories that need to be segmented.

实施例二Embodiment 2

为验证本发明模型的分割效果,本实施例设置了仿真实验用于验证本模型的优越性。In order to verify the segmentation effect of the model of the present invention, a simulation experiment is set up in this embodiment to verify the superiority of the model.

1、仿真条件1. Simulation conditions

硬件平台为:Intel(R)Core(TM)i7-7700@3.20GHZ、64.0GB RAM;The hardware platform is: Intel(R) Core(TM) i7-7700@3.20GHZ, 64.0GB RAM;

软件平台为:Pycharm,在PyTorch框架下以TensorFlow为后端。The software platform is: Pycharm, with TensorFlow as the backend under the PyTorch framework.

2、仿真方法2. Simulation method

采用本发明所述方法,获取光学遥感数据;其中,光学遥感数据包括光学遥感图像数据及其对应的地物真实标记图;根据光学遥感数据,将数据按照一定比例分为训练集和验证集对VM-UNet网络模型训练并最终对遥感光学图像进行分割;The method of the present invention is used to obtain optical remote sensing data, wherein the optical remote sensing data includes optical remote sensing image data and its corresponding ground object real marking map; according to the optical remote sensing data, the data is divided into a training set and a validation set according to a certain ratio to train the VM-UNet network model and finally segment the remote sensing optical image;

3、仿真内容和仿真结果3. Simulation content and simulation results

仿真实验选取的光学遥感图像是十幅附有真实地物标记图的光学遥感图像,该光学遥感图像主要包括背景、建筑、农田、草地、森林、以及水域共6种地物覆盖类型,各地物覆盖类型号分别用{0,1,2,3,4,5}表示,其对应的真实地物标记图的大小与光学遥感图像大小相同。在真实地物标记图中已确定地物类别的像素点处的值分别对应为背景为黑色(0,0,0),建筑为红色(255,0,0),农田为绿色(0,255,0),森林为青色(0,255,255),草地为黄色(255,255,0),水域为蓝色(0,0,255)。The optical remote sensing images selected for the simulation experiment are ten optical remote sensing images with real object marking maps. The optical remote sensing images mainly include 6 types of object coverage, including background, building, farmland, grassland, forest, and water area. The number of each object coverage type is represented by {0, 1, 2, 3, 4, 5}, and the size of the corresponding real object marking map is the same as that of the optical remote sensing image. The values of the pixels at the determined object categories in the real object marking map correspond to black (0, 0, 0) for the background, red (255, 0, 0) for the building, green (0, 255, 0) for the farmland, cyan (0, 255, 255) for the forest, yellow (255, 255, 0) for the grassland, and blue (0, 0, 255) for the water area.

仿真实验,采将10副光学遥感数据分割成大小为512×512的数据切片1820个,根据8:2的比例划分成训练集和验证集,利用验证集数据训练VM-UN et模型,实现光学遥感图像分割。在验证集上的分割准确率如表1所示:In the simulation experiment, 10 sets of optical remote sensing data were divided into 1820 data slices of size 512×512, and divided into training set and validation set according to the ratio of 8:2. The validation set data was used to train the VM-UN et model to achieve optical remote sensing image segmentation. The segmentation accuracy on the validation set is shown in Table 1:

表1Table 1

从表1可见,在光学遥感分割场景下,本发明提供的VM-UNet在遥感光学图像分割上既保持了线性或近线性的计算复杂度,又保证了各类别的分割精度,尤其是在水域分割上,取得了十分不错的效果,为光学遥感图像分割领域提供了除CNN和Transformer以外的另一种分割模型的baseline,为解决CNN和Transformer的现有难题提供了有效的解决方案。As can be seen from Table 1, in the optical remote sensing segmentation scenario, the VM-UNet provided by the present invention not only maintains linear or near-linear computational complexity in remote sensing optical image segmentation, but also ensures the segmentation accuracy of various categories, especially in water segmentation, and has achieved very good results. It provides a baseline for another segmentation model other than CNN and Transformer in the field of optical remote sensing image segmentation, and provides an effective solution to the existing problems of CNN and Transformer.

以上所述的实施例仅是对本发明优选方式进行的描述,并非对本发明的范围进行限定,在不脱离本发明设计精神的前提下,本领域普通技术人员对本发明的技术方案做出的各种变形和改进,均应落入本发明权利要求书确定的保护范围内。The embodiments described above are only descriptions of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Without departing from the design spirit of the present invention, various modifications and improvements made to the technical solutions of the present invention by ordinary technicians in this field should fall within the protection scope determined by the claims of the present invention.

Claims (8)

1. The optical remote sensing image segmentation method based on VMamba model is characterized by comprising the following steps:
Acquiring an optical remote sensing image of a scene to be segmented;
Constructing a VM-UNet model based on VMamba model;
The VM-UNet model built by the optical remote sensing image training is utilized to obtain a segmentation model;
and utilizing the segmentation model to complete the segmentation of the optical remote sensing image.
2. The VMamba model-based optical remote sensing image segmentation method as set forth in claim 1, wherein the VM-UNet model is an asymmetric U-shaped structure network model based on VMamba model, and includes: a block embedding layer, an encoder, a decoder, and a final projection layer; wherein,
The block embedding layer is used for carrying out embedding processing on the input image to obtain a processed image;
the encoder is used for extracting the characteristics of the processed image;
the decoder is used for generating a segmentation map based on the characteristics extracted by the encoder;
the final projection layer is used for projecting the segmentation graph into a final segmentation result.
3. The method for optical remote sensing image segmentation based on VMamba model according to claim 2, wherein the step of embedding the block embedding layer includes:
dividing the input optical remote sensing image into non-overlapping blocks with the size of 4 multiplied by 4;
performing dimension mapping on the divided optical remote sensing image to obtain a mapped image;
and carrying out normalization processing on the mapping image to obtain the processed image.
4. The VMamba model-based optical remote sensing image segmentation method of claim 2, wherein the encoder comprises four cascaded VSSLayer layers; wherein the first three VSSLayer layers each include two VSSblock blocks and one PATCHMERGING2D block; the fourth VSSLayer layer includes two VSSblock blocks.
5. The VMamba model-based optical remote sensing image segmentation method as defined in claim 4, wherein the decoder includes four concatenated VSSLayer _up layers; wherein a first one of said VSSLayer _up layers comprises two VSSblock blocks; the last three VSSLayer up layers each include two VSSblock blocks and one PatchExpand2D block.
6. The VMamba model-based optical remote sensing image segmentation method of claim 5, wherein the VSSblock block comprises a layer normalization layer and an SS2D module; the PATCHMERGING D block includes a linear layer and a layer normalization layer; the PatchExpand D block includes a linear layer and a layer normalization layer; the SS2D module includes: two linear layers, one convolutional layer, one SiLU active layer, and one layer normalization layer.
7. The method of claim 4, wherein in the encoder, each VSSLayer layer corresponds to a stage, and in the first three stages, the end of each stage employs a block merging operation to reduce the height and width of the input features while increasing the number of channels.
8. The method of claim 5, wherein in the decoder, each VSSLayer _up layer corresponds to a stage, and in the last three stages, the start of each stage uses a block expansion operation to reduce the number of feature channels and increase the height and width.
CN202410530012.9A 2024-04-29 2024-04-29 Optical remote sensing image segmentation method based on VMamba model Pending CN118365882A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410530012.9A CN118365882A (en) 2024-04-29 2024-04-29 Optical remote sensing image segmentation method based on VMamba model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410530012.9A CN118365882A (en) 2024-04-29 2024-04-29 Optical remote sensing image segmentation method based on VMamba model

Publications (1)

Publication Number Publication Date
CN118365882A true CN118365882A (en) 2024-07-19

Family

ID=91876459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410530012.9A Pending CN118365882A (en) 2024-04-29 2024-04-29 Optical remote sensing image segmentation method based on VMamba model

Country Status (1)

Country Link
CN (1) CN118365882A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118900366A (en) * 2024-08-13 2024-11-05 临沂大学 Infrared single pixel imaging method and device based on hybrid neural network model
CN118941585A (en) * 2024-10-12 2024-11-12 四川大学 A 3D oral hard palate image segmentation method based on multi-directional state space model
CN119006813A (en) * 2024-08-06 2024-11-22 青岛科技大学 U-shaped multimode fusion segmentation method combining graph neural network and Mamba model
CN119206568A (en) * 2024-09-05 2024-12-27 哈尔滨工业大学(威海) Video sequence segmentation method based on selective scanning visual state space model
CN119314009A (en) * 2024-09-20 2025-01-14 中国人民解放军国防科技大学 Automatic identification method of geological hazards based on multi-source data and deep learning
CN119723078A (en) * 2024-11-30 2025-03-28 西北工业大学 Selective state space model road segmentation method based on frequency domain feature compensation
CN119762499A (en) * 2024-12-04 2025-04-04 华南农业大学 Remote sensing image road extraction method and system based on VMamba and channel attention
CN119850654A (en) * 2025-03-19 2025-04-18 南京工业大学 Remote sensing image segmentation method and system based on Mamba architecture
CN120147632A (en) * 2025-02-26 2025-06-13 兰州理工大学 A hybrid structure remote sensing image segmentation method based on state space model
CN120450957A (en) * 2025-04-23 2025-08-08 安徽大学 A super-resolution mapping method for mangroves based on deep extraction of spatiotemporal spectral features
CN120472176A (en) * 2025-05-21 2025-08-12 耕宇牧星(北京)空间科技有限公司 A remote sensing image segmentation method based on intensity grouping Transformer network
CN120689359A (en) * 2025-08-26 2025-09-23 合肥综合性国家科学中心能源研究院(安徽省能源实验室) A medical image segmentation method based on improved SwinUNet

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797931A (en) * 2023-02-13 2023-03-14 山东锋士信息技术有限公司 Remote sensing image semantic segmentation method based on double-branch feature fusion
CN116258976A (en) * 2023-03-24 2023-06-13 长沙理工大学 A Hierarchical Transformer Semantic Segmentation Method and System for High Resolution Remote Sensing Images
CN117576402A (en) * 2024-01-15 2024-02-20 临沂大学 A multi-scale aggregation Transformer remote sensing image semantic segmentation method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797931A (en) * 2023-02-13 2023-03-14 山东锋士信息技术有限公司 Remote sensing image semantic segmentation method based on double-branch feature fusion
CN116258976A (en) * 2023-03-24 2023-06-13 长沙理工大学 A Hierarchical Transformer Semantic Segmentation Method and System for High Resolution Remote Sensing Images
CN117576402A (en) * 2024-01-15 2024-02-20 临沂大学 A multi-scale aggregation Transformer remote sensing image semantic segmentation method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
嫖姚: "上海交大提出VM-UNet:将Mamba结构融入UNet的模型", pages 1, Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/690754287> *
提着小灯找呀找: "Swin-Transformer详解", pages 1, Retrieved from the Internet <URL:https://blog.csdn.net/xunmizhengzha/article/details/127952866> *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119006813A (en) * 2024-08-06 2024-11-22 青岛科技大学 U-shaped multimode fusion segmentation method combining graph neural network and Mamba model
CN118900366A (en) * 2024-08-13 2024-11-05 临沂大学 Infrared single pixel imaging method and device based on hybrid neural network model
CN119206568A (en) * 2024-09-05 2024-12-27 哈尔滨工业大学(威海) Video sequence segmentation method based on selective scanning visual state space model
CN119314009A (en) * 2024-09-20 2025-01-14 中国人民解放军国防科技大学 Automatic identification method of geological hazards based on multi-source data and deep learning
CN118941585A (en) * 2024-10-12 2024-11-12 四川大学 A 3D oral hard palate image segmentation method based on multi-directional state space model
CN119723078B (en) * 2024-11-30 2025-09-19 西北工业大学 Selective state space model road segmentation method based on frequency domain feature compensation
CN119723078A (en) * 2024-11-30 2025-03-28 西北工业大学 Selective state space model road segmentation method based on frequency domain feature compensation
CN119762499A (en) * 2024-12-04 2025-04-04 华南农业大学 Remote sensing image road extraction method and system based on VMamba and channel attention
CN120147632A (en) * 2025-02-26 2025-06-13 兰州理工大学 A hybrid structure remote sensing image segmentation method based on state space model
CN119850654B (en) * 2025-03-19 2025-06-03 南京工业大学 Remote sensing image segmentation method and system based on Mamba architecture
CN119850654A (en) * 2025-03-19 2025-04-18 南京工业大学 Remote sensing image segmentation method and system based on Mamba architecture
CN120450957A (en) * 2025-04-23 2025-08-08 安徽大学 A super-resolution mapping method for mangroves based on deep extraction of spatiotemporal spectral features
CN120472176A (en) * 2025-05-21 2025-08-12 耕宇牧星(北京)空间科技有限公司 A remote sensing image segmentation method based on intensity grouping Transformer network
CN120689359A (en) * 2025-08-26 2025-09-23 合肥综合性国家科学中心能源研究院(安徽省能源实验室) A medical image segmentation method based on improved SwinUNet
CN120689359B (en) * 2025-08-26 2025-10-28 合肥综合性国家科学中心能源研究院(安徽省能源实验室) A medical image segmentation method based on improved SwinUNet

Similar Documents

Publication Publication Date Title
CN118365882A (en) Optical remote sensing image segmentation method based on VMamba model
CN114419449B (en) A Semantic Segmentation Method for Remote Sensing Images Based on Self-Attention Multi-scale Feature Fusion
CN112884758B (en) A method and system for generating defective insulator samples based on style transfer method
CN105069746B (en) Video real-time face replacement method and its system based on local affine invariant and color transfer technology
CN109146831A (en) Remote sensing image fusion method and system based on double branch deep learning networks
CN114973011A (en) High-resolution remote sensing image building extraction method based on deep learning
CN111325165A (en) A Scene Classification Method of Urban Remote Sensing Imagery Considering Spatial Relationship Information
CN108985181A (en) A kind of end-to-end face mask method based on detection segmentation
CN117237641B (en) Polyp segmentation method and system based on dual-branch feature fusion network
CN112766220B (en) Dual-channel micro-expression recognition method and system, storage medium and computer equipment
CN117576567B (en) Remote sensing image change detection method using multi-level difference characteristic self-adaptive fusion
CN110533591B (en) Super-resolution image reconstruction method based on codec structure
CN116797792B (en) Remote sensing image semantic segmentation method based on boundary information guided multi-information fusion
CN114972378A (en) Brain tumor MRI image segmentation method based on mask attention mechanism
CN112489050A (en) Semi-supervised instance segmentation algorithm based on feature migration
CN118115893A (en) A Small Target Detection Method for Remote Sensing Images
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN116091940B (en) Crop classification and identification method based on high-resolution satellite remote sensing image
CN115082798A (en) A method for detecting pin defects in transmission lines based on dynamic receptive field
CN109523558A (en) A kind of portrait dividing method and system
CN113409321A (en) Cell nucleus image segmentation method based on pixel classification and distance regression
CN113963232B (en) A network graph data extraction method based on attention learning
CN119559070A (en) Quaternion-based remote sensing hyperspectral image super-resolution fusion algorithm and system
CN120932088A (en) Automatic detection method for construction progress of engineering main body structure based on unmanned aerial vehicle vision
CN118470328A (en) A remote sensing image multi-dimensional attention semantic segmentation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20240719

RJ01 Rejection of invention patent application after publication