CN118657948A

CN118657948A - Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing

Info

Publication number: CN118657948A
Application number: CN202411146624.4A
Authority: CN
Inventors: 孙启玉; 杨公平; 刘玉峰; 孙平
Original assignee: Shandong Fengshi Information Technology Co ltd
Current assignee: Shandong Fengshi Information Technology Co ltd
Priority date: 2024-08-21
Filing date: 2024-08-21
Publication date: 2024-09-17
Anticipated expiration: 2044-08-21
Also published as: CN118657948B

Abstract

The present invention relates to an efficient visual ViT semantic segmentation method based on content perception and token sharing, and belongs to the field of image processing and computer vision technology. A token sharing strategy network is established, and the token sharing strategy network is trained until convergence, and an image I is input into the sharing strategy network to obtain a sharing strategy, and an image block and a token sharing strategy are input into a token sharing function to obtain a streamlined token set T' ; T' is input into a Transformer network to obtain a set of predicted token sets L containing semantic information; L and the sharing strategy are input into a token distribution function, and the token distribution function distributes and upsamples the predicted token set L according to the sharing strategy, and reorganizes it to obtain a predicted spatial feature and input it into a decoder to obtain a final semantic segmentation prediction result. The present invention can significantly improve the computational efficiency of a visual Transformer semantic segmentation network while ensuring the segmentation quality.

Description

Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing

技术领域Technical Field

本发明涉及基于内容感知与令牌共享的高效视觉ViT语义分割方法，属于图像处理和计算机视觉技术领域。The invention relates to an efficient visual ViT semantic segmentation method based on content perception and token sharing, and belongs to the technical field of image processing and computer vision.

背景技术Background Art

近年来，许多研究提出使用视觉Transformer（Vision Transformer，ViT）替代卷积神经网络（CNNs）来解决各种计算机视觉任务，如图像分类、目标检测和语义分割。ViT通过多层多头自注意力机制处理由固定大小图像块生成的一组令牌（Token）。现有的ViT模型在多个任务上达到了最新的研究水平，尤其是在大规模数据集上的预训练表现优异，从而显著提高了下游任务的性能。In recent years, many studies have proposed using Vision Transformer (ViT) to replace convolutional neural networks (CNNs) to solve various computer vision tasks such as image classification, object detection, and semantic segmentation. ViT processes a set of tokens generated by fixed-size image patches through a multi-layer multi-head self-attention mechanism. The existing ViT model has reached the latest research level on multiple tasks, especially the excellent pre-training performance on large-scale datasets, which significantly improves the performance of downstream tasks.

语义分割任务需要对图像中的每个像素进行分类，这对模型的计算能力和效率提出了更高的要求。传统的CNN方法通过逐层卷积操作逐步提取图像特征，但这些方法通常依赖于较大的感受野来捕捉全局信息。相比之下，ViT可以利用其自注意力机制直接捕获图像的全局依赖性，从而在理论上有助于提高分割任务的准确性。The semantic segmentation task requires classification of each pixel in the image, which places higher demands on the computational power and efficiency of the model. Traditional CNN methods gradually extract image features through layer-by-layer convolution operations, but these methods usually rely on a large receptive field to capture global information. In contrast, ViT can directly capture the global dependencies of the image using its self-attention mechanism, which theoretically helps improve the accuracy of the segmentation task.

在实际应用中，ViT已被用于多种密集预测任务，包括语义分割任务，并在许多基准测试中显示出优异的性能。例如，ViT模型在ADE20K和Cityscapes等标准数据集上达到了最高的性能水平。基于ViT的语义分割架构将输入图像分割成固定大小的图像块（Patch），经过线性嵌入和位置编码后送入Transformer编码器。编码器使用多头自注意力机制处理图像块序列，捕捉全局上下文信息。随后，通过解码器将编码器的输出转换为二维特征图，再经过上采样恢复到原始图像尺寸。最后，利用分割头（通常是卷积层）生成每个像素的类别预测，从而完成语义分割任务。这种架构结合了Transformer在建模长距离依赖关系方面的优势和传统卷积网络在处理局部特征方面的能力，有效提升了语义分割的性能。然而，尽管ViT在这些任务中表现出色，其计算复杂度和效率问题仍然是亟待解决的挑战。为了应对这些挑战，研究人员提出了一系列改进方案，例如：In practical applications, ViT has been used for a variety of dense prediction tasks, including semantic segmentation tasks, and has shown excellent performance in many benchmarks. For example, the ViT model has achieved the highest performance level on standard datasets such as ADE20K and Cityscapes. The semantic segmentation architecture based on ViT divides the input image into fixed-size image patches (Patch), which are fed into the Transformer encoder after linear embedding and position encoding. The encoder uses a multi-head self-attention mechanism to process the image patch sequence to capture global context information. Subsequently, the encoder output is converted into a two-dimensional feature map by the decoder, and then restored to the original image size after upsampling. Finally, the segmentation head (usually a convolutional layer) is used to generate a category prediction for each pixel to complete the semantic segmentation task. This architecture combines the advantages of Transformer in modeling long-distance dependencies and the ability of traditional convolutional networks in processing local features, effectively improving the performance of semantic segmentation. However, despite the excellent performance of ViT in these tasks, its computational complexity and efficiency issues remain challenges that need to be addressed. To address these challenges, researchers have proposed a series of improvements, such as:

（1）混合架构：一些研究结合了CNN和ViT的优势，提出了混合架构。这些方法通常在早期阶段使用卷积层提取局部特征，然后在后期阶段使用Transformer捕捉全局特征；(1) Hybrid architecture: Some studies combine the advantages of CNN and ViT and propose hybrid architectures. These methods usually use convolutional layers to extract local features in the early stage and then use Transformer to capture global features in the later stage;

（2）层次化Transformer：引入层次化结构，将图像分解成不同尺度的图像块，以减少计算负担。例如，Swin Transformer通过分层窗口的方式在不同尺度上进行自注意力计算，从而降低了复杂度；(2) Hierarchical Transformer: Introducing a hierarchical structure, the image is decomposed into image blocks of different scales to reduce the computational burden. For example, the Swin Transformer performs self-attention calculations at different scales through a layered window approach, thereby reducing complexity;

（3）动态令牌裁剪：根据图像内容动态调整需要处理的令牌数量，减少计算资源的浪费。例如，研究者提出了动态裁剪策略，根据每个令牌的重要性决定是否保留，从而提高效率。(3) Dynamic token cropping: Dynamically adjust the number of tokens to be processed according to the image content to reduce the waste of computing resources. For example, researchers proposed a dynamic cropping strategy that decides whether to keep each token based on its importance, thereby improving efficiency.

尽管这些改进在一定程度上解决了ViT在语义分割任务中的效率问题，但在该任务场景下，仍存在一些亟待解决的问题，例如：Although these improvements have solved the efficiency problem of ViT in semantic segmentation tasks to a certain extent, there are still some problems that need to be solved in this task scenario, such as:

（1）计算复杂度高：ViT的计算复杂度与输入令牌的数量平方成正比，这在处理大图像时变得尤为低效。传统的ViT架构在进行语义分割任务时需要处理大量的令牌，这极大地增加了计算负担；(1) High computational complexity: The computational complexity of ViT is proportional to the square of the number of input tokens, which becomes particularly inefficient when processing large images. The traditional ViT architecture needs to process a large number of tokens when performing semantic segmentation tasks, which greatly increases the computational burden;

（2）令牌冗余：在语义分割任务中，多处图像块可能包含相同的语义类，因此在某些情况下，这些图像块共享相同的令牌是不合理的。这种冗余没有被传统的ViT方法充分利用；(2) Token redundancy: In semantic segmentation tasks, multiple image patches may contain the same semantic class, so in some cases it is unreasonable for these image patches to share the same token. This redundancy is not fully utilized by traditional ViT methods;

（3）架构复杂性：为了提高效率，一些研究提出了结合全局和局部注意力或引入类似CNN的金字塔结构的新ViT架构。然而，这些新架构往往复杂且需要额外的设计和优化。(3) Architecture complexity: To improve efficiency, some studies have proposed new ViT architectures that combine global and local attention or introduce CNN-like pyramid structures. However, these new architectures are often complex and require additional design and optimization.

因此提高语义分割任务场景下基于ViT方法的效率，显著减少计算复杂度和资源消耗是亟待解决的问题。Therefore, improving the efficiency of ViT-based methods in semantic segmentation task scenarios and significantly reducing computational complexity and resource consumption are issues that need to be addressed urgently.

发明内容Summary of the invention

本发明的目的是克服上述不足，而提供一种基于内容感知与令牌共享的高效视觉ViT语义分割方法，通过使用一个高效的策略网络来预测哪些图像块可以共享令牌，减少需要处理的令牌数量，可以在保障分割质量的同时，显著提高基于视觉Transformer语义分割网络的计算效率。The purpose of the present invention is to overcome the above-mentioned shortcomings and to provide an efficient visual ViT semantic segmentation method based on content perception and token sharing. By using an efficient strategy network to predict which image blocks can share tokens and reduce the number of tokens that need to be processed, the computational efficiency of the visual Transformer semantic segmentation network can be significantly improved while ensuring the segmentation quality.

本发明采取的技术方案为：The technical solution adopted by the present invention is:

基于内容感知与令牌共享的高效视觉ViT语义分割方法，包括步骤如下：An efficient visual ViT semantic segmentation method based on content perception and token sharing includes the following steps:

S1.建立一个令牌共享策略网络p，用来预测输入图像中哪些超图块只包含一个语义类别，将包含一个语义类别的分为一组，然后让每组超图块中的图像块共享同一个令牌；S1. Establish a token sharing strategy network p to predict which super-blocks in the input image contain only one semantic category, group those containing one semantic category, and then let the image blocks in each group of super-blocks share the same token;

S2.训练令牌共享策略网络p，直至共享策略网络p收敛，将图像I输入共享策略网络p，得到共享策略P；S2. Train the token sharing strategy network p until the sharing strategy network p converges, input the image I into the sharing strategy network p , and obtain the sharing strategy P ;

S3.将输入图像I划分为N个图像块I _patch；S3. Divide the input image I into N image patches I _patch ;

S4. 基于ViT的语义分割架构，将图像块I _patch与令牌共享策略P输入令牌共享函数t _s，得到一个精简的令牌集合T’；S4. Based on the semantic segmentation architecture of ViT, the image block I _patch and the token sharing strategy P are input into the token sharing function t _s to obtain a simplified token set T ' ;

S5.将精简的令牌集合T’输入Transformer网络，得到一组包含语义信息的预测令牌集合L；S5. Input the simplified token set T' into the Transformer network to obtain a set of predicted tokens L containing semantic information;

S6.将预测令牌集合L以及共享策略P输入令牌分发函数t _u，令牌分发函数t _u将预测令牌集合L根据共享策略P进行分发以及上采样，并进行重组，得到预测空间特征L _spat；S6. Input the predicted token set L and the sharing strategy P into the token distribution function tu _. The token distribution function _tu distributes and upsamples the predicted token set L according to the sharing strategy P , and reorganizes the tokens to obtain the predicted spatial feature L _spat .

S7.将预测空间特征L _spat输入解码器，得到最终语义分割预测结果。S7. Input the predicted spatial feature L _spat into the decoder to obtain the final semantic segmentation prediction result.

上述方法中，步骤S1所述的令牌共享策略网络p 通过训练一个CNN模型来实现其功能，该模型预测超图块是否包含单一语义类别，而无需预测出具体类别，如果一个超图块只包含一个类别，其中的图像块就可以共享同一个令牌，它仅需对各超图块实现二分类。In the above method, the token sharing strategy network p described in step S1 realizes its function by training a CNN model, which predicts whether a super-block contains a single semantic category without predicting the specific category. If a super-block contains only one category, the image blocks therein can share the same token, and it only needs to implement binary classification for each super-block.

步骤S2所述的令牌共享策略网络p与主干网络的训练是分开的，也是与类别无关，并且训练令牌共享策略网络p所需的标签从语义分割数据集的分割标签中直接生成，对于图像中的每个超图块，若它在分割标签中只有一个语义类别，令牌共享策略网络p的标签设置为true，否则设置为false；使用此二分类标签，通过标准交叉熵损失训练令牌共享策略网络p收敛。The token sharing policy network p described in step S2 is trained separately from the backbone network and is also category-independent. The labels required for training the token sharing policy network p are directly generated from the segmentation labels of the semantic segmentation dataset. For each super-block in the image, if it has only one semantic category in the segmentation label, the label of the token sharing policy network p is set to true, otherwise it is set to false. Using this binary label, the token sharing policy network p is trained to converge through standard cross entropy loss.

步骤S4所述的令牌共享函数t _s将原始图像块I _patch及其位置编码映射到一个新的映射集合I' = {I'₁,...,I'_M}，其中M < N，并带有相应的位置编码，该映射是根据预测的共享策略P来执行的，共享策略P指定哪些超图块中的图像块应该共享令牌，哪些超图块中的图像块不进行共享拥有原始令牌，得到映射集I'后，令牌共享函数t _s将其投影到令牌集合T'，这个过程可公式化表示为：The token sharing function ts described in step S4 _maps the original image block I _patch and its position encoding to a new mapping set I' = {I' ₁ ,...,I' _M }, where M < N, and carries the corresponding position encoding. The mapping is performed according to the predicted sharing strategy P , which specifies which image blocks in the super-block should share tokens and which image blocks in the super- _block do not share the original tokens. After obtaining the mapping set I', the token sharing function ts projects it to the token set T'. This process can be formulated as:

， ,

其中，令牌共享函数t _s会根据共享策略P将可以共享令牌的每个超图块都下采样到单一图像块的维度，并线性投影为单一令牌，以此实现超图块中每个图像块的令牌共享。Among them, the token sharing function ts _will downsample each super-block that can share tokens to the dimension of a single image block according to the sharing strategy P , and linearly project it into a single token, so as to achieve token sharing of each image block in the super-block.

步骤S5所述的Transformer网络可以是任一基于Transformer架构的语义分割网络，优选基于ViT的分割网络Segmenter。The Transformer network described in step S5 can be any semantic segmentation network based on the Transformer architecture, preferably a segmentation network Segmenter based on ViT.

步骤S6所述的令牌分发函数t _u首先按照共享策略P通过双线性上采样恢复令牌维度，即取消各图像块的令牌共享，随后将令牌在空间上重组，令牌本质上为从各图像块学习到的特征向量，将令牌按照图像块在原图中的位置进行重组，得到预测空间特征L _spat。The token distribution function tu described in step S6 _first restores the token dimension through bilinear upsampling according to the sharing strategy P , that is, cancels the token sharing of each image block, and then reorganizes the token in space. The token is essentially a feature vector learned from each image block. The token is reorganized according to the position of the image block in the original image to obtain the predicted spatial feature L _spat .

步骤S7所述的解码器优选基于CNN的UPerNet作为解码器。The decoder described in step S7 is preferably based on CNN's UPerNet as a decoder.

基于内容感知与令牌共享的高效视觉ViT语义分割系统，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法。An efficient visual ViT semantic segmentation system based on content perception and token sharing includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the efficient visual ViT semantic segmentation method based on content perception and token sharing as described above is implemented.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明针对ViT计算复杂度高以及令牌冗余等问题，通过使用一个高效的策略网络来预测哪些图像块可以共享令牌，从而减少需要处理的令牌数量，可以在保障分割质量的同时，显著提高基于视觉Transformer的语义分割网络的计算效率。与现有基于令牌轻量化Transformer网络的技术相比，本方法采用效果更有保障的二分类网络对令牌是否需要共享进行判别，保障了轻量化过程的可靠性，即在不损失分割效果的情况下，可显著提升分割效率。本发明所提出的方法同时拥有即插即用特性，即适用于各种ViT主干网络和分割解码器，并且与最新的预训练方法兼容。该方法的简洁和通用性使其易于集成到现有所有基于ViT的模型中，为基于Transformer的密集预测任务的高效架构研究及应用提供了一种有效方法。In view of the high computational complexity and token redundancy of ViT, the present invention uses an efficient policy network to predict which image blocks can share tokens, thereby reducing the number of tokens that need to be processed. This can significantly improve the computational efficiency of the semantic segmentation network based on the visual Transformer while ensuring the segmentation quality. Compared with the existing token-based lightweight Transformer network technology, this method uses a more reliable binary classification network to determine whether tokens need to be shared, thereby ensuring the reliability of the lightweight process, that is, the segmentation efficiency can be significantly improved without losing the segmentation effect. The method proposed in the present invention also has a plug-and-play feature, that is, it is suitable for various ViT backbone networks and segmentation decoders, and is compatible with the latest pre-training methods. The simplicity and versatility of this method make it easy to integrate into all existing ViT-based models, providing an effective method for the study and application of efficient architectures for dense prediction tasks based on Transformer.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明方法的网络框架图；FIG1 is a network framework diagram of the method of the present invention;

图2为本发明的图像分块示意图。FIG. 2 is a schematic diagram of image segmentation according to the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例和附图对本发明进一步说明。The present invention is further described below in conjunction with specific embodiments and drawings.

实施例1基于内容感知与令牌共享的高效视觉ViT语义分割方法，包括步骤（见图1）如下：Embodiment 1 An efficient visual ViT semantic segmentation method based on content perception and token sharing includes the following steps (see FIG1 ):

S1.建立一个令牌共享策略网络p，用来预测输入图像中哪些超图块只包含一个语义类别，将包含一个语义类别的分为一组，然后让每组超图块中的图像块共享同一个令牌：S1. Establish a token sharing strategy network p to predict which super-blocks in the input image contain only one semantic category, group those containing one semantic category, and then let the image blocks in each group of super-blocks share the same token:

在基于ViT的语义分割架构中，输入图像I∈R^3×H×W被划分为N个图像块I _patch={I₁,...,I_N}，其中H，W分别代表图像I的高与宽。这些图像块通过线性投影被转换为一组令牌I _token={T₁,...,T_N}，然后加上相应的可学习位置编码，随后输入到Transformer中。令牌的总数N取决于输入图像的大小和预定义的图像块大小。基于Transformer的主干网络将输入令牌 Patch映射为一组包含语义信息的预测令牌L={L₁,...,L_N}。Transformer模型可以是任一使用全局自注意力的架构。下一步是将输出令牌 L输入到解码器中。解码器首先根据原始位置将令牌重新排列成特征，其中E是特征维度，H_L和W_L是原始H和W的分数。然后将特征L _spat输入到解码器g中，得到分割预测结果。本发明要解决的问题是，通过识别可以共享令牌的图像块，在不降低语义分割质量的前提下，减少T的大小以降低计算复杂度和资源消耗。In the ViT-based semantic segmentation architecture, the input image I ∈ R ^3×H×W is divided into N image patches I _patch ={I ₁ ,...,I _N }, where H and W represent the height and width of image I, respectively. These image patches are converted into a set of tokens I _token ={T ₁ ,...,T _N } through linear projection, and then added with corresponding learnable position encodings, and then input into the Transformer. The total number of tokens N depends on the size of the input image and the predefined image patch size. The Transformer-based backbone network maps the input token Patch to a set of predicted tokens L ={L ₁ ,...,L _N } containing semantic information. The Transformer model can be any architecture that uses global self-attention. The next step is to input the output token L into the decoder. The decoder first rearranges the tokens into features according to their original positions. , where E is the feature dimension, H _L and W _L are the fractions of the original H and W. The feature L _spat is then input into the decoder g to obtain the segmentation prediction result. The problem to be solved by the present invention is to identify image blocks that can share tokens without reducing Under the premise of semantic segmentation quality, the size of T is reduced to reduce computational complexity and resource consumption.

本发明基于一个显见的假设，即图像中必然存在图像块包含冗余信息。该假设基于图像中单一物体内像素连续这一现象，类比超像素的定义，比如将2×2相邻的图像块定义为超图块，组成单一物体的2×2超图块只包含一个语义类别，因此图像中必然存在许多语义上均匀的区域。假设通过联合处理这些超图块可以提高效率而不降低分割质量。基于此，本发明提出了基于内容感知的令牌共享方法，该方法首先预测输入图像I中哪些超图块只包含一个语义类别，然后让每组2×2超图块中的图像块共享同一个令牌。The present invention is based on an obvious assumption, that is, there must be image blocks in the image that contain redundant information. This assumption is based on the phenomenon that pixels in a single object in the image are continuous. By analogy with the definition of superpixels, for example, 2×2 adjacent image blocks are defined as superblocks. The 2×2 superblocks that make up a single object only contain one semantic category. Therefore, there must be many semantically uniform areas in the image. It is assumed that efficiency can be improved without reducing segmentation quality by jointly processing these superblocks. Based on this, the present invention proposes a content-aware token sharing method, which first predicts which superblocks in the input image I contain only one semantic category, and then allows the image blocks in each group of 2×2 superblocks to share the same token.

为了使该方法适用于任何基于Transformer的模型，本发明设计了一个令牌共享函数t _s，一个令牌分发函数t _u，以及一个令牌共享策略模型p，令牌共享函数t _s、令牌分发函数t _u以及令牌共享策略模型p都是以可微分的方式实现，以便插入目前任一基于Transformer的模型，从而进行端到端地训练。In order to make the method applicable to any Transformer-based model, the present invention designs a token sharing function t _s , a token distribution function tu _, and a token sharing strategy model p . The token sharing function t _s , the token distribution function tu _, and the token sharing strategy model p are all implemented in a differentiable manner so as to be inserted into any current Transformer-based model for end-to-end training.

令牌共享策略网络的作用是，在不降低分割质量的前提下，预测每个超图块是否可以共享同一个令牌。本发明通过训练一个高效的CNN模型来实现此功能，该模型预测超图块是否包含单一语义类别，而无需预测出具体类别，如果一个超图块只包含一个类别，其中的图像块就可以共享同一个令牌。将该模型定义为P=p(I)，其中模型网络p。可以用任何轻量级的网络实现，它仅需对各超图块实现二分类，在本发明中利用EfficientNet系列网络中参数量最小的版本Lite0实现内容感知与令牌共享。基于网络p的输出，选择得分最高的S个超图块进行令牌共享，则该S个2×2超图块中的四个图像块可以共享同一个令牌，最终保留M=N−3S个令牌，这些共享令牌送入Transformer的网络进行学习，其中N为图像I被划分成图像块的个数。The role of the token sharing strategy network is to predict whether each super-block can share the same token without reducing the segmentation quality. The present invention achieves this function by training an efficient CNN model, which predicts whether a super-block contains a single semantic category without predicting a specific category. If a super-block contains only one category, the image blocks in it can share the same token. The model is defined as P = p ( I ), where the model network p . It can be implemented with any lightweight network, which only needs to implement binary classification for each super-block. In the present invention, Lite0, the version with the smallest number of parameters in the EfficientNet series network, is used to implement content perception and token sharing. Based on the output of network p , the S super-blocks with the highest scores are selected for token sharing, then the four image blocks in the S 2×2 super-blocks can share the same token, and finally retain M = N −3 S tokens. These shared tokens are sent to the Transformer network for learning, where N is the number of image blocks into which the image I is divided.

S2.训练令牌共享策略网络p，直至共享策略网络p收敛，将图像I输入共享策略网络p，得到共享策略P：S2. Train the token sharing policy network p until the sharing policy network p converges, input the image I into the sharing policy network p , and obtain the sharing policy P :

令牌共享策略网络与主干网络的训练是分开的，同时也是与类别无关的，并且训练令牌共享策略网络p所需的标签可直接从语义分割数据集的分割标签中直接生成。具体来说，对于图像中的每个超图块，若它在分割标签中只有一个语义类别，令牌共享策略网络p的标签设置为true，否则设置为false。使用此二分类标签，通过标准交叉熵损失训练令牌共享策略网络p。利用收敛后的策略模型p进行后续操作。The token sharing policy network is trained separately from the backbone network and is also class-independent, and the labels required to train the token sharing policy network p can be directly generated from the segmentation labels of the semantic segmentation dataset. Specifically, for each hyperblock in the image, if it has only one semantic category in the segmentation label, the label of the token sharing policy network p is set to true, otherwise it is set to false. Using this binary label, the token sharing policy network p is trained with the standard cross entropy loss. The converged policy model p is used for subsequent operations.

之所以可以用这种简单高效的方式训练令牌共享策略模型p，是因为语义分割任务具备现实可用的语义标签。语义分割标注为每个像素提供了标签，这使得本发明能够直接将关于语义相似图像块冗余性的假设转化为每个超图块的二元标签。此外，令牌共享策略模型p与类别无关是该网络高效性的关键，二分类任务比多类分割任务简单很多，因此可以由一个高效的模型来执行，这对于通过减少令牌数量来获得效率提升是非常关键的。The token sharing strategy model p can be trained in this simple and efficient way because the semantic segmentation task has realistic and available semantic labels. Semantic segmentation annotation provides a label for each pixel, which enables the present invention to directly convert the assumption about the redundancy of semantically similar image blocks into a binary label for each super-block. In addition, the fact that the token sharing strategy model p is independent of the category is the key to the efficiency of the network. The binary classification task is much simpler than the multi-class segmentation task, so it can be performed by an efficient model, which is very critical to obtain efficiency improvements by reducing the number of tokens.

将图像I输入训练好的令牌共享策略模型p，得到共享策略P，共享策略P标示出在每个2×2超图块中的图像块是否可共享令牌。The image I is input into the trained token sharing strategy model p to obtain the sharing strategy P , which indicates whether the image block in each 2×2 super block can share the token.

S3.将输入图像I划分为N个图像块I _patch：S3. Divide the input image I into N image patches I _patch :

如图2所示，基于Transformer的骨干网络接收图像块作为输入，本发明首先将输入图像切分成N个patch，并定义相邻2×2个图像块为超图块。As shown in FIG2 , the Transformer-based backbone network receives an image block as input. The present invention first divides the input image into N patches and defines adjacent 2×2 image blocks as super-blocks.

S4. 基于ViT的语义分割架构，将图像块I _patch与令牌共享策略P输入令牌共享函数t _s，得到一个精简的令牌集合T’：S4. Based on the semantic segmentation architecture of ViT, the image patch I _patch and the token sharing strategy P are input into the token sharing function t _s to obtain a simplified token set T' :

令牌共享函数t _s（也即令牌共享模块，是下面的操作集合）将原始图像块I _patch及其位置编码映射到一个新的映射集合I' = {I'₁,...,I'_M}，其中M < N，并带有相应的位置编码。该映射是根据预测的共享策略P来执行的，共享策略P指定哪些超图块中的图像块应该共享令牌，哪些超图块中的图像块不进行共享拥有原始令牌。得到映射集I'后，令牌共享函数t _s将其投影到令牌集合T'，方便后续Transformer网络进行处理。上述过程可公式化表示为：The token sharing function ts (also known as the token sharing module, which is _the set of operations below) maps _{the original image block Ipatch} and its position encoding to a new mapping set I' = { _I'1 ,...,I _'M }, where M < N, with corresponding position encodings. This mapping is performed according to the predicted sharing strategy P , which specifies which image blocks in the super-block should share tokens and which image blocks in the super-block do not share the original tokens. After obtaining the mapping set I', the token sharing function ts projects it to the token set T' for subsequent processing _by the Transformer network. The above process can be formulated as:

， ,

其中，令牌共享函数t _s会根据共享策略P将可以共享令牌的每个超图块都下采样到单一图像块的维度，并线性投影为单一令牌，以此实现2×2超图块中每个图像块的令牌共享。Among them, the token sharing function ts _will downsample each super-block that can share tokens to the dimension of a single image block according to the sharing strategy P , and linearly project it into a single token, so as to achieve token sharing for each image block in the 2×2 super-block.

S5.将精简的令牌集合T’输入Transformer网络，得到一组包含语义信息的预测令牌集合L：S5. Input the simplified token set T' into the Transformer network to obtain a set of predicted tokens L containing semantic information:

得到精简的令牌集合T’后，将其输入Transformer网络f，进行特征学习，得到一组包含语义信息的预测令牌 L = {L₁,...,L_M}，该过程可公式化为：After obtaining the streamlined token set T' , it is input into the Transformer network f for feature learning to obtain a set of prediction tokens L = {L ₁ ,...,L _M } containing semantic information. The process can be formulated as:

， ,

其中，该Transformer网络f可以是任一基于Transformer架构的语义分割网络，在本发明中，利用基于ViT的分割网络Segmenter这一经典模型作为网络f进行学习。The Transformer network f can be any semantic segmentation network based on the Transformer architecture. In the present invention, the classic model of the ViT-based segmentation network Segmenter is used as the network f for learning.

S6.将预测令牌集合L以及共享策略P输入令牌分发函数t _u，令牌分发函数t _u将预测令牌集合L根据共享策略P进行分发以及上采样，并进行重组，得到预测空间特征L _spat：S6. Input the predicted token set L and the sharing strategy P into the token distribution function tu _. The token distribution function _tu distributes and upsamples the predicted token set L according to the sharing strategy P , and reorganizes the tokens to obtain the predicted spatial feature L _spat :

将预测令牌集合L以及共享策略P输入令牌分发函数t _u，令牌分发函数t _u（也即令牌分发模块，是下面的操作集合）首先按照共享策略P通过双线性上采样恢复令牌维度，即取消各图块的令牌共享，随后将令牌在空间上重组为预测空间特征L _spat。在这个过程中共享策略P提供步骤S4中进行过令牌共享的图像块对应的索引，令牌分发函数t _u基于该索引进行上采样。该过程可公式化为：The predicted token set L and the sharing strategy P are input into the token distribution function tu _. The token distribution function tu ₍ i.e., the token distribution module, which is the following set of operations) first restores the token dimension by bilinear upsampling according to the sharing strategy P , that is, cancels the token sharing of each block, and then spatially reorganizes the token into the predicted spatial feature Lspat . In this process, the sharing strategy P provides the index corresponding to the image block that has been token _- shared in step S4, and the token distribution function tu _performs upsampling based on the index. The process can be formulated as:

。 .

S7.将预测空间特征L _spat输入解码器，得到最终语义分割预测结果：S7. Input the predicted spatial feature L _spat into the decoder to obtain the final semantic segmentation prediction result:

对于预测空间特征L _spat，将其输入解码器g，得到最终的语义分割预测结果O，该过程可公式化为：For the predicted spatial feature L _spat , it is input into the decoder g to obtain the final semantic segmentation prediction result O . The process can be formulated as:

， ,

其中，解码器g可以采用多种网络结构，在本发明中出于效率考虑，利用基于CNN的UPerNet作为解码器对特征进行解码。Among them, the decoder g can adopt a variety of network structures. In the present invention, for efficiency considerations, CNN-based UPerNet is used as a decoder to decode the features.

实施例2：基于内容感知与令牌共享的高效视觉ViT语义分割系统，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如实施例1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法。Example 2: An efficient visual ViT semantic segmentation system based on content perception and token sharing, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the efficient visual ViT semantic segmentation method based on content perception and token sharing as described in Example 1 is implemented.

以上是结合实施例对本发明的进一步描述，本发明的保护范围不限于此。The above is a further description of the present invention in combination with the embodiments, and the protection scope of the present invention is not limited thereto.

Claims

1. An efficient visual ViT semantic segmentation method based on content perception and token sharing, characterized in that it includes the following steps:

S1. Establish a token sharing strategy network p to predict which super-blocks in the input image contain only one semantic category, group those containing one semantic category, and then let the image blocks in each group of super-blocks share the same token;

S2. Train the token sharing strategy network p until the sharing strategy network p converges, input the image I into the sharing strategy network p , and obtain the sharing strategy P ;

S3. Divide the input image I into N image patches I _patch ;

S4. Based on the semantic segmentation architecture of ViT, the image block I _patch and the token sharing strategy P are input into the token sharing function t _s to obtain a simplified token set T ' ;

S5. Input the simplified token set T' into the Transformer network to obtain a set of predicted tokens L containing semantic information;

S6. Input the predicted token set L and the sharing strategy P into the token distribution function tu _. The token distribution function _tu distributes and upsamples the predicted token set L according to the sharing strategy P , and reorganizes the tokens to obtain the predicted spatial feature L _spat .

S7. Input the predicted spatial feature L _spat into the decoder to obtain the final semantic segmentation prediction result.

2. According to the efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1, it is characterized in that the token sharing strategy network p described in step S1 realizes its function by training a CNN model, which predicts whether a super-block contains a single semantic category without predicting the specific category. If a super-block contains only one category, the image blocks therein can share the same token, and it only needs to implement binary classification for each super-block.

3. According to the content-aware and token-sharing-based efficient visual ViT semantic segmentation method of claim 1, it is characterized in that the token sharing strategy network p described in step S2 is trained separately from the backbone network and is also category-independent, and the labels required for training the token sharing strategy network p are directly generated from the segmentation labels of the semantic segmentation dataset, and for each super-block in the image, if it has only one semantic category in the segmentation label, the label of the token sharing strategy network p is set to true, otherwise it is set to false; using this binary label, the token sharing strategy network p is trained to converge through standard cross entropy loss.

4. The efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1 is characterized in that the token sharing function t _s described in step S4 maps the original image block I _patch and its position encoding to a new mapping set I' = {I' ₁ ,...,I' _M }, where M < N, and with corresponding position encoding, and the mapping is performed according to the predicted sharing strategy P , and the sharing strategy P specifies which image blocks in the super-block should share tokens and which image blocks in the super-block do not share the original tokens. After obtaining the mapping set I', the token sharing function t _s projects it to the token set T'. This process can be formulated as:

,

Among them, the token sharing function ts _will downsample each super-block that can share tokens to the dimension of a single image block according to the sharing strategy P , and linearly project it into a single token, so as to achieve token sharing of each image block in the super-block.

5. According to the efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1, it is characterized in that the Transformer network described in step S5 can be any semantic segmentation network based on the Transformer architecture.

6. The efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1 is characterized in that the token distribution function tu described in step S6 _first restores the token dimension through bilinear upsampling according to the sharing strategy P , that is, cancels the token sharing of each image block, and then reorganizes the token according to the position of the image block in the original image to obtain the predicted spatial feature L _spat .

7. According to the efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1, it is characterized in that the decoder described in step S7 selects CNN-based UPerNet as a decoder.

8. An efficient visual ViT semantic segmentation system based on content perception and token sharing, comprising a memory, a processor, and a computer program stored in the memory and runnable on the processor, wherein when the processor executes the program, the efficient visual ViT semantic segmentation method based on content perception and token sharing as described in any one of claims 1 to 7 is implemented.