CN118657948A - Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing - Google Patents

Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing Download PDF

Info

Publication number
CN118657948A
CN118657948A CN202411146624.4A CN202411146624A CN118657948A CN 118657948 A CN118657948 A CN 118657948A CN 202411146624 A CN202411146624 A CN 202411146624A CN 118657948 A CN118657948 A CN 118657948A
Authority
CN
China
Prior art keywords
token
sharing
image
network
vit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411146624.4A
Other languages
Chinese (zh)
Other versions
CN118657948B (en
Inventor
孙启玉
杨公平
刘玉峰
孙平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Fengshi Information Technology Co ltd
Original Assignee
Shandong Fengshi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Fengshi Information Technology Co ltd filed Critical Shandong Fengshi Information Technology Co ltd
Priority to CN202411146624.4A priority Critical patent/CN118657948B/en
Publication of CN118657948A publication Critical patent/CN118657948A/en
Application granted granted Critical
Publication of CN118657948B publication Critical patent/CN118657948B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本发明涉及基于内容感知与令牌共享的高效视觉ViT语义分割方法,属于图像处理和计算机视觉技术领域。建立一个令牌共享策略网络,训练令牌共享策略网络,直至收敛,将图像I输入共享策略网络得到共享策略,将图像块与令牌共享策略输入令牌共享函数,得到一个精简的令牌集合T’;将T’输入Transformer网络,得到一组包含语义信息的预测令牌集合L;将L以及共享策略输入令牌分发函数,令牌分发函数将预测令牌集合L根据共享策略进行分发以及上采样,并进行重组,得到预测空间特征并将其输入解码器,得到最终语义分割预测结果。本发明可以在保障分割质量的同时,显著提高基于视觉Transformer语义分割网络的计算效率。

The present invention relates to an efficient visual ViT semantic segmentation method based on content perception and token sharing, and belongs to the field of image processing and computer vision technology. A token sharing strategy network is established, and the token sharing strategy network is trained until convergence, and an image I is input into the sharing strategy network to obtain a sharing strategy, and an image block and a token sharing strategy are input into a token sharing function to obtain a streamlined token set T' ; T' is input into a Transformer network to obtain a set of predicted token sets L containing semantic information; L and the sharing strategy are input into a token distribution function, and the token distribution function distributes and upsamples the predicted token set L according to the sharing strategy, and reorganizes it to obtain a predicted spatial feature and input it into a decoder to obtain a final semantic segmentation prediction result. The present invention can significantly improve the computational efficiency of a visual Transformer semantic segmentation network while ensuring the segmentation quality.

Description

基于内容感知与令牌共享的高效视觉ViT语义分割方法Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing

技术领域Technical Field

本发明涉及基于内容感知与令牌共享的高效视觉ViT语义分割方法,属于图像处理和计算机视觉技术领域。The invention relates to an efficient visual ViT semantic segmentation method based on content perception and token sharing, and belongs to the technical field of image processing and computer vision.

背景技术Background Art

近年来,许多研究提出使用视觉Transformer(Vision Transformer,ViT)替代卷积神经网络(CNNs)来解决各种计算机视觉任务,如图像分类、目标检测和语义分割。ViT通过多层多头自注意力机制处理由固定大小图像块生成的一组令牌(Token)。现有的ViT模型在多个任务上达到了最新的研究水平,尤其是在大规模数据集上的预训练表现优异,从而显著提高了下游任务的性能。In recent years, many studies have proposed using Vision Transformer (ViT) to replace convolutional neural networks (CNNs) to solve various computer vision tasks such as image classification, object detection, and semantic segmentation. ViT processes a set of tokens generated by fixed-size image patches through a multi-layer multi-head self-attention mechanism. The existing ViT model has reached the latest research level on multiple tasks, especially the excellent pre-training performance on large-scale datasets, which significantly improves the performance of downstream tasks.

语义分割任务需要对图像中的每个像素进行分类,这对模型的计算能力和效率提出了更高的要求。传统的CNN方法通过逐层卷积操作逐步提取图像特征,但这些方法通常依赖于较大的感受野来捕捉全局信息。相比之下,ViT可以利用其自注意力机制直接捕获图像的全局依赖性,从而在理论上有助于提高分割任务的准确性。The semantic segmentation task requires classification of each pixel in the image, which places higher demands on the computational power and efficiency of the model. Traditional CNN methods gradually extract image features through layer-by-layer convolution operations, but these methods usually rely on a large receptive field to capture global information. In contrast, ViT can directly capture the global dependencies of the image using its self-attention mechanism, which theoretically helps improve the accuracy of the segmentation task.

在实际应用中,ViT已被用于多种密集预测任务,包括语义分割任务,并在许多基准测试中显示出优异的性能。例如,ViT模型在ADE20K和Cityscapes等标准数据集上达到了最高的性能水平。基于ViT的语义分割架构将输入图像分割成固定大小的图像块(Patch),经过线性嵌入和位置编码后送入Transformer编码器。编码器使用多头自注意力机制处理图像块序列,捕捉全局上下文信息。随后,通过解码器将编码器的输出转换为二维特征图,再经过上采样恢复到原始图像尺寸。最后,利用分割头(通常是卷积层)生成每个像素的类别预测,从而完成语义分割任务。这种架构结合了Transformer在建模长距离依赖关系方面的优势和传统卷积网络在处理局部特征方面的能力,有效提升了语义分割的性能。然而,尽管ViT在这些任务中表现出色,其计算复杂度和效率问题仍然是亟待解决的挑战。为了应对这些挑战,研究人员提出了一系列改进方案,例如:In practical applications, ViT has been used for a variety of dense prediction tasks, including semantic segmentation tasks, and has shown excellent performance in many benchmarks. For example, the ViT model has achieved the highest performance level on standard datasets such as ADE20K and Cityscapes. The semantic segmentation architecture based on ViT divides the input image into fixed-size image patches (Patch), which are fed into the Transformer encoder after linear embedding and position encoding. The encoder uses a multi-head self-attention mechanism to process the image patch sequence to capture global context information. Subsequently, the encoder output is converted into a two-dimensional feature map by the decoder, and then restored to the original image size after upsampling. Finally, the segmentation head (usually a convolutional layer) is used to generate a category prediction for each pixel to complete the semantic segmentation task. This architecture combines the advantages of Transformer in modeling long-distance dependencies and the ability of traditional convolutional networks in processing local features, effectively improving the performance of semantic segmentation. However, despite the excellent performance of ViT in these tasks, its computational complexity and efficiency issues remain challenges that need to be addressed. To address these challenges, researchers have proposed a series of improvements, such as:

(1)混合架构:一些研究结合了CNN和ViT的优势,提出了混合架构。这些方法通常在早期阶段使用卷积层提取局部特征,然后在后期阶段使用Transformer捕捉全局特征;(1) Hybrid architecture: Some studies combine the advantages of CNN and ViT and propose hybrid architectures. These methods usually use convolutional layers to extract local features in the early stage and then use Transformer to capture global features in the later stage;

(2)层次化Transformer:引入层次化结构,将图像分解成不同尺度的图像块,以减少计算负担。例如,Swin Transformer通过分层窗口的方式在不同尺度上进行自注意力计算,从而降低了复杂度;(2) Hierarchical Transformer: Introducing a hierarchical structure, the image is decomposed into image blocks of different scales to reduce the computational burden. For example, the Swin Transformer performs self-attention calculations at different scales through a layered window approach, thereby reducing complexity;

(3)动态令牌裁剪:根据图像内容动态调整需要处理的令牌数量,减少计算资源的浪费。例如,研究者提出了动态裁剪策略,根据每个令牌的重要性决定是否保留,从而提高效率。(3) Dynamic token cropping: Dynamically adjust the number of tokens to be processed according to the image content to reduce the waste of computing resources. For example, researchers proposed a dynamic cropping strategy that decides whether to keep each token based on its importance, thereby improving efficiency.

尽管这些改进在一定程度上解决了ViT在语义分割任务中的效率问题,但在该任务场景下,仍存在一些亟待解决的问题,例如:Although these improvements have solved the efficiency problem of ViT in semantic segmentation tasks to a certain extent, there are still some problems that need to be solved in this task scenario, such as:

(1)计算复杂度高:ViT的计算复杂度与输入令牌的数量平方成正比,这在处理大图像时变得尤为低效。传统的ViT架构在进行语义分割任务时需要处理大量的令牌,这极大地增加了计算负担;(1) High computational complexity: The computational complexity of ViT is proportional to the square of the number of input tokens, which becomes particularly inefficient when processing large images. The traditional ViT architecture needs to process a large number of tokens when performing semantic segmentation tasks, which greatly increases the computational burden;

(2)令牌冗余:在语义分割任务中,多处图像块可能包含相同的语义类,因此在某些情况下,这些图像块共享相同的令牌是不合理的。这种冗余没有被传统的ViT方法充分利用;(2) Token redundancy: In semantic segmentation tasks, multiple image patches may contain the same semantic class, so in some cases it is unreasonable for these image patches to share the same token. This redundancy is not fully utilized by traditional ViT methods;

(3)架构复杂性:为了提高效率,一些研究提出了结合全局和局部注意力或引入类似CNN的金字塔结构的新ViT架构。然而,这些新架构往往复杂且需要额外的设计和优化。(3) Architecture complexity: To improve efficiency, some studies have proposed new ViT architectures that combine global and local attention or introduce CNN-like pyramid structures. However, these new architectures are often complex and require additional design and optimization.

因此提高语义分割任务场景下基于ViT方法的效率,显著减少计算复杂度和资源消耗是亟待解决的问题。Therefore, improving the efficiency of ViT-based methods in semantic segmentation task scenarios and significantly reducing computational complexity and resource consumption are issues that need to be addressed urgently.

发明内容Summary of the invention

本发明的目的是克服上述不足,而提供一种基于内容感知与令牌共享的高效视觉ViT语义分割方法,通过使用一个高效的策略网络来预测哪些图像块可以共享令牌,减少需要处理的令牌数量,可以在保障分割质量的同时,显著提高基于视觉Transformer语义分割网络的计算效率。The purpose of the present invention is to overcome the above-mentioned shortcomings and to provide an efficient visual ViT semantic segmentation method based on content perception and token sharing. By using an efficient strategy network to predict which image blocks can share tokens and reduce the number of tokens that need to be processed, the computational efficiency of the visual Transformer semantic segmentation network can be significantly improved while ensuring the segmentation quality.

本发明采取的技术方案为:The technical solution adopted by the present invention is:

基于内容感知与令牌共享的高效视觉ViT语义分割方法,包括步骤如下:An efficient visual ViT semantic segmentation method based on content perception and token sharing includes the following steps:

S1.建立一个令牌共享策略网络p,用来预测输入图像中哪些超图块只包含一个语义类别,将包含一个语义类别的分为一组,然后让每组超图块中的图像块共享同一个令牌;S1. Establish a token sharing strategy network p to predict which super-blocks in the input image contain only one semantic category, group those containing one semantic category, and then let the image blocks in each group of super-blocks share the same token;

S2.训练令牌共享策略网络p,直至共享策略网络p收敛,将图像I输入共享策略网络p,得到共享策略PS2. Train the token sharing strategy network p until the sharing strategy network p converges, input the image I into the sharing strategy network p , and obtain the sharing strategy P ;

S3.将输入图像I划分为N个图像块I patch S3. Divide the input image I into N image patches I patch ;

S4. 基于ViT的语义分割架构,将图像块I patch 与令牌共享策略P输入令牌共享函数t s ,得到一个精简的令牌集合T’S4. Based on the semantic segmentation architecture of ViT, the image block I patch and the token sharing strategy P are input into the token sharing function t s to obtain a simplified token set T ' ;

S5.将精简的令牌集合T’输入Transformer网络,得到一组包含语义信息的预测令牌集合LS5. Input the simplified token set T' into the Transformer network to obtain a set of predicted tokens L containing semantic information;

S6.将预测令牌集合L以及共享策略P输入令牌分发函数t u,令牌分发函数t u将预测令牌集合L根据共享策略P进行分发以及上采样,并进行重组,得到预测空间特征L spat S6. Input the predicted token set L and the sharing strategy P into the token distribution function tu . The token distribution function tu distributes and upsamples the predicted token set L according to the sharing strategy P , and reorganizes the tokens to obtain the predicted spatial feature L spat .

S7.将预测空间特征L spat 输入解码器,得到最终语义分割预测结果。S7. Input the predicted spatial feature L spat into the decoder to obtain the final semantic segmentation prediction result.

上述方法中,步骤S1所述的令牌共享策略网络p 通过训练一个CNN模型来实现其功能,该模型预测超图块是否包含单一语义类别,而无需预测出具体类别,如果一个超图块只包含一个类别,其中的图像块就可以共享同一个令牌,它仅需对各超图块实现二分类。In the above method, the token sharing strategy network p described in step S1 realizes its function by training a CNN model, which predicts whether a super-block contains a single semantic category without predicting the specific category. If a super-block contains only one category, the image blocks therein can share the same token, and it only needs to implement binary classification for each super-block.

步骤S2所述的令牌共享策略网络p与主干网络的训练是分开的,也是与类别无关,并且训练令牌共享策略网络p所需的标签从语义分割数据集的分割标签中直接生成,对于图像中的每个超图块,若它在分割标签中只有一个语义类别,令牌共享策略网络p的标签设置为true,否则设置为false;使用此二分类标签,通过标准交叉熵损失训练令牌共享策略网络p收敛。The token sharing policy network p described in step S2 is trained separately from the backbone network and is also category-independent. The labels required for training the token sharing policy network p are directly generated from the segmentation labels of the semantic segmentation dataset. For each super-block in the image, if it has only one semantic category in the segmentation label, the label of the token sharing policy network p is set to true, otherwise it is set to false. Using this binary label, the token sharing policy network p is trained to converge through standard cross entropy loss.

步骤S4所述的令牌共享函数t s 将原始图像块I patch 及其位置编码映射到一个新的映射集合I' = {I'1,...,I'M},其中M < N,并带有相应的位置编码,该映射是根据预测的共享策略P来执行的,共享策略P指定哪些超图块中的图像块应该共享令牌,哪些超图块中的图像块不进行共享拥有原始令牌,得到映射集I'后,令牌共享函数t s 将其投影到令牌集合T',这个过程可公式化表示为:The token sharing function ts described in step S4 maps the original image block I patch and its position encoding to a new mapping set I' = {I' 1 ,...,I' M }, where M < N, and carries the corresponding position encoding. The mapping is performed according to the predicted sharing strategy P , which specifies which image blocks in the super-block should share tokens and which image blocks in the super- block do not share the original tokens. After obtaining the mapping set I', the token sharing function ts projects it to the token set T'. This process can be formulated as:

,

其中,令牌共享函数t s 会根据共享策略P将可以共享令牌的每个超图块都下采样到单一图像块的维度,并线性投影为单一令牌,以此实现超图块中每个图像块的令牌共享。Among them, the token sharing function ts will downsample each super-block that can share tokens to the dimension of a single image block according to the sharing strategy P , and linearly project it into a single token, so as to achieve token sharing of each image block in the super-block.

步骤S5所述的Transformer网络可以是任一基于Transformer架构的语义分割网络,优选基于ViT的分割网络Segmenter。The Transformer network described in step S5 can be any semantic segmentation network based on the Transformer architecture, preferably a segmentation network Segmenter based on ViT.

步骤S6所述的令牌分发函数t u首先按照共享策略P通过双线性上采样恢复令牌维度,即取消各图像块的令牌共享,随后将令牌在空间上重组,令牌本质上为从各图像块学习到的特征向量,将令牌按照图像块在原图中的位置进行重组,得到预测空间特征L spat The token distribution function tu described in step S6 first restores the token dimension through bilinear upsampling according to the sharing strategy P , that is, cancels the token sharing of each image block, and then reorganizes the token in space. The token is essentially a feature vector learned from each image block. The token is reorganized according to the position of the image block in the original image to obtain the predicted spatial feature L spat .

步骤S7所述的解码器优选基于CNN的UPerNet作为解码器。The decoder described in step S7 is preferably based on CNN's UPerNet as a decoder.

基于内容感知与令牌共享的高效视觉ViT语义分割系统,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法。An efficient visual ViT semantic segmentation system based on content perception and token sharing includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the efficient visual ViT semantic segmentation method based on content perception and token sharing as described above is implemented.

本发明的有益效果是:The beneficial effects of the present invention are:

本发明针对ViT计算复杂度高以及令牌冗余等问题,通过使用一个高效的策略网络来预测哪些图像块可以共享令牌,从而减少需要处理的令牌数量,可以在保障分割质量的同时,显著提高基于视觉Transformer的语义分割网络的计算效率。与现有基于令牌轻量化Transformer网络的技术相比,本方法采用效果更有保障的二分类网络对令牌是否需要共享进行判别,保障了轻量化过程的可靠性,即在不损失分割效果的情况下,可显著提升分割效率。本发明所提出的方法同时拥有即插即用特性,即适用于各种ViT主干网络和分割解码器,并且与最新的预训练方法兼容。该方法的简洁和通用性使其易于集成到现有所有基于ViT的模型中,为基于Transformer的密集预测任务的高效架构研究及应用提供了一种有效方法。In view of the high computational complexity and token redundancy of ViT, the present invention uses an efficient policy network to predict which image blocks can share tokens, thereby reducing the number of tokens that need to be processed. This can significantly improve the computational efficiency of the semantic segmentation network based on the visual Transformer while ensuring the segmentation quality. Compared with the existing token-based lightweight Transformer network technology, this method uses a more reliable binary classification network to determine whether tokens need to be shared, thereby ensuring the reliability of the lightweight process, that is, the segmentation efficiency can be significantly improved without losing the segmentation effect. The method proposed in the present invention also has a plug-and-play feature, that is, it is suitable for various ViT backbone networks and segmentation decoders, and is compatible with the latest pre-training methods. The simplicity and versatility of this method make it easy to integrate into all existing ViT-based models, providing an effective method for the study and application of efficient architectures for dense prediction tasks based on Transformer.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明方法的网络框架图;FIG1 is a network framework diagram of the method of the present invention;

图2为本发明的图像分块示意图。FIG. 2 is a schematic diagram of image segmentation according to the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例和附图对本发明进一步说明。The present invention is further described below in conjunction with specific embodiments and drawings.

实施例1基于内容感知与令牌共享的高效视觉ViT语义分割方法,包括步骤(见图1)如下:Embodiment 1 An efficient visual ViT semantic segmentation method based on content perception and token sharing includes the following steps (see FIG1 ):

S1.建立一个令牌共享策略网络p,用来预测输入图像中哪些超图块只包含一个语义类别,将包含一个语义类别的分为一组,然后让每组超图块中的图像块共享同一个令牌:S1. Establish a token sharing strategy network p to predict which super-blocks in the input image contain only one semantic category, group those containing one semantic category, and then let the image blocks in each group of super-blocks share the same token:

在基于ViT的语义分割架构中,输入图像I∈R3×H×W被划分为N个图像块I patch ={I1,...,IN},其中H,W分别代表图像I的高与宽。这些图像块通过线性投影被转换为一组令牌I token ={T1,...,TN},然后加上相应的可学习位置编码,随后输入到Transformer中。令牌的总数N取决于输入图像的大小和预定义的图像块大小。基于Transformer的主干网络将输入令牌 Patch映射为一组包含语义信息的预测令牌L={L1,...,LN}。Transformer模型可以是任一使用全局自注意力的架构。下一步是将输出令牌 L输入到解码器中。解码器首先根据原始位置将令牌重新排列成特征,其中E是特征维度,HL和WL是原始H和W的分数。然后将特征L spat 输入到解码器g中,得到分割预测结果。本发明要解决的问题是,通过识别可以共享令牌的图像块,在不降低语义分割质量的前提下,减少T的大小以降低计算复杂度和资源消耗。In the ViT-based semantic segmentation architecture, the input image I ∈ R 3×H×W is divided into N image patches I patch ={I 1 ,...,I N }, where H and W represent the height and width of image I, respectively. These image patches are converted into a set of tokens I token ={T 1 ,...,T N } through linear projection, and then added with corresponding learnable position encodings, and then input into the Transformer. The total number of tokens N depends on the size of the input image and the predefined image patch size. The Transformer-based backbone network maps the input token Patch to a set of predicted tokens L ={L 1 ,...,L N } containing semantic information. The Transformer model can be any architecture that uses global self-attention. The next step is to input the output token L into the decoder. The decoder first rearranges the tokens into features according to their original positions. , where E is the feature dimension, H L and W L are the fractions of the original H and W. The feature L spat is then input into the decoder g to obtain the segmentation prediction result. The problem to be solved by the present invention is to identify image blocks that can share tokens without reducing Under the premise of semantic segmentation quality, the size of T is reduced to reduce computational complexity and resource consumption.

本发明基于一个显见的假设,即图像中必然存在图像块包含冗余信息。该假设基于图像中单一物体内像素连续这一现象,类比超像素的定义,比如将2×2相邻的图像块定义为超图块,组成单一物体的2×2超图块只包含一个语义类别,因此图像中必然存在许多语义上均匀的区域。假设通过联合处理这些超图块可以提高效率而不降低分割质量。基于此,本发明提出了基于内容感知的令牌共享方法,该方法首先预测输入图像I中哪些超图块只包含一个语义类别,然后让每组2×2超图块中的图像块共享同一个令牌。The present invention is based on an obvious assumption, that is, there must be image blocks in the image that contain redundant information. This assumption is based on the phenomenon that pixels in a single object in the image are continuous. By analogy with the definition of superpixels, for example, 2×2 adjacent image blocks are defined as superblocks. The 2×2 superblocks that make up a single object only contain one semantic category. Therefore, there must be many semantically uniform areas in the image. It is assumed that efficiency can be improved without reducing segmentation quality by jointly processing these superblocks. Based on this, the present invention proposes a content-aware token sharing method, which first predicts which superblocks in the input image I contain only one semantic category, and then allows the image blocks in each group of 2×2 superblocks to share the same token.

为了使该方法适用于任何基于Transformer的模型,本发明设计了一个令牌共享函数t s ,一个令牌分发函数t u ,以及一个令牌共享策略模型p,令牌共享函数t s 、令牌分发函数t u 以及令牌共享策略模型p都是以可微分的方式实现,以便插入目前任一基于Transformer的模型,从而进行端到端地训练。In order to make the method applicable to any Transformer-based model, the present invention designs a token sharing function t s , a token distribution function tu , and a token sharing strategy model p . The token sharing function t s , the token distribution function tu , and the token sharing strategy model p are all implemented in a differentiable manner so as to be inserted into any current Transformer-based model for end-to-end training.

令牌共享策略网络的作用是,在不降低分割质量的前提下,预测每个超图块是否可以共享同一个令牌。本发明通过训练一个高效的CNN模型来实现此功能,该模型预测超图块是否包含单一语义类别,而无需预测出具体类别,如果一个超图块只包含一个类别,其中的图像块就可以共享同一个令牌。将该模型定义为P=p(I),其中模型网络p。可以用任何轻量级的网络实现,它仅需对各超图块实现二分类,在本发明中利用EfficientNet系列网络中参数量最小的版本Lite0实现内容感知与令牌共享。基于网络p的输出,选择得分最高的S个超图块进行令牌共享,则该S个2×2超图块中的四个图像块可以共享同一个令牌,最终保留M=N−3S个令牌,这些共享令牌送入Transformer的网络进行学习,其中N为图像I被划分成图像块的个数。The role of the token sharing strategy network is to predict whether each super-block can share the same token without reducing the segmentation quality. The present invention achieves this function by training an efficient CNN model, which predicts whether a super-block contains a single semantic category without predicting a specific category. If a super-block contains only one category, the image blocks in it can share the same token. The model is defined as P = p ( I ), where the model network p . It can be implemented with any lightweight network, which only needs to implement binary classification for each super-block. In the present invention, Lite0, the version with the smallest number of parameters in the EfficientNet series network, is used to implement content perception and token sharing. Based on the output of network p , the S super-blocks with the highest scores are selected for token sharing, then the four image blocks in the S 2×2 super-blocks can share the same token, and finally retain M = N −3 S tokens. These shared tokens are sent to the Transformer network for learning, where N is the number of image blocks into which the image I is divided.

S2.训练令牌共享策略网络p,直至共享策略网络p收敛,将图像I输入共享策略网络p,得到共享策略PS2. Train the token sharing policy network p until the sharing policy network p converges, input the image I into the sharing policy network p , and obtain the sharing policy P :

令牌共享策略网络与主干网络的训练是分开的,同时也是与类别无关的,并且训练令牌共享策略网络p所需的标签可直接从语义分割数据集的分割标签中直接生成。具体来说,对于图像中的每个超图块,若它在分割标签中只有一个语义类别,令牌共享策略网络p的标签设置为true,否则设置为false。使用此二分类标签,通过标准交叉熵损失训练令牌共享策略网络p。利用收敛后的策略模型p进行后续操作。The token sharing policy network is trained separately from the backbone network and is also class-independent, and the labels required to train the token sharing policy network p can be directly generated from the segmentation labels of the semantic segmentation dataset. Specifically, for each hyperblock in the image, if it has only one semantic category in the segmentation label, the label of the token sharing policy network p is set to true, otherwise it is set to false. Using this binary label, the token sharing policy network p is trained with the standard cross entropy loss. The converged policy model p is used for subsequent operations.

之所以可以用这种简单高效的方式训练令牌共享策略模型p,是因为语义分割任务具备现实可用的语义标签。语义分割标注为每个像素提供了标签,这使得本发明能够直接将关于语义相似图像块冗余性的假设转化为每个超图块的二元标签。此外,令牌共享策略模型p与类别无关是该网络高效性的关键,二分类任务比多类分割任务简单很多,因此可以由一个高效的模型来执行,这对于通过减少令牌数量来获得效率提升是非常关键的。The token sharing strategy model p can be trained in this simple and efficient way because the semantic segmentation task has realistic and available semantic labels. Semantic segmentation annotation provides a label for each pixel, which enables the present invention to directly convert the assumption about the redundancy of semantically similar image blocks into a binary label for each super-block. In addition, the fact that the token sharing strategy model p is independent of the category is the key to the efficiency of the network. The binary classification task is much simpler than the multi-class segmentation task, so it can be performed by an efficient model, which is very critical to obtain efficiency improvements by reducing the number of tokens.

将图像I输入训练好的令牌共享策略模型p,得到共享策略P,共享策略P标示出在每个2×2超图块中的图像块是否可共享令牌。The image I is input into the trained token sharing strategy model p to obtain the sharing strategy P , which indicates whether the image block in each 2×2 super block can share the token.

S3.将输入图像I划分为N个图像块I patch S3. Divide the input image I into N image patches I patch :

如图2所示,基于Transformer的骨干网络接收图像块作为输入,本发明首先将输入图像切分成N个patch,并定义相邻2×2个图像块为超图块。As shown in FIG2 , the Transformer-based backbone network receives an image block as input. The present invention first divides the input image into N patches and defines adjacent 2×2 image blocks as super-blocks.

S4. 基于ViT的语义分割架构,将图像块I patch 与令牌共享策略P输入令牌共享函数t s ,得到一个精简的令牌集合T’S4. Based on the semantic segmentation architecture of ViT, the image patch I patch and the token sharing strategy P are input into the token sharing function t s to obtain a simplified token set T' :

令牌共享函数t s (也即令牌共享模块,是下面的操作集合)将原始图像块I patch 及其位置编码映射到一个新的映射集合I' = {I'1,...,I'M},其中M < N,并带有相应的位置编码。该映射是根据预测的共享策略P来执行的,共享策略P指定哪些超图块中的图像块应该共享令牌,哪些超图块中的图像块不进行共享拥有原始令牌。得到映射集I'后,令牌共享函数t s 将其投影到令牌集合T',方便后续Transformer网络进行处理。上述过程可公式化表示为:The token sharing function ts (also known as the token sharing module, which is the set of operations below) maps the original image block Ipatch and its position encoding to a new mapping set I' = { I'1 ,...,I 'M }, where M < N, with corresponding position encodings. This mapping is performed according to the predicted sharing strategy P , which specifies which image blocks in the super-block should share tokens and which image blocks in the super-block do not share the original tokens. After obtaining the mapping set I', the token sharing function ts projects it to the token set T' for subsequent processing by the Transformer network. The above process can be formulated as:

,

其中,令牌共享函数t s 会根据共享策略P将可以共享令牌的每个超图块都下采样到单一图像块的维度,并线性投影为单一令牌,以此实现2×2超图块中每个图像块的令牌共享。Among them, the token sharing function ts will downsample each super-block that can share tokens to the dimension of a single image block according to the sharing strategy P , and linearly project it into a single token, so as to achieve token sharing for each image block in the 2×2 super-block.

S5.将精简的令牌集合T’输入Transformer网络,得到一组包含语义信息的预测令牌集合LS5. Input the simplified token set T' into the Transformer network to obtain a set of predicted tokens L containing semantic information:

得到精简的令牌集合T’后,将其输入Transformer网络f,进行特征学习,得到一组包含语义信息的预测令牌 L = {L1,...,LM},该过程可公式化为:After obtaining the streamlined token set T' , it is input into the Transformer network f for feature learning to obtain a set of prediction tokens L = {L 1 ,...,L M } containing semantic information. The process can be formulated as:

,

其中,该Transformer网络f可以是任一基于Transformer架构的语义分割网络,在本发明中,利用基于ViT的分割网络Segmenter这一经典模型作为网络f进行学习。The Transformer network f can be any semantic segmentation network based on the Transformer architecture. In the present invention, the classic model of the ViT-based segmentation network Segmenter is used as the network f for learning.

S6.将预测令牌集合L以及共享策略P输入令牌分发函数t u,令牌分发函数t u将预测令牌集合L根据共享策略P进行分发以及上采样,并进行重组,得到预测空间特征L spat S6. Input the predicted token set L and the sharing strategy P into the token distribution function tu . The token distribution function tu distributes and upsamples the predicted token set L according to the sharing strategy P , and reorganizes the tokens to obtain the predicted spatial feature L spat :

将预测令牌集合L以及共享策略P输入令牌分发函数t u,令牌分发函数t u(也即令牌分发模块,是下面的操作集合)首先按照共享策略P通过双线性上采样恢复令牌维度,即取消各图块的令牌共享,随后将令牌在空间上重组为预测空间特征L spat 。在这个过程中共享策略P提供步骤S4中进行过令牌共享的图像块对应的索引,令牌分发函数t u基于该索引进行上采样。该过程可公式化为:The predicted token set L and the sharing strategy P are input into the token distribution function tu . The token distribution function tu ( i.e., the token distribution module, which is the following set of operations) first restores the token dimension by bilinear upsampling according to the sharing strategy P , that is, cancels the token sharing of each block, and then spatially reorganizes the token into the predicted spatial feature Lspat . In this process, the sharing strategy P provides the index corresponding to the image block that has been token - shared in step S4, and the token distribution function tu performs upsampling based on the index. The process can be formulated as:

.

S7.将预测空间特征L spat 输入解码器,得到最终语义分割预测结果:S7. Input the predicted spatial feature L spat into the decoder to obtain the final semantic segmentation prediction result:

对于预测空间特征L spat ,将其输入解码器g,得到最终的语义分割预测结果O,该过程可公式化为:For the predicted spatial feature L spat , it is input into the decoder g to obtain the final semantic segmentation prediction result O . The process can be formulated as:

,

其中,解码器g可以采用多种网络结构,在本发明中出于效率考虑,利用基于CNN的UPerNet作为解码器对特征进行解码。Among them, the decoder g can adopt a variety of network structures. In the present invention, for efficiency considerations, CNN-based UPerNet is used as a decoder to decode the features.

实施例2:基于内容感知与令牌共享的高效视觉ViT语义分割系统,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如实施例1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法。Example 2: An efficient visual ViT semantic segmentation system based on content perception and token sharing, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the efficient visual ViT semantic segmentation method based on content perception and token sharing as described in Example 1 is implemented.

以上是结合实施例对本发明的进一步描述,本发明的保护范围不限于此。The above is a further description of the present invention in combination with the embodiments, and the protection scope of the present invention is not limited thereto.

Claims (8)

1.基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,包括步骤如下:1. An efficient visual ViT semantic segmentation method based on content perception and token sharing, characterized in that it includes the following steps: S1.建立一个令牌共享策略网络p,用来预测输入图像中哪些超图块只包含一个语义类别,将包含一个语义类别的分为一组,然后让每组超图块中的图像块共享同一个令牌;S1. Establish a token sharing strategy network p to predict which super-blocks in the input image contain only one semantic category, group those containing one semantic category, and then let the image blocks in each group of super-blocks share the same token; S2.训练令牌共享策略网络p,直至共享策略网络p收敛,将图像I输入共享策略网络p,得到共享策略PS2. Train the token sharing strategy network p until the sharing strategy network p converges, input the image I into the sharing strategy network p , and obtain the sharing strategy P ; S3.将输入图像I划分为N个图像块I patch S3. Divide the input image I into N image patches I patch ; S4. 基于ViT的语义分割架构,将图像块I patch 与令牌共享策略P输入令牌共享函数t s ,得到一个精简的令牌集合T’S4. Based on the semantic segmentation architecture of ViT, the image block I patch and the token sharing strategy P are input into the token sharing function t s to obtain a simplified token set T ' ; S5.将精简的令牌集合T’输入Transformer网络,得到一组包含语义信息的预测令牌集合LS5. Input the simplified token set T' into the Transformer network to obtain a set of predicted tokens L containing semantic information; S6.将预测令牌集合L以及共享策略P输入令牌分发函数t u,令牌分发函数t u将预测令牌集合L根据共享策略P进行分发以及上采样,并进行重组,得到预测空间特征L spat S6. Input the predicted token set L and the sharing strategy P into the token distribution function tu . The token distribution function tu distributes and upsamples the predicted token set L according to the sharing strategy P , and reorganizes the tokens to obtain the predicted spatial feature L spat . S7.将预测空间特征L spat 输入解码器,得到最终语义分割预测结果。S7. Input the predicted spatial feature L spat into the decoder to obtain the final semantic segmentation prediction result. 2.根据权利要求1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,步骤S1所述的令牌共享策略网络p 通过训练一个CNN模型来实现其功能,该模型预测超图块是否包含单一语义类别,而无需预测出具体类别,如果一个超图块只包含一个类别,其中的图像块就可以共享同一个令牌,它仅需对各超图块实现二分类。2. According to the efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1, it is characterized in that the token sharing strategy network p described in step S1 realizes its function by training a CNN model, which predicts whether a super-block contains a single semantic category without predicting the specific category. If a super-block contains only one category, the image blocks therein can share the same token, and it only needs to implement binary classification for each super-block. 3.根据权利要求1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,步骤S2所述的令牌共享策略网络p与主干网络的训练是分开的,也是与类别无关,并且训练令牌共享策略网络p所需的标签从语义分割数据集的分割标签中直接生成,对于图像中的每个超图块,若它在分割标签中只有一个语义类别,令牌共享策略网络p的标签设置为true,否则设置为false;使用此二分类标签,通过标准交叉熵损失训练令牌共享策略网络p收敛。3. According to the content-aware and token-sharing-based efficient visual ViT semantic segmentation method of claim 1, it is characterized in that the token sharing strategy network p described in step S2 is trained separately from the backbone network and is also category-independent, and the labels required for training the token sharing strategy network p are directly generated from the segmentation labels of the semantic segmentation dataset, and for each super-block in the image, if it has only one semantic category in the segmentation label, the label of the token sharing strategy network p is set to true, otherwise it is set to false; using this binary label, the token sharing strategy network p is trained to converge through standard cross entropy loss. 4. 根据权利要求1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,步骤S4所述的令牌共享函数t s 将原始图像块I patch 及其位置编码映射到一个新的映射集合I' = {I'1,...,I'M},其中M < N,并带有相应的位置编码,该映射是根据预测的共享策略P来执行的,共享策略P指定哪些超图块中的图像块应该共享令牌,哪些超图块中的图像块不进行共享拥有原始令牌,得到映射集I'后,令牌共享函数t s 将其投影到令牌集合T',这个过程可公式化表示为:4. The efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1 is characterized in that the token sharing function t s described in step S4 maps the original image block I patch and its position encoding to a new mapping set I' = {I' 1 ,...,I' M }, where M < N, and with corresponding position encoding, and the mapping is performed according to the predicted sharing strategy P , and the sharing strategy P specifies which image blocks in the super-block should share tokens and which image blocks in the super-block do not share the original tokens. After obtaining the mapping set I', the token sharing function t s projects it to the token set T'. This process can be formulated as: , 其中,令牌共享函数t s 会根据共享策略P将可以共享令牌的每个超图块都下采样到单一图像块的维度,并线性投影为单一令牌,以此实现超图块中每个图像块的令牌共享。Among them, the token sharing function ts will downsample each super-block that can share tokens to the dimension of a single image block according to the sharing strategy P , and linearly project it into a single token, so as to achieve token sharing of each image block in the super-block. 5.根据权利要求1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,步骤S5所述的Transformer网络可以是任一基于Transformer架构的语义分割网络。5. According to the efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1, it is characterized in that the Transformer network described in step S5 can be any semantic segmentation network based on the Transformer architecture. 6.根据权利要求1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,步骤S6所述的令牌分发函数t u首先按照共享策略P通过双线性上采样恢复令牌维度,即取消各图像块的令牌共享,随后将令牌按照图像块在原图中的位置进行重组,得到预测空间特征L spat 6. The efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1 is characterized in that the token distribution function tu described in step S6 first restores the token dimension through bilinear upsampling according to the sharing strategy P , that is, cancels the token sharing of each image block, and then reorganizes the token according to the position of the image block in the original image to obtain the predicted spatial feature L spat . 7.根据权利要求1所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法,其特征是,步骤S7所述的解码器选基于CNN的UPerNet作为解码器。7. According to the efficient visual ViT semantic segmentation method based on content perception and token sharing according to claim 1, it is characterized in that the decoder described in step S7 selects CNN-based UPerNet as a decoder. 8.基于内容感知与令牌共享的高效视觉ViT语义分割系统,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征是,所述处理器执行所述程序时实现如权利要求1-7任一项所述的基于内容感知与令牌共享的高效视觉ViT语义分割方法。8. An efficient visual ViT semantic segmentation system based on content perception and token sharing, comprising a memory, a processor, and a computer program stored in the memory and runnable on the processor, wherein when the processor executes the program, the efficient visual ViT semantic segmentation method based on content perception and token sharing as described in any one of claims 1 to 7 is implemented.
CN202411146624.4A 2024-08-21 2024-08-21 Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing Active CN118657948B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411146624.4A CN118657948B (en) 2024-08-21 2024-08-21 Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411146624.4A CN118657948B (en) 2024-08-21 2024-08-21 Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing

Publications (2)

Publication Number Publication Date
CN118657948A true CN118657948A (en) 2024-09-17
CN118657948B CN118657948B (en) 2024-11-08

Family

ID=92702542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411146624.4A Active CN118657948B (en) 2024-08-21 2024-08-21 Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing

Country Status (1)

Country Link
CN (1) CN118657948B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120953425A (en) * 2025-10-20 2025-11-14 平安创科科技(北京)有限公司 Method, apparatus, device and medium for generating visual tokens based on shared index
CN121686114A (en) * 2026-02-07 2026-03-17 交通运输部科学研究院 Road ecological landscape dynamic classification method and system based on landscape visual characteristics

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen Image Classification Method Based on Cross-Attention Distillation Transformer
CN116503595A (en) * 2023-03-31 2023-07-28 华为技术有限公司 Instance segmentation method, device and storage medium based on point supervision
CN116664425A (en) * 2023-05-25 2023-08-29 广东机电职业技术学院 Image restoration method based on pseudo-random mask self-encoder
CN117036714A (en) * 2023-10-09 2023-11-10 安徽大学 Intestinal polyp segmentation method, system and medium integrating hybrid attention mechanism
WO2023225335A1 (en) * 2022-05-19 2023-11-23 Google Llc Performing computer vision tasks by generating sequences of tokens
CN117275040A (en) * 2023-10-07 2023-12-22 浙江理工大学 An efficient human pose estimation method based on decision-making network and refined features
CN117475278A (en) * 2023-10-30 2024-01-30 安徽大学 Guided vehicle-centered multi-modal pre-training system and method based on structural information
WO2024131380A1 (en) * 2022-12-19 2024-06-27 华南理工大学 Deep learning-based intelligent recognition method and system for locomotive signboard information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen Image Classification Method Based on Cross-Attention Distillation Transformer
WO2023225335A1 (en) * 2022-05-19 2023-11-23 Google Llc Performing computer vision tasks by generating sequences of tokens
WO2024131380A1 (en) * 2022-12-19 2024-06-27 华南理工大学 Deep learning-based intelligent recognition method and system for locomotive signboard information
CN116503595A (en) * 2023-03-31 2023-07-28 华为技术有限公司 Instance segmentation method, device and storage medium based on point supervision
CN116664425A (en) * 2023-05-25 2023-08-29 广东机电职业技术学院 Image restoration method based on pseudo-random mask self-encoder
CN117275040A (en) * 2023-10-07 2023-12-22 浙江理工大学 An efficient human pose estimation method based on decision-making network and refined features
CN117036714A (en) * 2023-10-09 2023-11-10 安徽大学 Intestinal polyp segmentation method, system and medium integrating hybrid attention mechanism
CN117475278A (en) * 2023-10-30 2024-01-30 安徽大学 Guided vehicle-centered multi-modal pre-training system and method based on structural information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘栋;李素;曹志冬;: "深度学习及其在图像物体分类与检测中的应用综述", 计算机科学, no. 12, 15 December 2016 (2016-12-15) *
青晨;禹晶;肖创柏;段娟;: "深度卷积神经网络图像语义分割研究进展", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120953425A (en) * 2025-10-20 2025-11-14 平安创科科技(北京)有限公司 Method, apparatus, device and medium for generating visual tokens based on shared index
CN121686114A (en) * 2026-02-07 2026-03-17 交通运输部科学研究院 Road ecological landscape dynamic classification method and system based on landscape visual characteristics

Also Published As

Publication number Publication date
CN118657948B (en) 2024-11-08

Similar Documents

Publication Publication Date Title
CN111860757B (en) Efficient matrix formats for neural networks
US11496773B2 (en) Using residual video data resulting from a compression of original video data to improve a decompression of the original video data
DE102021120605A1 (en) EFFICIENT SOFTMAX CALCULATION
CN118657948A (en) Efficient Visual ViT Semantic Segmentation Method Based on Content Awareness and Token Sharing
CN116246062A (en) Perform semantic segmentation training using image/text pairs
CN113362242A (en) Image restoration method based on multi-feature fusion network
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN114863539A (en) Portrait key point detection method and system based on feature fusion
CN110633640A (en) Optimize PointNet&#39;s recognition method for complex scenes
US12614284B2 (en) Class agnostic object mask generation
CN117152438A (en) A lightweight street view image semantic segmentation method based on improved DeepLabV3+ network
CN117034100A (en) Adaptive graph classification method, system, equipment and media based on hierarchical pooling architecture
CN116127685A (en) Perform simulations using machine learning
CN119919665A (en) A remote sensing image multi-task joint prediction method, device and medium
CN120355696A (en) Brain MRI image segmentation method based on RWKV model
Gao et al. Multi-branch aware module with channel shuffle pixel-wise attention for lightweight image super-resolution
CN118982843A (en) A lightweight pedestrian detection method in harsh environments based on deep learning
CN117035047A (en) A parallel training method and system for medical image segmentation models
CN121392794A (en) Lightweight detection method for small intelligent unmanned vehicle target
Liu et al. The research of small object detection based on yolox in uav
CN119180959B (en) Edge-assisted feature calibration method for real-time semantic segmentation
Wu et al. EffiSeaNet: Pioneering lightweight network for underwater salient object detection
Liu Nd-BiMamba2: A unified bidirectional architecture for multi-dimensional data processing
CN114445838A (en) Method for generating single-stage table detection network
CN115409150A (en) Data compression method, data decompression method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant