CN120953442A - A unified method and system for image generation and understanding based on joint diffusion modeling - Google Patents

A unified method and system for image generation and understanding based on joint diffusion modeling

Info

Publication number
CN120953442A
CN120953442A CN202511205486.7A CN202511205486A CN120953442A CN 120953442 A CN120953442 A CN 120953442A CN 202511205486 A CN202511205486 A CN 202511205486A CN 120953442 A CN120953442 A CN 120953442A
Authority
CN
China
Prior art keywords
domain
training
model
image
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511205486.7A
Other languages
Chinese (zh)
Inventor
官子涵
何振梁
徐逸峰
阚美娜
山世光
陈熙霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seetatech Beijing Technology Co ltd
Institute of Computing Technology of CAS
Original Assignee
Seetatech Beijing Technology Co ltd
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seetatech Beijing Technology Co ltd, Institute of Computing Technology of CAS filed Critical Seetatech Beijing Technology Co ltd
Priority to CN202511205486.7A priority Critical patent/CN120953442A/en
Publication of CN120953442A publication Critical patent/CN120953442A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/00Two-dimensional [2D] image generation
    • G06T11/60Creating or editing images; Combining images with text
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于联合扩散建模的图像生成与理解统一方法及系统,其涉及图像生成和理解技术领域。本发明通过联合扩散建模统一图像生成与理解任务,无需为生成和理解任务分别设计模型,提升效率。通过改进的DINOv2、Segmenter、DETR模型增强了分类特征类簇聚合性、分割边界细节及检测中小目标特征表达。随机角色分配机制和掩码全注意力机制灵活处理多域信息,域不变位置编码辅助跨域对齐,提升建模精度。通过优化训练使模型同时支持联合生成、可控生成及图像感知任务,性能优于现有统一模型,在边缘检测等任务上甚至超过专有模型。

This invention discloses a unified method and system for image generation and understanding based on joint diffusion modeling, belonging to the field of image generation and understanding technology. This invention unifies image generation and understanding tasks through joint diffusion modeling, eliminating the need to design separate models for generation and understanding tasks, thus improving efficiency. Improved DINOv2, Segmenter, and DETR models enhance the aggregation of classification feature clusters, segmentation boundary details, and the representation of small target features in detection. Random role assignment and masked full attention mechanisms flexibly handle multi-domain information, while domain-invariant positional encoding assists cross-domain alignment, improving modeling accuracy. Optimized training enables the model to simultaneously support joint generation, controlled generation, and image perception tasks, outperforming existing unified models and even surpassing proprietary models in tasks such as edge detection.

Description

Unified image generation and understanding method and system based on joint diffusion modeling
Technical Field
The invention relates to the technical field of image generation and understanding, in particular to an image generation and understanding unification method and system based on joint diffusion modeling.
Background
Visual generation and understanding are generally regarded as two completely different classes of tasks, processed in different ways-generating tasks typically rely on a generative model represented by a diffusion model, while understanding tasks are often translated into classification or regression problems, solved by discriminant models. The simultaneous completion of generation and understanding in a model is a valuable, meaningful task.
In the generating task, the controllable generating method takes understandable labels such as a target detection frame, a semantic segmentation diagram, a human body posture diagram and the like as control conditions to generate corresponding images, but the capability of reversely obtaining the labels from the images is lacking. The method for mining semantic information from a pre-trained diffusion model is not essentially combined with modeling generation and understanding tasks, and the discriminant feature modeling capability in the pre-trained diffusion model is limited and is different from that of the traditional discriminant model in performance, and the method for converting the understanding task into an image label generation task is mostly used for intensive prediction, but is not applicable to understanding tasks such as segmentation, detection, classification and the like, for example, the image segmentation requires outputting a label map (semantic category of each pixel), and the RGB map is used for representing the uncertainty of color mapping, and can be mapped to a part of RGB space in a concentrated mode, so that learning is difficult. Target detection requires explicit bounding box coordinates and class labels, which are sparse, non-image format, and cannot be represented by a pixel image. Image classification requires class labels and cannot be modeled with RGB. In joint modeling, some recent approaches attempt to model conditional distributions in both directions of image-to-tag and tag-to-image at the same time, but do not model joint distributions of both. In some unified methods, although a set of model parameters is shared in all tasks, the unification of a model layer is realized, different tasks are still treated respectively, and the unification of all image domains and all task output domains is not realized. Thus, existing methods do not have joint modeling generation and understanding tasks.
Disclosure of Invention
The invention aims to provide a unified image generation and understanding method and system based on joint diffusion modeling so as to improve the technical problems.
In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:
a unified method of image generation and understanding based on joint diffusion modeling, comprising:
Determining an image domain or a label domain or a feature domain of a task to be generated/understood, constructing an initial joint diffusion model by combining a Flux architecture, wherein the task to be generated/understood comprises a task generation task, a task understanding task and a joint distribution task, the image domain comprises an original image, the label domain comprises a depth map, a normal map, an albedo, an edge map and a line manuscript, and the feature domain comprises an image segmentation feature, a target detection feature and an image classification feature;
Determining a learning target based on the image classification model, the image segmentation model and the target detection model, and carrying out optimization training on the initial joint diffusion model by combining an encoder and a random role allocation mechanism to obtain a joint diffusion model;
Based on the task to be generated/understood, outputting a task result through a joint diffusion model, and completing unification of image generation and understanding;
The image classification model adopts an improved DINOv self-supervision visual encoder, the image segmentation model adopts an improved SEGMENTER model, and the target detection model adopts an improved DETR model.
An unified image generation and understanding system based on joint diffusion modeling, comprising:
The data acquisition module is used for determining an image domain or a label domain or a feature domain of a task to be generated/understood;
The diffusion model building training module is used for determining a learning target based on the image classification model, the image segmentation model and the target detection model, and carrying out optimization training on the initial joint diffusion model by combining the encoder and the random role allocation mechanism to obtain the joint diffusion model;
and the generating/understanding task completion module is used for outputting a task result based on the task to be generated/understood through the joint diffusion model to complete the unification of image generation and understanding.
The beneficial effects of the invention are as follows:
According to the invention, the tasks are unified to be generated and understood through joint diffusion modeling, and models are not required to be designed for generating and understanding the tasks respectively, so that the efficiency is improved. The improved DINOv and SEGMENTER, DETR models enhance the clustering property of the classification feature clusters, the details of the segmentation boundaries and the expression of the small and medium target features in detection. The random role allocation mechanism and the mask full-attention mechanism flexibly process multi-domain information, domain invariant position coding assists cross-domain alignment, and modeling accuracy is improved. The model simultaneously supports the tasks of joint generation, controllable generation and image perception through optimization training, the performance is superior to that of the conventional unified model, and the performance even exceeds that of a proprietary model in tasks such as edge detection.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the method in embodiment 1 of the present invention;
FIG. 2 is a diagram showing the structure of an initial joint diffusion model in embodiment 1 of the present invention;
FIG. 3 is a diagram of a self-supervising visual encoder of the present invention, modified DINOv in embodiment 1;
FIG. 4 is a diagram showing a model structure of the improvement SEGMENTER in example 1 of the present invention;
FIG. 5 is a diagram showing the structure of the improved DETR model in example 1 of the present invention;
FIG. 6 is a schematic diagram of the integrated diffusion model in example 1 of the present invention;
fig. 7 is a diagram illustrating a random character allocation mechanism according to embodiment 1 of the present invention;
FIG. 8 is a diagram showing the construction of a system in embodiment 2 of the present invention;
FIG. 9 is a diagram showing the joint generation task in embodiment 3 of the present invention;
FIG. 10 is a task comparison chart of the method and other models in embodiment 4 of the present invention;
FIG. 11 is a graph showing comparison of the results of domain invariant bit encoding in example 5 of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
Example 1:
Referring to fig. 1, a unified method for generating and understanding images based on joint diffusion modeling is provided in this embodiment. The subject of execution of the method illustrated in fig. 1 may be a software and/or hardware device. The execution body of the present application may include, but is not limited to, at least one of a user equipment, a network device, etc. The user device may include, but is not limited to, a computer, a smart phone, a Personal Digital Assistant (PDA), and the above-mentioned electronic device. The network device may include, but is not limited to, a single network server, a server group of multiple network servers, or a cloud of a large number of computers or network servers based on cloud computing, where cloud computing is one of distributed computing, and a super virtual computer consisting of a group of loosely coupled computers. This embodiment is not limited thereto.
A unified method of image generation and understanding based on joint diffusion modeling, comprising:
s1, determining an image domain or a label domain or a feature domain of a task to be generated/understood, and constructing an initial joint diffusion model by combining a Flux architecture, wherein the task to be generated/understood comprises a generation task, an understanding task and a joint distribution task, the image domain comprises an original image, the label domain comprises a depth map, a normal map, an albedo, an edge map and a line manuscript, and the feature domain comprises an image segmentation feature, a target detection feature and an image classification feature;
Specifically, the joint generation task refers to that an image, all corresponding labels and corresponding image features can be generated simultaneously through a joint diffusion model. The generation task refers to generating a corresponding image given a number of labels and features using a joint diffusion model learning. The understanding task refers to the use of joint diffusion model learning to predict multiple labels and image features simultaneously given an image, i.e., to generate label and feature fields.
When the task to be generated/understood is a joint generation task, the input of the joint diffusion model is text information, namely, related description of an image required to be generated by the task, and the corresponding tag domain and the feature domain are extracted while the required image is generated. When the task to be generated/understood is a generating task, the input of the joint diffusion model is a plurality of tag domains and a plurality of characteristic domains, and images determined by the tag domains are output. When the task to be generated/understood is an understanding task, the input of the joint diffusion model is an image domain, and the corresponding output is a label domain and a feature domain. If different tasks exist at the same time, only the corresponding domains of each task are input into different joint diffusion models respectively, and the required results are output respectively after independent processing.
In this embodiment, the backhaul of the transform model is based on an open-source FLUX model, and is improved for multi-domain joint modeling tasks, unlike the original FLUX which is mainly used for image generation tasks, the method extends to include image domains, tag domains, and feature domains. To accomplish multi-domain joint modeling, the Flux architecture is adapted to handle token sequences from M domains and cross-domain feature interactions through a unified masked full-attention mechanism.
Thus, as shown in FIG. 2, the initial joint diffusion model comprises a transducer model, wherein the transducer model comprises a domain invariant location coding layer and a transducer module, and the transducer module comprises a plurality of transducer layers stacked;
The domain invariant location coding layer adopts 2D sine and cosine coding. Specifically, for the same spatial position of different domains, the same 2D sine and cosine code is added for the input token, so that explicit position information is provided for the joint diffusion model.
Each transducer layer adopts a mask full-attention mechanism, which is beneficial to improving modeling accuracy. In order to enable the joint diffusion model to ignore invalid domains with the character type X, a mask mechanism is introduced on the basis of a standard full-attention mechanism, and log (m j) penalty is added to a token marked as invalid so that the attention weight is zero after a softmax function. Thus, the formula corresponding to the mask full attention mechanism is:
Wherein softmax (·) represents the maximum function, O i represents the attention score corresponding to the i-th transducer layer, Q i、Kj、Vj represents the i-th query vector, the j-th key vector, and the j-th value vector, d represents the dimension of Q i, Representing the transpose of the key vector, Σ (-) representing the sum function, log (-) representing the logarithmic function with the base of the natural constant, m j representing the j-th mask indicator, the range of values being [0,1], M, N representing the number of queries, keys or values, respectively. When m j is 0, the attention score area of the j-th token is minus infinity, so that an invalid domain can be completely shielded, and the corresponding output has no influence.
S2, determining a learning target based on an image classification model, an image segmentation model and a target detection model, and carrying out optimization training on an initial joint diffusion model by combining an encoder and a random role allocation mechanism to obtain the joint diffusion model, wherein the image classification model adopts an improved DINOv self-supervision visual encoder, the image segmentation model adopts an improved SEGMENTER model, and the target detection model adopts an improved DETR model.
As shown in fig. 3, the improved DINOv self-supervising visual encoder includes a geometry-aware Patch embedded layer, a relative position transducer layer and a channel selection compression module connected in series, wherein the relative position transducer layer includes a plurality of stacked first transducers, one cluster-like sensing module is inserted between every two first transducers, the rest structure is the same as the existing transducers, and the channel selection compression module includes a global average pooling layer, two MLP layers and a weighted activation layer.
As shown in FIG. 4, the improved SEGMENTER model comprises an MS-CAP module, a channel attention layer and a split encoder which are connected in series, wherein the split encoder comprises a one-dimensional flattening layer, a position embedding layer and a plurality of stacked second convectors which are connected in series, an SREB module (semantic residual enhancement module) is inserted between every two second convectors, a cross-layer characteristic connection is introduced, the output of the ith second convectors is fused with the output of the ith+k second convectors, and the MS-CAP module comprises a channel attention layer and a characteristic fusion layer which are connected in series and of the parallel convolution layers.
As shown in fig. 5, the improved DETR model comprises a dual CNN backbone network, a channel attention layer, a transducer-based encoder, a transducer-based decoder and a prediction head which are connected in series, wherein the dual CNN backbone network comprises a parallel global semantic CNN network and a target edge CNN network, the target edge CNN network comprises a CNN network, a local perception convolution layer and a boundary response enhancement module which are connected in series, the transducer-based encoder comprises a plurality of block layers, and a space guiding module is inserted between each two block layers. The global semantic CNN network is the same as the CNN network.
The local perceptual convolution layers include a convolution layer having a convolution kernel of 3×3 and a convolution layer having a convolution kernel of 5×5. In the boundary response enhancement module, firstly, a boundary attention diagram (sigmoid activation) is generated through 1×1 convolution, and then element-by-element multiplication is carried out on the boundary attention diagram and the original characteristic, so that characteristic enhancement of a boundary region is realized.
As shown in fig. 6, the performing optimization training on the initial joint diffusion model to obtain a joint diffusion model includes:
S2-1, acquiring a training image domain and a training label domain, processing the training image domain through an image classification model, an image segmentation model and a target detection model based on a learning target to obtain classification tasks domain, segmentation domain and detection domain which are all used as training feature domains;
Because the generating task generally depends on a generating model represented by a diffusion model, the understanding task is often converted into a classification or regression problem and is solved by a discriminant model, so that an image classification model, an image segmentation model and a target detection model are selected to obtain corresponding training feature domains for training an initial joint diffusion model, the feature domains obtained and generated by the initial joint diffusion model are more suitable for the discriminant task (classification, detection and segmentation), and the generated feature domains are subsequently input to a decoder to realize the discriminant task, thereby achieving the purposes of image classification, image segmentation and target detection.
When the learning target is image classification, the original structure of the traditional DINOv self-supervision visual encoder has the problems that the separation between classes is insufficient, the diffusion model with high characteristic dimension is difficult to model and compress in the middle layer, so that the effective learning of the diffusion model is not facilitated, and a class cluster sensing module and a channel selection compression module are further introduced. Thus, the image classification model is processed as follows:
S2-101, acquiring a training image domain, inputting the training image domain into a geometric perception Patch embedded layer, dividing the training image domain through rotation and the like, obtaining a classification Patch with a fixed size, and converting the classification Patch into a classification tokens;
S2-102, extracting classification space features of each classification tokens by using a first transducer in a relative position transducer layer;
S2-103, performing class cluster aggregation on the classified space features through a class cluster perception module, and combining the attention weight and the aggregation vector to obtain classified class cluster perception features;
It should be noted that, the original class token is utilized to perform semantic aggregation on the classification cluster perception features, and the classification cluster perception features of the same class are guided to form high-density semantic clusters. Specifically, a group of class token is reserved for each first transducer, namely, one classification cluster perception feature is selected as the class token, and the rest classification cluster perception features are subjected to semantic aggregation. Taking the first transducer as an example, class cluster aggregation is performed on the input classification class cluster perception features, and the Attention weight A i of the class token corresponding to each classification class cluster perception feature is calculated according to a corresponding formula:
Wherein, the Represents the query vector corresponding to the i-th classification cluster perceptual feature x i,Representing the transpose of the key vector corresponding to the class token.
According to the attention weights, corresponding aggregate vectors are calculated and used as residual, the centers of the perception feature items of each classification cluster are guided to gather, the perception feature of the classification cluster is obtained, and a corresponding formula is obtained:
xi'=xi+Ai·Vcls;
where x i' represents the class cluster perception feature, V cls represents the value vector corresponding to class token, and a i·Vcls represents the aggregate vector.
S2-104, repeating the same operations of the first transducer and the cluster-like sensing module until the output of the last first transducer is obtained;
S2-105, pooling the output of the last first transducer through a channel selection compression module and calculating importance weights, and reducing the dimension of the pooled classification cluster perception features based on the importance weights to obtain classification task domains, namely obtaining training feature domains.
Pooling the output of the last first transducer and calculating importance weights through a channel selection compression module; based on the importance weight, reducing the dimension of the output of the last first transducer after pooling;
Specifically, the channel selection compression module is used for compressing invalid redundant dimensions in the output characteristics of the last first transducer in high dimensions, inhibiting invalid channel activation and compressing characteristic dimensions, so that the diffusion model generation learning efficiency is improved. The global average S in the channel dimension is calculated by the global average pooling layer, and the corresponding formula is as follows:
S=GAP(x')∈R1×D;
Where R represents a constant, GAP (·) represents a pooling operation, a single value is obtained by averaging all elements of each feature map, and D represents the feature dimension of the output of the last first transducer.
The importance weight W is calculated by using two layers of MLPs connected in series, and the corresponding formula is as follows:
W=σ(MLP(S))∈R1×D;
Where σ (·) represents the activation function and MLP (·) represents the multi-layer perceptron operation.
Selectively activating an important channel through the importance weight, discarding the channel of the importance weight W to obtain the feature X after dimension reduction, wherein the corresponding formula is as follows:
X=x'·W。
when the learning target is image segmentation, the processing procedure of the image segmentation model is as follows:
s2-111, acquiring a training image domain, dividing the training image domain by an MS-CAP module (multi-scale channel perception embedding module), extracting patchtokens under different perception fields in parallel, calculating channel attention weights of different patchtokens by a channel attention layer, processing each patchtokens based on the channel attention weights, and obtaining segmentation enhancement multi-scale features;
Specifically, the conventional SEGMENTER model has defects in boundary detail modeling and complete intermediate layer characteristic semantic information, which is not beneficial to effective learning of a diffusion model, so that the MS-CAP module is selected to replace the original flatten and project modules. In the MS-CAP module, training image fields are processed through multiple parallel convolution layers, and patch tokens under different receptive fields are extracted by using convolution kernels of different sizes. The small convolution kernel (such as 3 multiplied by 3) has small receptive field, focuses on capturing local details of the image, can accurately extract fine textures, and the large convolution kernel (such as 7 multiplied by 7) has large receptive field, can acquire information of a wider area, and focuses on the whole structure and the semantics.
And inputting each patch tokens into a channel attention layer respectively, analyzing the characteristic channels of each branch, and adaptively selecting the characteristic channels sensitive to boundaries and textures. By calculating the importance weight of each channel, the response of the channel related to the texture and semantic boundary is enhanced, and the irrelevant channel is restrained, so that the key characteristic information of the image is highlighted, and the joint diffusion model is more focused on the boundary and texture part.
And finally, splicing the multi-scale features subjected to channel attention enhancement, and integrating the enhanced features under different scales to form a feature set containing rich multi-scale information. And then, uniformly projecting the spliced features to meet patch token embedding dimensions required by the encoder in the first transducer, so that the features adapt to the processing requirements of the subsequent transducers. And adding position codes, recording the spatial position information of each feature in the original image by using the position codes, and keeping the spatial sequence of the features so as to avoid losing the spatial structure information in the processing process.
S2-112, inputting the segmentation enhancement multi-scale features into a segmentation encoder, extracting semantic residual fusion features and taking the semantic residual fusion features as segmentation tasks domain to obtain a training feature domain;
Dividing and enhancing multi-scale features in a segmentation encoder to obtain segmentation patches, inputting the segmentation patches into a one-dimensional vector through a one-dimensional flattening layer, embedding each one-dimensional vector through a position embedding layer to obtain an embedded sequence, carrying out feature coding and context aggregation on the embedded sequence by utilizing a first second Transformer to obtain a first semantic residual fusion feature, fusing the first semantic residual fusion feature with the outputs of k second transformers which are separated by each other through an SREB module and taking the fused feature as the input of the next second Transformer, and repeating the operations of the second transformers and the SREB module until the output of the last second Transformer is obtained to obtain the semantic residual fusion feature. In this embodiment, K has a value of 2 or 3.
It is to be explained that the SREB module is used for fusing shallow features and deep features, introducing cross-layer feature connection, improving the transfer of semantic information between layers, reducing the loss of intermediate semantic information, enabling the extracted features to capture local edge, semantic main body and global structure information respectively, and enabling the supervision target of the training diffusion model to be a complete semantic feature. Fusing the output F i of the second transducer of the i layer with the output F i+k of the second transducer of the i+k layer, corresponding to the formula:
F'i+k=Fi+k+gate(Fi);
wherein F' i+k represents the fused semantic residual fusion feature, and gate (·) represents the convolution operation with a convolution kernel of 1×1 and employing a Sigmoid activation function.
In the reasoning stage, a joint diffusion model is used for generating semantic segmentation features, the generated features are input into a two-layer lightweight MaskTransformer decoder, and a semantic mask probability map of each category is output. And obtaining a final segmentation result graph after softmax operation and category argmax operation.
When the learning target is target detection, the processing procedure of the target detection model is as follows:
S2-121, acquiring a training image domain, extracting global semantic information of the training image domain through a global semantic CNN network, extracting initial edge features of the training image domain through the CNN network, extracting detail textures of the initial edge features by utilizing a local perception convolution layer to obtain edge convolution features, and dynamically extracting edge enhancement features of the edge convolution features by utilizing a boundary response enhancement module and combining an edge detection operator and boundary attention gating;
s2-122, fusing the edge enhancement features and global semantic information through a channel attention layer to obtain global-edge fusion features;
S2-123, coding global edge fusion characteristics through a first block layer to obtain global-fusion coding information, extracting an edge response diagram of a training image domain by using a space guiding module, combining a mask attention mechanism and modulating, and fusing the edge response diagram and the global-fusion coding information to obtain global-fusion guiding information;
It should be explained that, in the spatial guidance module, the global-fusion coding information is modulated based on the edge response graph extracted from the training image domain, so as to obtain the attention features after dynamic weighting, thereby enhancing the response capability of the joint diffusion model to the small target and the edge region. Specifically, an edge response map (E) with low resolution is extracted from an input image and used as a position information guiding signal, the output feature of the current block layer is fused with the guiding map E to obtain a mask attention value M, the mask attention value M is used for modulating the output feature F of the current block layer, and the corresponding formula is as follows:
F'=F⊙(1+M);
Global-fusion guidance information F' is obtained.
The global-fusion guidance information is re-input to the next block layer. In the block layer, each layer of encoder is set to dynamically adjust the feature focusing position, and the guiding signal comes from the input original image (training image domain) or shallow branches, without introducing additional supervision.
S2-124, repeating the operations of the block layers and the space guiding module until the output of the last block layer is obtained and used as a detection domain, and obtaining the training feature domain.
In the training stage, the output of the last block layer is used as a target detection domain and used as a supervision target for training the joint diffusion model. In the inference phase, a diffusion model may be used to generate the detection features.
S2-2, respectively performing role distribution on the training feature domain, the training image domain and the training label domain through an encoder and a random role distribution mechanism to determine corresponding role types;
The conventional method cannot realize the unification of generating and understanding tasks, so that a random role allocation mechanism is provided. The character allocation is carried out on the training feature domain, the training image domain and the training label domain through the encoder and the random character allocation mechanism respectively, and the corresponding character types are determined, which comprises the following steps:
S2-2-1, inputting a training image field, a training label field and a training feature field into an encoder based on a training generation/understanding task to obtain training field-encoded information, wherein the task to be generated/understood is a generation task and an understanding task. For example, when the task to be generated/understood is a generation task, the input to the encoder is the corresponding tag field and feature field.
S2-2-2, processing training domain-coding information through a random role allocation mechanism to perform role allocation, and determining a role type and an allocation processing method; if the role type is G, the allocation processing method is noise adding, if the role type is C, the allocation processing method is kept as it is, and if the role type is X, the allocation processing method is data zeroing;
Specifically, as shown in fig. 7, if the character G (generated), the allocation processing method is noise adding, that is, noise adding by using a RECTIFIED FLOW standard training method, and speed field prediction are performed, if the character type is C (condition), the allocation processing method is kept as it is, and if the character type is X (ignore), the allocation processing method is data zeroing, that is, zeroing the data, and masking the influence thereof by using a masking mechanism. In fig. 2, the switch represents a random role assignment mechanism.
S2-2-3, processing the training domain-code allocation information through an allocation processing method to obtain the input data of the joint diffusion model.
Specifically, the formula corresponding to the random role assignment mechanism is:
Wherein, the Output data representing an initial joint diffusion model obtained by processing domain-encoded information through a random character allocation mechanism, ρ k represents an allocated character, k represents an index, τ represents a time step,Representing domain-encoded allocation information, ζ k representing noise term;
through a random role allocation mechanism, different tasks are realized, such as that an image domain is C, other domains are G, namely understanding is realized, the image domain is G, a certain label domain is C, namely controllable generation is realized, and all the label domains are G, namely joint generation is realized. During training, multiple tasks are trained in one model through a role allocation mechanism, so that unified generation and understanding are achieved.
S2-3, acquiring training text information, inputting the training text information and input data of each joint diffusion model into an initial joint diffusion model according to a training generation/understanding task, and generating a corresponding training task result;
It should be noted that, before the initial joint diffusion model, the input data needs to be divided to obtain a plurality of different training Token. According to the difference of training generation/understanding tasks, the processing procedure of the initial joint diffusion model is explained in three cases.
When the training generation/understanding task is a joint generation task, training Token corresponding to training text information is input to a joint diffusion model, roles of all domains are G, token is randomly initialized noise, and finally a graph and a label domain and a feature domain (which are generated simultaneously by one-time reasoning and have no precedence relation) corresponding to the graph are generated.
When the training generation/understanding task is a generation task, the training Token corresponding to the training label domain and the training feature domain is input to the domain invariant location coding layer, and location information is introduced into the training Token to obtain the corresponding training generation location Token. And capturing the relation among the training generation positions token through the stacked Transformer layers, generating shallow layer features and depth features of different layers, and fusing to finally obtain corresponding training images.
When the training generation/understanding task is an understanding task, the process is the same as when the training generation/understanding task is a generating task, and only the input data is a training image. In the three tasks, the domain to be generated is initialized with random noise, and the domain to be generated is the original token and remains unchanged.
S2-4, based on the results of each training task,
A diffusion Loss function Loss is calculated, wherein,Representing the desired function, U (0, 1) representing the random variable sampled from the uniform distribution [0,1], ζ 0:K、ξk representing the noise term from 0to K, the noise term K, N (0, 1) representing the random variable sampled from the standard normal distribution with mean 0 and variance 1,Respectively represent the target values of indexes from 0 to K in the training set and the target value of the kth domain, D represents the training set consisting of a training image domain, a training label domain and a training feature domain, Σ (·) represents the summation function, ρ k represents the role, G represents the role type, the term is used to denote the norm, θ denotes the parameters of the initial joint diffusion model, τ denotes the time step,The indexes of the domains with the indexes of 0,1 and K respectively at the time step tau are respectively represented as target values, and v θ (DEG) represents a speed prediction model;
The speed prediction model is a model of the invention, and under the diffusion framework, the model predicts the changing speed of data from the current state to the real state, and the real data is gradually restored from noise. The loss function optimizes the model by minimizing the difference between the model predicted speed and the true speed (noise-true value), thereby improving the quality of the generation.
S2-5, iterating the initial joint diffusion model based on the diffusion loss function, and adjusting weight parameters of the initial joint diffusion model to complete optimization training of the initial joint diffusion model.
And S3, outputting a task result based on the task to be generated/understood through a joint diffusion model, and completing unification of image generation and understanding.
Example 2:
as shown in fig. 8, an image generation and understanding unified system based on joint diffusion modeling includes:
The data acquisition module is used for determining an image domain or a label domain or a feature domain of a task to be generated/understood;
The role distribution module is used for distributing through the encoder and the random role distribution mechanism to obtain input data of the joint diffusion modeling;
The diffusion model building training module is used for building an initial joint diffusion model by combining a Flux architecture, determining a learning target by combining an image classification model, an image segmentation model and a target detection model, and carrying out optimization training on the initial joint diffusion model to obtain a joint diffusion model;
And the generating/understanding task completion module inputs the generating/understanding task data into the joint diffusion model, outputs a task result and completes the unification of image generation and understanding.
The diffusion model construction training module comprises:
the diffusion model construction submodule is used for constructing an initial joint diffusion model by combining with a Flux architecture;
the classification task domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through an image classification model row to obtain a classification task domain;
The segmentation domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through an image segmentation model to obtain a segmentation task domain;
the detection domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through a target detection model row to obtain detection domain;
The character allocation sub-module is used for allocating the classification task domain, the segmentation task domain and the detection domain respectively through an encoder and a random character allocation mechanism to obtain corresponding character types;
the model optimization training sub-module is used for acquiring training label fields and training text information, calculating a diffusion loss function based on classification tasks domain, segmentation tasks domain and detection domains, iterating an initial joint diffusion model based on the diffusion loss function, and adjusting weight parameters of the initial joint diffusion model.
It should be noted that, regarding the system in the above embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment regarding the method, and will not be described in detail herein.
Example 3:
The joint generation task is processed by the method in embodiment 1, and as a result, as shown in fig. 9, given the input text, the model can simultaneously generate an image and 7 kinds of corresponding labels. The resulting images have a wide range of aspect ratios, and are all around 1024 resolution.
Example 4:
The control Net model, uniControl model, easyControl model, omniGen model, pixWizard model and OneDiffusion model are selected as other models to be compared with the joint diffusion model of the invention. As shown in table 1, performance comparison is performed for the generated tag domain, and compared with other models, the joint diffusion model of the present invention achieves better performance under various conditions. The performance includes LPIPS (perceived image similarity) and FID distance, which is an indicator that measures the distance between the generated image and the real image distribution.
TABLE 1
The corresponding results of the generation tasks respectively realized through the method, omniGen model, oneDiffusion model and PixWizard model are shown in fig. 10. Compared with OmniGen model, oneDiffusion model and PixWizard model, the method is excellent in generating task, and the generated image is more fit with the input label and feature, so that the method is far better than the performance capability of OmniGen model, oneDiffusion model and PixWizard model.
The method sequentially carries out depth estimation, normal estimation, albedo estimation and edge detection.
TABLE 2
Method of NYUv2 ScanNet DIODE
OmniGen 9.2 10.1 30.6
PixWizard 7.0 7.9 25.4
OneDiffusion 8.9 9.7 25.2
The method 10.1 12.1 25.9
The method (integration) 8.3 9.9 25.8
TABLE 3 Table 3
Method of NYUv2 ScanNet iBims
OmniGen 28.9 28.9 31.3
PixWizard 23.5 26.6 22.5
The method 21.1 24.3 20.1
The method (integration) 18.6 20.3 18.2
TABLE 4 Table 4
Method of PSNR ScanNet
OrdinalShading 15.6 0.37
Kocsisetal. 11.3 0.49
CareagaandAksoy 15.7 0.36
RGB2X 20.6 0.18
The method 15.5 0.31
The method (integration) 16.5 0.33
TABLE 5
Method of ODS OIS
HED 0.788 0.808
PiDiNet 0.807 0.823
OmniGen 0.767 0.781
PixWizard 0.605 0.633
OneDiffusion 0.682 0.691
The method 0.826 0.851
For depth estimation, absolute average relative error was tested on NYUv dataset, scanNet dataset and DIODE dataset by the present method and its integrated model, oneDiffusion model, omniGen model, pixWizard model, respectively. As shown in table 2, the absolute average relative error obtained by the method and the method (integrated) on NYUv data sets is fifth and second, the absolute average relative error obtained by the method and the method (integrated) on ScanNet data sets is fifth and third, and the absolute average relative error obtained by the method and the method (integrated) on DIODE data sets is fourth and third, arranged in order of decreasing size.
For normal estimation, average angle errors were tested on the NYUv dataset, scanNet dataset and iBims dataset in sequence by the method and its integrated model, omniGen model, pixWizard model, respectively. As shown in Table 3, in NYUv data sets, the average angle error of the method and its integrated model are ranked third and fourth, respectively, in order of magnitude. In ScanNet datasets, the average angular error of the method and its integrated model is ranked third and fourth, respectively, in order of magnitude. In iBims datasets, the average angular error of the method and its integrated model is ranked third and fourth, respectively, in order of magnitude. Its performance is more stable and its performance is more stable across the various data sets than other models.
For albedo estimation, PSNR and LPIPS were tested on Hypersim dataset test sets by the present method and its integrated model, ordinal Shading model, kocsis et al model, careaga andAksoy model, RGB2X model, respectively. As shown in Table 4, the PSNR of the method and its integrated model were ranked third and second, respectively, in order of magnitude, and LPIPS of the method and its integrated model were ranked second and third, respectively, in order of magnitude. From this, the method is superior to Ordinal Shading model, kocsis et al model and Careaga andAksoy model in performance.
For edge detection, the Optimal Dataset Scale (ODS) and F-score at Optimal Image Scale (OIS) are tested on BSDS500 datasets by the present method, HED model, piDiNet model, omniGen model, pixWizard model, oneDiffusion model, respectively. As shown in Table 5, the F-score of the present method is at a maximum at both the optimal data set scale (ODS) and the Optimal Image Scale (OIS), and it is known that the present method is far from other models.
Example 5:
The method is used for comparing the joint diffusion model and the joint diffusion model of the removed domain invariant feature coding layer. As shown in fig. 11, in the absence of domain invariant position coding, there is a significant misalignment of the image domain and the label domain, and this problem is ameliorated after the position coding is added, which demonstrates the role of position coding. Therefore, the domain invariant location coding layer proposed by the method can be combined with the cross-domain spatial location alignment of the diffusion auxiliary model.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1.一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,包括:1. A unified method for image generation and understanding based on joint diffusion modeling, characterized in that it includes: 确定待生成/理解任务的图像域或标签域或特征域;结合Flux架构构建初始联合扩散模型;所述待生成/理解任务包括生成任务、理解任务和联合分布任务;所述图像域包括原始图像;所述标签域包括深度图、法向图、反照率、边缘图和线稿;所述特征域包括图像分割特征、目标检测特征和图像分类特征;The image domain, label domain, or feature domain of the task to be generated/understood is determined; an initial joint diffusion model is constructed using the Flux architecture; the task to be generated/understood includes a generation task, an understanding task, and a joint distribution task; the image domain includes the original image; the label domain includes depth map, normal map, albedo, edge map, and line drawing; the feature domain includes image segmentation features, object detection features, and image classification features. 基于图像分类模型、图像分割模型和目标检测模型,确定学习目标,并结合编码器和随机角色分配机制对初始联合扩散模型进行优化训练,得到联合扩散模型;Based on image classification model, image segmentation model and object detection model, the learning objective is determined, and the initial joint diffusion model is optimized and trained by combining encoder and random role assignment mechanism to obtain joint diffusion model; 基于待生成/理解任务通过联合扩散模型,输出任务结果,完成对图像生成与理解的统一;Based on the task to be generated/understood, a joint diffusion model is used to output the task results, thus achieving the unification of image generation and understanding; 其中,图像分类模型采用改进的DINOv2自监督视觉编码器;图像分割模型采用改进的Segmenter模型;目标检测模型采用改进的DETR模型。The image classification model uses an improved DINOv2 self-supervised visual encoder; the image segmentation model uses an improved Segmenter model; and the object detection model uses an improved DETR model. 2.根据权利要求1所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,所述初始联合扩散模型包括Transformer模型;所述Transformer模型包括域不变位置编码层、Transformer模块;Transformer模块包括多个堆叠而成的Transformer层;域不变位置编码层采用2D正余弦编码;每个Transformer层均采用掩码全注意力机制;所述掩码全注意力机制对应的公式为:2. The unified method for image generation and understanding based on joint diffusion modeling according to claim 1, characterized in that the initial joint diffusion model includes a Transformer model; the Transformer model includes a domain-invariant positional encoding layer and a Transformer module; the Transformer module includes multiple stacked Transformer layers; the domain-invariant positional encoding layer uses 2D sine and cosine coding; each Transformer layer uses a masked full attention mechanism; the formula corresponding to the masked full attention mechanism is: 其中,softmax(·)表示最大值函数,Oi表示第i个Transformer层对应的注意力分数,Qi、Kj、Vj分别表示第i个token的查询向量、第j个token的键向量和第j个token的值向量,d表示Qi的维度,表示键向量的转置矩阵,∑(·)表示求和函数,log(·)表示以自然常数为底的对数函数,mj表示第j个token的掩码指示符,M、N分别表示查询数量、键或值数量。Where softmax(·) denotes the maximum value function, O <sub>i</sub> represents the attention score corresponding to the i-th Transformer layer, Q <sub>i</sub> , K <sub>j</sub> , and V<sub> j </sub> represent the query vector, key vector, and value vector of the i-th token, respectively, and d represents the dimension of Q <sub>i </sub>. Let represent the transpose of the key vector, ∑(·) represent the summation function, log(·) represent the logarithmic function with the natural constant as the base, mj represent the mask indicator of the j-th token, and M and N represent the number of queries and the number of keys or values, respectively. 3.根据权利要求1所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,所述改进的DINOv2自监督视觉编码器包括串联的几何感知Patch嵌入层、相对位置Transformer层和通道选择压缩模块;相对位置Transformer层包括多个堆叠的第一Transformer,且每两个第一Transformer之间插入一个类簇感知模块;3. The unified method for image generation and understanding based on joint diffusion modeling according to claim 1, characterized in that the improved DINOv2 self-supervised visual encoder includes a cascaded geometric perception patch embedding layer, a relative position Transformer layer and a channel selection compression module; the relative position Transformer layer includes multiple stacked first Transformers, and a cluster perception module is inserted between every two first Transformers. 所述改进的Segmenter模型包括串联的MS-CAP模块、通道注意力层、分割编码器;分割编码器包括串联的一维展平层、位置嵌入层和多个堆叠的第二Transformer,且每两个第二Transformer之间插入SREB模块,引入跨层特征连接,将第i个第二Transformer的输出与第i+k个第二Transformer的输出进行融合;The improved Segmenter model includes a cascaded MS-CAP module, a channel attention layer, and a segmentation encoder. The segmentation encoder includes a cascaded one-dimensional flattening layer, a position embedding layer, and multiple stacked second Transformers. An SREB module is inserted between every two second Transformers to introduce cross-layer feature connections and fuse the output of the i-th second Transformer with the output of the i+k-th second Transformer. 所述改进的DETR模型包括串联的双支CNN骨干网络、通道注意力层、基于Transformer的编码器、基于Transformer的解码器、预测头;双支CNN骨干网络包括并行的全局语义CNN网络、目标边缘CNN网络;目标边缘CNN网络包括串联的CNN网络、局部感知卷积层和边界响应增强模块;基于Transformer的编码器包括多个block层,且每两个block层之间插入一个空间引导模块。The improved DETR model includes a cascaded dual-branch CNN backbone network, channel attention layers, a Transformer-based encoder, a Transformer-based decoder, and a prediction head. The dual-branch CNN backbone network includes a parallel global semantic CNN network and a target edge CNN network. The target edge CNN network includes a cascaded CNN network, local perceptual convolutional layers, and a boundary response enhancement module. The Transformer-based encoder includes multiple block layers, with a spatial guidance module inserted between every two block layers. 4.根据权利要求3所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,所述对初始联合扩散模型进行优化训练,得到联合扩散模型,包括:4. The unified method for image generation and understanding based on joint diffusion modeling according to claim 3, characterized in that, the step of optimizing and training the initial joint diffusion model to obtain the joint diffusion model includes: 获取训练图像域和训练标签域;基于学习目标,通过图像分类模型、图像分割模型和目标检测模型处理训练图像域,得到分类任务domain、分割domain和检测domain,并均作为训练特征域;Obtain the training image domain and training label domain; based on the learning objective, process the training image domain through image classification model, image segmentation model and object detection model to obtain the classification task domain, segmentation domain and detection domain, all of which are used as training feature domains; 分别通过编码器和随机角色分配机制对训练特征域、训练图像域和训练标签域进行角色分配,确定对应的角色类型;基于各角色类型,分别对训练特征域、训练图像域和训练标签域对应的域-编码信息进行处理,得到对应的联合扩散模型输入数据;Roles are assigned to the training feature domain, training image domain, and training label domain respectively through encoder and random role assignment mechanism to determine the corresponding role type; based on each role type, the domain-encoding information corresponding to the training feature domain, training image domain, and training label domain is processed to obtain the corresponding joint diffusion model input data; 获取训练文字信息;按照训练生成/理解任务,将训练文字信息和各联合扩散模型输入数据输入至初始联合扩散模型,生成对应的训练任务结果;Obtain training text information; according to the training generation/understanding task, input the training text information and the input data of each joint diffusion model into the initial joint diffusion model to generate the corresponding training task results; 基于各训练任务结果,结合公式:Based on the results of each training task, combined with the formula: 计算扩散损失函数Loss;其中,表示期望函数,U(0,1)表示从均匀分布[0,1]中采样的随机变量,ξ0:K、ξk分别表示索引从0到K的噪声项、索引为k的噪声项,N(0,1)表示从均值为0、方差为1的标准正态分布中采样的随机变量,分别表示训练集中索引从0到K的目标值和第k个域的目标值,D表示由训练图像域、训练标签域和训练特征域构成的训练集,∑(·)表示求和函数,ρk表示角色,G表示角色类型,||·||表示范数,θ表示初始联合扩散模型的参数,τ表示时间步,分别表示时间步τ下的索引分别为0、1、K的域对应的目标值,vθ(·)表示速度预测模型;Calculate the diffusion loss function Loss; where, Let U(0,1) represent the expectation function, U(0,1) represent a random variable sampled from a uniform distribution [0,1], ξ <sub>0:K</sub> and ξ<sub> k </sub> represent the noise term with indices from 0 to K and k, respectively, and N(0,1) represent a random variable sampled from a standard normal distribution with mean 0 and variance 1. Let represent the target values in the training set from index 0 to K and the target value of the k-th domain, respectively; D represents the training set consisting of the training image domain, training label domain, and training feature domain; ∑(·) represents the summation function; ρk represents the role; G represents the role type; ||·|| represents the norm; θ represents the parameters of the initial joint diffusion model; and τ represents the time step. Let vθ(·) represent the target values corresponding to the fields with indices 0, 1, and K at time step τ, respectively, and (·) represent the velocity prediction model. 基于扩散损失函数对初始联合扩散模型进行迭代并调整初始联合扩散模型的权重参数,完成对初始联合扩散模型的优化训练。The initial joint diffusion model is iterated based on the diffusion loss function, and the weight parameters of the initial joint diffusion model are adjusted to complete the optimization training of the initial joint diffusion model. 5.根据权利要求4所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,所述分别通过编码器和随机角色分配机制对训练特征域、训练图像域和训练标签域进行角色分配,确定对应的角色类型,包括:5. The unified method for image generation and understanding based on joint diffusion modeling according to claim 4, characterized in that, the step of assigning roles to the training feature domain, training image domain, and training label domain respectively through an encoder and a random role assignment mechanism to determine the corresponding role type includes: 基于训练生成/理解任务,将训练图像域、训练标签域和训练特征域输入至编码器,得到训练域-编码信息;Based on the training generation/understanding task, the training image domain, training label domain, and training feature domain are input into the encoder to obtain training domain-encoding information; 通过随机角色分配机制处理训练域-编码信息进行角色分配,确定角色类型和分配处理方法;若角色类型为G时,则分配处理方法为加噪;若角色类型为C时,则分配处理方法为保持原样;若角色类型为X时,则分配处理方法为数据置零;A random role assignment mechanism is used to process the training domain-encoding information for role assignment, determining the role type and assignment method. If the role type is G, the assignment method is to add noise; if the role type is C, the assignment method is to keep it as is; if the role type is X, the assignment method is to set the data to zero. 通过分配处理方法对训练域-编码分配信息进行处理,得到联合扩散模型输入数据。The training domain-encoding allocation information is processed using an allocation processing method to obtain the input data for the joint diffusion model. 6.根据权利要求4所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,当学习目标为图像分类时,图像分类模型的处理过程为:6. The unified method for image generation and understanding based on joint diffusion modeling according to claim 4, characterized in that, when the learning objective is image classification, the processing procedure of the image classification model is as follows: 获取训练图像域并输入至几何感知Patch嵌入层,通过旋转等变卷积划分训练图像域,得到固定大小的分类Patch并转换为分类tokens;The training image domain is obtained and input into the geometry-aware Patch embedding layer. The training image domain is divided by rotational equivariant convolution to obtain a fixed-size classification Patch and convert it into classification tokens. 利用相对位置Transformer层中的首个第一Transformer提取各分类tokens的分类空间特征;The classification space features of each category token are extracted using the first Transformer in the relative position Transformer layer; 通过类簇感知模块对分类空间特征进行类簇聚合,结合注意力权重和聚合向量,得到分类类簇感知特征;The cluster perception module performs cluster aggregation on the classification space features, and combines the attention weights and aggregation vectors to obtain the classification cluster perception features; 重复第一Transformer、类簇感知模块相同的操作,直至得到最后一个第一Transformer的输出;Repeat the same operations for the first Transformer and the cluster awareness module until the output of the last first Transformer is obtained; 通过通道选择压缩模块对最后一个第一Transformer的输出进行池化并计算重要性权重;基于重要性权重,对池化后的分类类簇感知特征进行降维,得到得到分类任务domain,即得到训练特征域。The channel selection compression module pools the output of the last first Transformer and calculates the importance weights. Based on the importance weights, the dimensionality of the pooled classification cluster perception features is reduced to obtain the classification task domain, which is the training feature domain. 7.根据权利要求4所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,当学习目标为图像分割时,图像分割模型的处理过程为:7. The unified method for image generation and understanding based on joint diffusion modeling according to claim 4, characterized in that, when the learning objective is image segmentation, the processing procedure of the image segmentation model is as follows: 获取训练图像域并通过MS-CAP模块进行划分,并行提取不同感受野下的patchtokens;通过通道注意力层计算不同patchtokens的通道注意力权重;基于各通道注意力权重处理各patchtokens,得到分割增强多尺度特征;The training image domain is acquired and divided using the MS-CAP module, and patch tokens under different receptive fields are extracted in parallel. Channel attention weights for different patch tokens are calculated using a channel attention layer. Each patch token is processed based on its channel attention weights to obtain segmentation enhancement multi-scale features. 将分割增强多尺度特征输入至分割编码器,提取语义残差融合特征并作为分割任务domain,即得到训练特征域;The segmentation enhancement multi-scale features are input into the segmentation encoder, and the semantic residual fusion features are extracted and used as the segmentation task domain, thus obtaining the training feature domain. 在分割编码器中,划分分割增强多尺度特征,得到分割Patch并输入至通过一维展平层,将各分割Patch展平为一维向量;通过位置嵌入层对各一维向量进行嵌入,得到嵌入序列;利用首个第二Transformer对嵌入序列进行特征编码和上下文聚合,得到第一语义残差融合特征;通过SREB模块将第一语义残差融合特征与相隔k个第二Transformer的输出进行融合,并作为下一个第二Transformer的输入;重复第二Transformer和SREB模块的操作,直至得到最后一个第二Transformer的输出,得到语义残差融合特征。In the segmentation encoder, segmentation enhancement multi-scale features are divided to obtain segmentation patches, which are then input into a one-dimensional flattening layer to flatten each segmentation patch into a one-dimensional vector. A positional embedding layer is used to embed each one-dimensional vector to obtain an embedding sequence. The first second Transformer is used to encode features and aggregate context in the embedding sequence to obtain the first semantic residual fusion feature. The SREB module fuses the first semantic residual fusion feature with the outputs of the second Transformers k times apart, and uses this as the input to the next second Transformer. The operations of the second Transformer and the SREB module are repeated until the output of the last second Transformer is obtained, resulting in the semantic residual fusion feature. 8.根据权利要求4所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,当学习目标为目标检测时,目标检测模型的处理过程为:8. The unified image generation and understanding method based on joint diffusion modeling according to claim 4, characterized in that, when the learning target is target detection, the processing procedure of the target detection model is as follows: 获取训练图像域;通过全局语义CNN网络提取训练图像域的全局语义信息;通过CNN网络提取训练图像域的初始边缘特征;利用局部感知卷积层提取初始边缘特征的细节纹理,得到边缘卷积特征;利用边界响应增强模块,结合边缘检测算子和边界注意力门控,动态提取边缘卷积特征的边缘增强特征;Obtain the training image domain; extract global semantic information of the training image domain through a global semantic CNN network; extract initial edge features of the training image domain through a CNN network; extract detailed textures of the initial edge features using local perceptual convolutional layers to obtain edge convolutional features; and dynamically extract edge enhancement features of the edge convolutional features using a boundary response enhancement module, combined with edge detection operators and boundary attention gating. 通过通道注意力层融合边缘增强特征和全局语义信息,得到全局-边缘融合特征;By fusing edge enhancement features and global semantic information through a channel attention layer, global-edge fusion features are obtained. 通过首个block层对全局边缘融合特征进行编码,得到全局-融合编码信息;利用空间引导模块提取训练图像域的边缘响应图,并结合掩码注意力机制和进行调制,融合边缘响应图和全局-融合编码信息,得到全局-融合引导信息;The global edge fusion features are encoded in the first block layer to obtain global-fusion encoding information; the edge response map of the training image domain is extracted using the spatial guidance module, and modulated by the mask attention mechanism to fuse the edge response map and global-fusion encoding information to obtain global-fusion guidance information. 重复block层和空间引导模块的操作,直至得到最后一个block层的输出并作为检测domain,即得到训练特征域。Repeat the operations of the block layer and the spatial guidance module until the output of the last block layer is obtained and used as the detection domain, that is, the training feature domain is obtained. 9.一种基于联合扩散建模的图像生成与理解统一系统,用于实现权利要求1至8任一项所述的一种基于联合扩散建模的图像生成与理解统一方法,其特征在于,包括:9. A unified system for image generation and understanding based on joint diffusion modeling, used to implement the unified method for image generation and understanding based on joint diffusion modeling as described in any one of claims 1 to 8, characterized in that it comprises: 数据获取模块,用于确定待生成/理解任务的图像域或标签域或特征域;The data acquisition module is used to determine the image domain, label domain, or feature domain of the task to be generated/understood; 扩散模型构建训练模块,用于基于图像分类模型、图像分割模型和目标检测模型,确定学习目标,并结合编码器和随机角色分配机制对初始联合扩散模型进行优化训练,得到联合扩散模型;The diffusion model construction training module is used to determine the learning objective based on the image classification model, image segmentation model and object detection model, and to optimize and train the initial joint diffusion model by combining the encoder and random role assignment mechanism to obtain the joint diffusion model; 生成/理解任务完成模块,用于基于待生成/理解任务通过联合扩散模型,输出任务结果,完成对图像生成与理解的统一。The generation/understanding task completion module is used to output task results based on the task to be generated/understood through a joint diffusion model, thereby achieving the unification of image generation and understanding. 10.根据权利要求9所述的一种基于联合扩散建模的图像生成与理解统一系统,其特征在于,所述扩散模型构建训练模块包括:10. The unified image generation and understanding system based on joint diffusion modeling according to claim 9, characterized in that the diffusion model construction and training module includes: 扩散模型构建子模块,用于结合Flux架构构建初始联合扩散模型;The diffusion model construction submodule is used to build an initial joint diffusion model in conjunction with the Flux architecture; 分类任务domain获取子模块,用于获取训练图像域并通过图像分类模型行处理,得到分类任务domain;The classification task domain acquisition submodule is used to acquire the training image domain and process it through the image classification model to obtain the classification task domain. 分割domain获取子模块,用于获取训练图像域并通过图像分割模型进行处理,得到分割任务domain;The domain acquisition submodule is used to acquire the training image domain and process it through the image segmentation model to obtain the segmentation task domain; 检测domain获取子模块,用于获取训练图像域并通过目标检测模型行处理,得到检测domain;The detection domain acquisition submodule is used to acquire the training image domain and process it through the object detection model to obtain the detection domain. 角色分配子模块,用于分别通过编码器和随机角色分配机制对分类任务domain、分割任务domain、检测domain进行分配,得到对应角色类型;基于各角色类型,分别对训练特征域、训练图像域和训练标签域对应的域-编码信息进行处理,得到对应的联合扩散模型输入数据;The role assignment submodule is used to assign the classification task domain, segmentation task domain, and detection domain to the encoder and random role assignment mechanism respectively to obtain the corresponding role types; based on each role type, the domain-encoding information corresponding to the training feature domain, training image domain, and training label domain is processed to obtain the corresponding joint diffusion model input data. 模型优化训练子模块,用于获取训练标签域、获取训练文字信息;按照训练生成/理解任务,计算扩散损失函数;基于扩散损失函数对初始联合扩散模型进行迭代并调整初始联合扩散模型的权重参数。The model optimization training submodule is used to obtain the training label domain and training text information; calculate the diffusion loss function according to the training generation/understanding task; and iterate the initial joint diffusion model based on the diffusion loss function and adjust the weight parameters of the initial joint diffusion model.
CN202511205486.7A 2025-08-27 2025-08-27 A unified method and system for image generation and understanding based on joint diffusion modeling Pending CN120953442A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511205486.7A CN120953442A (en) 2025-08-27 2025-08-27 A unified method and system for image generation and understanding based on joint diffusion modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511205486.7A CN120953442A (en) 2025-08-27 2025-08-27 A unified method and system for image generation and understanding based on joint diffusion modeling

Publications (1)

Publication Number Publication Date
CN120953442A true CN120953442A (en) 2025-11-14

Family

ID=97621608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511205486.7A Pending CN120953442A (en) 2025-08-27 2025-08-27 A unified method and system for image generation and understanding based on joint diffusion modeling

Country Status (1)

Country Link
CN (1) CN120953442A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121388001A (en) * 2025-12-24 2026-01-23 北京啄木鸟云健康科技有限公司 Personalized health management system and method for children project

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121388001A (en) * 2025-12-24 2026-01-23 北京啄木鸟云健康科技有限公司 Personalized health management system and method for children project

Similar Documents

Publication Publication Date Title
CN114970517B (en) Multi-modal interaction-based context awareness visual question-answering-oriented method
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN117974693B (en) Image segmentation method, device, computer equipment and storage medium
CN114612902B (en) Image semantic segmentation method, device, equipment, storage medium and program product
CN115222998A (en) An image classification method
CN110263912A (en) An Image Question Answering Method Based on Multi-object Association Depth Reasoning
CN119107305A (en) Multiscale Transformer for Image Analysis
WO2024114321A1 (en) Image data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
CN107545276B (en) Multi-view learning method combining low-rank representation and sparse regression
CN115984700A (en) Remote sensing image change detection method based on improved Transformer twin network
CN114648535A (en) Food image segmentation method and system based on dynamic transform
CN119206209A (en) Lung image segmentation method, device and storage medium
CN113159053A (en) Image recognition method and device and computing equipment
CN118447520A (en) Classification type government affair document analysis method based on multi-mode large language model
CN118967714A (en) Medical image segmentation model establishment method based on harmonic attention and medical image segmentation method
CN119672340A (en) Semantic detail fusion and context enhancement remote sensing image segmentation method based on DeepLabv3+
CN120953442A (en) A unified method and system for image generation and understanding based on joint diffusion modeling
CN116740078A (en) Image segmentation processing methods, devices, equipment and media
CN120147730A (en) Image classification method and system based on CNN-Transformer staged fusion
Zheng et al. MATNet: Semantic segmentation of 3D point clouds with multiscale adaptive transformer
Ye et al. GFSCompNet: remote sensing image compression network based on global feature-assisted segmentation
CN115965789B (en) A semantic segmentation method for remote sensing images based on scene-aware attention
Zhao et al. Multi-scale learnable key-channel attention network for point cloud classification and segmentation
CN120429833B (en) Multi-modal neural learning network model, method, equipment and medium
CN119762721B (en) Multi-stage Mamba point cloud completion method and device based on semantic and geometric guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination