CN120953442A

CN120953442A - A unified method and system for image generation and understanding based on joint diffusion modeling

Info

Publication number: CN120953442A
Application number: CN202511205486.7A
Authority: CN
Inventors: 官子涵; 何振梁; 徐逸峰; 阚美娜; 山世光; 陈熙霖
Original assignee: Seetatech Beijing Technology Co ltd; Institute of Computing Technology of CAS
Current assignee: Seetatech Beijing Technology Co ltd; Institute of Computing Technology of CAS
Priority date: 2025-08-27
Filing date: 2025-08-27
Publication date: 2025-11-14

Abstract

This invention discloses a unified method and system for image generation and understanding based on joint diffusion modeling, belonging to the field of image generation and understanding technology. This invention unifies image generation and understanding tasks through joint diffusion modeling, eliminating the need to design separate models for generation and understanding tasks, thus improving efficiency. Improved DINOv2, Segmenter, and DETR models enhance the aggregation of classification feature clusters, segmentation boundary details, and the representation of small target features in detection. Random role assignment and masked full attention mechanisms flexibly handle multi-domain information, while domain-invariant positional encoding assists cross-domain alignment, improving modeling accuracy. Optimized training enables the model to simultaneously support joint generation, controlled generation, and image perception tasks, outperforming existing unified models and even surpassing proprietary models in tasks such as edge detection.

Description

Unified image generation and understanding method and system based on joint diffusion modeling

Technical Field

The invention relates to the technical field of image generation and understanding, in particular to an image generation and understanding unification method and system based on joint diffusion modeling.

Background

Visual generation and understanding are generally regarded as two completely different classes of tasks, processed in different ways-generating tasks typically rely on a generative model represented by a diffusion model, while understanding tasks are often translated into classification or regression problems, solved by discriminant models. The simultaneous completion of generation and understanding in a model is a valuable, meaningful task.

In the generating task, the controllable generating method takes understandable labels such as a target detection frame, a semantic segmentation diagram, a human body posture diagram and the like as control conditions to generate corresponding images, but the capability of reversely obtaining the labels from the images is lacking. The method for mining semantic information from a pre-trained diffusion model is not essentially combined with modeling generation and understanding tasks, and the discriminant feature modeling capability in the pre-trained diffusion model is limited and is different from that of the traditional discriminant model in performance, and the method for converting the understanding task into an image label generation task is mostly used for intensive prediction, but is not applicable to understanding tasks such as segmentation, detection, classification and the like, for example, the image segmentation requires outputting a label map (semantic category of each pixel), and the RGB map is used for representing the uncertainty of color mapping, and can be mapped to a part of RGB space in a concentrated mode, so that learning is difficult. Target detection requires explicit bounding box coordinates and class labels, which are sparse, non-image format, and cannot be represented by a pixel image. Image classification requires class labels and cannot be modeled with RGB. In joint modeling, some recent approaches attempt to model conditional distributions in both directions of image-to-tag and tag-to-image at the same time, but do not model joint distributions of both. In some unified methods, although a set of model parameters is shared in all tasks, the unification of a model layer is realized, different tasks are still treated respectively, and the unification of all image domains and all task output domains is not realized. Thus, existing methods do not have joint modeling generation and understanding tasks.

Disclosure of Invention

The invention aims to provide a unified image generation and understanding method and system based on joint diffusion modeling so as to improve the technical problems.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

a unified method of image generation and understanding based on joint diffusion modeling, comprising:

Determining an image domain or a label domain or a feature domain of a task to be generated/understood, constructing an initial joint diffusion model by combining a Flux architecture, wherein the task to be generated/understood comprises a task generation task, a task understanding task and a joint distribution task, the image domain comprises an original image, the label domain comprises a depth map, a normal map, an albedo, an edge map and a line manuscript, and the feature domain comprises an image segmentation feature, a target detection feature and an image classification feature;

Determining a learning target based on the image classification model, the image segmentation model and the target detection model, and carrying out optimization training on the initial joint diffusion model by combining an encoder and a random role allocation mechanism to obtain a joint diffusion model;

Based on the task to be generated/understood, outputting a task result through a joint diffusion model, and completing unification of image generation and understanding;

The image classification model adopts an improved DINOv self-supervision visual encoder, the image segmentation model adopts an improved SEGMENTER model, and the target detection model adopts an improved DETR model.

An unified image generation and understanding system based on joint diffusion modeling, comprising:

The data acquisition module is used for determining an image domain or a label domain or a feature domain of a task to be generated/understood;

The diffusion model building training module is used for determining a learning target based on the image classification model, the image segmentation model and the target detection model, and carrying out optimization training on the initial joint diffusion model by combining the encoder and the random role allocation mechanism to obtain the joint diffusion model;

and the generating/understanding task completion module is used for outputting a task result based on the task to be generated/understood through the joint diffusion model to complete the unification of image generation and understanding.

The beneficial effects of the invention are as follows:

According to the invention, the tasks are unified to be generated and understood through joint diffusion modeling, and models are not required to be designed for generating and understanding the tasks respectively, so that the efficiency is improved. The improved DINOv and SEGMENTER, DETR models enhance the clustering property of the classification feature clusters, the details of the segmentation boundaries and the expression of the small and medium target features in detection. The random role allocation mechanism and the mask full-attention mechanism flexibly process multi-domain information, domain invariant position coding assists cross-domain alignment, and modeling accuracy is improved. The model simultaneously supports the tasks of joint generation, controllable generation and image perception through optimization training, the performance is superior to that of the conventional unified model, and the performance even exceeds that of a proprietary model in tasks such as edge detection.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method in embodiment 1 of the present invention;

FIG. 2 is a diagram showing the structure of an initial joint diffusion model in embodiment 1 of the present invention;

FIG. 3 is a diagram of a self-supervising visual encoder of the present invention, modified DINOv in embodiment 1;

FIG. 4 is a diagram showing a model structure of the improvement SEGMENTER in example 1 of the present invention;

FIG. 5 is a diagram showing the structure of the improved DETR model in example 1 of the present invention;

FIG. 6 is a schematic diagram of the integrated diffusion model in example 1 of the present invention;

fig. 7 is a diagram illustrating a random character allocation mechanism according to embodiment 1 of the present invention;

FIG. 8 is a diagram showing the construction of a system in embodiment 2 of the present invention;

FIG. 9 is a diagram showing the joint generation task in embodiment 3 of the present invention;

FIG. 10 is a task comparison chart of the method and other models in embodiment 4 of the present invention;

FIG. 11 is a graph showing comparison of the results of domain invariant bit encoding in example 5 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

Example 1:

Referring to fig. 1, a unified method for generating and understanding images based on joint diffusion modeling is provided in this embodiment. The subject of execution of the method illustrated in fig. 1 may be a software and/or hardware device. The execution body of the present application may include, but is not limited to, at least one of a user equipment, a network device, etc. The user device may include, but is not limited to, a computer, a smart phone, a Personal Digital Assistant (PDA), and the above-mentioned electronic device. The network device may include, but is not limited to, a single network server, a server group of multiple network servers, or a cloud of a large number of computers or network servers based on cloud computing, where cloud computing is one of distributed computing, and a super virtual computer consisting of a group of loosely coupled computers. This embodiment is not limited thereto.

s1, determining an image domain or a label domain or a feature domain of a task to be generated/understood, and constructing an initial joint diffusion model by combining a Flux architecture, wherein the task to be generated/understood comprises a generation task, an understanding task and a joint distribution task, the image domain comprises an original image, the label domain comprises a depth map, a normal map, an albedo, an edge map and a line manuscript, and the feature domain comprises an image segmentation feature, a target detection feature and an image classification feature;

Specifically, the joint generation task refers to that an image, all corresponding labels and corresponding image features can be generated simultaneously through a joint diffusion model. The generation task refers to generating a corresponding image given a number of labels and features using a joint diffusion model learning. The understanding task refers to the use of joint diffusion model learning to predict multiple labels and image features simultaneously given an image, i.e., to generate label and feature fields.

When the task to be generated/understood is a joint generation task, the input of the joint diffusion model is text information, namely, related description of an image required to be generated by the task, and the corresponding tag domain and the feature domain are extracted while the required image is generated. When the task to be generated/understood is a generating task, the input of the joint diffusion model is a plurality of tag domains and a plurality of characteristic domains, and images determined by the tag domains are output. When the task to be generated/understood is an understanding task, the input of the joint diffusion model is an image domain, and the corresponding output is a label domain and a feature domain. If different tasks exist at the same time, only the corresponding domains of each task are input into different joint diffusion models respectively, and the required results are output respectively after independent processing.

In this embodiment, the backhaul of the transform model is based on an open-source FLUX model, and is improved for multi-domain joint modeling tasks, unlike the original FLUX which is mainly used for image generation tasks, the method extends to include image domains, tag domains, and feature domains. To accomplish multi-domain joint modeling, the Flux architecture is adapted to handle token sequences from M domains and cross-domain feature interactions through a unified masked full-attention mechanism.

Thus, as shown in FIG. 2, the initial joint diffusion model comprises a transducer model, wherein the transducer model comprises a domain invariant location coding layer and a transducer module, and the transducer module comprises a plurality of transducer layers stacked;

The domain invariant location coding layer adopts 2D sine and cosine coding. Specifically, for the same spatial position of different domains, the same 2D sine and cosine code is added for the input token, so that explicit position information is provided for the joint diffusion model.

Each transducer layer adopts a mask full-attention mechanism, which is beneficial to improving modeling accuracy. In order to enable the joint diffusion model to ignore invalid domains with the character type X, a mask mechanism is introduced on the basis of a standard full-attention mechanism, and log (m _j) penalty is added to a token marked as invalid so that the attention weight is zero after a softmax function. Thus, the formula corresponding to the mask full attention mechanism is:

Wherein softmax (·) represents the maximum function, O _i represents the attention score corresponding to the i-th transducer layer, Q _i、K_j、V_j represents the i-th query vector, the j-th key vector, and the j-th value vector, d represents the dimension of Q _i, Representing the transpose of the key vector, Σ (-) representing the sum function, log (-) representing the logarithmic function with the base of the natural constant, m _j representing the j-th mask indicator, the range of values being [0,1], M, N representing the number of queries, keys or values, respectively. When m _j is 0, the attention score area of the j-th token is minus infinity, so that an invalid domain can be completely shielded, and the corresponding output has no influence.

S2, determining a learning target based on an image classification model, an image segmentation model and a target detection model, and carrying out optimization training on an initial joint diffusion model by combining an encoder and a random role allocation mechanism to obtain the joint diffusion model, wherein the image classification model adopts an improved DINOv self-supervision visual encoder, the image segmentation model adopts an improved SEGMENTER model, and the target detection model adopts an improved DETR model.

As shown in fig. 3, the improved DINOv self-supervising visual encoder includes a geometry-aware Patch embedded layer, a relative position transducer layer and a channel selection compression module connected in series, wherein the relative position transducer layer includes a plurality of stacked first transducers, one cluster-like sensing module is inserted between every two first transducers, the rest structure is the same as the existing transducers, and the channel selection compression module includes a global average pooling layer, two MLP layers and a weighted activation layer.

As shown in FIG. 4, the improved SEGMENTER model comprises an MS-CAP module, a channel attention layer and a split encoder which are connected in series, wherein the split encoder comprises a one-dimensional flattening layer, a position embedding layer and a plurality of stacked second convectors which are connected in series, an SREB module (semantic residual enhancement module) is inserted between every two second convectors, a cross-layer characteristic connection is introduced, the output of the ith second convectors is fused with the output of the ith+k second convectors, and the MS-CAP module comprises a channel attention layer and a characteristic fusion layer which are connected in series and of the parallel convolution layers.

As shown in fig. 5, the improved DETR model comprises a dual CNN backbone network, a channel attention layer, a transducer-based encoder, a transducer-based decoder and a prediction head which are connected in series, wherein the dual CNN backbone network comprises a parallel global semantic CNN network and a target edge CNN network, the target edge CNN network comprises a CNN network, a local perception convolution layer and a boundary response enhancement module which are connected in series, the transducer-based encoder comprises a plurality of block layers, and a space guiding module is inserted between each two block layers. The global semantic CNN network is the same as the CNN network.

The local perceptual convolution layers include a convolution layer having a convolution kernel of 3×3 and a convolution layer having a convolution kernel of 5×5. In the boundary response enhancement module, firstly, a boundary attention diagram (sigmoid activation) is generated through 1×1 convolution, and then element-by-element multiplication is carried out on the boundary attention diagram and the original characteristic, so that characteristic enhancement of a boundary region is realized.

As shown in fig. 6, the performing optimization training on the initial joint diffusion model to obtain a joint diffusion model includes:

S2-1, acquiring a training image domain and a training label domain, processing the training image domain through an image classification model, an image segmentation model and a target detection model based on a learning target to obtain classification tasks domain, segmentation domain and detection domain which are all used as training feature domains;

Because the generating task generally depends on a generating model represented by a diffusion model, the understanding task is often converted into a classification or regression problem and is solved by a discriminant model, so that an image classification model, an image segmentation model and a target detection model are selected to obtain corresponding training feature domains for training an initial joint diffusion model, the feature domains obtained and generated by the initial joint diffusion model are more suitable for the discriminant task (classification, detection and segmentation), and the generated feature domains are subsequently input to a decoder to realize the discriminant task, thereby achieving the purposes of image classification, image segmentation and target detection.

When the learning target is image classification, the original structure of the traditional DINOv self-supervision visual encoder has the problems that the separation between classes is insufficient, the diffusion model with high characteristic dimension is difficult to model and compress in the middle layer, so that the effective learning of the diffusion model is not facilitated, and a class cluster sensing module and a channel selection compression module are further introduced. Thus, the image classification model is processed as follows:

S2-101, acquiring a training image domain, inputting the training image domain into a geometric perception Patch embedded layer, dividing the training image domain through rotation and the like, obtaining a classification Patch with a fixed size, and converting the classification Patch into a classification tokens;

S2-102, extracting classification space features of each classification tokens by using a first transducer in a relative position transducer layer;

S2-103, performing class cluster aggregation on the classified space features through a class cluster perception module, and combining the attention weight and the aggregation vector to obtain classified class cluster perception features;

It should be noted that, the original class token is utilized to perform semantic aggregation on the classification cluster perception features, and the classification cluster perception features of the same class are guided to form high-density semantic clusters. Specifically, a group of class token is reserved for each first transducer, namely, one classification cluster perception feature is selected as the class token, and the rest classification cluster perception features are subjected to semantic aggregation. Taking the first transducer as an example, class cluster aggregation is performed on the input classification class cluster perception features, and the Attention weight A _i of the class token corresponding to each classification class cluster perception feature is calculated according to a corresponding formula:

Wherein, the Represents the query vector corresponding to the i-th classification cluster perceptual feature x _i,Representing the transpose of the key vector corresponding to the class token.

According to the attention weights, corresponding aggregate vectors are calculated and used as residual, the centers of the perception feature items of each classification cluster are guided to gather, the perception feature of the classification cluster is obtained, and a corresponding formula is obtained:

x_i'＝x_i+A_i·V_cls;

where x _i' represents the class cluster perception feature, V _cls represents the value vector corresponding to class token, and a _i·V_cls represents the aggregate vector.

S2-104, repeating the same operations of the first transducer and the cluster-like sensing module until the output of the last first transducer is obtained;

S2-105, pooling the output of the last first transducer through a channel selection compression module and calculating importance weights, and reducing the dimension of the pooled classification cluster perception features based on the importance weights to obtain classification task domains, namely obtaining training feature domains.

Pooling the output of the last first transducer and calculating importance weights through a channel selection compression module; based on the importance weight, reducing the dimension of the output of the last first transducer after pooling;

Specifically, the channel selection compression module is used for compressing invalid redundant dimensions in the output characteristics of the last first transducer in high dimensions, inhibiting invalid channel activation and compressing characteristic dimensions, so that the diffusion model generation learning efficiency is improved. The global average S in the channel dimension is calculated by the global average pooling layer, and the corresponding formula is as follows:

S=GAP(x')∈R¹×D;

Where R represents a constant, GAP (·) represents a pooling operation, a single value is obtained by averaging all elements of each feature map, and D represents the feature dimension of the output of the last first transducer.

The importance weight W is calculated by using two layers of MLPs connected in series, and the corresponding formula is as follows:

W=σ(MLP(S))∈R^1×D;

Where σ (·) represents the activation function and MLP (·) represents the multi-layer perceptron operation.

Selectively activating an important channel through the importance weight, discarding the channel of the importance weight W to obtain the feature X after dimension reduction, wherein the corresponding formula is as follows:

X=x'·W。

when the learning target is image segmentation, the processing procedure of the image segmentation model is as follows:

s2-111, acquiring a training image domain, dividing the training image domain by an MS-CAP module (multi-scale channel perception embedding module), extracting patchtokens under different perception fields in parallel, calculating channel attention weights of different patchtokens by a channel attention layer, processing each patchtokens based on the channel attention weights, and obtaining segmentation enhancement multi-scale features;

Specifically, the conventional SEGMENTER model has defects in boundary detail modeling and complete intermediate layer characteristic semantic information, which is not beneficial to effective learning of a diffusion model, so that the MS-CAP module is selected to replace the original flatten and project modules. In the MS-CAP module, training image fields are processed through multiple parallel convolution layers, and patch tokens under different receptive fields are extracted by using convolution kernels of different sizes. The small convolution kernel (such as 3 multiplied by 3) has small receptive field, focuses on capturing local details of the image, can accurately extract fine textures, and the large convolution kernel (such as 7 multiplied by 7) has large receptive field, can acquire information of a wider area, and focuses on the whole structure and the semantics.

And inputting each patch tokens into a channel attention layer respectively, analyzing the characteristic channels of each branch, and adaptively selecting the characteristic channels sensitive to boundaries and textures. By calculating the importance weight of each channel, the response of the channel related to the texture and semantic boundary is enhanced, and the irrelevant channel is restrained, so that the key characteristic information of the image is highlighted, and the joint diffusion model is more focused on the boundary and texture part.

And finally, splicing the multi-scale features subjected to channel attention enhancement, and integrating the enhanced features under different scales to form a feature set containing rich multi-scale information. And then, uniformly projecting the spliced features to meet patch token embedding dimensions required by the encoder in the first transducer, so that the features adapt to the processing requirements of the subsequent transducers. And adding position codes, recording the spatial position information of each feature in the original image by using the position codes, and keeping the spatial sequence of the features so as to avoid losing the spatial structure information in the processing process.

S2-112, inputting the segmentation enhancement multi-scale features into a segmentation encoder, extracting semantic residual fusion features and taking the semantic residual fusion features as segmentation tasks domain to obtain a training feature domain;

Dividing and enhancing multi-scale features in a segmentation encoder to obtain segmentation patches, inputting the segmentation patches into a one-dimensional vector through a one-dimensional flattening layer, embedding each one-dimensional vector through a position embedding layer to obtain an embedded sequence, carrying out feature coding and context aggregation on the embedded sequence by utilizing a first second Transformer to obtain a first semantic residual fusion feature, fusing the first semantic residual fusion feature with the outputs of k second transformers which are separated by each other through an SREB module and taking the fused feature as the input of the next second Transformer, and repeating the operations of the second transformers and the SREB module until the output of the last second Transformer is obtained to obtain the semantic residual fusion feature. In this embodiment, K has a value of 2 or 3.

It is to be explained that the SREB module is used for fusing shallow features and deep features, introducing cross-layer feature connection, improving the transfer of semantic information between layers, reducing the loss of intermediate semantic information, enabling the extracted features to capture local edge, semantic main body and global structure information respectively, and enabling the supervision target of the training diffusion model to be a complete semantic feature. Fusing the output F _i of the second transducer of the i layer with the output F _i+k of the second transducer of the i+k layer, corresponding to the formula:

F'_i+k＝F_i+k+gate(F_i);

wherein F' _i+k represents the fused semantic residual fusion feature, and gate (·) represents the convolution operation with a convolution kernel of 1×1 and employing a Sigmoid activation function.

In the reasoning stage, a joint diffusion model is used for generating semantic segmentation features, the generated features are input into a two-layer lightweight MaskTransformer decoder, and a semantic mask probability map of each category is output. And obtaining a final segmentation result graph after softmax operation and category argmax operation.

When the learning target is target detection, the processing procedure of the target detection model is as follows:

S2-121, acquiring a training image domain, extracting global semantic information of the training image domain through a global semantic CNN network, extracting initial edge features of the training image domain through the CNN network, extracting detail textures of the initial edge features by utilizing a local perception convolution layer to obtain edge convolution features, and dynamically extracting edge enhancement features of the edge convolution features by utilizing a boundary response enhancement module and combining an edge detection operator and boundary attention gating;

s2-122, fusing the edge enhancement features and global semantic information through a channel attention layer to obtain global-edge fusion features;

S2-123, coding global edge fusion characteristics through a first block layer to obtain global-fusion coding information, extracting an edge response diagram of a training image domain by using a space guiding module, combining a mask attention mechanism and modulating, and fusing the edge response diagram and the global-fusion coding information to obtain global-fusion guiding information;

It should be explained that, in the spatial guidance module, the global-fusion coding information is modulated based on the edge response graph extracted from the training image domain, so as to obtain the attention features after dynamic weighting, thereby enhancing the response capability of the joint diffusion model to the small target and the edge region. Specifically, an edge response map (E) with low resolution is extracted from an input image and used as a position information guiding signal, the output feature of the current block layer is fused with the guiding map E to obtain a mask attention value M, the mask attention value M is used for modulating the output feature F of the current block layer, and the corresponding formula is as follows:

F'=F⊙(1+M);

Global-fusion guidance information F' is obtained.

The global-fusion guidance information is re-input to the next block layer. In the block layer, each layer of encoder is set to dynamically adjust the feature focusing position, and the guiding signal comes from the input original image (training image domain) or shallow branches, without introducing additional supervision.

S2-124, repeating the operations of the block layers and the space guiding module until the output of the last block layer is obtained and used as a detection domain, and obtaining the training feature domain.

In the training stage, the output of the last block layer is used as a target detection domain and used as a supervision target for training the joint diffusion model. In the inference phase, a diffusion model may be used to generate the detection features.

S2-2, respectively performing role distribution on the training feature domain, the training image domain and the training label domain through an encoder and a random role distribution mechanism to determine corresponding role types;

The conventional method cannot realize the unification of generating and understanding tasks, so that a random role allocation mechanism is provided. The character allocation is carried out on the training feature domain, the training image domain and the training label domain through the encoder and the random character allocation mechanism respectively, and the corresponding character types are determined, which comprises the following steps:

S2-2-1, inputting a training image field, a training label field and a training feature field into an encoder based on a training generation/understanding task to obtain training field-encoded information, wherein the task to be generated/understood is a generation task and an understanding task. For example, when the task to be generated/understood is a generation task, the input to the encoder is the corresponding tag field and feature field.

S2-2-2, processing training domain-coding information through a random role allocation mechanism to perform role allocation, and determining a role type and an allocation processing method; if the role type is G, the allocation processing method is noise adding, if the role type is C, the allocation processing method is kept as it is, and if the role type is X, the allocation processing method is data zeroing;

Specifically, as shown in fig. 7, if the character G (generated), the allocation processing method is noise adding, that is, noise adding by using a RECTIFIED FLOW standard training method, and speed field prediction are performed, if the character type is C (condition), the allocation processing method is kept as it is, and if the character type is X (ignore), the allocation processing method is data zeroing, that is, zeroing the data, and masking the influence thereof by using a masking mechanism. In fig. 2, the switch represents a random role assignment mechanism.

S2-2-3, processing the training domain-code allocation information through an allocation processing method to obtain the input data of the joint diffusion model.

Specifically, the formula corresponding to the random role assignment mechanism is:

Wherein, the Output data representing an initial joint diffusion model obtained by processing domain-encoded information through a random character allocation mechanism, ρ ^k represents an allocated character, k represents an index, τ represents a time step,Representing domain-encoded allocation information, ζ ^k representing noise term;

through a random role allocation mechanism, different tasks are realized, such as that an image domain is C, other domains are G, namely understanding is realized, the image domain is G, a certain label domain is C, namely controllable generation is realized, and all the label domains are G, namely joint generation is realized. During training, multiple tasks are trained in one model through a role allocation mechanism, so that unified generation and understanding are achieved.

S2-3, acquiring training text information, inputting the training text information and input data of each joint diffusion model into an initial joint diffusion model according to a training generation/understanding task, and generating a corresponding training task result;

It should be noted that, before the initial joint diffusion model, the input data needs to be divided to obtain a plurality of different training Token. According to the difference of training generation/understanding tasks, the processing procedure of the initial joint diffusion model is explained in three cases.

When the training generation/understanding task is a joint generation task, training Token corresponding to training text information is input to a joint diffusion model, roles of all domains are G, token is randomly initialized noise, and finally a graph and a label domain and a feature domain (which are generated simultaneously by one-time reasoning and have no precedence relation) corresponding to the graph are generated.

When the training generation/understanding task is a generation task, the training Token corresponding to the training label domain and the training feature domain is input to the domain invariant location coding layer, and location information is introduced into the training Token to obtain the corresponding training generation location Token. And capturing the relation among the training generation positions token through the stacked Transformer layers, generating shallow layer features and depth features of different layers, and fusing to finally obtain corresponding training images.

When the training generation/understanding task is an understanding task, the process is the same as when the training generation/understanding task is a generating task, and only the input data is a training image. In the three tasks, the domain to be generated is initialized with random noise, and the domain to be generated is the original token and remains unchanged.

S2-4, based on the results of each training task,

A diffusion Loss function Loss is calculated, wherein,Representing the desired function, U (0, 1) representing the random variable sampled from the uniform distribution [0,1], ζ ^0:K、ξ^k representing the noise term from 0to K, the noise term K, N (0, 1) representing the random variable sampled from the standard normal distribution with mean 0 and variance 1,Respectively represent the target values of indexes from 0 to K in the training set and the target value of the kth domain, D represents the training set consisting of a training image domain, a training label domain and a training feature domain, Σ (·) represents the summation function, ρ ^k represents the role, G represents the role type, the term is used to denote the norm, θ denotes the parameters of the initial joint diffusion model, τ denotes the time step,The indexes of the domains with the indexes of 0,1 and K respectively at the time step tau are respectively represented as target values, and v _θ (DEG) represents a speed prediction model;

The speed prediction model is a model of the invention, and under the diffusion framework, the model predicts the changing speed of data from the current state to the real state, and the real data is gradually restored from noise. The loss function optimizes the model by minimizing the difference between the model predicted speed and the true speed (noise-true value), thereby improving the quality of the generation.

S2-5, iterating the initial joint diffusion model based on the diffusion loss function, and adjusting weight parameters of the initial joint diffusion model to complete optimization training of the initial joint diffusion model.

And S3, outputting a task result based on the task to be generated/understood through a joint diffusion model, and completing unification of image generation and understanding.

Example 2:

as shown in fig. 8, an image generation and understanding unified system based on joint diffusion modeling includes:

The role distribution module is used for distributing through the encoder and the random role distribution mechanism to obtain input data of the joint diffusion modeling;

The diffusion model building training module is used for building an initial joint diffusion model by combining a Flux architecture, determining a learning target by combining an image classification model, an image segmentation model and a target detection model, and carrying out optimization training on the initial joint diffusion model to obtain a joint diffusion model;

And the generating/understanding task completion module inputs the generating/understanding task data into the joint diffusion model, outputs a task result and completes the unification of image generation and understanding.

The diffusion model construction training module comprises:

the diffusion model construction submodule is used for constructing an initial joint diffusion model by combining with a Flux architecture;

the classification task domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through an image classification model row to obtain a classification task domain;

The segmentation domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through an image segmentation model to obtain a segmentation task domain;

the detection domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through a target detection model row to obtain detection domain;

The character allocation sub-module is used for allocating the classification task domain, the segmentation task domain and the detection domain respectively through an encoder and a random character allocation mechanism to obtain corresponding character types;

the model optimization training sub-module is used for acquiring training label fields and training text information, calculating a diffusion loss function based on classification tasks domain, segmentation tasks domain and detection domains, iterating an initial joint diffusion model based on the diffusion loss function, and adjusting weight parameters of the initial joint diffusion model.

It should be noted that, regarding the system in the above embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment regarding the method, and will not be described in detail herein.

Example 3:

The joint generation task is processed by the method in embodiment 1, and as a result, as shown in fig. 9, given the input text, the model can simultaneously generate an image and 7 kinds of corresponding labels. The resulting images have a wide range of aspect ratios, and are all around 1024 resolution.

Example 4:

The control Net model, uniControl model, easyControl model, omniGen model, pixWizard model and OneDiffusion model are selected as other models to be compared with the joint diffusion model of the invention. As shown in table 1, performance comparison is performed for the generated tag domain, and compared with other models, the joint diffusion model of the present invention achieves better performance under various conditions. The performance includes LPIPS (perceived image similarity) and FID distance, which is an indicator that measures the distance between the generated image and the real image distribution.

TABLE 1

The corresponding results of the generation tasks respectively realized through the method, omniGen model, oneDiffusion model and PixWizard model are shown in fig. 10. Compared with OmniGen model, oneDiffusion model and PixWizard model, the method is excellent in generating task, and the generated image is more fit with the input label and feature, so that the method is far better than the performance capability of OmniGen model, oneDiffusion model and PixWizard model.

The method sequentially carries out depth estimation, normal estimation, albedo estimation and edge detection.

TABLE 2

Method of	NYUv2	ScanNet	DIODE
				OmniGen	9.2	10.1	30.6
PixWizard	7.0	7.9	25.4
				OneDiffusion	8.9	9.7	25.2
The method	10.1	12.1	25.9
				The method (integration)	8.3	9.9	25.8

TABLE 3 Table 3

Method of	NYUv2	ScanNet	iBims
				OmniGen	28.9	28.9	31.3
PixWizard	23.5	26.6	22.5
				The method	21.1	24.3	20.1
The method (integration)	18.6	20.3	18.2

TABLE 4 Table 4

Method of	PSNR	ScanNet
			OrdinalShading	15.6	0.37
Kocsisetal.	11.3	0.49
			CareagaandAksoy	15.7	0.36
RGB2X	20.6	0.18
			The method	15.5	0.31
The method (integration)	16.5	0.33

TABLE 5

Method of	ODS	OIS
			HED	0.788	0.808
PiDiNet	0.807	0.823
			OmniGen	0.767	0.781
PixWizard	0.605	0.633
			OneDiffusion	0.682	0.691
The method	0.826	0.851

For depth estimation, absolute average relative error was tested on NYUv dataset, scanNet dataset and DIODE dataset by the present method and its integrated model, oneDiffusion model, omniGen model, pixWizard model, respectively. As shown in table 2, the absolute average relative error obtained by the method and the method (integrated) on NYUv data sets is fifth and second, the absolute average relative error obtained by the method and the method (integrated) on ScanNet data sets is fifth and third, and the absolute average relative error obtained by the method and the method (integrated) on DIODE data sets is fourth and third, arranged in order of decreasing size.

For normal estimation, average angle errors were tested on the NYUv dataset, scanNet dataset and iBims dataset in sequence by the method and its integrated model, omniGen model, pixWizard model, respectively. As shown in Table 3, in NYUv data sets, the average angle error of the method and its integrated model are ranked third and fourth, respectively, in order of magnitude. In ScanNet datasets, the average angular error of the method and its integrated model is ranked third and fourth, respectively, in order of magnitude. In iBims datasets, the average angular error of the method and its integrated model is ranked third and fourth, respectively, in order of magnitude. Its performance is more stable and its performance is more stable across the various data sets than other models.

For albedo estimation, PSNR and LPIPS were tested on Hypersim dataset test sets by the present method and its integrated model, ordinal Shading model, kocsis et al model, careaga andAksoy model, RGB2X model, respectively. As shown in Table 4, the PSNR of the method and its integrated model were ranked third and second, respectively, in order of magnitude, and LPIPS of the method and its integrated model were ranked second and third, respectively, in order of magnitude. From this, the method is superior to Ordinal Shading model, kocsis et al model and Careaga andAksoy model in performance.

For edge detection, the Optimal Dataset Scale (ODS) and F-score at Optimal Image Scale (OIS) are tested on BSDS500 datasets by the present method, HED model, piDiNet model, omniGen model, pixWizard model, oneDiffusion model, respectively. As shown in Table 5, the F-score of the present method is at a maximum at both the optimal data set scale (ODS) and the Optimal Image Scale (OIS), and it is known that the present method is far from other models.

Example 5:

The method is used for comparing the joint diffusion model and the joint diffusion model of the removed domain invariant feature coding layer. As shown in fig. 11, in the absence of domain invariant position coding, there is a significant misalignment of the image domain and the label domain, and this problem is ameliorated after the position coding is added, which demonstrates the role of position coding. Therefore, the domain invariant location coding layer proposed by the method can be combined with the cross-domain spatial location alignment of the diffusion auxiliary model.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A unified method for image generation and understanding based on joint diffusion modeling, characterized in that it includes:

The image domain, label domain, or feature domain of the task to be generated/understood is determined; an initial joint diffusion model is constructed using the Flux architecture; the task to be generated/understood includes a generation task, an understanding task, and a joint distribution task; the image domain includes the original image; the label domain includes depth map, normal map, albedo, edge map, and line drawing; the feature domain includes image segmentation features, object detection features, and image classification features.

Based on image classification model, image segmentation model and object detection model, the learning objective is determined, and the initial joint diffusion model is optimized and trained by combining encoder and random role assignment mechanism to obtain joint diffusion model;

Based on the task to be generated/understood, a joint diffusion model is used to output the task results, thus achieving the unification of image generation and understanding;

The image classification model uses an improved DINOv2 self-supervised visual encoder; the image segmentation model uses an improved Segmenter model; and the object detection model uses an improved DETR model.

2. The unified method for image generation and understanding based on joint diffusion modeling according to claim 1, characterized in that the initial joint diffusion model includes a Transformer model; the Transformer model includes a domain-invariant positional encoding layer and a Transformer module; the Transformer module includes multiple stacked Transformer layers; the domain-invariant positional encoding layer uses 2D sine and cosine coding; each Transformer layer uses a masked full attention mechanism; the formula corresponding to the masked full attention mechanism is:

Where softmax(·) denotes the maximum value function, O _i represents the attention score corresponding to the i-th Transformer layer, Q _i , K _j , and V _j represent the query vector, key vector, and value vector of the i-th token, respectively, and d represents the dimension of Q _i . Let represent the transpose of the key vector, ∑(·) represent the summation function, log(·) represent the logarithmic function with the natural constant as the base, _mj represent the mask indicator of the j-th token, and M and N represent the number of queries and the number of keys or values, respectively.

3. The unified method for image generation and understanding based on joint diffusion modeling according to claim 1, characterized in that the improved DINOv2 self-supervised visual encoder includes a cascaded geometric perception patch embedding layer, a relative position Transformer layer and a channel selection compression module; the relative position Transformer layer includes multiple stacked first Transformers, and a cluster perception module is inserted between every two first Transformers.

The improved Segmenter model includes a cascaded MS-CAP module, a channel attention layer, and a segmentation encoder. The segmentation encoder includes a cascaded one-dimensional flattening layer, a position embedding layer, and multiple stacked second Transformers. An SREB module is inserted between every two second Transformers to introduce cross-layer feature connections and fuse the output of the i-th second Transformer with the output of the i+k-th second Transformer.

The improved DETR model includes a cascaded dual-branch CNN backbone network, channel attention layers, a Transformer-based encoder, a Transformer-based decoder, and a prediction head. The dual-branch CNN backbone network includes a parallel global semantic CNN network and a target edge CNN network. The target edge CNN network includes a cascaded CNN network, local perceptual convolutional layers, and a boundary response enhancement module. The Transformer-based encoder includes multiple block layers, with a spatial guidance module inserted between every two block layers.

4. The unified method for image generation and understanding based on joint diffusion modeling according to claim 3, characterized in that, the step of optimizing and training the initial joint diffusion model to obtain the joint diffusion model includes:

Obtain the training image domain and training label domain; based on the learning objective, process the training image domain through image classification model, image segmentation model and object detection model to obtain the classification task domain, segmentation domain and detection domain, all of which are used as training feature domains;

Roles are assigned to the training feature domain, training image domain, and training label domain respectively through encoder and random role assignment mechanism to determine the corresponding role type; based on each role type, the domain-encoding information corresponding to the training feature domain, training image domain, and training label domain is processed to obtain the corresponding joint diffusion model input data;

Obtain training text information; according to the training generation/understanding task, input the training text information and the input data of each joint diffusion model into the initial joint diffusion model to generate the corresponding training task results;

Based on the results of each training task, combined with the formula:

Calculate the diffusion loss function Loss; where, Let U(0,1) represent the expectation function, U(0,1) represent a random variable sampled from a uniform distribution [0,1], ξ ^{0:K} and ξ ^k represent the noise term with indices from 0 to K and k, respectively, and N(0,1) represent a random variable sampled from a standard normal distribution with mean 0 and variance 1. Let represent the target values in the training set from index 0 to K and the target value of the k-th domain, respectively; D represents the training set consisting of the training image domain, training label domain, and training feature domain; ∑(·) represents the summation function; ^ρk represents the role; G represents the role type; ||·|| represents the norm; θ represents the parameters of the initial joint diffusion model; and τ represents the time step. Let vθ(·) represent the target values corresponding to the fields with indices 0, 1, and K at time step τ, respectively, and _vθ (·) represent the velocity prediction model.

The initial joint diffusion model is iterated based on the diffusion loss function, and the weight parameters of the initial joint diffusion model are adjusted to complete the optimization training of the initial joint diffusion model.

5. The unified method for image generation and understanding based on joint diffusion modeling according to claim 4, characterized in that, the step of assigning roles to the training feature domain, training image domain, and training label domain respectively through an encoder and a random role assignment mechanism to determine the corresponding role type includes:

Based on the training generation/understanding task, the training image domain, training label domain, and training feature domain are input into the encoder to obtain training domain-encoding information;

A random role assignment mechanism is used to process the training domain-encoding information for role assignment, determining the role type and assignment method. If the role type is G, the assignment method is to add noise; if the role type is C, the assignment method is to keep it as is; if the role type is X, the assignment method is to set the data to zero.

The training domain-encoding allocation information is processed using an allocation processing method to obtain the input data for the joint diffusion model.

6. The unified method for image generation and understanding based on joint diffusion modeling according to claim 4, characterized in that, when the learning objective is image classification, the processing procedure of the image classification model is as follows:

The training image domain is obtained and input into the geometry-aware Patch embedding layer. The training image domain is divided by rotational equivariant convolution to obtain a fixed-size classification Patch and convert it into classification tokens.

The classification space features of each category token are extracted using the first Transformer in the relative position Transformer layer;

The cluster perception module performs cluster aggregation on the classification space features, and combines the attention weights and aggregation vectors to obtain the classification cluster perception features;

Repeat the same operations for the first Transformer and the cluster awareness module until the output of the last first Transformer is obtained;

The channel selection compression module pools the output of the last first Transformer and calculates the importance weights. Based on the importance weights, the dimensionality of the pooled classification cluster perception features is reduced to obtain the classification task domain, which is the training feature domain.

7. The unified method for image generation and understanding based on joint diffusion modeling according to claim 4, characterized in that, when the learning objective is image segmentation, the processing procedure of the image segmentation model is as follows:

The training image domain is acquired and divided using the MS-CAP module, and patch tokens under different receptive fields are extracted in parallel. Channel attention weights for different patch tokens are calculated using a channel attention layer. Each patch token is processed based on its channel attention weights to obtain segmentation enhancement multi-scale features.

The segmentation enhancement multi-scale features are input into the segmentation encoder, and the semantic residual fusion features are extracted and used as the segmentation task domain, thus obtaining the training feature domain.

In the segmentation encoder, segmentation enhancement multi-scale features are divided to obtain segmentation patches, which are then input into a one-dimensional flattening layer to flatten each segmentation patch into a one-dimensional vector. A positional embedding layer is used to embed each one-dimensional vector to obtain an embedding sequence. The first second Transformer is used to encode features and aggregate context in the embedding sequence to obtain the first semantic residual fusion feature. The SREB module fuses the first semantic residual fusion feature with the outputs of the second Transformers k times apart, and uses this as the input to the next second Transformer. The operations of the second Transformer and the SREB module are repeated until the output of the last second Transformer is obtained, resulting in the semantic residual fusion feature.

8. The unified image generation and understanding method based on joint diffusion modeling according to claim 4, characterized in that, when the learning target is target detection, the processing procedure of the target detection model is as follows:

Obtain the training image domain; extract global semantic information of the training image domain through a global semantic CNN network; extract initial edge features of the training image domain through a CNN network; extract detailed textures of the initial edge features using local perceptual convolutional layers to obtain edge convolutional features; and dynamically extract edge enhancement features of the edge convolutional features using a boundary response enhancement module, combined with edge detection operators and boundary attention gating.

By fusing edge enhancement features and global semantic information through a channel attention layer, global-edge fusion features are obtained.

The global edge fusion features are encoded in the first block layer to obtain global-fusion encoding information; the edge response map of the training image domain is extracted using the spatial guidance module, and modulated by the mask attention mechanism to fuse the edge response map and global-fusion encoding information to obtain global-fusion guidance information.

Repeat the operations of the block layer and the spatial guidance module until the output of the last block layer is obtained and used as the detection domain, that is, the training feature domain is obtained.

9. A unified system for image generation and understanding based on joint diffusion modeling, used to implement the unified method for image generation and understanding based on joint diffusion modeling as described in any one of claims 1 to 8, characterized in that it comprises:

The data acquisition module is used to determine the image domain, label domain, or feature domain of the task to be generated/understood;

The diffusion model construction training module is used to determine the learning objective based on the image classification model, image segmentation model and object detection model, and to optimize and train the initial joint diffusion model by combining the encoder and random role assignment mechanism to obtain the joint diffusion model;

The generation/understanding task completion module is used to output task results based on the task to be generated/understood through a joint diffusion model, thereby achieving the unification of image generation and understanding.

10. The unified image generation and understanding system based on joint diffusion modeling according to claim 9, characterized in that the diffusion model construction and training module includes:

The diffusion model construction submodule is used to build an initial joint diffusion model in conjunction with the Flux architecture;

The classification task domain acquisition submodule is used to acquire the training image domain and process it through the image classification model to obtain the classification task domain.

The domain acquisition submodule is used to acquire the training image domain and process it through the image segmentation model to obtain the segmentation task domain;

The detection domain acquisition submodule is used to acquire the training image domain and process it through the object detection model to obtain the detection domain.

The role assignment submodule is used to assign the classification task domain, segmentation task domain, and detection domain to the encoder and random role assignment mechanism respectively to obtain the corresponding role types; based on each role type, the domain-encoding information corresponding to the training feature domain, training image domain, and training label domain is processed to obtain the corresponding joint diffusion model input data.

The model optimization training submodule is used to obtain the training label domain and training text information; calculate the diffusion loss function according to the training generation/understanding task; and iterate the initial joint diffusion model based on the diffusion loss function and adjust the weight parameters of the initial joint diffusion model.