Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
Example 1:
Referring to fig. 1, a unified method for generating and understanding images based on joint diffusion modeling is provided in this embodiment. The subject of execution of the method illustrated in fig. 1 may be a software and/or hardware device. The execution body of the present application may include, but is not limited to, at least one of a user equipment, a network device, etc. The user device may include, but is not limited to, a computer, a smart phone, a Personal Digital Assistant (PDA), and the above-mentioned electronic device. The network device may include, but is not limited to, a single network server, a server group of multiple network servers, or a cloud of a large number of computers or network servers based on cloud computing, where cloud computing is one of distributed computing, and a super virtual computer consisting of a group of loosely coupled computers. This embodiment is not limited thereto.
A unified method of image generation and understanding based on joint diffusion modeling, comprising:
s1, determining an image domain or a label domain or a feature domain of a task to be generated/understood, and constructing an initial joint diffusion model by combining a Flux architecture, wherein the task to be generated/understood comprises a generation task, an understanding task and a joint distribution task, the image domain comprises an original image, the label domain comprises a depth map, a normal map, an albedo, an edge map and a line manuscript, and the feature domain comprises an image segmentation feature, a target detection feature and an image classification feature;
Specifically, the joint generation task refers to that an image, all corresponding labels and corresponding image features can be generated simultaneously through a joint diffusion model. The generation task refers to generating a corresponding image given a number of labels and features using a joint diffusion model learning. The understanding task refers to the use of joint diffusion model learning to predict multiple labels and image features simultaneously given an image, i.e., to generate label and feature fields.
When the task to be generated/understood is a joint generation task, the input of the joint diffusion model is text information, namely, related description of an image required to be generated by the task, and the corresponding tag domain and the feature domain are extracted while the required image is generated. When the task to be generated/understood is a generating task, the input of the joint diffusion model is a plurality of tag domains and a plurality of characteristic domains, and images determined by the tag domains are output. When the task to be generated/understood is an understanding task, the input of the joint diffusion model is an image domain, and the corresponding output is a label domain and a feature domain. If different tasks exist at the same time, only the corresponding domains of each task are input into different joint diffusion models respectively, and the required results are output respectively after independent processing.
In this embodiment, the backhaul of the transform model is based on an open-source FLUX model, and is improved for multi-domain joint modeling tasks, unlike the original FLUX which is mainly used for image generation tasks, the method extends to include image domains, tag domains, and feature domains. To accomplish multi-domain joint modeling, the Flux architecture is adapted to handle token sequences from M domains and cross-domain feature interactions through a unified masked full-attention mechanism.
Thus, as shown in FIG. 2, the initial joint diffusion model comprises a transducer model, wherein the transducer model comprises a domain invariant location coding layer and a transducer module, and the transducer module comprises a plurality of transducer layers stacked;
The domain invariant location coding layer adopts 2D sine and cosine coding. Specifically, for the same spatial position of different domains, the same 2D sine and cosine code is added for the input token, so that explicit position information is provided for the joint diffusion model.
Each transducer layer adopts a mask full-attention mechanism, which is beneficial to improving modeling accuracy. In order to enable the joint diffusion model to ignore invalid domains with the character type X, a mask mechanism is introduced on the basis of a standard full-attention mechanism, and log (m j) penalty is added to a token marked as invalid so that the attention weight is zero after a softmax function. Thus, the formula corresponding to the mask full attention mechanism is:
Wherein softmax (·) represents the maximum function, O i represents the attention score corresponding to the i-th transducer layer, Q i、Kj、Vj represents the i-th query vector, the j-th key vector, and the j-th value vector, d represents the dimension of Q i, Representing the transpose of the key vector, Σ (-) representing the sum function, log (-) representing the logarithmic function with the base of the natural constant, m j representing the j-th mask indicator, the range of values being [0,1], M, N representing the number of queries, keys or values, respectively. When m j is 0, the attention score area of the j-th token is minus infinity, so that an invalid domain can be completely shielded, and the corresponding output has no influence.
S2, determining a learning target based on an image classification model, an image segmentation model and a target detection model, and carrying out optimization training on an initial joint diffusion model by combining an encoder and a random role allocation mechanism to obtain the joint diffusion model, wherein the image classification model adopts an improved DINOv self-supervision visual encoder, the image segmentation model adopts an improved SEGMENTER model, and the target detection model adopts an improved DETR model.
As shown in fig. 3, the improved DINOv self-supervising visual encoder includes a geometry-aware Patch embedded layer, a relative position transducer layer and a channel selection compression module connected in series, wherein the relative position transducer layer includes a plurality of stacked first transducers, one cluster-like sensing module is inserted between every two first transducers, the rest structure is the same as the existing transducers, and the channel selection compression module includes a global average pooling layer, two MLP layers and a weighted activation layer.
As shown in FIG. 4, the improved SEGMENTER model comprises an MS-CAP module, a channel attention layer and a split encoder which are connected in series, wherein the split encoder comprises a one-dimensional flattening layer, a position embedding layer and a plurality of stacked second convectors which are connected in series, an SREB module (semantic residual enhancement module) is inserted between every two second convectors, a cross-layer characteristic connection is introduced, the output of the ith second convectors is fused with the output of the ith+k second convectors, and the MS-CAP module comprises a channel attention layer and a characteristic fusion layer which are connected in series and of the parallel convolution layers.
As shown in fig. 5, the improved DETR model comprises a dual CNN backbone network, a channel attention layer, a transducer-based encoder, a transducer-based decoder and a prediction head which are connected in series, wherein the dual CNN backbone network comprises a parallel global semantic CNN network and a target edge CNN network, the target edge CNN network comprises a CNN network, a local perception convolution layer and a boundary response enhancement module which are connected in series, the transducer-based encoder comprises a plurality of block layers, and a space guiding module is inserted between each two block layers. The global semantic CNN network is the same as the CNN network.
The local perceptual convolution layers include a convolution layer having a convolution kernel of 3×3 and a convolution layer having a convolution kernel of 5×5. In the boundary response enhancement module, firstly, a boundary attention diagram (sigmoid activation) is generated through 1×1 convolution, and then element-by-element multiplication is carried out on the boundary attention diagram and the original characteristic, so that characteristic enhancement of a boundary region is realized.
As shown in fig. 6, the performing optimization training on the initial joint diffusion model to obtain a joint diffusion model includes:
S2-1, acquiring a training image domain and a training label domain, processing the training image domain through an image classification model, an image segmentation model and a target detection model based on a learning target to obtain classification tasks domain, segmentation domain and detection domain which are all used as training feature domains;
Because the generating task generally depends on a generating model represented by a diffusion model, the understanding task is often converted into a classification or regression problem and is solved by a discriminant model, so that an image classification model, an image segmentation model and a target detection model are selected to obtain corresponding training feature domains for training an initial joint diffusion model, the feature domains obtained and generated by the initial joint diffusion model are more suitable for the discriminant task (classification, detection and segmentation), and the generated feature domains are subsequently input to a decoder to realize the discriminant task, thereby achieving the purposes of image classification, image segmentation and target detection.
When the learning target is image classification, the original structure of the traditional DINOv self-supervision visual encoder has the problems that the separation between classes is insufficient, the diffusion model with high characteristic dimension is difficult to model and compress in the middle layer, so that the effective learning of the diffusion model is not facilitated, and a class cluster sensing module and a channel selection compression module are further introduced. Thus, the image classification model is processed as follows:
S2-101, acquiring a training image domain, inputting the training image domain into a geometric perception Patch embedded layer, dividing the training image domain through rotation and the like, obtaining a classification Patch with a fixed size, and converting the classification Patch into a classification tokens;
S2-102, extracting classification space features of each classification tokens by using a first transducer in a relative position transducer layer;
S2-103, performing class cluster aggregation on the classified space features through a class cluster perception module, and combining the attention weight and the aggregation vector to obtain classified class cluster perception features;
It should be noted that, the original class token is utilized to perform semantic aggregation on the classification cluster perception features, and the classification cluster perception features of the same class are guided to form high-density semantic clusters. Specifically, a group of class token is reserved for each first transducer, namely, one classification cluster perception feature is selected as the class token, and the rest classification cluster perception features are subjected to semantic aggregation. Taking the first transducer as an example, class cluster aggregation is performed on the input classification class cluster perception features, and the Attention weight A i of the class token corresponding to each classification class cluster perception feature is calculated according to a corresponding formula:
Wherein, the Represents the query vector corresponding to the i-th classification cluster perceptual feature x i,Representing the transpose of the key vector corresponding to the class token.
According to the attention weights, corresponding aggregate vectors are calculated and used as residual, the centers of the perception feature items of each classification cluster are guided to gather, the perception feature of the classification cluster is obtained, and a corresponding formula is obtained:
xi'=xi+Ai·Vcls;
where x i' represents the class cluster perception feature, V cls represents the value vector corresponding to class token, and a i·Vcls represents the aggregate vector.
S2-104, repeating the same operations of the first transducer and the cluster-like sensing module until the output of the last first transducer is obtained;
S2-105, pooling the output of the last first transducer through a channel selection compression module and calculating importance weights, and reducing the dimension of the pooled classification cluster perception features based on the importance weights to obtain classification task domains, namely obtaining training feature domains.
Pooling the output of the last first transducer and calculating importance weights through a channel selection compression module; based on the importance weight, reducing the dimension of the output of the last first transducer after pooling;
Specifically, the channel selection compression module is used for compressing invalid redundant dimensions in the output characteristics of the last first transducer in high dimensions, inhibiting invalid channel activation and compressing characteristic dimensions, so that the diffusion model generation learning efficiency is improved. The global average S in the channel dimension is calculated by the global average pooling layer, and the corresponding formula is as follows:
S=GAP(x')∈R1×D;
Where R represents a constant, GAP (·) represents a pooling operation, a single value is obtained by averaging all elements of each feature map, and D represents the feature dimension of the output of the last first transducer.
The importance weight W is calculated by using two layers of MLPs connected in series, and the corresponding formula is as follows:
W=σ(MLP(S))∈R1×D;
Where σ (·) represents the activation function and MLP (·) represents the multi-layer perceptron operation.
Selectively activating an important channel through the importance weight, discarding the channel of the importance weight W to obtain the feature X after dimension reduction, wherein the corresponding formula is as follows:
X=x'·W。
when the learning target is image segmentation, the processing procedure of the image segmentation model is as follows:
s2-111, acquiring a training image domain, dividing the training image domain by an MS-CAP module (multi-scale channel perception embedding module), extracting patchtokens under different perception fields in parallel, calculating channel attention weights of different patchtokens by a channel attention layer, processing each patchtokens based on the channel attention weights, and obtaining segmentation enhancement multi-scale features;
Specifically, the conventional SEGMENTER model has defects in boundary detail modeling and complete intermediate layer characteristic semantic information, which is not beneficial to effective learning of a diffusion model, so that the MS-CAP module is selected to replace the original flatten and project modules. In the MS-CAP module, training image fields are processed through multiple parallel convolution layers, and patch tokens under different receptive fields are extracted by using convolution kernels of different sizes. The small convolution kernel (such as 3 multiplied by 3) has small receptive field, focuses on capturing local details of the image, can accurately extract fine textures, and the large convolution kernel (such as 7 multiplied by 7) has large receptive field, can acquire information of a wider area, and focuses on the whole structure and the semantics.
And inputting each patch tokens into a channel attention layer respectively, analyzing the characteristic channels of each branch, and adaptively selecting the characteristic channels sensitive to boundaries and textures. By calculating the importance weight of each channel, the response of the channel related to the texture and semantic boundary is enhanced, and the irrelevant channel is restrained, so that the key characteristic information of the image is highlighted, and the joint diffusion model is more focused on the boundary and texture part.
And finally, splicing the multi-scale features subjected to channel attention enhancement, and integrating the enhanced features under different scales to form a feature set containing rich multi-scale information. And then, uniformly projecting the spliced features to meet patch token embedding dimensions required by the encoder in the first transducer, so that the features adapt to the processing requirements of the subsequent transducers. And adding position codes, recording the spatial position information of each feature in the original image by using the position codes, and keeping the spatial sequence of the features so as to avoid losing the spatial structure information in the processing process.
S2-112, inputting the segmentation enhancement multi-scale features into a segmentation encoder, extracting semantic residual fusion features and taking the semantic residual fusion features as segmentation tasks domain to obtain a training feature domain;
Dividing and enhancing multi-scale features in a segmentation encoder to obtain segmentation patches, inputting the segmentation patches into a one-dimensional vector through a one-dimensional flattening layer, embedding each one-dimensional vector through a position embedding layer to obtain an embedded sequence, carrying out feature coding and context aggregation on the embedded sequence by utilizing a first second Transformer to obtain a first semantic residual fusion feature, fusing the first semantic residual fusion feature with the outputs of k second transformers which are separated by each other through an SREB module and taking the fused feature as the input of the next second Transformer, and repeating the operations of the second transformers and the SREB module until the output of the last second Transformer is obtained to obtain the semantic residual fusion feature. In this embodiment, K has a value of 2 or 3.
It is to be explained that the SREB module is used for fusing shallow features and deep features, introducing cross-layer feature connection, improving the transfer of semantic information between layers, reducing the loss of intermediate semantic information, enabling the extracted features to capture local edge, semantic main body and global structure information respectively, and enabling the supervision target of the training diffusion model to be a complete semantic feature. Fusing the output F i of the second transducer of the i layer with the output F i+k of the second transducer of the i+k layer, corresponding to the formula:
F'i+k=Fi+k+gate(Fi);
wherein F' i+k represents the fused semantic residual fusion feature, and gate (·) represents the convolution operation with a convolution kernel of 1×1 and employing a Sigmoid activation function.
In the reasoning stage, a joint diffusion model is used for generating semantic segmentation features, the generated features are input into a two-layer lightweight MaskTransformer decoder, and a semantic mask probability map of each category is output. And obtaining a final segmentation result graph after softmax operation and category argmax operation.
When the learning target is target detection, the processing procedure of the target detection model is as follows:
S2-121, acquiring a training image domain, extracting global semantic information of the training image domain through a global semantic CNN network, extracting initial edge features of the training image domain through the CNN network, extracting detail textures of the initial edge features by utilizing a local perception convolution layer to obtain edge convolution features, and dynamically extracting edge enhancement features of the edge convolution features by utilizing a boundary response enhancement module and combining an edge detection operator and boundary attention gating;
s2-122, fusing the edge enhancement features and global semantic information through a channel attention layer to obtain global-edge fusion features;
S2-123, coding global edge fusion characteristics through a first block layer to obtain global-fusion coding information, extracting an edge response diagram of a training image domain by using a space guiding module, combining a mask attention mechanism and modulating, and fusing the edge response diagram and the global-fusion coding information to obtain global-fusion guiding information;
It should be explained that, in the spatial guidance module, the global-fusion coding information is modulated based on the edge response graph extracted from the training image domain, so as to obtain the attention features after dynamic weighting, thereby enhancing the response capability of the joint diffusion model to the small target and the edge region. Specifically, an edge response map (E) with low resolution is extracted from an input image and used as a position information guiding signal, the output feature of the current block layer is fused with the guiding map E to obtain a mask attention value M, the mask attention value M is used for modulating the output feature F of the current block layer, and the corresponding formula is as follows:
F'=F⊙(1+M);
Global-fusion guidance information F' is obtained.
The global-fusion guidance information is re-input to the next block layer. In the block layer, each layer of encoder is set to dynamically adjust the feature focusing position, and the guiding signal comes from the input original image (training image domain) or shallow branches, without introducing additional supervision.
S2-124, repeating the operations of the block layers and the space guiding module until the output of the last block layer is obtained and used as a detection domain, and obtaining the training feature domain.
In the training stage, the output of the last block layer is used as a target detection domain and used as a supervision target for training the joint diffusion model. In the inference phase, a diffusion model may be used to generate the detection features.
S2-2, respectively performing role distribution on the training feature domain, the training image domain and the training label domain through an encoder and a random role distribution mechanism to determine corresponding role types;
The conventional method cannot realize the unification of generating and understanding tasks, so that a random role allocation mechanism is provided. The character allocation is carried out on the training feature domain, the training image domain and the training label domain through the encoder and the random character allocation mechanism respectively, and the corresponding character types are determined, which comprises the following steps:
S2-2-1, inputting a training image field, a training label field and a training feature field into an encoder based on a training generation/understanding task to obtain training field-encoded information, wherein the task to be generated/understood is a generation task and an understanding task. For example, when the task to be generated/understood is a generation task, the input to the encoder is the corresponding tag field and feature field.
S2-2-2, processing training domain-coding information through a random role allocation mechanism to perform role allocation, and determining a role type and an allocation processing method; if the role type is G, the allocation processing method is noise adding, if the role type is C, the allocation processing method is kept as it is, and if the role type is X, the allocation processing method is data zeroing;
Specifically, as shown in fig. 7, if the character G (generated), the allocation processing method is noise adding, that is, noise adding by using a RECTIFIED FLOW standard training method, and speed field prediction are performed, if the character type is C (condition), the allocation processing method is kept as it is, and if the character type is X (ignore), the allocation processing method is data zeroing, that is, zeroing the data, and masking the influence thereof by using a masking mechanism. In fig. 2, the switch represents a random role assignment mechanism.
S2-2-3, processing the training domain-code allocation information through an allocation processing method to obtain the input data of the joint diffusion model.
Specifically, the formula corresponding to the random role assignment mechanism is:
Wherein, the Output data representing an initial joint diffusion model obtained by processing domain-encoded information through a random character allocation mechanism, ρ k represents an allocated character, k represents an index, τ represents a time step,Representing domain-encoded allocation information, ζ k representing noise term;
through a random role allocation mechanism, different tasks are realized, such as that an image domain is C, other domains are G, namely understanding is realized, the image domain is G, a certain label domain is C, namely controllable generation is realized, and all the label domains are G, namely joint generation is realized. During training, multiple tasks are trained in one model through a role allocation mechanism, so that unified generation and understanding are achieved.
S2-3, acquiring training text information, inputting the training text information and input data of each joint diffusion model into an initial joint diffusion model according to a training generation/understanding task, and generating a corresponding training task result;
It should be noted that, before the initial joint diffusion model, the input data needs to be divided to obtain a plurality of different training Token. According to the difference of training generation/understanding tasks, the processing procedure of the initial joint diffusion model is explained in three cases.
When the training generation/understanding task is a joint generation task, training Token corresponding to training text information is input to a joint diffusion model, roles of all domains are G, token is randomly initialized noise, and finally a graph and a label domain and a feature domain (which are generated simultaneously by one-time reasoning and have no precedence relation) corresponding to the graph are generated.
When the training generation/understanding task is a generation task, the training Token corresponding to the training label domain and the training feature domain is input to the domain invariant location coding layer, and location information is introduced into the training Token to obtain the corresponding training generation location Token. And capturing the relation among the training generation positions token through the stacked Transformer layers, generating shallow layer features and depth features of different layers, and fusing to finally obtain corresponding training images.
When the training generation/understanding task is an understanding task, the process is the same as when the training generation/understanding task is a generating task, and only the input data is a training image. In the three tasks, the domain to be generated is initialized with random noise, and the domain to be generated is the original token and remains unchanged.
S2-4, based on the results of each training task,
A diffusion Loss function Loss is calculated, wherein,Representing the desired function, U (0, 1) representing the random variable sampled from the uniform distribution [0,1], ζ 0:K、ξk representing the noise term from 0to K, the noise term K, N (0, 1) representing the random variable sampled from the standard normal distribution with mean 0 and variance 1,Respectively represent the target values of indexes from 0 to K in the training set and the target value of the kth domain, D represents the training set consisting of a training image domain, a training label domain and a training feature domain, Σ (·) represents the summation function, ρ k represents the role, G represents the role type, the term is used to denote the norm, θ denotes the parameters of the initial joint diffusion model, τ denotes the time step,The indexes of the domains with the indexes of 0,1 and K respectively at the time step tau are respectively represented as target values, and v θ (DEG) represents a speed prediction model;
The speed prediction model is a model of the invention, and under the diffusion framework, the model predicts the changing speed of data from the current state to the real state, and the real data is gradually restored from noise. The loss function optimizes the model by minimizing the difference between the model predicted speed and the true speed (noise-true value), thereby improving the quality of the generation.
S2-5, iterating the initial joint diffusion model based on the diffusion loss function, and adjusting weight parameters of the initial joint diffusion model to complete optimization training of the initial joint diffusion model.
And S3, outputting a task result based on the task to be generated/understood through a joint diffusion model, and completing unification of image generation and understanding.
Example 2:
as shown in fig. 8, an image generation and understanding unified system based on joint diffusion modeling includes:
The data acquisition module is used for determining an image domain or a label domain or a feature domain of a task to be generated/understood;
The role distribution module is used for distributing through the encoder and the random role distribution mechanism to obtain input data of the joint diffusion modeling;
The diffusion model building training module is used for building an initial joint diffusion model by combining a Flux architecture, determining a learning target by combining an image classification model, an image segmentation model and a target detection model, and carrying out optimization training on the initial joint diffusion model to obtain a joint diffusion model;
And the generating/understanding task completion module inputs the generating/understanding task data into the joint diffusion model, outputs a task result and completes the unification of image generation and understanding.
The diffusion model construction training module comprises:
the diffusion model construction submodule is used for constructing an initial joint diffusion model by combining with a Flux architecture;
the classification task domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through an image classification model row to obtain a classification task domain;
The segmentation domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through an image segmentation model to obtain a segmentation task domain;
the detection domain acquisition sub-module is used for acquiring a training image domain and processing the training image domain through a target detection model row to obtain detection domain;
The character allocation sub-module is used for allocating the classification task domain, the segmentation task domain and the detection domain respectively through an encoder and a random character allocation mechanism to obtain corresponding character types;
the model optimization training sub-module is used for acquiring training label fields and training text information, calculating a diffusion loss function based on classification tasks domain, segmentation tasks domain and detection domains, iterating an initial joint diffusion model based on the diffusion loss function, and adjusting weight parameters of the initial joint diffusion model.
It should be noted that, regarding the system in the above embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment regarding the method, and will not be described in detail herein.
Example 3:
The joint generation task is processed by the method in embodiment 1, and as a result, as shown in fig. 9, given the input text, the model can simultaneously generate an image and 7 kinds of corresponding labels. The resulting images have a wide range of aspect ratios, and are all around 1024 resolution.
Example 4:
The control Net model, uniControl model, easyControl model, omniGen model, pixWizard model and OneDiffusion model are selected as other models to be compared with the joint diffusion model of the invention. As shown in table 1, performance comparison is performed for the generated tag domain, and compared with other models, the joint diffusion model of the present invention achieves better performance under various conditions. The performance includes LPIPS (perceived image similarity) and FID distance, which is an indicator that measures the distance between the generated image and the real image distribution.
TABLE 1
The corresponding results of the generation tasks respectively realized through the method, omniGen model, oneDiffusion model and PixWizard model are shown in fig. 10. Compared with OmniGen model, oneDiffusion model and PixWizard model, the method is excellent in generating task, and the generated image is more fit with the input label and feature, so that the method is far better than the performance capability of OmniGen model, oneDiffusion model and PixWizard model.
The method sequentially carries out depth estimation, normal estimation, albedo estimation and edge detection.
TABLE 2
| Method of |
NYUv2 |
ScanNet |
DIODE |
| OmniGen |
9.2 |
10.1 |
30.6 |
| PixWizard |
7.0 |
7.9 |
25.4 |
| OneDiffusion |
8.9 |
9.7 |
25.2 |
| The method |
10.1 |
12.1 |
25.9 |
| The method (integration) |
8.3 |
9.9 |
25.8 |
TABLE 3 Table 3
| Method of |
NYUv2 |
ScanNet |
iBims |
| OmniGen |
28.9 |
28.9 |
31.3 |
| PixWizard |
23.5 |
26.6 |
22.5 |
| The method |
21.1 |
24.3 |
20.1 |
| The method (integration) |
18.6 |
20.3 |
18.2 |
TABLE 4 Table 4
| Method of |
PSNR |
ScanNet |
| OrdinalShading |
15.6 |
0.37 |
| Kocsisetal. |
11.3 |
0.49 |
| CareagaandAksoy |
15.7 |
0.36 |
| RGB2X |
20.6 |
0.18 |
| The method |
15.5 |
0.31 |
| The method (integration) |
16.5 |
0.33 |
TABLE 5
| Method of |
ODS |
OIS |
| HED |
0.788 |
0.808 |
| PiDiNet |
0.807 |
0.823 |
| OmniGen |
0.767 |
0.781 |
| PixWizard |
0.605 |
0.633 |
| OneDiffusion |
0.682 |
0.691 |
| The method |
0.826 |
0.851 |
For depth estimation, absolute average relative error was tested on NYUv dataset, scanNet dataset and DIODE dataset by the present method and its integrated model, oneDiffusion model, omniGen model, pixWizard model, respectively. As shown in table 2, the absolute average relative error obtained by the method and the method (integrated) on NYUv data sets is fifth and second, the absolute average relative error obtained by the method and the method (integrated) on ScanNet data sets is fifth and third, and the absolute average relative error obtained by the method and the method (integrated) on DIODE data sets is fourth and third, arranged in order of decreasing size.
For normal estimation, average angle errors were tested on the NYUv dataset, scanNet dataset and iBims dataset in sequence by the method and its integrated model, omniGen model, pixWizard model, respectively. As shown in Table 3, in NYUv data sets, the average angle error of the method and its integrated model are ranked third and fourth, respectively, in order of magnitude. In ScanNet datasets, the average angular error of the method and its integrated model is ranked third and fourth, respectively, in order of magnitude. In iBims datasets, the average angular error of the method and its integrated model is ranked third and fourth, respectively, in order of magnitude. Its performance is more stable and its performance is more stable across the various data sets than other models.
For albedo estimation, PSNR and LPIPS were tested on Hypersim dataset test sets by the present method and its integrated model, ordinal Shading model, kocsis et al model, careaga andAksoy model, RGB2X model, respectively. As shown in Table 4, the PSNR of the method and its integrated model were ranked third and second, respectively, in order of magnitude, and LPIPS of the method and its integrated model were ranked second and third, respectively, in order of magnitude. From this, the method is superior to Ordinal Shading model, kocsis et al model and Careaga andAksoy model in performance.
For edge detection, the Optimal Dataset Scale (ODS) and F-score at Optimal Image Scale (OIS) are tested on BSDS500 datasets by the present method, HED model, piDiNet model, omniGen model, pixWizard model, oneDiffusion model, respectively. As shown in Table 5, the F-score of the present method is at a maximum at both the optimal data set scale (ODS) and the Optimal Image Scale (OIS), and it is known that the present method is far from other models.
Example 5:
The method is used for comparing the joint diffusion model and the joint diffusion model of the removed domain invariant feature coding layer. As shown in fig. 11, in the absence of domain invariant position coding, there is a significant misalignment of the image domain and the label domain, and this problem is ameliorated after the position coding is added, which demonstrates the role of position coding. Therefore, the domain invariant location coding layer proposed by the method can be combined with the cross-domain spatial location alignment of the diffusion auxiliary model.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.