Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a semantic image analogy method for generating an anti-network based on a single image. This task was named "semantic image analogy," as a variation of "image analogy" (Aaron Hertzmann, Charles E.Jacobs, Nuria Oliver, Brian Curles, and David Salesin.2001.image analogy. in Proceedings of the 28th Annual Conference on Computer Graphics and Interactive technologies.327-340) and is defined below.
Given a source image I and its corresponding semantic segmentation map P, and some other semantic segmentation maps P ', a new target image I' is synthesized such that:
in the above formula: : a category relationship is represented.
As shown in fig. 1, the target image I '(four images in dashed box) should match both the appearance of the source image I and the layout of the target partition P' (four segmented images in dashed box). The task settings aim to find a similar transition from I to I 'in the same way from P to P'. In addition, two metrics are also used to evaluate the quality of the images generated from the semantic image analogy model: image block level distance and semantic alignment score. The former limits the original image I to be the only source of the patch generating the image I ', while the latter forces the generated image I ' to have a semantic layout aligned with the target segmentation map P '.
In practice, the source segmentation map P or another image with similar context may be edited to obtain the target segmentation map P'. The generator may then generate a semantically aligned target image I 'from the source image I, similar to the way P' is obtained from P. Comparison with existing methods shows that the method provided by the invention has advantages in both quantitative and qualitative assessment. Due to the flexible task setup, the proposed method can easily be extended to various applications including object removal, face editing and sketch-to-image generation of natural images.
As shown in fig. 2, the main principle of the semantic image analogy method for generating an anti-collision network based on a single image is provided for the present invention, and is implemented by a network model composed of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:
a training stage: we have designed an auto-supervised framework for training the condition GAN from a single image. During each training iteration, for a given source image I source And corresponding source semantic segmentation image P source Carry out the same randomAn expansion operation to obtain a corresponding enhanced image I aug And enhancing the semantically segmented image P aug (ii) a Extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator, so that a target image I is generated by combining the source image under the guidance of the transformation parameters target (ii) a Respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using the appearance similarity loss between a target image and a source image, the feature matching loss between the target image and an enhanced image obtained based on a score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;
during the training process, the enhanced randomness is gradually increased. Since the generator is homomorphic, so P is the time when P is target And P source At the same time, the source image can be reconstructed well. Herein P is target Is a general expression, in practice P in training target =P aug Thus, both can be mixed in the following training process.
In the embodiment of the invention, sampling and reconstruction modes are adopted for alternate training. In sampling mode, i.e. in the manner described above, the generator takes the enhancement of the semantically segmented image as a guide to generate the appearance and enhancement image I aug Identical and semantically laid out and enhanced semantically segmented image P aug The same target image. The working process of the reconstruction mode is the same as that of the sampling mode, the given source image and the corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image.
And an inference stage: inputting a source image, a corresponding source semantic segmentation image and a designated semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the designated semantic segmentation image.
After training is completed, the network model can generate a target image matched with the given semantic segmentation graph under the condition of giving the semantic segmentation graph with any shape layout, so that the content information of the source image is reserved, and the target image can be matched with the target semantic layout. As shown in fig. 1, the trained network model can change the shape of the source image horse according to a given shape.
For the purposes of promoting an understanding, reference will now be made in detail to the principles and procedures of the present invention.
The technical principle of the invention is the generation of a single image against the network. For a single image, training a generation countermeasure network (namely a generation model) taking a semantic segmentation graph as a condition, mainly comprising the generator, the auxiliary classifier and the discriminator, adopting a series of novel designs to establish semantic association between the semantic segmentation graph and image pixels, and further utilizing the association to achieve the purpose of recombining the image through the semantic segmentation graph.
And converting the semantic image analogy task into a patch-level layout matching problem, and performing conversion guidance in a semantic segmentation domain. For this reason, three main challenges need to be addressed: the paired data sources that generate the model are trained from a single image, the conditional approach that provides guidance from the segmentation domain to the image domain, and the appropriate oversight of the generated samples (i.e., the output of the generator).
To accomplish this task, a novel method is proposed that integrates the following three basic components:
1) a self-supervised training framework with progressive data enhancement strategy was designed. By alternating optimization with enhanced segmentation and original segmentation, the conditional GAN is successfully trained from a single image, which summarizes well the invisible transformations.
2) A semantic feature transformation module is designed which transforms the transformation parameters from the segmentation domain to the image domain.
3) A semantically-aware patch consistency loss is designed that encourages the transformed image to contain only the patches in the source image. Together with the semantic alignment constraints, it allows our generator to generate a real image with the target semantic layout.
As shown in fig. 1, the training phase mainly comprises the following steps:
step 1, giving a source image I source And corresponding source semantic segmentation image P source First, random expansion is performed to obtain an enhanced image I aug And enhancing semantically segmented images P aug Then segmenting the source semantic image P source And enhancing semantically segmented images P aug Input the same encoder E (i.e., E in FIG. 1) seg ) To extract features separately.
In an embodiment of the present invention, the random expansion operation includes one or more of the following operations: random turning, size adjustment, rotation and cutting. This progressive strategy may help the encoder learn the appearance of the source image in early iterations of training as the training step linearly increases the randomness of these operations.
And 2, designing a Semantic Feature Transformation (SFT) module to predict transformation parameters in the image domain from the Feature tensor.
The conversion parameters are explicitly converted from the segmentation domain to the image domain by the SFT module, as shown in fig. 3. Image P will be segmented from source senses source Segmentation of images P into enhanced semantics aug Is modeled as a linear transformation at the feature level. Thus, the feature tensor F of the image is segmented for the source semantics source And enhancing the feature tensor F of the semantically segmented image aug Performing element-by-element comparison and subtraction to obtain a feature scaling tensor F scale And an eigenshift tensor F shift For a subsequent downsampling stage, for the l-th downsampling stage, compute:
wherein,
respectively as the feature tensor F in the l < th > down-sampling stage
aug Feature tensor F
source Extracting a feature tensor; for example, if the number of downsampling times is K, the two feature tensors are divided into K parts, and each downsampling stage takes out the corresponding part and carries out the calculation.
Using feature scaling tensors
And the characteristic shift tensor
To approximate the scaling factor as a segmentation map transformation
And a shifting factor, as shown in FIG. 3, two SFT units can be used to model the conversion process from the segmentation domain to the image domain, and the two SFT units are processed separately
And
obtaining a scaling factor and a shifting factor of an image field
The parameters of the SFT unit are learned through a training process.
Step 3, obtaining the scaling factor and the shifting factor (gamma) of the image domain from the SFT module
img ,β
img ) Encoder-decoder part of the generator G
To the target image under the direction of (2).
For each downsampling stage of the l +1 th in the generator, the output feature tensor is given by:
where DS represents the down-sampling module (i.e., encoder) and mean and std represent the mean and standard deviation, respectively.
The up-sampling module (i.e., decoder) of the generator then maps the image feature tensor output by the down-sampling stage to the image domain, thereby generating the target image. In an embodiment of the present invention, the generator is an encoder-decoder structure having K downsample blocks and K upsample blocks; each block contains a 3 x 3 convolutional layer of step size 3 and a 4 x 4 convolutional or transposed convolutional layer of step size 2 for down-sampling or up-sampling, and each block also uses spectral normalization, batch normalization and leakage ReLU activation operations. Illustratively, the starting channel number is 32, which is doubled during the down-sampling.
For example, K may be set to 3, and each stage of three downsample blocks may receive
Then output according to the above formula
And the input of the upsampling block is
The output is a target image I
target 。
Step 4, the discriminator D will enhance the image I aug As a real sample, and generating a target image I target As a dummy sample. At the same time, the generated image is also input into the auxiliary classifier S to predict its segmentation map.
In an embodiment of the present invention, the discriminator is a fully convoluted PatchGAN (Phillip Isola, Jun-Yan Zhu, TinghuiZhou. and Alexei A. Efros.2017. image-to-large transformation with comparative adaptive networks. in Proceedings of the IEEE/CVF reference on Computer Vision and Pattern Recognition (CVPR). 5967-.
In an embodiment of the invention, Semantic Segmentation is performed in the auxiliary classifier (Segmentation Network in FIG. 2) using a simplified version of the DeepLab V3 architecture (Liang-Chieh Chen, Yukun Zhu, George Papandrou, Florian Schroff, and Hartwig Adam.2018.encoder-Decoder with associated estimation restriction for Semantic Image Segmentation. in Proceedings of the European Conference Computer Vision (ECCV), Vol.11211.833-851).
And 5, constructing a loss function training designed self-supervision network.
According to the task setting of semantic image analogy, the generated image should meet the following requirements: 1) consistent with the source image content; 2) the semantic layout is aligned with the target segmentation graph. Therefore, a Patch Cohenrence Loss (Patch) is proposed to measure the apparent similarity between the generated image and the source image. And a Semantic Alignment Loss (Semantic Alignment Loss) is proposed, the target Semantic segmentation image and the source Semantic segmentation image P predicted from the target image by the auxiliary classifier source Measure the consistency between. Specifically, the method comprises the following steps:
1) the apparent similarity between the generated image and the source image is measured by the image block coherence loss, if the generator generates a corresponding image block that cannot be found in the source image, this constraint will adversely affect the generator G, defined as the average of the lower limit of the image block distance between the source image and the target image:
wherein, N target Is a target image I target Number of image blocks in, I source Representing a source image, G (I) source )=I target ;U class And V class The segmentation labels, d (-) are distance metric functions, representing image blocks U and V. This loss relaxes the position dependence of the pixel distance. Instead, we treat the image as a bag of visual feature words. For each image block in the target image, a nearest neighbor search is run to find the most similar image block with the same class label from the source image, and then the most similar image block is takenThe average of their distances. We have found empirically that features from pre-trained VGG networks (Karen simony and Andrew zisserman.2015.very Deep conditional nets for Large-Scale registration. in Proceedings of the 3rd International Conference Learning Retrieval (ICLR)) produce good results, although other feature descriptors are also applicable.
2) An auxiliary classifier is used to predict a segmentation map of the target image (i.e., the target semantically segmented image). Then, the cross-entropy (CE) loss between the predicted segmentation map and the enhanced segmentation map is calculated. The semantic alignment penalty of the generator is defined as:
where CE represents the cross entropy loss, where P predict =S(G(I source ))=S(I target ),P predict The image is segmented for the target semantics (output of the auxiliary classifier S).
3) Using least squares GAN loss
As a competing constraint, and features are extracted from the discriminator to calculate a loss of feature matching between the enhanced image and the generated image
And (4) an image.
The total loss function is:
wherein,
indicating a loss of the degree of similarity of the appearance,
representing a semantic alignment penalty; lambda
seg 、λ
GAN And λ
fm All the parameters are over parameters, and all the parameters are set to be 1.0 in the experiment.
In the embodiment of the invention, two modes of sampling and reconstruction are adopted for alternate training;
in the sampling mode, namely the steps 1 to 5 described above, the generator takes the enhancement semantic segmentation image as a guide to generate the appearance and enhancement image I aug Identical and semantically laid out and enhanced semantically segmented image P aug The same target image.
The reconstruction mode and the sampling mode have the same working process, and the difference is that the random operation in the
step 1 is not required to be executed, the given source image and the corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image; the total loss function is slightly different, namely the loss of appearance similarity is reduced
The appearance similarity loss is replaced by L1 reconstruction loss between the output reconstructed image and the source image, and the feature matching loss and the semantic alignment loss are respectively the loss between the target image and the source image and between the target semantic segmentation image and the source semantic segmentation image.
During the inference phase, the network model parameters are fixed, and a source image I is given source And corresponding source semantic segmentation image P source And a specified semantically segmented image (which can be edited by P) source Obtained, can also be obtained by other methods); then, inputting the two semantic segmentation images into an encoder E, and then executing the steps 2 to 3, wherein the obtained images are matched with the specified semantic segmentation images, and the content information of the source images is retained.
To verify the effectiveness of the present invention, the performance of the above-described method of the present invention was evaluated in terms of numerical indicators and visual effects, respectively.
The semantic image analogy task was applied to images from different datasets, including COCO-Stuff, ADE20K, CelebAMask-HQ, and the web (i.e., randomly selected natural pictures in the web). The results of the above-described method of the invention and the comparative method were evaluated in two respects: 1) appearance similarity between the source image and the target image; 2) semantic consistency of the target image and the target segmentation graph.
To evaluate the appearance similarity of the generated image to the source image, a user study is performed in the following manner. From the COCO-Stuff dataset 10 pairs of images with the same class label were randomly selected. For each pair of images, one image is used as the source image and the other image is used to provide the target layout, we use the image analogy (Aaron Hertzmann, Charles E.Jacobs, Nuria Oliver, Brian Curles, and David Salesin.2001. images analogy. in Proceedings of the 28th Annual Conference on Graphics and Interactive techniques.327-340, IA) and depth image analogy (JingLiao, Yuan Yao, Lu Yuan, Gang Hua.and Sing big kang Ka2017. visual attribute transfer depth image analogy. ACM Trans. graph.36, 4(2017), 120: 1-120: 15, DIA) methods to transfer the source images into the layout of the other image. IA and DIA are the two most relevant jobs for the above-described method of the present invention. The DIA requires a pair of pictures as source and target, whereas the above-described methods and IA of the present invention require only one source picture and two segmentation pictures. The results are displayed in a random order and require 20 users to rank the appearance similarity with the source image as a reference. Then, the average Ranking (avg. user Ranking) of each method among all images and users was calculated. Table 1 shows the superiority of the above-described method (Our) of the present invention over two competitors.
TABLE 1 Performance under semantic alignment index and user subjective evaluation Performance
To evaluate the semantic consistency of the generated image with the target segmentation map, the segmentation map of the generated image was predicted using the panoramic segmentation model of Detectron2, and then the Pixel-by-Pixel Accuracy (Pixel Accuracy) and mean intersection ratio (mIOU) of the target segmentation were calculated. The images used for evaluation are the same as those in the user study. As shown in table 1, the method achieves the highest accuracy.
In FIGS. 4, 5 and 6, the Single picture generation versus network Model SinGAN (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli.2019.SinGAN: Learning a generic Model From a Single Natural Image in Proceedings of the IEEE/CVF International Conference Computer Vision (ICCV). 4569-4579) and the segmentation map to Image translation Model SPADE (Taesung part, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.2019. Setic Image Synthesis Spaly-Adaptation. in Productions of the CVF semantic and the semantic viscosity of the Image matching Model 2349. the Image matching method of the Image matching distribution is performed With the currently optimal Image analogy algorithms IA and DIA, the Single picture generation versus network Model SinGAN (Tamar Rott Shaham, Tally De, Taley Dekell, and Tomer Michaeli Micheli.2019. Singaol. Singan-Sing With the semantic Model Singan Image matching With the visual Image matching algorithm PR and the semantic map generation and Image quality matching algorithm PR (I) and the matching results are repeated With the target Image matching method of the matching Image generation and Image quality of the target Image matching algorithm PR 7, the matching method of the matching Image matching Model Singaol, the target texture distribution of the target texture generation and the matching method of the target Image matching method of the target matching Model Singaol of the Image matching Model Singaol generation and the target matching (IEEE Image matching method of the target Image matching and PR 7, the target quality of the Image matching method of the target matching method of the Image matching and the target matching method of the target matching (IEEE Image matching method of the target matching and PR) and the target matching of the Image matching method of the target matching of the Image matching (IEEE Image matching and the matching and Image matching method of the target matching of the Image quality of the target matching of the Image matching of the target matching of the Image matching of the Image generation and matching of the Image matching of the Image generation of the Image matching of the Image generation of the matching of the Image generation and matching of the Image matching of the matching of, DIA may produce unrealistic results. Without considering the semantic structure, SinGAN edits often alter the unedited area and produce undesirable textures, or simply blur pasted objects, which results in very similar editorial result versions. Although SPADE is semantically consistent with the target layout, its content is limited to the training data set and the appearance of the source image is lost. Our method produces an image that is faithful in appearance to the source image and semantically consistent with the target layout.
The method can perform semantic processing on the image through the segmentation map of the image. Instances may be moved, resized, or deleted in the source semantic segmentation graph to obtain the target layout. As shown in fig. 7, the above method of the present invention produces high quality results through arbitrary semantic modifications while preserving well the local appearance of the modified instances.
Our flexible semantic image analogy task setup can enable various applications. Due to the intensive condition inputs, pixel level control can be used to recombine image blocks in an image. In fig. 8, 9 and 10, three applications of the above-described method of the present invention are illustrated, including 1) object removal, where unwanted objects can be easily removed by modifying the class labels in the semantic segmentation map to background classes, 2) face editing, where the face image can be edited by altering the shape of the face in the segmentation map, and 3) edge-to-image generation, where other spatial conditions (e.g., edge maps) can be used as conditional inputs.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.