CN112102303B

CN112102303B - Semantic image analogy method for generating antagonistic network based on single image

Info

Publication number: CN112102303B
Application number: CN202011001562.XA
Authority: CN
Inventors: 熊志伟; 李家丞; 刘�东
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2022-09-06
Anticipated expiration: 2040-09-22
Also published as: CN112102303A

Abstract

The present invention discloses a semantic image analogy method based on a single-image generation confrontation network. It can be seen from the technical solutions provided by the present invention that, given any image and its semantic segmentation map, it is possible to train a method specific to a given image. Image generation model, the model can recombine source images according to different expected semantic layouts, generate images that conform to the target semantic layout, and achieve the effect of semantic image analogy. The results produced by this method achieve optimal visual quality and coincidence accuracy.

Description

Semantic image analogy method for generating countermeasure network based on single image

Technical Field

The invention relates to the technical field of image processing, in particular to a semantic image analogy method for generating an anti-collision network based on a single image.

Background

Generative models such as variable Auto-encoders (VAEs) and Generative Adaptive Networks (GANs) have advanced significantly in modeling natural image layouts in a Generative manner. By taking as input additional signals such as class labels, text, edges or segmentation maps, the conditional generation model can generate photo-level realistic samples in a controllable manner, which is useful in many multimedia applications such as interactive design and artistic style transfer.

In particular, segmentation maps provide dense pixel-level guidance for generating models and enable users to spatially control desired instances, which is much more flexible than image-level guidance like class labels or styles.

Isola et al propose that the Pix2Pix model shows the ability of the Conditional GAN to generate controlled images given dense Conditional signals (including sketches and segmentation maps) (Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efrost.2017. Image-to-Image transformation with Conditional Adversal networks. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5967-. Wang et al extended the above framework With a coarse-to-fine generator and a multi-scale discriminator to generate images With high Resolution details (Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.2018.high-Resolution Image Synthesis and semiconductor management With Conditioning GANs. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8798-. Park et al propose a Spatially Adaptive normalization technique (SPADE) that uses a semantic graph to predict affine transformation parameters to modulate activation in a normalization layer (Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.2019.semantic Image Synthesis With navigation-Adaptive normalization. in Proceedings of the IEEE/CVF Conference Computer Vision and Pattern Recognition (CVPR). 2337. 2346). Typically, these methods require a large training dataset to map the segmentation class labels to the image patch appearances of the entire dataset. However, the appearance of a certain label instance in the generated image is limited to the appearance of the label in the training dataset, thus limiting the generalization ability of these models on random natural images.

On the other hand, recent studies on single images GAN have shown that it is possible to learn a generative model from the internal patch layout of a single image. InGAN defines the resized transformation and trains a generative model to capture the internal patch statistics for redirection (Assaf shocker, Shai Bagon, Phillip Isola, and Michal Irani.2019.InGAN: Capturing and targeting the "DNA" of a Natural image in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4491-4500). SinGAN generates unconditional images using a multi-stage training scheme, which can generate images of arbitrary size From noise (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli.2019.SinGAN: Learning a genetic Model From a Single Natural image in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4569-4579). KernelGAN uses depth-linear generators and constrains them to learn image-specific degradation kernels for blind Super-Resolution (Sefi Bell-Kligler, Assaf shocker, and Michal Irani.2019. Blank Super-Resolution Kernel Estimation using an Internal-GAN. in Advances in Neural Information Processing Systems 32: annular Conference on Neural Information Processing Systems (NeurIPS). 284- & 293). Although these image-specific GANs are independent of the dataset and produce favorable results, the semantic meaning of patches within a single image is still barely explored.

Disclosure of Invention

The invention aims to provide a semantic image analogy method for generating an antagonistic network based on a single image, and the generated result visual quality and coincidence accuracy are optimal.

The purpose of the invention is realized by the following technical scheme:

a semantic image analogy method for generating an anti-network based on a single image is realized by a network model consisting of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:

a training stage: during each training iteration, carrying out the same random expansion operation on a given source image and a corresponding source semantic segmentation image to obtain a corresponding enhanced image and an enhanced semantic segmentation image; extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator so as to generate a target image by combining the source image under the guidance of the transformation parameters; respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using the appearance similarity loss between a target image and a source image, the feature matching loss between the target image and an enhanced image obtained based on a score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;

and (3) an inference stage: and inputting the source image, the corresponding source semantic segmentation image and the appointed semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the appointed semantic segmentation image.

The technical scheme provided by the invention can be seen that a generation model special for a given image can be trained under the condition of giving any image and a semantic segmentation image thereof, the model can recombine a source image according to different expected semantic layouts to generate an image conforming to a target semantic layout, and the semantic image analogy effect is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram of semantic image analogy concepts provided by an embodiment of the present invention;

fig. 2 is a schematic diagram of a semantic image analogy method for generating an anti-collision network based on a single image according to an embodiment of the present invention;

FIG. 3 is a flowchart of the calculation of the SFT module provided for an embodiment of the present invention;

FIG. 4 is a comparison of the image generation effect of the present invention and the visual effect of the prior image analogy method provided for the embodiment of the present invention;

FIG. 5 is a comparison of the image generation effect of the present invention provided for an embodiment of the present invention and the visual effect of the existing single image GAN method;

FIG. 6 is a comparison of the image generation effect of the present invention and the visual effect of the existing semantic image translation method provided for the embodiment of the present invention;

FIG. 7 is a visual effect of the present invention on semantic image analogy task provided for an embodiment of the present invention;

FIG. 8 is a visual effect of the present invention on an image object removal task provided for an embodiment of the present invention;

FIG. 9 is a diagram illustrating the visual effect of the present invention on a human face editing task according to an embodiment of the present invention;

FIG. 10 is a diagram of the visual effect of the present invention on the task of edge-to-image translation provided for an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a semantic image analogy method for generating an anti-network based on a single image. This task was named "semantic image analogy," as a variation of "image analogy" (Aaron Hertzmann, Charles E.Jacobs, Nuria Oliver, Brian Curles, and David Salesin.2001.image analogy. in Proceedings of the 28th Annual Conference on Computer Graphics and Interactive technologies.327-340) and is defined below.

Given a source image I and its corresponding semantic segmentation map P, and some other semantic segmentation maps P ', a new target image I' is synthesized such that:

in the above formula: : a category relationship is represented.

As shown in fig. 1, the target image I '(four images in dashed box) should match both the appearance of the source image I and the layout of the target partition P' (four segmented images in dashed box). The task settings aim to find a similar transition from I to I 'in the same way from P to P'. In addition, two metrics are also used to evaluate the quality of the images generated from the semantic image analogy model: image block level distance and semantic alignment score. The former limits the original image I to be the only source of the patch generating the image I ', while the latter forces the generated image I ' to have a semantic layout aligned with the target segmentation map P '.

In practice, the source segmentation map P or another image with similar context may be edited to obtain the target segmentation map P'. The generator may then generate a semantically aligned target image I 'from the source image I, similar to the way P' is obtained from P. Comparison with existing methods shows that the method provided by the invention has advantages in both quantitative and qualitative assessment. Due to the flexible task setup, the proposed method can easily be extended to various applications including object removal, face editing and sketch-to-image generation of natural images.

As shown in fig. 2, the main principle of the semantic image analogy method for generating an anti-collision network based on a single image is provided for the present invention, and is implemented by a network model composed of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:

a training stage: we have designed an auto-supervised framework for training the condition GAN from a single image. During each training iteration, for a given source image I _source And corresponding source semantic segmentation image P _source Carry out the same randomAn expansion operation to obtain a corresponding enhanced image I _aug And enhancing the semantically segmented image P _aug (ii) a Extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator, so that a target image I is generated by combining the source image under the guidance of the transformation parameters _target (ii) a Respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using the appearance similarity loss between a target image and a source image, the feature matching loss between the target image and an enhanced image obtained based on a score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;

during the training process, the enhanced randomness is gradually increased. Since the generator is homomorphic, so P is the time when P is _target And P _source At the same time, the source image can be reconstructed well. Herein P is _target Is a general expression, in practice P in training _target ＝P _aug Thus, both can be mixed in the following training process.

In the embodiment of the invention, sampling and reconstruction modes are adopted for alternate training. In sampling mode, i.e. in the manner described above, the generator takes the enhancement of the semantically segmented image as a guide to generate the appearance and enhancement image I _aug Identical and semantically laid out and enhanced semantically segmented image P _aug The same target image. The working process of the reconstruction mode is the same as that of the sampling mode, the given source image and the corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image.

And an inference stage: inputting a source image, a corresponding source semantic segmentation image and a designated semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the designated semantic segmentation image.

After training is completed, the network model can generate a target image matched with the given semantic segmentation graph under the condition of giving the semantic segmentation graph with any shape layout, so that the content information of the source image is reserved, and the target image can be matched with the target semantic layout. As shown in fig. 1, the trained network model can change the shape of the source image horse according to a given shape.

For the purposes of promoting an understanding, reference will now be made in detail to the principles and procedures of the present invention.

The technical principle of the invention is the generation of a single image against the network. For a single image, training a generation countermeasure network (namely a generation model) taking a semantic segmentation graph as a condition, mainly comprising the generator, the auxiliary classifier and the discriminator, adopting a series of novel designs to establish semantic association between the semantic segmentation graph and image pixels, and further utilizing the association to achieve the purpose of recombining the image through the semantic segmentation graph.

And converting the semantic image analogy task into a patch-level layout matching problem, and performing conversion guidance in a semantic segmentation domain. For this reason, three main challenges need to be addressed: the paired data sources that generate the model are trained from a single image, the conditional approach that provides guidance from the segmentation domain to the image domain, and the appropriate oversight of the generated samples (i.e., the output of the generator).

To accomplish this task, a novel method is proposed that integrates the following three basic components:

1) a self-supervised training framework with progressive data enhancement strategy was designed. By alternating optimization with enhanced segmentation and original segmentation, the conditional GAN is successfully trained from a single image, which summarizes well the invisible transformations.

2) A semantic feature transformation module is designed which transforms the transformation parameters from the segmentation domain to the image domain.

3) A semantically-aware patch consistency loss is designed that encourages the transformed image to contain only the patches in the source image. Together with the semantic alignment constraints, it allows our generator to generate a real image with the target semantic layout.

As shown in fig. 1, the training phase mainly comprises the following steps:

step 1, giving a source image I _source And corresponding source semantic segmentation image P _source First, random expansion is performed to obtain an enhanced image I _aug And enhancing semantically segmented images P _aug Then segmenting the source semantic image P _source And enhancing semantically segmented images P _aug Input the same encoder E (i.e., E in FIG. 1) _seg ) To extract features separately.

In an embodiment of the present invention, the random expansion operation includes one or more of the following operations: random turning, size adjustment, rotation and cutting. This progressive strategy may help the encoder learn the appearance of the source image in early iterations of training as the training step linearly increases the randomness of these operations.

And 2, designing a Semantic Feature Transformation (SFT) module to predict transformation parameters in the image domain from the Feature tensor.

The conversion parameters are explicitly converted from the segmentation domain to the image domain by the SFT module, as shown in fig. 3. Image P will be segmented from source senses _source Segmentation of images P into enhanced semantics _aug Is modeled as a linear transformation at the feature level. Thus, the feature tensor F of the image is segmented for the source semantics _source And enhancing the feature tensor F of the semantically segmented image _aug Performing element-by-element comparison and subtraction to obtain a feature scaling tensor F _scale And an eigenshift tensor F _shift For a subsequent downsampling stage, for the l-th downsampling stage, compute:

wherein,

respectively as the feature tensor F in the l < th > down-sampling stage _aug Feature tensor F _source Extracting a feature tensor; for example, if the number of downsampling times is K, the two feature tensors are divided into K parts, and each downsampling stage takes out the corresponding part and carries out the calculation.

Using feature scaling tensors

And the characteristic shift tensor

To approximate the scaling factor as a segmentation map transformation

And a shifting factor, as shown in FIG. 3, two SFT units can be used to model the conversion process from the segmentation domain to the image domain, and the two SFT units are processed separately

And

obtaining a scaling factor and a shifting factor of an image field

The parameters of the SFT unit are learned through a training process.

Step 3, obtaining the scaling factor and the shifting factor (gamma) of the image domain from the SFT module _img ，β _img ) Encoder-decoder part of the generator G

To the target image under the direction of (2).

For each downsampling stage of the l +1 th in the generator, the output feature tensor is given by:

where DS represents the down-sampling module (i.e., encoder) and mean and std represent the mean and standard deviation, respectively.

The up-sampling module (i.e., decoder) of the generator then maps the image feature tensor output by the down-sampling stage to the image domain, thereby generating the target image. In an embodiment of the present invention, the generator is an encoder-decoder structure having K downsample blocks and K upsample blocks; each block contains a 3 x 3 convolutional layer of step size 3 and a 4 x 4 convolutional or transposed convolutional layer of step size 2 for down-sampling or up-sampling, and each block also uses spectral normalization, batch normalization and leakage ReLU activation operations. Illustratively, the starting channel number is 32, which is doubled during the down-sampling.

For example, K may be set to 3, and each stage of three downsample blocks may receive

Then output according to the above formula

And the input of the upsampling block is

The output is a target image I _target 。

Step 4, the discriminator D will enhance the image I _aug As a real sample, and generating a target image I _target As a dummy sample. At the same time, the generated image is also input into the auxiliary classifier S to predict its segmentation map.

In an embodiment of the present invention, the discriminator is a fully convoluted PatchGAN (Phillip Isola, Jun-Yan Zhu, TinghuiZhou. and Alexei A. Efros.2017. image-to-large transformation with comparative adaptive networks. in Proceedings of the IEEE/CVF reference on Computer Vision and Pattern Recognition (CVPR). 5967-.

In an embodiment of the invention, Semantic Segmentation is performed in the auxiliary classifier (Segmentation Network in FIG. 2) using a simplified version of the DeepLab V3 architecture (Liang-Chieh Chen, Yukun Zhu, George Papandrou, Florian Schroff, and Hartwig Adam.2018.encoder-Decoder with associated estimation restriction for Semantic Image Segmentation. in Proceedings of the European Conference Computer Vision (ECCV), Vol.11211.833-851).

And 5, constructing a loss function training designed self-supervision network.

According to the task setting of semantic image analogy, the generated image should meet the following requirements: 1) consistent with the source image content; 2) the semantic layout is aligned with the target segmentation graph. Therefore, a Patch Cohenrence Loss (Patch) is proposed to measure the apparent similarity between the generated image and the source image. And a Semantic Alignment Loss (Semantic Alignment Loss) is proposed, the target Semantic segmentation image and the source Semantic segmentation image P predicted from the target image by the auxiliary classifier _source Measure the consistency between. Specifically, the method comprises the following steps:

1) the apparent similarity between the generated image and the source image is measured by the image block coherence loss, if the generator generates a corresponding image block that cannot be found in the source image, this constraint will adversely affect the generator G, defined as the average of the lower limit of the image block distance between the source image and the target image:

wherein, N _target Is a target image I _target Number of image blocks in, I _source Representing a source image, G (I) _source )＝I _target ；U _class And V _class The segmentation labels, d (-) are distance metric functions, representing image blocks U and V. This loss relaxes the position dependence of the pixel distance. Instead, we treat the image as a bag of visual feature words. For each image block in the target image, a nearest neighbor search is run to find the most similar image block with the same class label from the source image, and then the most similar image block is takenThe average of their distances. We have found empirically that features from pre-trained VGG networks (Karen simony and Andrew zisserman.2015.very Deep conditional nets for Large-Scale registration. in Proceedings of the 3rd International Conference Learning Retrieval (ICLR)) produce good results, although other feature descriptors are also applicable.

2) An auxiliary classifier is used to predict a segmentation map of the target image (i.e., the target semantically segmented image). Then, the cross-entropy (CE) loss between the predicted segmentation map and the enhanced segmentation map is calculated. The semantic alignment penalty of the generator is defined as:

where CE represents the cross entropy loss, where P _predict ＝S(G(I _source ))＝S(I _target )，P _predict The image is segmented for the target semantics (output of the auxiliary classifier S).

3) Using least squares GAN loss

As a competing constraint, and features are extracted from the discriminator to calculate a loss of feature matching between the enhanced image and the generated image

And (4) an image.

The total loss function is:

wherein,

indicating a loss of the degree of similarity of the appearance,

representing a semantic alignment penalty; lambda _seg 、λ _GAN And λ _fm All the parameters are over parameters, and all the parameters are set to be 1.0 in the experiment.

In the embodiment of the invention, two modes of sampling and reconstruction are adopted for alternate training;

in the sampling mode, namely the steps 1 to 5 described above, the generator takes the enhancement semantic segmentation image as a guide to generate the appearance and enhancement image I _aug Identical and semantically laid out and enhanced semantically segmented image P _aug The same target image.

The reconstruction mode and the sampling mode have the same working process, and the difference is that the random operation in the step 1 is not required to be executed, the given source image and the corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image; the total loss function is slightly different, namely the loss of appearance similarity is reduced

The appearance similarity loss is replaced by L1 reconstruction loss between the output reconstructed image and the source image, and the feature matching loss and the semantic alignment loss are respectively the loss between the target image and the source image and between the target semantic segmentation image and the source semantic segmentation image.

During the inference phase, the network model parameters are fixed, and a source image I is given _source And corresponding source semantic segmentation image P _source And a specified semantically segmented image (which can be edited by P) _source Obtained, can also be obtained by other methods); then, inputting the two semantic segmentation images into an encoder E, and then executing the steps 2 to 3, wherein the obtained images are matched with the specified semantic segmentation images, and the content information of the source images is retained.

To verify the effectiveness of the present invention, the performance of the above-described method of the present invention was evaluated in terms of numerical indicators and visual effects, respectively.

The semantic image analogy task was applied to images from different datasets, including COCO-Stuff, ADE20K, CelebAMask-HQ, and the web (i.e., randomly selected natural pictures in the web). The results of the above-described method of the invention and the comparative method were evaluated in two respects: 1) appearance similarity between the source image and the target image; 2) semantic consistency of the target image and the target segmentation graph.

To evaluate the appearance similarity of the generated image to the source image, a user study is performed in the following manner. From the COCO-Stuff dataset 10 pairs of images with the same class label were randomly selected. For each pair of images, one image is used as the source image and the other image is used to provide the target layout, we use the image analogy (Aaron Hertzmann, Charles E.Jacobs, Nuria Oliver, Brian Curles, and David Salesin.2001. images analogy. in Proceedings of the 28th Annual Conference on Graphics and Interactive techniques.327-340, IA) and depth image analogy (JingLiao, Yuan Yao, Lu Yuan, Gang Hua.and Sing big kang Ka2017. visual attribute transfer depth image analogy. ACM Trans. graph.36, 4(2017), 120: 1-120: 15, DIA) methods to transfer the source images into the layout of the other image. IA and DIA are the two most relevant jobs for the above-described method of the present invention. The DIA requires a pair of pictures as source and target, whereas the above-described methods and IA of the present invention require only one source picture and two segmentation pictures. The results are displayed in a random order and require 20 users to rank the appearance similarity with the source image as a reference. Then, the average Ranking (avg. user Ranking) of each method among all images and users was calculated. Table 1 shows the superiority of the above-described method (Our) of the present invention over two competitors.

TABLE 1 Performance under semantic alignment index and user subjective evaluation Performance

To evaluate the semantic consistency of the generated image with the target segmentation map, the segmentation map of the generated image was predicted using the panoramic segmentation model of Detectron2, and then the Pixel-by-Pixel Accuracy (Pixel Accuracy) and mean intersection ratio (mIOU) of the target segmentation were calculated. The images used for evaluation are the same as those in the user study. As shown in table 1, the method achieves the highest accuracy.

In FIGS. 4, 5 and 6, the Single picture generation versus network Model SinGAN (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli.2019.SinGAN: Learning a generic Model From a Single Natural Image in Proceedings of the IEEE/CVF International Conference Computer Vision (ICCV). 4569-4579) and the segmentation map to Image translation Model SPADE (Taesung part, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.2019. Setic Image Synthesis Spaly-Adaptation. in Productions of the CVF semantic and the semantic viscosity of the Image matching Model 2349. the Image matching method of the Image matching distribution is performed With the currently optimal Image analogy algorithms IA and DIA, the Single picture generation versus network Model SinGAN (Tamar Rott Shaham, Tally De, Taley Dekell, and Tomer Michaeli Micheli.2019. Singaol. Singan-Sing With the semantic Model Singan Image matching With the visual Image matching algorithm PR and the semantic map generation and Image quality matching algorithm PR (I) and the matching results are repeated With the target Image matching method of the matching Image generation and Image quality of the target Image matching algorithm PR 7, the matching method of the matching Image matching Model Singaol, the target texture distribution of the target texture generation and the matching method of the target Image matching method of the target matching Model Singaol of the Image matching Model Singaol generation and the target matching (IEEE Image matching method of the target Image matching and PR 7, the target quality of the Image matching method of the target matching method of the Image matching and the target matching method of the target matching (IEEE Image matching method of the target matching and PR) and the target matching of the Image matching method of the target matching of the Image matching (IEEE Image matching and the matching and Image matching method of the target matching of the Image quality of the target matching of the Image matching of the target matching of the Image matching of the Image generation and matching of the Image matching of the Image generation of the Image matching of the Image generation of the matching of the Image generation and matching of the Image matching of the matching of, DIA may produce unrealistic results. Without considering the semantic structure, SinGAN edits often alter the unedited area and produce undesirable textures, or simply blur pasted objects, which results in very similar editorial result versions. Although SPADE is semantically consistent with the target layout, its content is limited to the training data set and the appearance of the source image is lost. Our method produces an image that is faithful in appearance to the source image and semantically consistent with the target layout.

The method can perform semantic processing on the image through the segmentation map of the image. Instances may be moved, resized, or deleted in the source semantic segmentation graph to obtain the target layout. As shown in fig. 7, the above method of the present invention produces high quality results through arbitrary semantic modifications while preserving well the local appearance of the modified instances.

Our flexible semantic image analogy task setup can enable various applications. Due to the intensive condition inputs, pixel level control can be used to recombine image blocks in an image. In fig. 8, 9 and 10, three applications of the above-described method of the present invention are illustrated, including 1) object removal, where unwanted objects can be easily removed by modifying the class labels in the semantic segmentation map to background classes, 2) face editing, where the face image can be edited by altering the shape of the face in the segmentation map, and 3) edge-to-image generation, where other spatial conditions (e.g., edge maps) can be used as conditional inputs.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A semantic image analogy method based on a single image generation countermeasure network is characterized by being realized by a network model formed by an encoder, a generator G, an auxiliary classifier S and a discriminator D; wherein:

a training stage: during each training iteration, carrying out the same random expansion operation on a given source image and a corresponding source semantic segmentation image to obtain a corresponding enhanced image and an enhanced semantic segmentation image; extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator so as to generate a target image by combining the source image under the guidance of the transformation parameters; respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using appearance similarity loss between a target image and a source image, feature matching loss between the target image and an enhanced image obtained based on a score map, and semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;

and (3) an inference stage: inputting a source image, a corresponding source semantic segmentation image and a designated semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the designated semantic segmentation image;

wherein the predicting, by the semantic feature conversion module in the generator, the transformation parameters in the image domain based on the two feature tensors comprises: feature tensor F for source semantic segmentation images _source And enhancing the feature tensor F of the semantically segmented image _aug Performing element-by-element comparison and subtraction to obtain a feature scaling tensor F _scale And an eigenshift tensor F _shift For a subsequent down-sampling stage; for the l-th downsampling stage, calculate:

wherein,

respectively as the feature tensor F in the l < th > down-sampling stage _aug Feature tensor F _source Extracting a feature tensor;

using feature scaling tensors

And the characteristic shift tensor

As a scaling factor for the segmentation map transformation

And a shifting factor

Modeling the transformation process from the segmentation domain to the image domain by using two semantic feature transformation modules, respectively processing

And

obtaining a scaling factor and a shifting factor of an image domain

2. The method of claim 1, wherein the stochastic augmentation operation comprises one or more of the following operations: random flipping, resizing, rotating, and cropping.

3. The method of claim 1, wherein in the (l + 1) th down-sampling stage of the generator, the output feature tensor is obtained by the following formula:

wherein DS represents the down-sampling module, mean and std represent the mean and standard deviation, respectively

An up-sampling module of the generator maps the image characteristic tensor output in the down-sampling stage to an image domain so as to generate a target image;

the generator is an encoder-decoder structure having K downsample blocks and K upsample blocks; each block contains a 3 x 3 convolutional layer of step size 3 and a 4 x 4 convolutional or transposed convolutional layer of step size 2 for down-sampling or up-sampling, and each block also uses spectral normalization, batch normalization and leakage ReLU activation operations.

4. The method of claim 1, wherein the semantic image analogy method for generating the countermeasure network based on the single image,

measuring appearance similarity between a generated image and a source image through image block coherence loss, wherein the appearance similarity is defined as an average value of image block distance lower limits between the source image and a target image:

wherein N is _target Is a target image I _target Number of image blocks in, I _source Representing a source image, G (I) _source )＝I _target ；U _class And V _class The segmentation labels, d (-) are distance metric functions, representing image blocks U and V.

5. The method of claim 1, wherein the semantic image analogy method for generating the countermeasure network based on the single image is characterized in that the semantic alignment loss is expressed as:

where CE represents cross entropy loss; i is _source Representing a source image, P _aug Representing an enhanced semantically segmented image.

6. The method of claim 1, wherein the total loss function is expressed as:

wherein,

indicating a loss of the degree of similarity of the appearance,

indicating a loss of semantic alignment in the form of,

a loss of the matching of the features is indicated,

representing least squares GAN loss as an antagonistic constraint; lambda _seg 、λ _GAN And λ _fm All are hyper-parameters.

7. The semantic image analogy method for generating the countermeasure network based on the single image according to claim 1, characterized in that two modes of sampling and reconstruction are adopted for alternate training;

under the sampling mode, the generator takes the enhancement semantic segmentation image as a guide to generate the appearance and enhancement image I _aug Identical and semantically layout and enhanced semantically segmented image P _aug The same target image;

the working process of the reconstruction mode and the sampling mode is the same, a given source image and a corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image; and in the total loss function, the loss of the appearance similarity is replaced by the loss of L1 reconstruction between the output reconstructed image and the source image, and the loss of the feature matching and the loss of the semantic alignment are respectively the loss between the target image and the source image and the loss between the target semantic segmentation image and the source semantic segmentation image.