CN109218706A

CN109218706A - A method of 3 D visual image is generated by single image

Info

Publication number: CN109218706A
Application number: CN201811313847.XA
Authority: CN
Inventors: 许威威; 张�荣; 于金辉; 黄翔; 鲍虎军
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2019-01-15
Anticipated expiration: 2038-11-06
Also published as: CN109218706B

Abstract

本发明公开了一种由单张图像生成立体视觉图像的方法，属于立体视觉领域，包括以下步骤：(1)深度估计模型训练；(2)对单张彩色图像，估计其深度信息；(3)通过交互和模型参数微调修正估计深度中错误的部分；(4)对估计的深度图进行前景保护，对齐深度边缘和彩色图像边缘；(5)根据图像和深度信息，计算视差，得到新视角下图像；(6)根据数据集中的深度图，生成与新视角下图像相似的空洞区域；(7)用新生成的数据训练用于图像修补的生成对抗网络模型；(8)针对测试图片，对修补模型进行参数微调；(9)修补新视角下图像的空洞部分，得到立体视觉图像。本发明具有输入图像易获取、操作灵活、方便调节、生成图像立体感明显等特点。The invention discloses a method for generating a stereoscopic vision image from a single image, which belongs to the field of stereoscopic vision and includes the following steps: (1) training a depth estimation model; (2) estimating the depth information of a single color image; (3) ) Correct the wrong part of the estimated depth through interaction and fine-tuning of model parameters; (4) Perform foreground protection on the estimated depth map, aligning the depth edge and the color image edge; (5) According to the image and depth information, calculate the parallax to obtain a new perspective (6) According to the depth map in the dataset, generate a hollow area similar to the image in the new perspective; (7) Use the newly generated data to train a generative adversarial network model for image inpainting; (8) For the test image, Fine-tune the parameters of the inpainting model; (9) Repair the hollow part of the image under the new viewing angle to obtain a stereoscopic image. The invention has the characteristics of easy acquisition of the input image, flexible operation, convenient adjustment, and obvious stereoscopic effect of the generated image.

Description

A method of 3 D visual image is generated by single image

Technical field

The present invention relates to stereoscopic vision field, especially a kind of method that 3 D visual image is generated by single image.

Background technique

Stereoscopic vision is a kind of technology for simulating human eye's binocular vision effect.The technology has view by showing two The image of difference carrys out analog depth three-dimensional sense, and is widely used in three-dimensional display system, as virtual reality glasses and naked eye 3D are shown Deng.Two classes are broadly divided by the method that single image generates binocular stereo vision image, one kind is the depth information according to image The image under another visual angle is calculated, it is another kind of, it is the image directly generated according to current visual angle image under another visual angle.

The acquisition of depth image is wherein vital a part in first kind method.Directly acquire depth image needs Special instrument, then wide usage is higher for estimation of Depth.The depth estimation method of early stage always assumes that scene is plane or cylinder, or logical Method processing unit point specific object and the scene of data-driven are crossed, such methods have very big limitation on application scenarios Property.Research in recent years concentrates on the depth information that single image is estimated using convolutional neural networks, passes through the tune of model structure It is whole, the improvement of loss function or and the methods of the combination of condition random field improve the accuracy rate of estimation of Depth, such method Image similar with training data can only be handled.

Parallax, which can be calculated, according to depth map obtains the image under another visual angle, and the cavity in image is filled out It mends.Such methods usually have the interpolation that cavity is carried out along the direction of isophote, by patch-based method filling cavity, Directly by the part of convolutional neural networks prediction missing, repaired etc. using confrontation network is generated.Needle is not yet seen at present To the method for repairing and mending of 3 D visual image.

Second class method mainly provides the image at a visual angle at present, directly generates new view by training convolutional neural networks Image under angle.This method needs a large amount of binocular image data to be trained, and similarly there is the limitation of application scenarios.

Summary of the invention

In view of the above deficiencies, the present invention provides a kind of method for generating 3 D visual image by single image, and this method is only Need to input a color image, it can the new image under different perspectives is generated, for virtual reality glasses, naked eye 3D displaying etc., helps user to understand three-dimensional scenic, solves the problems, such as to be difficult to construct stereoscopic vision from single image.

To achieve the goals above, the technical scheme is that a kind of generate 3 D visual image by single image Method mainly comprises the steps that

(1) training of estimation of Depth model is carried out using RGBD image data set；

(2) for individual color image of input, estimate its depth information；

(3) pass through part wrong in interactive and model parameter fine tuning amendment estimating depth；

(4) prospect protection operation is carried out to the depth map of estimation, so that depth edge and color image side be better aligned Edge；

(5) according to image and depth information, parallax is calculated, the image under New Century Planned Textbook is obtained；

(6) according to the depth map in data set, hole region similar with image under New Century Planned Textbook is generated；

(7) generation with newly-generated data training for image mending fights network model；

(8) for test picture, small parameter perturbations are carried out to patch formation model；

(9) hollow sectors of image under New Century Planned Textbook, the 3 D visual image generated are repaired.

Further, the step (1) comprises the steps of:

(1.1) data processing is carried out, equal interval sampling is carried out to data set, and carry out random cropping, flip horizontal, color The data of shake enhance, and measure relative depth according to depth map sampled point；

(1.2) model structure is encoder-decoder structure, obtains three kinds not from color image by three convolutional layers With side input of the feature as model of scale, to restore detailed information；

(1.3) model loss function is made of L1loss, L2loss and rank loss；

(1.4) it is optimized using stochastic gradient descent method, and adjusts learning rate, batch size, weight decay Hyper parameter, start to train after parameter setting.

Further, the step (3) comprises the steps of:

(3.1) it for the zone errors in estimating depth figure, picks up and regional aim color gray scale to be modified on the image It is worth most similar color, and is smeared in zone errors, obtains interaction figure picture；

(3.2) random cropping, the data enhancing of random overturning and colour dither are carried out to input picture；

(3.3) data in step (3.1) are as real depth, and the data in step (3.2) are as input picture, to step Suddenly the model of (1) carries out further fine tuning training；

(3.4) depth for predicting input picture again with the model after fine tuning in step (3.3), generates the depth after amendment Spend prediction result.

Further, the step (6) comprises the steps of:

(6.1) mask code matrix onesize with image and that all value is 0 is initialized, by row to the pixel in image It is scanned, if some pixel and the difference of adjacent pixel are greater than given threshold, then corresponding position in mask code matrix is set as 1；

(6.2) pixel for being 1 for mask code matrix intermediate value calculates the difference of the parallax of itself and adjacent pixel, and will be in image All pixels point before from pixel to pixel is set as 0, and mask code matrix corresponding position is set as 1；

(6.3) pixel that mask code matrix intermediate value is 0 is hole region.

Further, the step (7) fills up model using Adadelta algorithm training image.

Further, the step (8) comprises the steps of:

(8.1) for image outside data set, hole region and mask code matrix are generated using the method for step (6)；

(8.2) random cropping, the data enhancing of random overturning and colour dither are carried out to input picture；

(8.3) use the data in step (8.2) as input picture, no empty data are as true picture, to step (7) Model carry out further fine tuning training.

The beneficial effects of the present invention are: only needing individual color image that can generate the image in a certain range under visual angle； By editing to the depth map of estimation, the region of prediction error can be corrected；It is finely tuned by model parameter, for data set Outer image can also have preferable performance, and the stereo-picture of generation can experience more apparent stereoscopic effect, to help User is helped to more fully understand three-dimensional scenic.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is the structural schematic diagram of estimation of Depth model；

Fig. 3 is depth prediction result exemplary diagram；

Fig. 4 a is input color image, and Fig. 4 b is real depth map, and Fig. 4 c is the prediction result before model fine tuning, and Fig. 4 d is Interact obtaining as a result, Fig. 4 e is the prediction result after model parameter fine tuning corrects mistake；

Fig. 5 a- Fig. 5 f is the empty example images figure that simulation generates；

Fig. 6 is the result schematic diagram of fine tuning front and back；

Fig. 7 is that stereo-picture generates result schematic diagram.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings:

As shown in Figure 1, a kind of method for generating 3 D visual image by single image, mainly comprises the steps that use The training of RGBD image data set progress estimation of Depth model；For individual color image of input, its depth information is estimated；It is logical Cross part wrong in interactive and model parameter fine tuning amendment estimating depth；Prospect protection operation is carried out to the depth map of estimation, To which depth edge and Color Image Edge be better aligned；According to image and depth information, parallax is calculated, is obtained under New Century Planned Textbook Image；According to the depth map in data set, hole region similar with image under New Century Planned Textbook is generated；It is instructed with newly-generated data Practice and fights network model for the generation of image mending；For test picture, small parameter perturbations are carried out to patch formation model；The new view of repairing The hollow sectors of image under angle, to generate final 3 D visual image.

Each step is described in detail below:

(1) training of estimation of Depth model is carried out using RGBD image data set: the step proposes a kind of estimation of Depth Model constructs a kind of encoder-decoder structure based on ResNet50, and the input of model is color image, output For single channel depth map.In order to supplement the detailed information in predetermined depth figure, model construction side-input structure, from defeated The feature of three kinds of different scales is added in the color image entered in three times.In addition, having also combined side-output structure to help Optimization.The loss function of model combines L1Loss, L2Loss and rank loss, and wherein rank loss can help to obtain non-office The information in portion.Specific step is as follows:

Model structure design: the present invention is with (Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.2016.Deep Residual Learning for Image Recognition.In CVPR.IEEE Computer Society, 770-778.) based on the ResNet50 model proposed in, a kind of encoder-decoder structure is constructed, such as Shown in Fig. 2.In order to restore the detailed information in predetermined depth figure, the present invention constructs side-input structure, i.e., from input Feature is extracted by a new convolutional layer in color image, and is connected together with the characteristic pattern of the part decoder, as under The input of one convolutional layer.The feature of three kinds of different scales is added in entire model in three times.In addition, having also combined (Saining Xie and Zhuowen Tu.2017.Holistically-Nested Edge Detection.International Journal of Computer Vision 125,1-3 (2017), 3-18.) the side-output structure proposed come help into The optimization of row model.

Loss function design: the loss function of this model combines L1Loss, L2Loss and rank loss, wherein Rankloss can help to obtain non local information.Specifically, for one group of RGBD image I in data set, the depth of prediction Image is Z.For a pixel i, RGB information is represented byReal depth value isPredetermined depth value is Z_i, The poor D (I, i) of predicted value and true value can be represented as:

Similarly, the difference G of the gradient of depth_xAnd G_yIt can indicate are as follows:

L1Loss can be defined as:

L2Loss can be defined as:

In addition, (Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng.2016.Single-Image Depth Perception in the Wild.In NIPS.730-738.) for a pair of randomly selected point to (i, j), root Label r is defined according to its depth value_ij={+1, -1,0 }, wherein+1 indicates Z_i>Z_j, -1 indicates Z_i< Z_j, 0 indicates Z_i=Z_j.This point pair Rank loss can indicate are as follows:

Finally, the loss function of estimation of Depth model is defined as:

Wherein, (k, l) indicates that the point pair on jth Zhang Xunlian picture, K are sum a little pair.

Model training process: the present invention uses house data collection NYU Depth v2 (Pushmeet Kohli Nathan Silberman,Derek Hoiem and Rob Fergus.2012.Indoor Segmentation and Support Inference from RGBD Images.In ECCV.) and outdoor data collection SYNTHIA (German Ros, Laura Sellart,Joanna Materzynska,David Vazquez,and Antonio M.Lopez.2016.The SYNTHIA Dataset:A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)) it is trained.

For NYU data set, equal interval sampling is carried out to original Kinect sequence frame and obtains 40,000 RGBD images, and led to It crosses Random Level overturning rotation and obtains 120,000 training sets, and unified bilinear interpolation narrows down to the resolution ratio of 320*240.Training In the process by the size of image random cropping to 304*228.In addition also on depth image 3000 pairs of points of stochastical sampling to and mark Infuse its relativeness.

For SYNTHIA data set, using two scenes of SEQS-04 and SEQS-06, and according to training set: test set 9:1 Ratio obtain 54000 training set images and 6000 test set images, the resolution ratio for finally narrowing down to 320*190 is instructed Practice.

Model training uses stochastic gradient descent method, and batch size is 8, and entire training process continues 13 epochs. Begin to use 0.001 learning rate to preheat thousands of times, formal training is started with 0.01 learning rate later, every 6 epochs will Learning rate is reduced to original 1/10.

(2) for individual color image of input, estimate its depth information: when test, picture centre being cropped to 304* 228, and the output valve of model is mapped back by 10 meters of ranges by exponential function, prediction result example is shown in Fig. 3.

(3) pass through part wrong in interactive and model parameter fine tuning amendment estimating depth: directly using model estimation figure When as depth, there may be partial error, as illustrated in fig. 4 c.The present invention provides interactive tool and corrects mistake area to help Domain.User can pick up on the image with color similar in regional aim color to be modified, and smeared, obtained on region The interaction figure picture of Fig. 4 d.The interaction figure picture is substituted into real depth map (Fig. 4 b) later, using the color image of Fig. 4 a to step 1) trained model carries out finetune in, carries out random cropping, random overturning and color to input picture in training process Dither operation come achieve the purpose that data enhance.Finally image is carried out again with the model after fine tuning to predict that Fig. 4 e, which can be obtained, to be repaired Prediction result after just.Wherein, model parameter fine tuning refers to one picture of input, by random cropping, random mirror image, and random color Picture enhancing is a data set by shake etc., and trained model carries out further on current data set in back Training, to improve the processing capacity to current image.

(4) prospect protection operation is carried out to the depth map of estimation, so that depth edge and color image side be better aligned Edge: the depth map edge of model prediction may be not fully aligned with cromogram edge, can use (Xiaohan Lu, Fang Wei,and Fangmin Chen.2012.Foreground-Object-Protected Depth Map Smoothing for 2012 IEEE International Conference on Multimedia and Expo.339-343. of DIBR.In) it mentions Foreground object out is protected depth edge and Color Image Edge is better aligned.

(5) according to image and depth information, parallax is calculated, obtains the image under New Century Planned Textbook: the solid generated in the present invention Visual pattern parallax in X direction, can be calculated by the following formula:

Wherein, x' indicates that the location of pixels in input picture, x indicate new location of pixels.B indicate view transformation away from From i.e. baseline；F indicates the focal length of camera, which is determined by data set；Z indicates the depth at x'.It can be incited somebody to action by the formula Image transforms under New Century Planned Textbook.In view transformation, when the region that is blocked displays again, hole region will form, 1 pixel is wide The cavity of degree can directly be filled up by interpolation, and generation confrontation network can be used to carry out image by subsequent step for bigger cavity It fills up.

(6) it according to the depth map in data set, generates hole region similar with image under New Century Planned Textbook: being produced by view transformation Raw hole region often close to the edge of object and shows elongated shape.It is filled up to improve generation confrontation network image Effect, the invention proposes a kind of algorithms that the image with hole region is generated according to depth map, which passes through scanning Depth difference between pixel and adjacent pixel determines whether there may be hole regions for the position；Additionally by certain pixel of calculating Parallax between adjacent pixel determines the size of the hole region of the position.It can be generated by this algorithm and stereopsis Feel the similar hole region of location and shape in image.Specific step is as follows:

(6.1) position that simulation hole region generates: assuming that generating the right visual angle of input picture, hole region will appear On the right of foreground object, we initialize a mask image by following formula:

Wherein Z (x, y) indicates that the depth value at (x, y) point, θ are the threshold value that system is formulated.The expression of M (x, y)=1 is carrying out When visual angle change, which is likely to occur cavity.

(6.2) simulate the size of hole region: in X direction, the size in cavity is by (x, y) and (x+1, y) near (x, y) point Parallax at two o'clock determines that formula is as follows:

All pixels point before from pixel (x+1, y) to pixel (x+1+s (x, y), y) is considered as hole area Domain, mask image corresponding position are set as 1, and for the empty image of the band of generation as shown in Fig. 5 a- Fig. 5 f, white portion indicates hole area Domain.

(7) fight network model for the generation of image mending with newly-generated data training: model uses (Satoshi Iizuka,Edgar Simo-Serra,and Hiroshi Ishikawa.2017.Globally and locally Consistent image completion.ACM Trans.Graph.36,4 (2017), 107:1-107:14.) in knot Structure is trained on the SYNTHIA data set of the NYU data set of 560*426 resolution sizes and 640*380 respectively, input Image is image and mask image with the hole region generated in step 6), and true picture is original complete image.Training Using Adadelta algorithm, continue 10 epochs, batch size is 24.

(8) for test picture, small parameter perturbations are carried out to patch formation model；For the picture not in data set, the present invention Using treated in step 4), predetermined depth figure generates hole region, and carries out random cropping, flip horizontal to input picture Data enhancing, using enhanced data in step 7) model carry out small parameter perturbations.The result example of fine tuning front and back is shown in Fig. 6.

(9) hollow sectors for repairing image under New Century Planned Textbook generate result as shown in fig. 7, white edge part is stereopsis in figure Feel the obvious region of effect.

Embodiment:

It is realized on the machine that inventor is equipped with two pieces of NVIDIA 1080Ti GPU and 12GB memories at one of the invention One implementation example.The part Resnet50 of estimation of Depth model is by the beginning of the good model of the pre-training on ImageNet data set Beginningization, it is 0 that the initialization of other layer parameters, which meets mean value, the Gaussian Profile that variance is 0.01.Estimation of Depth model is to input picture Continue 1000 iteration when being finely adjusted, image mending model continues 400 iteration when being finely adjusted.Inventor to this method into Assessment is gone, on NYU test set, in terms of estimation of Depth, the accuracy rate in 1.25 threshold value standards has reached 82.2%, SSIM (structural similarity) has reached 0.9463, and image mending aspect, PSNR (Y-PSNR) has reached 30.98, SSIM and reached 0.953.This method can generate that visual effect is reasonable and the apparent 3 D visual image of stereoscopic vision.

In addition to the implementation, the present invention can also have other embodiments.It is all to use equivalent substitution or equivalent transformation shape At technical solution, fall within the scope of protection required by the present invention.

Claims

1. a kind of method for generating 3 D visual image by single image, it is characterized in that: mainly comprising the steps that

(2) for individual color image of input, estimate its depth information；

(4) prospect protection operation is carried out to the depth map of estimation, so that depth edge and Color Image Edge be better aligned；

2. the method according to claim 1 for generating 3 D visual image by single image, it is characterized in that: the step (1) it comprises the steps of:

(1.1) data processing is carried out, equal interval sampling is carried out to data set, and carry out random cropping, flip horizontal, colour dither Data enhancing, and relative depth is measured according to depth map sampled point；

(1.2) model structure is encoder-decoder structure, obtains three kinds of different rulers from color image by three convolutional layers Side input of the feature of degree as model, to restore detailed information；

(1.3) model loss function is made of L1loss, L2loss and rank loss；

(1.4) optimized using stochastic gradient descent method, and adjust learning rate, batch size, weight decay it is super Parameter starts to train after parameter setting.

3. the method according to claim 2 for generating 3 D visual image by single image, it is characterized in that: the step (3) it comprises the steps of:

(3.1) it for the zone errors in estimating depth figure, picks up on the image with regional aim color gray value to be modified most Similar color, and smeared in zone errors, obtain interaction figure picture；

(3.3) data in step (3.1) are as real depth, and the data in step (3.2) are as input picture, to step (1) model carries out further fine tuning training；

(3.4) predict that the depth of input picture, the depth generated after correcting are pre- again with the model after fine tuning in step (3.3) Survey result.

4. the method according to claim 3 for generating 3 D visual image by single image, it is characterized in that: the step (6) it comprises the steps of:

(6.1) mask code matrix onesize with image and that all value is 0 is initialized, the pixel in image is carried out by row Corresponding position in mask code matrix is then set as 1 if some pixel and the difference of adjacent pixel are greater than given threshold by scanning；

(6.2) pixel for being 1 for mask code matrix intermediate value, calculates the difference of the parallax of itself and adjacent pixel, and by image from picture All pixels point before vegetarian refreshments to pixel is set as 0, and mask code matrix corresponding position is set as 1；

(6.3) pixel that mask code matrix intermediate value is 0 is hole region.

5. the method according to claim 1 for generating 3 D visual image by single image, it is characterized in that: the step (7) model is filled up using Adadelta algorithm training image.

6. the method according to claim 4 for generating 3 D visual image by single image, it is characterized in that: the step (8) it comprises the steps of:

(8.3) use the data in step (8.2) as input picture, no empty data are as true picture, to the mould of step (7) Type carries out further fine tuning training.