CN117437274A

CN117437274A - A monocular image depth estimation method and system

Info

Publication number: CN117437274A
Application number: CN202311521956.1A
Authority: CN
Inventors: 俞奕宣; 韩志敏; 王博; 林志赟
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2024-01-23

Abstract

The invention discloses a monocular image depth estimation method and a monocular image depth estimation system. The monocular image depth estimation method adopted by the invention is as follows: constructing a depth estimation network of an image and a pose estimation network of a camera by a depth feature encoder, a depth decoder and a pose encoder; splicing images between adjacent frames on the channel again, recalculating the pose between the two frames of depth, recalculating a pixel-level geometric consistency loss function, and marking an unstable region of the depth estimation network; for the output depth map, back-projecting the depth map into pseudo point cloud by using camera internal parameters, and dividing the ground point cloud by using an SVM classifier pre-trained on the road data set to obtain ground point cloud information; and restoring the image depth map scale information by using the camera height, the camera internal parameters, the camera and the ground point cloud information. The method and the device solve the instability and scale problems of monocular image depth estimation while improving monocular depth estimation accuracy.

Description

Monocular image depth estimation method and system

Technical Field

The invention relates to the field of monocular image depth estimation, in particular to a monocular image depth estimation method and a monocular image depth estimation system considering depth estimation errors and scale uncertainty.

Background

In recent years, the main application fields of the autonomous mobile robot can be divided into commercial and research fields mainly based on warehouse logistics and automatic driving and civil fields mainly based on small-sized sweeping robots; meanwhile, the method has a great deal of use in the fields of intelligent agriculture, unmanned delivery and the like. The tasks from different fields also put forward higher requirements on the technology of the autonomous mobile robot, such as reducing the dependence of sensors, improving the real-time performance of a system, increasing the stability of the system and the like, and the depth estimation algorithm is applied to the obstacle detection link of the autonomous mobile robot, so that the development and application of the obstacle avoidance technology of the autonomous mobile robot can be effectively promoted.

Monocular depth image estimation training methods are generally classified into three types: the first method is to directly collect relevant depth data by using a high-precision radar or a depth camera, use a depth estimation network to carry out regression on the depth information and learn the depth information of a scene, but in the real world, the acquisition of a depth tag is relatively difficult, and the use of an algorithm is limited. The second method is through binocular image and camera pose transformation, but because of the baseline problem of binocular cameras, it cannot be guaranteed that the cameras are applicable to all depth cameras. The third method is to use continuous video frames and estimate the pose and depth of the camera to complete training, the method is simple to realize, the dependence on the data set is low, and the pose estimation network effect of the camera has errors, so that a better effect cannot be achieved. The monocular depth estimation algorithm of the current encoder-decoder architecture remains the dominant algorithm. Training the network through inter-frame transformation, luminosity re-projection construction, depth smoothing loss and the like, but influencing the accuracy of depth estimation due to continuous downsampling and loss of detail information; and because of the problems existing in monocular depth estimation, the true scale of the monocular depth estimation needs to be recovered artificially, and has a certain difficulty, and meanwhile, the depth estimation algorithm has a certain instability problem and needs to be further considered.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a monocular image depth estimation method and a monocular image depth estimation system which consider depth estimation errors and scale uncertainty, and the method and the system consider the instability of monocular image depth estimation and the scale problem of the depth estimation while improving monocular depth estimation precision.

In order to achieve the purpose of the invention, the invention adopts the following technical scheme:

adding a hole convolution downsampling module to the ResNet18 residual error network encoder, fusing and encoding upper coarse granularity edge information in a short jump connection mode, enhancing monocular image depth estimation details, reducing downsampling information loss, and obtaining a depth characteristic encoder;

constructing a decoder neural network by utilizing a convolution upsampling module, and acquiring a depth decoder by utilizing semantic features of corresponding layers of a long jump connection fusion encoder in each layer of the decoder neural network; obtaining depth maps of different scales by using a depth convolution module;

utilizing a ResNet18 feature encoder as a pose encoder;

constructing a depth estimation network of an image and a pose estimation network of a camera by using the depth feature encoder, the depth decoder and the pose encoder;

establishing a minimized luminosity reprojection error loss function, a depth smoothing loss function, a geometric consistency loss function and a characteristic point reprojection loss function, and training a depth estimation network and a pose estimation network;

splicing images between adjacent frames on the channel again, recalculating the pose between the two frames of depth, recalculating a pixel-level geometric consistency loss function, and marking an unstable region of the depth estimation network;

for the output depth map, back-projecting the depth map into pseudo point cloud by using camera internal parameters, and dividing the ground point cloud by using an SVM classifier pre-trained on the road data set to obtain ground point cloud information;

and restoring the image depth map scale information by using the camera height, the camera internal parameters, the camera and the ground point cloud information.

Further, the steps of building a minimized luminosity reprojection error loss function, a depth smoothing loss function, a geometric consistency loss function and a characteristic point reprojection loss function, training a depth estimation network and a pose estimation network are as follows:

carrying out normalization, random cutting, horizontal overturning and luminosity change on the images by utilizing continuous frame video stream images of the Kitti public data set, and simultaneously carrying out yes and corresponding changes on an internal reference matrix of a camera so as to adapt to a re-projection process;

suppose there is-1 frame f _-1 0 frame f ₀ 1 frame f ₁ Three continuous frames of images, 0 frame f ₀ Image feeding into depth estimation network to obtain depth map D ₀ Simultaneously, f is calculated by using a camera pose estimation network ₀ To f _-1 Camera pose relationship P between _0→-1 And f ₀ To f _-1 Camera pose P therebetween _0→1 At the same time utilize P _0→-1 Depth map D ₀ -1 frame f _-1 For 0 frame f ₀ Reconstructing to obtain reconstructed 0 frame f' _-1→0 Similarly, 1 frame f is obtained ₁ Frame to 0 frame f ₀ Reconstruction f 'of frame' _1→0 At the same time calculate f' _-1→0 And f ₀ Between photometric re-projection errors, f' _1→0 And f ₀ The photometric re-projection error between the two pixels is obtained, and the minimized re-projection error is obtained at each pixel, so that a minimized photometric re-projection error loss function is constructed;

constructing a depth smoothing loss function by utilizing a depth map output by a depth estimation network and an RGB image corresponding to the input;

reconstructing the source frame depth map to the target frame depth map by utilizing the depth map of the source frame, the target frame depth map and the camera pose, and constructing a geometric consistency loss function by utilizing geometric information of the camera pose;

obtaining feature points matched with a source image and a target image by using a LoFTR feature point matching network, sampling depth values on a target image depth image, calculating projection points of the feature points projected on the source image by the depth values and the camera pose, and constructing a feature point re-projection loss function;

after training is completed, the RGB images are normalized and then input into a depth estimation network, a depth image of a monocular image is output, and the pose relation between two frames is obtained through inputting two frames of continuously output input images into a pose estimation network.

Further, the depth convolution downsampling module consists of a hole convolution operation, a batch normalization operation, a point convolution operation and a GELU activation function, and the downsampling mode of the hole convolution is the point convolution, the batch normalization operation and the space grouping expansion convolution and the GELU activation function;

the cavity convolution expansion rate is 2, the step length is 2, the convolution kernel size is 3*3, and the group number is consistent with the channel number of the input feature; the point convolution step size is 1, the convolution kernel size is 1*1, and the group number is 1.

Further, a multi-scale feature fusion module is added at the end of the depth feature encoder, the multi-scale feature fusion module is composed of a cavity space convolution pooling pyramid module and a compression excitation module and is used for further extracting features and details of different scales, and the specific extraction process comprises the following steps:

extracting multi-scale features by using expansion convolutions with expansion ratios of 1, 3, 6, 9 and 12 in a cavity space convolution pooling pyramid module, splicing the features in the channel dimension, fusing and compressing the features on the channel by using point convolutions of 1x1 after the splicing is completed, calculating weights on the channels of the feature blocks by using a compression excitation module, and finally weighting the features in the channel dimension by the weights.

Further, the minimized photometric re-projection error loss function is:

wherein L is _p To minimize photometric re-projection errors, I _t 、I′ _t Respectively a current frame and a reconstructed frame; pe is luminosity re-projection error, and the calculation method comprises the following steps:

wherein λ is an optional super parameter;

wherein x and y are windows in the same position corresponding to the two pictures, mu _x Sum mu _y As an average value in the corresponding window,for the variance of all data in window x, +.>For the variance, σ, of all data in window y _xy C is the covariance of the x and y data in the window ₁ ＝(k ₁ ，L) ² ，c ₂ ＝(k ₂ ，L) ² ，k ₁ ＝0.01，k ₂ =0.03, l=1.0, and the window size is 3*3.

Further, the depth smoothing loss function is:

wherein f _p For inputting frame image, f _p x is the average value of gradient values of pixels in the horizontal direction of an input frame image, and f _p y is the average value of gradient values of pixels in the vertical direction of the input frame image, and d _p Depth network estimation for input frame correspondenceDepth map of gauge d _p x is the gradient value of the pixel in the horizontal direction of the output depth image, d _p y is the gradient value of the pixel in the vertical direction of the output depth map; v is the number of total pixels in the graph;

the geometric consistency loss function is:

wherein,

wherein D is _diff (p) depth uniformity error for each point, D' _f0 (p) is an estimated depth map,is a depth map reconstructed from a previous frame depth map.

Further, the characteristic point re-projection loss function is as follows:

wherein x is _r Y is the abscissa value of the reprojection coordinate of the characteristic point of the target image to the source image _r Vertical coordinate values of the reprojection coordinates of the characteristic points of the target image to the source image; v is the number of total pixels in the figure.

Further, the pre-training process of the SVM classifier is as follows: taking the 40 th, 44 th, 48 th and 49 th classes of the Kitti_Semantic road point cloud classification data set as the positive class of the ground point cloud, and the rest as the negative class; after the data set is divided, inputting point cloud classification data into an SVM classifier, constructing a 2-class cross entropy loss function, and training the SVM classifier by using an SGD (generalized gateway) optimizer to obtain the SVM classifier capable of dividing the ground point cloud;

the method for recovering the scale information of the image depth map comprises the following steps:obtaining segmented ground Point clouds Point _grund Re-projecting the point cloud data back to the pixel plane to obtain the coordinates y of the points in the pixel coordinate system _ground Obtaining the relative height h of the camera according to the similarity theorem _relative ：

Wherein f _y Is the vertical focal length of the camera; d (D) _relative Outputting a network estimate for the depth estimation;

each Point _ground All through the above way to obtain a h _relative Then find h _relative Median h of (2) _{relative_m} ，h _absolute For determining the scaling for the absolute height of the cameraThe absolute depth is noted as D _absolute ＝αD _relative 。

Further, by using the geometric consistency error constructing method, the geometric consistency error at the pixel level is recalculated by using the current frame depth map, the previous frame depth map, the current frame RGB image, the previous frame RGB image pair and the camera pose between the two frames, the position of the first 20% of pixels with the largest error is 1, the rest pixels are set to 0, and the error is binarized.

The present invention also provides a monocular image depth estimation system taking depth estimation errors and scale uncertainties into account, comprising:

an encoder module: adding a cavity convolution downsampling module to the ResNet18 residual error network encoder, fusing the upper layer coarse granularity edge information of the encoding by short jump connection, enhancing depth estimation details, and obtaining a depth characteristic encoder;

a decoder module: constructing a decoder neural network by utilizing a convolution upsampling module, and acquiring a depth decoder by utilizing long jump connection to fuse semantic features of corresponding layers of the encoder in each layer of the decoder neural network; obtaining depth maps of different scales by using a depth convolution module;

an estimation network construction module: constructing a depth estimation network using the depth feature encoder and the depth decoder; constructing a pose estimation network of a camera by using the ResNet18 feature encoder as a pose encoder;

a loss function module: establishing a minimized luminosity reprojection error loss function, a depth smoothing loss function, a geometric consistency loss function and a characteristic point reprojection loss function, and training a depth estimation network and a pose estimation network;

an unsupervised monocular depth estimation training module: carrying out normalization, random cutting, horizontal overturning and luminosity change on the images by utilizing continuous frame video stream images of the Kitti public data set, and simultaneously carrying out corresponding change on an internal reference matrix of a camera so as to adapt to a re-projection process; constructing a minimized reprojection error by utilizing the depth acquired by the continuous frame RGB image, the depth estimation network and the camera pose; constructing a depth smoothing loss function by utilizing a depth map output by a depth estimation network and an RGB image corresponding to the input; reconstructing the source frame depth map to the target frame depth map by utilizing the depth map of the source frame, the target frame depth map and the camera pose, and constructing a geometric consistency loss function by utilizing geometric information of the camera pose; obtaining feature points matched with a source image and a target image by using a LoFTR feature point matching network, sampling depth values on a target image depth image, calculating projection points of the feature points projected on the source image by the depth values and the camera pose, and constructing a feature point re-projection loss function;

and an application module: after training is completed, normalizing the RGB images, inputting the normalized RGB images into a depth estimation network, outputting a depth image of a monocular image, and obtaining a pose relation between two frames through inputting the continuously output two input images into a pose estimation network; splicing images between adjacent frames on a channel again, recalculating the pose between two frames of depth, recalculating a pixel-level geometric consistency loss function, and marking an unstable region of depth estimation; for the output depth map, back-projecting the depth map into pseudo point cloud by using camera internal parameters, and dividing the ground point cloud by using SVM pre-trained on the road data set to obtain ground point cloud information; and restoring the image depth map scale information by using the camera height, the camera internal parameters, the camera and the ground point cloud information.

The invention has the following beneficial effects: the method can be used for obstacle detection, labeling of the unstable region of the depth estimation and scale recovery in the automatic driving process, and solves the problems of instability and scale of the depth estimation of the monocular image while improving the accuracy of the monocular depth estimation by considering the depth estimation error and scale uncertainty.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a depth estimation network in accordance with an embodiment of the present invention;

FIG. 2 is a perspective view of a point in three-dimensional space of a pinhole camera in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of a hole convolution downsampling module in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a convolution upsampling module in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of a depth convolution module in accordance with an embodiment of the present invention;

FIG. 6 is a block diagram of a multi-scale feature fusion module in accordance with an embodiment of the present invention;

FIG. 7 is a graph of the segmentation result of the SVM ground point cloud tested on the standard dataset in an embodiment of the present invention;

FIG. 8 is a flowchart of a binarized depth uncertainty mask calculation in accordance with an embodiment of the present invention;

FIG. 9 is a flow chart of depth scale recovery by ground cloud and camera height in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by persons of ordinary skill in the art without making creative efforts based on the embodiments of the present invention are all within the scope of protection of the present invention.

The embodiment provides a monocular image depth estimation method considering depth estimation errors and scale uncertainty, which comprises the following steps:

(1) Adding a hole convolution downsampling module to the ResNet18 residual error network encoder, wherein the hole convolution downsampling module is connected through short jump, fuses and encodes upper coarse granularity edge information, enhances monocular image depth estimation details, and obtains a depth feature encoder; at the same time, a multi-scale feature fusion module consisting of a cavity space convolution pooling pyramid and a compression excitation module is finally used at the decoder and is used for extracting feature information of different scales and further fusing;

(2) Constructing a decoder neural network by using a convolution up-sampling module, and obtaining a depth decoder by using long jump connection in each layer of the decoder neural network and fusing semantic features of corresponding layers of an encoder, wherein the convolution up-sampling module is shown in fig. 4; finally, obtaining depth maps of different scales by using a depth convolution module, wherein the depth convolution module is shown in fig. 5;

(3) Utilizing a ResNet18 feature encoder as a pose encoder;

(4) Constructing a depth estimation network of an image and a pose estimation network of a camera by using the depth feature encoder, the depth decoder and the pose encoder; a depth estimation network constructed with a depth feature encoder and a depth decoder, the structure of which is shown in fig. 1;

(5) Establishing a minimized luminosity reprojection error loss function, a depth smoothing loss function, a geometric consistency loss function and a characteristic point reprojection loss function, and training a depth estimation network and a pose estimation network;

(6) Splicing adjacent frame images on the channel again, recalculating the pose between the two frames of depth, recalculating a pixel-level geometric consistency loss function, and marking an unstable region of the depth estimation network;

(7) For the output depth map, back-projecting the depth map into pseudo point cloud by using camera internal parameters, and dividing the ground point cloud by using an SVM classifier pre-trained on the road data set to obtain ground point cloud information;

(8) And restoring the image depth map scale information by using the camera height, the camera internal parameters and the ground point cloud information.

In the step (1), the depth convolution downsampling module is composed of a hole convolution operation, a batch normalization operation, a point convolution operation and a GELU activation function, and downsampling modes of the hole convolution are the point convolution, the batch normalization operation and the space grouping expansion convolution and the GELU activation function aiming at a channel space, so that loss of detail features in a downsampling process is reduced, and the specific process comprises the following steps:

the convolution with the step length of 2, the expansion rate of 2 and the convolution kernel size of 3x3 is used for extracting features, batch normalization operation is carried out on the extracted features, 1x1 point convolution is used for fusing the features on a channel after normalization operation is completed, and finally a GELU activation function is used for carrying out nonlinear transformation on the output features, as shown in fig. 3;

in the step (1), a multi-scale feature fusion module is added at the end of the depth feature encoder, wherein the multi-scale feature fusion module is composed of a cavity space convolution pooling pyramid module and a compression excitation module and is used for further extracting features and details of different scales, and the specific extraction process comprises the following steps:

extracting multi-scale features by using expansion convolutions with expansion ratios of 1, 3, 6, 9 and 12 in a cavity space convolution pooling pyramid module, splicing the features in the channel dimension, fusing and compressing the features on the channel by using point convolutions of 1x1 after the splicing is completed, calculating weights on the channels of the feature blocks by using a compression excitation module, and finally weighting the features in the channel dimension by the weights, as shown in fig. 6.

The specific content of the step (5) is as follows:

carrying out normalization, random cutting, horizontal overturning and luminosity change on the images by utilizing continuous frame video stream images of the Kitti public data set, and simultaneously carrying out corresponding change on an internal reference matrix of a camera so as to adapt to a re-projection process;

after training, the RGB images are normalized and then input into a depth estimation network, a depth image of a monocular image is output, and the pose relation between two frames is obtained through inputting two continuously output input images into a pose estimation network.

Considering a simplified geometric model of a basic pinhole camera, a specific image is shown in fig. 2, where point e= (X, Y, Z) is a point in 3-dimensional space, e= (X, Y) is a point on the camera plane, and the line between point E and the optical center of the camera passes through point E, and the following equation can be simply obtained by using a similar triangle:

where x, y are coordinates in plane Z.

If points in space are converted to homogeneous coordinate points, the points in space can be written as standard homogeneous coordinates, the points in space can be written as e= (X, Y, Z, 1) and the principal point bias is considered, the formula can be rewritten as:

the above formula expresses how to project from a point represented by homogeneous coordinates of a three-dimensional space to a homogeneous coordinate system of pixels, where fX, fY, p _x ，p _y Focus and offset for camera intrinsic.

The camera pose estimation network outputs a 6 multiplied by 1 vector, the front three-dimensional vector represents the camera displacement, the rear three-dimensional vector represents the rotation of the camera, and in the invention, euler angles and r are used for rotation in the three-dimensional space _x 、r _y 、r _z Expressed as rotation about the x-axis, rotation about the y-axis, and rotation about the z-axis, respectively, and then the acquired rotation and displacement are expressed as a pose change matrix (r|t), with the following specific change formulas:

rotation matrix in X-axis direction:

rotation matrix in Y-axis direction:

rotation matrix in Z-axis direction:

the invention assumes that the overall rotation matrix of the camera with the rotation direction Z- & gtY- & gtX is:

R _zyx (r _z ，r _y ，r _x )＝R _z (r _z )R _y (r _y )R _z (r _z )，

for displacement matrix T _xyz The following is directly used:

T _xyz ＝[T _x T _y T _z ] ^T ，

the complete camera pose transformation matrix is expressed as:

(R|T)＝(R _zyx |T _xyz )。

the change in the points in three-dimensional space can be expressed as:

wherein X ', Y ', Z ' pass through points in three-dimensional space after camera pose transformation.

The reprojection process is as follows:

the data of each pixel of the depth map D can be expressed as (x, y, Z), wherein x, y are the pixel coordinates of a point, and Z is the depth corresponding to the point, which is represented by the above equation

The point (X, Y, Z) in three-dimensional space corresponding to the point can be solved.

The pose transformation of the camera between two frames is considered for the points in the three-dimensional space (X ', Y ', Z ');

taking the camera projection equation into account again, one can reproject (X ', Y ', Z ') as (X ', Y ', 1)

The (x ', y') is the pixel coordinate corresponding to the point where the pixel coordinate of the first frame is (x, y) in the second frame.

The loss function is composed of 4 parts, namely a minimized re-projection loss function, a depth smoothing loss function, a geometric consistency loss function and a characteristic point re-projection loss function. The loss function is used for guaranteeing geometric consistency between adjacent frames by constructing sparse and dense luminosity and geometric loss functions, and restraining the pose estimation network and the depth estimation network to learn more accurate depth and pose information.

1. Minimizing re-projection errors

The minimized photometric re-projection error loss function is:

wherein λ is an optional super parameter;

The loss function further improves the robustness against the shielding problem between adjacent frames in the training process by taking the minimum value of the photometric reprojection loss between the multi-frame images as an optimization target.

2. Depth smoothing loss function:

wherein f _p For inputting frame image, f _p x is the average value of gradient values of pixels in the horizontal direction of an input frame image, and f _p y is the average value of gradient values of pixels in the vertical direction of the input frame image, and d _p Depth map estimated for depth network corresponding to input frame d _p x is the gradient value of the pixel in the horizontal direction of the output depth image, d _p y is the gradient value of the pixel in the vertical direction of the output depth map; v is the number of total pixels in the figure.

The depth smoothing loss function encourages smoothness of depth estimation inside objects by constraining the consistency of depth changes with color changes of the RGB image.

3. Geometric consistency loss:

wherein,

The output depth between the forced frames meets the solid geometric relation in the three-dimensional space, so that the scale consistency of the network output depth is ensured.

4. Point of sign reprojection loss function:

wherein x is _r Y is the abscissa value of the reprojection coordinate of the characteristic point of the target image to the source image _r Vertical coordinate values of the reprojection coordinates of the characteristic points of the target image to the source image; v is the number of total feature points in the figure.

Compared with dense luminosity loss, sparse characteristic point reprojection loss can improve the estimated accuracy of the pose estimation network;

the absolute scale recovery algorithm is used for recovering the absolute scale information.

The pre-training process of the SVM classifier is as follows: taking the 40 th, 44 th, 48 th and 49 th classes of the Kitti_Semantic road point cloud classification data set as the positive class of the ground point cloud, and the rest as the negative class; after the data set is divided, the point cloud classification data are input into an SVM classifier, a 2-class cross entropy loss function is constructed, and the SGD optimizer is used for training the SVM classifier to obtain the SVM classifier capable of dividing the ground point cloud.

The method for recovering the scale information of the image depth map comprises the following steps: obtaining segmented ground Point clouds Point _grund Re-projecting the point cloud data back to the pixel plane to obtain the coordinates y of the points in the pixel coordinate system _ground Obtaining the relative height h of the camera according to the similarity theorem _relative ：

each Point _ground All through the above way to obtain a h _relative Then find h _relative Median h of (2) _{relative_m} ，h _absolute For determining the scaling for the absolute height of the cameraThe absolute depth is noted as D _absolute ＝αD _relative The flow of depth scale recovery through the ground cloud and camera height is shown in fig. 9.

The monocular depth estimation algorithm is not generally stable and is easily influenced by environment, and the invention provides a binarization mask for marking an uncertain region, which is specifically realized as follows:

after training, the RGB images are normalized and then input into a depth estimation network, and a depth map of the monocular image is output and obtained. And splicing images between adjacent frames on a channel again, recalculating the pose between the two frames, recalculating a geometric consistency loss function, and sequencing the geometric consistency loss to set the loss area of the first 20% as 1 and the rest areas as 0, wherein the loss area is used as a pixel uncertainty mask for marking depth inconsistency. The specific flow is shown in fig. 8.

Ground point cloud acquisition algorithm:

in order to restore the absolute depth scale, the embodiment needs to acquire the information of the ground point cloud, and the embodiment uses an SVM classifier with a linear kernel to realize the segmentation of the ground point cloud, and the specific training process is as follows: classes 40, 44, 48 and 49 of the Kitti_Semantic road point cloud classification dataset are taken as the positive class of the ground point cloud, and the rest are divided into negative classes. After the data set is divided, inputting point cloud data into the SVM classifier, constructing a 2-class cross entropy loss function, and training the linear kernel SVM ground point cloud by using an SGD optimizer to obtain the SVM classifier capable of dividing the ground point cloud; tests are performed on the corresponding test sets, and the final classification result is shown in fig. 7. The accuracy of the classification test is shown in the following table.

	Highest to	Lowest minimum	Average of
				Time	16ms	0.9ms	2ms
Accuracy rate of	97％	67％	86％

Specifically, the 2-class cross entropy loss function is:

L(y，t)＝-t*log(y)-(1-t)*log(1-y)

where L is the loss function, y is the output of the model, and t is the corresponding real label.

The present embodiment also provides a monocular image depth estimation system, which includes:

an encoder module: adding hole convolution to the ResNet18 residual error network encoder, and fusing the upper layer coarse granularity edge information of the encoding by short jump connection to strengthen depth estimation details to obtain a depth characteristic encoder;

a decoder module: constructing a decoder neural network by utilizing convolution and up-sampling, and fusing semantic features of corresponding layers of the encoder in each layer of the decoder neural network by utilizing long jump connection to obtain a depth decoder;

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A monocular image depth estimation method, comprising:

adding a cavity convolution downsampling module to the ResNet18 residual error network encoder, fusing and encoding upper-layer coarse granularity edge information in a short jump connection mode, enhancing monocular image depth estimation details, and obtaining a depth feature encoder;

utilizing a ResNet18 feature encoder as a pose encoder;

2. The monocular image depth estimation method of claim 1, wherein the establishing a minimization photometric reprojection error loss function, a depth smoothing loss function, a geometric consistency loss function, and a feature point reprojection loss function trains a depth estimation network and a pose estimation network, and comprises the following specific steps:

suppose there is-1 frame f _-1 0 frame f ₀ 1 frame f ₁ Three continuous frames of images, 0 frame f ₀ Image feeding into depth estimation network to obtain depth map D ₀ Simultaneously using camera pose estimationThe meter network calculates f respectively ₀ To f _-1 Camera pose relationship P between _0→-1 And f ₀ To f _-1 Camera pose P therebetween _0→1 At the same time utilize P _0→-1 Depth map D ₀ -1 frame f _-1 For 0 frame f ₀ Reconstructing to obtain reconstructed 0 frame f' _-1→0 Similarly, 1 frame f is obtained ₁ Frame to 0 frame f ₀ Reconstruction f 'of frame' _1→0 At the same time calculate f' _-1→0 And f ₀ Between photometric re-projection errors, f' _1→0 And f ₀ The photometric re-projection error between the two pixels is obtained, and the minimized re-projection error is obtained at each pixel, so that a minimized photometric re-projection error loss function is constructed;

3. The monocular image depth estimation method according to claim 1 or 2, wherein the depth convolution downsampling module is composed of a hole convolution operation, a batch normalization operation, a point convolution operation and a GELU activation function, and downsampling modes using the hole convolution are the point convolution, the batch normalization operation and the spatial grouping expansion convolution and the GELU activation function for a channel space;

4. The monocular image depth estimation method of claim 1, wherein a multi-scale feature fusion module is added at the end of the depth feature encoder, the multi-scale feature fusion module is composed of a cavity space convolution pooling pyramid module and a compression excitation module, and is used for further extracting features and details of different scales, and the specific extraction process comprises:

5. The monocular image depth estimation method of claim 1, wherein the minimized photometric re-projection error loss function is:

wherein λ is an optional super parameter;

wherein x and y are windows in the same position corresponding to the two pictures, mu _x Sum mu _y As an average value in the corresponding window,for the variance of all data in window x, +.>For the variance, σ, of all data in window y _xy C is the covariance of the x and y data in the window ₁ ＝(k ₁ ,L) ² ，c ₂ ＝(k ₂ ,L) ² ，k ₁ ＝0.01，k ₂ =0.03, l=1.0, and the window size is 3*3.

6. The monocular image depth estimation method of claim 1, wherein the depth smoothing loss function is:

wherein f _p For inputting frame image, f _p x is the average value of gradient values of pixels in the horizontal direction of an input frame image, and f _p y is the average value of gradient values of pixels in the vertical direction of the input frame image, and d _p Depth map estimated for depth network corresponding to input frame d _p x is the gradient value of the pixel in the horizontal direction of the output depth image, d _p y is the gradient value of the pixel in the vertical direction of the output depth map; v is the number of total pixels in the graph;

the geometric consistency loss function is:

wherein,

wherein D is _diff (p) depth uniformity error for each point, D _f ^′ ₀ (p) is an estimated depth map,is a depth map reconstructed from a previous frame depth map.

7. The monocular image depth estimation method of claim 1, wherein the feature point re-projection loss function is:

8. The monocular image depth estimation method of claim 1, wherein the pre-training process of the SVM classifier is as follows: taking the 40 th, 44 th, 48 th and 49 th classes of the Kitti_Semantic road point cloud classification data set as the positive class of the ground point cloud, and the rest as the negative class; after the data set is divided, inputting point cloud classification data into an SVM classifier, constructing a 2-class cross entropy loss function, and training the SVM classifier by using an SGD (generalized gateway) optimizer to obtain the SVM classifier capable of dividing the ground point cloud;

the method for recovering the scale information of the image depth map comprises the following steps: obtaining segmented ground Point clouds Point _grund Re-projecting the point cloud data back to the pixel plane to obtain the coordinates y of the points in the pixel coordinate system _ground The obtained phase is utilizedTheorem-like obtaining the relative height h of the camera _relative ：

9. The monocular image depth estimation method of claim 1, wherein the geometric consistency error constructing method is used to recalculate the geometric consistency error at the pixel level using the current frame depth map, the previous frame depth map, the current frame RGB image, the previous frame RGB image pair, and the camera pose between two frames, and the first 20% of the pixels with the largest error are positioned as 1, and the remaining pixels are positioned as 0, and the error is binarized.

10. A monocular image depth estimation system, comprising: