Automatic driving borderless scene implicit reconstruction method considering geometric information enhancement
Technical Field
The invention relates to the field of automatic driving scene reconstruction, in particular to an automatic driving borderless scene implicit reconstruction method and system considering geometric information enhancement.
Background
Neural radiation fields are a novel three-dimensional reconstruction technique. In autopilot, three-dimensional reconstruction technology is an indispensable key technology. Autopilot systems require accurate environmental awareness capabilities such as recognition and tracking of roadways, static scenes, dynamic objects, as well as path planning and high quality three-dimensional modeling of scenes. The three-dimensional reconstruction technology is utilized to assist the autopilot to realize the tasks, so that the safety and reliability of the autopilot are improved. Firstly, the neural radiation field technology can reconstruct a 2D image into a 3D scene, so that a high-precision map is manufactured, high-precision vehicle positioning and map matching are realized, and research and development of a downstream task of automatic driving are promoted; secondly, the nerve radiation field technology can synthesize complex automatic driving scenes, further enrich training data of automatic driving, and help an automatic driving system to carry out efficient data enhancement, thirdly, the nerve radiation field technology can simulate severe scenes such as extreme weather, serious traffic accidents and the like, so that the real severe scenes can be restored by the simulated data, and the safety of automatic driving is improved. In a word, the neural radiation field technology in three-dimensional reconstruction has wide application in automatic driving, and the neural radiation field technology is combined with an automatic driving scene to help promote development and application of the automatic driving technology.
Because of the limited view angle of the autopilot scene, neRF reconstruction quality will be reduced, how to synthesize a high quality view under the limited view angle is a problem that researchers need to face, because of the large-scale data volume of the autopilot scene and the limited storage capacity of the NeRF model, how to reconstruct the large-scale and large-scale scene in the limited model storage is a challenge that researchers need to face, because the autopilot scene has the transformation of illumination appearance and the like and the dynamic change of dynamic objects, the method exceeds the assumption of the original NeRF model, how to process the dynamic objects in the scene needs to be solved urgently, and finally how to improve the training and rendering speed of the model and accelerate the rendering of the autopilot scene needs to be considered. High-quality automatic driving scene reconstruction is a precondition for application landing such as large-scale commercialization of automatic driving.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides an automatic driving borderless scene implicit reconstruction method and system considering geometric information enhancement, which use an automatic driving geometric perception grid-based nerve rendering system, wherein the system reconstructs and renders an accurate driving environment from sparse sensor data by utilizing perspective deformed hash grids and Signed Distance Functions (SDFs), improves scene reconstruction quality, can effectively process borderless scenes, and improves performance in viewpoint sparse environments.
On one hand, the automatic driving borderless scene implicit reconstruction method considering geometric information enhancement provided by the invention comprises the following steps:
S1, performing space division on a borderless open scene through an octree structure to generate a plurality of local small scenes, mapping a space at infinity to a limited distance by using a perspective deformation function on each local small scene, performing space coding on the local small scenes, storing coding features by using a multi-resolution hash grid, indexing to local grids corresponding to any point feature of the space through the hash coding, and performing three-line interpolation by using vertexes of the local grids to obtain features of the space points;
s2, inputting the characteristics of the space points into a neural implicit rendering network, outputting a signed distance function SDF field and a color field, and rendering by utilizing the differentiable geometric enhancement characteristics to obtain predicted values of scene colors, scene depths and scene normal vectors;
S3, extracting scene colors, scene depths and scene normal vectors through a monocular depth model, and applying additional constraint to a Signed Distance Function (SDF) field by introducing a geometric prior generated by a pre-trained monocular estimation model to obtain true values of the scene colors, the scene depths and the scene normal vectors;
and S4, calculating a loss function according to the predicted value in the step S2 and the true value in the step S3, and carrying out joint optimization according to the loss function to realize scene reconstruction.
In step S1, generating a plurality of local small scenes by performing spatial division on the borderless open scene through the octree structure further includes:
initializing the root node size of the octree to be 32 times of a bounding box comprising all input camera trajectories;
for each tree node of the octree, determining whether to further subdivide according to the visibility of the camera and the distance from the node center, forming a leaf node set.
Further, in step S1, constructing the perspective deformation function by a principal component analysis PCA method, and mapping the space at infinity to a finite distance by using the perspective deformation function on each of the local small scenes specifically includes:
Firstly, taking point cloud data or grid vertexes of a three-dimensional model as a data set, wherein the three-dimensional coordinates of each point form an original feature vector, finding out the main change direction of the data set through a principal component analysis PCA method, and defining a new two-dimensional coordinate system by the main change direction, wherein the perspective deformation function is F (x) =MW (x), W (x) is the two-dimensional coordinates of projecting the point x to all visible cameras, and M is a projection matrix constructed through the principal component analysis PCA method.
Further, in step S1, storing the encoded features using the multi-resolution hash grid further includes:
The corresponding feature of each leaf node is mapped into a hash pool through a hash function, ha Xichi comprises 16 levels, each level holds 219 two-dimensional feature vectors, and the feature vectors are obtained through the following spatial interpolation calculation:
Where o and d represent the center of the camera and the direction of the light, j i is the jacobian matrix of the perspective deformation function at x i, and l is the hyper-parameter controlling the sampling interval, respectively.
Further, in step S2, inputting the features of the spatial points into a neural implicit rendering network, and outputting the signed distance function SDF field and the color field further includes:
The method comprises the steps of learning a Signed Distance Function (SDF) through a multi-layer perceptron (MLP) network, performing density modeling by taking the SDF as an intermediate variable, defining the SDF as a zero level set of the surface of an object, converting the SDF into bulk density through a cumulative distribution function of Laplace distribution, wherein the density is expressed by the following formula:
σ(x)=αΨβ(-dΩ(x)),
Where α and β are the learnable parameters, ψ β represents the cumulative distribution function of the laplace distribution, and d Ω (x) is the point x signed distance function SDF field representation function.
Preferably, in step S2, rendering the predicted values of the scene color, the scene depth, and the scene normal vector by using the differentiable geometric enhancement features further includes:
calculating the gradient of the signed distance function SDF by a numerical differentiation method so as to obtain a surface normal, wherein the initial step size is the size of a leaf node, and the initial step size is gradually reduced to capture local details, and the gradient calculation formula of the signed distance function SDF is expressed as:
where e is a minimum vector for perturbation x i;
and finally calculating the predicted values of the scene colors, the scene depths and the scene normal vectors through a volume rendering technology.
Further, in step S3, applying additional constraints to the signed distance function SDF field by introducing a geometric prior generated by the pre-trained monocular estimation model, obtaining true values of the scene color, the scene depth, and the scene normal vector further includes:
Generating a relative depth value by using a pre-trained monocular estimation model, learning the scale and deviation of each batch by a least squares criterion, and optimizing by a depth consistency loss function to align the relative depth value with an actual depth value, wherein the depth consistency loss function is expressed as:
ensuring the consistency of the rendering normals and the predicted monocular normals in the same coordinate space, optimizing by a normals consistency loss function, the normals consistency loss function being represented by a calculated L1 normals loss and an angle loss, the normals consistency loss function being as follows:
Wherein k and b are learnable parameters of the aligned depth values, R is a set of rays in the training batch, R is an element in set R, A predicted value representing the scene depth of the variable r,Is the true value of the scene depth of the variable r,Is the predicted value of the scene normal vector of the variable r,Is the true value of the scene normal vector of the variable r.
Preferably, step S4 further comprises:
setting a scene color reconstruction loss function that, by minimizing the difference between the input image and the rendered image by color loss, connects the 3D environment and its corresponding 2D observations, the scene color reconstruction loss function being defined as:
Wherein R is the light ray set in the training batch, R is the element in the set R, A predicted value of the scene color representing the variable r,True values of scene colors for variable r;
Setting a regularization loss function, introducing an Eikonal term to normalize a Signed Distance Function (SDF) field, and restricting parallax by parallax loss to reduce floating artifacts, wherein the regularization loss function is defined as:
Wherein, the There is a gradient of the signed distance function SDF for the point x.
More preferably, in step S4, performing joint optimization according to the loss function to implement scene reconstruction further includes:
All loss functions are jointly optimized by an Adam optimizer until a predetermined reconstruction quality and accuracy are reached, and the final loss function is expressed as:
L=Lrgb+λdepthLdepth+λnormalLnormal+λeikonalLeikonal+λdispLxisp,
Wherein λ depth is a depth uniformity loss weight, λ eikonal is a normal uniformity loss weight, λ eikonal is a regular loss weight, L disp is a disparity map loss, and λ disp is a disparity map loss weight.
On the other hand, the invention provides an automatic driving borderless scene implicit reconstruction system with geometrical information enhancement, which comprises the following steps:
The space division characterization module is used for carrying out space division on an unbounded open scene through an octree structure to generate a plurality of local small scenes, mapping a space at infinity to a limited distance by using a perspective deformation function on each local small scene, carrying out space coding on the local small scenes, storing coding characteristics by using a multi-resolution hash grid, indexing to local grids corresponding to any point characteristic of the space through the hash coding, and carrying out three-line interpolation by utilizing vertexes of the local grids to obtain the characteristics of the space points;
the spatial rendering module is used for inputting the characteristics of the spatial points into a neural implicit rendering network, outputting a signed distance function SDF field and a color field, and rendering by utilizing the differentiable geometric enhancement characteristics to obtain predicted values of scene colors, scene depths and scene normal vectors
The multi-consistency loss function scene reconstruction module is used for extracting scene colors, scene depths and scene normal vectors through a monocular depth model, applying additional constraint to a signed distance function SDF field through introducing a geometric prior generated by a pre-trained monocular estimation model to obtain true values of the scene colors, the scene depths and the scene normal vectors, calculating a loss function according to the predicted values and the true values, and carrying out joint optimization according to the loss function to realize scene reconstruction.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the invention, by introducing the geometric enhancement features and perspective deformation hash grids, the reconstructed driving scene is obviously improved in geometric details and global consistency, and the reconstruction quality on KITT I, free-dataset and self-acquired urban road data sets is superior to that of the existing most advanced method;
(2) The invention utilizes SDF fields and geometric prior enhancement characteristics, so that the system can still maintain high precision and high stability when processing complex open scenes with sparse view points, and particularly under the conditions of a low texture area and the sparse view points, the reconstruction effect of the system is still excellent, and mismatch and blurring phenomena in reconstruction are obviously reduced;
(3) According to the invention, through space division and feature storage of perspective deformed hash grids, efficient sampling and calculation are realized on data processing and feature extraction, the calculation complexity and time cost of a system are remarkably reduced, and under the same training turn, compared with the traditional multi-view three-dimensional reconstruction method, the method can obtain a reconstruction result with higher quality in a shorter time;
(4) By using the pre-trained monocular depth and normal estimation model, the system can extract geometric prior from relatively simple and low-cost image data, and the feasibility and economy of reconstruction are improved;
drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
In the drawings:
FIG. 1 is a flow chart of an automatic driving borderless scene implicit reconstruction method considering geometric information enhancement;
FIG. 2 is a schematic view of a scene color, normal vector, and depth rendering effect according to the present invention;
fig. 3 is a block diagram of an automatic driving borderless scene implicit reconstruction system considering geometric information enhancement according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The present invention and system includes a hash pool Ha Xichi including 16 levels, each level holding 219 two-dimensional feature vectors. The feature vector is input to the MLP network through spatial interpolation, and scene features and SDF are extracted. Finally, high quality scene reconstruction is achieved by optimization of RGB images, depth and normals.
The following describes specific embodiments of the present invention with reference to the drawings and examples.
Example 1
As shown in fig. 1, the technical scheme of the implicit reconstruction method of the automatic driving borderless scene considering geometric information enhancement provided in the embodiment includes the following steps:
S1, performing space division on a borderless open scene through an octree structure to generate a plurality of local small scenes, mapping a space at infinity to a limited distance by using a perspective deformation function on each local small scene, performing space coding on the local small scenes, storing coding features by using a multi-resolution hash grid, indexing to local grids corresponding to any point feature of the space through the hash coding, and performing three-line interpolation by using vertexes of the local grids to obtain features of the space points;
s2, inputting the characteristics of the space points into a neural implicit rendering network, outputting a signed distance function SDF field and a color field, and rendering by utilizing the differentiable geometric enhancement characteristics to obtain predicted values of scene colors, scene depths and scene normal vectors;
S3, extracting scene colors, scene depths and scene normal vectors through a monocular depth model, and applying additional constraint to a Signed Distance Function (SDF) field by introducing a geometric prior generated by a pre-trained monocular estimation model to obtain true values of the scene colors, the scene depths and the scene normal vectors;
and S4, calculating a loss function according to the predicted value in the step S2 and the true value in the step S3, and carrying out joint optimization according to the loss function to realize scene reconstruction.
In step S1, the sparse sensor data is spatially transformed through a perspective deformation function, and the hash grid is used for storing coding features, so that efficient data sampling and processing are ensured. Wherein, when space division, the generating a plurality of local small scenes by space division of the borderless open scene through the octree structure further comprises:
initializing the root node size of the octree to be 32 times of a bounding box comprising all input camera trajectories;
for each tree node of the octree, determining whether to further subdivide according to the visibility of the camera and the distance from the node center, forming a leaf node set.
In addition, on perspective deformation, constructing the perspective deformation function through a Principal Component Analysis (PCA) method to ensure the information fidelity after the dimension reduction, specifically, mapping the space at infinity to a finite distance by using the perspective deformation function on each local small scene specifically comprises the following steps:
Firstly, taking point cloud data or grid vertexes of a three-dimensional model as a data set, wherein the three-dimensional coordinates of each point form an original feature vector, finding out the main change direction of the data set through a principal component analysis PCA method, and defining a new two-dimensional coordinate system by the main change direction, wherein the perspective deformation function is F (x) =MW (x), W (x) is the two-dimensional coordinates of projecting the point x to all visible cameras, and M is a projection matrix constructed through the principal component analysis PCA method. The deformation process can ensure uniform sampling in the deformation space, and improves the accuracy and efficiency of sampling.
Further, in step S1, storing the encoded features using the multi-resolution hash grid further includes:
The corresponding feature of each leaf node is mapped into a hash pool through a hash function, ha Xichi comprises 16 levels, each level holds 219 two-dimensional feature vectors, and the feature vectors are obtained through the following spatial interpolation calculation:
Where o and d represent the center of the camera and the direction of the light, j i is the jacobian matrix of the perspective deformation function at x i, and l is the hyper-parameter controlling the sampling interval, respectively.
Further, in step S2, inputting the features of the spatial points into a neural implicit rendering network, and outputting the signed distance function SDF field and the color field further includes:
The method comprises the steps of learning a Signed Distance Function (SDF) through a multi-layer perceptron (MLP) network, performing density modeling by taking the SDF as an intermediate variable, defining the SDF as a zero level set of the surface of an object, converting the SDF into bulk density through a cumulative distribution function of Laplace distribution, wherein the density is expressed by the following formula:
σ(x)=αΨβ(-dΩ(x)),
Where α and β are the learnable parameters, ψ β represents the cumulative distribution function of the laplace distribution, and d Ω (x) is the point x signed distance function SDF field representation function.
In addition, in step S2, rendering the predicted values of the scene color, the scene depth, and the scene normal vector by using the differentiable geometric enhancement features further includes:
calculating the gradient of the signed distance function SDF by a numerical differentiation method so as to obtain a surface normal, wherein the initial step size is the size of a leaf node, and the initial step size is gradually reduced to capture local details, and the gradient calculation formula of the signed distance function SDF is expressed as:
where e is a minimum vector for perturbation x i;
and finally calculating the predicted values of the scene colors, the scene depths and the scene normal vectors through a volume rendering technology.
Further, in step S3, by introducing a geometric prior generated by the pre-trained monocular estimation model, the geometric prior including depth and normal estimation, applying additional constraints to the signed distance function SDF field, obtaining true values of scene colors, scene depths, and scene normal vectors further includes:
Generating a relative depth value by using a pre-trained monocular estimation model, learning the scale and deviation of each batch by a least squares criterion, and optimizing by a depth consistency loss function to align the relative depth value with an actual depth value, wherein the depth consistency loss function is expressed as:
ensuring the consistency of the rendering normals and the predicted monocular normals in the same coordinate space, optimizing by a normals consistency loss function, the normals consistency loss function being represented by a calculated L1 normals loss and an angle loss, the normals consistency loss function being as follows:
Wherein k and b are learnable parameters of the aligned depth values, R is a set of rays in the training batch, R is an element in set R, A predicted value representing the scene depth of the variable r,Is the true value of the scene depth of the variable r,Is the predicted value of the scene normal vector of the variable r,Is the true value of the scene normal vector of the variable r.
In step S4, the joint optimization is performed by combining the RGB input and the monocular depth and normal line observation, and further includes:
setting a scene color reconstruction loss function that, by minimizing the difference between the input image and the rendered image by color loss, connects the 3D environment and its corresponding 2D observations, the scene color reconstruction loss function being defined as:
Wherein R is the light ray set in the training batch, R is the element in the set R, A predicted value of the scene color representing the variable r,True values of scene colors for variable r;
Setting a regularization loss function, introducing an Eikonal term to normalize a Signed Distance Function (SDF) field, and restricting parallax by parallax loss to reduce floating artifacts, wherein the regularization loss function is defined as:
Wherein, the There is a gradient of the signed distance function SDF for the point x.
Preferably, in step S4, performing joint optimization according to the loss function to implement scene reconstruction further includes:
All loss functions are jointly optimized by an Adam optimizer until a predetermined reconstruction quality and accuracy are reached, and the final loss function is expressed as:
L=Lrgb+λdepthLdepth+λnormalLnormal+λeikonalLeikonal+λdispLdisp,
Wherein λ depth is a depth uniformity loss weight, λ normal is a normal uniformity loss weight, λ eikonal is a regular loss weight, L disp is a disparity map loss, and λ disp is a disparity map loss weight.
Through the specific implementation manner, the method and the device can realize efficient and accurate automatic driving scene reconstruction, and provide reliable technical support for efficient training and testing of an automatic driving system.
We have compared the performance of the present invention with existing methods across multiple data sets through experimental verification. The results show that in the sky and stair scenes of Free-dataset, the invention achieves better performance on indexes such as PSNR, SSIM and LPIPS. On KITTI and the self-acquired FMAD dataset, the method and the device have higher robustness and accuracy when processing complex scenes in the actual driving environment.
The specific experimental data are as follows:
in a Free-dataset 'sky' scene, the PSNR of the method reaches 25.93, the SSIM reaches 0.827 and the LPIPS reaches 0.336, which are superior to the existing method.
In the "highway" scene from the acquired FMAD dataset, the method has a PSNR of 24.13, ssim of 0.825, and lpas of 0.347, significantly better than other methods.
In conclusion, theoretical analysis and experimental data prove that compared with the prior art, the method has the advantages that the reconstruction quality and robustness of the automatic driving scene are remarkably improved, the cost is reduced, the efficiency is improved, and the method has important practical application value.
Example 2
The automatic driving borderless scene implicit reconstruction system considering geometric information enhancement provided by the embodiment comprises:
The space division characterization module is used for carrying out space division on an unbounded open scene through an octree structure to generate a plurality of local small scenes, mapping a space at infinity to a limited distance by using a perspective deformation function on each local small scene, carrying out space coding on the local small scenes, storing coding characteristics by using a multi-resolution hash grid, indexing to local grids corresponding to any point characteristic of the space through the hash coding, and carrying out three-line interpolation by utilizing vertexes of the local grids to obtain the characteristics of the space points;
the spatial rendering module is used for inputting the characteristics of the spatial points into a neural implicit rendering network, outputting a signed distance function SDF field and a color field, and rendering by utilizing the differentiable geometric enhancement characteristics to obtain predicted values of scene colors, scene depths and scene normal vectors
The multi-consistency loss function scene reconstruction module is used for extracting scene colors, scene depths and scene normal vectors through a monocular depth model, applying additional constraint to a signed distance function SDF field through introducing a geometric prior generated by a pre-trained monocular estimation model to obtain true values of the scene colors, the scene depths and the scene normal vectors, calculating a loss function according to the predicted values and the true values, and carrying out joint optimization according to the loss function to realize scene reconstruction.
Finally, it should be noted that the above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above examples, but all technical solutions belonging to the concept of the present invention belong to the scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.