Dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction method
Technical Field
The invention belongs to the technical field of synchronous positioning and mapping (Simultaneous Localization and Mapping, SLAM) of robots, and particularly relates to an autonomous positioning and on-line high-precision three-dimensional reconstruction method of a dynamic scene.
Background
With the advent of the artificial intelligence era, intelligent unmanned systems such as unmanned aerial vehicles, logistics robots and automatic driving have been widely focused and rapidly developed, and high-quality pose and environment information are the basis for effective environment perception and decision control of intelligent carriers in complex application scenes, so that SLAM technology is one of key technologies for realizing autonomy of intelligent unmanned systems. Conventional SLAM schemes require correlation of current observation information with established environmental characteristic information to solve for carrier current position and attitude data, and thus tend to assume static (or low dynamic) operating scenarios. With the increasing diversification and complexity of related applications, static scene assumptions often cannot be satisfied, such as people and traffic on city streets, and these dynamic features may introduce unreliable data association information, reducing the accuracy and robustness of the SLAM system.
In recent years, SLAM problems in dynamic environments have been widely studied, and solutions can be roughly divided into two categories: semantic extraction schemes based on traditional multi-view geometry or other outlier detection methods (e.g., RANSAC), and on deep learning. Compared to traditional methods, semantic-based schemes can output more robust and accurate segmentation detection results, but they typically rely on a priori training knowledge and tend to impact the time-running efficiency of SLAM systems. In addition, most SLAM solutions pay more attention to positioning accuracy, and the back end generally only maintains a sparse environment feature point map or a local semi-dense point cloud map, so that the method is difficult to directly apply to higher-level tasks, such as environment understanding, autonomous path planning and the like.
On the other hand, a real-time dense three-dimensional reconstruction technology in a dynamic environment has been developed in recent years, and the technical field has strong relevance to SLAM, but focuses on three-dimensional reconstruction quality of an object or a scene, such as precision, spatial resolution, model integrity, and the like. Current RGB-D camera-based real-time three-dimensional reconstruction techniques are mostly re-developed on the classical KinectFusion framework, whose model reconstruction quality is largely limited by the original depth data (noisy and systematic bias), whereas voxel fusion strategies often used to suppress noise tend to produce a low-score result (i.e. texture details are not restored). Classical Multi-View stereovision (MVS) reconstruction algorithms can optimize the original depth data using RGB image information in a neighborhood multiframe, thereby improving the final model quality, but due to the high time complexity of this optimization problem, conventional MVS algorithms are mostly applied to offline three-dimensional reconstruction tasks.
The current academy has limited research on SLAM schemes which simultaneously compromise front end positioning accuracy and back end on-line dense three-dimensional reconstruction quality. The recently proposed schemes such as DS-SLAM and Detect-SLAM maintain a dense semantic map at the back end, but the generation of the map still depends on the original depth data or the result after voxel fusion, and the geometric accuracy is still relatively limited, so that the scheme is not suitable for some applications requiring a higher reconstruction quality model.
Disclosure of Invention
First, the technical problem to be solved
The invention aims to solve the technical problems that: how to propose a dynamic scene-oriented RGB-D SLAM technical scheme, continuous and steady positioning of the front end and high-precision on-line three-dimensional dense reconstruction of the rear end can be realized to a certain extent at the same time, so that application requirements of high positioning precision and environment modeling quality are met, such as environment models with high fidelity are needed in Augmented Reality (AR) games to obtain better immersive experience, unmanned platform autonomous navigation in urban environments and the like.
(II) technical scheme
In order to solve the technical problems, the invention provides a dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction method, which is implemented based on a three-dimensional reconstruction system, and comprises the following steps: a pose tracking thread, a target detection thread, a local map building thread, a dense map building thread and a loop detection optimizing thread;
the three-dimensional reconstruction method comprises the following steps:
step 1: color and depth image data of a scene are obtained through an RGB-D sensor, and are sent to a pose tracking thread for ORB feature point extraction and matching after preprocessing;
step 2: the pose tracking thread eliminates the associated point pairs possibly belonging to the dynamic object by utilizing the motion probability of the local map feature points, and then combines pixel point columns with higher gradient values extracted from the reference key frame to construct the following joint cost function based on geometric reprojection deviation and pixel luminosity deviation:
wherein, xi ji Representing a six degree-of-freedom rigid transformation T from a reference key frame i to a current frame j ji Lie algebraic representation of e r (x) For the reprojection residual error of the three-dimensional map feature point x on the current frame, the sigma is the covariance matrix of the reprojection residual error, e d (u) is the luminance residual of the pixel u' on the reference frame projected onto the current frame, Λ is the covariance matrix of the luminance residual, ρ is the Huber kernel function to suppress outliers, O j And S is i Respectively representing a set of local map feature points and reference frame pixel points after the dynamic region is removed;
in order to balance the influence of two types of residual terms on the optimization process, the respective residual term numbers (n r ,n d ) And standard deviation (sigma) r ,σ d ) Normalizing the two types of residual errors;
the pose tracking thread tracks the preprocessed RGB-D image in real time, and outputs a key frame with high-precision pose estimation information by combining a key frame screening strategy;
step 3: the target detection thread receives a key frame from the pose tracking thread, operates an SSD target detection network to extract semantic information of an image, and updates the motion probability of the map feature points by combining depth consistency test;
the detection result of the SSD destination detection network depends on priori training knowledge, the accuracy of the output result of a complex scene is limited, and objects which do not exist in a training set cannot be identified, so that the update of the motion probability of the feature points is assisted by means of an input depth image, and the method comprises the following specific processes:
step 31: referring to the operation of Detect-SLAM, the motion probability of the three-dimensional map feature point x is updated using the following formula,
P k (x)=(1-α)P k-1 (x)+αZ k (u),
wherein P is k (x) Representing the motion probability of the map point after the semantic information of the kth key frame is updated, P k-1 (x) For the motion probability after the update of the previous key frame, u is the associated pixel characteristic point of the three-dimensional map characteristic point x on the kth key frame, Z k (u) represents the observation of the motion state of map point x on the keyframe; alpha is a weight parameter;
step 32: z is based on semantic output result and depth consistency test k The value of (u) is expressed as:
when the pixel characteristic point u falls in the region omega determined as a dynamic object k,dyn When in the interior, Z k (u) is set to 1; if falling in the static object region omega k,sta Internally, set to 0; if falling in other region omega k,unk Then use its neighborhoodPixel point array in the pixel arrayIs subjected to a geometric consistency check of the depth data of (1), wherein +.>Representing the difference between the depth projected at the pixel point and the depth measurement of its neighborhood pixels, T k Is a six-degree-of-freedom transformation matrix of the key frame k under a global coordinate system, [.] Z Operator takes transformed spatial feature points T k Depth value of x>Representing a domain pixel point u i Depth values of (2); sigma (sigma) u Representing standard deviation of corresponding depth values;
step 4: the dense mapping thread receives key frames from tracking threads at the same time, adopts a multi-view stereoscopic vision reconstruction algorithm, and utilizes RGB image information of adjacent key frames to carry out MVS iterative optimization on depth data in the current key frames, and a cost function of MVS optimization problem constructed in the process consists of three parts:
E mvs (Θ)=ω α E photo (Θ)+ω β E reg (θ)+ω γ E prior (Θ)
wherein,the method comprises the steps of obtaining a parameter vector to be optimized, namely a pixel depth value set; e (E) photo The gray gradient residual error term between multi-view stereo matching pixels is used for evaluating the luminosity consistency in the multi-view image under the current parameter Θ; e (E) reg The method is a regular term expressed by a first derivative of a normal vector of the curved surface and is used for representing the smoothness of the current reconstructed curved surface; e (E) prior The prior constraint term based on the original depth measurement is used for constraining the iteration direction in the nonlinear optimization process, so that serious drift of parameters to be estimated is avoided; { omega α ,ω β ,ω γ -weight coefficients for each residual term;
the MVS nonlinear optimization problem expressed above has strong non-convexity and high parameter space dimension, and is generally difficult to perform online optimization solution;
to solve this problem, a nonlinear optimization solver capable of solving in parallel on a GPU is provided, and the working process includes:
step 41: firstly, uniformly dividing an image area to be reconstructed into a plurality of sub-square areas, and decomposing the pixel depth optimization problem of the whole area into the depth optimization problems of a plurality of sub-areas, thereby reducing the space dimension of parameters to be estimated;
step 42: starting a thread block on the GPU for carrying out Gauss-Newton iterative solution for each sub-square region to be optimized, wherein the steps are as follows:
step 421: each thread in the thread block independently processes a pixel in the sub-block, calculates residual errors and first-order Jacobian vectors, and finally constructs a linear equation with a sparse coefficient matrix; step 422: the linear equation of the last step is solved by utilizing the shared memory of the thread blocks to efficiently run the conjugate gradient method; step 423: repeating step 421 until the maximum iteration number is reached or the suspension condition is satisfied;
step 43: in the actual running process, simultaneously starting a plurality of thread blocks to optimize the pixel depths of a plurality of sub-square areas in parallel, and traversing all the sub-squares for a plurality of times to obtain a final optimization result;
step 5: the local mapping thread receives key frames from the pose tracking thread, maintains and manages the feature point map, and comprises the steps of creating new map points, updating the co-view relation among the key frames and performing local BA optimization to adjust map state parameters;
step 6: the loop detection thread receives the map state parameters which are updated by the local map building thread processing, wherein the map state parameters comprise map points, key frames and common view relations; matching a word bag model of the latest key frame with the historical key frames to check whether a loop is formed, and if the loop is detected, carrying out loop correction and optimization on the associated frames at the two ends of the loop so as to reduce the accumulated deviation in the system as much as possible;
step 7: the SLAM system map-building back end composed of the dense map-building thread and the local map-building thread is combined with the target detection thread to provide environment models with different updating frequencies and different geometric accuracies.
Wherein, the preprocessing in the step 1 comprises the following steps: RGB-depth registration processing, bilateral filtering noise suppression processing.
Wherein, in the step 32, the dynamic object includes a human being and a pet.
Wherein, in the step 32, the static object includes a sofa and a table and a chair.
Wherein in the step 4, the weight coefficient { ω ] of each residual term α ,ω β ,ω γ 1, 0.025 and 0.01, respectively.
In the step 4, the nonlinear optimization solver is a Gauss-Newton solver.
In step 41, the sub-block size is set to be 16×16 in order to improve the GPU data access efficiency.
In step 41, to improve the GPU data access efficiency, the sub-block size is set to 32×32.
In step 43, all the sub-blocks are traversed 10 to 15 times.
In the step 7, the dense map corresponds to low frequency and high precision, and the local map corresponds to high frequency and low precision; the high-precision environment model can better support task requirements of environment understanding and passable area analysis in a complex scene, and the extracted semantic information is combined with the constructed local map model to effectively improve the positioning precision and robustness of the system in a dynamic scene.
(III) beneficial effects
Compared with the prior art, the invention has the following beneficial effects:
the first, the invention provides a semantic SLAM scheme facing dynamic scene, which combines a target detection technology based on deep learning and a classical MVS three-dimensional reconstruction frame, and can output robust pose estimation and simultaneously carry out high-precision online dense reconstruction;
secondly, the invention designs and realizes a nonlinear optimization solver which can be parallel on the GPU, and can efficiently carry out iterative solution on the complex nonlinear optimization problem of the core in the MVS framework, thereby meeting the real-time requirement of on-line dense reconstruction.
The invention is redeveloped under the framework of ORB-SLAM2, mainly increases the target detection thread and dense reconstruction thread, and adds photometric residual error constraint in the tracking thread. In addition, the original local mapping thread and loop detection thread in ORB-SLAM2 are reserved in the invention.
Drawings
Fig. 1 is a block diagram of a semantic SLAM system according to the present invention.
FIGS. 2 a-2 c are schematic diagrams of cost functions of MVS reconstruction algorithms;
FIG. 2a is a schematic diagram of the photometric consistency cost in the MVS reconstruction algorithm;
FIG. 2b is a schematic diagram of a regularized term cost in the MVS reconstruction algorithm;
FIG. 2c is a diagram of a priori geometry constraint costs in the MVS reconstruction algorithm;
wherein R is a reference frame to be rebuilt; n (N) 1 Is a domain key frame 1; n (N) 2 Is domain key frame 2; p is the 3D point to be reconstructed, and is determined by the corresponding pixel depth.
FIG. 3 is a flow diagram of a GPU-based parallel optimization solution strategy.
FIG. 4 is a schematic diagram of depth optimization results on a test dataset.
Detailed Description
For the purposes of clarity, content, and advantages of the present invention, a detailed description of the embodiments of the present invention will be described in detail below with reference to the drawings and examples.
Example 1
The embodiment of the method for autonomous dynamic scene positioning and on-line high-precision three-dimensional reconstruction is characterized in that the method for autonomous dynamic scene positioning and on-line high-precision three-dimensional reconstruction is implemented based on a three-dimensional reconstruction system, and the overall architecture of the three-dimensional reconstruction system is shown in fig. 1, and comprises the following steps: a pose tracking thread, a target detection thread, a local map building thread, a dense map building thread and a loop detection optimizing thread;
the three-dimensional reconstruction method comprises the following steps:
step 1: color and depth image data of a scene are obtained through an RGB-D sensor, and are sent to a pose tracking thread for ORB feature point extraction and matching after preprocessing;
step 2: the pose tracking thread eliminates the associated point pairs possibly belonging to the dynamic object by utilizing the motion probability of the local map feature points, and then combines pixel point columns with higher gradient values extracted from the reference key frame to construct the following joint cost function based on geometric reprojection deviation and pixel luminosity deviation:
wherein, xi ji Representing a six degree-of-freedom rigid transformation T from a reference key frame i to a current frame j ji Lie algebra (Lie-algebra) representation, e r (x) For the reprojection residual error of the three-dimensional map feature point x on the current frame, the sigma is the covariance matrix of the reprojection residual error, e d (u) is the luminance residual of the pixel u' on the reference frame projected onto the current frame, Λ is the covariance matrix of the luminance residual, ρ is the Huber kernel function to suppress outliers, O j And S is i Respectively representing a set of local map feature points and reference frame pixel points after the dynamic region is removed;
in order to balance the influence of two types of residual terms on the optimization process, the respective residual term numbers (n r ,n d ) And standard deviation (sigma) r ,σ d ) Normalizing the two types of residual errors;
the pose tracking thread tracks the preprocessed RGB-D image in real time, and outputs a key frame with high-precision pose estimation information by combining a key frame screening strategy;
compared with the original tracking mode which only depends on feature point matching of ORB-SLAM2, the method adds semi-dense pixel luminosity constraint to enhance the tracking robustness of the system in a weak texture area;
step 3: the target detection thread receives a key frame from the pose tracking thread, operates an SSD target detection network to extract semantic information of an image, and updates the motion probability of the map feature points by combining depth consistency test;
the detection result of the SSD destination detection network depends on priori training knowledge, the accuracy of the output result of a complex scene is limited, and objects which do not exist in a training set cannot be identified, so that the update of the motion probability of the feature points is assisted by means of an input depth image, and the method comprises the following specific processes:
step 31: referring to the operation of Detect-SLAM, the motion probability of the three-dimensional map feature point x is updated using the following formula,
P k (x)=(1-α)P k-1 (x)+αZ k (u),
wherein P is k (x) Representing the motion probability of the map point after the semantic information of the kth key frame is updated, P k-1 (x) For the motion probability after the update of the previous key frame, u is the associated pixel characteristic point of the three-dimensional map characteristic point x on the kth key frame, Z k (u) represents the observation of the motion state of map point x on the keyframe; alpha is a weight parameter;
step 32: z is based on semantic output result and depth consistency test k The value of (u) is expressed as:
when the pixel characteristic point u falls in the region omega determined as a dynamic object k,dyn When in the interior, Z k (u) is set to 1; if falling in the static object region omega k,sta Internally, set to 0; if falling in other region omega k,unk Then use its neighborhoodPixel point array in the pixel arrayIs subjected to a geometric consistency check of the depth data of (1), wherein +.>Representing the difference between the depth projected at the pixel point and the depth measurement of its neighborhood pixels, T k Is a six-degree-of-freedom transformation matrix of the key frame k under a global coordinate system, [.] Z Operator takes transformed spatial feature points T k Depth value of x>Representing a domain pixel point u i Depth values of (2); sigma (sigma) u Representing standard deviation of corresponding depth values;
in this way, the false correlation caused by shielding can be determined to be a dynamic characteristic to a certain extent and filtered in the follow-up tracking process;
step 4: the dense mapping thread receives key frames from tracking threads at the same time, adopts a multi-view stereoscopic vision (MVS) reconstruction algorithm, and utilizes RGB image information of adjacent key frames to carry out MVS iterative optimization on depth data in the current key frames, wherein a cost function of MVS optimization problems constructed in the process consists of three parts:
E mvs (Θ)=ω α E photo (Θ)+ω β E reg (Θ)+ω γ E prior (Θ)
wherein,the method comprises the steps of obtaining a parameter vector to be optimized, namely a pixel depth value set; e (E) photo Gray gradient residual terms (shown in fig. 2 a) between pixels of the multi-view stereo matching are used for evaluating luminosity consistency in the multi-view image under the current parameter Θ; e (E) reg A regular term (as shown in fig. 2 b) represented by the first derivative of the normal vector of the surface, which is used to characterize the smoothness of the current reconstructed surface; e (E) prior For an a priori constraint term (as shown in fig. 2 c) based on the original depth measurement, the term is used for constraining the iteration direction in the nonlinear optimization process, so that serious drift of parameters to be estimated is avoided; { omega α ,ω β ,ω γ -weight coefficients for each residual term;
the MVS nonlinear optimization problem expressed above has strong non-convexity and high parameter space dimension, and is generally difficult to perform online optimization solution;
to solve this problem, a nonlinear optimization solver capable of efficiently solving in parallel on a GPU is provided, and as shown in fig. 3, the working process includes:
step 41: firstly, uniformly dividing an image area to be reconstructed (which can be determined by semantic information assistance) into a plurality of sub-square areas, thereby decomposing the pixel depth optimization problem of the whole area into the depth optimization problems of a plurality of sub-areas, and further reducing the space dimension of parameters to be estimated;
step 42: for each sub-Block area to be optimized, starting a Thread Block (Thread Block) on the GPU to perform Gauss-Newton iterative solution, wherein the method comprises the following steps:
step 421: each thread in the thread block independently processes a pixel in the sub-block, calculates residual errors and first-order Jacobian vectors, and finally constructs a linear equation with a sparse coefficient matrix; step 422: the linear equation of the last step is solved by utilizing the shared memory of the thread blocks to efficiently run the conjugate gradient method; step 423: repeating step 421 until the maximum iteration number is reached or the suspension condition is satisfied;
step 43: in the actual running process, simultaneously starting a plurality of thread blocks to optimize the pixel depths of a plurality of sub-square areas in parallel, and traversing all the sub-squares for a plurality of times to obtain a final optimization result;
FIG. 4 shows the test results obtained after running the parallel optimization algorithm described above on a certain RGB-D dataset, the test machine being a desktop (Intel Core i5-6600K CPU,16GB RAM,Nvidia GeForce GTX 1080GPU) and the programming language being CUDA C++ for a CUDA 9.1 development kit. It can be obviously observed that compared with the original depth image and the 3D model fused based on the original depth data, the optimized depth image and the 3D model better reflect some detail textures of the actual object. The thread then uses the keyframe pose to stitch the optimized point cloud (octree maps may also be constructed for more efficient storage). Because the image area to be reconstructed is divided by semantic information, static (or low dynamic) objects to be reconstructed in the scene can be well separated in the reconstruction process, so that the computing resources are saved and the occurrence of ghost phenomenon is avoided; on the other hand, semantic information (i.e. object labels) can be naturally mapped from the image into a dense point cloud or octree map, thereby serving higher-order application scenes.
Step 5: the local mapping thread receives key frames from the pose tracking thread, maintains and manages the feature point map, and comprises the steps of creating new map points, updating the co-view relation among the key frames and performing local BA optimization to adjust map state parameters;
step 6: the loop detection thread receives the map state parameters which are updated by the local map building thread processing, wherein the map state parameters comprise map points, key frames and common view relations; matching a word bag model of the latest key frame with the historical key frames to check whether a loop is formed, and if the loop is detected, carrying out loop correction and optimization on the associated frames at the two ends of the loop so as to reduce the accumulated deviation in the system as much as possible;
step 7: the SLAM system map-building back end composed of the dense map-building thread and the local map-building thread is combined with the target detection thread to provide environment models with different updating frequencies and different geometric accuracies. The dense map is corresponding to low frequency and high precision, and the local map is corresponding to high frequency and low precision; the high-precision environment model can better support task requirements of environment understanding and passable area analysis in a complex scene, and the extracted semantic information is combined with the constructed local map model to effectively improve the positioning precision and robustness of the system in a dynamic scene.
2. The method for autonomous positioning and on-line high-precision three-dimensional reconstruction of a dynamic scene according to claim 1, wherein the preprocessing in step 1 comprises: RGB-depth registration processing, bilateral filtering noise suppression processing.
Wherein, in the step 32, the dynamic object includes a human being and a pet.
Wherein, in the step 32, the static object includes a sofa and a table and a chair.
Wherein in the step 4, the weight coefficient { ω ] of each residual term α ,ω β ,ω γ 1, 0.025 and 0.01, respectively.
In the step 4, the nonlinear optimization solver is a Gauss-Newton solver.
In step 41, to increase the GPU data access efficiency, the sub-block size is set to 16×16 or 32×32.
In step 43, all the sub-blocks are traversed 10 to 15 times.
Example 2
In the embodiment, under the open source framework of ORB-SLAM2, a robust tracking module capable of processing a high dynamic scene is firstly constructed based on the research ideas of detection-SLAM and RDS-SLAM, and mainly comprises a target detection thread for deep learning, which is used for outputting key frame semantic information and updating map feature point motion probability, and a traditional pose tracking thread based on geometric reprojection deviation and direct luminosity deviation, which is used for solving pose information. On the basis, the invention additionally creates a dense mapping thread, a set of multi-view stereoscopic vision (MVS) reconstruction algorithm based on RGB image information, surface curvature constraint and prior constraint of adjacent key frames is operated on the thread to combine semantic information to improve the depth data quality of the current frame, and a nonlinear optimization solver of the algorithm core, namely a Gauss-Newton solver, is parallelly deployed on the GPU in an efficient mode to meet the real-time requirement. The technical scheme is a five-thread architecture (comprising an original local mapping thread and a loop detection thread in an ORB-SLAM2 framework), and the implementation steps are as follows:
1) Scene data are collected, and the registered color map and depth map are sent to a pose tracking thread to be subjected to ORB feature point extraction matching;
2) The pose tracking thread filters out the characteristic point pair association belonging to the dynamic object based on the motion probability of the local map characteristic points, and constructs a joint pose optimization cost function by combining the semi-dense pixel points with higher gray gradient values extracted from the current reference frame, so as to solve the pose of the current frame;
3) The pose tracking thread selects a key frame and transmits the key frame to the target detection thread, the local mapping thread and the dense mapping thread respectively;
4) The target detection thread extracts semantic information of the key frames and updates the motion probability of the matched feature points by combining a depth consistency checking method;
5) The local mapping process is responsible for maintaining a rear-end sparse feature point map and mainly comprises management of key frames and map points and local BA optimization;
6) The dense mapping thread performs MVS iterative optimization on the pixel depth value of the static object on the current key frame by utilizing the color images of the adjacent key frames in the key frame queue and combining semantic information provided by the target detection thread, and performs point cloud splicing according to the pose estimation of the key frame;
7) And the loop detection thread mainly carries out loop detection on the current key frame, and if loop detection is carried out, loop correction optimization is further carried out.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.