CN117115343A - Dynamic scene autonomous positioning and online high-precision 3D reconstruction method - Google Patents

Dynamic scene autonomous positioning and online high-precision 3D reconstruction method Download PDF

Info

Publication number
CN117115343A
CN117115343A CN202311023428.3A CN202311023428A CN117115343A CN 117115343 A CN117115343 A CN 117115343A CN 202311023428 A CN202311023428 A CN 202311023428A CN 117115343 A CN117115343 A CN 117115343A
Authority
CN
China
Prior art keywords
thread
map
precision
depth
pixel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311023428.3A
Other languages
Chinese (zh)
Inventor
罗航
赵熙俊
光星星
刘萌
崔星
李骁逸
刘雪妍
陈佳琪
卢彩霞
杨雨
王旭
于华超
梁震烁
王一全
李兆冬
程文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongbing Intelligent Innovation Research Institute Co ltd
Original Assignee
Zhongbing Intelligent Innovation Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongbing Intelligent Innovation Research Institute Co ltd filed Critical Zhongbing Intelligent Innovation Research Institute Co ltd
Priority to CN202311023428.3A priority Critical patent/CN117115343A/en
Publication of CN117115343A publication Critical patent/CN117115343A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/28Indexing scheme for image data processing or generation, in general involving image processing hardware
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本发明属于机器人同步定位与建图技术领域,具体涉及一种动态场景自主定位与线上高精度三维重建方法,所述动态场景自主定位与线上高精度三维重建方法的实施依据位姿跟踪线程、目标检测线程、局部建图线程、稠密建图线程,以及回环检测优化线程;该方案结合了基于深度学习的目标检测技术和经典的MVS三维重建框架,能在输出鲁棒的位姿估计的同时进行高精度的线上稠密重建;本发明提供一种可在GPU上并行的非线性优化求解器,可高效地对MVS框架中核心的复杂非线性优化问题进行迭代求解,从而满足线上稠密重建的实时性要求。本发明在ORB‑SLAM2的框架下进行再开发而来,增加了目标检测线程和稠密重建线程,并在其跟踪线程中添加了光度残差约束。

The invention belongs to the technical field of robot synchronous positioning and mapping, and specifically relates to a dynamic scene autonomous positioning and online high-precision three-dimensional reconstruction method. The implementation of the dynamic scene autonomous positioning and online high-precision three-dimensional reconstruction method is based on the pose tracking thread. , target detection thread, local mapping thread, dense mapping thread, and loop detection optimization thread; this solution combines deep learning-based target detection technology and the classic MVS three-dimensional reconstruction framework, and can output robust pose estimation. High-precision online dense reconstruction is performed at the same time; the present invention provides a nonlinear optimization solver that can be parallelized on the GPU, which can efficiently iteratively solve the core complex nonlinear optimization problem in the MVS framework, thereby satisfying the online dense Real-time requirements for reconstruction. This invention is redeveloped under the framework of ORB-SLAM2, adds a target detection thread and a dense reconstruction thread, and adds photometric residual constraints to its tracking thread.

Description

Dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction method
Technical Field
The invention belongs to the technical field of synchronous positioning and mapping (Simultaneous Localization and Mapping, SLAM) of robots, and particularly relates to an autonomous positioning and on-line high-precision three-dimensional reconstruction method of a dynamic scene.
Background
With the advent of the artificial intelligence era, intelligent unmanned systems such as unmanned aerial vehicles, logistics robots and automatic driving have been widely focused and rapidly developed, and high-quality pose and environment information are the basis for effective environment perception and decision control of intelligent carriers in complex application scenes, so that SLAM technology is one of key technologies for realizing autonomy of intelligent unmanned systems. Conventional SLAM schemes require correlation of current observation information with established environmental characteristic information to solve for carrier current position and attitude data, and thus tend to assume static (or low dynamic) operating scenarios. With the increasing diversification and complexity of related applications, static scene assumptions often cannot be satisfied, such as people and traffic on city streets, and these dynamic features may introduce unreliable data association information, reducing the accuracy and robustness of the SLAM system.
In recent years, SLAM problems in dynamic environments have been widely studied, and solutions can be roughly divided into two categories: semantic extraction schemes based on traditional multi-view geometry or other outlier detection methods (e.g., RANSAC), and on deep learning. Compared to traditional methods, semantic-based schemes can output more robust and accurate segmentation detection results, but they typically rely on a priori training knowledge and tend to impact the time-running efficiency of SLAM systems. In addition, most SLAM solutions pay more attention to positioning accuracy, and the back end generally only maintains a sparse environment feature point map or a local semi-dense point cloud map, so that the method is difficult to directly apply to higher-level tasks, such as environment understanding, autonomous path planning and the like.
On the other hand, a real-time dense three-dimensional reconstruction technology in a dynamic environment has been developed in recent years, and the technical field has strong relevance to SLAM, but focuses on three-dimensional reconstruction quality of an object or a scene, such as precision, spatial resolution, model integrity, and the like. Current RGB-D camera-based real-time three-dimensional reconstruction techniques are mostly re-developed on the classical KinectFusion framework, whose model reconstruction quality is largely limited by the original depth data (noisy and systematic bias), whereas voxel fusion strategies often used to suppress noise tend to produce a low-score result (i.e. texture details are not restored). Classical Multi-View stereovision (MVS) reconstruction algorithms can optimize the original depth data using RGB image information in a neighborhood multiframe, thereby improving the final model quality, but due to the high time complexity of this optimization problem, conventional MVS algorithms are mostly applied to offline three-dimensional reconstruction tasks.
The current academy has limited research on SLAM schemes which simultaneously compromise front end positioning accuracy and back end on-line dense three-dimensional reconstruction quality. The recently proposed schemes such as DS-SLAM and Detect-SLAM maintain a dense semantic map at the back end, but the generation of the map still depends on the original depth data or the result after voxel fusion, and the geometric accuracy is still relatively limited, so that the scheme is not suitable for some applications requiring a higher reconstruction quality model.
Disclosure of Invention
First, the technical problem to be solved
The invention aims to solve the technical problems that: how to propose a dynamic scene-oriented RGB-D SLAM technical scheme, continuous and steady positioning of the front end and high-precision on-line three-dimensional dense reconstruction of the rear end can be realized to a certain extent at the same time, so that application requirements of high positioning precision and environment modeling quality are met, such as environment models with high fidelity are needed in Augmented Reality (AR) games to obtain better immersive experience, unmanned platform autonomous navigation in urban environments and the like.
(II) technical scheme
In order to solve the technical problems, the invention provides a dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction method, which is implemented based on a three-dimensional reconstruction system, and comprises the following steps: a pose tracking thread, a target detection thread, a local map building thread, a dense map building thread and a loop detection optimizing thread;
the three-dimensional reconstruction method comprises the following steps:
step 1: color and depth image data of a scene are obtained through an RGB-D sensor, and are sent to a pose tracking thread for ORB feature point extraction and matching after preprocessing;
step 2: the pose tracking thread eliminates the associated point pairs possibly belonging to the dynamic object by utilizing the motion probability of the local map feature points, and then combines pixel point columns with higher gradient values extracted from the reference key frame to construct the following joint cost function based on geometric reprojection deviation and pixel luminosity deviation:
wherein, xi ji Representing a six degree-of-freedom rigid transformation T from a reference key frame i to a current frame j ji Lie algebraic representation of e r (x) For the reprojection residual error of the three-dimensional map feature point x on the current frame, the sigma is the covariance matrix of the reprojection residual error, e d (u) is the luminance residual of the pixel u' on the reference frame projected onto the current frame, Λ is the covariance matrix of the luminance residual, ρ is the Huber kernel function to suppress outliers, O j And S is i Respectively representing a set of local map feature points and reference frame pixel points after the dynamic region is removed;
in order to balance the influence of two types of residual terms on the optimization process, the respective residual term numbers (n r ,n d ) And standard deviation (sigma) rd ) Normalizing the two types of residual errors;
the pose tracking thread tracks the preprocessed RGB-D image in real time, and outputs a key frame with high-precision pose estimation information by combining a key frame screening strategy;
step 3: the target detection thread receives a key frame from the pose tracking thread, operates an SSD target detection network to extract semantic information of an image, and updates the motion probability of the map feature points by combining depth consistency test;
the detection result of the SSD destination detection network depends on priori training knowledge, the accuracy of the output result of a complex scene is limited, and objects which do not exist in a training set cannot be identified, so that the update of the motion probability of the feature points is assisted by means of an input depth image, and the method comprises the following specific processes:
step 31: referring to the operation of Detect-SLAM, the motion probability of the three-dimensional map feature point x is updated using the following formula,
P k (x)=(1-α)P k-1 (x)+αZ k (u),
wherein P is k (x) Representing the motion probability of the map point after the semantic information of the kth key frame is updated, P k-1 (x) For the motion probability after the update of the previous key frame, u is the associated pixel characteristic point of the three-dimensional map characteristic point x on the kth key frame, Z k (u) represents the observation of the motion state of map point x on the keyframe; alpha is a weight parameter;
step 32: z is based on semantic output result and depth consistency test k The value of (u) is expressed as:
when the pixel characteristic point u falls in the region omega determined as a dynamic object k,dyn When in the interior, Z k (u) is set to 1; if falling in the static object region omega k,sta Internally, set to 0; if falling in other region omega k,unk Then use its neighborhoodPixel point array in the pixel arrayIs subjected to a geometric consistency check of the depth data of (1), wherein +.>Representing the difference between the depth projected at the pixel point and the depth measurement of its neighborhood pixels, T k Is a six-degree-of-freedom transformation matrix of the key frame k under a global coordinate system, [.] Z Operator takes transformed spatial feature points T k Depth value of x>Representing a domain pixel point u i Depth values of (2); sigma (sigma) u Representing standard deviation of corresponding depth values;
step 4: the dense mapping thread receives key frames from tracking threads at the same time, adopts a multi-view stereoscopic vision reconstruction algorithm, and utilizes RGB image information of adjacent key frames to carry out MVS iterative optimization on depth data in the current key frames, and a cost function of MVS optimization problem constructed in the process consists of three parts:
E mvs (Θ)=ω α E photo (Θ)+ω β E reg (θ)+ω γ E prior (Θ)
wherein,the method comprises the steps of obtaining a parameter vector to be optimized, namely a pixel depth value set; e (E) photo The gray gradient residual error term between multi-view stereo matching pixels is used for evaluating the luminosity consistency in the multi-view image under the current parameter Θ; e (E) reg The method is a regular term expressed by a first derivative of a normal vector of the curved surface and is used for representing the smoothness of the current reconstructed curved surface; e (E) prior The prior constraint term based on the original depth measurement is used for constraining the iteration direction in the nonlinear optimization process, so that serious drift of parameters to be estimated is avoided; { omega αβγ -weight coefficients for each residual term;
the MVS nonlinear optimization problem expressed above has strong non-convexity and high parameter space dimension, and is generally difficult to perform online optimization solution;
to solve this problem, a nonlinear optimization solver capable of solving in parallel on a GPU is provided, and the working process includes:
step 41: firstly, uniformly dividing an image area to be reconstructed into a plurality of sub-square areas, and decomposing the pixel depth optimization problem of the whole area into the depth optimization problems of a plurality of sub-areas, thereby reducing the space dimension of parameters to be estimated;
step 42: starting a thread block on the GPU for carrying out Gauss-Newton iterative solution for each sub-square region to be optimized, wherein the steps are as follows:
step 421: each thread in the thread block independently processes a pixel in the sub-block, calculates residual errors and first-order Jacobian vectors, and finally constructs a linear equation with a sparse coefficient matrix; step 422: the linear equation of the last step is solved by utilizing the shared memory of the thread blocks to efficiently run the conjugate gradient method; step 423: repeating step 421 until the maximum iteration number is reached or the suspension condition is satisfied;
step 43: in the actual running process, simultaneously starting a plurality of thread blocks to optimize the pixel depths of a plurality of sub-square areas in parallel, and traversing all the sub-squares for a plurality of times to obtain a final optimization result;
step 5: the local mapping thread receives key frames from the pose tracking thread, maintains and manages the feature point map, and comprises the steps of creating new map points, updating the co-view relation among the key frames and performing local BA optimization to adjust map state parameters;
step 6: the loop detection thread receives the map state parameters which are updated by the local map building thread processing, wherein the map state parameters comprise map points, key frames and common view relations; matching a word bag model of the latest key frame with the historical key frames to check whether a loop is formed, and if the loop is detected, carrying out loop correction and optimization on the associated frames at the two ends of the loop so as to reduce the accumulated deviation in the system as much as possible;
step 7: the SLAM system map-building back end composed of the dense map-building thread and the local map-building thread is combined with the target detection thread to provide environment models with different updating frequencies and different geometric accuracies.
Wherein, the preprocessing in the step 1 comprises the following steps: RGB-depth registration processing, bilateral filtering noise suppression processing.
Wherein, in the step 32, the dynamic object includes a human being and a pet.
Wherein, in the step 32, the static object includes a sofa and a table and a chair.
Wherein in the step 4, the weight coefficient { ω ] of each residual term αβγ 1, 0.025 and 0.01, respectively.
In the step 4, the nonlinear optimization solver is a Gauss-Newton solver.
In step 41, the sub-block size is set to be 16×16 in order to improve the GPU data access efficiency.
In step 41, to improve the GPU data access efficiency, the sub-block size is set to 32×32.
In step 43, all the sub-blocks are traversed 10 to 15 times.
In the step 7, the dense map corresponds to low frequency and high precision, and the local map corresponds to high frequency and low precision; the high-precision environment model can better support task requirements of environment understanding and passable area analysis in a complex scene, and the extracted semantic information is combined with the constructed local map model to effectively improve the positioning precision and robustness of the system in a dynamic scene.
(III) beneficial effects
Compared with the prior art, the invention has the following beneficial effects:
the first, the invention provides a semantic SLAM scheme facing dynamic scene, which combines a target detection technology based on deep learning and a classical MVS three-dimensional reconstruction frame, and can output robust pose estimation and simultaneously carry out high-precision online dense reconstruction;
secondly, the invention designs and realizes a nonlinear optimization solver which can be parallel on the GPU, and can efficiently carry out iterative solution on the complex nonlinear optimization problem of the core in the MVS framework, thereby meeting the real-time requirement of on-line dense reconstruction.
The invention is redeveloped under the framework of ORB-SLAM2, mainly increases the target detection thread and dense reconstruction thread, and adds photometric residual error constraint in the tracking thread. In addition, the original local mapping thread and loop detection thread in ORB-SLAM2 are reserved in the invention.
Drawings
Fig. 1 is a block diagram of a semantic SLAM system according to the present invention.
FIGS. 2 a-2 c are schematic diagrams of cost functions of MVS reconstruction algorithms;
FIG. 2a is a schematic diagram of the photometric consistency cost in the MVS reconstruction algorithm;
FIG. 2b is a schematic diagram of a regularized term cost in the MVS reconstruction algorithm;
FIG. 2c is a diagram of a priori geometry constraint costs in the MVS reconstruction algorithm;
wherein R is a reference frame to be rebuilt; n (N) 1 Is a domain key frame 1; n (N) 2 Is domain key frame 2; p is the 3D point to be reconstructed, and is determined by the corresponding pixel depth.
FIG. 3 is a flow diagram of a GPU-based parallel optimization solution strategy.
FIG. 4 is a schematic diagram of depth optimization results on a test dataset.
Detailed Description
For the purposes of clarity, content, and advantages of the present invention, a detailed description of the embodiments of the present invention will be described in detail below with reference to the drawings and examples.
Example 1
The embodiment of the method for autonomous dynamic scene positioning and on-line high-precision three-dimensional reconstruction is characterized in that the method for autonomous dynamic scene positioning and on-line high-precision three-dimensional reconstruction is implemented based on a three-dimensional reconstruction system, and the overall architecture of the three-dimensional reconstruction system is shown in fig. 1, and comprises the following steps: a pose tracking thread, a target detection thread, a local map building thread, a dense map building thread and a loop detection optimizing thread;
the three-dimensional reconstruction method comprises the following steps:
step 1: color and depth image data of a scene are obtained through an RGB-D sensor, and are sent to a pose tracking thread for ORB feature point extraction and matching after preprocessing;
step 2: the pose tracking thread eliminates the associated point pairs possibly belonging to the dynamic object by utilizing the motion probability of the local map feature points, and then combines pixel point columns with higher gradient values extracted from the reference key frame to construct the following joint cost function based on geometric reprojection deviation and pixel luminosity deviation:
wherein, xi ji Representing a six degree-of-freedom rigid transformation T from a reference key frame i to a current frame j ji Lie algebra (Lie-algebra) representation, e r (x) For the reprojection residual error of the three-dimensional map feature point x on the current frame, the sigma is the covariance matrix of the reprojection residual error, e d (u) is the luminance residual of the pixel u' on the reference frame projected onto the current frame, Λ is the covariance matrix of the luminance residual, ρ is the Huber kernel function to suppress outliers, O j And S is i Respectively representing a set of local map feature points and reference frame pixel points after the dynamic region is removed;
in order to balance the influence of two types of residual terms on the optimization process, the respective residual term numbers (n r ,n d ) And standard deviation (sigma) rd ) Normalizing the two types of residual errors;
the pose tracking thread tracks the preprocessed RGB-D image in real time, and outputs a key frame with high-precision pose estimation information by combining a key frame screening strategy;
compared with the original tracking mode which only depends on feature point matching of ORB-SLAM2, the method adds semi-dense pixel luminosity constraint to enhance the tracking robustness of the system in a weak texture area;
step 3: the target detection thread receives a key frame from the pose tracking thread, operates an SSD target detection network to extract semantic information of an image, and updates the motion probability of the map feature points by combining depth consistency test;
the detection result of the SSD destination detection network depends on priori training knowledge, the accuracy of the output result of a complex scene is limited, and objects which do not exist in a training set cannot be identified, so that the update of the motion probability of the feature points is assisted by means of an input depth image, and the method comprises the following specific processes:
step 31: referring to the operation of Detect-SLAM, the motion probability of the three-dimensional map feature point x is updated using the following formula,
P k (x)=(1-α)P k-1 (x)+αZ k (u),
wherein P is k (x) Representing the motion probability of the map point after the semantic information of the kth key frame is updated, P k-1 (x) For the motion probability after the update of the previous key frame, u is the associated pixel characteristic point of the three-dimensional map characteristic point x on the kth key frame, Z k (u) represents the observation of the motion state of map point x on the keyframe; alpha is a weight parameter;
step 32: z is based on semantic output result and depth consistency test k The value of (u) is expressed as:
when the pixel characteristic point u falls in the region omega determined as a dynamic object k,dyn When in the interior, Z k (u) is set to 1; if falling in the static object region omega k,sta Internally, set to 0; if falling in other region omega k,unk Then use its neighborhoodPixel point array in the pixel arrayIs subjected to a geometric consistency check of the depth data of (1), wherein +.>Representing the difference between the depth projected at the pixel point and the depth measurement of its neighborhood pixels, T k Is a six-degree-of-freedom transformation matrix of the key frame k under a global coordinate system, [.] Z Operator takes transformed spatial feature points T k Depth value of x>Representing a domain pixel point u i Depth values of (2); sigma (sigma) u Representing standard deviation of corresponding depth values;
in this way, the false correlation caused by shielding can be determined to be a dynamic characteristic to a certain extent and filtered in the follow-up tracking process;
step 4: the dense mapping thread receives key frames from tracking threads at the same time, adopts a multi-view stereoscopic vision (MVS) reconstruction algorithm, and utilizes RGB image information of adjacent key frames to carry out MVS iterative optimization on depth data in the current key frames, wherein a cost function of MVS optimization problems constructed in the process consists of three parts:
E mvs (Θ)=ω α E photo (Θ)+ω β E reg (Θ)+ω γ E prior (Θ)
wherein,the method comprises the steps of obtaining a parameter vector to be optimized, namely a pixel depth value set; e (E) photo Gray gradient residual terms (shown in fig. 2 a) between pixels of the multi-view stereo matching are used for evaluating luminosity consistency in the multi-view image under the current parameter Θ; e (E) reg A regular term (as shown in fig. 2 b) represented by the first derivative of the normal vector of the surface, which is used to characterize the smoothness of the current reconstructed surface; e (E) prior For an a priori constraint term (as shown in fig. 2 c) based on the original depth measurement, the term is used for constraining the iteration direction in the nonlinear optimization process, so that serious drift of parameters to be estimated is avoided; { omega αβγ -weight coefficients for each residual term;
the MVS nonlinear optimization problem expressed above has strong non-convexity and high parameter space dimension, and is generally difficult to perform online optimization solution;
to solve this problem, a nonlinear optimization solver capable of efficiently solving in parallel on a GPU is provided, and as shown in fig. 3, the working process includes:
step 41: firstly, uniformly dividing an image area to be reconstructed (which can be determined by semantic information assistance) into a plurality of sub-square areas, thereby decomposing the pixel depth optimization problem of the whole area into the depth optimization problems of a plurality of sub-areas, and further reducing the space dimension of parameters to be estimated;
step 42: for each sub-Block area to be optimized, starting a Thread Block (Thread Block) on the GPU to perform Gauss-Newton iterative solution, wherein the method comprises the following steps:
step 421: each thread in the thread block independently processes a pixel in the sub-block, calculates residual errors and first-order Jacobian vectors, and finally constructs a linear equation with a sparse coefficient matrix; step 422: the linear equation of the last step is solved by utilizing the shared memory of the thread blocks to efficiently run the conjugate gradient method; step 423: repeating step 421 until the maximum iteration number is reached or the suspension condition is satisfied;
step 43: in the actual running process, simultaneously starting a plurality of thread blocks to optimize the pixel depths of a plurality of sub-square areas in parallel, and traversing all the sub-squares for a plurality of times to obtain a final optimization result;
FIG. 4 shows the test results obtained after running the parallel optimization algorithm described above on a certain RGB-D dataset, the test machine being a desktop (Intel Core i5-6600K CPU,16GB RAM,Nvidia GeForce GTX 1080GPU) and the programming language being CUDA C++ for a CUDA 9.1 development kit. It can be obviously observed that compared with the original depth image and the 3D model fused based on the original depth data, the optimized depth image and the 3D model better reflect some detail textures of the actual object. The thread then uses the keyframe pose to stitch the optimized point cloud (octree maps may also be constructed for more efficient storage). Because the image area to be reconstructed is divided by semantic information, static (or low dynamic) objects to be reconstructed in the scene can be well separated in the reconstruction process, so that the computing resources are saved and the occurrence of ghost phenomenon is avoided; on the other hand, semantic information (i.e. object labels) can be naturally mapped from the image into a dense point cloud or octree map, thereby serving higher-order application scenes.
Step 5: the local mapping thread receives key frames from the pose tracking thread, maintains and manages the feature point map, and comprises the steps of creating new map points, updating the co-view relation among the key frames and performing local BA optimization to adjust map state parameters;
step 6: the loop detection thread receives the map state parameters which are updated by the local map building thread processing, wherein the map state parameters comprise map points, key frames and common view relations; matching a word bag model of the latest key frame with the historical key frames to check whether a loop is formed, and if the loop is detected, carrying out loop correction and optimization on the associated frames at the two ends of the loop so as to reduce the accumulated deviation in the system as much as possible;
step 7: the SLAM system map-building back end composed of the dense map-building thread and the local map-building thread is combined with the target detection thread to provide environment models with different updating frequencies and different geometric accuracies. The dense map is corresponding to low frequency and high precision, and the local map is corresponding to high frequency and low precision; the high-precision environment model can better support task requirements of environment understanding and passable area analysis in a complex scene, and the extracted semantic information is combined with the constructed local map model to effectively improve the positioning precision and robustness of the system in a dynamic scene.
2. The method for autonomous positioning and on-line high-precision three-dimensional reconstruction of a dynamic scene according to claim 1, wherein the preprocessing in step 1 comprises: RGB-depth registration processing, bilateral filtering noise suppression processing.
Wherein, in the step 32, the dynamic object includes a human being and a pet.
Wherein, in the step 32, the static object includes a sofa and a table and a chair.
Wherein in the step 4, the weight coefficient { ω ] of each residual term αβγ 1, 0.025 and 0.01, respectively.
In the step 4, the nonlinear optimization solver is a Gauss-Newton solver.
In step 41, to increase the GPU data access efficiency, the sub-block size is set to 16×16 or 32×32.
In step 43, all the sub-blocks are traversed 10 to 15 times.
Example 2
In the embodiment, under the open source framework of ORB-SLAM2, a robust tracking module capable of processing a high dynamic scene is firstly constructed based on the research ideas of detection-SLAM and RDS-SLAM, and mainly comprises a target detection thread for deep learning, which is used for outputting key frame semantic information and updating map feature point motion probability, and a traditional pose tracking thread based on geometric reprojection deviation and direct luminosity deviation, which is used for solving pose information. On the basis, the invention additionally creates a dense mapping thread, a set of multi-view stereoscopic vision (MVS) reconstruction algorithm based on RGB image information, surface curvature constraint and prior constraint of adjacent key frames is operated on the thread to combine semantic information to improve the depth data quality of the current frame, and a nonlinear optimization solver of the algorithm core, namely a Gauss-Newton solver, is parallelly deployed on the GPU in an efficient mode to meet the real-time requirement. The technical scheme is a five-thread architecture (comprising an original local mapping thread and a loop detection thread in an ORB-SLAM2 framework), and the implementation steps are as follows:
1) Scene data are collected, and the registered color map and depth map are sent to a pose tracking thread to be subjected to ORB feature point extraction matching;
2) The pose tracking thread filters out the characteristic point pair association belonging to the dynamic object based on the motion probability of the local map characteristic points, and constructs a joint pose optimization cost function by combining the semi-dense pixel points with higher gray gradient values extracted from the current reference frame, so as to solve the pose of the current frame;
3) The pose tracking thread selects a key frame and transmits the key frame to the target detection thread, the local mapping thread and the dense mapping thread respectively;
4) The target detection thread extracts semantic information of the key frames and updates the motion probability of the matched feature points by combining a depth consistency checking method;
5) The local mapping process is responsible for maintaining a rear-end sparse feature point map and mainly comprises management of key frames and map points and local BA optimization;
6) The dense mapping thread performs MVS iterative optimization on the pixel depth value of the static object on the current key frame by utilizing the color images of the adjacent key frames in the key frame queue and combining semantic information provided by the target detection thread, and performs point cloud splicing according to the pose estimation of the key frame;
7) And the loop detection thread mainly carries out loop detection on the current key frame, and if loop detection is carried out, loop correction optimization is further carried out.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. The dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction method is characterized by being implemented based on a three-dimensional reconstruction system, and the three-dimensional reconstruction system comprises: a pose tracking thread, a target detection thread, a local map building thread, a dense map building thread and a loop detection optimizing thread;
the three-dimensional reconstruction method comprises the following steps:
step 1: color and depth image data of a scene are obtained through an RGB-D sensor, and are sent to a pose tracking thread for ORB feature point extraction and matching after preprocessing;
step 2: the pose tracking thread eliminates the associated point pairs possibly belonging to the dynamic object by utilizing the motion probability of the local map feature points, and then combines pixel point columns with higher gradient values extracted from the reference key frame to construct the following joint cost function based on geometric reprojection deviation and pixel luminosity deviation:
wherein, xi ji Representing a six degree-of-freedom rigid transformation T from a reference key frame i to a current frame j ji Lie algebraic representation of e r (x) For the reprojection residual error of the three-dimensional map feature point x on the current frame, the sigma is the covariance matrix of the reprojection residual error, e d (u) is the luminance residual of the pixel u' on the reference frame projected onto the current frame, Λ is the covariance matrix of the luminance residual, ρ is the Huber kernel function to suppress outliers, O j And S is i Respectively representing a set of local map feature points and reference frame pixel points after the dynamic region is removed;
in order to balance the influence of two types of residual terms on the optimization process, the respective residual term numbers (n r ,n d ) And standard deviation (sigma) rd ) Normalizing the two types of residual errors;
the pose tracking thread tracks the preprocessed RGB-D image in real time, and outputs a key frame with high-precision pose estimation information by combining a key frame screening strategy;
step 3: the target detection thread receives a key frame from the pose tracking thread, operates an SSD target detection network to extract semantic information of an image, and updates the motion probability of the map feature points by combining depth consistency test;
the detection result of the SSD destination detection network depends on priori training knowledge, the accuracy of the output result of a complex scene is limited, and objects which do not exist in a training set cannot be identified, so that the update of the motion probability of the feature points is assisted by means of an input depth image, and the method comprises the following specific processes:
step 31: referring to the operation of Detect-SLAM, the motion probability of the three-dimensional map feature point x is updated using the following formula,
P k (x)=(1-α)P k-1 (x)+αZ k (u),
wherein P is k (x) Indicating that the map point passes through the firstMotion probability, P, after semantic information update of k key frames k-1 (x) For the motion probability after the update of the previous key frame, u is the associated pixel characteristic point of the three-dimensional map characteristic point x on the kth key frame, Z k (u) represents the observation of the motion state of map point x on the keyframe; alpha is a weight parameter;
step 32: z is based on semantic output result and depth consistency test k The value of (u) is expressed as:
when the pixel characteristic point u falls in the region omega determined as a dynamic object k,dyn When in the interior, Z k (u) is set to 1; if falling in the static object region omega k,sta Internally, set to 0; if falling in other region omega k,unk Then use its neighborhoodPixel columns in ∈>Is subjected to a geometric consistency check of the depth data of (1), wherein +.>Representing the difference between the depth projected at the pixel point and the depth measurement of its neighborhood pixels, T k Is a six-degree-of-freedom transformation matrix of the key frame k under a global coordinate system, [.] Z Operator takes transformed spatial feature points T k Depth value of x>Representing a domain pixel point u i Depth values of (2); sigma (sigma) u Representing standard deviation of corresponding depth values;
step 4: the dense mapping thread receives key frames from tracking threads at the same time, adopts a multi-view stereoscopic vision reconstruction algorithm, and utilizes RGB image information of adjacent key frames to carry out MVS iterative optimization on depth data in the current key frames, and a cost function of MVS optimization problem constructed in the process consists of three parts:
E mvs (Θ)=ω α E photo (Θ)+ω β E reg (Θ)+ω γ E prior (Θ)
wherein,the method comprises the steps of obtaining a parameter vector to be optimized, namely a pixel depth value set; e (E) photo The gray gradient residual error term between multi-view stereo matching pixels is used for evaluating the luminosity consistency in the multi-view image under the current parameter Θ; e (E) reg The method is a regular term expressed by a first derivative of a normal vector of the curved surface and is used for representing the smoothness of the current reconstructed curved surface; e (E) prior The prior constraint term based on the original depth measurement is used for constraining the iteration direction in the nonlinear optimization process, so that serious drift of parameters to be estimated is avoided; [ omega ] αβγ -weight coefficients for each residual term;
the MVS nonlinear optimization problem expressed above has strong non-convexity and high parameter space dimension, and is generally difficult to perform online optimization solution;
to solve this problem, a nonlinear optimization solver capable of solving in parallel on a GPU is provided, and the working process includes:
step 41: firstly, uniformly dividing an image area to be reconstructed into a plurality of sub-square areas, and decomposing the pixel depth optimization problem of the whole area into the depth optimization problems of a plurality of sub-areas, thereby reducing the space dimension of parameters to be estimated;
step 42: starting a thread block on the GPU for carrying out Gauss-Newton iterative solution for each sub-square region to be optimized, wherein the steps are as follows:
step 421: each thread in the thread block independently processes a pixel in the sub-block, calculates residual errors and first-order Jacobian vectors, and finally constructs a linear equation with a sparse coefficient matrix; step 422: the linear equation of the last step is solved by utilizing the shared memory of the thread blocks to efficiently run the conjugate gradient method; step 423: repeating step 421 until the maximum iteration number is reached or the suspension condition is satisfied;
step 43: in the actual running process, simultaneously starting a plurality of thread blocks to optimize the pixel depths of a plurality of sub-square areas in parallel, and traversing all the sub-squares for a plurality of times to obtain a final optimization result;
step 5: the local mapping thread receives key frames from the pose tracking thread, maintains and manages the feature point map, and comprises the steps of creating new map points, updating the co-view relation among the key frames and performing local BA optimization to adjust map state parameters;
step 6: the loop detection thread receives the map state parameters which are updated by the local map building thread processing, wherein the map state parameters comprise map points, key frames and common view relations; matching a word bag model of the latest key frame with the historical key frames to check whether a loop is formed, and if the loop is detected, carrying out loop correction and optimization on the associated frames at the two ends of the loop so as to reduce the accumulated deviation in the system as much as possible;
step 7: the SLAM system map-building back end composed of the dense map-building thread and the local map-building thread is combined with the target detection thread to provide environment models with different updating frequencies and different geometric accuracies.
2. The method for autonomous positioning and on-line high-precision three-dimensional reconstruction of a dynamic scene according to claim 1, wherein the preprocessing in step 1 comprises: RGB-depth registration processing, bilateral filtering noise suppression processing.
3. The method for autonomous positioning and online high-precision three-dimensional reconstruction of a dynamic scene according to claim 1, wherein in the step 32, the dynamic object comprises a human being or a pet.
4. The method for autonomous dynamic scene positioning and on-line high-precision three-dimensional reconstruction according to claim 1, wherein in the step 32, the static object comprises a sofa or a table.
5. The method for autonomous dynamic scene positioning and online high-precision three-dimensional reconstruction according to claim 1, wherein in the step 4, the weight coefficient { ω ] of each residual term αβγ 1, 0.025 and 0.01, respectively.
6. The method for autonomous positioning and online high-precision three-dimensional reconstruction of a dynamic scene according to claim 1, wherein in the step 4, the nonlinear optimization solver is a Gauss-Newton solver.
7. The method of dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction according to claim 1, wherein in step 41, the sub-block size is set to 16×16 for improving the GPU data access efficiency.
8. The method of dynamic scene autonomous positioning and on-line high-precision three-dimensional reconstruction according to claim 1, wherein in step 41, the sub-block size is set to 32×32 in order to increase the GPU data access efficiency.
9. The method for autonomous positioning and on-line high-precision three-dimensional reconstruction of a dynamic scene according to claim 1, wherein in step 43, all sub-blocks are traversed 10 to 15 times.
10. The method for autonomous positioning and on-line high-precision three-dimensional reconstruction of dynamic scene according to claim 1, wherein in step 7, the dense map corresponds to low frequency and high precision, and the local map corresponds to high frequency and low precision; the high-precision environment model can better support task requirements of environment understanding and passable area analysis in a complex scene, and the extracted semantic information is combined with the constructed local map model to effectively improve the positioning precision and robustness of the system in a dynamic scene.
CN202311023428.3A 2023-08-15 2023-08-15 Dynamic scene autonomous positioning and online high-precision 3D reconstruction method Pending CN117115343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311023428.3A CN117115343A (en) 2023-08-15 2023-08-15 Dynamic scene autonomous positioning and online high-precision 3D reconstruction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311023428.3A CN117115343A (en) 2023-08-15 2023-08-15 Dynamic scene autonomous positioning and online high-precision 3D reconstruction method

Publications (1)

Publication Number Publication Date
CN117115343A true CN117115343A (en) 2023-11-24

Family

ID=88799389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311023428.3A Pending CN117115343A (en) 2023-08-15 2023-08-15 Dynamic scene autonomous positioning and online high-precision 3D reconstruction method

Country Status (1)

Country Link
CN (1) CN117115343A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118674786A (en) * 2024-08-22 2024-09-20 浙江吉利控股集团有限公司 Method, device, equipment, medium and program product for determining image pose data
CN118736162A (en) * 2024-09-03 2024-10-01 四川万豪恒达信息科技有限公司 A voxel-based dynamic three-dimensional reconstruction method and computer equipment for intelligent workshops

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118674786A (en) * 2024-08-22 2024-09-20 浙江吉利控股集团有限公司 Method, device, equipment, medium and program product for determining image pose data
CN118674786B (en) * 2024-08-22 2024-11-08 浙江吉利控股集团有限公司 Method, device, equipment, medium and program product for determining image pose data
CN118736162A (en) * 2024-09-03 2024-10-01 四川万豪恒达信息科技有限公司 A voxel-based dynamic three-dimensional reconstruction method and computer equipment for intelligent workshops

Similar Documents

Publication Publication Date Title
CN111968129B (en) Semantic-aware real-time positioning and map construction system and method
CN108416840B (en) A 3D scene dense reconstruction method based on monocular camera
CN111340922B (en) Positioning and mapping method and electronic device
CN117456136A (en) Digital twin scene intelligent generation method based on multi-mode visual recognition
CN117252892B (en) Automatic double-branch portrait matting device based on light visual self-attention network
CN111311685A (en) An unsupervised method for motion scene reconstruction based on IMU/monocular image
CN117315169A (en) Real-life three-dimensional model reconstruction method and system based on deep learning multi-view dense matching
CN110009674A (en) A real-time calculation method of monocular image depth of field based on unsupervised deep learning
CN111899280A (en) Monocular vision odometer method adopting deep learning and mixed pose estimation
CN114677479A (en) A deep learning-based multi-view 3D reconstruction method for natural landscapes
CN116772820A (en) Local refinement mapping system and method based on SLAM and semantic segmentation
Niu et al. Overview of image-based 3D reconstruction technology
Yang et al. [Retracted] A Method of Image Semantic Segmentation Based on PSPNet
CN117115343A (en) Dynamic scene autonomous positioning and online high-precision 3D reconstruction method
CN118071932A (en) A three-dimensional static scene image reconstruction method and system
CN120236273A (en) A lightweight real-time semantic segmentation method for 3D Gaussian scenes
CN117437274A (en) A monocular image depth estimation method and system
Yue et al. Semi-supervised monocular depth estimation based on semantic supervision
CN114677380B (en) A video object segmentation method and system based on diversified interaction
CN118521702A (en) A point cloud rendering method and system based on neural radiation field
CN120976449A (en) A method and system for cross-source data 3D reconstruction based on improved Gaussian sputtering
CN118447536A (en) Human body posture estimation method, system and device based on transducer and diffusion model
Zhou et al. PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes
CN117392228A (en) Visual odometry calculation method, device, electronic equipment and storage medium
CN117911480A (en) An attention-guided multi-view depth estimation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination