CN120672942B

CN120672942B - A method, system, device, and medium for 3D reconstruction based on monocular video.

Info

Publication number: CN120672942B
Application number: CN202510722081.4A
Authority: CN
Inventors: 陈天戈; 吴卉; 黄志青
Original assignee: Guangzhou Zhongyiyong Intelligent Technology Co ltd
Current assignee: Guangzhou Zhongyiyong Intelligent Technology Co ltd
Priority date: 2025-05-30
Filing date: 2025-05-30
Publication date: 2026-04-03
Anticipated expiration: 2045-05-30
Also published as: CN120672942A

Abstract

The application discloses a three-dimensional reconstruction method, a system, equipment and a medium based on monocular video, wherein the method comprises the following steps: the method comprises the steps of acquiring and processing data of an indoor space through a monocular camera to obtain a monocular video, carrying out sliding window segmentation processing on the monocular video according to the space volume of the indoor space to obtain video fragments, carrying out local reconstruction processing on the video fragments to obtain local reconstruction point clouds, carrying out key frame joint registration processing on the local reconstruction point clouds to obtain registration scene frames, and carrying out global scene optimization processing on the registration scene frames according to space constraint to obtain a three-dimensional reconstruction result. The embodiment of the application can improve the accuracy of three-dimensional reconstruction and can be widely applied to the technical field of computer vision.

Description

Three-dimensional reconstruction method, system, equipment and medium based on monocular video

Technical Field

The application relates to the technical field of computer vision, in particular to a three-dimensional reconstruction method, system, equipment and medium based on monocular video.

Background

In the related art, a three-dimensional reconstruction method generally collects a large amount of three-dimensional point cloud or image data based on a plurality of stereo cameras and other devices, and converts the data into a three-dimensional model through a three-dimensional reconstruction algorithm. However, in practical application, the related method is found that in a small space environment, the characteristic repeated area is increased due to a narrow visual field, registration ambiguity is caused, and the monocular drift is aggravated due to the fact that the motion parallax effectiveness is reduced in a limited moving range, so that the efficiency of three-dimensional reconstruction is affected. In summary, the technical problems in the related art are to be improved.

Disclosure of Invention

The embodiment of the application mainly aims to provide a three-dimensional reconstruction method, a system, equipment and a medium based on monocular video, which can improve the accuracy of three-dimensional reconstruction.

To achieve the above object, an aspect of an embodiment of the present application provides a three-dimensional reconstruction method based on monocular video, the method including:

the method comprises the steps that data acquisition processing is carried out on an indoor space through a monocular camera, so that monocular video is obtained;

Carrying out sliding window segmentation processing on the monocular video according to the space volume of the indoor space to obtain a video segment;

Carrying out local reconstruction processing on the video segment to obtain local reconstruction point cloud;

Performing key frame joint registration processing on the local reconstruction point cloud to obtain a registration scene frame;

And carrying out global scene optimization processing on the registration scene frame according to space constraint to obtain a three-dimensional reconstruction result.

In some embodiments, the sliding window segmentation processing is performed on the monocular video according to the spatial volume of the indoor space to obtain a video segment, which includes the following steps:

performing volume calculation processing on the indoor space according to the monocular video to obtain the space volume;

Initializing a sliding window according to the space volume, and adjusting the length of the sliding window based on motion blur detection to obtain a target sliding window;

and dividing the monocular video according to the target sliding window to obtain the video fragment.

In some embodiments, the performing a local reconstruction process on the video segment to obtain a local reconstruction point cloud includes the following steps:

Performing feature extraction processing on the video clips through an image encoder, and performing time sequence feature fusion processing on the extracted features by adopting a gating circulation unit to obtain spatial scene features;

Performing multi-view information fusion processing on the spatial scene characteristics through a key frame decoder to obtain multi-view information;

performing key frame information supplementing processing on the space scene characteristics through a supporting frame decoder to obtain key frame information;

performing bidirectional cross attention calculation processing on the multi-view information and the key frame information according to space locality constraint to obtain fusion characteristics;

and carrying out regression prediction processing on the fusion characteristics based on a point cloud regression module of deformable convolution to obtain the local reconstruction point cloud.

In some embodiments, the performing a keyframe joint registration process on the local reconstruction point cloud to obtain a registered scene frame includes the following steps:

acquiring a scene frame buffer pool, wherein the scene frame buffer pool comprises historical scene frames;

performing coordinate transformation processing on the local reconstruction point cloud to obtain global point cloud data;

And carrying out registration retrieval processing on the global point cloud data according to the scene frame buffer pool to obtain the registration scene frame.

In some embodiments, the performing registration retrieval processing on the scene frame buffer pool according to the global point cloud data to obtain the registered scene frame includes the following steps:

performing cosine similarity retrieval processing on each historical scene frame in the scene frame buffer pool according to the global point cloud data to generate a key frame set;

performing space-time feature alignment processing on the keyframe set to obtain cross-keyframe space-time features;

performing three-dimensional point cloud registration processing on the keyframe set according to the cross-keyframe space-time characteristics to obtain a registration point cloud;

And carrying out point cloud fusion processing on the registration point cloud to obtain the registration scene frame.

In some embodiments, the global scene optimization processing is performed on the registration scene frame according to spatial constraint to obtain a three-dimensional reconstruction result, including the following steps:

performing point cloud optimization processing on the registration scene frame to obtain point cloud optimization data;

Performing plane constraint optimization processing on the point cloud optimization data to obtain plane optimization data;

and carrying out space topology optimization processing on the plane optimization data to obtain the three-dimensional reconstruction result.

In some embodiments, the performing a spatial topology optimization process on the plane optimization data to obtain the three-dimensional reconstruction result includes the following steps:

Performing topology construction processing on the plane optimization data to obtain a scene topological graph;

and carrying out iterative optimization processing on the scene topological graph according to a graph convolution network to obtain the three-dimensional reconstruction result.

To achieve the above object, another aspect of an embodiment of the present application provides a three-dimensional reconstruction system based on monocular video, the system including:

The first module is used for obtaining and processing the video of the indoor space through the monocular camera to obtain monocular video;

The second module is used for carrying out sliding window segmentation processing on the monocular video according to the space volume of the indoor space to obtain a video segment;

the third module is used for carrying out local reconstruction processing on the video segment to obtain a local reconstruction point cloud;

A fourth module, configured to perform keyframe joint registration processing on the local reconstruction point cloud to obtain a registration scene frame;

And a fifth module, configured to perform global scene optimization processing on the registration scene frame according to spatial constraint, so as to obtain a three-dimensional reconstruction result.

To achieve the above object, another aspect of the embodiments of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method described above when executing the computer program.

To achieve the above object, another aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

The embodiment of the application at least comprises the following beneficial effects that the three-dimensional reconstruction method, the system, the equipment and the medium based on the monocular video are provided, the monocular video is obtained by carrying out data acquisition processing on the indoor space through the monocular camera, the video fragments are obtained by carrying out sliding window segmentation processing on the monocular video according to the space volume of the indoor space, the window length can be adaptively adjusted according to the space volume, the local window overlapping rate can be increased, the monocular scale drift is reduced, and a data base is provided for the follow-up local reconstruction. In addition, the scheme obtains local reconstruction point clouds by carrying out local reconstruction processing on video clips, obtains registration scene frames by carrying out key frame joint registration processing on the local reconstruction point clouds, obtains three-dimensional reconstruction results by carrying out global scene optimization processing on the registration scene frames according to space constraint, can detect reconstruction integrity based on the space constraint, reduces registration errors and improves the accuracy of three-dimensional reconstruction.

Drawings

Fig. 1 is a flowchart of a three-dimensional reconstruction method based on monocular video according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a three-dimensional reconstruction system based on monocular video according to an embodiment of the present application;

fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the application, but are merely examples of systems and methods consistent with aspects of embodiments of the application as detailed in the accompanying claims.

It is to be understood that the terms "first," "second," and the like, as used herein, may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present application. The words "if", as used herein, may be interpreted as "when" or "in response to a determination", depending on the context.

The terms "at least one", "a plurality", "each", "any" and the like as used herein, at least one includes one, two or more, a plurality includes two or more, each means each of the corresponding plurality, and any one means any of the plurality.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

In the related art, a three-dimensional reconstruction method generally collects a large amount of three-dimensional point cloud or image data based on a plurality of stereo cameras and other devices, and converts the data into a three-dimensional model through a three-dimensional reconstruction algorithm. However, in practical application, the related method is found that in a small space environment, the characteristic repeated area is increased due to a narrow visual field, registration ambiguity is caused, and the monocular drift is aggravated due to the fact that the motion parallax effectiveness is reduced in a limited moving range, so that the efficiency of three-dimensional reconstruction is affected. In summary, the technical problems in the related art are to be improved. For example, in the related art, an instantaneous positioning and map building (SLAM) method is used for three-dimensional reconstruction, but the method needs offline processing, so that the real-time requirement cannot be met, and the real-time dense SLAM system has defects in reconstruction accuracy and integrity. While depth sensor based solutions are costly and environmentally limited.

In view of the above, embodiments of the present application provide a three-dimensional reconstruction method, system, device, and medium based on monocular video, where the monocular video is obtained by performing data acquisition processing on an indoor space by using a monocular camera, and a video segment is obtained by performing sliding window segmentation processing on the monocular video according to a spatial volume of the indoor space, so that a window length can be adaptively adjusted according to the spatial volume, a local window overlapping rate can be increased, a monocular scale drift is reduced, and a data base is provided for subsequent local reconstruction. In addition, the scheme obtains local reconstruction point clouds by carrying out local reconstruction processing on video clips, obtains registration scene frames by carrying out key frame joint registration processing on the local reconstruction point clouds, obtains three-dimensional reconstruction results by carrying out global scene optimization processing on the registration scene frames according to space constraint, can detect reconstruction integrity based on the space constraint, reduces registration errors and improves the accuracy of three-dimensional reconstruction.

The embodiment of the application provides a three-dimensional reconstruction method based on monocular video, and relates to the technical field of computer vision. The three-dimensional reconstruction method based on the monocular video provided by the embodiment of the application can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be, but not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle terminal, etc., the server may be configured as an independent physical server, may be configured as a server cluster or a distributed system formed by a plurality of physical servers, may be configured as a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platform, and the server may also be a node server in a blockchain network, and the software may be an application for implementing a three-dimensional reconstruction method based on monocular video, etc., but is not limited to the above forms.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable consumer electronics, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of a three-dimensional reconstruction method based on monocular video according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S105.

Step S101, data acquisition processing is carried out on an indoor space through a monocular camera, so that monocular video is obtained;

Step S102, carrying out sliding window segmentation processing on the monocular video according to the space volume of the indoor space to obtain a video segment;

Step S103, carrying out local reconstruction processing on the video segment to obtain local reconstruction point cloud;

step S104, performing key frame joint registration processing on the local reconstruction point cloud to obtain a registration scene frame;

And step S105, performing global scene optimization processing on the registration scene frame according to space constraint to obtain a three-dimensional reconstruction result.

In the small space environment, the steps S101 to S105 are performed by using only a common monocular camera, and the real-time three-dimensional reconstruction with high precision and high integrity is performed, wherein the small space in the embodiment of the application refers to an indoor closed space with the area of 10-50 square meters and the height of not more than 5 meters. Specifically, in the embodiment of the application, the monocular video is obtained by carrying out data acquisition processing on the indoor space through the monocular camera, then the length of the sliding window is adaptively adjusted according to the space volume, the monocular video is segmented through the sliding window, and the input video stream is segmented into overlapped short segments. And then carrying out local reconstruction processing on the video segment, obtaining local reconstruction point cloud through an improved multi-branch neural network model, directly predicting dense 3D point cloud pictures of frames in a window, and establishing a local coordinate system by taking an intermediate frame as a key frame. In the embodiment, the local reconstruction point cloud is further subjected to key frame joint registration processing, the local reconstruction point cloud is incrementally registered to a global coordinate system, and the historical scene frame related to the current frame can be retrieved from the buffer pool based on the visual similarity and the baseline suitability score, so that the registered scene frame is obtained through joint registration. And finally, carrying out global scene optimization processing on the registration scene frame according to the space constraint to obtain a three-dimensional reconstruction result. Aiming at the characteristics of a small space environment, the embodiment of the application particularly optimizes the following parameters that the window length is set to 11 frames, the reconstruction quality and efficiency are balanced, the scene frame buffer Chi Daxiao is set to 30-50 frames, 5-10 most relevant scene frames are searched for each registration, and a multi-key frame co-registration strategy is adopted for registration processing.

The technical scheme has the advantages that registration ambiguity and monocular scale drift can be reduced through a dynamic window segmentation strategy and space topology constraint, and the accuracy of three-dimensional reconstruction is improved.

In the embodiment of the application, the volume calculation can be performed on the indoor space through the acquired monocular video, and the space volume can also be obtained through actually measuring the indoor space. According to the embodiment of the application, the space recognition processing can be carried out on the monocular video through the depth learning model, the wall surface or the ground of the space in the monocular video is detected, and the depth of the space is predicted through the depth prediction model, so that the volume is calculated according to the detected area and depth, and the space volume is obtained. The sliding window is then initialized based on the spatial volume, e.g., window frame numberV denotes the spatial volume. And then, the length of the sliding window is adjusted based on motion blur detection, the sharpness or edge information of the image can be utilized to detect the blur degree, so that the window length is adaptively adjusted according to the blur degree, and the monocular video is segmented according to the adaptively adjusted target sliding window, so that a video segment is obtained.

The technical scheme has the advantages that the size of the sliding window can be adjusted in a self-adaptive mode according to scene requirements by dynamically adjusting the sliding window, optimal balance of resources and performance is achieved, and a data basis is provided for follow-up three-dimensional reconstruction.

In the embodiment of the application, the video segment is subjected to feature extraction processing through an image encoder, a cross-window feature transfer mechanism is introduced aiming at the characteristic of high repeatability of the features of a small space scene, and a gating circulation unit is adopted to perform time sequence feature fusion processing on the extracted features, wherein the formula of the gating circulation unit is as follows:

z_t＝σ(W_z·[h_t-1,x_t])

Where z _t denotes the updated gate output vector for the current time step, σ denotes the Sigmoid activation function, W _z denotes the learnable weight matrix corresponding to the updated gate, h _t-1 denotes the hidden state vector for the last time step, and x _t denotes the input feature vector for the current time step. And then carrying out multi-view information fusion processing on the space scene characteristics through a key frame decoder to obtain multi-view information, and carrying out key frame information supplementing processing on the space scene characteristics through a support frame decoder to obtain key frame information. The embodiment of the application also designs space locality constraint for a bidirectional cross attention mechanism, and defines the attention action radius according to the characteristics of a small space environment, wherein the calculation formula of the attention action radius is as follows:

Wherein r is the attention radius, W, H is the image width and height respectively, and the embodiment of the application forcedly focuses on the local geometric association in the small space scene through the attention radius. It should be noted that the embodiment of the present application uses an intermediate frame as a key frame to establish a local coordinate system. And finally, carrying out regression prediction processing on the fusion characteristics based on a point cloud regression module of the deformable convolution to obtain a local reconstruction point cloud. When a refinement module based on deformable convolution is added in the point cloud regression module, the embodiment of the application limits the convolution kernel deformation offset according to the small space characteristics, and the calculation formula of the convolution kernel deformation offset is as follows:

Wherein Δp is a convolution kernel deformation offset, D is a scene depth estimated value, and f is a focal length parameter.

The technical scheme has the advantages that the video segment is partially reconstructed, and the corresponding spatial scale constraint is added by combining the characteristics of the small-space environment, so that the characteristics of a narrow visual field can be better detected, and the accuracy of characteristic extraction and reconstruction is improved.

In the embodiment of the application, the scene frame buffer pool is a pre-constructed buffer pool, wherein the buffer pool comprises a plurality of frames of historical scene frames, and the historical scene frames can be obtained through a database or through acquisition and processing of a space in advance. According to the embodiment of the application, the local reconstruction point cloud is incrementally registered to the global coordinate system to obtain global point cloud data, then the global point cloud data is registered and searched according to the scene frame buffer pool, cross-window feature multiplexing is realized through the pre-constructed scene frame buffer pool, and a plurality of historical scene frames similar to cosine similarity search are adopted as key frames for registration, so that registered scene frames are obtained.

The technical scheme has the advantages that the embodiment of the application can perform batch multi-key frame joint registration by performing registration and retrieval processing on the global point cloud data, thereby improving the registration efficiency.

In the embodiment of the application, the similarity calculation is carried out on each historical scene frame in the scene frame buffer pool on the global point cloud data according to the cosine similarity, and a key frame set can be generated by setting a similarity threshold value for screening. And then carrying out space-time feature alignment on the keyframe set, and obtaining the cross-keyframe space-time feature by constructing a space-time feature cube and combining a three-dimensional convolution check space-time feature cube to carry out feature extraction. Then carrying out three-dimensional point cloud registration processing on the keyframe set according to the cross-keyframe space-time characteristics, and carrying out multi-keyframe joint registration by adopting an improved three-dimensional point cloud registration (ICP) algorithm, wherein the objective function is as follows:

Wherein the weights w _k are dynamically calculated from the key frame confidence, R represents the rotation transformation matrix, An ith three-dimensional point coordinate representing a kth key frame in the source point cloud, t representing a translational transformation vector,Representing the coordinate in the target point cloudCorresponding three-dimensional point coordinates. Finally, carrying out point cloud fusion processing on the registration point cloud to obtain a registration scene frame, wherein the point cloud fusion can be carried out by establishing a probability fusion model, and the expression of the fusion model is as follows:

where p (x) represents a fusion probability density function representing a three-dimensional spatial point x, alpha _k represents a mixture weight coefficient of a kth gaussian component, Representing a three-dimensional gaussian distribution probability density function, mu _k representing the mean vector of the kth gaussian component, Σ _k representing the covariance matrix of the kth gaussian component. The embodiment of the application can solve the optimal fusion parameters through the expectation maximization algorithm, set the point cloud confidence threshold as 3, and filter low-quality reconstructed point cloud data.

The technical scheme has the advantages that the embodiment of the application can simultaneously register a plurality of key frames by adopting the multi-key frame co-registration strategy, thereby improving the registration efficiency and accuracy.

In the embodiment of the application, a small space optimization strategy is set according to the characteristics of a small space scene, point cloud optimization processing is performed on a registration scene frame by introducing point cloud distribution optimization to obtain point cloud optimization data, plane constraint optimization processing is performed on the point cloud optimization data by plane constraint optimization to obtain plane optimization data, for example, constraint optimization is performed on large planes such as a plane constraint optimization wall surface by adding, and space topology optimization processing is performed on the plane optimization data by introducing space topology optimization to obtain a three-dimensional reconstruction result. Specifically, the point cloud distribution optimization adopts a density perception clustering algorithm to optimize the point cloud data by defining a density measure, wherein the expression of the density measure is as follows:

Where ρ (x) represents the density measure, x represents the target three-dimensional point coordinates of the density to be calculated, and x _i represents the ith neighboring point coordinate within the neighborhood N (x). Then by setting the adaptive neighborhood radius r=μ _d+ασ_d, where μ _d is the average neighbor distance, α represents, σ _d represents. And optimizing the distribution of the point cloud data according to the density measurement and the adaptive neighborhood radius. The embodiment of the application also uses a multi-plane detection algorithm for plane constraint optimization, wherein the expression of the plane detection algorithm is as follows:

Where n represents the normal vector of the plane, d represents the distance from the plane to the origin, Representing the gradient term of the normal vector in space, λ represents the regularization coefficient. According to the embodiment of the application, the plane relation diagram is established, and orthogonal constraint is forced, for example, the orthogonal constraint is that the included angle between the wall surfaces is 90+/-5 degrees, so that plane constraint is carried out on point cloud data to obtain optimized plane optimization data.

The technical scheme has the advantages that the accuracy of three-dimensional reconstruction can be improved by introducing planar orthogonality constraint optimization and density self-adaptive topology optimization.

In the embodiment of the application, a scene topological graph G= (V, E, W) is obtained by constructing plane optimization data, vertexes represent space units, and then the distribution of point clouds is optimized through a graph rolling network (GCN), and an optimization formula is as follows:

Wherein H ^(l+1) represents the node characteristic matrix of the (l+1) th layer, sigma represents the nonlinear activation function, the embodiment of the application adopts the ReLU activation function, In a normalized form of the degree of representation matrix,Representing a normalized version of the adjacency matrix, H ^(l) representing the input node characteristics of the first layer, and W ^(l) representing the trainable weight matrix of the first layer.

The technical scheme has the advantages that the space topology optimization is carried out on the point cloud data through the graph convolution network, the reconstructed data can be more in line with the small space environment, and the accuracy of three-dimensional reconstruction is improved.

The following describes and illustrates the embodiments of the present application in detail with reference to specific application examples:

According to the embodiment of the application, the monocular RGB video input is received, the first window is initialized, all frames are tried to be used as key frame candidates, the reconstruction result with the highest total confidence is selected to initialize the global scene, and the optimization is carried out by establishing small space priori constraint, so that the three-dimensional reconstruction result is obtained. According to the embodiment of the application, the video fragment is obtained through the sliding window, the characteristics of each frame are extracted through the image encoder, the multi-view information is fused through the key frame decoder, the key frame information is supplemented through the support frame decoder, and the 3D point cloud and the confidence coefficient are predicted through the regression head. And registering the reconstructed point cloud data to a global coordinate system, searching related historical scene frames through a scene frame buffer pool, converting the coordinate system by combining the coded images and the geometric feature registration decoder, optimizing the global scene through the scene decoder, and updating the scene frame buffer pool. And then optimizing by a small space optimizing strategy, and introducing plane orthogonality constraint to eliminate wall registration errors. Specifically, the embodiment of the application adopts a two-stage neural network framework, divides a video into short segments through a sliding window mechanism, directly predicts local 3D point clouds by using a first-stage network, and then is incrementally registered to a global coordinate system through a second-stage network. The window size, scene frame management strategy and space constraint are optimized for the small space environment, and high-quality real-time reconstruction without explicit camera parameter estimation is realized. By introducing window segmentation, plane orthogonality constraint optimization and density self-adaptive topology optimization of space volume sensing, compared with a related three-dimensional reconstruction method, the reconstruction integrity is improved by 42.7% in a 5m multiplied by 5m standard test scene, the registration error is reduced to 0.11m, and the real-time performance of 23FPS is maintained.

Referring to fig. 2, the embodiment of the present application further provides a three-dimensional reconstruction system based on a monocular video, which can implement the three-dimensional reconstruction method based on a monocular video, where the system includes:

a first module 201, configured to perform video acquisition processing on an indoor space through a monocular camera to obtain a monocular video;

a second module 202, configured to perform sliding window segmentation processing on the monocular video according to the spatial volume of the indoor space, so as to obtain a video segment;

a third module 203, configured to perform local reconstruction processing on the video segment to obtain a local reconstruction point cloud;

A fourth module 204, configured to perform a keyframe joint registration process on the local reconstruction point cloud to obtain a registration scene frame;

And a fifth module 205, configured to perform global scene optimization processing on the registered scene frame according to spatial constraint, so as to obtain a three-dimensional reconstruction result.

It can be understood that the content in the above method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the three-dimensional reconstruction method based on the monocular video when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

It can be understood that the content in the above method embodiment is applicable to the embodiment of the present apparatus, and the specific functions implemented by the embodiment of the present apparatus are the same as those of the embodiment of the above method, and the achieved beneficial effects are the same as those of the embodiment of the above method.

Referring to fig. 3, fig. 3 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 301 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an Application-specific integrated Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs, so as to implement the technical solutions provided by the embodiments of the present application;

The Memory 302 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 302 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present disclosure is implemented by software or firmware, relevant program codes are stored in the memory 302, and the processor 301 invokes the three-dimensional reconstruction method based on monocular video to execute the embodiments of the present disclosure;

an input/output interface 303 for implementing information input and output;

The communication interface 304 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 305 for transferring information between various components of the device (e.g., processor 301, memory 302, input/output interface 303, and communication interface 304);

Wherein the processor 301, the memory 302, the input/output interface 303 and the communication interface 304 are communicatively coupled to each other within the device via a bus 305.

The embodiment of the application also provides a computer readable storage medium, which stores a computer program, and the computer program realizes the three-dimensional reconstruction method based on monocular video when being executed by a processor.

It can be understood that the content of the above method embodiment is applicable to the present storage medium embodiment, and the functions of the present storage medium embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the three-dimensional reconstruction method, system, equipment and medium based on the monocular video, the monocular video is obtained by carrying out data acquisition processing on the indoor space through the monocular camera, the video segments are obtained by carrying out sliding window segmentation processing on the monocular video according to the space volume of the indoor space, the window length can be adaptively adjusted according to the space volume, the local window overlapping rate can be increased, the monocular-scale drift is reduced, and a data base is provided for the follow-up local reconstruction. In addition, the scheme obtains local reconstruction point clouds by carrying out local reconstruction processing on video clips, obtains registration scene frames by carrying out key frame joint registration processing on the local reconstruction point clouds, obtains three-dimensional reconstruction results by carrying out global scene optimization processing on the registration scene frames according to space constraint, can detect reconstruction integrity based on the space constraint, reduces registration errors and improves the accuracy of three-dimensional reconstruction.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The system embodiments described above are merely illustrative, in that the units illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" is used to describe an association relationship of an associated object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that only a exists, only B exists, and three cases of a and B exist simultaneously, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the above elements is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. The storage medium includes various media capable of storing programs, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A three-dimensional reconstruction method based on monocular video, characterized in that the method includes the following steps:

The data of the indoor space is collected and processed by a monocular camera to obtain monocular video.

The monocular video is segmented using a sliding window based on the spatial volume of the indoor space to obtain video segments.

The video segment is subjected to local reconstruction processing to obtain a locally reconstructed point cloud;

The locally reconstructed point cloud is subjected to keyframe joint registration processing to obtain the registered scene frame;

Global scene optimization processing is performed on the registered scene frames based on spatial constraints to obtain the 3D reconstruction results;

The process of performing sliding window segmentation on the monocular video based on the spatial volume of the indoor space to obtain video segments includes the following steps:

The volume of the indoor space is calculated based on the monocular video.

The sliding window is initialized based on the spatial volume, and the length of the sliding window is adjusted based on motion blur detection to obtain the target sliding window;

The monocular video is segmented according to the target sliding window to obtain the video segment.

2. The method according to claim 1, characterized in that, the step of performing local reconstruction processing on the video segment to obtain a locally reconstructed point cloud includes the following steps:

The video segment is processed by an image encoder to extract features, and the extracted features are fused by a gated loop unit to obtain spatial scene features.

The spatial scene features are fused using a keyframe decoder to obtain multi-view information.

By supporting frame decoder to supplement keyframe information of the spatial scene features, keyframe information is obtained;

Based on spatial locality constraints, bidirectional cross-attention calculation is performed on the multi-view information and the keyframe information to obtain fused features;

The point cloud regression module based on deformable convolution performs regression prediction processing on the fused features to obtain the local reconstructed point cloud.

3. The method according to claim 1, characterized in that, the step of performing keyframe joint registration processing on the locally reconstructed point cloud to obtain a registered scene frame includes the following steps:

Obtain the scene frame buffer pool, which includes historical scene frames;

The local reconstructed point cloud is subjected to coordinate transformation to obtain global point cloud data;

The global point cloud data is registered and retrieved based on the scene frame buffer pool to obtain the registered scene frame.

4. The method according to claim 3, characterized in that, the step of performing registration retrieval processing on the scene frame buffer pool based on the global point cloud data to obtain the registered scene frame includes the following steps:

Based on the global point cloud data, cosine similarity retrieval is performed on each historical scene frame in the scene frame buffer pool to generate a keyframe set.

Spatiotemporal feature alignment processing is performed on the keyframe set to obtain cross-keyframe spatiotemporal features;

Based on the cross-keyframe spatiotemporal features, the keyframe set is subjected to 3D point cloud registration processing to obtain a registered point cloud.

The registered point cloud is subjected to point cloud fusion processing to obtain the registered scene frame.

5. The method according to claim 1, characterized in that, the step of performing global scene optimization processing on the registered scene frame according to spatial constraints to obtain the three-dimensional reconstruction result includes the following steps:

The registered scene frame is subjected to point cloud optimization processing to obtain point cloud optimization data;

The point cloud optimization data is subjected to planar constraint optimization processing to obtain planar optimization data;

The planar optimization data is subjected to spatial topology optimization processing to obtain the three-dimensional reconstruction result.

6. The method according to claim 5, characterized in that, the step of performing spatial topology optimization processing on the planar optimization data to obtain the three-dimensional reconstruction result includes the following steps:

The planar optimization data is subjected to topology construction processing to obtain a scene topology map;

The scene topology graph is iteratively optimized using a graph convolutional network to obtain the 3D reconstruction result.

7. A 3D reconstruction system based on monocular video, characterized in that the system comprises:

The first module is used to acquire and process video of the indoor space through a monocular camera to obtain monocular video.

The second module is used to perform sliding window segmentation on the monocular video according to the spatial volume of the indoor space to obtain video segments.

The third module is used to perform local reconstruction processing on the video segment to obtain a local reconstructed point cloud;

The fourth module is used to perform keyframe joint registration processing on the local reconstructed point cloud to obtain the registered scene frame;

The fifth module is used to perform global scene optimization processing on the registered scene frame according to spatial constraints to obtain the three-dimensional reconstruction result;

The second module is used to perform sliding window segmentation on the monocular video based on the spatial volume of the indoor space to obtain video segments, including:

The volume of the indoor space is calculated based on the monocular video.

8. An electronic device, characterized in that the electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method according to any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 6.