CN118505878A - Three-dimensional reconstruction method and system for single-view repetitive object scene - Google Patents

Three-dimensional reconstruction method and system for single-view repetitive object scene Download PDF

Info

Publication number
CN118505878A
CN118505878A CN202410581762.9A CN202410581762A CN118505878A CN 118505878 A CN118505878 A CN 118505878A CN 202410581762 A CN202410581762 A CN 202410581762A CN 118505878 A CN118505878 A CN 118505878A
Authority
CN
China
Prior art keywords
view
image
scene
dimensional
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410581762.9A
Other languages
Chinese (zh)
Inventor
吕佳俊
丁缘媛
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202410581762.9A priority Critical patent/CN118505878A/en
Publication of CN118505878A publication Critical patent/CN118505878A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/00Three-dimensional [3D] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a three-dimensional reconstruction method and a system of a single-view repetitive object scene, wherein the method comprises the following steps: collecting a single-view scene picture; the method for processing the single-view scene picture realizes a single-instance multi-view frame and comprises the processing steps of semantic segmentation, pose estimation, image coding and the like, and the data effective information is fully extracted; constructing a neural radiation field reconstruction network model based on an SDF function; supervising and training the nerve radiation field through various geometric losses and rendering losses; the invention aims at improving the defects of the existing nerve radiation field technology, and greatly improves the training efficiency of a reconstruction algorithm and the overall effect of scene reconstruction. Not only is high-fidelity scene reconstruction under the single-view image input condition realized, including accurate geometric shapes and textures with rich details, but also the dependence on dense views is broken through, and a new possibility is opened up for the wide application of the three-dimensional reconstruction technology in a plurality of application fields.

Description

Three-dimensional reconstruction method and system for single-view repetitive object scene
Technical Field
The invention relates to the technical field of three-dimensional reconstruction of single-view images, in particular to a three-dimensional reconstruction method and a three-dimensional reconstruction system of a single-view repeated object scene.
Background
Image-based three-dimensional reconstruction techniques play a vital role in understanding and interacting with the real three-dimensional world, providing new possibilities for applications in numerous fields such as Virtual Reality (VR), augmented Reality (AR), robotics, etc. The neural radiation field (Neural RADIANCE FIELDS, NERF) technology brings breakthrough to the three-dimensional reconstruction task by adopting the volume rendering technology, and obviously improves the geometric details of the scene. In a simple scene with dense viewpoint acquisition NeRF can achieve highly accurate reconstruction effects, revealing impressive capabilities. However, it is challenging when dealing with sparse view input situations, which is a common problem in real-world application scenarios. In practical situations, acquiring image data from a dense viewpoint is often impractical, making three-dimensional reconstruction from a single viewpoint a challenging task in the field of computer vision. This task aims to recover the three-dimensional geometry and appearance of a scene by means of only the images captured by a single camera, with the difficulty that the information provided by a single view is limited, and the inherent uncertainties and ambiguities make accurate reconstruction particularly complex.
In order to break through the high dependence of the nerve radiation field on multi-view information under the input of a single-view image and realize the new view synthesis and high-quality three-dimensional reconstruction model of the scene, the current research explores a method for combining various structures prior into a network model. The method aims at guiding the accuracy in the three-dimensional scene reconstruction process by performing supervised learning on the network. While such priors can constrain the reconstruction process to some extent, they may be biased, and for single view pictures, the true values of these priors tend to be sparse.
In addition, existing methods tend to ignore the importance of object texture information, and the rich geometric and semantic details contained in textures are critical to reconstruction. Most studies rely solely on camera pose inputs corresponding to images to extract features and take the images as a supervisory source of color truth values, failing to directly extract texture information from the images, resulting in network models failing to fully utilize available information resources. Real world scenes are enriched with repetitive objects and structures, but current approaches fail to take advantage of this to enhance the accuracy and efficiency of scene reconstruction. Therefore, for the scene containing multiple objects widely existing in the real world, under the challenge that dense data is difficult to acquire, development of a strategy capable of rapidly realizing accurate three-dimensional reconstruction with minimum calculation cost is urgently required. The method has important significance for promoting the wide application of NeRF technology in the field of three-dimensional reconstruction and solving the actual application challenges faced by the method.
Disclosure of Invention
The invention aims to provide a nerve radiation field three-dimensional reconstruction technology and a nerve radiation field three-dimensional reconstruction system which are specially designed for a scene containing a repeated object in a single view, adopts an innovative single view multi-instance frame, and aims to reconstruct three-dimensional geometric structures and texture details of the scene with high fidelity by only relying on a scene image of the single view. The development of the technology not only solves the problem of accurate three-dimensional reconstruction under the sparse view angle information condition, but also skillfully processes repeated objects in a scene, and improves the quality and efficiency of three-dimensional reconstruction. In order to achieve the above purpose, the technical scheme used in the invention is as follows:
in one aspect, the invention provides a three-dimensional reconstruction method of a single-view repetitive object scene, comprising the following steps:
s101, acquiring a single-view scene image I of a scene of a multiple repeated objects through image acquisition equipment;
S102, carrying out semantic segmentation on the single-view scene image I through a pre-training semantic segmentation model, identifying all object instances in the image, generating a semantic mask image, and extracting a plurality of view patches of the same object in the image based on semantic labels to form a group of pictures;
S103, acquiring camera attitude information corresponding to each view patch picture through a multi-view patch picture group, and estimating camera attitude and object attitude information corresponding to the multi-view image by utilizing a motion structure restoration (SFM) method so as to obtain sampling point information of pixel projection light;
S104, extracting image features through an image encoder, obtaining example alignment features according to a two-dimensional boundary box of the example object, and simultaneously obtaining pixel alignment features corresponding to the three-dimensional points;
S105, constructing a neural radiation field network model capable of predicting a signed distance Function (SIGN DISTANCE Function, SDF), and representing three-dimensional geometry by using a neural implicit curved surface with an SDF value;
S106, training the neural radiation field network model by taking the position of each sampling point, the view angle of the sampling point, the pixels and the example characteristics in the multi-view training sample as input, and outputting geometric prediction and appearance prediction through volume rendering;
s107, supervising the network optimization training through each geometrical loss term and luminosity rendering loss term, and modifying parameters and weights through back propagation;
s108, according to the trained neural implicit representation, combining the implicit representation by giving a two-dimensional boundary box and a three-dimensional boundary box of each object, obtaining a luminosity rendering picture visualization by volume rendering, and displaying a three-dimensional grid model by SDF value conversion.
Preferably, the pre-training semantic segmentation model segments all object instances in the scene based on a pre-training network of a Transformer to obtain semantic segmentation information, object category information is obtained according to the semantic segmentation information, N repeated similar objects are cut in a single view picture, and a single scene image I is converted into a group of object image sets for shooting the same object from different view anglesA single instance multi-view picture patch training set is constructed.
Preferably, the step S103 specifically includes: n multi-view clipping pictures obtained by using SfM algorithmCorresponding N camera posesAlignment of N virtual camerasAll cameras are aligned to a reference coordinate system { P ref }, and the 6DoF pose of the repeated object is obtained:
preferably, the step S104 specifically includes:
Extracting an image feature F using a CNN-based encoder Enc;
Obtaining a two-dimensional boundary frame of the object according to the semantic segmentation information, and obtaining an example alignment feature F ins from a clipping region in the image feature;
for a three-dimensional point x on the ray sample, the x is projected onto the image plane and the pixel alignment feature F pix is obtained by linear interpolation of the feature map.
Preferably, the step S105 specifically includes:
Obtaining the distance from a given point x to the nearest curved surface by using the symbolic distance function, wherein the symbolic value represents that the point is positioned inside or outside the curved surface, and the negative value represents that the point is positioned inside the curved surface of the object, and the positive value represents that the point is positioned outside the curved surface of the object;
implicitly representing the three-dimensional geometric surface by using the zero level set of SDF values;
the implicit surface expression SDF value of the nerve is converted into volume density through a leachable parameter beta, so that the nerve radiation field can be subjected to volume rendering, and the conversion relation is as follows:
The neural implicit radiation field network model consists of two multi-layer perceptron (MLP), namely SDF implicit network f θ and rendering network crepe
Preferably, the step S106 specifically includes:
The SDF implicit network f θ receives the coded sampling point positions and corresponding example features and pixel features as input, outputs predicted SDF values and feature vectors of the sampling points, and obtains volume density and mask predicted values through the learnable parameters ζ:
rendering network Receiving the sampling point position, the sampling point view angle, the characteristic vector and the normal vector output by the f θ network as input, and outputting a color predicted value
Volume density sigma and color of each sampling point of ray by volume rendering modeIntegrating and outputting network prediction normal vector valuePrediction mask valuePredicting color values
Where α i=1-exp(-σiδi),δi denotes the distance between two adjacent sampling points on the ray.
Preferably, the step S107 specifically includes:
Obtaining normal true values through a pre-training Omnidata model, and calculating a normal consistency loss term L n with normal values obtained through network prediction, wherein the loss term consists of an absolute value of a difference between a dot product of two normal vectors and 1 and an L1 norm of a difference between a predicted normal vector and an actual normal vector:
Adding the Eikonal regularization term L eikonal, which requires that the modulo length of the SDF field gradient remain 1 throughout the field:
rendering supervision consists of photometric reconstruction consistency loss and mask consistency,
The photometric reconstruction consistency loss, i.e. the color consistency L2 loss term L rgb, is defined as minimizing the square difference between the rendered color and the predicted color, supervising the consistency between the rendered color of all pixel rays and the network predicted color:
The difference between the rendered mask and the network observed mask is measured as a Binary Cross Entropy (BCE) as a mask consistency term L mask:
The overall penalty term for co-optimizing the neural implicit radiation fields SDF MLP and C MLP consists of a color consistency penalty L rgb, a mask consistency term L mask, a normal consistency penalty term L n, an Eikonal regularization term L eikonal, each of which can adjust its weight, where λ 1=0.5,λ2=0.5,λ3 =0.1:
L=Lrgb1Lmask2Ln3Leikonal
preferably, the step S108 specifically includes:
Determining a three-dimensional boundary box of the object by utilizing the 6DoF pose information of the object, and forming implicit representation of the whole scene by combining the implicit representations;
A scene composed of implicit expressions of N repeated instances is transformed into a display grid by extracting an iso-surface using Marching Cubes algorithm.
In another aspect, the present invention provides a three-dimensional reconstruction system of a single-view repetitive object scene, comprising:
And an image acquisition module: the system comprises a terminal, a cloud end module, a storage module and a storage module, wherein the terminal is used for capturing RGB images containing multiple objects from a single view angle, performing necessary preprocessing and storage work, uploading image data to the cloud end module for further processing, and the module is deployed at the terminal;
an example segmentation module: the image processing module is used for carrying out object semantic segmentation on the image data uploaded by the image acquisition module, generating a mask to obtain a multi-view image set of the object, and the module is deployed at the cloud end;
The pose estimation module is used for executing a motion restoration structure algorithm to restore the pose of each object patch picture corresponding to the camera and the pose of the object;
The image coding module is used for acquiring image coding features and obtaining example alignment features and pixel alignment features;
the neural radiation field reconstruction module is used for executing neural radiation field network model training to obtain a three-dimensional implicit model of an object and a scene, and the module is deployed at the cloud;
And the visualization module is used for three-dimensionally displaying the reconstruction model and rendering the visualization of the picture, and is deployed at the terminal.
Compared with the prior art, the invention has the beneficial effects that:
1. breaking through data-intensive dependencies: the invention successfully overcomes the dependence of the neural radiation field network on dense data input, especially under the condition of difficulty in obtaining high-quality visible light pictures. The invention constructs a multi-view single-instance framework by skillfully utilizing the characteristics of repeated instances in a real scene and by means of repeated modes in a single picture. The innovation enables the nerve radiation field under single-view input to realize high-fidelity scene reconstruction, and achieves the achievement of even surpassing the traditional multi-view input effect.
2. Accurate pose estimation: the invention can accurately calculate the pose of the multi-view camera and the object even under the condition of unknown camera pose input. In addition, the method is suitable for processing multi-view scene picture input, and the accuracy and the application range of pose estimation are remarkably improved.
3. Flexible object composition operation: by introducing object synthesis operation, the invention provides unprecedented flexibility for three-dimensional reconstruction and rendering processes, supports the synthesis of innovative views, such as rotation and translation of objects and the combination of different scenes, and greatly widens the possibility of scene editing.
4. And the calculation efficiency is improved: compared with the traditional neural radiation field reconstruction technology requiring multi-view pictures, the single-view input method provided by the invention has the advantages that the three-dimensional reconstruction precision is ensured, the network optimization training time is obviously shortened, the calculation cost is saved, and the overall calculation efficiency is improved.
5. High fidelity restoration of texture features: by fully mining the texture features of images and examples and combining various prior regularization techniques, the invention reconstructs the texture features with high fidelity while recovering the geometric shapes of objects and scenes.
6. Optimizing a system architecture: the invention also provides a nerve radiation field three-dimensional reconstruction system of the single-view repetitive object scene so as to flexibly implement the proposed reconstruction method. The system can flexibly combine various modules or singly use a certain module, not only fully utilizes the strong computing resources of the cloud, but also ensures that complex three-dimensional reconstruction tasks can be efficiently completed on terminal equipment, and provides real-time and high-quality three-dimensional scene reconstruction and visual experience for users.
Drawings
Fig. 1 is a flow chart of a three-dimensional reconstruction method of a nerve radiation field of a single-view repetitive object scene according to a preferred embodiment of the present invention.
Fig. 2 is a schematic diagram of a three-dimensional reconstruction system for a nerve radiation field of a single view repetitive object scene according to a preferred embodiment of the present invention.
Fig. 3 is a flow chart of the network preprocessing training data provided in a preferred embodiment of the present invention.
FIG. 4 is a diagram of a network model training pipeline provided in a preferred embodiment of the present invention.
Fig. 5 is a diagram showing the effect of an application scenario terminal of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1.
Fig. 1 is a flow chart of a three-dimensional reconstruction method of a nerve radiation field of a single-view repetitive object scene according to a preferred embodiment of the present invention. As shown in fig. 1, the three-dimensional reconstruction method and system for a single-view repetitive object scene provided by the invention comprise the following steps:
step S101, obtaining, by an image acquisition device, a single view RGB picture I of a multiple repetition instance scene.
Step S102, carrying out semantic segmentation on the single-view scene picture through a pre-training semantic segmentation model, identifying all object instances in the image, and generating a semantic mask image. And extracting a plurality of visual angle patches of the same object in the image based on the semantic tags to form a group of pictures.
The pre-training semantic segmentation model is a pre-training network panoramic segmentation model based on a transducer, a single-view scene picture I is input, and all object instances in the network segmentation picture are obtained to obtain mask information.
According to semantic information, object category information is obtained, N repeated similar objects are identified and cut in a single view picture, and a single scene image I is converted into a group of multi-view object image sets for shooting the same object from different view angles
Step S103, camera posture information corresponding to each view patch picture is obtained through the multi-view patch picture group, and camera posture and object posture information corresponding to the multi-view image are estimated by utilizing a motion structure recovery (SFM) method, so that sampling point information of pixel projection light is obtained.
Feature points are detected in a single-instance multi-view picture set through a Scale Invariant Feature Transform (SIFT) algorithm, and matched feature points are found among different images.
Acquiring N multi-view clipping pictures by utilizing characteristic point matching informationCorresponding N camera poses
Since there is only one real camera in practice, in order to simply align N virtual camerasAligning all cameras to a reference coordinate system { P ref }, obtaining a 6DoF gesture of a repeated object, and specifically obtaining the 6DoF gesture by a matrix multiplication operation of gesture synthesis:
After the object pose information is obtained, sampling point information of pixel projection light is obtained, and a position set { X i } and a view angle direction set { V i } of points are obtained.
Step S104, extracting image features through an image encoder, obtaining example alignment features according to a two-dimensional boundary box of the example object, and simultaneously obtaining pixel alignment features corresponding to the three-dimensional points.
Extracting an image I feature F using a CNN-based encoder Enc;
Obtaining a two-dimensional boundary frame of the object according to the semantic segmentation information, and obtaining an example alignment feature F ins from a clipping region in the image feature;
for a three-dimensional point x on the ray sample, the x is projected onto the image plane and the pixel alignment feature F pix is obtained by linear interpolation of the feature map.
Step S105, constructing a neural radiation field network model capable of predicting a signed distance value function.
The SDF function provides a continuous function that yields a given x and distance to the nearest surface, the sign of which represents whether the point is inside or outside the surface, where negative values represent inside the object surface and positive values represent outside the object surface.
The three-dimensional geometric surface is implicitly represented using a zero level set of the SDF function.
Converting the neural implicit surface representation SDF value into volume density through a learnable parameter beta, so that the neural radiation field can perform volume rendering:
as shown in fig. 4, the neural implicit radiation field network model is composed of two multi-layer perceptron (MLP), SDF implicit network MLPf θ and rendering network, respectively The two MLPs are composed of an input layer, a hidden layer and an output layer, so that a coarse-to-fine sampling strategy is realized, and the detail part and the texture are restored after the complete geometry is obtained.
And 106, training the neural radiation field network model by taking the position of each sampling point in the multi-view training sample, the view angle of the sampling point and the pixel example characteristics as inputs, and outputting geometric prediction and appearance prediction through volume rendering.
The SDF implicit network f θ receives the encoded sampling point positions and corresponding example features and pixel features as input, outputs a predicted SDF value and a feature vector of the sampling point, and obtains a bulk density and a mask predicted value by the SDF value through a learnable parameter ζ:
rendering network Receiving the sampling point position, the sampling point view angle, the characteristic vector and the normal vector output by the f θ network as input, and outputting a color predicted value
Volume density sigma and color of each sampling point of ray by volume rendering modeIntegrating and outputting network prediction normal vector valuePrediction mask valuePredicting color values
Where α i=1-exp(-σiδi),δi denotes the distance between two adjacent sampling points on the ray.
Step S107, supervising the network optimization training through each geometrical loss term and luminosity rendering loss term, and modifying parameters and weights through back propagation.
The volume rendering will result in the geometry and appearance predictions of the network output. In order to eliminate ambiguity in recovering a three-dimensional shape, a geometric optimization process is promoted by using monocular normal clues, normal true values are obtained through a pre-training Omnidata model, and a normal consistency loss term L n is calculated with a normal value obtained through network prediction, wherein the loss term consists of an absolute value of a difference between a dot product of two normal vectors and 1 and an L1 norm of a difference between a predicted normal vector and an actual normal vector:
to ensure that the nerve radiation field is an effective signed distance field, ensure that the normal vector remains correct, an Eikonal regularization term L eikonal is added that requires that the modulo length of the SDF field gradient remain 1 throughout the field:
the rendering supervision is used as an aid of the three-dimensional supervision, the implicit network and the rendering network are optimized together, so that the three-dimensional shape priori of the object is effectively learned, the pixel level detail texture is captured, and the rendering supervision consists of luminosity reconstruction consistency loss and mask consistency.
The photometric reconstruction consistency penalty, the color consistency L2 penalty term L rgb, is defined as minimizing the squared difference between the rendered color and the predicted color, supervising the consistency between the rendered color and the network predicted color for all pixel rays:
The difference between the rendered mask and the network observed mask is measured as a Binary Cross Entropy (BCE) as a mask consistency term L mask:
The total loss term for jointly optimizing the implicit neural radiation fields SDF MLP and C MLP consists of a color consistency loss L rgb, a mask consistency term L mask, a normal consistency loss term L n, an Eikonal regularization term L eikonal, the weights of which can be adjusted, wherein λ 1=0.5,λ2=0.5,λ3 =0.1 is set in the technical scheme of the invention:
L=Lrgh1Lmask2Ln3Leikonal
and (6) circularly executing the steps from S106 to S107 until training is finished, and entering the step S108 to visualize the result.
Step S108, according to the trained neural implicit representation, the implicit representation is combined by giving a two-dimensional boundary box and a three-dimensional boundary box of each object, the luminosity rendering picture visualization is obtained by volume rendering, and the three-dimensional grid model is displayed by SDF value conversion.
After neural implicit network training is completed and converged, the network builds an implicit representation for each individual instance. The 6DoF pose information of the object is utilized to determine its three-dimensional bounding box and by combining these implicit representations, an implicit representation of the entire scene is formed. The photometric rendering of the scene is realized through the volume rendering technology, and the sense of reality of visual presentation is further enhanced.
In addition, a scene composed of implicit expressions of N repeated instances, the implicit surface of the scene is converted to a display grid by extracting an iso-surface using Marching Cubes algorithm.
Example 2.
Fig. 2 is a schematic flow chart of a neural radiation field three-dimensional reconstruction system of a single view repetitive object scene according to a preferred embodiment of the present invention. As shown in fig. 2, the three-dimensional reconstruction system for a single-view repetitive object scene provided by the present invention specifically includes:
The image acquisition module 201 is deployed at the terminal. And acquiring and storing the single-view pictures of the multi-instance scene through picture acquisition equipment such as a handheld camera and the like, and subsequently uploading the single-view pictures to an instance segmentation module.
The same object scene, such as the same model of automobile, the table and chair in the classroom, the same beverage on the supermarket shelf, etc., is filled in the real world; the image acquisition equipment can be a handheld portable equipment with shooting function such as a mobile phone and a camera or a hardware platform such as an unmanned aerial vehicle and a robot with optical image information acquisition function.
Three-dimensional reconstruction total module 202: the method comprises the steps of deploying the image acquisition module in a cloud end of a server, receiving pictures uploaded by the image acquisition module, and outputting visual rendering pictures and displaying a three-dimensional model to a visual module after training of a nerve radiation field is completed. The three-dimensional reconstruction total module comprises an example segmentation module, a pose estimation module, an image coding module and a nerve radiation field reconstruction module. The required server cloud is specifically an electronic device, and includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor includes a GPU and a CPU that implement the three-dimensional reconstruction method according to any of steps S102 to S107 when the program is executed.
A network pre-processing training data flow diagram is shown in fig. 3. The instance segmentation module obtains a plurality of identical instance masks under a single view through a pre-training segmentation model, cuts out a multi-view picture set of the single instance, and uploads the multi-view picture set to the pose estimation module and the image coding module respectively; the pose estimation module receives the multi-view image set uploaded by the example segmentation module, and restores the camera pose and the object pose of each patch image by utilizing a camera motion restoration algorithm to obtain the position and the view angle of a sampling point on a pixel light; the image coding module receives the single-view scene picture uploaded by the instance segmentation module and the object two-dimensional boundary frame in the scene, obtains picture feature codes through the pre-training image coder, and inputs instance alignment features and pixel alignment features according to the sampling point region.
A network model training pipeline diagram is shown in fig. 4. The neural radiation field reconstruction module consists of a symbol distance field implicit network MLP and a rendering network MLP, wherein the symbol distance field implicit network MLP takes the receiving position information, the example alignment feature and the pixel alignment feature as inputs to obtain a predicted SDF value, the rendering network MLP receives the position information, the view angle direction and the feature vector output by the symbol distance field implicit network MLP as inputs to output a color predicted value. The loss function is optimized by volume rendering and the network weights and parameters are adjusted by back propagation.
Visualization module 203: deployed at the terminal. And after the network is converged, performing visual explicit three-dimensional grid and luminosity rendering of the picture.
The client terminal is an operable electronic display device, provides object synthesis operation of a user interface, provides flexibility for reconstruction and rendering, is suitable for novel view synthesis of overall scene understanding and three-dimensional scene editing, such as object rotation, translation and synthesis of different scenes, and provides more scene editability for users. As shown in fig. 5, an application scenario end effect diagram is provided.
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that the different dependent claims and the features described herein may be combined in ways other than as described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims (10)

1. The three-dimensional reconstruction method of the single-view repetitive object scene is characterized by comprising the following steps of:
s101, acquiring a single-view scene image I of a scene of a multiple repeated objects through image acquisition equipment;
S102, carrying out semantic segmentation on the single-view scene image I through a pre-training semantic segmentation model, identifying all object instances in the image, generating a semantic mask image, and extracting a plurality of view patches of the same object in the image based on semantic labels to form a group of pictures;
S103, acquiring camera attitude information corresponding to each view patch picture through a multi-view patch picture group, and estimating camera attitude and object attitude information corresponding to the multi-view image by utilizing a motion structure restoration (SFM) method so as to obtain sampling point information of pixel projection light;
S104, extracting image features through an image encoder, obtaining example alignment features according to a two-dimensional boundary box of the example object, and simultaneously obtaining pixel alignment features corresponding to the three-dimensional points;
S105, constructing a neural radiation field network model capable of predicting a signed distance Function (SIGN DISTANCE Function, SDF), and representing three-dimensional geometry by using a neural implicit curved surface with an SDF value;
S106, training the neural radiation field network model by taking the position of each sampling point, the view angle of the sampling point, the pixels and the example characteristics in the multi-view training sample as input, and outputting geometric prediction and appearance prediction through volume rendering;
s107, supervising the network optimization training through each geometrical loss term and luminosity rendering loss term, and modifying parameters and weights through back propagation;
s108, according to the trained neural implicit representation, combining the implicit representation by giving a two-dimensional boundary box and a three-dimensional boundary box of each object, obtaining a luminosity rendering picture visualization by volume rendering, and displaying a three-dimensional grid model by SDF value conversion.
2. The three-dimensional reconstruction method of a single view repetitive object scene according to claim 1, wherein the pre-training semantic segmentation model segments all object instances in the scene for a pre-training network based on a Transformer to obtain semantic segmentation information, object category information is obtained according to the semantic segmentation information, N repetitive similar objects are cut in a single view picture, and a single scene image I is converted into a set of object image sets for photographing the same object from different viewsA single instance multi-view picture patch training set is constructed.
3. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 2, wherein the step S103 is specifically: n multi-view clipping pictures obtained by using SfM algorithmCorresponding N camera posesAlignment of N virtual camerasAll cameras are aligned to a reference coordinate system { P ref }, and the 6DoF pose of the repeated object is obtained:
4. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 1, wherein the step S104 is specifically:
Extracting an image feature F using a CNN-based encoder Enc;
Obtaining a two-dimensional boundary frame of the object according to the semantic segmentation information, and obtaining an example alignment feature F ins from a clipping region in the image feature;
for a three-dimensional point x on the ray sample, the x is projected onto the image plane and the pixel alignment feature F pix is obtained by linear interpolation of the feature map.
5. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 1, wherein the step S105 is specifically:
Obtaining the distance from a given point x to the nearest curved surface by using the symbolic distance function, wherein the symbolic value represents that the point is positioned inside or outside the curved surface, and the negative value represents that the point is positioned inside the curved surface of the object, and the positive value represents that the point is positioned outside the curved surface of the object;
implicitly representing the three-dimensional geometric surface by using the zero level set of SDF values;
the implicit surface expression SDF value of the nerve is converted into volume density through a leachable parameter beta, so that the nerve radiation field can be subjected to volume rendering, and the conversion relation is as follows:
The neural implicit radiation field network model consists of two multi-layer perceptron (MLP), namely an SDF implicit network f θ and a rendering network
6. The method for three-dimensional reconstruction of a single-view repetitive object scene according to claim 5, wherein the step S106 is specifically:
The SDF implicit network f θ receives the coded sampling point positions and corresponding example features and pixel features as input, outputs predicted SDF values and feature vectors of the sampling points, and obtains volume density and mask predicted values through the learnable parameters ζ:
rendering network Receiving the sampling point position, the sampling point view angle, the characteristic vector and the normal vector output by the f θ network as input, and outputting a color predicted value
Volume density sigma and color of each sampling point of ray by volume rendering modeIntegrating and outputting network prediction normal vector valuePrediction mask valuePredicting color values
Where α i=1-exp(-σiδi),δi denotes the distance between two adjacent sampling points on the ray.
7. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 1, wherein the step S107 is specifically:
Obtaining normal true values through a pre-training Omnidata model, and calculating a normal consistency loss term L n with normal values obtained through network prediction, wherein the loss term consists of an absolute value of a difference between a dot product of two normal vectors and 1 and an L1 norm of a difference between a predicted normal vector and an actual normal vector:
Adding the Eikonal regularization term L eikonal, which requires that the modulo length of the SDF field gradient remain 1 throughout the field:
rendering supervision consists of photometric reconstruction consistency loss and mask consistency,
The photometric reconstruction consistency loss, i.e. the color consistency L2 loss term L rgb, is defined as minimizing the square difference between the rendered color and the predicted color, supervising the consistency between the rendered color of all pixel rays and the network predicted color:
The difference between the rendered mask and the network observed mask is measured as a Binary Cross Entropy (BCE) as a mask consistency term L mask:
The overall penalty term for co-optimizing the neural implicit radiation fields SDF MLP and C MLP consists of a color consistency penalty L rgb, a mask consistency term L mask, a normal consistency penalty term L n, an Eikonal regularization term L eikonal, each of which can adjust its weight, where λ 1=0.5,λ2=0.5,λ3 =0.1:
L=Lrgb1Lmask2Ln3Leikonal
8. The method for three-dimensional reconstruction of a single-view repetitive object scene according to claim 7, wherein the step S108 is specifically:
Determining a three-dimensional boundary box of the object by utilizing the 6DoF pose information of the object, and forming implicit representation of the whole scene by combining the implicit representations;
A scene composed of implicit expressions of N repeated instances is transformed into a display grid by extracting an iso-surface using Marching Cubes algorithm.
9. A three-dimensional reconstruction system for a single view repetitive object scene, comprising:
And an image acquisition module: the system comprises a terminal, a cloud end module, a storage module and a storage module, wherein the terminal is used for capturing RGB images containing multiple objects from a single view angle, performing necessary preprocessing and storage work, uploading image data to the cloud end module for further processing, and the module is deployed at the terminal;
an example segmentation module: the image processing module is used for carrying out object semantic segmentation on the image data uploaded by the image acquisition module, generating a mask to obtain a multi-view image set of the object, and the module is deployed at the cloud end;
The pose estimation module is used for executing a motion restoration structure algorithm to restore the pose of each object patch picture corresponding to the camera and the pose of the object;
The image coding module is used for acquiring image coding features and obtaining example alignment features and pixel alignment features;
the neural radiation field reconstruction module is used for executing neural radiation field network model training to obtain a three-dimensional implicit model of an object and a scene, and the module is deployed at the cloud;
And the visualization module is used for three-dimensionally displaying the reconstruction model and rendering the visualization of the picture, and is deployed at the terminal.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory, the program being executable on the processor for performing a method of three-dimensional reconstruction, characterized in that the reconstruction of a three-dimensional scene can be performed according to any of the methods described in claims 1 to 8 when the program is run by the processor.
CN202410581762.9A 2024-05-11 2024-05-11 Three-dimensional reconstruction method and system for single-view repetitive object scene Pending CN118505878A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410581762.9A CN118505878A (en) 2024-05-11 2024-05-11 Three-dimensional reconstruction method and system for single-view repetitive object scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410581762.9A CN118505878A (en) 2024-05-11 2024-05-11 Three-dimensional reconstruction method and system for single-view repetitive object scene

Publications (1)

Publication Number Publication Date
CN118505878A true CN118505878A (en) 2024-08-16

Family

ID=92244269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410581762.9A Pending CN118505878A (en) 2024-05-11 2024-05-11 Three-dimensional reconstruction method and system for single-view repetitive object scene

Country Status (1)

Country Link
CN (1) CN118505878A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118691756A (en) * 2024-08-26 2024-09-24 中国人民解放军国防科技大学 Monocular semantic reconstruction method based on implicit neural representation
CN119399116A (en) * 2024-09-29 2025-02-07 武汉大学 A method and device for inverse rendering based on fusion semantic segmentation considering shadows
CN119444961A (en) * 2024-10-17 2025-02-14 中国科学院自动化研究所 Method and device for generating texture of three-dimensional scene, readable storage medium, and computer program product
CN119478172A (en) * 2024-11-12 2025-02-18 电子科技大学 A generalizable neural radiance field reconstruction method based on patch extraction
CN119559329A (en) * 2024-11-18 2025-03-04 杭州电子科技大学 An implicit surface human body modeling method based on weighted masking strategy
CN119669601A (en) * 2024-12-13 2025-03-21 中国科学技术大学先进技术研究院 VOCs fluid three-dimensional reconstruction method and device based on neural fluid field
CN120354472A (en) * 2025-06-23 2025-07-22 虹软科技股份有限公司 Two-dimensional clothing platemaking diagram generation method and device and computer equipment
CN121074126A (en) * 2025-07-08 2025-12-05 北京智源人工智能研究院 Six-dimensional pose detection method and device for object and electronic equipment
CN121366256A (en) * 2025-12-23 2026-01-20 北京土小豆在线科技有限公司 Three-dimensional modeling method, equipment and storage medium based on neural implicit surface reconstruction
CN121459341A (en) * 2026-01-06 2026-02-03 深圳数生科技有限公司 Scene semantic understanding system based on 3D Gaussian splatter model

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118691756A (en) * 2024-08-26 2024-09-24 中国人民解放军国防科技大学 Monocular semantic reconstruction method based on implicit neural representation
CN119399116A (en) * 2024-09-29 2025-02-07 武汉大学 A method and device for inverse rendering based on fusion semantic segmentation considering shadows
CN119444961A (en) * 2024-10-17 2025-02-14 中国科学院自动化研究所 Method and device for generating texture of three-dimensional scene, readable storage medium, and computer program product
CN119478172A (en) * 2024-11-12 2025-02-18 电子科技大学 A generalizable neural radiance field reconstruction method based on patch extraction
CN119478172B (en) * 2024-11-12 2025-11-18 电子科技大学 A generalizable neural radiation field reconstruction method based on patch extraction
CN119559329A (en) * 2024-11-18 2025-03-04 杭州电子科技大学 An implicit surface human body modeling method based on weighted masking strategy
CN119669601A (en) * 2024-12-13 2025-03-21 中国科学技术大学先进技术研究院 VOCs fluid three-dimensional reconstruction method and device based on neural fluid field
CN120354472A (en) * 2025-06-23 2025-07-22 虹软科技股份有限公司 Two-dimensional clothing platemaking diagram generation method and device and computer equipment
CN121074126A (en) * 2025-07-08 2025-12-05 北京智源人工智能研究院 Six-dimensional pose detection method and device for object and electronic equipment
CN121366256A (en) * 2025-12-23 2026-01-20 北京土小豆在线科技有限公司 Three-dimensional modeling method, equipment and storage medium based on neural implicit surface reconstruction
CN121366256B (en) * 2025-12-23 2026-04-21 北京土小豆在线科技有限公司 Three-dimensional modeling method, equipment and storage medium based on neural implicit surface reconstruction
CN121459341A (en) * 2026-01-06 2026-02-03 深圳数生科技有限公司 Scene semantic understanding system based on 3D Gaussian splatter model

Similar Documents

Publication Publication Date Title
CN118505878A (en) Three-dimensional reconstruction method and system for single-view repetitive object scene
EP4036863B1 (en) Human body model reconstruction method and reconstruction system, and storage medium
Zhu et al. Ponderv2: Pave the way for 3d foundation model with a universal pre-training paradigm
US11961266B2 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
CN111950477B (en) A single-image three-dimensional face reconstruction method based on video supervision
CN113628348B (en) A method and device for determining viewpoint paths in three-dimensional scenes
WO2023159517A1 (en) System and method of capturing three-dimensional human motion capture with lidar
CN116051740A (en) A 3D reconstruction method and system for an outdoor unbounded scene based on a neural radiation field
CN110458939A (en) Indoor scene modeling method based on perspective generation
CN117994480A (en) A lightweight hand reconstruction and driving method
CN120976449B (en) A method and system for cross-source data 3D reconstruction based on improved Gaussian sputtering
KR20230150867A (en) Multi-view neural person prediction using implicit discriminative renderer to capture facial expressions, body posture geometry, and clothing performance
Kang et al. Competitive learning of facial fitting and synthesis using uv energy
Han et al. Ro-map: Real-time multi-object mapping with neural radiance fields
Jeon et al. Struct-mdc: Mesh-refined unsupervised depth completion leveraging structural regularities from visual slam
US12051168B2 (en) Avatar generation based on driving views
US20250265781A1 (en) Method and system for recovering a three-dimensional human mesh in camera space
US20220392099A1 (en) Stable pose estimation with analysis by synthesis
CN120707630A (en) A posture recognition algorithm and application system for arbitrary objects under a monocular camera
CN117218246A (en) Training method, device, electronic equipment and storage medium for image generation model
Shin et al. Canonicalfusion: Generating drivable 3d human avatars from multiple images
CN117392228A (en) Visual odometry calculation method, device, electronic equipment and storage medium
CN121353500A (en) A method and related equipment for 3D reconstruction of underwater scenes based on 3D Gaussian sputtering
CN119991937B (en) A single-view 3D human body reconstruction method based on Gaussian surface elements
CN115497029A (en) Video processing method, device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination