CN118505878A

CN118505878A - Three-dimensional reconstruction method and system for single-view repetitive object scene

Info

Publication number: CN118505878A
Application number: CN202410581762.9A
Authority: CN
Inventors: 吕佳俊; 丁缘媛; 刘勇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-05-11
Filing date: 2024-05-11
Publication date: 2024-08-16

Abstract

The invention discloses a three-dimensional reconstruction method and a system of a single-view repetitive object scene, wherein the method comprises the following steps: collecting a single-view scene picture; the method for processing the single-view scene picture realizes a single-instance multi-view frame and comprises the processing steps of semantic segmentation, pose estimation, image coding and the like, and the data effective information is fully extracted; constructing a neural radiation field reconstruction network model based on an SDF function; supervising and training the nerve radiation field through various geometric losses and rendering losses; the invention aims at improving the defects of the existing nerve radiation field technology, and greatly improves the training efficiency of a reconstruction algorithm and the overall effect of scene reconstruction. Not only is high-fidelity scene reconstruction under the single-view image input condition realized, including accurate geometric shapes and textures with rich details, but also the dependence on dense views is broken through, and a new possibility is opened up for the wide application of the three-dimensional reconstruction technology in a plurality of application fields.

Description

Three-dimensional reconstruction method and system for single-view repetitive object scene

Technical Field

The invention relates to the technical field of three-dimensional reconstruction of single-view images, in particular to a three-dimensional reconstruction method and a three-dimensional reconstruction system of a single-view repeated object scene.

Background

Image-based three-dimensional reconstruction techniques play a vital role in understanding and interacting with the real three-dimensional world, providing new possibilities for applications in numerous fields such as Virtual Reality (VR), augmented Reality (AR), robotics, etc. The neural radiation field (Neural RADIANCE FIELDS, NERF) technology brings breakthrough to the three-dimensional reconstruction task by adopting the volume rendering technology, and obviously improves the geometric details of the scene. In a simple scene with dense viewpoint acquisition NeRF can achieve highly accurate reconstruction effects, revealing impressive capabilities. However, it is challenging when dealing with sparse view input situations, which is a common problem in real-world application scenarios. In practical situations, acquiring image data from a dense viewpoint is often impractical, making three-dimensional reconstruction from a single viewpoint a challenging task in the field of computer vision. This task aims to recover the three-dimensional geometry and appearance of a scene by means of only the images captured by a single camera, with the difficulty that the information provided by a single view is limited, and the inherent uncertainties and ambiguities make accurate reconstruction particularly complex.

In order to break through the high dependence of the nerve radiation field on multi-view information under the input of a single-view image and realize the new view synthesis and high-quality three-dimensional reconstruction model of the scene, the current research explores a method for combining various structures prior into a network model. The method aims at guiding the accuracy in the three-dimensional scene reconstruction process by performing supervised learning on the network. While such priors can constrain the reconstruction process to some extent, they may be biased, and for single view pictures, the true values of these priors tend to be sparse.

In addition, existing methods tend to ignore the importance of object texture information, and the rich geometric and semantic details contained in textures are critical to reconstruction. Most studies rely solely on camera pose inputs corresponding to images to extract features and take the images as a supervisory source of color truth values, failing to directly extract texture information from the images, resulting in network models failing to fully utilize available information resources. Real world scenes are enriched with repetitive objects and structures, but current approaches fail to take advantage of this to enhance the accuracy and efficiency of scene reconstruction. Therefore, for the scene containing multiple objects widely existing in the real world, under the challenge that dense data is difficult to acquire, development of a strategy capable of rapidly realizing accurate three-dimensional reconstruction with minimum calculation cost is urgently required. The method has important significance for promoting the wide application of NeRF technology in the field of three-dimensional reconstruction and solving the actual application challenges faced by the method.

Disclosure of Invention

The invention aims to provide a nerve radiation field three-dimensional reconstruction technology and a nerve radiation field three-dimensional reconstruction system which are specially designed for a scene containing a repeated object in a single view, adopts an innovative single view multi-instance frame, and aims to reconstruct three-dimensional geometric structures and texture details of the scene with high fidelity by only relying on a scene image of the single view. The development of the technology not only solves the problem of accurate three-dimensional reconstruction under the sparse view angle information condition, but also skillfully processes repeated objects in a scene, and improves the quality and efficiency of three-dimensional reconstruction. In order to achieve the above purpose, the technical scheme used in the invention is as follows:

in one aspect, the invention provides a three-dimensional reconstruction method of a single-view repetitive object scene, comprising the following steps:

s101, acquiring a single-view scene image I of a scene of a multiple repeated objects through image acquisition equipment;

S102, carrying out semantic segmentation on the single-view scene image I through a pre-training semantic segmentation model, identifying all object instances in the image, generating a semantic mask image, and extracting a plurality of view patches of the same object in the image based on semantic labels to form a group of pictures;

S103, acquiring camera attitude information corresponding to each view patch picture through a multi-view patch picture group, and estimating camera attitude and object attitude information corresponding to the multi-view image by utilizing a motion structure restoration (SFM) method so as to obtain sampling point information of pixel projection light;

S104, extracting image features through an image encoder, obtaining example alignment features according to a two-dimensional boundary box of the example object, and simultaneously obtaining pixel alignment features corresponding to the three-dimensional points;

S105, constructing a neural radiation field network model capable of predicting a signed distance Function (SIGN DISTANCE Function, SDF), and representing three-dimensional geometry by using a neural implicit curved surface with an SDF value;

S106, training the neural radiation field network model by taking the position of each sampling point, the view angle of the sampling point, the pixels and the example characteristics in the multi-view training sample as input, and outputting geometric prediction and appearance prediction through volume rendering;

s107, supervising the network optimization training through each geometrical loss term and luminosity rendering loss term, and modifying parameters and weights through back propagation;

s108, according to the trained neural implicit representation, combining the implicit representation by giving a two-dimensional boundary box and a three-dimensional boundary box of each object, obtaining a luminosity rendering picture visualization by volume rendering, and displaying a three-dimensional grid model by SDF value conversion.

Preferably, the pre-training semantic segmentation model segments all object instances in the scene based on a pre-training network of a Transformer to obtain semantic segmentation information, object category information is obtained according to the semantic segmentation information, N repeated similar objects are cut in a single view picture, and a single scene image I is converted into a group of object image sets for shooting the same object from different view anglesA single instance multi-view picture patch training set is constructed.

Preferably, the step S103 specifically includes: n multi-view clipping pictures obtained by using SfM algorithmCorresponding N camera posesAlignment of N virtual camerasAll cameras are aligned to a reference coordinate system { P ^ref }, and the 6DoF pose of the repeated object is obtained:

preferably, the step S104 specifically includes:

Extracting an image feature F using a CNN-based encoder Enc;

Obtaining a two-dimensional boundary frame of the object according to the semantic segmentation information, and obtaining an example alignment feature F _ins from a clipping region in the image feature;

for a three-dimensional point x on the ray sample, the x is projected onto the image plane and the pixel alignment feature F _pix is obtained by linear interpolation of the feature map.

Preferably, the step S105 specifically includes:

Obtaining the distance from a given point x to the nearest curved surface by using the symbolic distance function, wherein the symbolic value represents that the point is positioned inside or outside the curved surface, and the negative value represents that the point is positioned inside the curved surface of the object, and the positive value represents that the point is positioned outside the curved surface of the object;

implicitly representing the three-dimensional geometric surface by using the zero level set of SDF values;

the implicit surface expression SDF value of the nerve is converted into volume density through a leachable parameter beta, so that the nerve radiation field can be subjected to volume rendering, and the conversion relation is as follows:

The neural implicit radiation field network model consists of two multi-layer perceptron (MLP), namely SDF implicit network f _θ and rendering network crepe

Preferably, the step S106 specifically includes:

The SDF implicit network f _θ receives the coded sampling point positions and corresponding example features and pixel features as input, outputs predicted SDF values and feature vectors of the sampling points, and obtains volume density and mask predicted values through the learnable parameters ζ:

rendering network Receiving the sampling point position, the sampling point view angle, the characteristic vector and the normal vector output by the f _θ network as input, and outputting a color predicted value

Volume density sigma and color of each sampling point of ray by volume rendering modeIntegrating and outputting network prediction normal vector valuePrediction mask valuePredicting color values

Where α _i＝1-exp(-σ_iδ_i),δ_i denotes the distance between two adjacent sampling points on the ray.

Preferably, the step S107 specifically includes:

Obtaining normal true values through a pre-training Omnidata model, and calculating a normal consistency loss term L _n with normal values obtained through network prediction, wherein the loss term consists of an absolute value of a difference between a dot product of two normal vectors and 1 and an L1 norm of a difference between a predicted normal vector and an actual normal vector:

Adding the Eikonal regularization term L _eikonal, which requires that the modulo length of the SDF field gradient remain 1 throughout the field:

rendering supervision consists of photometric reconstruction consistency loss and mask consistency,

The photometric reconstruction consistency loss, i.e. the color consistency L2 loss term L _rgb, is defined as minimizing the square difference between the rendered color and the predicted color, supervising the consistency between the rendered color of all pixel rays and the network predicted color:

The difference between the rendered mask and the network observed mask is measured as a Binary Cross Entropy (BCE) as a mask consistency term L _mask:

The overall penalty term for co-optimizing the neural implicit radiation fields SDF MLP and C MLP consists of a color consistency penalty L _rgb, a mask consistency term L _mask, a normal consistency penalty term L _n, an Eikonal regularization term L _eikonal, each of which can adjust its weight, where λ ₁＝0.5,λ₂＝0.5,λ₃ =0.1:

L＝L_rgb+λ₁L_mask+λ₂L_n+λ₃L_eikonal。

preferably, the step S108 specifically includes:

Determining a three-dimensional boundary box of the object by utilizing the 6DoF pose information of the object, and forming implicit representation of the whole scene by combining the implicit representations;

A scene composed of implicit expressions of N repeated instances is transformed into a display grid by extracting an iso-surface using Marching Cubes algorithm.

In another aspect, the present invention provides a three-dimensional reconstruction system of a single-view repetitive object scene, comprising:

And an image acquisition module: the system comprises a terminal, a cloud end module, a storage module and a storage module, wherein the terminal is used for capturing RGB images containing multiple objects from a single view angle, performing necessary preprocessing and storage work, uploading image data to the cloud end module for further processing, and the module is deployed at the terminal;

an example segmentation module: the image processing module is used for carrying out object semantic segmentation on the image data uploaded by the image acquisition module, generating a mask to obtain a multi-view image set of the object, and the module is deployed at the cloud end;

The pose estimation module is used for executing a motion restoration structure algorithm to restore the pose of each object patch picture corresponding to the camera and the pose of the object;

The image coding module is used for acquiring image coding features and obtaining example alignment features and pixel alignment features;

the neural radiation field reconstruction module is used for executing neural radiation field network model training to obtain a three-dimensional implicit model of an object and a scene, and the module is deployed at the cloud;

And the visualization module is used for three-dimensionally displaying the reconstruction model and rendering the visualization of the picture, and is deployed at the terminal.

Compared with the prior art, the invention has the beneficial effects that:

1. breaking through data-intensive dependencies: the invention successfully overcomes the dependence of the neural radiation field network on dense data input, especially under the condition of difficulty in obtaining high-quality visible light pictures. The invention constructs a multi-view single-instance framework by skillfully utilizing the characteristics of repeated instances in a real scene and by means of repeated modes in a single picture. The innovation enables the nerve radiation field under single-view input to realize high-fidelity scene reconstruction, and achieves the achievement of even surpassing the traditional multi-view input effect.

2. Accurate pose estimation: the invention can accurately calculate the pose of the multi-view camera and the object even under the condition of unknown camera pose input. In addition, the method is suitable for processing multi-view scene picture input, and the accuracy and the application range of pose estimation are remarkably improved.

3. Flexible object composition operation: by introducing object synthesis operation, the invention provides unprecedented flexibility for three-dimensional reconstruction and rendering processes, supports the synthesis of innovative views, such as rotation and translation of objects and the combination of different scenes, and greatly widens the possibility of scene editing.

4. And the calculation efficiency is improved: compared with the traditional neural radiation field reconstruction technology requiring multi-view pictures, the single-view input method provided by the invention has the advantages that the three-dimensional reconstruction precision is ensured, the network optimization training time is obviously shortened, the calculation cost is saved, and the overall calculation efficiency is improved.

5. High fidelity restoration of texture features: by fully mining the texture features of images and examples and combining various prior regularization techniques, the invention reconstructs the texture features with high fidelity while recovering the geometric shapes of objects and scenes.

6. Optimizing a system architecture: the invention also provides a nerve radiation field three-dimensional reconstruction system of the single-view repetitive object scene so as to flexibly implement the proposed reconstruction method. The system can flexibly combine various modules or singly use a certain module, not only fully utilizes the strong computing resources of the cloud, but also ensures that complex three-dimensional reconstruction tasks can be efficiently completed on terminal equipment, and provides real-time and high-quality three-dimensional scene reconstruction and visual experience for users.

Drawings

Fig. 1 is a flow chart of a three-dimensional reconstruction method of a nerve radiation field of a single-view repetitive object scene according to a preferred embodiment of the present invention.

Fig. 2 is a schematic diagram of a three-dimensional reconstruction system for a nerve radiation field of a single view repetitive object scene according to a preferred embodiment of the present invention.

Fig. 3 is a flow chart of the network preprocessing training data provided in a preferred embodiment of the present invention.

FIG. 4 is a diagram of a network model training pipeline provided in a preferred embodiment of the present invention.

Fig. 5 is a diagram showing the effect of an application scenario terminal of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1.

Fig. 1 is a flow chart of a three-dimensional reconstruction method of a nerve radiation field of a single-view repetitive object scene according to a preferred embodiment of the present invention. As shown in fig. 1, the three-dimensional reconstruction method and system for a single-view repetitive object scene provided by the invention comprise the following steps:

step S101, obtaining, by an image acquisition device, a single view RGB picture I of a multiple repetition instance scene.

Step S102, carrying out semantic segmentation on the single-view scene picture through a pre-training semantic segmentation model, identifying all object instances in the image, and generating a semantic mask image. And extracting a plurality of visual angle patches of the same object in the image based on the semantic tags to form a group of pictures.

The pre-training semantic segmentation model is a pre-training network panoramic segmentation model based on a transducer, a single-view scene picture I is input, and all object instances in the network segmentation picture are obtained to obtain mask information.

According to semantic information, object category information is obtained, N repeated similar objects are identified and cut in a single view picture, and a single scene image I is converted into a group of multi-view object image sets for shooting the same object from different view angles

Step S103, camera posture information corresponding to each view patch picture is obtained through the multi-view patch picture group, and camera posture and object posture information corresponding to the multi-view image are estimated by utilizing a motion structure recovery (SFM) method, so that sampling point information of pixel projection light is obtained.

Feature points are detected in a single-instance multi-view picture set through a Scale Invariant Feature Transform (SIFT) algorithm, and matched feature points are found among different images.

Acquiring N multi-view clipping pictures by utilizing characteristic point matching informationCorresponding N camera poses

Since there is only one real camera in practice, in order to simply align N virtual camerasAligning all cameras to a reference coordinate system { P ^ref }, obtaining a 6DoF gesture of a repeated object, and specifically obtaining the 6DoF gesture by a matrix multiplication operation of gesture synthesis:

After the object pose information is obtained, sampling point information of pixel projection light is obtained, and a position set { X _i } and a view angle direction set { V _i } of points are obtained.

Step S104, extracting image features through an image encoder, obtaining example alignment features according to a two-dimensional boundary box of the example object, and simultaneously obtaining pixel alignment features corresponding to the three-dimensional points.

Extracting an image I feature F using a CNN-based encoder Enc;

Step S105, constructing a neural radiation field network model capable of predicting a signed distance value function.

The SDF function provides a continuous function that yields a given x and distance to the nearest surface, the sign of which represents whether the point is inside or outside the surface, where negative values represent inside the object surface and positive values represent outside the object surface.

The three-dimensional geometric surface is implicitly represented using a zero level set of the SDF function.

Converting the neural implicit surface representation SDF value into volume density through a learnable parameter beta, so that the neural radiation field can perform volume rendering:

as shown in fig. 4, the neural implicit radiation field network model is composed of two multi-layer perceptron (MLP), SDF implicit network MLPf _θ and rendering network, respectively The two MLPs are composed of an input layer, a hidden layer and an output layer, so that a coarse-to-fine sampling strategy is realized, and the detail part and the texture are restored after the complete geometry is obtained.

And 106, training the neural radiation field network model by taking the position of each sampling point in the multi-view training sample, the view angle of the sampling point and the pixel example characteristics as inputs, and outputting geometric prediction and appearance prediction through volume rendering.

The SDF implicit network f _θ receives the encoded sampling point positions and corresponding example features and pixel features as input, outputs a predicted SDF value and a feature vector of the sampling point, and obtains a bulk density and a mask predicted value by the SDF value through a learnable parameter ζ:

Step S107, supervising the network optimization training through each geometrical loss term and luminosity rendering loss term, and modifying parameters and weights through back propagation.

The volume rendering will result in the geometry and appearance predictions of the network output. In order to eliminate ambiguity in recovering a three-dimensional shape, a geometric optimization process is promoted by using monocular normal clues, normal true values are obtained through a pre-training Omnidata model, and a normal consistency loss term L _n is calculated with a normal value obtained through network prediction, wherein the loss term consists of an absolute value of a difference between a dot product of two normal vectors and 1 and an L1 norm of a difference between a predicted normal vector and an actual normal vector:

to ensure that the nerve radiation field is an effective signed distance field, ensure that the normal vector remains correct, an Eikonal regularization term L _eikonal is added that requires that the modulo length of the SDF field gradient remain 1 throughout the field:

the rendering supervision is used as an aid of the three-dimensional supervision, the implicit network and the rendering network are optimized together, so that the three-dimensional shape priori of the object is effectively learned, the pixel level detail texture is captured, and the rendering supervision consists of luminosity reconstruction consistency loss and mask consistency.

The photometric reconstruction consistency penalty, the color consistency L2 penalty term L _rgb, is defined as minimizing the squared difference between the rendered color and the predicted color, supervising the consistency between the rendered color and the network predicted color for all pixel rays:

The total loss term for jointly optimizing the implicit neural radiation fields SDF MLP and C MLP consists of a color consistency loss L _rgb, a mask consistency term L _mask, a normal consistency loss term L _n, an Eikonal regularization term L _eikonal, the weights of which can be adjusted, wherein λ ₁＝0.5,λ₂＝0.5,λ₃ =0.1 is set in the technical scheme of the invention:

L＝L_rgh+λ₁L_mask+λ₂L_n+λ₃L_eikonal

and (6) circularly executing the steps from S106 to S107 until training is finished, and entering the step S108 to visualize the result.

Step S108, according to the trained neural implicit representation, the implicit representation is combined by giving a two-dimensional boundary box and a three-dimensional boundary box of each object, the luminosity rendering picture visualization is obtained by volume rendering, and the three-dimensional grid model is displayed by SDF value conversion.

After neural implicit network training is completed and converged, the network builds an implicit representation for each individual instance. The 6DoF pose information of the object is utilized to determine its three-dimensional bounding box and by combining these implicit representations, an implicit representation of the entire scene is formed. The photometric rendering of the scene is realized through the volume rendering technology, and the sense of reality of visual presentation is further enhanced.

In addition, a scene composed of implicit expressions of N repeated instances, the implicit surface of the scene is converted to a display grid by extracting an iso-surface using Marching Cubes algorithm.

Example 2.

Fig. 2 is a schematic flow chart of a neural radiation field three-dimensional reconstruction system of a single view repetitive object scene according to a preferred embodiment of the present invention. As shown in fig. 2, the three-dimensional reconstruction system for a single-view repetitive object scene provided by the present invention specifically includes:

The image acquisition module 201 is deployed at the terminal. And acquiring and storing the single-view pictures of the multi-instance scene through picture acquisition equipment such as a handheld camera and the like, and subsequently uploading the single-view pictures to an instance segmentation module.

The same object scene, such as the same model of automobile, the table and chair in the classroom, the same beverage on the supermarket shelf, etc., is filled in the real world; the image acquisition equipment can be a handheld portable equipment with shooting function such as a mobile phone and a camera or a hardware platform such as an unmanned aerial vehicle and a robot with optical image information acquisition function.

Three-dimensional reconstruction total module 202: the method comprises the steps of deploying the image acquisition module in a cloud end of a server, receiving pictures uploaded by the image acquisition module, and outputting visual rendering pictures and displaying a three-dimensional model to a visual module after training of a nerve radiation field is completed. The three-dimensional reconstruction total module comprises an example segmentation module, a pose estimation module, an image coding module and a nerve radiation field reconstruction module. The required server cloud is specifically an electronic device, and includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor includes a GPU and a CPU that implement the three-dimensional reconstruction method according to any of steps S102 to S107 when the program is executed.

A network pre-processing training data flow diagram is shown in fig. 3. The instance segmentation module obtains a plurality of identical instance masks under a single view through a pre-training segmentation model, cuts out a multi-view picture set of the single instance, and uploads the multi-view picture set to the pose estimation module and the image coding module respectively; the pose estimation module receives the multi-view image set uploaded by the example segmentation module, and restores the camera pose and the object pose of each patch image by utilizing a camera motion restoration algorithm to obtain the position and the view angle of a sampling point on a pixel light; the image coding module receives the single-view scene picture uploaded by the instance segmentation module and the object two-dimensional boundary frame in the scene, obtains picture feature codes through the pre-training image coder, and inputs instance alignment features and pixel alignment features according to the sampling point region.

A network model training pipeline diagram is shown in fig. 4. The neural radiation field reconstruction module consists of a symbol distance field implicit network MLP and a rendering network MLP, wherein the symbol distance field implicit network MLP takes the receiving position information, the example alignment feature and the pixel alignment feature as inputs to obtain a predicted SDF value, the rendering network MLP receives the position information, the view angle direction and the feature vector output by the symbol distance field implicit network MLP as inputs to output a color predicted value. The loss function is optimized by volume rendering and the network weights and parameters are adjusted by back propagation.

Visualization module 203: deployed at the terminal. And after the network is converged, performing visual explicit three-dimensional grid and luminosity rendering of the picture.

The client terminal is an operable electronic display device, provides object synthesis operation of a user interface, provides flexibility for reconstruction and rendering, is suitable for novel view synthesis of overall scene understanding and three-dimensional scene editing, such as object rotation, translation and synthesis of different scenes, and provides more scene editability for users. As shown in fig. 5, an application scenario end effect diagram is provided.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that the different dependent claims and the features described herein may be combined in ways other than as described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims

1. The three-dimensional reconstruction method of the single-view repetitive object scene is characterized by comprising the following steps of:

2. The three-dimensional reconstruction method of a single view repetitive object scene according to claim 1, wherein the pre-training semantic segmentation model segments all object instances in the scene for a pre-training network based on a Transformer to obtain semantic segmentation information, object category information is obtained according to the semantic segmentation information, N repetitive similar objects are cut in a single view picture, and a single scene image I is converted into a set of object image sets for photographing the same object from different viewsA single instance multi-view picture patch training set is constructed.

3. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 2, wherein the step S103 is specifically: n multi-view clipping pictures obtained by using SfM algorithmCorresponding N camera posesAlignment of N virtual camerasAll cameras are aligned to a reference coordinate system { P ^ref }, and the 6DoF pose of the repeated object is obtained:

4. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 1, wherein the step S104 is specifically:

Extracting an image feature F using a CNN-based encoder Enc;

5. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 1, wherein the step S105 is specifically:

The neural implicit radiation field network model consists of two multi-layer perceptron (MLP), namely an SDF implicit network f _θ and a rendering network

6. The method for three-dimensional reconstruction of a single-view repetitive object scene according to claim 5, wherein the step S106 is specifically:

7. The three-dimensional reconstruction method of a single-view repetitive object scene according to claim 1, wherein the step S107 is specifically:

L＝L_rgb+λ₁L_mask+λ₂L_n+λ₃L_eikonal。

8. The method for three-dimensional reconstruction of a single-view repetitive object scene according to claim 7, wherein the step S108 is specifically:

9. A three-dimensional reconstruction system for a single view repetitive object scene, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory, the program being executable on the processor for performing a method of three-dimensional reconstruction, characterized in that the reconstruction of a three-dimensional scene can be performed according to any of the methods described in claims 1 to 8 when the program is run by the processor.