CN115761133A

CN115761133A - Training and three-dimensional reconstruction method and device of three-dimensional reconstruction model and storage medium

Info

Publication number: CN115761133A
Application number: CN202211471018.0A
Authority: CN
Inventors: 胡伊凡; 陈赟; 周则儒; 周鹏威
Original assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-03-07

Abstract

The invention discloses a three-dimensional reconstruction model training and three-dimensional reconstruction method, equipment and storage medium, comprising: acquiring a multi-view image set of the same object, determining an input image set corresponding to each image in the multi-view image set; The corresponding input image set is input into the initial 3D reconstruction model; the position information of each input image set is fused through the multi-view position information fusion network to obtain the fusion feature image set; the fusion feature image set is input to the hybrid network of CNN and Transformer to obtain the first A predicted reconstructed image and a second predicted reconstructed image; determining a target loss function value according to the first predicted reconstructed image, the second predicted reconstructed image and the reconstructed image label, and adjusting network parameters in the initial 3D reconstruction model. Through the fusion of CNN and Transformer, the extracted features are both global and local, which enhances the accuracy of the model prediction results and can effectively and accurately restore the three-dimensional image of the object.

Description

Training and three-dimensional reconstruction method and device of three-dimensional reconstruction model and storage medium

Technical Field

The invention relates to the technical field of computer vision processing, in particular to a method, equipment and a storage medium for training and three-dimensional reconstruction of a three-dimensional reconstruction model.

Background

Because the image-based computer vision algorithm is naturally lack of depth information and seriously inconsistent with the three-dimensional information of an actual scene, the depth and comprehensive understanding of the surrounding scene are required. In scenes such as automatic driving and virtual reality application, three-dimensional reconstruction is the current mainstream method for solving the defects of computer vision algorithms.

The three-dimensional reconstruction algorithm based on deep learning generally adopts a single feature extraction network to extract features, and the extracted features have limitations, so that the three-dimensional reconstruction algorithm is low in precision and cannot well restore three-dimensional images of objects.

Disclosure of Invention

The invention provides a method, equipment and a storage medium for training and three-dimensional reconstruction of a three-dimensional reconstruction model, which aim to solve the problem that the features obtained by feature extraction of a single feature extraction network are limited.

In a first aspect, an embodiment of the present disclosure provides a training method for a three-dimensional reconstruction model, including:

acquiring a multi-view image set of the same object, and determining an input image set corresponding to each image in the multi-view image set according to the multi-view image set;

inputting an input image set corresponding to each image into an initial three-dimensional reconstruction model; wherein the initial three-dimensional reconstruction model comprises: a multi-view position information fusion network and a mixed network of CNN and Transformer; wherein, the mixed network of CNN and Transformer includes: the system comprises a CNN branch network and a Transformer branch network, wherein the CNN branch network is connected with the Transformer branch network through a feature fusion network;

performing position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion characteristic image set;

inputting the fusion characteristic image set into a CNN branch network and a Transformer branch network in a mixed network of the CNN and the Transformer to obtain a first prediction reconstruction image output by the CNN branch network and a second prediction reconstruction image output by the Transformer branch network;

calculating a first loss function value according to the first prediction reconstruction image and a reconstruction image label, and calculating a second loss function value according to the second prediction reconstruction image and the reconstruction image label; and adjusting the network parameters in the initial three-dimensional reconstruction model according to a target loss function value determined by the sum of the first loss function value and the second loss function value.

In a second aspect, an embodiment of the present disclosure provides a three-dimensional reconstruction method, including:

acquiring a multi-view image set of an object to be reconstructed;

determining an input image set corresponding to each image in the multi-view image set according to the multi-view image set;

inputting an input image set corresponding to each image into a target three-dimensional reconstruction model obtained by training by adopting the training method of the three-dimensional reconstruction model according to any embodiment;

and acquiring a three-dimensional reconstruction image of the object to be reconstructed output by the target three-dimensional reconstruction model.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method for training a three-dimensional reconstruction model provided in the above-mentioned first aspect embodiment, or the method for three-dimensional reconstruction provided in the above-mentioned second aspect embodiment.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to, when executed, enable a processor to implement the training method for the three-dimensional reconstruction model provided in the embodiment of the first aspect, or the three-dimensional reconstruction method provided in the embodiment of the second aspect.

According to the training and three-dimensional reconstruction method, equipment and storage medium of the three-dimensional reconstruction model, the input image set corresponding to each image in the multi-view image set is determined according to the multi-view image set by acquiring the multi-view image set of the same object; inputting an input image set corresponding to each image into an initial three-dimensional reconstruction model; performing position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion characteristic image set; inputting the fusion characteristic image set into a CNN branch network and a Transformer branch network in the CNN and Transformer mixed network to obtain a first prediction reconstruction image output by the CNN branch network and a second prediction reconstruction image output by the Transformer branch network; calculating a first loss function value according to the first prediction reconstruction image and a reconstruction image label, and calculating a second loss function value according to the second prediction reconstruction image and the reconstruction image label; and adjusting the network parameters in the initial three-dimensional reconstruction model according to a target loss function value determined by the sum of the first loss function value and the second loss function value. By adopting the technical scheme, a mixed network structure of the CNN and the Transformer is designed, the characteristic of CNN local characteristic extraction based on the convolutional neural network is combined with the characteristic that the Transformer is easier to capture global characteristics, so that the extracted characteristics have global and local properties, the accuracy of a model prediction result is enhanced, and the three-dimensional image of the object can be effectively and accurately reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a training method for a three-dimensional reconstruction model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a hybrid network of CNN and Transformer;

FIG. 3 is an interactive schematic diagram of feature image fusion processing based on a feature fusion network in a three-dimensional reconstruction model;

FIG. 4 is a flowchart of a training method for a three-dimensional reconstruction model according to a second embodiment of the present invention;

FIG. 5 is an interaction diagram of a hybrid network of CNN and Transformer;

fig. 6 is a flowchart of a three-dimensional reconstruction method according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of a training apparatus for a three-dimensional reconstruction model according to a fourth embodiment of the present invention;

fig. 8 is a schematic structural diagram of a three-dimensional reconstruction apparatus according to a fifth embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and "target" and the like in the description and claims of the invention and the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a training method for a three-dimensional reconstruction model according to an embodiment of the present invention, where the embodiment is applicable to the case of training the three-dimensional reconstruction model, and the method may be performed by a training apparatus for the three-dimensional reconstruction model, and the apparatus may be implemented in a form of hardware and/or software.

As shown in fig. 1, the method includes:

s101, acquiring a multi-view image set of the same object, and determining an input image set corresponding to each image in the multi-view image set according to the multi-view image set.

In this embodiment, a multi-view image set may be understood as a set of images from a plurality of different views of the same object. The input image set may be a set of input images for input to the three-dimensional reconstruction model, and the input images may be images combined from at least two images in the multi-view image set. Wherein, both the images in the multi-view image set and the input images in the input image set may be RGB images.

Specifically, for the same object, images of the object are acquired from a plurality of different viewing angles to form a multi-view image set. For each image in the multi-view image set, combining the image with different numbers of images of other views respectively to obtain an input image set consisting of a plurality of input images. It will be appreciated that each image in the multi-view image set has a corresponding input image set.

For example, the input image may be an image obtained by fusing image information of the multi-view image or an image obtained by fusing position information. For example, for each image in the multi-view image set, images from at least one other view angle are acquired from the multi-view image set and are fused to obtain an input image, so that the fused input image has multi-view fusion information. Different input images can be obtained by respectively carrying out image information fusion on each image and other images with different numbers of visual angles, so that each image in the multi-visual-angle image set corresponds to an input image set formed by a plurality of input images.

And S102, inputting the input image set corresponding to each image into the initial three-dimensional reconstruction model.

Wherein the initial three-dimensional reconstruction model comprises: a multi-view position information fusion network and a mixed network of CNN and Transformer; wherein the mixed network of the CNN and the Transformer comprises: the system comprises a CNN branch network and a Transformer branch network, wherein the CNN branch network and the Transformer branch network are connected through a feature fusion network.

The initial three-dimensional reconstruction model can be understood as a three-dimensional reconstruction model which is not trained or is not trained completely. The multi-view position information fusion network can be understood as a fusion network for performing position information fusion on multi-view images of the same object.

The hybrid Network of CNN (Convolutional Neural Network) and Transformer may be a hybrid Network for realizing the fusion between CNN branch Network and Transformer branch Network. The CNN branch network is a convolutional neural network and is used for performing convolution operation on the image to extract features; the Transformer branch network is a network which utilizes an attention mechanism to improve the image processing speed and the model training speed, and can output the characteristic image after corresponding processing. The feature fusion network may be a network for performing image feature fusion on the feature image output by the CNN branch network and the feature image output by the Transformer branch network.

The CNN branch network has better characteristic extraction effect on local characteristics; the Transformer branch network has a good feature extraction effect on the global features. Therefore, in the embodiment of the invention, the CNN branch network and the Transformer branch network are connected through the feature fusion network, so that feature fusion can be performed between the local features extracted by the CNN branch network and the global features extracted by the Transformer branch network. For example, fig. 2 is a schematic structural diagram of a mixed network of a CNN and a Transformer, and as shown in fig. 2, the CNN branch network and the Transformer branch network are connected through a feature fusion network.

S103, carrying out position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion characteristic image set.

The fusion feature image set may be a set formed by a plurality of fusion feature images, and the fusion feature images in the fusion feature image set may be feature images obtained by fusing position information of input image sets corresponding to each image.

Specifically, an input image set corresponding to each image in the multi-view image set is input into a multi-view position information fusion network in the initial three-dimensional reconstruction model, position information of the input images in the input image set is determined and fused according to the multi-view position information fusion network, a fusion feature image is generated, and a fusion feature image set is constructed.

S104, inputting the fusion characteristic image set into a CNN branch network and a Transformer branch network in the mixed network of the CNN and the Transformer to obtain a first prediction reconstruction image output by the CNN branch network and a second prediction reconstruction image output by the Transformer branch network.

The first predictive reconstruction image may be an image obtained by performing feature processing on the feature fusion image set through a CNN branch network. The second prediction reconstruction image can be an image obtained after the feature fusion image set is subjected to feature processing through a Transformer branch network.

And taking the feature fusion image set output by the multi-view position information fusion network as an input image set of a mixed network of the CNN and the Transformer, and respectively inputting the feature fusion image set to the CNN branch network and the Transformer branch network. In the CNN branch network, obtaining a corresponding characteristic image as a first prediction reconstruction image through characteristic processing of each module in the CNN branch network; and each module in the Transformer branch network also performs corresponding characteristic processing to obtain a characteristic image as a second prediction reconstruction image.

It should be noted that, since the CNN branch network and the transform branch network are connected by the feature fusion network, the feature image input into the CNN branch network fuses the features extracted by the transform branch network, and similarly, the feature image input into the transform branch network also fuses the features extracted by the CNN branch network.

S105, calculating a first loss function value according to the first prediction reconstruction image and the reconstruction image label, and calculating a second loss function value according to the second prediction reconstruction image and the reconstruction image label; and adjusting the network parameters in the initial three-dimensional reconstruction model according to the target loss function value determined by the sum of the first loss function value and the second loss function value.

The reconstructed image tag may be understood as a tag image that marks the acquired multi-view image set.

The first loss function value may be a loss value between the first predicted reconstructed image and the reconstructed image label. The second loss function value may also be a loss value between the first predicted reconstructed image and the reconstructed image label. The target loss function value is a sum of the first loss function value and the second loss function value.

The network parameters in the initial three-dimensional reconstruction model may be network parameter information of each network participating in the construction and outputting of the three-dimensional reconstruction image, and may at least include network parameters of a multi-view position information fusion network and network parameters of a mixed network of the CNN and the Transformer.

Specifically, a first loss function value is calculated from a loss value between the first predicted reconstructed image and the reconstructed image label, and similarly, a second loss function value is calculated from a loss value between the second predicted reconstructed image and the reconstructed image label. And determining the sum of the first loss function value and the second loss function value as a target loss function value, iteratively adjusting network parameters of each network in the initial three-dimensional reconstruction model according to the target loss function value, and stopping iterative training when the target loss function value meets a preset requirement to obtain the target three-dimensional reconstruction model. The implementation of the invention does not limit the preset requirement of the target loss function value, and the preset requirement can be a minimum value or less than a preset value.

In this embodiment, by acquiring a multi-view image set of the same object, an input image set corresponding to each image in the multi-view image set is determined according to the multi-view image set; inputting an input image set corresponding to each image into an initial three-dimensional reconstruction model; performing position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion characteristic image set; inputting the fusion characteristic image set into a CNN branch network and a Transformer branch network in a mixed network of the CNN and the Transformer to obtain a first prediction reconstruction image output by the CNN branch network and a second prediction reconstruction image output by the Transformer branch network; calculating a first loss function value according to the first prediction reconstruction image and a reconstruction image label, and calculating a second loss function value according to the second prediction reconstruction image and the reconstruction image label; and adjusting the network parameters in the initial three-dimensional reconstruction model according to a target loss function value determined by the sum of the first loss function value and the second loss function value. By adopting the technical scheme, a mixed network structure of the CNN and the Transformer is designed, the characteristic of CNN local characteristic extraction based on the convolutional neural network is combined with the characteristic that the Transformer is easier to capture global characteristics, so that the extracted characteristics have global and local properties, the accuracy of a model prediction result is enhanced, and the three-dimensional image of the object can be effectively reduced.

As a first optional implementation example of the embodiment, on the basis of the above implementation example, the first optional implementation example further illustrates the step of determining the input image set corresponding to each image in the multi-view image set according to the multi-view image set, including:

(1) From multi-view image set S = { S = { S = } _i Sequentially acquiring a first image s _i (ii) a Wherein s is _i ∈S,i∈[1,n]And n is an integer greater than 1, and n is the number of images included in the multi-view image set.

Wherein the first image may be the ith image, S, of the multi-view image set S _i ∈S,i∈[1,n]And n is the number of images in the multi-view image set. Illustratively, the dimension of an image in the multi-view image set may be represented as (3,h, w).

Specifically, the multi-view image set S includes n images, and the ith image is sequentially acquired from the n images and determined as the first image S _i 。

(2) For each first image s _i Randomly acquiring a second image S from the multi-view image set S _j (ii) a Wherein s is _j ∈S，j∈[1，m]And j ≠ i, m ≦ n-1.

Wherein the second image can be a randomly acquired first image S of the multi-view image set S _i And (e) other images.

In particular, s in a set of multi-view images _i And s _j Satisfies s _i ∈S，i∈[1，n]，s _j ∈S，j∈[1，m]And j ≠ i, m ≦ n-1.

(3) According to said first image s _i And said second image s _j Forming said first image s _i The corresponding input image set is P = { { s { [ S ] _i ，s ₁ }，{s _i ，s ₁ ，s ₂ }，…，{s _i ，s ₁ ，s ₂ ，…，s _m }}。

In particular, a first image S is selected in turn from the set S of multi-view images _i With m random second images s _j Forming an input image from the first image s _i The corresponding input images constitute a first image s _i Input image set P = { { s { (S) } _i ，s ₁ }，{s _i ，s ₁ ，s ₂ }，…，{s _i ，s ₁ ，s ₂ ，…，s _m }}。

It will be appreciated that each first image s _i A corresponding input image set P may be constructed, in this alternative embodiment, the first image s _i N corresponding input image sets P are constructed, each input image set P including m sets of input images.

Illustratively, the multi-view image set S = S ₁ ，s ₂ ，s ₃ ，s ₄ Including 4 images, selecting s ₁ As the first image, thereby, the second image may be determined to be s ₂ 、s ₃ And s ₄ . Construction relative to a first image s ₁ Input image set P = { { s { (S) } ₁ ，s ₂ }，{s ₁ ，s ₂ ，s ₃ }，{s ₁ ，s ₂ ，s ₃ ，s ₄ }}。

At present, a common sampling method for inputting a three-dimensional reconstruction model is to randomly select a fixed number of RGB images from an RGB data set corresponding to the three-dimensional model, perform preprocessing such as normalization, and then input the RGB images into a deep learning network model. In the optional embodiment, for each image in the multi-view image set, different numbers of images with different views are sequentially selected from the multi-view image set to be fused and an input image set is constructed, so that the position information of the RGB images can be better fused, the association degree between the multi-view RGB images is improved, the fusion between the position information of the input RGB images is promoted, the optimization processing of the input dimension of the three-dimensional reconstruction model is realized, and a foundation is laid for the accuracy of the output result of the three-dimensional model.

Example two

Fig. 4 is a flowchart of a training method for a three-dimensional reconstruction model according to a second embodiment of the present invention, which is a further optimization of any of the above embodiments, and this embodiment is applicable to the case of training a three-dimensional reconstruction model, and this method can be executed by a training apparatus for a three-dimensional reconstruction model, and this apparatus can be implemented in the form of hardware and/or software.

As shown in fig. 4, the method includes:

s201, acquiring a multi-view image set of the same object, and determining an input image set corresponding to each image in the multi-view image set according to the multi-view image set.

S202, inputting the input image set corresponding to each image into the initial three-dimensional reconstruction model.

S203, carrying out position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion characteristic image set.

S204, inputting the fusion feature image set into a CNN branch network and a Transformer branch network in the CNN and Transformer mixed network.

Specifically, the multi-view position information fusion network performs position information fusion on the input image set to obtain a fusion feature image set, and inputs the fusion feature image set to a mixed network of a CNN and a Transformer. And the CNN branch network and the Transformer branch network are connected through a feature fusion network.

Optionally, after the fused feature image set is input to the CNN branch network and the Transformer branch network in the CNN and Transformer hybrid network, the CNN branch network and the Transformer branch network perform feature extraction and fusion processing through the feature fusion network.

Exemplarily, fig. 3 is an interactive schematic diagram of performing feature image fusion processing based on a feature fusion network in a three-dimensional reconstruction model, as shown in fig. 3, the CNN branch network 10 includes: a downsample stem module 110 and a residual module 120, the residual module 120 comprising: t residual sub-modules 1210; the Transformer branch network 20 includes: a patch module 210 and a transform module 220, the transform module 220 comprising: t transform submodules 2210. The CNN branch network 10 and the Transformer branch network 20 are connected by a feature fusion network 30, and the feature fusion network 30 includes a first feature fusion submodule 310 and a second feature fusion submodule 320.

Wherein, the downsampling stem module 110 may be a convolution module for downsampling in the CNN branch network; the residual module 120 is configured to perform multi-layer residual processing on the input image; each layer of residual sub-module 1210 in the residual module 120 is used for performing residual processing on each layer of input feature image. The blocking patch module 210 can be understood as a module in a transform branch network for performing blocking processing on an input fusion feature image set; the Transformer module 220 is used for performing multi-layer Transformer processing on the input image; each layer of Transformer sub-modules 2210 in the Transformer module is used for performing Transformer processing on each layer of input feature images.

The feature fusion network 30 is respectively connected with the CNN branch network 10 and the Transformer branch network 20, and is configured to perform feature fusion on the feature image output by the CNN branch network 10 and the feature image output by the Transformer branch network 20; the feature fusion network 30 includes a first feature fusion sub-module 310 and a second feature fusion sub-module 320; the first feature fusion sub-module 310 is a fusion module for performing feature fusion based on the feature dimension of the Transformer feature image; the second feature fusion sub-module 320 is a fusion module that performs feature fusion based on feature dimensions of the residual feature image.

After the fused feature image set is input to the CNN branch network and the Transformer branch network in the CNN and Transformer hybrid network, the CNN branch network and the Transformer branch network perform feature extraction and fusion processing through the feature fusion network, which may be specifically implemented as follows:

(1) The (t-1) th residual error sub-module is connected with the (t-1) th transform sub-module through the (t-1) th first feature fusion sub-module; t is more than or equal to 2 and less than or equal to T, the first feature fusion sub-module is used for fusing a residual feature image output by the (T-1) th residual sub-module and a Transformer feature image output by the (T-1) th Transformer sub-module to obtain a (T-1) th first feature fusion image, and the feature dimensions of the first feature fusion image and the residual feature image output by the residual sub-module are the same.

In the CNN branch network, T residual error sub-modules exist, and each residual error sub-module is connected with a Transformer sub-module through a first feature fusion sub-module. And in the (t-1) th first feature fusion sub-module, performing feature fusion processing on the feature images output by the residual sub-modules and the transform sub-modules on the same layer based on the feature dimensions of the transform feature images to output a first feature fusion image.

Illustratively, t is 3, (t-1) =2, the 2 nd residual submodule outputs a feature image to the 2 nd first feature fusion submodule, the 2 nd transform submodule similarly inputs the output feature image to the 2 nd first feature fusion submodule, and the 2 nd first feature fusion submodule performs fusion processing on the two input feature images and outputs a 2 nd first feature fusion image.

(2) The (t-1) th Transformer sub-module is connected with the (t-1) th residual sub-module through the (t-1) th second feature fusion sub-module; the second feature fusion submodule is used for fusing a residual feature image output by the (t-1) th residual sub-module and a Transformer feature image output by the (t-1) th Transformer sub-module to obtain a (t-1) th second feature fusion image, and the second feature fusion image has the same feature dimension as the Transformer feature image output by the Transformer sub-module.

In the Transformer branch network, T Transformer sub-modules exist, and each Transformer sub-module is connected with the residual sub-module through a second feature fusion sub-module. And the (t-1) th Transformer submodule outputs a Transformer characteristic image to the (t-1) th second characteristic fusion submodule, the (t-1) th residual submodule similarly outputs a Transformer characteristic image to the (t-1) th second characteristic fusion submodule, and in the (t-1) th second characteristic fusion submodule, characteristic fusion processing is carried out on the characteristic images output by the Transformer submodule and the residual submodule on the same layer based on the characteristic dimension of the residual characteristic image, and a second characteristic fusion image is output.

Illustratively, t is 3, (t-1) =2, the 2 nd transform submodule outputs a feature image to the 2 nd second feature fusion submodule, the 2 nd residual submodule similarly inputs the output feature image to the 2 nd second feature fusion submodule, and the 2 nd second feature fusion submodule performs fusion processing on the two input feature images and outputs a 2 nd second feature fusion image.

Further, the second feature fusion submodule includes: the activation function is a first convolution layer, a first conversion layer, a full connection layer and a first fusion layer of the RELU function; the first feature fusion submodule includes: a second conversion layer, a second convolution layer with an activation function being a sigmoid function, and a second fusion layer.

Specifically, in the second feature fusion submodule, a first convolution layer, a first conversion layer, a full link layer and a first fusion layer, of which the activation functions are RELU functions, are sequentially arranged.

The input images of the second feature fusion submodule are respectively a residual feature image with feature dimensions (B, C, H, W) and a Transformer feature image with feature dimensions (B, C, N). The output image of the second feature fusion sub-module is a second fusion feature image with feature dimensions (B, C, N). Illustratively, a residual characteristic image output by the residual submodule is obtained, and the characteristic dimension of the residual characteristic image is (B, C, H, W). The residual characteristic image is firstly input to a convolution kernel with the size of 3 x 3 and the padding of 1, an activation function is designed to be a first convolution layer of the RELU, the characteristic dimension of the residual characteristic image after the processing of the first convolution layer is still (B, C, H, W), and the size of the residual characteristic image is not changed. Since the feature dimensions extracted in the CNN branch network are four-dimensional features of (B, C, H, W), the feature dimensions of the first feature fusion image output by the first feature fusion submodule need to be the same as the feature dimensions of the residual feature image output by the residual submodule, that is, the fusion feature image obtained by feature fusion of the first feature fusion submodule should be a four-dimensional feature image of (B, C, H, W). Therefore, the feature dimensions of the input residual feature image are converted by the first conversion layer, the four-dimensional features (B, C, H, W) are converted into the three-dimensional features (B, C, N), and then the output feature image of the full connection layer is obtained by processing the feature dimensions through the full connection layer FC. Performing feature splicing and fusion on the output feature image of the full connection layer and the Transformer feature image with the feature dimension of (B, C, N) obtained in the Transformer sub-module to generate a second feature fusion image; and entering a next module, and calculating forward propagation and backward propagation.

Specifically, in the first feature fusion submodule, a second conversion layer, a second convolution layer whose activation function is a sigmoid function, and a second fusion layer are sequentially provided. The input images of the first feature fusion submodule are a Transformer feature image with feature dimensions (B, C, N) and a residual feature image with feature dimensions (B, C, H, W), respectively. The output image of the first feature fusion sub-module is a first fused feature image with feature dimensions (B, C, H, W).

In an exemplary Transformer branched network, the feature dimension extracted by the Transformer is a three-dimensional residual feature image of (B, C, N), so the feature dimension of the second feature fusion image output by the second feature fusion submodule needs to be the same as the feature dimension of the Transformer feature image output by the Transformer submodule, that is, the three-dimensional feature image of the fusion feature image (B, C, N) obtained after feature fusion by the second feature fusion submodule. Therefore, the Transformer feature image is firstly input into a second conversion layer, and the Transformer feature image with the feature dimension (B, C, N) is subjected to reshape conversion in the second conversion layer, so that the image feature is converted from (B, C, N) into a Transformer feature image with the feature dimension (B, C, H, W); inputting a Transformer characteristic image with characteristic dimensions of (B, C, H and W) into a convolution kernel of 3*3, designing an activation function into a sigmoid second convolution layer, and acquiring a Transformer characteristic image processed by the first convolution layer; performing feature splicing and fusion on the output features of the second convolution layer and residual error feature images with feature dimensions (B, C, H and W) obtained in the residual error sub-modules in the second fusion layer to generate second feature fusion images; and entering a next module, and calculating forward propagation and backward propagation.

In a CNN and Transformer mixed network, each layer of residual error sub-module in the CNN branched network is respectively connected with the Transformer sub-module on the same layer through a first characteristic fusion sub-module and a second characteristic fusion sub-module, so that the first characteristic fusion sub-module performs characteristic fusion on a residual error characteristic image output by the residual error sub-module and a Transformer characteristic image output by the Transformer sub-module to obtain a first characteristic fusion image; and the second characteristic fusion submodule performs characteristic fusion on the residual characteristic image output by the residual submodule, the transform submodule output and the transform characteristic image to obtain a second characteristic fusion image. By adopting the technical scheme, the characteristic image extracted based on the CNN branch network is fused with the characteristics extracted by the Transformer branch network; the characteristic image extracted based on the Transformer branch network is fused with the characteristic extracted by the CNN branch network, so that the characteristic images extracted by the CNN branch network and the Transformer branch network have both global characteristics and local characteristics, and the fusion of the advantages of the CNN and the Transformer network in the characteristic extraction is effectively realized.

S205, performing convolution operation and maximum pooling operation on the fusion feature set through a stem module in the CNN branch network, and inputting the fusion feature set into a residual error module; and performing residual error processing on the input pooled feature image through T residual error sub-modules in the residual error module to obtain a first prediction reconstruction image.

The stem module includes a convolution operation with a kernel size of 3*3 and an output channel of 64, and a maximum pooling module (Maxpool). T layers of residual sub-modules exist in the residual module. In each residual submodule, a convolution module with the kernel size of 1*1 and a RELU activation function are included; a convolution module with the kernel size of 3*3 and a sigmoid activation function; a convolution module with a kernel size of 1*1, and a Linear activation function Linear. The convolution with the first kernel size of 1*1 can be used for increasing the hidden space dimension, the convolution with the kernel size of 3*3 can be used for increasing the fitting capability of the model, and the convolution with the second kernel size of 1*1 is used for performing equal mapping. The residual module has T residual sub-modules stacked, the number of the residual sub-modules is T, and the size of T is set according to the data size, which is not limited in this embodiment.

Specifically, after convolution operation and maximum pooling operation are carried out on the fusion feature set through a down-sampling stem module, a pooled feature image is output to a residual error module. And the T residual sub-modules in the residual module gradually perform residual processing on the input pooled feature image and obtain a corresponding predicted image, and the predicted image output by the last residual sub-module, namely the predicted image output by the T residual sub-module, can be determined as a first predicted reconstructed image.

S206, performing blocking operation on the fusion feature set through a blocking patch module in the Transformer branch network, and inputting the blocking operation into the Transformer module; and performing feature extraction on the input block feature image through T transform submodules in the transform module to obtain a second prediction reconstruction image.

Specifically, in the blocking patch module, after the blocking operation is performed on the fusion feature set, a blocking feature image is obtained. And inputting the block characteristic image into a Transformer module, and performing layer-by-layer characteristic extraction through a T-layer Transformer submodule in the Transformer module to obtain a second prediction reconstruction image.

S207, calculating a first loss function value according to the first prediction reconstruction image and a reconstruction image label, and calculating a second loss function value according to the second prediction reconstruction image and the reconstruction image label; and adjusting the network parameters in the initial three-dimensional reconstruction model according to a target loss function value determined by the sum of the first loss function value and the second loss function value.

In this embodiment, by acquiring a multi-view image set of the same object, an input image set corresponding to each image in the multi-view image set is determined according to the multi-view image set; inputting an input image set corresponding to each image into an initial three-dimensional reconstruction model; performing position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion characteristic image set; inputting the fusion feature image set into a CNN branch network and a Transformer branch network in the mixed network of the CNN and the Transformer; performing convolution operation and maximum pooling operation on the fusion feature set through a stem module in the CNN branch network, and inputting the fusion feature set into a residual error module; performing residual processing on the input pooled feature images through T residual submodules in the residual module to obtain a first predicted reconstructed image; performing blocking operation on the fusion feature set through a blocking patch module in the Transformer branch network, and inputting the blocking operation into a Transformer module; performing feature extraction on the input block feature images through T transform submodules in the transform module to obtain a second prediction reconstruction image; calculating a first loss function value according to the first prediction reconstruction image and a reconstruction image label, and calculating a second loss function value according to the second prediction reconstruction image and the reconstruction image label; and adjusting the network parameters in the initial three-dimensional reconstruction model according to a target loss function value determined by the sum of the first loss function value and the second loss function value. By adopting the technical scheme, the mixed network structure of the convolutional neural network CNN and the Transformer is utilized to perform feature fusion, so that the features extracted by the mixed network of the CNN and the Transformer have global property and local property, the accuracy of a model prediction result is enhanced, and the three-dimensional image of the object can be effectively and accurately reduced.

As a first optional embodiment of the embodiment, on the basis of the above embodiment, the first optional embodiment further describes a step of performing residual processing on the input pooled feature image through the residual sub-modules of T residual modules in the residual module to obtain a first predicted reconstructed image. Fig. 5 is an interaction diagram of a CNN and Transformer hybrid network, as shown in fig. 5, specifically including:

(1) For the 1 st residual sub-module in the residual modules, performing residual processing on the pooled feature image output by the stem module to obtain the 1 st residual feature image; inputting the 1 st residual feature image into a 1 st first feature fusion module;

specifically, the CNN branch network includes a down-sampling stem module and a residual module, and the residual module includes a plurality of residual sub-modules. And inputting the fused feature image set into a down-sampling stem module, performing convolution operation and maximum pooling operation by the stem module, and outputting a pooled feature image. And inputting the pooled feature image into a 1 st residual sub-module in the residual modules, sequentially performing corresponding residual processing in the 1 st residual sub-module according to each convolution kernel in the residual sub-module and the activation function, and outputting the 1 st residual feature image to a first feature fusion module which is also the 1 st layer by the 1 st residual sub-module after the residual processing.

(2) And when T is more than or equal to 2 and less than or equal to T-1, carrying out residual error processing on the (T-1) th first feature fusion image output by the (T-1) th first feature fusion module for the T-th residual error sub-module in the residual error module to obtain a T-th residual error feature image, and inputting the T-th residual error feature image to the T-th first feature fusion module.

Residual submodules from the layer 2 to the layer T-1 are input into a first fused feature image output by fusing the residual submodules of the previous layer through the first feature fusion modules with the corresponding number of layers; the output is the residual characteristic image of the residual submodule of the layer.

It can be understood that (T-1) first feature fusion sub-modules are included in the feature fusion network, the input of the first feature fusion sub-module of each layer is the residual feature image output by the residual sub-module of the same layer, and the output is the first feature fusion image of the first feature fusion sub-module of the layer.

Specifically, when T is more than or equal to 2 and less than or equal to T-1, the residual sub-module of the T-th layer receives the first feature fusion image of the previous layer, generates a residual feature image through residual feature processing, inputs the residual feature image to the first feature fusion sub-module of the T-th layer, generates a first feature fusion image through fusion processing in the first feature fusion sub-module, and inputs the first feature fusion image to the residual sub-module of the next layer.

Exemplarily, t =2, (t-1) =1, and a 2 nd residual sub-module in the residual module receives a 1 st first feature fusion image output by the 1 st first feature fusion module and performs residual processing to obtain a 2 nd residual feature image; and inputting the 2 nd residual error feature image into the 2 nd first feature fusion module to obtain a 2 nd first feature fusion image, and inputting the 2 nd first feature fusion image into the 3 rd residual error sub-module.

(3) When T = T, for the T-th residual sub-module in the residual modules, performing residual processing on the (T-1) -th first feature fusion image output by the (T-1) -th first feature fusion module to obtain a T-th residual feature image; and determining the Tth residual characteristic image output by the Tth residual sub-module as a first predicted reconstructed image.

The input image of the residual sub-module of the T-th layer is a first fusion characteristic image output after the residual sub-module of the previous layer is fused by the first characteristic fusion module with the corresponding number of layers; the output image is a first predicted reconstructed image.

Specifically, when T = T, the residual sub-module of the T-th layer receives the first feature fusion image of the previous layer (i.e., the T-1 layer), and generates a residual feature image through residual feature processing. It can be understood that the residual characteristic image output by the T-th layer residual sub-module is the mixed network first prediction reconstructed image of CNN and Transformer.

As a second optional embodiment of the embodiment, on the basis of the above embodiment, the second optional embodiment further describes the step of performing feature extraction on the input block feature images by using T transform sub-modules in the transform module to obtain a second predicted reconstructed image. Fig. 5 is an interaction diagram of a CNN and Transformer hybrid network, as shown in fig. 5, specifically including:

(1) For a 1 st Transformer submodule in the Transformer module, performing Transformer processing on the partitioned characteristic image output by the patch module to obtain a 1 st Transformer characteristic image; inputting the 1 st Transformer feature image into a 1 st second feature fusion module;

specifically, the transform branch network includes a block patch module and a transform module, and the transform module includes a plurality of transform submodules. And after the fusion feature image set is input to the patch module, the patch module performs blocking processing, and outputs a patch feature image. The block characteristic image is input into a 1 st Transformer sub-module in the Transformer module and subjected to Transformer processing, and after the Transformer processing is carried out, the 1 st Transformer sub-module outputs the 1 st Transformer characteristic image to a second characteristic fusion module which is also the 1 st layer.

(2) And when T is more than or equal to 2 and less than or equal to T-1, performing Transformer processing on the (T-1) th second feature fusion image output by the (T-1) th second feature fusion module for the T-th Transformer sub-module in the Transformer module to obtain a T-th Transformer feature image, and inputting the T-th Transformer feature image to the T-th second feature fusion module.

The Transformer sub-modules from the 2 nd layer to the T-1 th layer are input into a second fusion characteristic image output after the previous layer of Transformer sub-modules are fused by the second characteristic fusion modules with the corresponding number of layers; the output is the transform feature image of the layer of transform sub-modules.

It can be understood that (T-1) second feature fusion sub-modules are included in the feature fusion network, the input of the second feature fusion sub-module of each layer is the Transformer feature image output by the Transformer sub-module of the same layer, and the output is the second feature fusion image of the second feature fusion sub-module of the layer.

Specifically, when T is more than or equal to 2 and less than or equal to T-1, the Transformer sub-module of the T-th layer receives the second feature fusion image of the previous layer, generates a Transformer feature image through Transformer feature processing, inputs the Transformer feature image into the second feature fusion sub-module of the T-th layer, generates a second feature fusion image through fusion processing in the second feature fusion sub-module, and inputs the second feature fusion image into the Transformer sub-module of the next layer.

Illustratively, t =2, (t-1) =1, the 2 nd transform sub-module in the transform module receives the 1 st second feature fusion image output by the 1 st second feature fusion module and performs transform processing to obtain a 2 nd transform feature image; and inputting the 2 nd transform feature image into the 2 nd second feature fusion module to obtain a 2 nd second feature fusion image, and inputting the 2 nd second feature fusion image into the 3 rd transform submodule.

(3) When T = T, for a T-th Transformer sub-module in the Transformer module, performing Transformer processing on a (T-1) th second feature fusion image output by a (T-1) th second feature fusion module to obtain a T-th Transformer feature image; and determining the Tth Transformer characteristic image output by the Tth Transformer sub-module as a second prediction reconstruction image.

The input image of the transform submodule at the T-th layer is a second fusion characteristic image output by the transform submodule at the previous layer after fusion through the second characteristic fusion modules with the corresponding layer number; the output image is a second predicted reconstructed image.

Specifically, when T = T, the Transformer sub-module of the T-th layer receives the second feature fusion image of the previous layer, and generates a Transformer feature image through Transformer feature processing. It can be understood that the Transformer feature image output by the Transformer sub-module of the T-th layer is a second prediction reconstruction image of a mixed network of CNN and Transformer. Optionally, each transform submodule includes: attention module and multi-layer perceptron MLP.

In the attention module, the characteristics of the input block characteristic image are (B, N, C); carrying out reshape on the input features (B, N and C) to obtain (B, C, H and W), designing a convolution kernel with a kernel of 3, and carrying out convolution operation on the input features to obtain features (B, C, H and W); carrying out reshape operation on the (B, C, h, w) to obtain (B, N, C), wherein N is far smaller than N; attention mechanism operation is performed on the features (B, n, C). Because N is far less than N, the parameter number is greatly reduced, the structure is optimized, and the precision of three-dimensional reconstruction is not lost.

Specifically, the transform sub-module inputs the input block characteristic image or transform characteristic image to the attention module, and performs corresponding convolution operation and attention mechanism operation through the attention module to obtain an attention output image; and inputting the attention output image into a multi-layer perceptron MLP, carrying out corresponding perception processing, and outputting a Transformer characteristic image by the multi-layer perceptron MLP.

It can be understood that the feature image input by the layer 1 transform submodule is a block feature image; the feature image input by the Transformer sub-module from the layer 2 to the layer T may be a second feature fusion image corresponding to the Transformer feature image output by the previous Transformer sub-module.

Specifically, the blocking patch module outputs a blocking feature image and inputs the blocking feature image into a corresponding transform submodule when T =1, and an attention module in the transform submodule receives the blocking feature image and obtains an attention output image after corresponding processing. Inputting the attention output image serving as an input image into a multi-layer perceptron MLP, and outputting a corresponding Transformer characteristic image after perception processing; the second feature fusion image corresponding to the output Transformer feature image can be used as an input image of an attention module in a next-layer Transformer sub-module.

EXAMPLE III

Fig. 6 is a flowchart of a three-dimensional reconstruction method according to an embodiment of the present invention, which is applicable to a case of performing three-dimensional reconstruction on an object to be reconstructed, where the method may be performed by a three-dimensional reconstruction apparatus, and the three-dimensional reconstruction apparatus may be implemented in a form of hardware and/or software.

As shown in fig. 6, the method includes:

s301, acquiring a multi-view image set of the object to be reconstructed.

The object to be reconstructed is an object to be three-dimensionally reconstructed.

Specifically, the same object to be reconstructed can acquire a plurality of different images in a plurality of visual dimensions and positions, and a multi-view image set of the object to be reconstructed is constructed according to the acquired images of the plurality of objects to be reconstructed.

S302, determining an input image set corresponding to each image in the multi-view image set according to the multi-view image set.

Specifically, according to a plurality of images in the multi-view image set, a selected image and any images except the selected image are combined into an input image, and the set of the input images is determined as the input image set corresponding to the selected image. Each selected image may have a corresponding set of input images.

And S303, inputting the input image set corresponding to each image into a target three-dimensional reconstruction model obtained by training through a three-dimensional reconstruction model training method.

The target three-dimensional reconstruction model is a completely trained three-dimensional reconstruction model obtained by continuously training the initial three-dimensional reconstruction model by a three-dimensional reconstruction model training method. The target three-dimensional reconstruction model specifically comprises: a multi-view position information fusion network and a CNN and Transformer hybrid network; the mixed network of the CNN and the Transformer comprises: the CNN branch network and the Transformer branch network are connected through a feature fusion network.

S304, acquiring a three-dimensional reconstruction image of the object to be reconstructed output by the target three-dimensional reconstruction model.

Specifically, in the three-dimensional reconstruction model, after an input image set corresponding to each image in a multi-view image set of the object to be reconstructed is input into the target three-dimensional reconstruction model, a three-dimensional reconstruction image of the target three-dimensional reconstruction model is obtained, so that the three-dimensional reconstruction of the object to be reconstructed is realized.

In the embodiment, a multi-view image set of an object to be reconstructed is acquired; determining an input image set corresponding to each image in the multi-view image set according to the multi-view image set; inputting an input image set corresponding to each image into a target three-dimensional reconstruction model obtained by training through a three-dimensional reconstruction model training method; and acquiring a three-dimensional reconstruction image of the object to be reconstructed output by the target three-dimensional reconstruction model. By adopting the technical scheme, the target three-dimensional reconstruction model based on the CNN and Transformer mixed network can accurately and effectively restore the three-dimensional image of the object to be reconstructed, and the accuracy of three-dimensional reconstruction is improved.

Example four

Fig. 7 is a schematic structural diagram of a training apparatus for a three-dimensional reconstruction model according to a fourth embodiment of the present invention. The embodiment may be applicable to the case of training a three-dimensional reconstruction model, the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in any device that provides a function of training a three-dimensional reconstruction model, as shown in fig. 7, the training apparatus for a three-dimensional reconstruction model specifically includes: an image set determination module 41, an image set input module 42, a position information fusion module 43, a predicted image acquisition module 44, and a network parameter adjustment module 45.

An image set determining module 41, configured to obtain a multi-view image set of the same object, and determine, according to the multi-view image set, an input image set corresponding to each image in the multi-view image set;

an image set input module 42, configured to input an input image set corresponding to each image into the initial three-dimensional reconstruction model; wherein the initial three-dimensional reconstruction model comprises: a multi-view position information fusion network and a mixed network of CNN and Transformer; wherein the mixed network of the CNN and the Transformer comprises: the system comprises a CNN branch network and a Transformer branch network, wherein the CNN branch network is connected with the Transformer branch network through a feature fusion network;

a position information fusion module 43, configured to perform position information fusion on each input image set through the multi-view position information fusion network to obtain a fusion feature image set;

a predicted image obtaining module 44, configured to input the fused feature image set to a CNN branch network and a Transformer branch network in the CNN and Transformer mixed network, and obtain a first predicted reconstructed image output by the CNN branch network and a second predicted reconstructed image output by the Transformer branch network;

a network parameter adjusting module 45, configured to calculate a first loss function value according to the first predicted reconstructed image and the reconstructed image tag, and calculate a second loss function value according to the second predicted reconstructed image and the reconstructed image tag; and adjusting the network parameters in the initial three-dimensional reconstruction model according to a target loss function value determined by the sum of the first loss function value and the second loss function value.

By adopting the technical scheme, based on the characteristic of local characteristic extraction of the convolutional neural network CNN and the characteristic that a Transformer is easier to capture global characteristics, a mixed network structure of the CNN and the Transformer is designed, so that the extracted characteristics have global and local characteristics, the accuracy of a model prediction result is enhanced, and a three-dimensional image of an object can be effectively reduced.

Optionally, the image set determining module 41 includes: from multi-view image set S = S _i In turn acquiring a first image s _i (ii) a Wherein s is _i ∈S,i∈[1,n]N is an integer greater than 1, and n is the number of images included in the multi-view image set;

for each first image s _i Randomly acquiring a second image S from the multi-view image set S _j ；

Wherein s is _j ∈S,j∈[1,m]J is not equal to i, m is not more than n-1;

according to said first image s _i And said second image s _j Forming said first image s _i The corresponding input image set is P = { { s { [ S ] _i ,s ₁ }，{s _i ,s ₁ ,s ₂ }，…，{s _i ,s ₁ ,s ₂ ,…，s _m }}。

Optionally, the CNN branch network includes: a downsampling stem module and a residual error module; the Transformer branch network comprises: a patch module and a transform module; the residual error module comprises: t residual submodules; the Transformer module comprises: t transform submodules;

the (t-1) th residual error sub-module is connected with the (t-1) th transform sub-module through the (t-1) th first feature fusion sub-module; t is more than or equal to 2 and less than or equal to T, the first feature fusion sub-module is used for fusing a residual feature image output by the (T-1) th residual sub-module and a Transformer feature image output by the (T-1) th Transformer sub-module to obtain a (T-1) th first feature fusion image, and the first feature fusion image and the residual feature image output by the residual sub-module have the same feature dimension;

the (t-1) th Transformer sub-module is connected with the (t-1) th residual sub-module through the (t-1) th second feature fusion sub-module; the second feature fusion submodule is used for fusing a residual feature image output by the (t-1) th residual sub-module and a Transformer feature image output by the (t-1) th Transformer sub-module to obtain a (t-1) th second feature fusion image, and the second feature fusion image has the same feature dimension as the Transformer feature image output by the Transformer sub-module;

the second feature fusion submodule includes: the activation function is a first convolution layer, a first conversion layer, a full connection layer and a first fusion layer of the RELU function; the first feature fusion submodule includes: the second conversion layer, the second convolution layer with the activation function being a sigmoid function, and the second fusion layer.

Optionally, the predicted image obtaining module 44 includes:

an image input unit, configured to input the fused feature image set to a CNN branch network and a Transformer branch network in the CNN and Transformer hybrid network;

the first image acquisition unit is used for performing convolution operation and maximum pooling operation on the fusion feature set through a stem module in the CNN branch network and then inputting the fusion feature set into a residual error module; performing residual error processing on the input pooled feature image through T residual error sub-modules in the residual error module to obtain a first prediction reconstruction image;

the second image acquisition unit is used for performing blocking operation on the fusion feature set through a blocking patch module in the Transformer branch network and inputting the blocking operation into the Transformer module; and performing feature extraction on the input block feature image through T transform submodules in the transform module to obtain a second prediction reconstruction image.

Optionally, the first image obtaining unit is specifically applied to:

for the 1 st residual sub-module in the residual modules, performing residual processing on the pooled feature image output by the stem module to obtain the 1 st residual feature image; inputting the 1 st residual characteristic image into a 1 st first characteristic fusion module;

for the tth residual sub-module in the residual module, performing residual processing on the (t-1) th first feature fusion image output by the (t-1) th first feature fusion module to obtain a tth residual feature image; wherein T is more than or equal to 2 and less than or equal to T;

when T is more than or equal to 2 and less than or equal to T-1, inputting the T-th residual error feature image into the T-th first feature fusion module; and when T = T, determining the Tth residual characteristic image output by the Tth residual submodule as the first predicted reconstruction image.

Optionally, the second image obtaining unit is specifically applied to:

for a 1 st Transformer sub-module in the Transformer module, performing Transformer processing on the block feature image output by the patch module to obtain a 1 st Transformer feature image; inputting the 1 st Transformer feature image into a 1 st second feature fusion module;

for the tth Transformer sub-module in the Transformer module, performing Transformer processing on the (t-1) th second feature fusion image output by the (t-1) th second feature fusion module to obtain a tth Transformer feature image; wherein T is more than or equal to 2 and less than or equal to T;

when T is more than or equal to 2 and less than or equal to T-1, inputting the tth Transformer characteristic image into the tth second characteristic fusion module; and when T = T, determining the Tth Transformer characteristic image output by the Tth Transformer sub-module as the second prediction reconstruction image.

Further, each transform submodule comprises: an attention module and a multi-layer perceptron MLP;

obtaining an attention output image based on an input block characteristic image or a Transformer characteristic image through the attention module;

and through the multi-layer perceptron MLP, the perception processing is carried out on the output image based on the input attention to obtain a corresponding Transformer characteristic image.

The training device for the three-dimensional reconstruction model provided by the embodiment of the invention can execute the training method for the three-dimensional reconstruction model provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 8 is a schematic structural diagram of a three-dimensional reconstruction apparatus according to a fifth embodiment of the present invention. The present embodiment may be applicable to the case of performing three-dimensional reconstruction on an object to be reconstructed, the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in any device providing a three-dimensional reconstruction function, as shown in fig. 5, where the three-dimensional reconstruction apparatus specifically includes:

an image set obtaining module 51, configured to obtain a multi-view image set of an object to be reconstructed;

an image set correspondence module 52, configured to determine, according to the multi-view image set, an input image set corresponding to each image in the multi-view image set;

a model input module 53, configured to input an input image set corresponding to each image into a target three-dimensional reconstruction model obtained by using a training method of the three-dimensional reconstruction model;

a reconstructed image obtaining module 54, configured to obtain a three-dimensional reconstructed image of the object to be reconstructed output by the target three-dimensional reconstructed model.

By adopting the technical scheme, the target three-dimensional reconstruction model based on the CNN and Transformer mixed network can accurately and effectively restore the three-dimensional image of the object to be reconstructed, and the accuracy of three-dimensional reconstruction is improved.

The three-dimensional reconstruction device provided by the embodiment of the invention can execute the three-dimensional reconstruction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example six

FIG. 9 shows a schematic block diagram of an electronic device 60 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the electronic device 60 includes at least one processor 61, and a memory communicatively connected to the at least one processor 61, such as a Read Only Memory (ROM) 62, a Random Access Memory (RAM) 63, and the like, wherein the memory stores computer programs executable by the at least one processor, and the processor 61 may perform various suitable actions and processes according to the computer programs stored in the Read Only Memory (ROM) 62 or the computer programs loaded from the storage unit 68 into the Random Access Memory (RAM) 63. In the RAM 63, various programs and data necessary for the operation of the electronic apparatus 60 can also be stored. The processor 61, the ROM 62, and the RAM 63 are connected to each other by a bus 64. An input/output (I/O) interface 65 is also connected to bus 64.

A number of components in the electronic device 60 are connected to the I/O interface 65, including: an input unit 66 such as a keyboard, a mouse, or the like; an output unit 67 such as various types of displays, speakers, and the like; a storage unit 68 such as a magnetic disk, optical disk, or the like; and a communication unit 69 such as a network card, modem, wireless communication transceiver, etc. The communication unit 69 allows the electronic device 60 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 61 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 61 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 61 performs the various methods and data processing described above, such as a training method of a three-dimensional reconstruction model or a three-dimensional reconstruction method.

In some embodiments, a method of training a three-dimensional reconstruction model or a method of three-dimensional reconstruction. May be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 68. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 60 via the ROM 62 and/or the communication unit 69. The training method of the three-dimensional reconstruction model or the three-dimensional reconstruction method described above may be performed when the computer program is loaded into the RAM 63 and executed by the processor 61. One or more steps of (a). Alternatively, in other embodiments, the processor 41 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method or a three-dimensional reconstruction method of the three-dimensional reconstruction model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A training method for a three-dimensional reconstruction model, comprising:

Input the input image set corresponding to each image into the initial three-dimensional reconstruction model; wherein, the initial three-dimensional reconstruction model includes: a multi-view position information fusion network, and a hybrid network of CNN and Transformer; wherein, the hybrid network of CNN and Transformer Including: a CNN branch network and a Transformer branch network, and the CNN branch network and the Transformer branch network are connected through a feature fusion network;

performing position information fusion on each of the input image sets through the multi-view position information fusion network to obtain a fusion feature image set;

The fusion feature image set is input to the CNN branch network and Transformer branch network in the hybrid network of CNN and Transformer, and the first predicted reconstruction image output by the CNN branch network and the second predicted reconstruction image output by the Transformer branch network are obtained. Predictively reconstructed images;

Calculate a first loss function value based on the first predicted reconstructed image and the reconstructed image label, calculate a second loss function value based on the second predicted reconstructed image and the reconstructed image label; The target loss function value determined by the sum of the two loss function values adjusts the network parameters in the initial three-dimensional reconstruction model.

2. The method according to claim 1, wherein the determining the input image set corresponding to each image in the multi-view image set according to the multi-view image set comprises: from the multi-view image set S=s Acquire the first image s _i sequentially from _i ; where, s _i ∈ S, i ∈ [1, n], and n is an integer greater than 1, and n is the number of images contained in the multi-view image set;

For each first image s _i , randomly obtain the second image s _j from the multi-view image set S; where, s _j ∈ S, j ∈ [1, m] and j≠i, m≤n-1;

According to the first image s _i and the second image s _j , the input image set corresponding to the first image s _i is constructed as P={{s _i , s ₁ }, {s _i , s ₁ , s ₂ }, ..., {s _i , s ₁ , s ₂ , ..., s _m }}.

3. The method according to claim 1, wherein the CNN branch network comprises: a downsampling stem module and a residual module; the Transformer branch network comprises: a block patch module and a Transformer module; the residual The module includes: T residual submodules; the Transformer module includes: T Transformer submodules;

Wherein, the (t-1)th residual submodule is connected with the (t-1)th Transformer submodule through the (t-1) first feature fusion submodule; wherein, 2≤t≤T, the The first feature fusion sub-module is used to fuse the residual feature image output by the (t-1)th residual sub-module and the Transformer feature image output by the (t-1)th Transformer sub-module to obtain the (t-1)th A first feature fusion image, the feature dimension of the first feature fusion image is the same as the residual feature image output by the residual sub-module;

The (t-1)th Transformer submodule is connected with the (t-1)th residual submodule through the (t-1) second feature fusion submodule; the second feature fusion submodule is used to fuse the first The residual feature image output by the (t-1) residual sub-module and the Transformer feature image output by the (t-1) Transformer sub-module obtain the (t-1) second feature fusion image, the first The feature dimension of the Transformer feature image output by the two feature fusion images is the same as that of the Transformer submodule;

The second feature fusion submodule includes: the activation function is the first convolutional layer, the first conversion layer, the fully connected layer and the first fusion layer of the RELU function; the first feature fusion submodule includes: the second conversion layer , The activation function is the second convolutional layer and the second fusion layer of the sigmoid function.

4. The method according to claim 3, wherein the fusion feature image set is input to the CNN branch network and the Transformer branch network in the hybrid network of the CNN and Transformer, and obtains the output of the CNN branch network. The first predicted reconstructed image and the second predicted reconstructed image output by the Transformer branch network include:

The fusion feature image set is input to the CNN branch network and the Transformer branch network in the hybrid network of the CNN and Transformer;

Through the stem module in the CNN branch network, the convolution operation and the maximum pooling operation are performed on the fusion feature set, and then input into the residual module; through the T residual sub-modules in the residual module to the input pool Performing residual processing on the optimized feature image to obtain the first predicted reconstructed image;

Through the block patch module in the Transformer branch network, the fusion feature set is carried out into blocks and then input to the Transformer module; through the T Transformer sub-modules in the Transformer module, the feature extraction of the input block feature image is performed, A second predicted reconstructed image is obtained.

5. The method according to claim 4, wherein the residual processing is carried out to the input pooling feature image through T residual sub-modules in the residual module to obtain the first predictive reconstruction image, including :

For the first residual sub-module in the residual module, perform residual processing on the pooled feature image output by the stem module to obtain the first residual feature image; input the first residual feature image To the first first feature fusion module;

For the tth residual submodule in the residual module, the (t-1)th first feature fusion image output by the (t-1)th first feature fusion module is subjected to residual processing to obtain the first t residual feature images; where, 2≤t≤T;

When 2≤t≤T-1, the t-th residual feature image is input to the t-th first feature fusion module; when t=T, the T-th residual sub-module output from the T-th residual A difference feature image is determined as said first predictive reconstructed image.

6. method according to claim 4, is characterized in that, by the Transformer sub-module of T in the Transformer module, feature extraction is carried out to the block characteristic image of input, obtains the second predictive reconstructed image, comprises:

For the 1st Transformer submodule in the Transformer module, the block feature image output by the patch module is processed by Transformer to obtain the 1st Transformer feature image; the 1st Transformer feature image is input to the 1st Two feature fusion module;

For the tth Transformer submodule in the Transformer module, the (t-1)th second feature fusion image output by the (t-1)th second feature fusion module is processed by Transformer to obtain the tth Transformer Feature image; where, 2≤t≤T;

When 2≤t≤T-1, input the t-th Transformer feature image to the t-th second feature fusion module; when t=T, input the T-th Transformer feature image output by the T-th Transformer sub-module An image is determined to be the second predicted reconstruction.

7. The method according to claim 4, wherein each Transformer submodule comprises: attention module and multi-layer perceptron MLP;

Through the attention module, the attention output image is obtained based on the input block feature image or Transformer feature image;

Through the multi-layer perceptron MLP, the output image based on the input attention is perceptually processed to obtain the corresponding Transformer feature image.

8. A three-dimensional reconstruction method, characterized in that, comprising:

Obtain a multi-view image set of the object to be reconstructed;

Inputting the input image set corresponding to each image into the target three-dimensional reconstruction model obtained by training the training method of the three-dimensional reconstruction model described in any one of claims 1-7;

Acquiring a 3D reconstructed image of the object to be reconstructed outputted by the target 3D reconstructed model.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, the computer program is executed by the at least one processor, so that the at least one processor can perform any one of claims 1-7 The training method of the three-dimensional reconstruction model, or realize the three-dimensional reconstruction method as described in claim 8.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the method described in any one of claims 1-7 when executed. A training method for a three-dimensional reconstruction model, or realize the three-dimensional reconstruction method as described in claim 8.