CN117994421A

CN117994421A - A dense 3D reconstruction method for the earth's surface based on multi-view optical satellite images

Info

Publication number: CN117994421A
Application number: CN202410025275.4A
Authority: CN
Inventors: 洪中华; 杨沛鑫; 潘海燕; 周汝雁; 张云; 韩彦岭; 王静; 杨树瑚; 马振玲; 徐利军
Original assignee: Shanghai Ocean University
Current assignee: Shanghai Ocean University
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-05-07
Anticipated expiration: 2044-01-08
Also published as: CN117994421B

Abstract

The invention belongs to the technical field of three-dimensional reconstruction, and discloses a ground surface dense three-dimensional reconstruction method based on multi-view optical satellite images, which is characterized in that a multi-view satellite image dataset is utilized to train a reconstruction network model, and the trained reconstruction network model is used for three-dimensional reconstruction, wherein the reconstruction network model comprises a feature extraction module and an elevation prediction module, and the feature extraction module sequentially processes a feature pyramid network FPN and a DCN variable convolution operation and an attention mechanism to extract local features and global features under different scales so as to obtain feature bodies with different scales; the elevation prediction module respectively carries out micro RPC mapping processing on different feature bodies according to the sequence from small scale to large scale, carries out regularization processing by combining the context features of the reference image, participates the regularization processing result of the small scale feature bodies in the micro RPC mapping processing of the large scale feature bodies, and takes the regularization processing result of the maximum scale feature bodies as a final result.

Description

Earth surface dense three-dimensional reconstruction method based on multi-view optical satellite image

Technical Field

The invention relates to the technical field of three-dimensional reconstruction, in particular to a ground surface dense three-dimensional reconstruction method based on multi-view optical satellite images.

Background

The ground surface information is one of the most important components in the land planning management, and the basis of the application of urban three-dimensional modeling and planning, land change detection, forest management and the like is good three-dimensional terrain data.

In recent years, some multi-view three-dimensional reconstruction based on a depth learning method under a close-range scene has remarkable results and great potential, and is mainly applied to a ground reference data set and a multi-view aerial image, wherein 2021 Gao provides a new SatMVS neural network model, a micro-rational polynomial model (RPC) mapping is used for the first time to replace the conventional operation of fitting RPC parameters to a pinhole camera to adapt to the multi-view satellite image model, projection errors are reduced in the mode, the reconstruction accuracy of satellite images is improved, however, compared with near-ground data, the satellite images have the characteristics of poor illumination condition, high scene complexity, large elevation interval and the like, so that the existing network is directly applied to the three-dimensional reconstruction of the large-range satellite images, the problem that the accuracy of the meeting reconstruction results is low, the integrity is low and the like can not achieve the same effect as near-ground data, and particularly in an area with a large elevation range difference, the complex scene topography in the reconstructed satellite data is better, but the accuracy and the integrity of the satellite images still needs to be improved.

Disclosure of Invention

The invention provides a surface dense three-dimensional reconstruction method based on multi-view optical satellite images, which can extract effective characteristic information and iteratively optimize elevation, thereby realizing DSM dense reconstruction of a region with more complex ground and a region with larger elevation difference.

The invention can be realized by the following technical scheme:

A surface dense three-dimensional reconstruction method based on multi-view optical satellite images comprises the following steps:

step one, constructing a multi-view satellite image data set;

Step two, establishing a reconstructed network model

The reconstruction network model comprises a feature extraction module and an elevation prediction module, wherein the feature extraction module extracts local features and global features under different scales through feature pyramid network FPN and DCN variable convolution operation and attention mechanism processing in sequence so as to obtain feature bodies with different scales;

The elevation prediction module respectively carries out micro RPC mapping processing on different feature bodies according to the sequence from small scale to large scale, carries out regularization processing by combining with the context feature of the reference image, participates the regularization processing result of the small scale feature bodies in the micro RPC mapping processing of the large scale feature bodies, and takes the regularization processing result of the maximum scale feature bodies as a final result;

Training the reconstruction network model by utilizing the multi-view satellite image data set, and then three-dimensionally reconstructing a new multi-view optical satellite image by utilizing the trained reconstruction network model.

Further, the elevation prediction module comprises a micro RPC mapping module, a sequence regularization module and an iteration regularization module, wherein the micro RPC mapping module is firstly utilized to respectively carry out micro RPC mapping processing on different feature bodies according to the sequence from small to large, then the sequence regularization module is utilized to carry out regularization processing, then the iteration regularization module is utilized to carry out repeated iteration regularization processing in combination with the context feature of the reference image, the last iteration regularization processing result of the small-scale feature body participates in the micro RPC mapping processing of the large-scale feature body, but for the largest-scale feature body, the micro RPC mapping module and the sequence regularization module are only utilized to process,

And the output result of each regularization participates in loss calculation to back-propagate parameters of the optimized reconstruction network model.

Further, the micro-RPC mapping module is configured to map the current scale feature corresponding to the resource image to the current scale reference view angle corresponding to the reference image in the given elevation value range to construct different current scale cost bodies, and then obtain a current scale cost body by performing variance aggregation processing on the current scale cost bodies;

the order regularization module is used for regularizing the cost price body of the current scale to obtain an initial elevation chart of the current scale;

the iterative regularization module firstly carries out simplified micro-RPC mapping processing on the current scale feature body to obtain a current scale low-cost body, then carries out linear interpolation on the current scale initial elevation graph and indexes from the current scale low-cost body to construct a current scale high-correlation cost body G _f, then carries out repeated iterative regularization processing on the current scale high-correlation cost body G _f and the context features of the reference image together so as to obtain a plurality of current scale elevation graphs, and uses the current scale elevation graph after the last iterative regularization processing for micro-RPC mapping processing of the next scale feature body.

Further, the loss is calculated using the following formula, parameters of the reconstructed network model are adjusted with back propagation,

Wherein, For the elevation values corresponding to the initial elevation map and the predicted elevation map, H is the elevation value corresponding to the actual elevation map, N is the total number of scales, T is the total number of iterations for each scale, w _s is the weight occupied by the different stages,And iterating the corresponding weight for each scale.

Further, for the micro RPC mappable module, the given elevation value range is manually selected when processing the minimum scale feature; the given range of elevation values is set based on the elevation values contained in the initial elevation map of the previous scale as other scale features are processed.

Further, the context features of the reference image are extracted through feature pyramid network FPN and DCN variable convolution operation to obtain a plurality of reference feature images with different scales, and the reference feature images are respectively processed through two different activation function layers to correspondingly obtain a plurality of reference feature bodies with different scales;

Two reference feature volumes with corresponding scales are selected each time to participate in the iterative regularization module processing.

Further, the feature extraction module firstly utilizes the feature pyramid network FPN to respectively conduct multi-scale feature extraction on the reference image and the resource image to obtain a plurality of feature images with different scales, then respectively conducts DCN variable convolution processing on the feature images with different scales to obtain features corresponding to different targets, and finally utilizes the attention mechanism module to obtain global features to obtain feature bodies with different scales.

Further, the feature map is marked as a first scale feature map, a second scale feature map, an Mth scale feature map in turn from small scale to large scale,

The attention mechanism module firstly utilizes the position coding module to reduce the dimension of the first scale feature map, and then sequentially utilizes the self-attention module and the cross-attention module to extract global features and output a first scale feature body;

Then up-sampling the first scale feature body to make the first scale feature body consistent with the scale of the second scale feature map, adding, splicing and fusing the first scale feature body and the second scale feature map, outputting the second scale feature body, up-sampling the second scale feature body to make the second scale feature body consistent with the scale of the third scale feature map, adding, splicing and fusing the second scale feature body and the third scale feature body, and analogizing the second scale feature body until outputting the Mth scale feature body.

Further, each of the DCN variable convolutions includes a convolution layer having a convolution kernel size of 3×3 and a DCN convolution layer having a convolution kernel size of 3×3.

Further, the reference image is set as a lower view image, the resource images are set as a front view image and a rear view image, and one reference image and two resource images are input into the reconstruction network model each time for subsequent processing.

The beneficial technical effects of the invention are as follows:

1. a brand new neural network model is designed for dense reconstruction of different complex terrains in different multi-view satellite image data, and good universality is achieved.

2. Based on a transducer theory, feature association between different images can be effectively enhanced by setting Q, K and V as different image areas for feature graphs extracted from FPN and DCN. Experimental results prove that the accuracy of the reconstruction result is effectively improved by the method.

3. In the aspect of multi-layer cascade elevation prediction, a strategy for iterative optimization of elevation is provided, a mask up-sampling method is utilized, the reliability of inter-layer transfer elevation is ensured while the initial elevation result of each layer is optimally corrected, and the integrity of the whole result is improved. Experimental results prove that the method effectively improves the integrity of the reconstruction result.

4. A simple and effective depth estimation strategy is provided, a reasonable weight distribution mode is adopted, the modules are effectively utilized to predict Cheng Jieguo, the model is subjected to supervision training, and the reliability of the reconstruction result is further improved.

Drawings

FIG. 1 is a schematic general flow diagram of the present invention;

FIG. 2 is a schematic diagram of the attention module of the present invention;

FIG. 3 is a schematic diagram of an iterative regularization module of the present invention;

FIG. 4 is a graph comparing effects of reconstruction of different terrain in TLC SATMVS datasets using different methods in accordance with an embodiment of the present invention;

FIG. 5 is a graph comparing the effects of reconstruction of different terrain in a Tai lake dataset using different methods in an embodiment of the present invention.

Detailed Description

The following detailed description of the invention refers to the accompanying drawings and preferred embodiments.

As shown in FIG. 1, the invention provides a surface dense three-dimensional reconstruction method based on multi-view optical satellite images, which comprises the steps of firstly extracting respective multi-scale characteristic information from multi-view images by using an FPN network, taking the multi-scale characteristic information as the multi-stage characteristic input from coarse to fine in order to consider the characteristic that the multi-view satellite images have different resolutions, introducing a plurality of layers of DCNs into an up-sampling module of the FPN network to pay attention to the same characteristic with obvious scale difference in different images at different stages, extracting the high-dimensional characteristic between the images by using a multi-stage transducer, effectively correlating the high-dimensional characteristic between the images, better paying attention to the same characteristic with inconsistent size range under different visual angles, and then mapping a plurality of resource image features to a reference view angle by using micro RPC mapping, constructing a plurality of cost volumes and aggregating variances into a single cost volume, regularizing the cost volume through sequential regularization (2D Unet ConvGRU), predicting an initial elevation map and the feature volumes containing effective information, simultaneously taking the context features extracted by the reference image together as an initial state of iterative regularization (Iter ConvGRU), carrying out error correction on the initial elevation map to obtain an elevation map, taking the corrected predicted elevation map as input of the next stage, and finally taking the plurality of elevation maps of different stages into loss function supervision training, thereby realizing dense reconstruction of surface information.

The method comprises the following steps:

step one, a multi-view satellite image dataset is constructed, and the multi-view satellite image dataset is used as a training dataset, and can be from a public dataset TLC SATMVS, and is a multi-view optical image acquired by a resource No. 02 star.

It should be noted that the data set includes two sets of data, the first set is a large satellite image, and each image has a size of 5120×5120; the second group is a small image after clipping, which can be directly used for training and testing of the reconstructed network model, and each image has a size of 768×384. Considering the existing computer resources, the 5120×5120 size data cannot be directly predicted, a first group of large-scale image test set is cut into 2048×1472 sizes according to a second group cutting mode in TLC SATMVS, the large-scale image test set is used for checking the effect of a network model on different terrains, the lower view spatial resolution is 2.1 meters, the front view and rear view spatial resolution is 2.5 meters, the lower view image is used as a reference image, the front view image and the rear view image are used as resource images, and one reference image and two resource images are involved in training and testing of the reconstructed network model each time.

Similarly, the multi-view image data of the Taihu area of China photographed by the third 01 star of the resource is preprocessed, and a plurality of groups of images including plain, mountain and urban areas are screened out to verify the effect of different terrains of the network model under lower resolution. The original region has an initial image size of 29291 ×28323, is cut into 2048×1472 by blocks, has a lower viewing spatial resolution of 2.1 m, and has a front and rear viewing spatial resolution of 3.5 m, which is lower than the front and rear viewing spatial resolution obtained by 02 stars.

And step two, establishing a reconstructed network model.

The reconstruction network model comprises a feature extraction module and an elevation prediction module, wherein the feature extraction module sequentially processes a feature pyramid network FPN and a DCN variable convolution operation and an attention mechanism to extract local features and global features under different scales so as to obtain feature bodies with different scales;

The elevation prediction module respectively carries out micro RPC mapping processing on different feature bodies according to the sequence from small scale to large scale, carries out regularization processing by combining the context features of the reference image, participates the regularization processing result of the small scale feature bodies in the micro RPC mapping processing of the large scale feature bodies, and takes the regularization processing result of the maximum scale feature bodies as a final result.

1. Feature extraction module

The feature extraction module firstly utilizes a feature pyramid network FPN to respectively conduct multi-scale feature extraction on a reference image and a resource image to obtain a plurality of feature images with different scales, then respectively conducts DCN variable convolution processing on the feature images with different scales to obtain features corresponding to different targets, and finally utilizes an attention mechanism module to obtain global features to obtain feature bodies with different scales, wherein each DCN variable convolution comprises a convolution layer with the size of 3 multiplied by 3 of a convolution kernel and a DCN convolution layer with the size of 3 multiplied by 3 of four convolution kernels.

Sequentially marking the feature images as a first scale feature image and a second scale feature image according to the scale from small to large; then up-sampling the first scale feature body to make the first scale feature body consistent with the scale of the second scale feature map, adding, splicing and fusing the first scale feature body and the second scale feature map, outputting the second scale feature body, up-sampling the second scale feature body to make the second scale feature body consistent with the scale of the third scale feature map, adding, splicing and fusing the second scale feature body and the third scale feature body, and analogizing the second scale feature body until outputting the Mth scale feature body.

Specifically, firstly, a three-layer feature pyramid network FPN is used to process a reference image and two resource images, three scale resolutions, i.e. 512×368×32 (first layer), 1024×736×16 (second layer) and 2048×1472×8 (third layer), are output, each layer of feature map includes three images, namely a feature map F ₀ corresponding to the feature of the reference image and a feature map F ₁、F₂ corresponding to the feature of the resource image, in order to consider the feature of the multi-view satellite image with different resolutions, DCN variable convolution is then respectively introduced for three different scale output features to provide features of objects with different sizes in the image, each DCN variable convolution includes a convolution layer with a size of 1 convolution kernel 3×3 and a DCN convolution layer with a size of 4 convolution kernels 3×3, the output feature scale is unchanged, and is used for focusing on the same features with obvious scale in different images in different stages, but the features do not consider the correlation between the mutual viewing angles, and finally we introduce attention to the mechanism of correlation between global image and the global context.

Referring to fig. 2, we first use position coding modules to record relative position information of different features on three feature maps with resolution of 512×368×32 extracted from the first layer, reference image feature map F ₀ and resource image feature map F ₁、F₂ respectively, each position coding module includes sine function (sin) and cosine function (cos) used alternately, and performs dimension reduction processing on the high-dimensional image feature information to 188461 (512×368) ×32, then use self-attention module for each feature map to focus on related information of different features in the global, at this time Q, K and V vectors are in the same view, and each coding block includes 1 normalization layer LayerNorm, 1 Linear layer Linear, 1 activation function layer ReLU and 1 random inactivation layer Dropout through 4 coding blocks; in order to better pay attention to the feature association between multi-view images, a cross attention module is added after a self attention module to pay attention to the association information between the reference image features and the resource image features, Q in the coding blocks comes from the reference image feature images F ₀, K and V come from the matched resource image feature images F ₁、F₂, the resource image feature information is updated, the self attention coding blocks and the cross attention coding blocks are used in a crossing mode through 8 coding blocks, 512 multiplied by 368 multiplied by 32 feature bodies after feature enhancement are generated, up-sampling is carried out on the feature bodies, the feature bodies are simultaneously subjected to joint fusion with a second layer resolution of 1024 multiplied by 736 multiplied by 16 feature images after DCN variable convolution processing through 1 Add joint layer, new feature bodies with the same size 1024 multiplied by 736 multiplied by 16 are output, equivalent operation is carried out on the third layer feature images, the feature bodies with the different sizes are obtained, and the feature association enhancement of the features with different sizes is completed.

The elevation prediction module comprises a micro RPC mapping module, a sequence regularization module and an iteration regularization module, wherein the micro RPC mapping module is firstly utilized to respectively carry out micro RPC mapping processing on different feature bodies according to the sequence from small to large in scale, then the sequence regularization module is utilized to carry out regularization processing, then the iteration regularization module is utilized to carry out repeated iteration regularization processing by combining with the context features of the reference image, the last iteration regularization processing result of the small scale feature body participates in the micro RPC mapping processing of the large scale feature body, but for the largest scale feature body, the largest scale feature body only undergoes the micro RPC mapping module and the sequence regularization module to process, and the regularized output result of each time participates in loss calculation to reversely propagate and optimize parameters of the reconstructed network model.

Specifically, the micro-RPC mapping module is used for realizing the feature projection function between multi-view satellite images, mapping the current scale feature body corresponding to the resource image to the current scale reference view angle corresponding to the reference image in a given elevation value range to construct different current scale cost bodies, and then carrying out variance aggregation processing on the current scale cost bodies to obtain a current scale cost body, wherein the given elevation value range is artificially selected when the minimum scale feature body is processed; when processing other scale features, the given elevation value range is set according to the elevation value contained in the initial elevation map of the previous scale;

For the three layers of feature bodies, a first feature body needs to be given a height range artificially, the features of F ₁ and F ₂ are mapped to the coordinate system of F ₀ in a unified way, then a cost body is obtained through variance aggregation, and for the rest two layers, the corresponding features of F ₁ and F ₂ are mapped to the coordinate system of F ₀ in a unified way according to the height value range defined by the prediction height map of the upper layer, and then a cost body is obtained through variance aggregation.

For example, the position coordinates (line, samp) are obtained by adding the assumed corresponding position elevation values (hei) according to the positions (line, samp) of the feature values, projecting the first layer feature body of 512×368×32 into the three-dimensional space, and back projecting the calculated values (lat, lon) and the artificial assumed elevation values (hei) to the coordinate system of the other feature body. For the first layer of feature bodies 512×368×32, adding one dimensional information to 64 elevation surfaces to obtain a plurality of 512×368×64×32 feature cost bodies, and reducing the plurality of cost bodies to a single cost body by a common variance aggregation mode for predicting subsequent elevation results; the construction of the cost body in the rest stages is different in that firstly, the number of assumptions of elevation surfaces is different, namely, the second layer is 32, the third layer is 8, secondly, the characteristic scale is different, namely, the second layer is 1024 multiplied by 736 multiplied by 16, and the third layer is 2048 multiplied by 1472 multiplied by 8.

The order regularization module is used for regularizing the cost price body of the current scale to obtain an initial elevation map of the current scale. The cost banks can be sequentially regularized by using a common sequential regularization module (2 DUnet ConvGRU) with a multi-layer coding and decoding structure, at this time, the GRU can regularize the cost bank according to the elevation plane sequence, only the cost banks of 512×368×1×32 need to be calculated at each time to be 512×368×1, finally one cost bank is overlapped to be 512×368×64 again according to the sequence, the regularized cost bank is subjected to softmax to obtain a probability bank of 512×368×64 of each pixel of the reference image at different depths, and the soft argmin is used to return to the initial elevation diagram to be 512×368×1.

The method comprises the steps that the context features of a reference image are extracted through feature pyramid network FPN and DCN variable convolution operation to obtain a plurality of reference feature images with different scales, and the reference feature images are processed through two different activation function layers respectively to correspondingly obtain a plurality of reference feature bodies with different scales; two reference feature bodies with corresponding scales are selected each time to participate in the iterative regularization module processing;

That is, an additional context feature extraction network and a DCN variable convolution operation are required to be set to perform feature extraction on the reference image only, wherein the context feature extraction network is also a three-layer feature pyramid network FPN, but the three output scale feature graphs are 512×368×64, 1024×736×32, 2048×1472×16 respectively, and the feature extracted by using one DCN variable convolution operation on each scale feature graph is divided into two parts according to the number of channels, and one part uses 1 activation function layer Tanh to obtain three-layer feature volumes with corresponding resolutions of 512×368×32, 1024×736×16, 2048×1472×8 respectively; and the other part uses 1 activation function layer ReLU to obtain three layers of feature volumes with corresponding resolutions of 512×368×32, 1024×736×16 and 2048×1472×8 respectively, which participate in the subsequent cost volume optimization module and are used for providing hidden state initialization and updating for the iterative regularization module.

Referring to fig. 3, first, a small cost volume is constructed for the first layer feature volume 512×368×32 by using a simplified micro RPC mapping module, for example, using a small number Gao Chengmian, such as 3, to better utilize the association between the small cost volume and the current scale initial elevation map, performing linear interpolation on the current scale initial elevation map 512×368×1 to index from the small cost volume after variance aggregation, and generating a set of high correlation cost volumes G _f as 512×368×9, and its correlation formula:

Where r is the index radius, C _G is the cost volume, and H _t is the initial elevation map.

The indexed high-correlation cost body feature body G _f and the elevation prediction H _t are combined with an iterative regularization module (Iter ConvGRU) to predict a high Cheng Can difference value, namely a convolution layer with the size of 1 convolution kernel 1 multiplied by 1, a convolution layer with the size of 1 activation function (ReLU) and 1 convolution kernel 3 multiplied by 3 and a feature with the higher similarity are processed by 1 activation function (ReLU), a convolution layer with the size of 7 multiplied by 7, a convolution layer with the size of 1 activation function (ReLU) and a convolution layer with the size of 3 multiplied by 3 and a convolution function (ReLU) are used for G _f, and the feature information of the elevation graph is extracted by using the convolution layer with the size of 1 convolution kernel 7 multiplied by 7, the convolution layer with the size of 1 activation function (ReLU) and the feature body with the size of 1 multiplied by 368 multiplied by 32 for H Concat, and the two results are output as 512 multiplied by 368 multiplied by 32 through 1 splicing layer (Concat) for associating effective feature information between the low-cost body and the elevation graph; then, in order to fully utilize the effective information in the current network, a splicing operation (Add) is performed with a feature body (context) subjected to an activation function (Tanh) operation in the context feature extraction network in the feature extraction module, a 512×368×64 feature body is output, another part of the feature body (net) is continuously introduced and is also used as an initial hidden state h ₀ to be optimized for correcting the elevation deviation of the current stage, next, two circular convolution modules are used for the feature body (512×368×64), wherein 1 module uses a convolution layer with a convolution kernel size of 1×5, 1 module uses a convolution layer with a convolution kernel size of 5×1, and uses fewer parameters to pay attention to calculating the transverse and longitudinal effective correlation features respectively, and the iterative feature body (net) is updated as the hidden state of the next iteration, continuing to Add a convolution layer with the size of 3×3 by using 1 convolution kernel, a convolution layer with the size of 1 activation function layer (ReLU), a convolution layer with the size of 3×3 by using 1 convolution kernel and a residual elevation map with the size of 1 activation function layer (Tanh), outputting a result of 512×368×1, adding the residual elevation map with the current elevation to obtain the elevation result of the current iterative prediction, so that the current optimized prediction elevation map is as far as possible within a reasonable range, carrying out iterative regularization (Iter ConvGRU) for a plurality of times at the current stage on the basis of the elevation map, for example, 3 times, ensuring that the elevation map transferred to the next stage is as accurate as possible, avoiding excessive transfer of errors, obtaining a total of 4 prediction elevation maps with the size of 512×368×1at the current first stage, generating three prediction elevation maps and 1 initial elevation map for three iterative regularization respectively, and the predicted elevation map obtained in the last iteration participates in the processing of the micro RPC mapping module at the lower layer.

Finally, the final iterative feature (net) of the current stage is up-sampled to a specified resolution using a nearest mask up-sampling module comprising 1 convolution layer of convolution kernel 3×3 size, 1 activation function layer (ReLU) and 1 convolution layer of convolution kernel 1×1 size to predict the mask weights of each adjacent pixel, on which a weighted combination operation is performed to up-sample the resolution of the elevation map to a specified resolution, resulting in a more accurate up-Cheng Yuce map, such as 512×368×1 up-sampling of the final elevation map of the first layer to 1024×736×1 defining the elevation range of the second layer, and 1024×736×1 up-sampling of the final prediction elevation map of the second layer to 2048×1472×1 defining the elevation range of the third layer.

Finally, a loss function is adopted to realize reasonable supervision of the loss training function of a plurality of predictive graphs. Firstly, when iterative regularization processing is performed by using an iterative regularization module, each iteration of an initial elevation graph with different scales outputs a corresponding prediction elevation graph, a true elevation graph with corresponding resolution is used for calculating L1 loss of all output elevation graphs of each stage, then, different duty ratios of the total loss of different stages are considered simultaneously, each iteration is refined to be high Cheng Jieguo, different weights are respectively given, and finally, the final loss is a loss weighting sum formula common to all stages:

Wherein, For the elevation values corresponding to the initial elevation map and the predicted elevation map, H is the elevation value corresponding to the actual elevation map, N is the total number of scales, T is the total number of iterations for each scale, w _s is the weight occupied by the different stages,The corresponding weight is iterated for each scale, so that the basic accuracy of the initial elevation map can be ensured, and the iterative prediction can have better effect on the basis.

In summary, the three-dimensional reconstruction method of the invention firstly uses FPN network to extract the respective multi-scale characteristic information of multi-view images, in order to give consideration to the characteristics of multi-view satellite images with different resolutions, a multi-layer DCN is introduced into an up-sampling module of the FPN network to pay attention to the same characteristics with obvious scale difference in different images at different stages, then uses multi-stage transducer to extract the high-dimensional characteristics between the images and effectively correlate the same characteristics, better pays attention to the same characteristics with inconsistent size ranges at different view angles, inputs the same characteristics as multi-stage characteristics from coarse to fine, and then mapping a plurality of resource image features to a reference view angle by using micro RPC mapping, constructing a plurality of cost volumes and aggregating variances into a single cost volume, regularizing the cost volume through sequential regularization (2D Unet ConvGRU), predicting an initial elevation map and the feature volumes containing effective information, simultaneously taking the context features extracted by the reference image together as an initial state of iterative regularization (Iter ConvGRU), carrying out error correction on the initial elevation map to obtain an elevation map, taking the corrected predicted elevation map as input of the next stage, and finally taking the plurality of elevation maps of different stages into loss function supervision training, thereby realizing dense reconstruction of surface information.

To verify the feasibility of the three-dimensional reconstruction method of the present invention, we performed the following experiments:

1. The TLC SATMVS data were predicted and evaluated and tables 1 and 4 show the reconstruction of different terrain in TLC SATMVS data.

First, the method aims at plain areas with high surface texture repeatability, insignificant terrain change and small elevation change. The results of the quantitative comparison with CASMVSNET, UCSNET, SATMVS methods are shown in the plain section of table 1, respectively, with a drop of 0.17, 0.13, 0.21 on the accuracy MAE and a drop of 0.27, 0.25, 0.35 on the rmse, respectively, and our method further improves the accuracy of the results in reconstructing the DSM. At the same time, 12.73%, 12.00%, 3.79% improvement in integrity, respectively, we have seen more intuitively from fig. 4 (line 1 to line 3) that our method reconstructs more effective area of DSM. The CASMVSNET, UCSNET method has fewer effective areas reconstructed in the edge area, satMVS performs better, but partial ineffective areas still exist, and the CASMVSNET, UCSNET method can reconstruct 99.28% of the plain area, so that the effectiveness of the current method is proved. Based on the reconstruction result, the accurate grid ratio is respectively improved by 3.16%, 2.44% and 2.33% at the position smaller than 2.5m, and the accurate grid ratio is respectively reduced by 0.09%, 0.14% and 0.02% at the position smaller than 7.5 m. Although the partial duty cycle is reduced, our method can reconstruct the best results in plain areas from the overall results.

Then, the mountain area with a large continuous change in elevation is targeted. Compared with CASMVSNET, UCSNET, SATMVS method, the quantitative result is shown in mountain part in table 1, the accuracy MAE is reduced by 0.23, 0.11, 0.22, rmse is reduced by 0.94, 0.83, 0.98, respectively, and the method still shows good reconstruction accuracy. As can be seen from fig. 4 (line 4 to line 6), the CASMVSNET, UCSNET method has a large elevation change area, a large range of invalid areas appear in the reconstruction result, satMVS has a partial unsatisfactory area at the edge and the reconstruction effect of the area with the large elevation change, and the method can reconstruct a large amount of effective terrains in the complex area, wherein the integrity ratio can reach 96.72%, and the reconstruction of the invalid areas of the small lakes in the image of line 6 of fig. 4 is included. Based on the reconstruction result, the accurate grid ratio is respectively improved by 3.11%, 0.41% and 1.92% at the position smaller than 2.5m, and the accurate grid ratio is respectively improved by 2.05%, 1.64% and 2.16% at the position smaller than 7.5 m. The method greatly improves various precision indexes of the reconstructed mountain area.

Then, the urban area with intense elevation transformation and high repetition is targeted. Also compared with CASMVSNET, UCSNET, SATMVS methods, the quantitative results are shown in the urban part of Table 1, the accuracy MAE is respectively reduced by 0.17, 0.19 and 0.25, the RMSE is respectively reduced by 0.56, 0.72 and 0.72, and the method also shows better reconstruction accuracy. The improvement in integrity is 12.63%, 14.21%, 4.56%, respectively, and as is evident from fig. 4 (lines 7 to 9), the performance is poor in multiple building areas CASMVSNET, UCSNET, and more elevation information is lost in SatMVS. In the building area with small characteristic points, high repeatability and severe elevation change, more effective elevation information is rebuilt, and the integrity ratio reaches 98.55%. At the same time, the accurate grid ratio is respectively reduced by 0.37%, reduced by 0.58% and lifted by 0.33% at the position less than 2.5m, and the accurate grid ratio is respectively lifted by 2.21%, 2.45% and 2.98% at the position less than 7.5 m. It can be found that some of the metrics are reduced, but from a global point of view, our method also reconstructs good results.

Finally, for a variety of different types of complex terrain areas that are representative. The method (Appli-SatNet) provided by the invention can reconstruct rich effective terrain information, and simultaneously ensure good reconstruction accuracy. For terrain areas with high surface texture repeatability, the relative positions of self image features and feature correlations between images are focused using self-attention and cross-attention mechanisms in a transducer. For a terrain area with obvious elevation change, an effective iterative regularization (Iter ConvGRU) is used for carrying out error iterative correction on the elevation, so that the accuracy of elevation transfer is ensured. After the modules are effectively fused, various complex terrain areas can be processed simultaneously, and good DSM can be rebuilt. In the case of the truth in fig. 4, the DSM that we reconstruct through the network can also well reveal rich and reliable terrain information. Thus, appli-SatNet may be better suited for DSM reconstruction of different terrains.

TABLE 1 DSM precision assessment of different topography at TLC SATMVS data for different methods

2. The prediction and evaluation results of the Taihu data are shown in Table 2 and FIG. 5, which show the reconstruction results of different terrains in the Taihu data of lower resolution multiview images.

As can be seen from table 2, the method shows improvement over CASMVSNET, UCSNET, SATMVS in all indexes of plain, mountain area and urban area. MAE was observed to decrease by 0.07, 0.14, 0.09, respectively; 1.18, 1.04, 1.26 and 0.81, 0.90, 0.71, rmse reduced by 0.18, 0.22, 0.20, respectively; 1.06, 1.39, 1.21 and 1.52, 1.82, 1.30. Meanwhile, the integrity is respectively improved by 15.83%, 16.28% and 7.32%;36.80%, 28.75%, 13.95% and 19.69%, 22.81%, 19.52%. It can be seen that the proposed network provides significant improvements in both the accuracy and detail of these low resolution multi-view images. The DSM results are shown in fig. 4, where reconstruction is also performed on three different types of complex terrain areas, it can be seen that we have generated finer surface information. Firstly, reconstructing more repeated texture areas and more effective elevation information for plain areas (lines 1 to 3 in fig. 5) with high surface texture repeatability and insignificant terrain variation and small elevation variation. Then in mountain area (line 4 to line 6 of fig. 5), the method can correct the elevation information in time and reconstruct richer texture effect on the terrain with larger elevation continuous change. Finally, aiming at urban areas (line 7 to line 9 of fig. 5) with severe elevation changes and high repeatability, the effective characteristic information is fully utilized by a characteristic enhancement and iteration regularization (Iter ConvGRU) module, and the deviation of the elevation predicted value is corrected, so that the fact that more dense and finer building group information is rebuilt by the method can be obviously seen. It follows that the method is better applicable to multi-view images with lower resolution, and the reliable and perfect DSM can be reconstructed.

TABLE 2 DSM precision assessment of different topography of Tai lake data by different methods

3. Ablation experiments were performed on the methods of the invention.

Table 3 shows evaluation indexes and prediction results obtained from the module configuration of each model, in which model 1: satMVS, model 2: satMVS +transducer, model 3: satMVS + Iter ConvGRU + Mask Upsample +Iter_loss, model 4: satMVS +transducer+ Iter ConvGRU + Mask Upsample +Iter_loss, model 5: satMVS +Iter-ConvGRU +Correlation+ Mask Upsample +Iter_loss, model 6: satMVS +transducer+ Iter ConvGRU +corestation+Iter_loss, model 7: satMVS +transducer+ Iter ConvGRU +corestation+mask-Upsample, model 8: satMVS +transducer+ Iter ConvGRU +corestation+ Mask Upsample +Iter_loss (method of the invention).

Comparing model 1 and model 2, it can be seen that increasing the initial multi-stage feature enhancement processing of FPN extraction by the multi-stage transducer module yields a large improvement in accuracy, with representative MAE and RMSE reduced by 0.23 and 0.69, respectively, and an improvement in integrity of 0.49%, while at the same time 0.26% and 0.17% in grid accuracy ratios of less than 2.5m and less than 7.5m, respectively. Comparing model 1 and model 3, it can be seen that the iterative regularization (Iter ConvGRU) module is added after the initial regularization of the cost volume, so as to reasonably utilize the extracted effective cost volume features. Meanwhile, the method is combined with the interpolation of mask upsampling (Mask Upsample) and Iter_loss, and the accuracy of the method is improved than that of a basic index from a table, so that the effectiveness of the module is verified. Wherein MAE is reduced by 0.02, RMSE is reduced by 0.05, integrity is reduced by 0.05%, and grid accuracy is improved by 2.15% and 0.07% in a ratio of less than 2.5m and less than 7.5m, respectively.

Comparing model 1 and model 4, adding two modules to the network at the same time, the accuracy MAE and RMSE are raised by 0.02 and 0.10 respectively, but the integrity is better improved, and the initial integrity is improved by 2.77%.

Comparing model3 and model 5, it is found that the effective information of the sequential regularization (2D Unet ConvGRU) and iterative regularization (Iter ConvGRU) modules cannot be fully utilized when the effective cost bodies are aggregated, and the aggregated cost bodies are indexed by using the current elevation graph to obtain cost body characteristics with higher correlation, so that the effective information can be focused better. It can be seen from table 3 that the addition of the corelation module, increased by 0.29 and 0.74 for MAE and RMSE, respectively, and decreased by 7.25% in integrity, but increased by 0.49% and 0.13% for grid accuracy of less than 2.5m and less than 7.5m, respectively.

Comparing model 6 with model 8, it was found that a more suitable mask upsampling (Mask Upsample) interpolation mode can be used to replace bilinear interpolation, the final result is improved by 0.04% in terms of MAE and RMSE drop by 0.02 and 0.04, respectively, and the integrity, wherein the grid precision ratio is improved by 0.52% and 0.84% by less than 2.5m and less than 7.5m, respectively, and it can be seen that the mask upsampling (Mask Upsample) interpolation module is more suitable for effectively delivering the altitudes.

In order to fully utilize the elevation graph result of each stage, a new depth estimation strategy is designed by comparing the model 7 with the model 8, the elevation result of each iteration is participated in supervision training based on the initial elevation graph of each stage, the effectiveness of the result of each iteration is ensured as much as possible, and the occupation ratio of the last iteration result in the current stage is the largest because of the most accurate result. MAE and RMSE were reduced by 0.04, and integrity was improved by 0.11%, respectively, with grid accuracy ratios less than 2.5m and less than 7.5m being improved by 1.35% and 0.63%, respectively, verifying the effectiveness of each elevation map calculated by weight.

TABLE 3 quantitative analysis of different modules on TLC SATMVS predicted dataset

While particular embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these are merely illustrative, and that many changes and modifications may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims.

Claims

1. A surface dense three-dimensional reconstruction method based on multi-view optical satellite images is characterized by comprising the following steps:

step one, constructing a multi-view satellite image data set;

Step two, establishing a reconstructed network model

2. The method for three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 1, wherein the method comprises the following steps: the elevation prediction module comprises a micro RPC mapping module, a sequence regularization module and an iteration regularization module, wherein the micro RPC mapping module is firstly utilized to respectively carry out micro RPC mapping processing on different feature bodies according to the sequence from small to large, then the sequence regularization module is utilized to carry out regularization processing, then the iteration regularization module is utilized to carry out repeated iteration regularization processing in combination with the context feature of the reference image, the last iteration regularization processing result of the small-scale feature body participates in the micro RPC mapping processing of the large-scale feature body, but for the largest-scale feature body, the largest-scale feature body is processed only by the micro RPC mapping module and the sequence regularization module,

3. The method for three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 2, wherein the method comprises the following steps: the micro RPC mapping module is used for mapping the current scale feature body corresponding to the resource image to the current scale reference view angle corresponding to the reference image in a given elevation value range to construct different current scale cost bodies, and then carrying out variance aggregation treatment on the current scale cost bodies to obtain a current scale cost body;

4. A method for dense three-dimensional earth surface reconstruction based on multi-view optical satellite images according to claim 3, wherein: the loss is calculated using the following formula, parameters of the reconstructed network model are adjusted with back propagation,

Wherein, For the elevation values corresponding to the initial elevation map and the predicted elevation map, H is the elevation value corresponding to the real elevation map, N is the total number of scales, T is the total number of iterations of each scale, and w _s is the weight occupied by different stages,And iterating the corresponding weight for each scale.

5. The method for three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 2, wherein the method comprises the following steps: for the micro RPC mapping module, when the minimum scale feature is processed, the given elevation value range is manually selected; the given range of elevation values is set based on the elevation values contained in the initial elevation map of the previous scale as other scale features are processed.

6. The method for three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 2, wherein the method comprises the following steps: the context features of the reference image are extracted through feature pyramid network FPN and DCN variable convolution operation to obtain a plurality of reference feature images with different scales, and the reference feature images are processed through two different activation function layers respectively to correspondingly obtain a plurality of reference feature bodies with different scales;

7. The method for three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 1, wherein the method comprises the following steps: the feature extraction module firstly utilizes a feature pyramid network FPN to respectively conduct multi-scale feature extraction on a reference image and a resource image to obtain a plurality of feature images with different scales, then respectively conducts DCN variable convolution processing on the feature images with each scale to obtain features corresponding to different targets in size, and finally utilizes the attention mechanism module to obtain global features to obtain feature bodies with different scales.

8. The method for dense three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 7, wherein: sequentially marking the feature images as first-scale feature images and second-scale feature images from small scale to large scale,

9. The method for dense three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 7, wherein: each of the DCN variable convolutions includes a convolution layer of a convolution kernel 3 x 3 size, and four convolution kernel 3 x 3 size DCN convolution layers.

10. The method for three-dimensional reconstruction of earth's surface based on multi-view optical satellite images according to claim 1, wherein the method comprises the following steps: the reference image is set as a lower view image, the resource images are set as a front view image and a rear view image, and one reference image and two resource images are input into the reconstruction network model each time for subsequent processing.