Disclosure of Invention
In view of the foregoing, it is a primary object of the present application to provide a technique for accurately locating a device based on sensor data provided by the sensor device, which technique may be suitably implemented in an embedded computing system having limited computing resources.
The above and other objects are achieved by the subject matter of the independent claims. Other implementations are apparent in the dependent claims, the description and the drawings.
According to a first aspect there is provided a method of determining a location of a device comprising a sensor device, comprising the steps of the sensor device obtaining sensor data representative of an environment of the device, a first neural network generating a first descriptor map based on the sensor data, inputting input data to a second neural network different from the first neural network based on the sensor data, the second neural network outputting descriptors based on the input data, rendering the descriptors stereoscopically to obtain a second descriptor map, matching the first descriptor map with the second descriptor map (comparing to find a match), determining an attitude of the sensor device based on the match.
Obviously, the term "neural network" herein refers to an artificial neural network. The device may be a vehicle, such as a fully or partially autonomous car, an autonomous mobile robot, or an automated guided vehicle (Automated Guided Vehicle, AGV). The sensor device may be a camera device or a LIDAR device. The camera device may be, for example, a time-of-flight camera, a depth camera, etc., while the LIDAR device may be, for example, a microelectromechanical system (Micro-Electro-MECHANICAL SYSTEM, MEMS) LIDAR device, a solid-state LIDAR device, etc.
The first neural network is trained for extracting descriptors/features from query sensor data acquired by the sensor device (e.g., data acquired during movement of the device in the environment). The second neural network is trained to process data based on the sensor data to obtain local descriptors, e.g., local descriptors for each pixel in an image captured by the camera or local descriptors for each point in a three-dimensional point cloud captured by the LIDAR device.
The first neural network may be a (deep) convolutional neural network (convolutional neural network, CNN) which may be based on one of the neural network architectures known in the art for learning feature extraction (see examples given in m.jahrer et al, "learning local descriptors for identification and matching", computer vision winter seminar, sulwennia, morawetopride, 2 months 4-6 days; p.napoletano, "visual descriptors for content-based remote sensing image retrieval", "international remote sensing journal, 2018, 39:5, pages 1343-1376; a.moreau et al," ImPosing: implicit gesture coding for visual localization ", IEEE/CVF computer vision application winter conference (WACV 2023), 2023, 1 month, united states Wei Keluo subvillage, pages 2893-2902).
The second neural network may comprise a (deep) multi-layer perceptron (Multilayer Perceptron, MLP) (fully connected feed-forward) neural network. In particular, the second neural network may be trained based on the neural radiation field (NeuralRadiance Field, neRF) technique set forth in the paper by B.Mildenhall et al entitled "Nerf: representing a scene as a neural radiation field for view synthesis" ("computer vision-ECCV 2020", 16 th meeting, european conference, england, grosgo, month 8, 23-28, springer, cha M, 2020), or any improvement thereof, which has now become a popular view synthesis tool.
Input data of the vision NeRF neural network represents the 3D position (x, y, z) and viewing direction/angle of the camera deviceThe NeRF trained neural network outputs a neural field that includes view-dependent color values (e.g., RGB) and a bulk density value σ. Thus, the MLP achieves F Θ: And obtains the optimized weight theta in the training process. The neural field may be interrogated at a plurality of locations along the ray for stereoscopic rendering (see detailed description below). The neural network representation of the environment captured by the sensor device is given by a neural field for subsequently performing a stereoscopic rendering, which may produce a rendered image or point cloud, etc. The differences between the rendered image and the mapped sensor data are minimized during the training process. If 3D point cloud data is processed instead of images, neRF neural networks may be used to render the point cloud. While in a composite image generation application, the camera pose of the multiple images input to NeRF neural networks is known, in a positioning application, the camera device (or LIDAR device) pose may be iteratively determined starting from an initial estimate (pose a priori) (see detailed description below).
However, by including additional descriptors in the implicit function (see detailed description below), such NeRF-trained neural networks, or variants thereof, may be used or include a second neural network for use in a method according to the first aspect. In this case, the input data based on the sensor data includes a three-dimensional position and a viewing direction (collectively forming a pose), and the output data includes a descriptor and a bulk density.
For example, the second neural network may be trained based on a highly evolving NeRF-W technology (appearance code) that reliably accounts for illumination dynamics (see R.Martin-Brualla et al, "NeRF in the field: neural radiation field for unconstrained photo collection", IEEE International Computer Vision AND PATTERN Reconnaissance, CVPR, 2021, 6 months 19-25, pages 7210-7219, computer Vision foundation, IEEE, 2021). In an implementation using such NeRF-W trained neural networks, the input data of the second neural network includes appearance embedding (optionally, transient embedding); see further detailed below.
The method of determining the location of a device by means of a first neural network and a second neural network according to the first aspect can very accurately locate the device without the need for powerful and expensive computing resources and memory resources. In contrast to the structure-based visual localization techniques of the art, there is no need to match a three-dimensional map generated from reference sensor data requiring a large amount of memory space, but rather the matching process is based on the output of a trained neural network that can be implemented with relatively low memory requirements.
According to an implementation of the method of the first aspect, the descriptor is independent of a viewing direction (or viewpoint) of the sensor device. This property is similar to that of bulk density, but is different from the color output by the NeRF trained neural network. Since the descriptor is independent of the viewing direction, positioning failure due to a significant difference between the non-textured content or reference data acquired according to the viewing direction and the query sensor data acquired during positioning can be significantly reduced.
According to one implementation, the descriptor represents the local content of the input data and the three-dimensional position of a data point (pixel in an image or point in a point cloud) of the input data. Such implementations are not based on features including keypoints and associated descriptors, but rather may use each point of input data, which may improve the accuracy of the matching results.
According to an implementation form, the method according to the first aspect as such or any implementation form thereof further comprises the second neural network obtaining a depth map, wherein the matching of the first descriptor map and the second descriptor map is based on the obtained depth map. The depth map is obtained by rendering the values of the volume density output by the second neural network stereoscopically. The information of the depth maps may be used to avoid matching of descriptor structures of descriptor maps that are geometrically far from each other (at least some predetermined distance threshold), i.e. to ensure that the descriptor structure of one of the descriptor maps that are geometrically far is considered to be mutually dissimilar to the descriptor structure of the other of the descriptor maps.
According to one implementation, the pose of the sensor device is iteratively determined starting from a pose a priori (initial estimate of the pose). Iterations may be performed based on a perspective N-Point (PERSPECTIVE-N-Point, pnP) method in combination with a random sample consensus (Random Sample Consensus, RANSAC) algorithm to obtain a robust estimate of pose by culling outlier matches. The pose a priori may be obtained by matching the global image descriptor with an image retrieval database or implicit map as known in the art.
According to one implementation, the first neural network and the second neural network are jointly trained for the environment based on matching a first training descriptor map acquired through the first neural network with a second training descriptor map acquired through stereo rendering of training descriptors output by the second neural network. When the neural network is jointly trained in this way, it can be ensured that during the matching process during the actual positioning, the corresponding descriptor structures are identified as matching each other.
In particular, according to such an implementation, descriptor/feature extraction is learned for the environment/scene in which positioning is to be performed. In this implementation, the first neural network does not use an off-the-shelf feature extractor, but rather is trained for a specific scenario, which may further improve the accuracy of the pose estimation, and thus the accuracy of the positioning results.
Implementations of the method of the first aspect may include training the first neural network and the second neural network. According to one implementation, the sensor device is the camera device, the method further comprises performing joint training on the first neural network and the second neural network for the environment, wherein the performing joint training comprises training the camera device to obtain training image data of different training poses of the training camera device, inputting training input data into the first neural network based on the training image data and training pose data according to the different training poses of the training camera device into the second neural network based on the training image data, outputting a first training descriptor map based on the training input data by the first neural network, outputting training color data, training volume density data and training descriptor data by the second neural network, rendering the training color data, the training volume density data and the training descriptor data to obtain rendered training images, rendered training depth maps and rendered second training descriptor maps respectively. Furthermore, the method according to this implementation comprises minimizing a first objective function representing a difference between the rendered training image and the respective pre-stored reference image, or maximizing a first objective function representing a similarity between the rendered training image and the pre-stored reference image, minimizing a second objective function representing a difference between the first training descriptor map and the respective rendered second training descriptor map, or maximizing a second objective function representing a similarity between the first descriptor map and the respective rendered second training descriptor map.
The process may efficiently co-train the first and second neural networks for accurate positioning based on the pose of the camera device estimated by matching the descriptor maps to each other (i.e., based on the descriptor maps that match each other).
A similar training process may be performed for the case where the sensor device is a LIDAR device. In this implementation, the method includes jointly training the first neural network and the second neural network for the environment, wherein the jointly training includes training a LIDAR device to obtain training three-dimensional point cloud data of different training poses of the training LIDAR device, inputting training input data to the first neural network based on the training three-dimensional point cloud data and inputting training pose data according to the different training poses of the training LIDAR device to the second neural network based on the training three-dimensional point cloud data, outputting a first training descriptor map based on the training input data by the first neural network, outputting training volume density data and training descriptor data by the second neural network, rendering the training volume density data and the training descriptor data to obtain a rendered training depth map and a rendered second training descriptor map, respectively. Furthermore, the method according to this embodiment comprises minimizing a first objective function representing a difference between the rendered training depth map and the respective pre-stored reference depth map, or maximizing a first objective function representing a similarity between the rendered training depth map and the pre-stored reference depth map, minimizing a second objective function representing a difference between the first training descriptor map and the respective rendered second training descriptor map, or maximizing a second objective function representing a similarity between the first descriptor map and the respective rendered second training descriptor map.
According to another implementation, the training process further includes applying a loss function based on the rendered training depth maps to suppress minimization or maximization of the second objective function for data points of one of the first training descriptor maps, wherein a geometric distance of the data points of the first training descriptor map from data points of a corresponding one of the rendered second training descriptor maps is greater than a predetermined threshold. Considering the penalty function may avoid comparing different descriptor structures of the descriptor map to each other, which descriptor structures do not actually represent common features of the environment.
At least one step of the method according to the first aspect and implementations thereof may be performed at a device site of an embedded computing system with limited computing resources, or at a remote site provided with data needed to process/locate the device.
According to a second aspect, there is provided a computer program product comprising computer readable instructions for performing or controlling the steps of the method according to the first aspect or any implementation thereof, when the computer readable instructions are run on a computer. The computer may be installed in a device (e.g., a vehicle).
According to a third aspect, a positioning device is provided. The method according to the first aspect and any implementation thereof may be implemented in a positioning device according to the third aspect. The positioning device according to the third aspect and any implementation thereof may provide the same advantages as described above.
The positioning device according to the third aspect comprises a sensor device (e.g. a camera device or a LIDAR device) for acquiring sensor data representing an environment of the sensor device, a first neural network for generating a first descriptor map based on the sensor data, and a second neural network different from the first neural network for outputting descriptors based on input data, wherein the input data is based on the sensor data. The positioning device according to the third aspect further comprises a processing unit for performing the operations of rendering the descriptors stereoscopically to obtain a second descriptor map, matching the first descriptor map with the second descriptor map, and determining the pose of the sensor device based on the matching.
According to one implementation, the descriptor is independent of the viewing direction of the sensor device. The descriptor may represent the local content of the input data and the three-dimensional location of the data point of the input data.
According to one implementation, the second neural network is further configured to obtain a depth map, and the processing unit is further configured to match the first descriptor map with the second descriptor map based on the depth map.
According to one implementation, the processing unit is further configured to iteratively determine the pose of the sensor device starting from a pose a priori.
According to one implementation, the first neural network and the second neural network are jointly trained for the environment based on matching a first training descriptor map acquired through the first neural network with a second training descriptor map acquired through stereo rendering of training descriptors output by the second neural network.
According to a fourth aspect, there is provided a vehicle comprising a positioning system according to the third aspect or any implementation thereof. For example, the vehicle is a (specifically, fully or partially autonomous) car, an autonomous mobile robot, or an automated guided vehicle (Automated Guided Vehicle, AGV).
The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Detailed Description
A method of locating a device (e.g., a vehicle) equipped with a sensor device (e.g., a camera device or a LIDAR device) is provided herein. In particular, the method may be based on a neuro-radiation field (Neural RADIANCE FIELD, NERF) scene representation. Although the following description of embodiments refers to NeRF techniques, other techniques for implicit representation of neural field and stereo-rendered based environments/scenes may be suitably employed in alternative embodiments. Highly accurate positioning results can be obtained at relatively low computational cost.
FIG. 1 illustrates locating a vehicle according to an embodiment. The vehicle is navigated 11 in a known environment. The vehicle is equipped with a sensor device (e.g., a camera device or a LIDAR device) and captures sensor data (e.g., an image or a 3D point cloud) representing the environment. Specifically, a query sensor dataset (e.g., a query image or a query 3D point cloud) is captured 12 and used to locate a vehicle in a known environment. The query sensor dataset is input to a neural network (e.g., deep convolutional neural network (convolutional neural network, CCN)) trained for descriptor extraction, and a query descriptor map is generated 13 based on the query sensor dataset and the extracted descriptors. Therefore, positioning failure caused by that the texture-free content or the reference data acquired according to the observation direction is significantly different from the query sensor data acquired in the positioning process can be significantly reduced.
In the art, the extracted descriptors or features are compared to a three-dimensional reference descriptor map generated from a training sensor dataset. With such a three-dimensional reference descriptor map, memory requirements are great. In contrast to the art, according to the embodiment shown in fig. 1, another neural network is used 14 to provide another rendered descriptor map to match the query descriptor map provided by the neural network trained for descriptor extraction. Input data for the query sensor data captured by the vehicle-based sensor device is input into the other neural network. For example, estimated three-dimensional position data and viewing direction data of a sensor device associated with a captured query image or 3D point cloud are input into other neural networks for outputting descriptors of local position (features). These descriptors are part of a neural field, which may also include color values and bulk density values. A nerve field that includes local descriptors may be referred to as a nerve location feature field.
According to a specific implementation, the other neural network comprises or consists of one or more (deep) multi-layer perceptrons (multilayer perceptron, MLP), i.e. the other neural network comprises or is a fully connected feed-forward neural network trained based on the neural radiation field (Neural RADIANCE FIELD, NERF) technique proposed in the paper by b.mildenhall et al entitled "computer vision-ECCV 2020", 16 th meeting european conference, uk, glasgang, 23-28 days in 2020, springer, cha M, 2020) or any improvement thereof (e.g. the Nerf-W technique). Nerf-W technology is described in R.Martin-Brualla et al, "NeRF in the field: nerve radiation field for unconstrained photograph collection", IEEE International Computer Vision and Pattern Recognition conference (Computer Vision AND PATTERN recording, CVPR) of 2021, month 19-25 of 2021, pages 7210-7219, computer Vision foundation, IEEE, 2021. NeRF, originally introduced by b.mildenhall et al, may obtain a neural network representation of the environment based on color values and spatially correlated volume density values (representing neural fields).
Input data of neural network represents 3D position and viewing directionThe NeRF trained neural network outputs view-dependent color values (e.g., RGB) and a bulk density value σ. Thus, the MLP achieves F Θ: and obtains the optimized weight theta in the training process.
Stereoscopic rendering is based on light rays (projected from all pixels in an image) passing through the scene. The bulk density σ (x, y, z) can be interpreted as the differential probability that the ray ends up on an infinitesimal particle at (x, y, z). By collecting all the bulk density values in the light direction, the cumulative transmittance T(s) in the light direction can be calculated: the cumulative transmittance T(s) along the ray from origin 0 to s represents the probability that the ray will pass through its path to s but miss any particles.
The implicit representation (neural field) is queried at multiple locations along the ray, and the resulting samples are then combined into an image.
According to the embodiment shown in fig. 1, a neural network of this type may be employed. However, according to this embodiment, these neural networks are trained not only for acquiring volume density values and color values (if a camera device is used as a sensor device), but also for acquiring descriptors (modified NeRF-trained neural network). Rendering the neural network output descriptors stereoscopically may produce a rendered (reference) descriptor map that is to be matched (compared) to the query descriptor map. The use of other neural networks instead of the reference three-dimensional map used in the art saves memory space (typically about 1000 times), but can achieve very accurate pose estimation to obtain positioning results.
For example, the query descriptor map is matched 15 with the reference descriptor map using cosine similarity. For example, if the similarity is above a predetermined threshold and both descriptors represent the best candidate in both descriptor graphs (match each other) in both directions, then the two descriptors are matching terms (i.e., the two descriptors are similar to each other).
From the correspondence of the query descriptor map to the rendered descriptor map generated by stereo rendering of other (modified NeRF trained) neural network output descriptors, the pose (position and perspective) of the sensor device can be calculated, as is known in the art, as shown in fig. 1. For example, starting from a priori (initial estimate) of the pose of the sensor device, the actual pose of the query sensor data may be iteratively estimated 16 based on a perspective N-point (PERSPECTIVE-N-Points, pnP) method in combination with a random sample consensus (Random Sample Consensus, RANSAC) algorithm. The view observed from the pose a priori should have overlapping content with the query sensor data to make the matching process viable. The pose a priori may be obtained by matching the global image descriptor with an image retrieval database or implicit map as known in the art. Similar to the procedure described in Moreau et al, "ImPosing: implicit pose coding for visual localization" (IEEE/CVF computer vision application winter conference (WACV 2023), 2023, month 1, U.S. Wei Keluo subvillage, pages 2893-2902), the acquired sensor device pose estimate may be used as a new pose prior, matching iteration, and PnP processing results in conjunction with RANSAC to refine the sensor device pose estimate, thereby refining the localization results. Although classical 3D reference models have access to only a limited set of reference descriptors, the reference descriptors can be computed from any camera pose by other neural networks, which can improve the accuracy of the positioning results.
Fig. 2 illustrates a neural network architecture for rendering a descriptor map of a camera device equipped positioning device, in accordance with a particular embodiment. The neural network shown in the figure is trained based on NeRF-W technology (R.Martin-Brualla et al, "neural radiation field for unconstrained photo collection," IEEE International Computer Vision AND PATTERN Recognition, CVPR, 2021, 6, 19-25, 7210-7219, computer Vision foundation, IEEE, 2021). Such NeRF neural networks are able to accommodate dynamic (illumination) changes in outdoor scenes due to the appearance embedding included in the implicit function. As shown in fig. 2, similar to the initially introduced NeRF neural network, the NeRF-W neural network is fed by input data based on sensor data in the form of 3D position (x, y, z) 21 and view direction D. The NeRF-W neural network can be considered to include a first multi-layer perceptron (Multilayer Perceptron, MLP) 22 and a second MLP 23 that includes a first logic portion 23a and a second logic portion 23 b. The first MLP 22 receives the position (x, y, z) as input data, and outputs data including information of the position (x, y, z) of the input second MLP 23. Furthermore, the first MLP 22 outputs a bulk density value σ that is independent of the viewing direction d. The first portion 23a of the second MLP 23 is trained for outputting color values RGB of the position (x, y, z) and viewing direction d of the input second MLP 23. The first portion 23a of the second MLP 23 also receives an appearance embedding application (app) to account for dynamic (illumination) changes of the captured scene. The first portion 23a of the second MLP 23 may or may not receive the transient embedding.
Based on the (RGB) color values and the bulk density values σ, an image may be generated by stereo rendering. According to the embodiment shown in FIG. 2, unlike the teachings of R.Martin-Brualla et al in "NeRF in the field," the local descriptor for unconstrained photo collection (2021, IEEE International Computer Vision AND PATTERN recording, CVPR), 2021, month 6, 19-25, pages 7210-7219, computer Vision foundation, IEEE, 2021) is trained on the second portion 23b of the second MLP 23 to output a local descriptor that is independent of neither the viewing direction (angle) nor the location (x, y, z) of the apparent embedding. The rendered (reference) descriptor maps may be generated by stereo rendering of local descriptors, which are provided for matching with the query descriptor map (see description of fig. 1 above) for positioning purposes.
Fig. 3 shows a general training process of a neural network (feature extractor) 31 for extracting descriptors/features from an image and a neural network (neural renderer) 32 for rendering (referencing) a descriptor map. For example, the first neural network 31 may be a full convolution neural network having 8 layers, a ReLU activation and a maximum pooling layer, and the second neural network 32 may be the neural network shown in FIG. 2. The two neural networks are jointly trained in a self-supervised manner using scene geometry information by defining an optimization objective (total loss function). Thus, specific descriptors of the target scene are obtained, which describe not only the visual content of the viewpoint, but also the 3D position of the viewpoint, enabling better differentiation than the generic descriptors provided by the off-the-shelf feature extractor. The generated descriptors are embedded independent of the viewing direction or appearance.
The training (e.g., RGB) image I captured by the camera device is input to a neural network 31, which neural network 31 is used to output a first descriptor map DesM1 based on the input training image. For the same training image, input data based on the training image is input to the neural network 32. In the embodiment shown in fig. 3, the input data are the pose (with reference to the 3D position (x, y, z) and direction D) of the camera device capturing the training image I, described in connection with fig. 2. For example, the pose may be obtained from the training image I by a motion restoration structure (Structure from Motion, sfM) technique. The neural network 32 may also be embedded using appearance. The neural network 32 outputs a training descriptor, a volume density value σ (i.e., depth information), and a color. The stereo rendering may generate a second (rendered training) descriptor map DesM, a rendered training depth map DepM, and a rendered training image IM, respectively. The stereo rendering includes an aggregation of descriptors, volume density values, and color values along the virtual camera rays, as is known in the art.
The luminosity loss (mean square error loss) L MSE is applied to the rendered training image IM supervised by the training image I to train the (radiated field) neural network 32, as is known in the art. In addition, the structural dissimilarity loss L SSIM can be minimized. The descriptor loss L pos、Lneg is applied to the first training descriptor map DesM1 and the second training descriptor map DesM2 to jointly train the two neural networks 31 and 32. L pos may be applied to maximize the similarity of the first training descriptor map DesM and the second training descriptor map DesM2, and L neg may be applied to ensure that the pair of pixels (p 1, p 2) in pixel p1 in the first training descriptor map DesM1 and the pair of pixels (p 1, p 2) in pixel p2 in the second training descriptor map DesM have different descriptors from each other at a large geometric 3D distance.
Regularization loss L TV may be applied to the rendered training depth map DepM to further improve the quality of the geometry learned by the second neural network 32 by smoothing and limiting artifacts. Minimizing the structural dissimilarity loss L SSIM and the regularization loss L TV may improve positioning accuracy.
These losses form the overall objective function that needs to be optimized during the training process. Further detailed information about the loss is shown below.
In the embodiments shown in fig. 2 and 3, the sensor devices used for the positioning and training process are camera devices, respectively. In an alternative embodiment, the sensor device is a LIDAR device that provides 3D point cloud data instead of images, and the configurations shown in fig. 2 and 3 require direct modification.
Fig. 4 shows an embodiment of a method of locating a device comprising a sensor device by means of a first neural network for descriptor/feature extraction and a second neural network 42 for providing a neural field comprising descriptors. For example, the second neural network 42 may be the neural network shown in fig. 2. For example, the first neural network 41 and the second neural network 42 are jointly trained in a similar manner to the joint training of the neural network 31 and the neural network 32 shown in fig. 3.
During actual navigation of a mobile device equipped with a sensor device (e.g. a vehicle equipped with a camera device or a LIDAR device), a query sensor dataset SD (e.g. a query image or a query 3D point cloud) is input into the first neural network 41 for descriptor extraction to generate a first descriptor map DesM1. Starting from the pose a priori (initial value of the pose), i.e. the position and the viewing direction, the second neural network outputs the descriptor, the volume density value and the color value, respectively, to generate a second descriptor map DesM, a depth map DepM and a (RGB) image IM. The two 2D descriptor maps DesM and DesM2 are matched to each other to establish a 2D-2D local correspondence. And using the depth map for the second descriptor map, and establishing a 2D-3D corresponding relation through depth information in a third direction. Based on matching the descriptor structures, pose estimates may be calculated 43 by a perspective N-point (PERSPECTIVE-N-Points, pnP) method in combination with a random sample consensus (Random Sample Consensus, RANSAC) algorithm, as is known in the art. The estimated sensor pose may be used as a new pose prior, and then, based on the new pose prior, a new descriptor map may be rendered to match the first descriptor map DesM. The process of pose estimation may be iterated multiple times in order to improve the accuracy of estimating the pose and thus the positioning accuracy of the device.
Fig. 5 is a flow chart illustrating an embodiment of a method 50 of determining a location of a device including a sensor device. The method 50 includes the sensor device acquiring S51 sensor data representing an environment of the device. For example, the sensor device is a camera that captures a 2D image of the environment, or a LIDAR device that captures a 3D point cloud representing the environment. The method 50 further includes the first neural network generating S52 a first descriptor map based on the sensor data. The first neural network may be one of the examples of the first neural network for descriptor/feature extraction described above, for example, the neural network 31 shown in fig. 3 or the neural network 41 shown in fig. 4.
The method 50 further comprises inputting S53 input data to a second neural network different from the first neural network based on the sensor data, and outputting S54 a descriptor (in particular, the descriptor may be independent of the viewing direction) based on the input data by the second neural network. For example, the input data represents an initial or developed estimate of the attitude of the sensor device acquired based on the sensor data. The second neural network may be one of the examples of the second neural network described above, for example, the neural network 32 shown in fig. 3 or the neural network 42 shown in fig. 4, and may include the MLP 32a and the MLP 32b shown in fig. 2. The first neural network and the second neural network may be jointly trained to maximize the similarity of the respective training descriptor maps based on the descriptor loss applied to the training descriptor map provided by the first neural network and the training descriptor map rendered based on the descriptors output by the second neural network.
The method 50 further comprises rendering the descriptors stereo S55 to obtain a second descriptor map, matching the first descriptor map with the second descriptor map S56. The pose of the sensor device is determined S57 based on the matching S56 (the descriptor maps match each other to some predetermined extent). Details of specific steps of method 50 may be similar to those described above.
Fig. 6 shows a positioning device 60 according to an embodiment. The positioning device 60 may be mounted on a vehicle (e.g., an automobile). The positioning device 60 may be used to perform the method steps of the method 50 described above in connection with fig. 5. The positioning device comprises a sensor device 61, for example a camera capturing a 2D image of the environment of the positioning device, or a LIDAR device capturing a 3D point cloud representing the environment. Sensor data captured by the sensor device 61 is processed by a first neural network 62 and a second neural network 63 that is different from the first neural network 62. The first neural network is used to generate a first descriptor map (in particular, including descriptors that may be independent of the viewing direction) based on the sensor data. The first neural network may be one of the examples of the first neural network for descriptor/feature extraction described above, for example, the neural network 31 shown in fig. 3 or the neural network 41 shown in fig. 4.
The second neural network is used to output descriptors (which may be independent of the viewing direction) based on input data (e.g., pose a priori or developed pose estimates) that are based on sensor data. The second neural network may be one of the examples of the second neural network described above, for example, the neural network 32 shown in fig. 3 or the neural network 42 shown in fig. 4, and may include the MLP 32a and the MLP 32b shown in fig. 2. The first neural network 62 and the second neural network 63 may be jointly trained to maximize the similarity of the respective training descriptor maps based on the descriptor loss applied to the training descriptor map provided by the first neural network 62 and the training descriptor map rendered based on the descriptors output by the second neural network 63.
Furthermore, the positioning device comprises a processing unit 64, which processing unit 64 is arranged to process the output provided by the first neural network 62 and the second neural network 63. The processing unit 64 is arranged for rendering the descriptors spatially to obtain a second descriptor map, matching the first descriptor map with the second descriptor map, and determining the pose of the sensor device based on the matching (the descriptor maps match each other to some predetermined extent).
It should be noted that in the embodiments shown in fig. 1 to 6, the data processing may be performed entirely on the device (e.g., vehicle) to be located, thereby sufficiently ensuring the privacy of the user data. In cases where computational power is limited, the data processing may be performed partly or wholly on an external processing unit (server). For example, the query sensor dataset is transmitted to an external processing unit, which performs the localization and informs the device of its location. In this case, since only the location information thereof is provided to the device, the privacy of the server-side data is ensured. According to another embodiment, extracting features/descriptors from the query sensor data may be performed on the device site and transmitting the extracted features/descriptors to an external processing unit to perform the remaining steps of the positioning process.
Embodiments of the above-described methods and apparatus may be suitably integrated in vehicles, automated guided vehicles (Automated Guided Vehicle, AGV), autonomous mobile robots, and the like, to facilitate navigation, positioning, and obstacle avoidance. In an automobile, embodiments of the above-described methods and apparatus may consist of an ADAS. Other embodiments of the above-described methods and apparatus may be suitably implemented in augmented reality applications.
Description of further details of the above-described specific embodiments
The camera repositioning method estimates the position and orientation of images captured in a given environment. A solution to this problem is to match the image with a pre-computed map constructed from data previously collected in the target area.
Embodiments are described in which consistent local descriptors with 3D coordinates can be rendered from any point of view, represented by a neural field with reference to a graph. This gives rise to many advantages over classical sparse 3D models, in that dense feature matching can be performed, pose estimation can be iteratively improved, and storage requirements reduced. The proposed system learns local features in a scene in a self-supervision manner, and the performance of the system is superior to that of a related method, and accurate matching can be established even when the gesture prior is far from the actual camera gesture.
Visual localization (i.e., camera pose estimation problems in known environments) can use cameras to construct localization systems for various applications such as autopilot, robotics, or augmented reality. The goal is to predict the 6-DoF camera pose (translation and orientation) from the 2D camera sensor measurements, which is a projection geometry problem. The method that works best in this field is called a structure-based method, which operates by matching image features to a pre-computed 3D model representing the environment of the diagram. Typically, these graphs are modeled from the results of a Structure-from-Motion (SfM) by triangulating key points on a reference image from which a sparse 3D point cloud is constructed. The storage requirement of the 3D model is high, but the stored information is still sparse.
Recently, a new approach to implicitly represent a scene, namely a neural radiation field (Neural RADIANCE FIELD, NERF), has emerged. The scene is implicitly represented by a neural network, rather than explicitly represented by a point cloud, grid, voxel grid, or the like, which learns the mapping from 3D point coordinates to density and radiation. NeRF training using a sparse pose image set representing a given scene, and learning underlying 3D geometry information without supervision. The generated model is continuous, i.e. the radiation of all 3D points in the scene can be calculated, so that a photo-level realistic view can be rendered from any viewpoint. Subsequent work has shown that additional modalities (e.g., semantics) can be incorporated into the radiation field and accurately rendered. By means of neural fields, dense information of the scene can be stored in a compact way, with neural network weights representing only a few megabytes.
We propose to introduce local descriptors in NeRF implicit schemes and use the generated model as a localization map. We train both the CNN feature extractor and the neural renderer to provide consistent scene-specific descriptors in a self-supervised manner. Unlike radiation, the rendering of descriptors is neither dependent on the viewing direction nor on the image appearance. The advantage of this approach is that repeatable features are learned to achieve accurate matching under extreme viewpoint changes. By utilizing different advanced neural rendering techniques, we make the model computationally simpler by means of multi-resolution hash coding of Instant-NGP, and adapt the model to dynamic outdoor scenes by means of the apparent embedding of Nerf-W. In the training process, we use the radiation field learned 3D information in a metric learning optimization objective that does not require image pairs or pre-computation of the supervised pixel correspondences on the 3D model.
Finally, we demonstrate that these features can solve the visual repositioning task by a simple structure-based approach based on sparse feature matching and/or dense feature alignment. The commonly used sparse 3D model obtained from the motion restoration structure is replaced by a neural field from which dense reference features from any camera pose can be queried, while the neural field has very compact storage requirements compared to the prior art.
Camera repositioning-estimating the 6-DoF camera pose of a query image in a known environment is discussed in the literature. The structure-based approach compares the local image features to existing 3D models. Classical pipelines consist mainly of 2 steps, first retrieving the nearest reference image from the query image using a global image descriptor. These priors are then refined by geometric reasoning based on the extracted local features. Such pose refinement may be based on sparse feature matching, direct feature alignment, or relative pose regression. The structure-based relocation method is the most accurate, but requires storing and utilizing a large amount of graph information, which means high computational cost and memory footprint. Even though compression methods have been developed, memory intensive graphs remain a challenging task. An effective alternative is absolute pose regression, which connects the query image and the associated camera pose in a single neural network forward pass, but with less accuracy. Scene coordinate regression can learn the mapping between image pixels and 3D coordinates, so that accurate gestures can be calculated by a perspective N-point method, but the expansibility in a large environment is poor. Our approach refines camera pose priors in a structure-based approach, but replaces the traditional 3D model with a compact implicit representation.
Localization using neural scene representations recently, the localization method was improved by using NeRF models and related models in a variety of ways. iNeRF iteratively optimizes camera pose by minimizing NeRF photometric error. The LENS improves the accuracy of the absolute pose regression method by using the NeRF rendered new views evenly distributed over the graph as additional training data. iMAP and NICE-SLAM obtain more competitive results than advanced methods by means of a neuro-implicit map for solving RGB-D SLAM problems. ImPosing solve the kilometer scale localization problem by measuring the similarity between the global image descriptor and the implicit camera pose representation. Features related to our work query the descriptors in the network learning implicit function for relocation. The model is trained on pre-calculated sparse 3D point cloud only, an existing feature extractor is used as supervision, and how descriptors change along with the change of the viewpoints is learned, so that iterative gesture refinement is realized. We take the opposite approach to let the radiation field learn the 3D geometry information and the view invariant descriptor at the same time. The latter learns in a self-supervised manner without any additional training data. To our knowledge, no one has previously proposed learning visual localization descriptors in nerve radiation fields without supervision.
The local descriptor can be used for describing the region of interest effectively based on the learned local feature description, so that an accurate corresponding relation can be established between image pairs describing the same scene. While hand-made descriptors such as SIFT and SURF have met with great success, the emphasis in recent years has been shifted to learning feature extraction from a large amount of visual data. Many learning-based schemes rely on a twin convolutional network trained using paired or triplet images/patches supervised by correspondence. By expanding 2 versions of the same image, the feature extractor can be trained without annotating the correspondence. SuperPoint uses isomorphic diagrams, while Novotny et al utilize image warping. We propose a method to learn repeatable descriptors following different paths-we restrict the feature extractor to provide the same descriptor map as the volumetric neural renderer. Since the neural renderer adopts a ray traveling scheme and has geometric consistency in design, the method can enable us to learn dense scene specific descriptors without annotating the corresponding relation.
Our method estimates the 6-DoF camera pose (i.e., 3D translation and 3D orientation) of the query image in the accessed environment. We first use in an offline step a set of reference image training modules with corresponding poses previously captured in the region of interest. Since we learn scene geometry information during the training process, a 3D model of the scene is not necessary.
1. Neural rendering of local descriptors:
NeRF are capable of rendering views from any camera pose in a given scene while training using only sparse observation sets. Given the camera pose of the known internal reference, the 2D pixels are back projected in the 3D scene by ray travel. The density σ and RGB color c of each point p= (x, y, z) along the ray is estimated by the MLP. The final pixel color of a pixel is calculated by a stereoscopic rendering along the ray, which is differentiable and capable of training the entire system by minimizing the photometric error of the rendered image.
NeRF assume that the illumination in the scene remains constant over time, but this is not true in many scenes in the real world. NeRF-W overcomes this limitation by modeling the appearance using appearance embedding (see also fig. 2) that controls the appearance of each rendered view. Another limitation of this neural scene representation is the computation time that rendering an image requires H x W x N evaluations of an 8-layer MLP, where N is the number of points per ray sample. Instant-NGP proposes to use multi-resolution hash coding to speed up the process by storing local features in a hash table, which are then processed by smaller MLPs (compared to those in NeRF), thus significantly shortening training time and reasoning time.
Descriptor-NeRF our neural renderer combines the 3 techniques described above to efficiently render dynamic scenes. However, our primary goal is not photo-level photorealistic rendering, but features that match the new observations. While the query image can be aligned to the NeRF model by minimizing photometric errors, this approach lacks robustness to changes in illumination. Instead, we propose to add local descriptors, i.e. D-dimensional potential vectors describing the visual content of the region of interest in the scene, as an additional output of the radiation field function. Unlike the rendered colors, we model these descriptors as descriptors independent of the viewing direction d and the appearance vector, and demonstrate below that this makes the matching process more robust. Similar to color, the 2D descriptors of camera rays are also aggregated by a common stereo rendering formula applied to the descriptors at each point along the ray. The neural renderer architecture is shown in fig. 2, and the training process will be explained in the next section.
2. Training self-supervised feature extraction in neural fields
Motivation although the neural renderers described above may represent a graph of our repositioning method, we also need to extract features from the query image. The simple approach proposed by FQN is to use an off-the-shelf pre-trained feature extractor (e.g., superPoint or D2-Net) and train the neural renderer to memorize the observed descriptors according to the direction of observation. Instead, we propose to jointly train the feature extractor and the neural renderer by defining optimization objectives that utilize scene geometry information. We acquire special descriptors of the target scene that describe not only the visual content of the viewpoint, but also the 3D position of the viewpoint, enabling better differentiation than general descriptors.
The training process is shown in fig. 3. One training sample is a reference image with a corresponding camera pose. In one aspect, the image is processed by a feature extractor to obtain a descriptor map F_ { I }. On the other hand, we sample each pixel point using camera internal reference rays, calculate the density, color and descriptor of each 3D point, and finally perform stereo rendering to obtain RGB view c_ { R }, descriptor map f_ { R } and depth map d_ { R }.
Feature extraction our feature extractor is a simple full convolutional neural network with 8 layers, reLU activation and max pooling. The input is an RGB image I of size H W and a dense descriptor map F_ { I } of size H/4W/4 XD is generated.
Learning the radiation field similarly to NeRF, we use c_ { R } and the mean square error loss between the real images L MSE to learn the radiation field. When we render the whole (scaled down) image in a single training step, we can exploit the local 2D image structure and optimize SSIM (denoted as L SSIM loss), we observe that a clearer image is generated and better results. The positioning process uses the depth map to calculate the camera pose, while a better depth map results in a more accurate pose. The NeRF model trained using limited training views may produce incorrect depths due to shape-irradiance ambiguity. We add regularization penalty L TV to minimize the total variation in depth of the randomly sampled 5 x 5 image patch, thereby improving the smoothness of the rendered depth map and limiting artifacts on the rendered depth map. L SSIM and L TV improve positioning accuracy and image reconstruction quality.
Learning descriptors our goal is to match descriptors from the feature extractor and the neural renderer together. Our self-supervision objective encourages both models to generate the same features for a given pixel while preventing high matching scores from occurring between points in the 3D scene that are farther apart from each other. We define a loss function with two terms, L POS and L NEG, applied to a pair of descriptor graphs, each containing n pixels. We use the expressionCosine similarity measure similarity between descriptors.
The first term maximizes the similarity between the descriptor graphs F_ { I } and F_ { R } in the two models;
the second term samples random pixel pairs and ensures that pixel pairs with larger 3D distances have different descriptors.
Wherein t λ (i, j) =max (0, 1- λ||xyz (i) -xyz (j) |).
Xyz (i) is the 3D coordinates of the point represented by the i-th pixel in the descriptor map. We calculate this coordinate from the camera parameters and predicted depth of the rendered view. Note that we did not back-propagate this lost gradient to the depth map. Lambda is a hyper-parameter controlling the linear relationship between descriptor similarity and 3D distance. P is a random arrangement of pixel indices from 1 to n.
The proposed self-supervision objective is close to classical triplet-loss functions, but we show in section 4.3 that the injection of 3D coordinates in the formula is crucial for learning meaningful descriptors.
Finally, we optimize the following loss function at each training step:
3. visual localization by iterative dense feature matching
This section describes a positioning method that estimates the camera pose of the query image from our learning module. An overview of this process is shown in fig. 4. The proposed solution combines simple and usual techniques as a new approach. We demonstrate that our learned high quality features enable accurate positioning accuracy when using simple feature matching and pose estimation strategies.
I. Positioning prior, similar to the related feature matching method, we assume a camera pose with positioning prior, i.e., a pose relatively close to the query pose. The view observed from a priori should have overlapping visual content with the query image to make the matching process viable. Such a priori may be obtained by matching the global image descriptor with an image retrieval database or an implicit map.
Feature extraction first we extract dense descriptors from the query image via CNN and use the neural renderer to extract descriptors and depth from the localization priors.
Dense feature matching, matching query descriptors with reference descriptors using cosine similarity. We consider that if the similarity is above the threshold theta and if the match represents the best candidate in the other graph (a mutual match) in both directions, then 2 descriptors are matches. Then we calculate the predicted 3D coordinates (depending on camera parameters and depth) of the matched rendered pixels and obtain a set of 2D-3D matches.
And IV, camera pose estimation, namely combining a perspective N point method with the RANSAC to obtain robust estimation by removing outlier matching items.
Iterative pose refinement although classical 3D models have access to a limited set of reference descriptors, our neural renderer can compute the reference descriptors from any camera pose. Similar to FQN and ImPosing, we can consider the camera pose estimation as a new positioning prior and iterate the aforementioned steps multiple times to refine the camera pose.
We compared our results with advanced visual repositioning methods. We use DENSEVLADAS to locate the first 1 reference poses retrieved a priori (and the other uses the first 10 reference images).
Table 1 results for seven scene datasets
| cm/° |
Fire (fire) |
Chinese chess |
Stair for stairs |
Red kitchen |
Office room |
Pumpkin (pumpkin) |
Head |
| Priori |
0.34/13.2 |
0.22/12.1 |
0.26/15.8 |
0.29/12.0 |
|
0.31/10.8 |
0.16/15.0 |
| Iteration 1 |
0.08/2.8 |
0.02/0.7 |
0.21/2.9 |
0.04/1.1 |
|
0.04/1.0 |
0.05/2.8 |
| Iteration 2 |
0.06/2.2 |
0.01/0.5 |
0.28/4.1 |
0.03/0.8 |
|
0.03/0.8 |
0.03/2.5 |
| Iteration 3 |
0.05/1.9 |
0.01/0.4 |
0.29/4.4 |
0.02/0.8 |
|
0.03/0.8 |
0.03/2.3 |
Table 2 results of cambridge landmark dataset
We propose to represent the visual localization map by means of a neural field. This enables dense scenes to be represented with less memory footprint, learning local features for the target area without supervision, and performing better than the relevant visual localization methods. The proposed pipeline should be compatible with future improvements in the field of neural rendering in order to extend these models to larger scenes.
All of the embodiments discussed above are not intended to be limiting, but rather to illustrate the features and advantages of the present invention. It should be understood that some or all of the features described above may also be combined in different ways.