Disclosure of Invention
In order to make up for the defects of the prior art, the invention collects depth information and color information of the surface of the test tube in real time through a depth camera, simultaneously provides a test tube grabbing method based on a mechanical arm with depth complement of a transparent object, processes transparent test tube data through a depth complement network to obtain a complete depth map, deduces the position information of the test tube according to the obtained depth data, calculates to obtain grabbing pose information of the test tube according to a 3D space rotation principle, plans the tail end track of the mechanical arm through an interpolation algorithm, obtains motion information in a joint space through inverse kinematics, and finally controls the mechanical arm to complete the grabbing of the test tube.
The method for grabbing test tubes by using the mechanical arm based on transparent object depth complementation is characterized by comprising the following steps of:
(1) Constructing a transparent depth complement model;
(2) Projecting the completed point cloud back to the depth image to be used as an input of a depth completion module for training;
(3) A depth camera is used for acquiring and aligning a color image and a depth image of a transparent test tube, and the depth camera is fixed at the tail end of a mechanical arm;
(4) Transplanting the depth complement model constructed in the step (1) to an ROS development platform, loading the network weight parameters stored in the step (2), and processing the test tube image acquired in real time in the ROS to obtain a complete depth image;
(5) The mechanical arm subscribes to topics published in the step (3) to obtain position information of a test tube, and calculates the grabbing position of the mechanical arm end effector by combining a central coordinate system (x t,yt,zt) at the top of the test tube to obtain pose information of the mechanical arm end effector in Cartesian space;
(6) And (3) obtaining track information of the tail end clamping jaw of the mechanical arm according to the pose information under the world coordinate system obtained in the step (4) and combining a cubic B spline interpolation algorithm.
Further, the transparent depth completion model comprises a point cloud completion module and a depth completion module, wherein the point cloud completion module is used for preprocessing a depth map, and the depth completion module inputs complete depth data converted from the point cloud into the depth completion module to further refine the depth information.
Further, the step (1) includes:
(11) A point cloud complement module is constructed, the depth of the transparent object is back projected to the point cloud, and a correct depth image is estimated by predicting the shape of the complete point cloud;
(12) Taking the projected sparse point cloud as the input of a point cloud complement module;
(121) Aiming at unordered sparse point clouds in the step (12), constructing a 3D grid to convert unordered point cloud information into rule data capable of representing local information and structures of the point clouds;
(122) Defining a Grid g= < V, W >, wherein V, W represents the set of vertices and the set of values of Grid, respectively;
(123) Judging the coordinate position of the original point cloud, and calculating the weight corresponding to each vertex;
(124) Learning feature data required to complement the point cloud using a 3D CNN encoder-decoder structure, and converting Grid into an unordered point cloud through GRIDDING REVERSE;
(125) The point cloud is complemented by creating an MLP and further refined for the coarse point cloud.
Further, in the step (11), a characterization equation of the pixel point of the depth image at the 3D space coordinate is:
Wherein (x, y, z) is a three-dimensional coordinate point in space, (x ', y') is a pixel point coordinate in the depth map, D is a depth value, and f x,fy is a camera internal reference;
in the step (122), a step of, in the first embodiment, Wherein the method comprises the steps ofRepresenting each vertex generated in the mesh, w i representing the weight of each vertex;
in step (123), the original point cloud is If the coordinate position of a point p= (x, y, z) in the point cloud satisfies:
then take this point as vertex v i neighborhood The weight corresponding to each vertex can be expressed as:
Wherein the method comprises the steps of
Furthermore, the depth complement module adopts the architecture of an encoder-decoder, the encoder part adopts EFFICIENTNET as a main body and constructs a CECD module and a CEC module, RGB and depth images are spliced to be used as network input, depth data is input into a single-channel sampling module for feature extraction, and depth features with corresponding resolution are used as the input of each CECD block and each CEC block;
the decoder is a lightweight REFINENET decoder and mainly comprises a CRP module and a FUSE module, wherein the CRP module mainly comprises a convolution layer with the size of 1x1 and a maximum pooling layer with the size of 5x5, the FUSE module mainly comprises two convolution layers with the size of 1x1 and an up-sampling module, and characteristic information output by the CEC module is combined with original depth information to serve as input of the decoder, and complete depth information is recovered.
Further, step (2) includes:
(21) By passing through Representing the process of projecting the point cloud back to the depth image;
(22) The noisy depth map generated after the point cloud is complemented and the original RGB color image are input into a depth complementing module;
(23) The point cloud completion network is trained using Gridding Loss, the Loss function of which can be expressed as:
Wherein the method comprises the steps of A weight representing each vertex in the mesh;
The loss function of the depth-completion network can be expressed as:
where beta is a weight parameter and where, And D gt represent the predicted depth and the true depth, respectively, and D h and D w are gradient vectors of the depth map D along the width and height coordinate axes, respectively.
Further, in the step (3), a calibration mode of 'eyes on hands' is adopted;
CLEAR GRASP, TODD and TransCG were used as training data sets.
Further, in step (4), after the complete depth image is acquired, the depth image is mapped to coordinate points in the 3D space, and finally the obtained coordinate points are issued through the ROS message mechanism.
Further, step (5) includes:
(51) Defining the center point of opening and closing of the clamping jaw as the origin of a terminal coordinate system, taking the vertical clamping jaw plane downward as the Z-axis direction, taking the direction parallel to the grabbing direction of the clamping jaw forward as the X-axis direction, and determining the Y-axis direction according to the right-hand rule;
(52) The completed depth map is back projected to the 3D space again, and the complete point cloud can be expressed as The position of putting of transparent test tube is located on the test-tube rack, through traversing all point clouds and fix a position the pose of snatching, at first searches the highest point at test tube top:
For vertex p highest=(xh,yh,zh, search for all points within its neighborhood N (p highest) that meet the following requirements:
Where r T denotes the radius of the tube orifice, then searching the center point (x c,yc,zc) of neighborhood N (p highest) as the spatial location of the gripping jaw grip;
(53) After the grabbing position of the test tube is obtained, the gesture of the grabbing point clamping jaw is calculated, the gesture is set according to the coordinate system in (51), the positive direction of the x axis is the advancing direction of clamping jaw grabbing, and the gesture of the clamping jaw coordinate system during grabbing is the gesture during picture acquisition.
(54) According to the obtained test tube position information under the camera coordinate system and the conversion relation between the camera coordinate system and the world coordinate system, the test tube coordinate position under the world coordinate system is obtained, and the process can be expressed as follows:
Further, in the step (6), a cubic B-spline interpolation definition formula is as follows:
C3(u)=∑PiNi,3(u)
Wherein P i is a control point of a spline curve, P i is set as an initial point and a grabbing point of a test tube, N i,3 is a basis function of a cubic spline curve, and an equation of the basis function can be solved by a recurrence formula:
Where k is the number of times of the curve;
And adding a time stamp, a speed and an acceleration value to the obtained track information to obtain a complete motion track, converting the motion track into motion information in a joint space through inverse kinematics, and sending the motion information to a mechanical arm control module to complete a test tube grabbing task.
Compared with the prior art, the invention has the following advantages:
According to the invention, the RGBD sensor is used for collecting depth information and RGB information on the surface of the transparent test tube, the depth of the transparent test tube is complemented through an improved U-Net architecture depth complement model, then the grabbing pose of the test tube is calculated according to the complete depth information, and finally the grabbing track of the mechanical arm is obtained through a track planning algorithm. The method can accurately estimate the position information of the transparent test tube, and rapidly plan the corresponding grabbing track to finish grabbing.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, a method for grabbing test tubes by using a mechanical arm based on transparent object depth complementation includes:
Firstly, constructing a transparent depth complement model as shown in the figure, firstly constructing a point cloud complement module, back projecting the depth of a transparent object to the point cloud, and estimating a correct depth image by predicting the shape of the complete point cloud. The process of converting depth map pixels into 3D spatial coordinates can be expressed as:
Wherein (x, y, z) is a three-dimensional coordinate point in space, (x ', y') is a pixel point coordinate in the depth map, D is a depth value, and f x,fy is a camera internal reference.
The projected sparse point cloud is used as an input of a point cloud complement module. Inspired by the point cloud processing module provided in GRNet, for the input sparse unordered point cloud, a 3D grid is firstly constructed to convert unordered point cloud information into rule data capable of representing local information and structures of the point cloud. Define a Grid g= < V, W >, whereinRespectively representing the set of vertices and the set of values for Grid.Representing each vertex generated in the mesh, then the weight of each vertex is represented. Original point cloudIf the coordinate position of a point p= (x, y, z) in the point cloud satisfies:
then take this point as vertex v i neighborhood Is a point in the above. The weight corresponding to each vertex can be expressed as:
Wherein the method comprises the steps of The 3D CNN encoder-decoder structure is then used to learn the feature data needed to complement the point cloud and convert Grid to an unordered point cloud by GRIDDING REVERSE, whose coordinates are a weighted sum of the eight vertices. And finally, supplementing the point cloud by creating the MLP, and further refining the rough point cloud.
The transparent depth complement model is mainly divided into two parts, namely a point cloud complement (GRNet) module and a depth complement module. The first part is preprocessing of the depth map, firstly, the input original depth map is back projected to the 3D space, and as input of GRNet modules, the complete point cloud is predicted by GRNet modules to estimate the correct depth information. And the second part is to input more complete depth data converted from the point cloud into a depth complement module to further refine the depth information.
The depth completion module adopts the architecture of an encoder-decoder, and the encoder part adopts EFFICIENTNET as a backbone and constructs Conv-effect-Conv-Downsample (CECD module) and Conv-effect-Conv (CEC module). The RGB and depth images are spliced to be used as network input, meanwhile, in order to keep original depth information, the depth data is input into a single-channel sampling module for feature extraction, and depth features with corresponding resolution are used as input of each CECD blocks and CEC blocks.
The decoder part adopts a lightweight REFINENET decoder and mainly consists of a CRP module and a FUSE module, wherein the CRP module mainly consists of a convolution layer with the size of 1x1 and a maximum pooling layer with the size of 5x5, and the FUSE module mainly consists of two convolution layers with the size of 1x1 and an up-sampling module. The characteristic information output by the CEC module is combined with the original depth information as an input to the decoder and the complete depth information is restored.
Projecting the completed point cloud back to the depth image, wherein the process can be expressed as follows:
And then inputting the noisy depth map generated after the point cloud is complemented and the original RGB color image into a depth complementing module. Depth completion module uses encoder-decoder architecture, uses EFFICIENTNET as backbone and customizes Conv-effect-Conv-Downsample (CECD block), conv-effect-Conv (CEC block) to extract features and upsamples with one lightweight REFINENET decoder consisting essentially of CRP module consisting of a 1x1 convolutional layer and a 5x5 max-pooling layer, and two 1x1 convolutions
And the FUSE module formed by the layers improves the performance of the algorithm on the premise of ensuring the accuracy.
The point cloud completion network is trained using a Gridding Loss, whose Loss function can be expressed as:
Wherein the method comprises the steps of Representing the weight of each vertex in the mesh, the loss function of the depth completion network can be expressed as:
where beta is a weight parameter and where, And D gt represent the predicted depth and the true depth, respectively, and D h and D w are gradient vectors of the depth map D along the width and height coordinate axes, respectively.
And thirdly, acquiring a color image and a depth image of the transparent test tube by using a depth camera and aligning the color image and the depth image, wherein the camera is fixed at the tail end of the mechanical arm, and a calibration mode of 'eyes on hands' is adopted, namely, in order to acquire the conversion relation between a camera coordinate system and a mechanical arm base coordinate system, the calibration of eyes and hands is needed. The mechanical arm is moved to a plurality of different poses on the premise of keeping the calibration plate in the field of view of the camera, so that a plurality of point motion samples are obtained and used for calculating a conversion matrix. The dataset was trained using CLEAR GRASP, TODD, and TransCG. That is, the joint training was performed using three data sets CLEAR GRASP, TODD, and TransCG.
And step four, transplanting the depth completion model built in the step one to an ROS development platform, loading the network weight parameters stored in the step two, processing the test tube image acquired in real time in the ROS to obtain a complete depth image, mapping the depth image to coordinate points in a 3D space, and issuing the finally obtained coordinate points through an ROS message mechanism.
Fifthly, the mechanical arm subscribes to topics published in the third step to obtain position information of the test tube, and calculates the grabbing position of the mechanical arm end effector by combining a central coordinate system (x t,yt,zt) at the top of the test tube to obtain pose information of the mechanical arm end effector in Cartesian space, wherein the specific calculation process is as follows:
a) The mechanical arm end effector is a PGI parallel electric claw, a center point for opening and closing the clamping jaw is defined as an origin of an end coordinate system, a vertical clamping jaw plane is downwards in a Z-axis direction, a direction parallel to a clamping jaw grabbing direction is forwards in an X-axis direction, and a Y-axis direction is determined according to a right-hand rule.
B) The completed depth map is back projected to the 3D space again, and the complete point cloud can be expressed asThe placing position of the transparent test tube is located on the test tube rack, the pose of grabbing is positioned by traversing all point clouds, and the highest point at the top of the test tube is searched first:
For vertex p highest=(xh,yh,zh, search for all points within its neighborhood N (p highest) that meet the following requirements:
Where r T denotes the radius of the tube orifice. The center point (x c,yc,zc) of neighborhood N (p highest) is then searched as the spatial location of the jaw grip.
C) After the grabbing position of the test tube is obtained, the gesture of the grabbing point clamping jaw needs to be calculated. The positive direction of the x-axis is the advancing direction of the gripping jaws, according to the coordinate system set in a). Because the z-axis of the camera coordinate system and the x-axis of the clamping jaw point in the same direction, the posture of the clamping jaw coordinate system at the time of capturing is equal to the posture at the time of taking the picture.
D) According to the obtained test tube position information under the camera coordinate system and the conversion relation between the camera coordinate system and the world coordinate system, the test tube coordinate position under the world coordinate system can be obtained, and the process can be expressed as follows:
and step six, obtaining track information of the tail end clamping jaw of the mechanical arm according to the pose information under the world coordinate system obtained in the step four and combining a cubic B spline interpolation algorithm. The cubic B-spline interpolation definition formula is as follows:
C3(u)=∑PiNi,3(u)
Where P i is the control point of the spline curve, here set as the initial point and the grabbing point of the test tube. N i,3 is the basis function of a cubic spline, the equation of which can be solved by a recursive formula:
where k is the number of times the curve.
And finally adding a time stamp, a speed and an acceleration value to the obtained track information to obtain a complete motion track, converting the motion track into motion information in a joint space through inverse kinematics, and sending the motion information to a mechanical arm control module to complete a test tube grabbing task.
According to the invention, the RGBD sensor is used for collecting depth information and RGB information on the surface of the transparent test tube, the depth of the transparent test tube is complemented through an improved U-Net architecture depth complement model, then the grabbing pose of the test tube is calculated according to the complete depth information, and finally the grabbing track of the mechanical arm is obtained through a track planning algorithm. The method can accurately estimate the position information of the transparent test tube, and rapidly plan the corresponding grabbing track to finish grabbing.