CN116385518B

CN116385518B - A method for robotic arm to grasp test tubes based on depth completion of transparent objects

Info

Publication number: CN116385518B
Application number: CN202310085582.7A
Authority: CN
Inventors: 陈洪波; 顾赵键; 朱萍; 吕斌; 董哲康; 郑军科; 邢峰; 林冰晶
Original assignee: Zhejiang Fangyuan Detection Group Stock Co ltd
Current assignee: Zhejiang Fangyuan Detection Group Stock Co ltd
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2026-03-17
Anticipated expiration: 2043-01-12
Also published as: CN116385518A

Abstract

This invention discloses a method for a robotic arm to grasp a test tube based on depth completion of a transparent object. The method includes the following steps: constructing a transparent depth completion model; projecting the completed point cloud back into a depth image; acquiring and aligning the color and depth images of the transparent test tube using a depth camera; porting the depth completion model to the ROS development platform and loading network weight parameters to process the test tube image acquired in real-time in ROS to obtain a complete depth image; obtaining the position information of the test tube using the robotic arm, and calculating the grasping position of the end effector based on the center coordinate system (x _t , y _t , z_t ) at the top of the test tube, thus obtaining the pose information of the end effector in Cartesian space; and obtaining the trajectory information of the robotic arm's end gripper based on the obtained pose information in the world coordinate system using a cubic B-spline interpolation algorithm. This method can accurately estimate the position information of the transparent test tube and quickly plan the corresponding grasping trajectory to complete the grasping process.

Description

Method for grabbing test tube by mechanical arm based on transparent object depth completion

Technical Field

The invention belongs to the technical field of depth camera application, and particularly relates to a method for grabbing test tubes by a mechanical arm based on transparent object depth complementation.

Background

The application of robots in chemical analysis laboratories is mainly the gripping and moving of mechanical arms. In the face of complex and changeable environments such as various laboratory instruments and equipment, changeable placement positions and the like, the success rate of grabbing the mechanical arm is seriously affected. Test tubes are a common instrument for every chemical laboratory, and many steps in the experimental process require test tubes. However, the cuvette, as a transparent object, has unique visual characteristics, which makes it difficult for a general purpose RGB-D camera to capture its complete depth information.

The unique visual characteristics of the transparent object, such as refraction and reflection, make the conventional RGBD camera unable to accurately acquire the depth information. The depth of the transparency obtained by the camera often appears as depth information behind the surface through which it passes, or as a lack of depth caused by specular highlights. For an automatic grabbing robot used in a chemical laboratory, the automatic identification mode is generally to locate the position to be grabbed by analyzing depth information of the surfaces of various objects, however, all visual characteristics of a transparent test tube make it difficult for a traditional algorithm to obtain an ideal depth value, and certain flexibility and stability are lacked. With the continuous deep learning application in image processing, the depth of a transparent object can be completely complemented, and various complex and changeable laboratory environments can be handled, so that the success rate of grabbing the mechanical arm can be greatly increased by processing the data of the transparent test tube by adopting a deep learning method.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention collects depth information and color information of the surface of the test tube in real time through a depth camera, simultaneously provides a test tube grabbing method based on a mechanical arm with depth complement of a transparent object, processes transparent test tube data through a depth complement network to obtain a complete depth map, deduces the position information of the test tube according to the obtained depth data, calculates to obtain grabbing pose information of the test tube according to a 3D space rotation principle, plans the tail end track of the mechanical arm through an interpolation algorithm, obtains motion information in a joint space through inverse kinematics, and finally controls the mechanical arm to complete the grabbing of the test tube.

The method for grabbing test tubes by using the mechanical arm based on transparent object depth complementation is characterized by comprising the following steps of:

(1) Constructing a transparent depth complement model;

(2) Projecting the completed point cloud back to the depth image to be used as an input of a depth completion module for training;

(3) A depth camera is used for acquiring and aligning a color image and a depth image of a transparent test tube, and the depth camera is fixed at the tail end of a mechanical arm;

(4) Transplanting the depth complement model constructed in the step (1) to an ROS development platform, loading the network weight parameters stored in the step (2), and processing the test tube image acquired in real time in the ROS to obtain a complete depth image;

(5) The mechanical arm subscribes to topics published in the step (3) to obtain position information of a test tube, and calculates the grabbing position of the mechanical arm end effector by combining a central coordinate system (x _t,y_t,z_t) at the top of the test tube to obtain pose information of the mechanical arm end effector in Cartesian space;

(6) And (3) obtaining track information of the tail end clamping jaw of the mechanical arm according to the pose information under the world coordinate system obtained in the step (4) and combining a cubic B spline interpolation algorithm.

Further, the transparent depth completion model comprises a point cloud completion module and a depth completion module, wherein the point cloud completion module is used for preprocessing a depth map, and the depth completion module inputs complete depth data converted from the point cloud into the depth completion module to further refine the depth information.

Further, the step (1) includes:

(11) A point cloud complement module is constructed, the depth of the transparent object is back projected to the point cloud, and a correct depth image is estimated by predicting the shape of the complete point cloud;

(12) Taking the projected sparse point cloud as the input of a point cloud complement module;

(121) Aiming at unordered sparse point clouds in the step (12), constructing a 3D grid to convert unordered point cloud information into rule data capable of representing local information and structures of the point clouds;

(122) Defining a Grid g= < V, W >, wherein V, W represents the set of vertices and the set of values of Grid, respectively;

(123) Judging the coordinate position of the original point cloud, and calculating the weight corresponding to each vertex;

(124) Learning feature data required to complement the point cloud using a 3D CNN encoder-decoder structure, and converting Grid into an unordered point cloud through GRIDDING REVERSE;

(125) The point cloud is complemented by creating an MLP and further refined for the coarse point cloud.

Further, in the step (11), a characterization equation of the pixel point of the depth image at the 3D space coordinate is:

Wherein (x, y, z) is a three-dimensional coordinate point in space, (x ', y') is a pixel point coordinate in the depth map, D is a depth value, and f _x,f_y is a camera internal reference;

in the step (122), a step of, in the first embodiment, Wherein the method comprises the steps ofRepresenting each vertex generated in the mesh, w _i representing the weight of each vertex;

in step (123), the original point cloud is If the coordinate position of a point p= (x, y, z) in the point cloud satisfies:

then take this point as vertex v _i neighborhood The weight corresponding to each vertex can be expressed as:

Wherein the method comprises the steps of

Furthermore, the depth complement module adopts the architecture of an encoder-decoder, the encoder part adopts EFFICIENTNET as a main body and constructs a CECD module and a CEC module, RGB and depth images are spliced to be used as network input, depth data is input into a single-channel sampling module for feature extraction, and depth features with corresponding resolution are used as the input of each CECD block and each CEC block;

the decoder is a lightweight REFINENET decoder and mainly comprises a CRP module and a FUSE module, wherein the CRP module mainly comprises a convolution layer with the size of 1x1 and a maximum pooling layer with the size of 5x5, the FUSE module mainly comprises two convolution layers with the size of 1x1 and an up-sampling module, and characteristic information output by the CEC module is combined with original depth information to serve as input of the decoder, and complete depth information is recovered.

Further, step (2) includes:

(21) By passing through Representing the process of projecting the point cloud back to the depth image;

(22) The noisy depth map generated after the point cloud is complemented and the original RGB color image are input into a depth complementing module;

(23) The point cloud completion network is trained using Gridding Loss, the Loss function of which can be expressed as:

Wherein the method comprises the steps of A weight representing each vertex in the mesh;

The loss function of the depth-completion network can be expressed as:

where beta is a weight parameter and where, And D ^gt represent the predicted depth and the true depth, respectively, and D _h and D _w are gradient vectors of the depth map D along the width and height coordinate axes, respectively.

Further, in the step (3), a calibration mode of 'eyes on hands' is adopted;

CLEAR GRASP, TODD and TransCG were used as training data sets.

Further, in step (4), after the complete depth image is acquired, the depth image is mapped to coordinate points in the 3D space, and finally the obtained coordinate points are issued through the ROS message mechanism.

Further, step (5) includes:

(51) Defining the center point of opening and closing of the clamping jaw as the origin of a terminal coordinate system, taking the vertical clamping jaw plane downward as the Z-axis direction, taking the direction parallel to the grabbing direction of the clamping jaw forward as the X-axis direction, and determining the Y-axis direction according to the right-hand rule;

(52) The completed depth map is back projected to the 3D space again, and the complete point cloud can be expressed as The position of putting of transparent test tube is located on the test-tube rack, through traversing all point clouds and fix a position the pose of snatching, at first searches the highest point at test tube top:

For vertex p _highest＝(x_h,y_h,z_h, search for all points within its neighborhood N (p _highest) that meet the following requirements:

Where r _T denotes the radius of the tube orifice, then searching the center point (x _c,y_c,z_c) of neighborhood N (p _highest) as the spatial location of the gripping jaw grip;

(53) After the grabbing position of the test tube is obtained, the gesture of the grabbing point clamping jaw is calculated, the gesture is set according to the coordinate system in (51), the positive direction of the x axis is the advancing direction of clamping jaw grabbing, and the gesture of the clamping jaw coordinate system during grabbing is the gesture during picture acquisition.

(54) According to the obtained test tube position information under the camera coordinate system and the conversion relation between the camera coordinate system and the world coordinate system, the test tube coordinate position under the world coordinate system is obtained, and the process can be expressed as follows:

Further, in the step (6), a cubic B-spline interpolation definition formula is as follows:

C₃(u)＝∑P_iN_i,3(u)

Wherein P _i is a control point of a spline curve, P _i is set as an initial point and a grabbing point of a test tube, N _i,3 is a basis function of a cubic spline curve, and an equation of the basis function can be solved by a recurrence formula:

Where k is the number of times of the curve;

And adding a time stamp, a speed and an acceleration value to the obtained track information to obtain a complete motion track, converting the motion track into motion information in a joint space through inverse kinematics, and sending the motion information to a mechanical arm control module to complete a test tube grabbing task.

Compared with the prior art, the invention has the following advantages:

According to the invention, the RGBD sensor is used for collecting depth information and RGB information on the surface of the transparent test tube, the depth of the transparent test tube is complemented through an improved U-Net architecture depth complement model, then the grabbing pose of the test tube is calculated according to the complete depth information, and finally the grabbing track of the mechanical arm is obtained through a track planning algorithm. The method can accurately estimate the position information of the transparent test tube, and rapidly plan the corresponding grabbing track to finish grabbing.

Drawings

FIG. 1 is a schematic diagram of a transparent depth complement model.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, a method for grabbing test tubes by using a mechanical arm based on transparent object depth complementation includes:

Firstly, constructing a transparent depth complement model as shown in the figure, firstly constructing a point cloud complement module, back projecting the depth of a transparent object to the point cloud, and estimating a correct depth image by predicting the shape of the complete point cloud. The process of converting depth map pixels into 3D spatial coordinates can be expressed as:

Wherein (x, y, z) is a three-dimensional coordinate point in space, (x ', y') is a pixel point coordinate in the depth map, D is a depth value, and f _x,f_y is a camera internal reference.

The projected sparse point cloud is used as an input of a point cloud complement module. Inspired by the point cloud processing module provided in GRNet, for the input sparse unordered point cloud, a 3D grid is firstly constructed to convert unordered point cloud information into rule data capable of representing local information and structures of the point cloud. Define a Grid g= < V, W >, whereinRespectively representing the set of vertices and the set of values for Grid.Representing each vertex generated in the mesh, then the weight of each vertex is represented. Original point cloudIf the coordinate position of a point p= (x, y, z) in the point cloud satisfies:

then take this point as vertex v _i neighborhood Is a point in the above. The weight corresponding to each vertex can be expressed as:

Wherein the method comprises the steps of The 3D CNN encoder-decoder structure is then used to learn the feature data needed to complement the point cloud and convert Grid to an unordered point cloud by GRIDDING REVERSE, whose coordinates are a weighted sum of the eight vertices. And finally, supplementing the point cloud by creating the MLP, and further refining the rough point cloud.

The transparent depth complement model is mainly divided into two parts, namely a point cloud complement (GRNet) module and a depth complement module. The first part is preprocessing of the depth map, firstly, the input original depth map is back projected to the 3D space, and as input of GRNet modules, the complete point cloud is predicted by GRNet modules to estimate the correct depth information. And the second part is to input more complete depth data converted from the point cloud into a depth complement module to further refine the depth information.

The depth completion module adopts the architecture of an encoder-decoder, and the encoder part adopts EFFICIENTNET as a backbone and constructs Conv-effect-Conv-Downsample (CECD module) and Conv-effect-Conv (CEC module). The RGB and depth images are spliced to be used as network input, meanwhile, in order to keep original depth information, the depth data is input into a single-channel sampling module for feature extraction, and depth features with corresponding resolution are used as input of each CECD blocks and CEC blocks.

The decoder part adopts a lightweight REFINENET decoder and mainly consists of a CRP module and a FUSE module, wherein the CRP module mainly consists of a convolution layer with the size of 1x1 and a maximum pooling layer with the size of 5x5, and the FUSE module mainly consists of two convolution layers with the size of 1x1 and an up-sampling module. The characteristic information output by the CEC module is combined with the original depth information as an input to the decoder and the complete depth information is restored.

Projecting the completed point cloud back to the depth image, wherein the process can be expressed as follows:

And then inputting the noisy depth map generated after the point cloud is complemented and the original RGB color image into a depth complementing module. Depth completion module uses encoder-decoder architecture, uses EFFICIENTNET as backbone and customizes Conv-effect-Conv-Downsample (CECD block), conv-effect-Conv (CEC block) to extract features and upsamples with one lightweight REFINENET decoder consisting essentially of CRP module consisting of a 1x1 convolutional layer and a 5x5 max-pooling layer, and two 1x1 convolutions

And the FUSE module formed by the layers improves the performance of the algorithm on the premise of ensuring the accuracy.

The point cloud completion network is trained using a Gridding Loss, whose Loss function can be expressed as:

Wherein the method comprises the steps of Representing the weight of each vertex in the mesh, the loss function of the depth completion network can be expressed as:

And thirdly, acquiring a color image and a depth image of the transparent test tube by using a depth camera and aligning the color image and the depth image, wherein the camera is fixed at the tail end of the mechanical arm, and a calibration mode of 'eyes on hands' is adopted, namely, in order to acquire the conversion relation between a camera coordinate system and a mechanical arm base coordinate system, the calibration of eyes and hands is needed. The mechanical arm is moved to a plurality of different poses on the premise of keeping the calibration plate in the field of view of the camera, so that a plurality of point motion samples are obtained and used for calculating a conversion matrix. The dataset was trained using CLEAR GRASP, TODD, and TransCG. That is, the joint training was performed using three data sets CLEAR GRASP, TODD, and TransCG.

And step four, transplanting the depth completion model built in the step one to an ROS development platform, loading the network weight parameters stored in the step two, processing the test tube image acquired in real time in the ROS to obtain a complete depth image, mapping the depth image to coordinate points in a 3D space, and issuing the finally obtained coordinate points through an ROS message mechanism.

Fifthly, the mechanical arm subscribes to topics published in the third step to obtain position information of the test tube, and calculates the grabbing position of the mechanical arm end effector by combining a central coordinate system (x _t,y_t,z_t) at the top of the test tube to obtain pose information of the mechanical arm end effector in Cartesian space, wherein the specific calculation process is as follows:

a) The mechanical arm end effector is a PGI parallel electric claw, a center point for opening and closing the clamping jaw is defined as an origin of an end coordinate system, a vertical clamping jaw plane is downwards in a Z-axis direction, a direction parallel to a clamping jaw grabbing direction is forwards in an X-axis direction, and a Y-axis direction is determined according to a right-hand rule.

B) The completed depth map is back projected to the 3D space again, and the complete point cloud can be expressed asThe placing position of the transparent test tube is located on the test tube rack, the pose of grabbing is positioned by traversing all point clouds, and the highest point at the top of the test tube is searched first:

Where r _T denotes the radius of the tube orifice. The center point (x _c,y_c,z_c) of neighborhood N (p _highest) is then searched as the spatial location of the jaw grip.

C) After the grabbing position of the test tube is obtained, the gesture of the grabbing point clamping jaw needs to be calculated. The positive direction of the x-axis is the advancing direction of the gripping jaws, according to the coordinate system set in a). Because the z-axis of the camera coordinate system and the x-axis of the clamping jaw point in the same direction, the posture of the clamping jaw coordinate system at the time of capturing is equal to the posture at the time of taking the picture.

D) According to the obtained test tube position information under the camera coordinate system and the conversion relation between the camera coordinate system and the world coordinate system, the test tube coordinate position under the world coordinate system can be obtained, and the process can be expressed as follows:

and step six, obtaining track information of the tail end clamping jaw of the mechanical arm according to the pose information under the world coordinate system obtained in the step four and combining a cubic B spline interpolation algorithm. The cubic B-spline interpolation definition formula is as follows:

C₃(u)＝∑P_iN_i,3(u)

Where P _i is the control point of the spline curve, here set as the initial point and the grabbing point of the test tube. N _i,3 is the basis function of a cubic spline, the equation of which can be solved by a recursive formula:

where k is the number of times the curve.

And finally adding a time stamp, a speed and an acceleration value to the obtained track information to obtain a complete motion track, converting the motion track into motion information in a joint space through inverse kinematics, and sending the motion information to a mechanical arm control module to complete a test tube grabbing task.

Claims

1. A method for a robotic arm to grasp test tubes based on depth completion of transparent objects, characterized by comprising the following steps:

(1) Construct a transparent depth completion model;

(2) The completed point cloud is projected back into the depth image and used as the input of the depth completion module for training. The depth completion module adopts an encoder-decoder architecture. The encoder part uses EfficientNet as the backbone and constructs CECD and CEC modules. The RGB and depth images are stitched together and used as the input of the network. The depth data is input into a single-channel sampling module for feature extraction, and the depth features of the corresponding resolution are used as the input of each CECD block and CEC block.

The decoder is a lightweight RefineNet decoder, mainly composed of the CRP module and the FUSE module. The CRP block mainly consists of a 1x1 convolutional layer and a 5x5 max pooling layer, while the FUSE block mainly consists of two 1x1 convolutional layers and an upsampling module. The feature information output by the CEC module is combined with the original depth information as the input of the decoder to recover the complete depth information.

(21) Through This represents the process of projecting a point cloud back into a depth image;

(22) Input the noisy depth map generated after point cloud completion and the original RGB color image into the depth completion module;

(23) The point cloud completion network is trained using Gridding Loss, and its loss function can be expressed as:

,

in, This represents the weight of each vertex in the grid; the loss function of the depth completion network can be expressed as:

,

Where β is the weighting parameter, D ^{gt} and Dh represent the predicted depth and the true depth, respectively, while D _h and D _w are the gradient vectors of the depth map D along the width and height axes, respectively.

(3) Use a depth camera to acquire color and depth images of the transparent test tube and align them. The depth camera is fixed at the end of the robotic arm.

(4) The depth completion model built in step (1) is ported to the ROS development platform, and the network weight parameters saved in step (2) are loaded to process the test tube image obtained in real time in ROS to obtain a complete depth image.

(5) The robotic arm subscribes to the topic published in step (3) to obtain the position information of the test tube. Combined with the center coordinate system (x _t , y _t , z _t ) at the top of the test tube, the position of the robotic arm end effector for grasping is calculated, and the pose information of the robotic arm end effector in Cartesian space is obtained.

(6) Based on the pose information in the world coordinate system obtained in step (4), the trajectory information of the end effector gripper of the robotic arm is obtained by combining the cubic B-spline interpolation algorithm; the cubic B-spline interpolation formula is defined as follows:

C _3 (u) = ∑P _i N _{i,3} (u)

Where _Pi is the control point of the spline curve, _Pi is set as the initial point and the gripping point of the test tube, and Ni _,3 is the basis function of the cubic spline curve, whose equation can be solved by recursion formula:

,

Where k is the degree of the curve;

The obtained trajectory information is added with timestamps, velocity and acceleration values to obtain a complete motion trajectory, and then converted into motion information in joint space through inverse kinematics, which is sent to the robotic arm control module to complete the test tube grasping task.

2. The method for a robotic arm to grasp test tubes based on depth completion of a transparent object according to claim 1, characterized in that the transparent depth completion model includes a point cloud completion module and a depth completion module, the point cloud completion module is used for preprocessing of the depth map, and the depth completion module inputs the relatively complete depth data converted from the point cloud into the depth completion module to further refine the depth information.

3. The method for a robotic arm to grasp a test tube based on depth completion of a transparent object according to claim 2, characterized in that step (1) includes:

(11) Construct a point cloud completion module to back-project the depth of the transparent object onto the point cloud and estimate the correct depth image by predicting the complete point cloud shape;

(12) Use the projected sparse point cloud as the input to the point cloud completion module;

(121) For the disordered sparse point cloud in step (12), a 3D mesh is constructed to convert the disordered point cloud information into regular data that can represent the local information and structure of the point cloud.

(122) Define a Grid G = <V, W>, where V and W represent the vertex set and value set of the Grid, respectively;

(123) Determine the coordinate position of the original point cloud and calculate the weight of each vertex;

(124) Use a 3D CNN encoder-decoder architecture to learn the feature data required to complete the point cloud, and convert the grid into an unordered point cloud through Gridding Reverse;

(125) The point cloud is completed by creating an MLP and the coarse point cloud is further refined.

4. The method for a robotic arm to grasp a test tube based on depth completion of a transparent object according to claim 3, characterized in that, in step (11), the representation equation of the pixel points of the depth image in 3D space coordinates is:

,

Where (x,y,z) is a three-dimensional coordinate point in space, (x',y') is the coordinate of a pixel in the depth map, D is the depth value, and _fx and _fy are camera intrinsic parameters;

In step (122), in, represents each vertex generated in the mesh, and w _i represents the weight of each vertex;

In step (123), the original point cloud is If the coordinates of a point p = (x, y, z) in the point cloud satisfy:

,

Then, this point is taken as a point in the neighborhood N _vi of vertex v _i , and the weight corresponding to each vertex is represented as:

,

in, .

5. The method for a robotic arm to grasp test tubes based on depth completion of transparent objects according to claim 1, characterized in that, in step (3), the calibration method of "eye on hand" is adopted;

Clear Grasp, TODD, and TransCG were used as training datasets.

6. The method for a robotic arm to grasp test tubes based on depth completion of transparent objects according to claim 1, characterized in that, in step (4), after the complete depth image is acquired, the depth image is mapped to coordinate points in 3D space, and the finally obtained coordinate points are published through the ROS message mechanism.

7. The method for a robotic arm to grasp a test tube based on depth completion of a transparent object according to claim 1, characterized in that step (5) includes:

(51) The center point of the opening and closing of the gripper is defined as the origin of the end coordinate system. The direction of the Z-axis is perpendicular to the gripper plane and downward. The direction of the X-axis is parallel to the gripper gripping direction and forward. The direction of the Y-axis is determined according to the right-hand rule.

(52) The completed depth map is then back-projected into 3D space. The complete point cloud can be represented as:

The transparent test tubes are placed on a test tube rack. The grasping pose is determined by traversing all point clouds, starting with searching for the highest point at the top of the test tube:

,

For a vertex phighest = (x _h , y _h , z _h ), search its neighborhood N (p _highest ) for all points that satisfy the following requirements:

,

Where _rT represents the radius of the test tube opening, and then the center point ( _xc , _yc , _zc ) of the neighborhood N (p _highest ) is searched as the spatial position for the gripper to grasp;

(53) After obtaining the gripping position of the test tube, calculate the posture of the gripping point. According to the coordinate system setting in (51), the positive direction of the x-axis is the forward direction of the gripping, and the posture of the gripping coordinate system during gripping is the posture when the image is acquired.

(54) Based on the obtained test tube position information in the camera coordinate system and the transformation relationship between the camera coordinate system and the world coordinate system, the test tube coordinate position in the world coordinate system is obtained. The process can be expressed as follows:

.