Professional Documents
Culture Documents
截圖 2022-06-26 上午12.09.18
截圖 2022-06-26 上午12.09.18
X, MONTH 202X 1
Abstract—This paper proposes a robot arm gripping pose also a challenging and developing technological development
recognition system with a human-like fetching strategy for in grasping tasks. The early grasp detection method first
arbitrary objects in complex environments. The system is assumes that the object is placed in a clean and simple
divided into an arbitrary object type classification system, environment and performs the grasping task by simply
clamping area identification system, and clamping posture
generation system. The arbitrary object type classification uses
positioning it and placing it fixedly[4]. The grasping task is
human visual senses to determine the object type to classify more complicated when the object is placed in a complex
different objects. With deep learning keying technology, the environment or an arbitrary pose. In recent years, with the
object outline is removed in any background environment, and vigorous development of visual imaging technology, deep
the complete outline of the object is preserved through learning methods have been applied to robot vision, and the
morphological image processing. The object clamping area technology uses various techniques through visual
identification uses the aforementioned complete outline of the perception[6]. Since humans cannot analyze images quickly
object and identifies different clamping positions according to and efficiently with their eyes, they use computer-aided
different object type classifiers, generates object plane clamping image processing to identify environmental objects. With
area coordinates, then converts the coordinates to robot arm
coordinates to generate clamping postures, and inputs the
visual imaging technology, grasping detection can lock grasp
converted coordinates to the robot arm for the actual clamping targets faster while obtaining good grasp positions and
action. postures[5].
This paper uses this system to conduct the coordinate The central axis of this paper is to mimic the gripping pose
conversion and positioning experiments to verify the error and position tracking of any object in a complex background
relationship between the coordinate conversion and the object environment with a humanoid robot. By classifying any
clamping position positioning experiments, to determine object type[7] and deep learning matting[8], the complete
whether the object can be converted to the arm for clamping contour information of any object is found as one of the
tasks, and to support the reliability of this paper with the essential data for the grasping position. The gripping position
complete operation of the robot arm.
is determined based on the object's contour described above.
The pose of the gripping point, gripping position, and
Index Terms—object type classifier, deep learning foreground, and
background separation, coordinate transformation relationship, gripping parameters are obtained according to the classified
gripping pose detection, object gripping. objects' different types of shape classifiers.After obtaining the
parameters, the grasping parameters are converted into arm
I. INTRODUCTION coordinates by the relationship between pixel coordinates,
world coordinates, and arm coordinates[9]. The system is
A. Motivation and Purpose verified by using the actual arm for validation experiments.
In recent years, with the vigorous development of Industry This paper firstly investigates the robot grasping detection
4.0. Factory automation has become a hot topic that is being technology. With the development and evolution of grasping
studied[1]. Robots are widely used in factories and home detection technology, the grasping task is that the robot
environments and occupy a significant position in life. In obtains the position of the target object to perform the
industrial production technology, industrial robots are used in grasping action
factory automation products, such as packaging, distribution, B. Related Works
sorting [2]. Most of the traditional industrial robot grasping
systems are aimed at a structured operating environment and This paper firstly investigates the robot grasping detection
rely on obtaining all relevant information of the grasped technology. With the development and evolution of grasping
object in advance, such as the shape, color, posture, position, detection technology, the grasping task is that the robot
grasping scene, and other related features of the grasped[3]. obtains the position of the target object to perform the
For such a single structured system, it lacks flexibility and grasping action.
robustness, and the scope of application is quite limited. The The traditional detection technology method requires
grasping tasks are mainly divided into three directions, humans to analyze the geometric structure of the detection
namely object detection, grasping planning, and robot object directly, form an algorithm specific to the task,
control[4]. Detecting objects and generating grasping poses determine the appropriate grasping point according to its
is the primary key to the success of robot grasping tasks. The shape and size, and match the appropriate gripper to perform
accuracy of grasping poses helps to plan grasping paths and the gripping task [10]. However, this method requires a lot of
realize complete grasping tasks. computation and analysis time, and the application scalability
Robotic grasping detection is of great significance to is relatively tiny. In the past five years, the application of deep
intelligent manufacturing and factory automation, and it is learning to robot vision has made significant progress. The
2 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X
environmental changes without changing the object the number ratio of 8:2:2. Tab. 1. shows the data results after
characteristics during training. Fig. 5. and Fig. 6. show the training.
results of data-augmentation and deep image matting for the Tab. 1. YOLOv4 results after training with the original data, data
original photos, respectively. enhancement, and background replacement. It was evident from the graph
that compared with the original data, the results after training with the
enhanced and background replacement data set have increased by 32.3% and
36.8% mAP.
Data mAP IOU
Method Backbone
Improvement (%) (%)
YOLOv4: Accuracy of Object Detection
YOLOv4 w/o augmentation CSPDarknet53 60% 63.8%
𝐹 = max(min(𝐹 𝑅 + 𝐼, 1) , 0) (3)
◼ Open operation
The open operation is synthesized by erosion and
expansion. The erosion first filters out the more negligible
noise, and the expansion smoothes the edge contour. Hence,
the open operation aims to smooth the contour and eliminate
the thin part of the edge connection, as shown in (5).
A。B = (A ⊖ B)⊕ B (6) Fig. 10. long object flowchart
◼ Closed operation In order to provide an excellent gripping position for the
First, perform the expansion operation on the B set, erode robot arm and take the center of the object as the best gripping
it, and fuse the more minor noises in the image. Some of the position for the long object, we use the LSD (Line Segment
more considerable noises cannot be eliminated. Therefore, Detection) method to find the intersection of the two most
the more enormous noises are eroded through the erosion extended straight lines on the edge of the long object as the
operation and closed. The purpose of the operation is to make gripping center point Coordinates, the process architecture
the contour smooth and fill some gaps in the contour, as diagram of the long object is represented by Fig. 10. To detect
shown in (5). straight lines in an image, the most basic method is to detect
pixels with significant gradient changes in the image. LSD
A • B = (A ⊕ B)⊖ B (7)
analyzes the regional gradient changes of the image and
The object is divided into different types of objects by the classifies the clusters of pixels with similar gradient changes
type classifier, and then the object foreground is obtained by as line candidates. After verifying the line candidates through
deep image matting. The complete contour of the object is hypothesis citation, the line pixel set and the error control set
generated after binarized erosion and expansion. The result is are merged and finally obtained. For the accurate line set, we
presented in Fig. 9. extract the intersection center point of the most extended two
6 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X
lines in the accurate line set as the coordinate of the clamping optimal clamping position at the center of the mug handle
center point, calculate the slope of the two-line segments and based on the frame selection position coordinates. Differently,
convert the angle between them and the horizontal line, and we calculate the rotation angle between the center point of the
take the average of the two angles as the gripper rotation mug body and the center point of the grip, use the angle to
angle. Generate the rectangular clipping area with the determine the gripping area to generate a gripping rectangle,
clipping center coordinates and rotation angle and display it and use the center point of the rectangle as the gripping center
on the original picture. Obtain the plane parameters of the coordinate. As shown in Fig. 13. The rotation angle of the
clipping area of the long object by the above method. jaws is 90 degrees from the parallel horizontal axis as the
◼ Circle Object rotation angle of the jaws of the mug.
Fig. 11. circle object flowchart Fig. 13. Take the center line of the object's bounding box and the grip as the
primary connection line, project the object to the center of the coordinates,
The part of the circle object also takes the center of the divide the coordinates into several areas, and calculate the angle of each area
object circle as the best clamping area. Fig. 11. is the process according to the center connection line and the horizontal axis.
structure of the circle object, but in the circle object, we ◼ Blade Object
divide it into two shapes, namely the hollow shape and the
solid shape. The contour map of the object after secondary
processing uses Canny edge detection to find the inner
contour of the center of the object. It generates a
circumscribed rectangle from the inner contour, and the
center of the circumscribed rectangle is used as the clamping
coordinate. After the secondary processing, the center of the
contour map is incomplete, so a category is added to the type
classifier to determine whether the circle object is a hollow
shape as a secondary verification. The center of the
circumscribed rectangle is used as the gripping coordinate.
The width of the gripper used in this paper is limited to 80mm, Fig. 14. blade object flowchart
so the actual grasping object will be limited to this width. The Blade object We divide into a grasp shape and a blade
gripping rotation angle is based on the angle between the shape. In the type classifier, we divide grasp into grasp and
straight line and the horizontal axis after the object is fitted round grasp. The architecture flowchart of the blade object is
with a straight line as the basis for the rotation angle of the
shown in Fig. 14. For the shape of round grasp, we select the
gripper.
center of the object grasp and blade as the gripping coordinate,
◼ Columnar Object and round grasp selects the gripping position closest to the
camera direction as the gripping coordinate. Similarly, when
we place the blade object, we can choose the placement
position arbitrarily, so we take the blade's center and grass as
the rotation center, as shown in Fig. 15. The object is placed
and rotated into four quadrants, the slope is converted for the
tangent of the object on each quadrant, and the rotation angle
between the object and the horizontal line is obtained as the
basis for the jaw angle.
C. Grip Pose Generation System
Fig. 12. columnar object flowchart The camera used in this paper is a zed 2i depth camera that
The columnar object is divided into a general columnar can output color image information and depth information at
bottle and mug. The flow chart of the columnar object is the same time. The color image information is mainly used to
shown in Fig. 12. The mug handle will be classified in the identify the clamping mentioned above plane coordinates.
type classifier. Generally, the cylindrical bottle will use the We have obtained the plane clamping parameters from the
LSD linear detection method to find the intersection of the above method. The output of the parameters is The
two most extended line segments on the edge of the cup as information of the pixel coordinate system. However, the
the clamping coordinate, and the angle between the straight space object coordinates input to the robot arm need to be
line fitting and the horizontal axis will be used as the basis for based on the three-dimensional world coordinates of the
the rotation angle. Mug Since we know the frame selection geodetic coordinate system, so the following will introduce
coordinates of the grip area on the classifier, we set the how to convert the pixel coordinates of the two-dimensional
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X 7
𝑓𝑋
𝑢 = + 𝑢0 (10)
𝑍
𝑓𝑌
𝑣 = + 𝑣0 (11)
𝑍
Fig. 15. The rotation angle is based on the angle between the center point of 𝑢 𝑓𝑥 0 𝑢0 𝑋
the blade and grasp bounding box and the horizontal axis. The object is 𝑍 [𝑣 ] = [ 0 𝑓𝑦 𝑣0 ] [𝑌 ] (12)
projected to the center of the coordinates, the four quadrants are divided into 1
three-angle zones, and the above-calculated angle positions are compared.
0 0 1 𝑍
plane to the three-dimensional world coordinates and convert 𝑓𝑥 0 𝑢0
the three-dimensional coordinate system. to the arm K=[ 0 𝑓𝑦 𝑣0 ] (13)
coordinates so that the robot arm can achieve the gripping 0 0 1
task.
⚫ Conversion relationship between grip area plane In this paper, the left camera of ZED 2i depth camera is
coordinates and camera point cloud coordinates used as the main camera, where (𝑓𝑋 ,𝑓𝑦 )is the focal length of
the left camera, (𝑢0 , 𝑣0 ) is the distance between the left
First, the color image information is aligned with the depth camera and the center, Z is the depth distance from the object
information, and we express the depth information in meters to the camera, [𝑢, 𝑣, 1]𝑇 is the pixel plane coordinates,
and then convert the pixel coordinates in the 2D plane to the [𝑋, 𝑌, 𝑍]𝑇 is the camera point cloud coordinates.Therefore,
camera point cloud coordinate system in the left camera the depth information in this thesis can be converted from the
center of the depth camera by obtaining the internal above matrix into the following representations (14)-(16) to
parameters of the camera through the camera calibration. This obtain the coordinates of the point cloud coordinate system
is shown in Fig. 16. In the figure, we assume that P(X,Y,Z) is centered on the camera.
the camera coordinate projected on the imaging plane
coordinate (u,v), where f is the distance from the vertical axis
of PC to the Center of Projection O in the imaging plane.
According to the similar triangle theorem, we can obtain the
relationship between the imaging plane and the camera point
cloud equation (7):
𝑍 (14)
𝑋 = (𝑢 − 𝑢0 ) ∗
𝑓𝑥
Fig. 16. Relationship between pixel coordinates and point cloud coordinates
𝑑𝑒𝑝𝑡ℎ
𝑣 𝑌 𝑓𝑌 𝑌 = (𝑣 − 𝑣0 ) (15)
𝑓𝑦
= ⇒ 𝑣 = (7)
𝑓 𝑍 𝑍
Z =z (16)
𝑢 𝑋 𝑓𝑋
= ⇒ 𝑢 = (8)
𝑓 𝑍 𝑍
⚫ Conversion of Camera Point Cloud Coordinate System
Express (7) and (8) in vector form as shown in (9). to World Coordinate System
In the previous section, we obtained the space object's
𝑓𝑌
𝑢
point cloud coordinates and then converted them to the world
[ ]=[𝑍] (9) coordinates. We use the PnP (Perspective-n-Point) method
𝑣 𝑓𝑋
𝑍 [27] to take n pixel coordinates on the pixel plane and n on
the world coordinates. The corresponding projection
In general, the origin of the pixel coordinate system will relationship is calculated for each feature point, represented
be at the upper left corner of the screen. In the above equation, by Fig. 17. The corresponding points of the 3D-2D
the origin of the pixel coordinate system is set at the center of coordinates are known. The 3D world coordinates are marked
the imaging plane, so the origin of the coordinates should be here as A, B, and C, the coordinates on the 2D image plane
shifted to the upper left of the screen, assuming the origin are marked as a, b, c, and the camera coordinate system with
center coordinates o( 𝑢0 , 𝑣0 ), and the coordinates after the camera as the origin is unknown, OA, OB, OC represent
translation are shown in (10) and (11). us. If the camera coordinates can be obtained, the
8 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X
corresponding points of 3D-3D can be obtained, and the ⚫ Conversion of world coordinate system to robot arm
rotation and translation matrix between the camera coordinate system
coordinates and the world coordinates can be calculated. The
With the above method, we can convert the plane
corresponding relationship will be introduced below.
parameters generated by ZED 2i through the clamping
Using the triangle Law of cosine can get (17)-(19).
attitude into the world coordinate system based on the test
̅̅̅̅̅̅ (17) platform. Fig. 18. is the experimental position diagram of this
𝑂𝐴2 + ̅̅̅̅̅̅ 𝑂𝐵 ⋅ cos(𝑎𝑏) = ̅̅̅̅̅̅
̅̅̅̅ ⋅ ̅̅̅̅
𝑂𝐵2 − 2𝑂𝐴 𝐴𝐵2
paper and the representation of the rotation axis of each
̅̅̅̅̅̅ coordinate system. This paper uses the Eye to Hand method
𝑂𝐵2 + ̅̅̅̅̅̅
𝑂𝐶 2 − 2𝑂𝐵 ̅̅̅̅ ⋅ cos(𝑏𝑐) = ̅̅̅̅̅̅
̅̅̅̅ ⋅ 𝑂𝐶 𝐵𝐶 2 (18)
to fix the camera beside the test platform. It is known that the
̅̅̅̅̅̅2 + ̅̅̅̅̅̅
pixel coordinates are first converted to the boundary
𝑂𝐴 𝑂𝐶 2 − 2𝑂𝐴 ̅̅̅̅ ⋅ cos(𝑎𝑐) = ̅̅̅̅̅
̅̅̅̅ ⋅ 𝑂𝐶 𝐴𝐶 2 (19) coordinates, and the rotation and translation matrices are
̅̅̅̅
converted into the arm coordinate system. The conversion
̅̅̅̅
𝑂𝐴 𝑂𝐵
Divide all three equations by ̅̅̅̅̅̅
𝑂𝐶 2 , assume x = ̅̅̅̅ , y = ̅̅̅̅ relationship is given by (28). The gripping task can be
𝑂𝐵 𝑂𝐶
performed using UDP to transmit the converted coordinate
̅̅̅̅̅̅
𝐴𝐵2 parameters to the robot arm.
𝑥 2 + 𝑦 2 − 2𝑥𝑦 ⋅ cos(𝑎𝑏) = (20)
̅̅̅̅̅̅
𝑂𝐶 2
̅̅̅̅̅̅
𝐵𝐶 2
𝑦 2 + 1 − 2𝑦 ⋅ cos(𝑏𝑐) = (21)
̅̅̅̅̅̅
𝑂𝐶 2
̅̅̅̅̅
𝐴𝐶 2
𝑥 2 + 1 − 2𝑥 ⋅ cos(𝑎𝑐) = (22)
̅̅̅̅̅̅
𝑂𝐶 2
̅̅̅̅̅̅
𝐴𝐵2 ̅̅̅̅̅̅
𝐵𝐶 2 ̅̅̅̅̅̅
𝐴𝐶 2
Assume v = ̅̅̅̅̅̅2 , uv = ̅̅̅̅̅̅2 , wv = ̅̅̅̅̅̅2 bring in (20)-(22)
𝑂𝐶 𝑂𝐶 𝑂𝐶
above, and after sorting, it is shown by (23)-(25).
problem, calculate the error value between the real world ⚫ Place 3
coordinates and the world coordinates obtained by using PnP,
and convert the solved world coordinates to the arm
coordinate system, and use arm was validated. The results are
shown in the following chart.
After the transformation matrix is obtained through PnP
solution, the squares are arbitrarily placed in 4 positions, and
three points are arbitrarily selected for each position for arm
coordinate transformation, and the error comparison is
performed through the actual touch position of the arm.
⚫ Place 1
Fig. 22. Place 3
Tab. 3. Place 3 Data Analysis
⚫ Place 4
⚫ Place 2
⚫ Blade Object
⚫ Circle Object
Fig. 31. Blade Object 2
Tab. 12. Blade Object 2 Data Analysis
⚫ Circle Object
IV. CONCLUSIONS AND FUTURE WORKS
This paper proposes a generation design for contour
capture of multi-target objects in a complex environment
based on the vision system, using data augmentation and deep
image matting to replace the background to obtain a good
training sample set for model training of the type classifier,
Fig. 33. Circle object grip pose generation using the ZED 2i depth camera. Shoot the images to classify
Tab. 14. Circle object Data Analysis the shape classifier, then use the gripping area identification
to select the appropriate gripping position for different
categories, and obtain the best gripping area plane coordinate
⚫ Columnar Object parameters—a three-dimensional robot arm coordinate
system. In order to verify the feasibility of this system, the
three-dimensional object clamping position positioning
experiment is carried out, the two-dimensional plane
coordinate conversion three-dimensional robot arm
coordinate experiment is carried out using the verification
block, and the gripping attitude is generated to verify the
Fig. 34. Columnar object grip pose generation
gripping coordinate error experiment. Finally, the actual arm
Tab. 15. Columnar object Data Analysis
gripping experiment is carried out. The above experimental
results show that the system can identify and grip. The
specified coordinate parameters can be obtained through this
system to complete the gripping task of the robotic arm.
⚫ Blade Object In future work, in addition to collecting more types of
objects with more unique appearances for gripping
identification, the ability to identify the gripping position of
multiple objects can also be increased, giving more
convenience and feasibility of gripping.
V. REFERENCES
Fig. 35. Blade object grip pose generation [1] Oztemel, E., & Gursev, S. (2020). Literature review of Industry 4.0
Tab. 16. Blade object Data Analysis and related technologies. Journal of Intelligent Manufacturing, 31(1),
127–182.
[2] C. Rose, J. Britt, J. Allen and D. Bevly, “An Integrated Vehicle
Navigation System Utilizing Lane-Detection and Lateral Position
Estimation Systems in Difficult Environments for GPS,” IEEE
Transactions on Intelligent Transportation Systems, vol. 15, no. 6, pp.
2615-2629, 2014.
12 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X