Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO.

X, MONTH 202X 1

Contour capture generation of multi-target


objects in complex environment based on the
vision system
Yu-Yin Huang, and Chung-Hsien Kuo, Senior Member, IEEE

Abstract—This paper proposes a robot arm gripping pose also a challenging and developing technological development
recognition system with a human-like fetching strategy for in grasping tasks. The early grasp detection method first
arbitrary objects in complex environments. The system is assumes that the object is placed in a clean and simple
divided into an arbitrary object type classification system, environment and performs the grasping task by simply
clamping area identification system, and clamping posture
generation system. The arbitrary object type classification uses
positioning it and placing it fixedly[4]. The grasping task is
human visual senses to determine the object type to classify more complicated when the object is placed in a complex
different objects. With deep learning keying technology, the environment or an arbitrary pose. In recent years, with the
object outline is removed in any background environment, and vigorous development of visual imaging technology, deep
the complete outline of the object is preserved through learning methods have been applied to robot vision, and the
morphological image processing. The object clamping area technology uses various techniques through visual
identification uses the aforementioned complete outline of the perception[6]. Since humans cannot analyze images quickly
object and identifies different clamping positions according to and efficiently with their eyes, they use computer-aided
different object type classifiers, generates object plane clamping image processing to identify environmental objects. With
area coordinates, then converts the coordinates to robot arm
coordinates to generate clamping postures, and inputs the
visual imaging technology, grasping detection can lock grasp
converted coordinates to the robot arm for the actual clamping targets faster while obtaining good grasp positions and
action. postures[5].
This paper uses this system to conduct the coordinate The central axis of this paper is to mimic the gripping pose
conversion and positioning experiments to verify the error and position tracking of any object in a complex background
relationship between the coordinate conversion and the object environment with a humanoid robot. By classifying any
clamping position positioning experiments, to determine object type[7] and deep learning matting[8], the complete
whether the object can be converted to the arm for clamping contour information of any object is found as one of the
tasks, and to support the reliability of this paper with the essential data for the grasping position. The gripping position
complete operation of the robot arm.
is determined based on the object's contour described above.
The pose of the gripping point, gripping position, and
Index Terms—object type classifier, deep learning foreground, and
background separation, coordinate transformation relationship, gripping parameters are obtained according to the classified
gripping pose detection, object gripping. objects' different types of shape classifiers.After obtaining the
parameters, the grasping parameters are converted into arm
I. INTRODUCTION coordinates by the relationship between pixel coordinates,
world coordinates, and arm coordinates[9]. The system is
A. Motivation and Purpose verified by using the actual arm for validation experiments.
In recent years, with the vigorous development of Industry This paper firstly investigates the robot grasping detection
4.0. Factory automation has become a hot topic that is being technology. With the development and evolution of grasping
studied[1]. Robots are widely used in factories and home detection technology, the grasping task is that the robot
environments and occupy a significant position in life. In obtains the position of the target object to perform the
industrial production technology, industrial robots are used in grasping action
factory automation products, such as packaging, distribution, B. Related Works
sorting [2]. Most of the traditional industrial robot grasping
systems are aimed at a structured operating environment and This paper firstly investigates the robot grasping detection
rely on obtaining all relevant information of the grasped technology. With the development and evolution of grasping
object in advance, such as the shape, color, posture, position, detection technology, the grasping task is that the robot
grasping scene, and other related features of the grasped[3]. obtains the position of the target object to perform the
For such a single structured system, it lacks flexibility and grasping action.
robustness, and the scope of application is quite limited. The The traditional detection technology method requires
grasping tasks are mainly divided into three directions, humans to analyze the geometric structure of the detection
namely object detection, grasping planning, and robot object directly, form an algorithm specific to the task,
control[4]. Detecting objects and generating grasping poses determine the appropriate grasping point according to its
is the primary key to the success of robot grasping tasks. The shape and size, and match the appropriate gripper to perform
accuracy of grasping poses helps to plan grasping paths and the gripping task [10]. However, this method requires a lot of
realize complete grasping tasks. computation and analysis time, and the application scalability
Robotic grasping detection is of great significance to is relatively tiny. In the past five years, the application of deep
intelligent manufacturing and factory automation, and it is learning to robot vision has made significant progress. The
2 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X

This paper designs an object classifier based on human


visual perception to determine the object type. The design of
the classifier refers to the problem of object detection. Object
detection is mainly defined as the visual sensor shooting an
object and determining the position of the object in the image
and the object through object detection. Which category they
belong to [20], as the most direct method to discriminate the
Fig. 1. Object grab point representation type of objects. Posting, Huang et al. [22] proposed a method
overall performance of robot grasp detection has been based on RGBD image learning to detect objects. Through a
improved through deep learning methods [11]. In addition to short-connected FCN network RGBD detection object model,
being widely used in various scenarios, deep learning can also the deep learning network can be more coherent in space. The
improve the accuracy of grasping [12][13]. new loss function enhances spatial consistency.Decheng
Wang et al. proposed an RGB-D object detection network
C. Robotic Grasp Detection model based on the YOLOv3 framework [21]. They
The method of grasping detection is to capture and identify improved the NMS algorithm fused with depth features, and
the object in the picture to obtain the grasping point or the mean average precision of the Depth Fusion NMS
grasping posture and obtain grasping positioning information algorithm is higher than that of the Greedy-NMS Soft NMS-
[14] so that the endpoint of the robot gripper is positioned at L. 0.8, 0.5% and 0.3%.
the grasping point of the object. The gripper jaws firmly grip F. Grasp pose generation
the object between the grippers and pick up the object.The
relevant, gripping information is generated through the vision In this paper, the contour and classifier results are obtained
sensor, similar to this method. According to the gripping based on the deep learning matting used above, and the
information, the gripping system is analyzed and planned to comparative verification methods are selected for different
find a more suitable gripping posture.Ian Lenz et al. proposed types of objects. The LSD fast line detection algorithm
the sliding window method in 2015 [15]. By first dividing the proposed by Rafael Grompone et al. [24], by calculating the
image into many small blocks, in the iteration, the gradient size and direction of all points in the image,
classification system predicts whether this position is a connecting points with small gradient direction changes and
clamping position and finally outputs a more reliable adjacent points into the same area, and discriminating through
clamping position.This method The longitude of object multiple calculations Compared with Hough line, the
detection can reach 75%, and the processing time of each detection speed and accuracy are greatly improved. Canny
image is 13.5 seconds.Redmon et al. proposed a real-time edge detection is a composite edge detection algorithm [25],
robot grasping detection network based on the convolutional which combines Gaussian Filter, gradient detection, non-
neural network [16]. This network is a single-stage regression maximum suppression, and judgment boundary four
of the bounding box of the grasping, and the speed of running algorithms to practice edge detection with a low error rate and
on the GPU can reach 13 frames per second. This method The accurate positioning high resolution.
detection of multiple grasping points can be generated on a The rest of this paper is organized as follows. The Section
single object, not limited to a single grasping pose. III mainly introduces the overall operation flow chart of
multi-target object contour capture generation based on the
D. Deep Image Matting complex environment of the vision system. In Section IV
Image matting technology is a prevalent feature in online discusses the design of the Object Edge Extraction System
meeting software. The simple goal is to remove the and the complicated concepts of Object grasping Position
background and leave the foreground information. Image Discrimination System. In the Section V, the proposed
Matting is similar to Instance Segmentation, but the most system is actually operated on the robot. The coordinate
significant difference between the two is that the edges of transformation and positioning test and the actual Object
objects separated by Image Matting will be more delicate, and gripping task on the robot are used to verify the system.
Instance Segmentation focuses on extracting the correctness Finally, Section VI introduces the results of this paper and
of objects rather than the finished edges of object contours future work.
[17]. Segmentation cannot achieve the same effect as Matting
in object extraction technology. II. METHOD
In recent years, with the rise of deep learning, F, B, Alpha A. System Flowchart
Matting [18] proposed by Context and Marco Forte et al.
adopt ResNet-50+U-Net two model architectures, which The system architecture designed in this paper is to
claim to be able to extract the foreground of the hair level, identify the gripping area of any object, generate grip
although the effect Yes, but it needs to process details and coordinates, and position the grip coordinates on the robot
edges based on grayscale images, and spend more time arm for gripping action. The whole system is composed of an
sorting out training data. Shanchuan Lin published Real-Time object edge extraction system, grasping position
High-Resolution Background Matting [19] in 2020 to solve discrimination system, and grip poses generation system. In
the inconvenience that the above images need to go through this paper, the grip pose generation recognition program is
the edge of essential details. At the same time, it has both high developed in Python under the Linux system, and the ROS
speed and high image quality, and the effect is also perfect. system controls the robot arm. The first part is the object edge
extraction system design, using the YOLOv4 network
E. Object Type Classifier architecture as the main model training framework for the
type classifier, extracting the same class of image features,
predicting the position of the object in the image, and the
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X 3

Fig. 2. Thesis Structure

probability value of the class, using the human visual sensory


object type as the classifier class selection, and classifying the
object into four categories, with subdivisions for each
category. The object type data set is collected to maximize
the deep learning training sample. The data augmentation
method and deep image matting are added to replace the need
for manual photographing and marking of objects, which
helps enhance the training model. The object outline is Fig. 3. Object edge contour extraction system architecture diagram
separated again by using deep image matting two layers of This thesis uses the objects on the office desk as the
the neural network. With the traditional image processing to primary research target. In order to obtain the best gripping
refine the object outline, the complete acquisition of the point posture of various objects for analysis and discussion,
object edge outline helps identify the following clamping area. the objects on the office desk are classified into four main
The second part is the object grasping position discrimination categories, namely long objects, round objects, cup-shaped
system, which uses the object contour and the category objects, and grasping objects, and the four categories are used
identified by the type classifier as the basis for the clamping as the reference standard to distinguish the external shape of
position discrimination. It selects the suitable clamping area objects through human eyesight. The following diagram
using straight-line detection, object contour detection, shows the structure of the type classifier designed in this
straight-line simulation, and object rotation angle methods. thesis, as shown in Fig. 4.
The plane coordinates of the grip area are obtained. The third
part is the grip pose generation system, where the known grip
plane coordinates are converted to the robot arm coordinates
through two coordinate conversions. Finally, the robot arm is
used for grip verification to verify the accuracy of the grip
area. Grip poses generated by the system in this paper and
whether the robot arm can grip the objects. The overall
operation flow chart of this study is shown in Fig. 2.
A. Object Edge Extraction System
The first part is the object type classifier design, which uses
human visual senses to judge different types of objects. The
YOLOv4 neural network framework trains the model to
classify similarly shaped objects. In preparing the training Fig. 4. The classification diagram of the type classifier, using human vision
data set, we hope that we can effectively recognize the shape to identify the type of objects. The four main categories are blade, circle,
in any environment, so we add the technique of deep image long, and columnar, and other subdivisions are made for each category.
matting and collage the separated objects on different During the preparation of the dataset of the type classifier,
backgrounds to enhance the model training and strengthen the to reduce the manual annotation time, enhance the model
recognition ability in any environment. In order to reduce the training effect, and ensure that the over-fitting phenomenon
work of photographing object data sets and object labeling, will not occur during the training, 100 color photos of various
the Data Augmentation method is added to obtain the full objects were taken on the workbench. The 100-RGB images
deep learning training set with a small number of data were first completed by manual annotation. Data-
samples to enhance the learning effect. augmentation was added to the original pictures after The
The second part is the deep learning keying technique. data-augmentation was added to rotate, adjust the scale, and
After the object is classified and trained, a complete image change the brightness sharpening, noise, and flip effects to
with foreground information and a background image are generate more data for the system to learn from the existing
taken to obtain the complete object outline by obtaining the data. At the same time, to make the model adaptable to
foreground profile of an object for subsequent analysis of various environmental changes, we use deep image matting
arbitrary object pinch points. The structure of the object edge to separate the photographed objects and replace various
contour extraction system is shown in Fig. 3. backgrounds so that the model can adapt to various
4 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X

environmental changes without changing the object the number ratio of 8:2:2. Tab. 1. shows the data results after
characteristics during training. Fig. 5. and Fig. 6. show the training.
results of data-augmentation and deep image matting for the Tab. 1. YOLOv4 results after training with the original data, data
original photos, respectively. enhancement, and background replacement. It was evident from the graph
that compared with the original data, the results after training with the
enhanced and background replacement data set have increased by 32.3% and
36.8% mAP.
Data mAP IOU
Method Backbone
Improvement (%) (%)
YOLOv4: Accuracy of Object Detection
YOLOv4 w/o augmentation CSPDarknet53 60% 63.8%

YOLOv4 w/ augmentation CSPDarknet53 92.3% 76.5%


Fig. 5. The original image in the upper left corner, and the remaining five w/ augmentation +
YOLOv4 CSPDarknet53 96.8% 87.3%
images are all new images that have been replaced after deep image matting background
After the type classifier classifies the objects, they need to
perform the extraction of edge contours before they can be
used as the grip point to discriminate the following clamped
areas. This paper uses the deep learning keying technique as
the primary method for extracting contours, published by
Shanchuan Lin et al. in 2021 [19]. This network architecture
is shown in Fig. 8. The conventional way of acquiring
foreground images is obtained by using (1).
Fig. 6. Shows the new images generated randomly after data-augmentation.
The aforementioned generated dataset will be used for
deep neural network training, and the YOLOv4 network
architecture is chosen as the primary model training method
for this type classifier. YOLOv4 is a target detection system
based on the work of convolutional neural networks,
published by Alexey Bochkovskiy et al. in 2020 [26], which
is a widely used system in the field of object detection to
extract image feature maps through the filter, YOLO is an Fig. 8. Deep image matting architecture, consisting of the Gbase network and
end-to-end deep network architecture that uses regression the Grefine network, initially outputs the partial contour of the object with
methods to predict the bounding box, class, and feature low resolution. The latter does high-resolution analysis for the part with more
extraction in the same convolutional neural networks work. significant error and finally outputs the complete object contour.
The core objective is to use the complete image as the input
𝐼𝑖 = 𝛼𝑖 𝐹𝑖 + (1 − 𝛼𝑖 )𝐵𝑖 (1)
to the neural network and regress the Bounding Box location
and the target category in the output layer. Fig. 7. shows the Where 𝐼 is the known input original image, 𝐵 is the
flowchart of the YOLOv4 target recognition. YOLOv4 used background image, α is a value between [0,1] (which can also
in this paper is an innovative algorithm that combines many be expressed as transparency), and ⅈ is the index of each pixel
previous research techniques to achieve a perfect balance of in the image, the general matting method is to find the α value
speed and accuracy while reducing hardware usage of each pixel point in the image 𝐼 as the error comparison
requirements. between the background image and the foreground image
after subtraction. However, this method, unlike the general
method of deep image matting, is used to generate foreground
images by predicting the difference between the actual
foreground value and the estimated value (collectively
referred to as the residual value in this paper) (2), so that
foreground prediction can be predicted in a lower resolution
network, as shown in (3).
𝐹𝑅 = 𝐹 − 𝐼 (2)

𝐹 = max⁡(min(𝐹 𝑅 + 𝐼, 1) , 0) (3)

Deep image matting network layer is divided into two


layers of network structure. Firstly, we take an original image
Fig. 7. the flowchart of the YOLOv4 target recognition.
and a background image as input and input the input image to
In addition to the four main categories (long, circle, blend,
the base network layer after downsampling. The first base
and columnar), the object type classifier in this paper adds the
network layer is to get the low-resolution result on the
identification of whether a circle is hollow or filled, whether
downsampled image quickly and predict the foreground
a columnar contains a grip shape, and whether a blade is
information 𝐹𝑐𝑅 , alpha prediction error map 𝐸𝑐, and network
divided into a blade and a grasp part. There are eight
prediction 𝛼𝑐 , Network feature 𝐻𝑐 .The second Grefine
categories in total. The network architecture uses the
network layer optimizes the edges on the base network layer.
CSPDarknet53 framework designed by AlexeyAB, and the
It optimizes the details of 𝛼𝑐 and 𝐹𝑐𝑅 only in the parts of the
complete training set, validation set, and test set are trained in
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X 5

predicted 𝐸𝑐 with larger error values and finally generates


foreground contours, foreground objects, and foreground
residuals.
In this paper, we use deep image matting as an object edge
contour extraction method, in which the depth residual
network structure is chosen to use resnet101 for the output
results at high resolution, and refine mode optimization area
is chosen to use sampling, which is optimized only for the
places with large 𝐸𝑐 error values.
After the depth image matte, generating a foreground
contour map (pha-image) will be our main method to
distinguish the clamping area. However, after actually
photographing the object, we found that when using depth
image matte, the light of the object tends to produce shadows
and noise interference, resulting in the generated foreground
contour map is not so complete. Therefore we add a post-
processor to the pha-image for secondary processing, using
image morphology to remove edge noise and connect broken
foregrounds, mainly using binarized images to do erosion
expansion to eliminate noise. The following is a description
of each. Fig. 9. The object is obtained by depth image extinction, and the object
◼ Erosion outline is obtained by secondary processing of the binarized image map.
Consider two sets, A and B, in the image space. When set
A is eroded by set B, it can be expressed as A ⊝ B, where A B. Grip Pose Generation System
is the target of erosion and B is the structural element. By When the complete outline of the object is extracted
using image erosion, the slight white noise can be removed through the above edge outline, we will use the result of the
from the image, and the image can be refined to eliminate type mentioned above in classifier classification and the
burrs, as shown in (4). object outline after secondary processing to select a suitable
clipping area as the position target of the clipping plane
A⊖B = {z|(𝐵̂)𝑧 ⊆ ∅} (4) coordinates. The clamping plane coordinates are based on the
◼ Dilation clamping center (x, y), the rotation angle θ, and the clamping
Consider two sets, A and B, in the image space. When set depth. These five parameters are the main parameters used as
A is expanded by set B, it can be expressed as A ⊕ B, where the coordinate conversion benchmark. It is known from the
classifier that we divide the categories into four major
A is the expansion target and B is the structural element. Most
categories, namely long, circle, columnar, and blade. The
image expansions are used together with image erosion,
gripping area will analyze the gripping area of the category
where the edge lines are narrowed by erosion, the noise is
through these four major categories. The four categories will
removed, and then the image is restored by image expansion,
be introduced below—selection of the gripping region.
as shown in (5).
◼ Long Object
A⊕B = {z|(𝐵̂)𝑧 ∩ 𝐴 ≠ ∅} (5)

◼ Open operation
The open operation is synthesized by erosion and
expansion. The erosion first filters out the more negligible
noise, and the expansion smoothes the edge contour. Hence,
the open operation aims to smooth the contour and eliminate
the thin part of the edge connection, as shown in (5).
A。B = (A ⊖ B)⁡⊕ B (6) Fig. 10. long object flowchart
◼ Closed operation In order to provide an excellent gripping position for the
First, perform the expansion operation on the B set, erode robot arm and take the center of the object as the best gripping
it, and fuse the more minor noises in the image. Some of the position for the long object, we use the LSD (Line Segment
more considerable noises cannot be eliminated. Therefore, Detection) method to find the intersection of the two most
the more enormous noises are eroded through the erosion extended straight lines on the edge of the long object as the
operation and closed. The purpose of the operation is to make gripping center point Coordinates, the process architecture
the contour smooth and fill some gaps in the contour, as diagram of the long object is represented by Fig. 10. To detect
shown in (5). straight lines in an image, the most basic method is to detect
pixels with significant gradient changes in the image. LSD
A • B = (A ⊕ B)⁡⊖ B (7)
analyzes the regional gradient changes of the image and
The object is divided into different types of objects by the classifies the clusters of pixels with similar gradient changes
type classifier, and then the object foreground is obtained by as line candidates. After verifying the line candidates through
deep image matting. The complete contour of the object is hypothesis citation, the line pixel set and the error control set
generated after binarized erosion and expansion. The result is are merged and finally obtained. For the accurate line set, we
presented in Fig. 9. extract the intersection center point of the most extended two
6 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X

lines in the accurate line set as the coordinate of the clamping optimal clamping position at the center of the mug handle
center point, calculate the slope of the two-line segments and based on the frame selection position coordinates. Differently,
convert the angle between them and the horizontal line, and we calculate the rotation angle between the center point of the
take the average of the two angles as the gripper rotation mug body and the center point of the grip, use the angle to
angle. Generate the rectangular clipping area with the determine the gripping area to generate a gripping rectangle,
clipping center coordinates and rotation angle and display it and use the center point of the rectangle as the gripping center
on the original picture. Obtain the plane parameters of the coordinate. As shown in Fig. 13. The rotation angle of the
clipping area of the long object by the above method. jaws is 90 degrees from the parallel horizontal axis as the
◼ Circle Object rotation angle of the jaws of the mug.

Fig. 11. circle object flowchart Fig. 13. Take the center line of the object's bounding box and the grip as the
primary connection line, project the object to the center of the coordinates,
The part of the circle object also takes the center of the divide the coordinates into several areas, and calculate the angle of each area
object circle as the best clamping area. Fig. 11. is the process according to the center connection line and the horizontal axis.
structure of the circle object, but in the circle object, we ◼ Blade Object
divide it into two shapes, namely the hollow shape and the
solid shape. The contour map of the object after secondary
processing uses Canny edge detection to find the inner
contour of the center of the object. It generates a
circumscribed rectangle from the inner contour, and the
center of the circumscribed rectangle is used as the clamping
coordinate. After the secondary processing, the center of the
contour map is incomplete, so a category is added to the type
classifier to determine whether the circle object is a hollow
shape as a secondary verification. The center of the
circumscribed rectangle is used as the gripping coordinate.
The width of the gripper used in this paper is limited to 80mm, Fig. 14. blade object flowchart
so the actual grasping object will be limited to this width. The Blade object We divide into a grasp shape and a blade
gripping rotation angle is based on the angle between the shape. In the type classifier, we divide grasp into grasp and
straight line and the horizontal axis after the object is fitted round grasp. The architecture flowchart of the blade object is
with a straight line as the basis for the rotation angle of the
shown in Fig. 14. For the shape of round grasp, we select the
gripper.
center of the object grasp and blade as the gripping coordinate,
◼ Columnar Object and round grasp selects the gripping position closest to the
camera direction as the gripping coordinate. Similarly, when
we place the blade object, we can choose the placement
position arbitrarily, so we take the blade's center and grass as
the rotation center, as shown in Fig. 15. The object is placed
and rotated into four quadrants, the slope is converted for the
tangent of the object on each quadrant, and the rotation angle
between the object and the horizontal line is obtained as the
basis for the jaw angle.
C. Grip Pose Generation System
Fig. 12. columnar object flowchart The camera used in this paper is a zed 2i depth camera that
The columnar object is divided into a general columnar can output color image information and depth information at
bottle and mug. The flow chart of the columnar object is the same time. The color image information is mainly used to
shown in Fig. 12. The mug handle will be classified in the identify the clamping mentioned above plane coordinates.
type classifier. Generally, the cylindrical bottle will use the We have obtained the plane clamping parameters from the
LSD linear detection method to find the intersection of the above method. The output of the parameters is The
two most extended line segments on the edge of the cup as information of the pixel coordinate system. However, the
the clamping coordinate, and the angle between the straight space object coordinates input to the robot arm need to be
line fitting and the horizontal axis will be used as the basis for based on the three-dimensional world coordinates of the
the rotation angle. Mug Since we know the frame selection geodetic coordinate system, so the following will introduce
coordinates of the grip area on the classifier, we set the how to convert the pixel coordinates of the two-dimensional
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X 7

𝑓𝑋
𝑢 =⁡ + 𝑢0 (10)
𝑍

𝑓𝑌
𝑣 =⁡ + 𝑣0 (11)
𝑍

(10) and (11) are expressed as Homogeneous Coordinates by


(12). The 3*3 matrix named K is the internal orientation
parameter of the camera, and its coefficient is only related to
the internal camera itself.

Fig. 15. The rotation angle is based on the angle between the center point of 𝑢 𝑓𝑥 0 𝑢0 𝑋
the blade and grasp bounding box and the horizontal axis. The object is 𝑍 [𝑣 ] = [ 0 𝑓𝑦 𝑣0 ] [𝑌 ] (12)
projected to the center of the coordinates, the four quadrants are divided into 1
three-angle zones, and the above-calculated angle positions are compared.
0 0 1 𝑍
plane to the three-dimensional world coordinates and convert 𝑓𝑥 0 𝑢0
the three-dimensional coordinate system. to the arm K=[ 0 𝑓𝑦 𝑣0 ] (13)
coordinates so that the robot arm can achieve the gripping 0 0 1
task.
⚫ Conversion relationship between grip area plane In this paper, the left camera of ZED 2i depth camera is
coordinates and camera point cloud coordinates used as the main camera, where (𝑓𝑋 ,⁡𝑓𝑦 )⁡is the focal length of
the left camera, (𝑢0 , 𝑣0 )⁡ is the distance between the left
First, the color image information is aligned with the depth camera and the center, Z is the depth distance from the object
information, and we express the depth information in meters to the camera, [𝑢, 𝑣, 1]𝑇 is the pixel plane coordinates,
and then convert the pixel coordinates in the 2D plane to the [𝑋, 𝑌, 𝑍]𝑇 is the camera point cloud coordinates.Therefore,
camera point cloud coordinate system in the left camera the depth information in this thesis can be converted from the
center of the depth camera by obtaining the internal above matrix into the following representations (14)-(16) to
parameters of the camera through the camera calibration. This obtain the coordinates of the point cloud coordinate system
is shown in Fig. 16. In the figure, we assume that P(X,Y,Z) is centered on the camera.
the camera coordinate projected on the imaging plane
coordinate (u,v), where f is the distance from the vertical axis
of PC to the Center of Projection O in the imaging plane.
According to the similar triangle theorem, we can obtain the
relationship between the imaging plane and the camera point
cloud equation (7):

Fig. 17. Projection of pixel plane coordinates to world coordinates

𝑍 (14)
𝑋 = (𝑢 − 𝑢0 ) ∗
𝑓𝑥
Fig. 16. Relationship between pixel coordinates and point cloud coordinates
𝑑𝑒𝑝𝑡ℎ
𝑣 𝑌 𝑓𝑌 𝑌 = (𝑣 − 𝑣0 )⁡ (15)
𝑓𝑦
= ⇒ 𝑣 =⁡ (7)
𝑓 𝑍 𝑍

Z =z (16)
𝑢 𝑋 𝑓𝑋
= ⇒ ⁡𝑢 = ⁡ (8)
𝑓 𝑍 𝑍
⚫ Conversion of Camera Point Cloud Coordinate System
Express (7) and (8) in vector form as shown in (9). to World Coordinate System
In the previous section, we obtained the space object's
𝑓𝑌
𝑢
point cloud coordinates and then converted them to the world
[ ]=[𝑍] (9) coordinates. We use the PnP (Perspective-n-Point) method
𝑣 𝑓𝑋
𝑍 [27] to take n pixel coordinates on the pixel plane and n on
the world coordinates. The corresponding projection
In general, the origin of the pixel coordinate system will relationship is calculated for each feature point, represented
be at the upper left corner of the screen. In the above equation, by Fig. 17. The corresponding points of the 3D-2D
the origin of the pixel coordinate system is set at the center of coordinates are known. The 3D world coordinates are marked
the imaging plane, so the origin of the coordinates should be here as A, B, and C, the coordinates on the 2D image plane
shifted to the upper left of the screen, assuming the origin are marked as a, b, c, and the camera coordinate system with
center coordinates o( 𝑢0 , 𝑣0 ), and the coordinates after the camera as the origin is unknown, OA, OB, OC represent
translation are shown in (10) and (11). us. If the camera coordinates can be obtained, the
8 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X

corresponding points of 3D-3D can be obtained, and the ⚫ Conversion of world coordinate system to robot arm
rotation and translation matrix between the camera coordinate system
coordinates and the world coordinates can be calculated. The
With the above method, we can convert the plane
corresponding relationship will be introduced below.
parameters generated by ZED 2i through the clamping
Using the triangle Law of cosine can get (17)-(19).
attitude into the world coordinate system based on the test
̅̅̅̅̅̅ (17) platform. Fig. 18. is the experimental position diagram of this
𝑂𝐴2 + ̅̅̅̅̅̅ 𝑂𝐵 ⋅ cos(𝑎𝑏) = ̅̅̅̅̅̅
̅̅̅̅ ⋅ ̅̅̅̅
𝑂𝐵2 − 2𝑂𝐴 𝐴𝐵2
paper and the representation of the rotation axis of each
̅̅̅̅̅̅ coordinate system. This paper uses the Eye to Hand method
𝑂𝐵2 + ̅̅̅̅̅̅
𝑂𝐶 2 − 2𝑂𝐵 ̅̅̅̅ ⋅ cos(𝑏𝑐) = ̅̅̅̅̅̅
̅̅̅̅ ⋅ 𝑂𝐶 𝐵𝐶 2 (18)
to fix the camera beside the test platform. It is known that the
̅̅̅̅̅̅2 + ̅̅̅̅̅̅
pixel coordinates are first converted to the boundary
𝑂𝐴 𝑂𝐶 2 − 2𝑂𝐴 ̅̅̅̅ ⋅ cos(𝑎𝑐) = ̅̅̅̅̅
̅̅̅̅ ⋅ 𝑂𝐶 𝐴𝐶 2 (19) coordinates, and the rotation and translation matrices are
̅̅̅̅
converted into the arm coordinate system. The conversion
̅̅̅̅
𝑂𝐴 𝑂𝐵
Divide all three equations by ̅̅̅̅̅̅
𝑂𝐶 2 , assume x = ̅̅̅̅ , y = ̅̅̅̅ relationship is given by (28). The gripping task can be
𝑂𝐵 𝑂𝐶
performed using UDP to transmit the converted coordinate
̅̅̅̅̅̅
𝐴𝐵2 parameters to the robot arm.
𝑥 2 + 𝑦 2 − 2𝑥𝑦 ⋅ cos(𝑎𝑏) = (20)
̅̅̅̅̅̅
𝑂𝐶 2

̅̅̅̅̅̅
𝐵𝐶 2
𝑦 2 + 1 − 2𝑦 ⋅ cos(𝑏𝑐) = (21)
̅̅̅̅̅̅
𝑂𝐶 2

̅̅̅̅̅
𝐴𝐶 2
𝑥 2 + 1 − 2𝑥 ⋅ cos(𝑎𝑐) = (22)
̅̅̅̅̅̅
𝑂𝐶 2
̅̅̅̅̅̅
𝐴𝐵2 ̅̅̅̅̅̅
𝐵𝐶 2 ̅̅̅̅̅̅
𝐴𝐶 2
Assume v = ̅̅̅̅̅̅2 , uv = ̅̅̅̅̅̅2 , wv = ̅̅̅̅̅̅2 bring in (20)-(22)
𝑂𝐶 𝑂𝐶 𝑂𝐶
above, and after sorting, it is shown by (23)-(25).

𝑥 2 + 𝑦 2 − 2𝑥𝑦 ⋅ cos(𝑎𝑏) − 𝑣 = 0 (20)

𝑦 2 + 1 − 2𝑦 ⋅ cos(𝑏𝑐) − 𝑢𝑣 = 0 (21) Fig. 18. Test platform and cursor axis


𝑥𝑟𝑜𝑏𝑜𝑡 𝑢
𝑥 2 + 1 − 2𝑥 ⋅ cos(𝑎𝑐) − 𝑤𝑣 = 0 (22) [𝑦 ] = [𝑅] [ ] + [𝑇] (28)
𝑟𝑜𝑏𝑜𝑡 𝑣
Bringing (20) into (21) and (22), can get:
(1 − 𝑢)𝑦 2 − 𝑢𝑥 2 − 2𝑦𝑐𝑜𝑠(𝑏𝑐) + 2𝑢𝑥𝑦𝑐𝑜𝑠(𝑎𝑏) + 1 III. EXPERIMENTAL RESULTS
(23)
=0 A. Coordinate conversion and positioning experiment

(1 − 𝑤)𝑥 2 − 𝑤𝑦 2 − 2𝑥𝑐𝑜𝑠(𝑎𝑐) + 2𝑤𝑥𝑦𝑐𝑜𝑠(𝑎𝑏) + 1


(24)
=0
Z
Use the above-known parameters to get x,y and then bring
̅̅̅̅, 𝑂𝐵
back (20) to get 𝑣 , and then get 𝑂𝐴 ̅̅̅̅, 𝑂𝐶
̅̅̅̅ . Through this
method, the world coordinates A, B, and Three points of C.
This paper verifies the error by bringing in 6 sets of points, Y X
and finds the smallest A, B, C solution among them, and Origin Point
obtains the rotation and translation matrix according to the
Fig. 19. The left diagram shows the calibration of the block placement, the
3D-3D points, which is represented by (25). coordinates axis, and the position of the origin of our calibration world
coordinates. In contrast, the correct diagram shows the 6-pixel world
𝑢 𝑓𝑥 0 𝑢0 𝑟11 𝑟12 𝑟13 𝑡1 𝑋 coordinates of the PnP solution.
[𝑣 ] = [ 0 𝑓𝑦 𝑣0 ] [𝑟21 𝑟22 𝑟23 𝑡2 ] [𝑌 ] (25) This paper uses PnP to solve the rotation and translation
1 0 0 1 𝑟31 𝑟32 𝑟33 𝑡3 𝑍 matrix between the pixel plane coordinate and the world
1
coordinate system. Then the transformation matrix is used to
The above pixel plane coordinates are converted to world convert the world coordinate to the arm coordinate system. In
coordinates and simplified as (26) and (27). order to verify the accuracy of the conversion between the
three coordinates, this experiment is designed to analyze three.
𝐴 = 𝑅 ∗ 𝐴′ + 𝑇 (26) For the error relationship of coordinate system conversion,
we designed a 6cm*6cm correction square ourselves. They
𝑅−1 (𝐴 − 𝑇) = 𝐴′ (27) are shown in Fig. 19. Left. The lower-left corner of the test
platform is set as the origin of the world coordinates, and the
where A is the point cloud coordinate system, A' is the world
square is aligned with the origin of the coordinates, as shown
coordinate system, R is the rotational matrix, and T is the
in Fig. 19. Right. , bring in 6 sets of pixel points and the
translational matrix.
camera's internal orientation parameters to calculate the PnP
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X 9

problem, calculate the error value between the real world ⚫ Place 3
coordinates and the world coordinates obtained by using PnP,
and convert the solved world coordinates to the arm
coordinate system, and use arm was validated. The results are
shown in the following chart.
After the transformation matrix is obtained through PnP
solution, the squares are arbitrarily placed in 4 positions, and
three points are arbitrarily selected for each position for arm
coordinate transformation, and the error comparison is
performed through the actual touch position of the arm.
⚫ Place 1
Fig. 22. Place 3
Tab. 3. Place 3 Data Analysis

Fig. 20. Place 1

Tab. 2. Place 1 Data Analysis

⚫ Place 4

⚫ Place 2

Fig. 23. Place 4


Tab. 4. Place 4 Data Analysis

Fig. 21. Place 2

Tab. 3. Place 2 Data Analysis

B. Object touch point positioning experiment


From the coordinate transformation positioning experiment,
we can accurately transform the pixel plane coordinate
system to the arm coordinate system through coordinate
transformation. This experiment is designed to verify whether
this paper's generated coordinates of the gripping pose are
accurate. In this experiment, two types of objects (8 types in
total) will be extracted from the four categories in the type
classifier. The objects will be arbitrarily placed on the test
platform in three positions, and the calculated real-world
coordinates will be compared with the ground truth. Truth
uses our human senses to determine the best clamping
10 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X

position for the current object placement. The results are


shown in the following list.
⚫ Long Object

Fig. 29. Columnar Object 2


Tab. 10. Columnar Object 2 Data Analysis

Fig. 24. Long Object 1


Tab. 5. Long Object 1 Data Analysis

⚫ Blade Object

Fig. 30. Blade Object 1


Tab. 11. Blade Object 1 Data Analysis

Fig. 25. Long Object 2


Tab. 6. Long Object 2 Data Analysis

⚫ Circle Object
Fig. 31. Blade Object 2
Tab. 12. Blade Object 2 Data Analysis

Fig. 26. Circle Object 1


Tab. 7. Circle Object 1 Data Analysis

C. Object Grip experiment


After the above experiment converts the generated
coordinates of the obtained gripping posture into the robotic
arm and touches it by the robotic arm, we will use the robotic
arm to verify the gripping position and whether the object can
be gripped. The experimental design is as follows. The
robotic arm uses Techman Robot Arm(TM5). The gripping
Fig. 27. Circle Object 2 task was dispatched, and a simple gripper was designed using
Tab. 8. Circle Object 2 Data Analysis 3Dprint as a verification tool, as shown in Fig. 32. Select two
objects from each of the four categories of the type classifier
to perform the gripping task (8 objects in total), and the
system calculates the moving coordinates of the gripping and
uses UDP to transmit the coordinate points to the robot arm
to perform the gripping action. They are indicated in the table.
⚫ Columnar Object ⚫ Long Object
Tab. 8. Circle Object 2 Data Analysis

Fig. 28. Columnar Object 1


Tab. 9. Columnar Object 1 Data Analysis
Fig. 32. Long object grip pose generation
Tab. 13. Long object Data Analysis
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X 11

Fig. 28. Columnar object 1 grip pose generation

Fig. 28. Object grip process

⚫ Circle Object
IV. CONCLUSIONS AND FUTURE WORKS
This paper proposes a generation design for contour
capture of multi-target objects in a complex environment
based on the vision system, using data augmentation and deep
image matting to replace the background to obtain a good
training sample set for model training of the type classifier,
Fig. 33. Circle object grip pose generation using the ZED 2i depth camera. Shoot the images to classify
Tab. 14. Circle object Data Analysis the shape classifier, then use the gripping area identification
to select the appropriate gripping position for different
categories, and obtain the best gripping area plane coordinate
⚫ Columnar Object parameters—a three-dimensional robot arm coordinate
system. In order to verify the feasibility of this system, the
three-dimensional object clamping position positioning
experiment is carried out, the two-dimensional plane
coordinate conversion three-dimensional robot arm
coordinate experiment is carried out using the verification
block, and the gripping attitude is generated to verify the
Fig. 34. Columnar object grip pose generation
gripping coordinate error experiment. Finally, the actual arm
Tab. 15. Columnar object Data Analysis
gripping experiment is carried out. The above experimental
results show that the system can identify and grip. The
specified coordinate parameters can be obtained through this
system to complete the gripping task of the robotic arm.
⚫ Blade Object In future work, in addition to collecting more types of
objects with more unique appearances for gripping
identification, the ability to identify the gripping position of
multiple objects can also be increased, giving more
convenience and feasibility of gripping.

V. REFERENCES
Fig. 35. Blade object grip pose generation [1] Oztemel, E., & Gursev, S. (2020). Literature review of Industry 4.0
Tab. 16. Blade object Data Analysis and related technologies. Journal of Intelligent Manufacturing, 31(1),
127–182.
[2] C. Rose, J. Britt, J. Allen and D. Bevly, “An Integrated Vehicle
Navigation System Utilizing Lane-Detection and Lateral Position
Estimation Systems in Difficult Environments for GPS,” IEEE
Transactions on Intelligent Transportation Systems, vol. 15, no. 6, pp.
2615-2629, 2014.
12 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. XX, NO. X, MONTH 202X

[3] Birglen, L., & Schlicht, T. (2018). A statistical review of industrial


robotic grippers. Robotics and Computer-Integrated Manufacturing,
49, 88–97.
[4] Zhang, H., Tang, J., Sun, S., & Lan, X. (2022). Robotic grasping from
classical to modern: A survey. In arXiv [cs.RO].
[5] Zhao, Y., Gong, L., Huang, Y., & Liu, C. (2016). A review of key
techniques of vision-based control for harvesting robot. Computers
and Electronics in Agriculture, 127, 311–323.
[6] Yin, Z., & Li, Y. (2022). Overview of robotic grasp detection from
2D to 3D. Cognitive Robotics, 2, 73–82. 6/.2022.03.002
[7] Joshi, R. C., Joshi, M., Singh, A. G., & Mathur, S. (2018). Object
detection, classification and tracking methods for video surveillance:
A review. 2018 4th International Conference on Computing
Communication and Automation (ICCCA), 1–7.
[8] S. Lin, A. Ryabtsev, S. Sengupta, B. Curless, S. Seitz, and I.
Kemelmacher-Shlizerman, "Real-Time High-Resolution
Background Matting," 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2021, pp. 8758-8767
[9] 林妤璟。「仿效人類取物策略於機器手臂之深度圖像夾取姿態與
位置辨識」。碩士論文,國立臺灣科技大學電機工程系,2020
[10] Du, G., Wang, K., Lian, S., & Zhao, K. (2021). Vision-based robotic
grasping from object localization, object pose estimation to grasp
estimation for parallel grippers: a review. Artificial Intelligence
Review, 54(3), 1677–1734.
[11] Caldera, S., Rassau, A., & Chai, D. (2018). Review of deep learning
methods in robotic grasp detection. Multimodal Technologies and
Interaction, 2(3), 57.
[12] Kumra, S., & Kanan, C. (2017). Robotic grasp detection using deep
convolutional neural networks. 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), 769–776.
[13] S. Takada, S. Matsumoto, and T. Matsushita, “Estimation of whole-
body thermal sensation in the non-steady state based on skin
temperature,” Build. Environ., vol. 68, pp. 123-133, 2013.
[14] Joshi, R. C., Joshi, M., Singh, A. G., & Mathur, S. (2018). Object
detection, classification and tracking methods for video surveillance:
A review. 2018 4th International Conference on Computing
Communication and Automation (ICCCA), 1–7.
[15] Lenz, I., Lee, H., & Saxena, A. (2015). Deep learning for detecting
robotic grasps. The International Journal of Robotics Research, 34(4–
5), 705–724.
[16] Redmon, Joseph, and Anelia Angelova., “Real-time grasp detection
using convolutional neural networks,” 2015 IEEE International
Conference on Robotics and Automation (ICRA), pp. 1316-1322,
2015.
[17] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-
CNN. ArXiv [Cs.CV]
[18] Forte, Marco and Franccois Piti'e. “F, B, Alpha Matting.” ArXiv
abs/2003.07711 (2020): n. pag.
[19] S. Lin, A. Ryabtsev, S. Sengupta, B. Curless, S. Seitz and I.
Kemelmacher-Shlizerman, "Real-Time High-Resolution
Background Matting," 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2021, pp. 8758-8767
[20] Zhao, Z.-Q., Zheng, P., Xu, S.-T., & Wu, X. (2018). Object
detection with deep learning: A review. In arXiv [cs.CV].
[21] Decheng Wang, Xiangning Chen, Hui Yi, and Feng Zhao,
“Improvement of Non-Maximum Suppression in RGB-D Object
Detection,” IEEE Access, vol. 7, pp. 144134-144143, 2019
[22] Posheng Huang, Chin-Han Shen, and Hsu-Feng Hsiao., “Rgbd salient
object detection using spatially coherent deep learning framework,”
2018 IEEE 23rd International Conference on Digital Signal
Processing (DSP), pp. 1-5 , 2018.
[23] Tan, M., Pang, R., & Le, Q. V. (2019). EfficientDet: Scalable and
Efficient Object Detection. ArXiv [Cs.CV].
https://doi.org/10.48550/ARXIV.1911.09070
[24] Grompone von Gioi, R., Jakubowicz, J., Morel, J.-M., & Randall,
G. (2012). LSD: a Line Segment Detector. Image Processing on
Line, 2, 35–55.
[25] Bao, P., Zhang, L., & Wu, X. (2005). Canny edge detection
enhancement by scale multiplication. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 27(9), 1485–1490.
[26] Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020). YOLOv4:
Optimal speed and accuracy of object detection. ArXiv [Cs.CV].
[27] OpenCV: Perspective-n-Point (PnP) pose computation. (n.d.).
Opencv.Org. Retrieved June 18, 2022, from
[28]

You might also like