Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Manufacturing 11 (2017) 423 – 430

27th International Conference on Flexible Automation and Intelligent Manufacturing, FAIM2017,


27-30 June 2017, Modena, Italy

A machine learning approach for visual recognition of complex


parts in robotic manipulation
P. Aivaliotisa, A. Zampetisa, G. Michalosa, S. Makrisa*
a
Laboratory for Manufacturing Systems & Automation, University of Patras, Rion Patras, 26504, Greece

Abstract

This research presents a method for the visual recognition of parts using machine learning services to enable the manipulation of
complex parts. Robotic manipulation of complex parts is a challenging application due to the uncertainty of the parts’ positioning
as well as the gripper’s grasping instability. This instability is caused by the non-symmetrical and complex geometries that may
result in a slightly variable orientation of the part after being grasped, which is outside the handling/assembly process tolerance.
To compensate for this, a visual recognition approach is implemented via classifiers. Finally, a case study focusing on the
manipulation of consumer goods is demonstrated and evaluated.

©©2017
2017TheTheAuthors.
Authors.Published by Elsevier
Published B.V.B.V.
by Elsevier This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the scientific committee
Peer-review under responsibility of the scientific committee of theof27th
theInternational
27th International Conference
Conference on Flexible
on Flexible Automation
Automation and and
IntelligentManufacturing
Intelligent Manufacturing.

Keywords: visual recognition, grasping instability, manipulation of complex parts

1. Introduction

Robotic manipulation is widely applied to manufacturing for the simplification and speeding up of the production
process. As the accuracy of the robots is increasing, so are the fields that it can be applied to. For instance, surgical
operations can now be performed via robotic systems that are more precise and steady than human hands are [1].
Particularly in manufacturing, it is critical that the manipulation of any component should be made with high

* Corresponding author. Tel.: +30-2610-910-160; fax: +30-2610-997-774.


E-mail address: makris@lms.mech.upatras.gr

2351-9789 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the scientific committee of the 27th International Conference on Flexible Automation and Intelligent Manufacturing
doi:10.1016/j.promfg.2017.07.130
424 P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430

accuracy [2]. Robotic grippers are used in the majority of the grasping operations; however, despite their
technological advancement in terms of actuation and sensing, they are still unable to steadily grasp objects with
complex geometries.
In the past, a great amount of research study had to do with the improvement of the grippers’ grasp ability and
sensitivity. To this effect, methodologies for the evaluation of the influence of perception capabilities on grasping
and on gripper design, with the use of grasp ability maps have been developed [3]. Extreme learning machine
(ELM) and support vector regression (SVR) methods have also been used for higher accuracy of the manipulation to
be achieved [4]. Other attempts have focused on the development of geometry-based grasp quality measures, which
can be determined directly from a 3D point cloud [5].
In contrast with the above research, which focuses on a part’s pre-grasping phase, the presented approach focuses
on identifying the position and orientation errors of the manipulated part, after it has been grasped. In this way, the
uncertainty caused by the gripper’s instability is overcome and the robot’s behavior can be adjusted to executing the
manipulation task successfully. A method for visual recognition via machine learning services is used for this
purpose, since this kind of services can perform tasks, namely the fast and accurate identification of the objects’
orientation and position in 3D space.
A large number of approaches for visual recognition were developed in the past. A critical step in attributing
recognition is the extraction of low-level features, which encodes the local visual characteristics in images, and
provides the representation, used in the attribute prediction step [6, 7]. DeCAF is an open-source implementation of
deep convolutional activation features, along with all associated network parameters to enable experimentation with
deep representations, across a range of visual concept learning paradigms [8]. The use of Best Code Entries (BCE)
and Mid-Level Classification Weights (MLCW) allows for the improvement of a representation’s quality, while
maintaining real-time performances. BCE focuses entirely on sparsity and improves the stability and computational
efficiency of the coding phase, MLCW increases the discriminability of the visual representation, and therefore, the
system’s overall recognition accuracy, by exploiting data supervision [9].
Based on the above approaches, this paper focuses on the integration and implementation of a visual recognition
service, in the industrial robotic manipulation, as it will described below. In Section 2, the approach is described in
detail. In Section 3, the software and hardware implementation of this approach is presented on the basis of a real
industrial pilot case. The same scenario has been implemented multiple times by changing the parameters of the
visual recognition service each time. Accordingly, the method’s evaluation has taken place, while the results are
presented in Section 4.

2. Approach

The objective of this study is the robotic manipulation of complex parts, using a visual recognition service as an
external tool for the estimation of a part’s position and orientation after it has been grasped. The grasping phase is
very crucial for a successful manipulation, due to the part’s possible pose deviations, caused by the gripper’s
instability.

Fig. 1. Pose’s deviations of part after the grasping.


P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430 425

As it is depicted in Fig. 1 (a), the pose of the complex part, before being grasped is already known and fixed,
either because it is oriented by the production process or by using another vision technique (e.g. stereo camera or
other). All of the part’s dimensions are also considered to be known and given that the gripper’s grasping is ideal,
the robot should be aware of the part’s accurate pose. In reality though, as it is depicted in Fig. 1 (b) and (c), the
gripper’s instability in combination with its complex geometry result in a slightly variable orientation of the part
after it has been grasped. Thus, the robot does not know the exact actual pose of the grasped part and it is not
possible to achieve the required accuracy in order for the task to be carried out. In order to overcome this issue, a
vision system, in combination with a visual recognition service, was integrated into the robotic manipulation
process. The procedure for the execution of the task, using these technologies, is depicted in Fig. 3.

Fig. 2. Robotic manipulation using visual recognition service.

The main Machine Learning services, which are used for the implementation of the visual recognition, are the
Convolutional Neural Networks (CNNs/ConvNets). ConvNets, similar to ordinary Neural Networks, consist of
multiple layers of receptive fields or else neurons. Initially, these neurons are trained through labelled data, which
have been entitled by a human. After the training phase, each neuron is ready to execute its task, which is the
processing of a portion of unlabeled data, aiming at providing information about them as output. Eventually, all the
outputs are overlapped based on their input regions in order to obtain a better representation of the unlabeled data.
The same procedure is reiterated for every neuron and finally, multiple labels are used for the representation of the
input unlabeled data. These labels are also characterized by a score, determining the confidence of the answer.
ConvNets combine the outputs of multiple neuron clusters, making them faster and more accurate than normal
Neural Networks do [10, 11, 12].
The procedure of the visual recognition, based on ConvNets, is similar to that mentioned above. Firstly, in the
training phase, a set of images, with the same characteristics, is used to creating a class, which is labeled by the user
according to its common characteristic. Multiple classes are used for the training, resulting in a classifier. Secondly,
using this classifier, an unlabeled image can be analyzed. Based on the confidence of the ConvNets, a score that
depicts the matching rate of the unlabeled image to each class has been calculated. The scores are calculated for
each class independently, while an unlabeled image can match with more than one classes. Finally, assuming the
accuracy of the service, the class with the highest score is the one which represents the correct label of the input
image. The entire system’s approach is depicted in Figure 3.
For the creation of the classes, sets of images should be captured by a camera. The configuration of the camera is
defined in this phase and remains the same during the full manipulation process. In addition, the camera should be
fixed, while the background of the objects should be as clear as possible to achieve high accuracy and matching
scores. As regards the specifications of the camera, the shooting rate, high contrast, saturation and sharpness are
parameters that can primarily increase the performance of the visual recognition. Each class requires at least 10
images, while a representative label should characterize them.
426 P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430

Fig. 3. System approach.

3. System implementation

3.1. Software implementation

In this particular implementation of machine learning and visual recognition, it is the IBM’s Watson Visual
Recognition service utilized. The IBM's Watson is a question answering computing system, based on the ConvNets
architecture, built to apply to many machine learning technologies. It packs a huge amount of processing power and
memory, making it capable of 80 TeraFLOPs. In combination with its visual recognition service, it can provide a
very accurate and fast classification of the provided images. In addition, it has the option of creating custom classes
and classifiers therefore, it can be easily integrated into the industrial robotic manipulation process presented. The
service is accessible through IBM's Bluemix platform, which provides many other machine learning services, cloud
services and virtual servers, too.
The training and classification phases of Watson’s Visual Recognition service are similar with those of the
corresponding procedure described in the approach. Initially, the required sets of images were created. Utilizing a
simple code, in the experiment below, it was Python 2.7.12 used; the sets of images are classified into classes and
are uploaded to the IBM's Bluemix platform, as input to the training phase. As output of the training phase, the
IBM’s Watson Visual Recognition service has created a classifier with a unique id. Then, a simple application,
which can be either executed locally or on a Bluemix Virtual Server, has allowed the classification of an unlabeled
image, using the created classifier. The Watson Visual Recognition service returned the matching scores of the
unlabeled image input for each class. The label of the class with the highest score was the final result of the
classification.

3.2. Hardware implementation

An industrial pilot case of robotic manipulation was set up for the demonstration of the specific approach. A
robotic arm, equipped with a parallel gripper, was used for grasping a shaver’s curved handle. The application is
oriented towards implementing the feeding of these complex parts. The grasping of the handle resulted in its slight
P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430 427

rotation due to its geometry curves. The manipulation of complex geometry parts is a challenging robotic
application due to the plethora of displacements and rotations, even if for few millimeters or degrees from the
nominal position, which may arise during the grasping phase. Accordingly, the handle’s position and orientation
were changed and the robot controller was unable to acquire correct information of the object’s pose in the 3D
space. For the calculation of the handle’s new pose, the following procedure has been executed. Firstly, the robot
moved the handle to an exact point, into the vision area of a fixed camera, as explained above. A led light source
was attached in an angled position, next to the camera, in order to provide the complex part with a more powerful,
uniform lighting. Secondly, an image was captured and uploaded to the Bluemix platform in order to be processed
by the visual recognition service. Finally, the results of the classification were fed back to the robot in order to be
informed of the handle’s new pose.

Fig. 4. Demonstration of the presented approach.

The entire procedure of the complex part’s manipulation was controlled by a master PC, which had the role of a
communication node among the devices/services. The TCP/IP was the communication protocol that was used for the
implementation of this method. The software sequence diagram of the whole demonstration is presented in the
following Fig. 5.

Fig. 5. Implementation procedure of the industrial scenario.


428 P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430

4. Experiments

For the evaluation of this approach, the above implementation is reiterated for different parameters of the visual
recognition service e.g. image resolution and number of photos in each class of the training. The Answer Scores and
the Service Response Time of each experiment are calculated and presented in this section.
In the first set of experiments, the number of classes along with the number of images of each class was stable,
while the resolution of the vision system was changing. A photo of the handle after being grasped is captured and
uploaded for classification. The experiments took place with 60 classes (15 classes for the rotation around x-axis, 15
classes for the rotation about y-axis, 15 classes for the translation in Z-axis and 15 classes for the translation in x-
axis) and 30 images per class. Due to the large number of classes, it is not feasible to present the scores for all of
them. However, a representative example of the handle’s classification, as regards its rotation, around the X-axis
(Rx) after its grasping, is presented in Table 1. Initially, the handle was at its nominal position (Rx = Ry = Rz = 0O,
Dx = Dy = Dz = 0) before its grasping.

Fig. 6.Captured image for classification.

Table 1. Experiments with different image resolution.


Answer Score for Answer Score for Answer Score for Answer Score for Service Response
Image Resolution
Class 1: Rx = -1 Class 2: Rx = 0 Class 3: Rx = 1 Class 4: Rx = 2 Time
100% 0.14 0.21 0.35 0.98 4 sec
85% 0.14 0.22 0.38 0.96 3.7 sec
63% 0.19 0.24 0.4 0.89 3.1 sec
41% 0.26 0.3 0.42 0.81 2.4 sec
15% 0.35 0.38 0.47 0.72 1.6 sec
10% 0.4 0.42 0.48 0.67 1.3 sec
5% 0.47 0.49 0.51 0.55 1.5 sec

Assuming that only these classes have been taken under consideration, the class with the highest Answer Score
was the Class 4 in all cases. The Answer Score was dependent on the image resolution. Eventually, the rotation of
the handle after its being grasped was 2 degrees (Rx = 2). As it seems above, the Answer Score was decreased as the
image resolution was decreased. Even if we consider the metric of the Service Response Time, it does not mean that
the highest resolution is the best choice for the classification of an image. If there is not any time limit for the
manipulation task, the highest image resolution can be used. In addition, for the lowest percentage of the image
P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430 429

resolution, the classification was successful, but the values of the Answer Score, for all classes, were very close,
thus creating a risk of a possible error.
In fact,, the image resolution is not the only parameter, which has an effect on the Answer Score and the Service
Response Time. In the second set of experiments, the above procedure is repeated for each experiment. The number
of classes as well as the image resolution was stable, while the number of images of each class was changing. The
experiments took place with 60 classes and 90% image resolution, while the effect of the different number of images
on each class is depicted at Table 2.

Table 2. Experiments with different number of images in each class.


Number of images in each Answer Score for Answer Score for Answer Score for Answer Score for Service Response
class Class 1: Rx = -1 Class 2: Rx = 0 Class 3: Rx = 1 Class 4: Rx = 2 Time
30 0.14 0.21 0.35 0.98 1.4 sec
25 0.14 0.22 0.37 0.95 1.2 sec
20 0.16 0.24 0.37 0.93 1.12 sec
15 0.16 0.25 0.39 0.92 0.95 sec
10 0.17 0.25 0.39 0.92 1.5 sec

In the same way, assuming that only these classes are taken under consideration, it seems that the number of
images in each class mainly affects the Service Response Time, while the Answer Scores are very close to all
experiments. More specifically, keeping the same image resolution, the response time and the score of the correct
answer depend on the number of images in each class. As more images are used in each class, the higher are the
response time of the service and the score of the correct answer. However, 15 images per class is a down limit for
the response timer of the service. If the number of images, pes class, is smaller than 15, the response time of the
service is increased. Taking under consideration the above experiments, the best selection of the service’s
parameters can be placed. In all cases, the classification of the image was successful, while the different experiments
present the effect of the visual recognition service’s parameters on the Answer Scores and Service Response Time.

5. Conclusions

The approach presented can be used as a good solution to overcoming the possible uncertainties, which occurred
by the grippers’ instability, during the execution of a grasping task. The direct benefits of the discussed approach on
the robotic manipulation of a complex part, are shortly discussed. At first, the manipulation of a complex part can be
managed by using the visual recognition service as a tool to compensate for the grasping position error. The possible
slight rotation of a complex part, due to its non-symmetrical and complex geometries, can be managed. The
flexibility of changing the visual recognition service’s parameters is another benefit, since the user can adjust them,
depending on the available time and the desirable matching score.
Future study will focus on increasing the Answer Scores of the classified image by reducing the service response
time simultaneously. It is expected that the application would be more accurate and capable of recognizing images
with a complex background, too. Finally, future work will aim at increasing the number of classifiers as well as to
combine them in order to enable the simultaneous classification of more complex parts with multi vision systems.

Acknowledgements

The work reported in this paper has been partially funded by the EC research project “VERSATILE – Innovative
robotic applications for highly reconfigurable production lines” (Grant Agreement: 731330) [13].
430 P. Aivaliotis et al. / Procedia Manufacturing 11 (2017) 423 – 430

References

[1] Huu M. L., Thanh N. D., Soo J. Ph. A survey on actuators-driven surgical robots. Sensors and Actuators A: Physical 2016;247:323-354.
[2] Chen F.,Cannella F., Canali C., Hauptman T., Sofia G., Caldwell D. In-Hand Precise Twisting and Positioning by a Novel Dexterous Robotic
Gripper for Industrial High-Speed Assembly. IEEE International Conference on Robotics and Automation 2013:270-275.
[3] Eizicovits D., Tuijl B., Berman S., Edan Y. Integration of perception capabilities in gripper design using graspability maps. Biosystems
Engineering 2016;146:98-113.
[4] Petković D., Danesh A. S., Dadkhah M., Misaghian N., Shamshirband S., Zalnezhad E., Pavlović N. D. Adaptive control algorithm of flexible
robotic gripper by extreme learning machine. Robotics and Computer-Integrated Manufacturing 2016;37:170-178.
[5] Eizicovits D., Berman S. Efficient sensory-grounded grasp pose quality mapping for gripper design and online grasp planning. Robotics and
Autonomous Systems 2014;62:1208-1219.
[6] Danaci E. G., Ikizler-Cinbis N. Low-level features for visual attribute recognition: An evaluation. Pattern Recognition Letters 2016;84:185-
191.
[7] Zhanga Ch., Xuea Z., Zhu X., Wangc H., Huanga Q., Tiane Q. Boosted random contextual semantic space based representation for visual
recognition. Information Sciences 2016;369:160-170.
[8] Donahue J., Jia Y., Vinyals O., Hoffman J., Zhang N., Tzeng E., Darrell Tr. DeCAF: A Deep Convolutional Activation Feature for Generic
Visual Recognition. International Conference in Machine Learning (ICML) 2014;
[9] Fanello S. R., Ciliberto C., Noceti N., Metta G., Odone F. Visual recognition for humanoid robots. Robotics and Autonomous Systems 2016.
[10] Krizhevsky A., Sutskever I., Geoffrey E Hinton G. E. ImageNet Classification with Deep Convolutional Neural Networks. Advances in
neural information processing systems 2012:25;1097-1105.
[11] Zeiler M. D., Fergus R. Visualizing and Understanding Convolutional Networks. Computer Vision – ECCV 2014: 8689:818-833.
[12] Shelhamer E., Long J., Darrell T. Fully Convolutional Networks for Semantic Segmentation. Computer Vision and Pattern Recognition 2016;
3431-3440.
[13] EU VERSATILE Project. http://versatile-project.eu/

You might also like