Object gripping algorithm for robotic assistance by means of deep leaning

ABSTRACT

Another example is presented in [15], where the CNN regression is used to estimate the joint velocities that a robot must have to launch and catch objects. On the other hand, DL techniques are integrated into the control of robots in aspects based on object recognition [16] and allow applications oriented to the grip of objects, for example, the work presented in [17], where an application of conventional use for garment grip is oriented. More specialized applications involve human-robot interaction [18], where the DL is used to identify the intention of movement. For applications that involve the use of robots in integrated environments as assistants, it is necessary to identify the object of interest and grasp it, for which CNN has already shown their versatility [19,20]. This work presents an advance in the use of DL for the recognition and grip of objects in multi-objective environments oriented to assistive robots using recent techniques of variation of conventional CNN architectures such as fast-RCNN and CNN regression [21].
CNN is usually used for detecting grasping objects with final robotics effectors with a tweezer form, as it is exposed in [22,23]. But it presents an unstable grip requiring additional time to find the best way to grip, close to the gravity center to the object. This article presents the development of a new algorithm based on Faster R-CNN and CNN regression to provide to robotic agents, equipped with three-finger grippers, the ability to grasp objects that are in the environment, increasing the stability, but its training implies an additional complexity, reason for use both of the CNN kinds. This problem emerges as a necessity to integrate an efficient grip method with previous works based on human-robot interaction [24] with assistant robots [25].
This article is divided into three sections. The first section presents the environment of the application that focuses on the acquisition and adaptation of the databases and the neural architecture implemented for the detection of the elements to be grasped. In the second section, the graphic user interface that facilitates the acquisition of databases, training of networks and results obtained in a virtual environment are exposed. Finally, the conclusions of the developed system and possible improvements for future work are presented.

RESEARCH METHOD
To evaluate a grip algorithm using DL using CNN, 3 objects are established in a virtual environment. Since the aim is to use a gripper for a robotic agent, the characteristics of the grip object must be defined. To generalize the geometries, two types of objects are proposed; the first type have infinite symmetry axes, where two objects with that characteristic are established: a cylinder and a toroid; the second type is defined with a finite number of symmetry axes, using a parallelepiped geometry. By having elements with a finite number of axes of symmetry, their rotation could affect the way it grabs when changing their orientation with respect to the Z axis, requiring a test of this type. With the defined objects, two databases are established: the first one, to train a network for its detection and the second one, to estimate the angle at which the parallelepiped is rotated.

Database for networks training
In Figure 1, some examples of the database for detection and localization are shown, in which the position and orientation of the elements in the environment are changed. A total of 2195 RGB images with a resolution of 224x224 pixels are established, of which 200 images are separated for evaluation after training and the rest are used to train the network. For the network trained to estimate the angle of rotation, two databases were established, one RGB and one binary, to evaluate which of the two databases manages to have a better learning. From the previous database, 2000 images were taken, from which the regions where the parallelepiped is located are extracted. Once the regions are obtained, edges are added to the images of both databases to make the image square. This is done since all the images must be resized to the same size, because network input size cannot be variable. It is set a size of 50x50 pixel for the images, which encompasses the size of the parallelepiped as shown in Figure 2. From 2000 images, 10% of them, corresponding to 200 images, are used for tests after training.

RGB database
Binary database

DL architectures
The proposed architecture is divided in two parts. In the first part, a Faster R-CNN is implemented to detect and to locate the elements in the environment. In the second part, an architecture proposed for a CNN with a regression layer that allows estimating the angle to which the parallelepiped is rotated as shown in Figure 3. The Faster R-CNN, unlike a conventional CNN, has an RPN (region proposal network), with which frames, called Anchors, are generated in the image, where it is identified in which an object could exist. The learned characteristics are linked with a RoI-Pooling, then passes through Fully Connected layers that allow to generate a learning on the extracted characteristics and finally, to identify in each detected object to which category it belongs. As an architecture for the Faster R-CNN, the VGG16 network [26] is implemented. Then, from the regions detected by the Faster R-CNN, only the one corresponding to the parallelepiped is extracted and adjusted as indicated in the database for the CNN regression. This architecture consists of four convolutional layers that will learn the features of the parallelepiped, to be then entered into the Fully Connected layers and finally, with the regression layer, to estimate the angle. Given the geometry of the object, the angle of rotation will be between 0° and 180°, from this the maximum angle is set at 175°, taking the range between 175° and 180° equal to zero, that is, variations of at least 5° of error will be validated.

RESULTS AND ANALYSIS 3.1. Networks results
Faster R-CNN is trained with the training images, the confusion matrix is calculated in order to be able to demonstrate its performance in learning as shown in Figure 4. In Figure 4(a) and (b), the confusion matrices of the training and testing database are presented. The green diagonal represents the images that were correctly classified and out of this those that were classified in an incorrect category. Both the training database and the test database showed 100% accuracy in the classification of the images. Another factor to take into account when performing a system for the detection of elements in images is the Average Precision. Since this will indicate the precision of the overlap between the boxes that generate the network and those set in the ground truth. For both databases (training and test), values greater than 97% accuracy are presented, i.e. the boxes generated by the network are reliable and will allow the robotic agent to move correctly to grab them.
In Figure 5, some examples of classification and detection of the Faster R-CNN are shown. It can be seen cases in which the elements to be identified are partially obstructed and even so the network is able to detect them although with less reliability. The network is tested by placing strange objects that are unknown, identifying that, although there are elements similar to those trained, no false positives are presented. For the CNN regression, through an iterative process, the training parameters of the network were established, obtaining a batch size for the training of 100, a learning factor of 1x10-6 and 100 epochs of training, allowing the network to estimate an approximate angle to the real one without going into overfitting. In Figure 6, the box diagrams for the RGB and Binary database are presented. The binary database obtained  10. Another factor to highlight is the mean error for both bases, for the binary it corresponds to 1.049º and for the RGB, 0.769º. The binary database also showed a greater error in the range of non-atypical values, being between 20.23º and -16.54º, while in the RGB images, of 8.49º and -11.76º. Therefore, it can be concluded that the RGB database presented a better performance than the binary one.

Tests in virtual environment
Nowadays, there are several virtual environments that are used for the simulation of robotic systems, among the main ones are Gazebo and V-REP, where in [27], the main advantages and disadvantages of each are presented. V-REP is selected for its diversity of sensors and coupling with different programming languages. In Figure 9, an example of the virtual environment used is shown, where it can be seen a RRP robot, which has a gripper as final effector, and the three different types of objects in the environment (cylinder, parallelepiped and toroid), which can be visualized with a camera that is incorporated in the gripper. In [28], it is mentioned that using a virtual environment will facilitate the tests of the DL architectures for the detection of the elements. In this way, when implementing them in a real environment, it will not be necessary to train the network from scratch, it will only be necessary a fine tuning for the networks to be coupled to the real environment.
In the virtual environment, the gripper camera runs frame by frame while running the Faster R-CNN and it can reach up to 5fps with a seventh-generation i7 computer with an NVIDIA GTX 960M GPU and 16 GB of RAM. Once the system is running, it can be selected between the options "Cylinder", "Parallelepiped" and "Toroid", in order to tell the robotic agent which item to collect. It is highlighted that under the option of "Parallelepiped" the angle to which the element is rotated is shown. In case of significant errors in this angle, the option "Acquisition" makes a new database with the sections detected by the Faster R-CNN in which the parallelepiped is located, with which a new CNN network can be trained for tasks of regression of the orientation angle of the object as shown in Figure 10. In Figure 11, some grip tests are shown for each object with two and three finger grippers. In Figure 11, some grip tests are shown for each object with two and three finger grippers. In general, in the tests carried out, the three-finger gripper allowed a greater margin of error with respect to the two, but in some cases, one of the additional fingers of the three-finger gripper may not perform a useful function during the grip, as shown in Figure 11, when the three-finger gripper grips the parallelepiped.  Figure 11. Examples of tests with two and three finger grippers

CONCLUSION
The implementation of the Faster R-CNN for the detection of elements of interest in the environment managed to obtain 100% accuracy in the classifications of the test database, and over a 97% of average precision locating the generated boxes in each element, thus allowing the robotic agent to have greater autonomy in the execution of their trajectories towards the objects that are wanted to be collected. The design of a CNN trained for regression tasks allowed to calculate an approximate angle to the real one in most cases, but from the results obtained in the boxplot, it is possible to identify that in case it is wanted to implement a system that requires high precision and accuracy to perform a task, errors in the gripping tasks may appear. For this reason, for future developments, other possible databases will be evaluated to reduce the error obtained.
Taking into account the trajectories made and the grips made by the robotic agent, it is proposed for future developments to calculate, by means of a CNN regression, the coordinates of the approximate points of grip for each of the fingers that the gripper may have, in order that all serve as support in the execution of the task. The comparison between two and three fingers gripping simulated, it stated three fingers gripping allow keep the object, while with two fingers it is gripping on close to the gravity center but not in it, the object is exposed to possible fallen. In every case, the object was successfully grabbed by the proposed algorithm.