Human-machine interactions based on hand gesture recognition using deep learning methods

ABSTRACT


INTRODUCTION
Human-machine interaction is one of the important aspects of modern information technologies.Hand gesture recognition technologies are an innovative approach to providing a convenient and natural user interaction with computers and electronic devices.One of the most effective and promising methods for hand gesture recognition is the application of deep learning methods.In recent years, deep learning has led to significant breakthroughs in pattern recognition [1], computer vision [2]- [4], and natural language processing  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 741-748 742 [5]- [8].These methods allow you to automatically extract high-level features from complex data, which has made them especially attractive for hand gesture recognition problems [9]- [11].This study is devoted to the research and development of a technology for recognizing human-machine interaction based on hand gesture recognition using deep learning methods.The main goal of this research is to create an efficient and accurate hand gesture recognition system that will allow users to interact with computers and other devices in a natural and intuitive way.To evaluate the performance of the developed models, experiments were carried out on various data sets with real hand gestures.The system's accuracy, speed, and stability are evaluated to determine its effectiveness and applicability in various scenarios.
To evaluate the performance of the developed models, experiments were carried out on various data sets with real hand gestures.The system's accuracy, speed, and stability are evaluated to determine its effectiveness and applicability in various scenarios.The tasks under consideration based on hand gesture recognition using deep learning methods can be diverse, i.e., hand gesture recognition, mouse cursor control, interaction with virtual objects, control of robots, and drones.This research is very relevant and has great potential for application in various areas of human activity, such as in natural interaction, in real world applications, the applicability of deep learning in improving the accuracy and efficiency of the system.In general, hand gesture recognition technology [12]- [15] has significant potential and relevance in the modern world.Its development and application can lead to new innovative products and improved user experience, which makes this topic very important for research and development in the field of computer technology.
Tussupov et al. [16] presents a unique learning-based approach to denoising without the use of pure data.The authors have shown that deep neural networks can be trained with a pair of noisy images without the need for pure training data.Lehtinen et al. [17] proposes a method for blind defocusing (blur removal) on images using conditional generative adversarial networks (GANs).The authors demonstrated that deep learning with cGAN can successfully recover images with different blur levels.Zhang et al. [18] proposed the residual dense network (RDN) for the problem of image super-resolution.RDN combines tight-tie blocks and residual blocks to achieve outstanding results in super-resolution problems.Huang et al. [19] presents a method called Noise2Void that allows a deep neural network to be trained to denoise based on a single noisy image without using pure data.This provides an efficient solution for object recognition in noisy images.Zhang and Oh [20] present a deep convolutional network that denoises Monte Carlo tracking images.This method is especially applied in computer graphics, but can be adapted for other problems of object recognition in noisy images.
Zhang et al. [21] propose a deep learning model for processing noisy and dark images, which is an important aspect in object recognition in low light conditions.Zhang et al. [22] explores the application of networks with deep residual blocks and an attention mechanism to the problem of image super-resolution.Improving image resolution can also be an important step in object recognition in noisy images.Izadi et al. [23] presents a deep convolutional network capable of denoising images obtained by the Monte Carlo tracking method.This method is especially used in computer graphics, but it can be adapted to other problems of object recognition in noisy images.Lee and Jeong [24] presents the Noise2Self method, which uses self-surveillance to train a deep neural network on noisy images without the need for clean data.This approach also demonstrates good results in denoising and object recognition problems.Baguer et al. [25] presents the deep image prior method, which makes it possible to recover clean images from noisy data without training.This approach is based on the properties of the deep neural network architecture.In study [26], a deep learning model was proposed for processing noisy and dark images, which is an important aspect in object recognition in low light conditions.

METHODS
The use of MediaPipe hand recognition for human-machine interaction based on hand gesture recognition provides new opportunities for creating intuitive, interactive and user-friendly user interfaces, which contributes to the development of more advanced and smart systems.MediaPipe hand recognition is powered by deep neural networks specially trained on huge datasets containing many real-life images and videos with a variety of hand gestures and finger positions.The trained models are used to classify and identify key points on the hand such as fingertips, knuckles, wrist.MediaPipe hand recognition can process both static images and real-time video streams, making it useful for various applications such as hand gesture recognition in interactive systems, virtual reality and augmented reality, mouse cursor and interface control using hand movements, creating collaborative applications where users can draw, write, or interact with content using hand gestures.In the presented work, the MediaPipe hands library was used to detect key points on the hands in the video stream using a webcam.Specifically, the coordinates of the fingertips of the index finger in the image are determined, and then used to control the position of the mouse cursor on the computer screen.Human-machine interaction based on hand gesture recognition using deep learning methods such as convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural networks is an exciting and promising area of research and development.The considered CNN and LSTM methods make it possible to create more efficient models for processing video streams with hand gestures and analyzing gesture sequences, which improves the accuracy and reliability of the recognition system.In this context, CNNs are used to process images of hand gestures and extract meaningful features.CNNs are particularly efficient at processing visual information and automatically extract characteristic patterns and structures in images.For successful gesture recognition, CNNs can be used to classify hand images into different categories of gestures.
On the other hand, long short-term memory recurrent neural networks are applied to analyze sequences of hand gestures.LSTMs allow for context and dependencies between successive video frames, making them suitable for analyzing temporal data such as hand gestures.This allows you to create models that can capture the dynamics of gestures and understand the sequence of actions.This approach finds application in various fields such as smart device control, medical applications, virtual and augmented reality, interactive systems, and more.The use of CNNs and LSTMs in hand gesture recognition helps create more accurate and reliable human-machine interaction systems, making the interaction more natural and comfortable for users.
The process of collecting data for training CNN and LSTM models begins with initializing the window and camera, where the parameters for the size of the video stream display window are set and the camera is configured.Hand recognition is then initialized using the MediaPipe library, configuring parameters such as the maximum number of hands to detect and the confidence level.In the main video stream processing loop, each frame is captured, flipped, converted to RGB format and passed to the hand recognition model.If hands are detected, the coordinates of the pointer and thumb are extracted and the distance between them is calculated to determine gestures.Finger pointer coordinates are converted to screen coordinates using the PyAutoGUI library to control the mouse cursor as well as define gesture actions.The data is collected and stored in a comma-separated values (CSV) file "hand_movement_data.csv" for subsequent training of CNN and LSTM models.The current action is displayed on the frame and the process continues until the "q" key is pressed, at which point the camera is released, the windows are closed, and the data is saved.
Thus, the whole process allows you to collect data for training CNN and LTSM models for hand gesture recognition and mouse cursor control based on hand position.Once data is collected, these models can be trained and used to create an interactive application for controlling the mouse cursor using hand gestures.The process of training neural networks based on the above codes includes several stages: Data preparation: First, the data must be loaded from the "hand_movement_data.csv" file using the Pandas library.The data is then divided into features (X) and labels (y), which are the coordinates of the fingertips and the coordinates of the mouse cursor, respectively.The data is scaled using MinMaxScaler so that feature and label values are in the range 0 to 1.

RESULTS AND DISCUSSION
In this paper, the deep learning methods discussed, such as CNN and LSTM, allow the development of efficient gesture recognition systems that can simulate the natural interaction between a person and a computer, which has promising prospects in various fields.One of the main advantages of using CNNs in hand gesture recognition is the ability to automatically extract important features from visual data.CNNs trained on large datasets can recognize hand and gesture features, making them effective tools for classifying various gestures.The ability to use pre-trained CNN models can also significantly reduce training costs and improve system speed.
Using CNN and LSTM recurrent neural networks methods to control the mouse cursor on a computer or device using hand gestures is an interesting and innovative technology that can significantly improve user experience and make interaction with a computer more natural.and convenient.One of the main advantages of using CNNs and LSTMs is their ability to process visual data and analyze sequences of hand movements.CNNs can automatically extract features from hand images, which allows the system to recognize various gestures such as cursor movement, clicks, and scrolling with high accuracy.LSTM, in turn, provides analysis of sequences of gestures.This is especially useful when moving the cursor, as users can perform complex hand movements to accurately position the cursor.LSTM allows the system to take into account the dynamics of movements and adapt to various gestures, which makes cursor control more precise and intuitive.One application for such technology could be the use of hand gestures to control the cursor in virtual or augmented reality.This can provide a convenient and interactive way to interact with virtual objects and environments.However, there are some challenges and limitations that need to be taken into account.First, training efficient CNN and LSTM models requires large amounts of data so that the system Collecting and labeling such data can be a timeconsuming process.Secondly, the system must be able to process real-time gestures for a comfortable user experience.This requires optimization and efficient processing of the video stream with gestures.Consideration should also be given to the system's resilience to varying conditions, such as changing lighting, background noise, or varying hand postures.The system must work reliably and accurately in a variety of scenarios.In general, the use of CNN and LSTM techniques to control the mouse cursor with hand gestures represents a promising technology that can make computer interaction more natural and convenient.This allows users to control the cursor and perform actions on the computer or device more efficiently and intuitively, see Figure 1.However, for the successful implementation of such a system, it is necessary to take into account the challenges associated with training models, processing data in real time and ensuring stable operation in various conditions.The trained model is saved to the "hand_movement_lstm_model.h5" file using the save function.Finally, a graph is plotted for the change in the loss function and the mean absolute error on the training and test samples during the training process.As a result of executing the code, we will get a trained LSTM model that can predict the coordinates of the mouse cursor based on the coordinates of the fingertips.If the accuracy of the model on the test set satisfies the requirements, it can be used to control the cursor using hand gestures in real time as shown in Figure 3.As a result of training the considered CNN and LTSM models, it was determined that the CNN does a good job of extracting features from images, which allows you to accurately determine the coordinates of the finger pointer on the video stream.Figure 4 shows the result of an experiment using the CNN model to select a folder on the desktop using hand gestures.The model successfully recognized the hand gestures associated with highlighting a folder on the screen.This includes correctly determining the start and end points of the selection, as well as determining the shape and boundaries of the folder.
Figure 5 shows the result of how a user can use to click on a folder with a squeezing hand gesture.In the future, the model can be successfully applied in real use cases, such as managing files and folders on a computer using hand gestures.When training, it is important to ensure that the model is trained well and tested thoroughly to achieve the desired results and ensure high performance of the system.

CONCLUSION
Using CNN and LSTM recurrent neural networks to control the mouse cursor on a computer or device using hand gestures has improved human-machine interaction, making it more natural and userfriendly.The advantages of using CNNs and LSTMs in this context lie in their ability to efficiently process visual data and analyze hand motion sequences.CNNs automatically extract important features from hand images, which allows the system to recognize various gestures with high accuracy.LSTM, in turn, made it possible to take into account the dynamics of movements and improve the accuracy of cursor control when performing complex gestures.
Both architectures have their advantages and limitations, and the choice of the appropriate model may depend on the specific needs and characteristics of the system.For the task of controlling the mouse cursor with hand gestures, using the CNN model, advantages were identified, such as the ability to process visual data such as images or videos, the ability to automatically extract important features from hand images, which allows recognition of various gestures with high accuracy, and also suitability for classifying and recognizing objects, which is ideal for identifying various hand gestures.The limitation of this model is that it does not take into account sequences of gestures, which can be important for some cursor control scenarios, and it also required more training data, since CNN has many parameters, and large amounts of diverse data are required for successful training.

Figure 1 .
Figure 1.Algorithm for performing human-machine interaction

Figure 4 .Figure 5 .
Figure 4. Result of the CNN model for highlighting a folder on the desktop using hand gestures