Comparison between handwritten word and speech record in real-time using CNN architectures

Received Oct 10, 2019 Revised Feb 5, 2020 Accepted Feb 25, 2020 This paper presents the development of a system of comparison between words spoken and written by means of deep learning techniques. There are used 10 words acquired by means of an audio function and, these same words, are written by hand and acquired by a webcam, in such a way as to verify if the two data match and show whether or not it is the required word. For this, 2 different CNN architectures were used for each function, where for voice recognition, a suitable CNN was used to identify complete words by means of their features obtained with mel frequency cepstral coefficients, while for handwriting, a faster R-CNN was used, so that it both locates and identifies the captured word. To implement the system, an easy-to-use graphical interface was developed, which unites the two neural networks for its operation. With this, tests were performed in real-time, obtaining a general accuracy of 95.24%, allowing showing the good performance of the implemented system, adding the response speed factor, being less than 200 ms in making the comparison.


INTRODUCTION
The implementation of pattern recognition applications in different types of data, such as signals from sensors [1], audio [2,3], or even in images [4,5], has been growing exponentially, giving way to the creation of a great variety of techniques to cover each type of pattern. Within these techniques, there is an area called deep learning [6], which contains robust pattern recognition methods such as recurrent neural networks [7,8]. That are mainly used for speech recognition and writing, as well as deep belief neural networks [9], used for image recognition and natural language.
Another method of deep learning that has been evolving, mainly since early 2012 [6], is the convolutional neuronal network (CNN) [10], which was originally used to recognize only patterns in images. However, due to the demonstration of its capability to recognize up to 1000 different categories [11] and to support a large number of hidden layers, which even improved its performance [12,13], its application has been extended to different fields, being used for speech recognition [14,15], electromyographic signals [16,17], and even as a means of human-machine interaction [18,19]. With respect to speech recognition, CNNs have been applied successfully in different developments, such as the one shown in [20], where deep CNN architectures are compared against a deep neural network for large-scale speech tasks, having a relative improvement of 12% to 14%. In another development, this technique is used to recognize 6 different languages, obtaining a word error rate of 11.8% [21]. These works usually use phonemes to recognize the words, however, for an application where not the whole of a language is used, but some of the words, extracting the phonemes would result in a very complex task for the simplicity of the work, for this reason, as part of the contribution, this paper proposes using a CNN based on complete-word recognition, so that the input to the network is a whole processed audio, making its implementation much easier for the recognition of the words required.
On the other hand, regarding the location of objects in images, because CNN recognizes a single object in a total image, a variant was developed to allow it to identify several objects within the same image, generating a new variation of the network called region-based CNN (R-CNN) [22] that, in combination with a technique of identifying regions of interest (RoI), the network is able to locate and identify different objects but with a processing time of more than 10 seconds, making it very slow for any application in real-time. For this reason, other variations were implemented, reaching a network called Faster R-CNN [23] that, instead of using a separate algorithm to detect RoIs, has a region proposal network (RPN) with which it shares the weights and learning, which not only increases its robustness but also reduces the processing time the network takes to locate all the objects of interest in an image. However, although there are works that CNN have used to recognize handwritten words, such as those presented in [24,25], the variation to locate them without the need for an additional algorithm has not been used, for that reason, this paper explores the use of a Faster R-CNN to locate 10 different handwritten words, to verify their operation.
This work, in addition to the aforementioned, seeks to implement a versatile application of truth check between what a user says and what they write, therefore, it is proposed to make the comparison of 10 different words, using a speech recognition system and one of location and identification of handwriting, united in a simple interface, which allows doing this task in real-time. This paper is divided into 4 sections, where section 2 describes the architectures implemented for speech recognition and handwritten word location and identification, and the interface developed for the task. In section 3, the results of the real-time test performed are presented. Finally, section 4 shows the conclusions reached.

METHODS AND MATERIALS
The development of this work is based on the recognition of 10 different words in the Spanish language that can be written by a user on a paper, which are: "Abajo" (Down), "Abra" (Open), "Arriba" (Up), "Avance" (Advance), "Cierre" (Close), "Dere" (shortening modification of Right), "Izqui" (shortening modification of Left), "No", "Pare" (Stop), "Si" (Yes), additionally, the recognition of these is subject to what the user says by means of his voice. Taking into account this, the development depends on the implementation of two main functions, which are: one responsible for recognizing what the person has written and another responsible for processing the input audio, in such a way that it identifies the word that the user says. For this, two different CNN architectures are built and trained in order to execute said functions, and these are integrated into a graphic interface which is what the user uses. Its development is explained below.

CNN for handwriting recognition
The function oriented to recognize what the user wrote has two parts that are the location of the text and its subsequent classification. The objective was to implement a single neural network that is capable of carrying out both parts, for this reason, it was proposed to implement a neural network based on regions, type Faster R-CNN that, thanks to its high response capacity, allows the development of algorithms capable of locating and recognizing words in real-time. Based on this, the architecture shown in Table 1 is proposed, which consists of small square filters of 5x5, 3x3, and 2x2, since in terms of resolution, the words do not have large size, giving the possibility that the network is able to learn both generalities and more specific details of each word. Likewise, zeros are added to the edges of the images that pass through the network by means of padding, so that if a trace is very close to the edge, the network is able to learn or take into account its characteristics. Additionally, coupled to the CNN, there is a second path, which is a region proposal network (RPN), in charge of learning the regions or location of the words. These two are joined by means of a region of interest pooling (RoI-Maxpooling), which allows to have dynamic sizes of the incoming feature maps to this layer, i.e. depending on the box located in the RPN, it performs the corresponding downsampling to obtain the required input size in the next layer. A database containing a total of 1200 grayscale images is built, of which 1000 are used to carry out the training of the network and 200 to validate it. Each image was taken directly by a webcam and contains the 10 words to be identified, which were manually labeled, as shown in Figure 1.
With this, the training of the neural network is performed and later its validation. In this last step, the network obtains 98.9% of average accuracy in terms of the identification of the words within the image, and a mean average precision (mAP) in the location of the bounding boxes of 99.23%. The average precision of each word is shown in Figure 2, with the word "Yes" being the least accurate, with 98%, mainly because the estimated box tended to be a little larger than the ground truth set.

CNN for handwriting recognition
To implement the speech recognition function of the user, first, the database for the training is prepared, in such a way that the dimensions of the network input are obtained. For this, it is obtained 115 audios, of length from 2 seconds to 16 kHz, of each word of different people in varied environments, i.e. controlled and not controlled or with noise, making a total of 1150 data in the database. However, as it is wanted to make audio acquisition continuously, there is a need to add an additional category, which in this case is called "Otros", which contains different sounds of environment (people speaking, blows, etc.) and of the same users (strong breathing, coughing, etc.) to avoid that the network confuses these sounds with the words. From this category, 240 audios are recorded.
In order for the audios to be used in a CNN focused on speech recognition, processing of each audio is performed by means of the Mel Frequency Cepstral Coefficients (MFCC), in such a way that a feature map is obtained to be entered into the CNN. For the processing, a window of 20 ms is used with a shift of 10 ms, obtaining 12 coefficients per frame and a total of 199 frames. To have a clearer feature map, floor filter is applied to the map, and then obtain the first (∆) and second (∆∆) derivative from the MFCC obtained, to acquire better characteristics of the sound and that the network has a greater possibility to learn the temporal variations of each word. An example of these maps is presented in Figure 3. Finally, the database of the feature maps obtained is divided, taking 100 maps per word and 200 from the category "Other" for training, and 15 per word and 40 from "Otros" for validation. Once the database is obtained, CNN is proposed for the application as shown in Table 2.

4317
Since the recognition to be made is of the complete words, square filters are proposed, in such a way that the network learns the characteristics both in terms of the coefficients and the temporal variations. The network consists of 3 packets 2 convolutions plus a maxpooling, with the difference that in the first, a normalization layer is added in the first convolution and downsampling is made only in the coefficients, so that in the following convolutions, the temporal features are maintained. Likewise, padding is added to the volumes, in such a way that the characteristics found in the first and last coefficients are taken into account at the moment that the learning filters pass over them.
Finally, the training of the network is carried out, with which an accuracy of 100% and 90% validation is obtained, as shown in Table 3. In this one, it can be observed that the words that got the most erroneous classifications were "Arriba", "Avance", "Cierre", and "Izqui", mainly when there was a lot of ambient noise with people talking, which causes the characteristics map to change to some extent, causing it to be confused with other words.

Graphic user interface
To concatenate the operation of the two trained CNNs, a basic graphic user interface (GUI) is made, so that it is easy to use for any user. This interface is composed of a window that allows visualizing in real-time the word written by means of a webcam. Additionally, there are 4 buttons, among which 2 are to activate and deactivate the webcam, and the other 2, to activate or deactivate the data collection by means of a microphone. Finally, there is a box that shows the word said by the user, which changes color depending on 2 cases: green, if the word mentioned by the user is the same as the one he recognized in the image, or red, if the word does not match the one recognized in the image. For the interface to start working, both the camera and the audio input must be activated. An example of the interface is shown in Figure 4.

RESULTS AND ANALYSIS
Various real-time tests of the GUI are performed to see the performance of both the proposed network for speech recognition and the network specialized in finding the words. For this, a continuous recording algorithm is added, in such a way that once the audio is started, the data is taken continuously, however, in order to prevent environmental noise from being recognized, an amplitude threshold of the audio signal is used with respect to its average for a time of 200 ms. If the average of the amplitude exceeds the value of 0.1, it is determined that there is presence of voice, whereby the data is taken from a time ti=t-700 ms to tf=ti+2000 ms, where t is the current time, ti is the time of beginning of data collection and tf is the final time of data collection, obtaining an audio of 2 seconds. If the threshold is not surpassed, the audio data is not saved. With this, it is proceeded to perform the tests, where users write the words individually, placing them in any position of the image and even with a certain degree of inclination to increase the difficulty of their recognition. After it is located, the users say some of the words, either the one they wrote or another, in such a way that the interface shows them if what they say match with what they wrote. These tests are carried out with 6 users, each one with 7 tries, i.e. there were done 42 tests, obtaining the results shown in Table 4.
In each of the tests, the speech recognition network was able to correctly identify each word mentioned by the users. Regarding the identification of the words in the images, 95.24% of general accuracy was obtained, being able to recognize 20 words correctly and indicate an error of matching when the users named another word 20 times. Eight of these tests can be observed in Figure 5, where when the word was correctly classified, the identification confidence was greater than 90%, in other words, the network was "sure" that it was the word. On the other hand, the identification network of the words in the image had 2 erroneous classifications, that is, when the user mentioned a word and another was on the webcam, this was recognized as true, however, as shown in Figure 5.g and Figure 5.h, the confidence is even lower than 80%, which is solved with a confidence threshold, so that when a word is misclassified, if this threshold is not exceeded, it is taken as a negative.

4319
In order to better understand how the used Faster R-CNN is behaving, the activations obtained by all the words in the last layer of the network are shown Figure 6. As can be seen, the network activates the most relevant regions of each word, even in the word "Cierre", it focuses on the whole of it, where the red color means greater activation. On the other hand, in certain categories it activates the edges of the sheet where it has been written, mainly in the word "Si", where possibly confuse the shadow with the section of the letter "i", however, not finding more features, the network does not identify it as a word. As for the processing time of the algorithm, once you have finished taking the audio, the process of obtaining the MFCCs, using the voice recognition network, using the Faster R-CNN and displaying the result in the graphic interface, it takes an average of 151 ms, i.e. once the user finishes saying the word, the algorithm identifies whether or not it matches with what is written almost instantly.

CONCLUSION
This paper described the implementation of a comparison system (executed in real-time) between handwritten and spoken words by a user so that each of the functions developed was implemented with different CNN architectures, which presented a validation accuracy of 98.9% and 90% for the recognition of handwriting and speech, respectively. On the other hand, when performing the tests in real-time, the speech recognition function did not have errors when recognizing each word that the user said, that is, it identified 100% of the 42 tests performed. Additionally, for the system of recognition and localization of the handwritten words, the difficulty of the tests was increased, not only by not controlling the ambient light, but also by placing them with a certain degree of inclination, in order to test its performance, obtaining an accuracy of 95.24% of the general system developed, where there were only 2 cases of false positives, i.e. when the user said any of the words, the system erroneously said that the written one matched, also, no false negatives were obtained during the tests performed. With this, it can be observed the very good performance of the general system implemented and the great capacity that the CNNs have to be used both in speech recognition using complete words and in location and recognition of handwriting, in addition to