A multi-microcontroller-based hardware for deploying Tiny machine learning model

ABSTRACT


INTRODUCTION
The demand for implementing machine learning model on the embedded computers and resource-constraint microcontrollers or Tiny machine learning (TinyML) [1]- [3] has been increased.Data analysis and decision making are necessarily applied to the edge devices according to the development of internet of thing (IoT) technology, therefore, the deployment of machine learning model on edges is very crucial to prevent the overload of cloud server systems.The Raspberry Pi or Jetson Nano embedded computing platforms have been used in deep learning networks (DLNs) applications [4]- [6].These single-board computers have a powerful processor to successfully execute some DLNs tasks, nevertheless, power-consumption is their main issues that hardly apply to the edge devices using battery or renewable energy sources.Therefore, Tensorflow Lite [7] has been introduced to enable the trained DLNs to deploy effectively on tiny microcontrollers, such as nRF52840, ESP32.Based on this, Hussein et al. [8] proposed a customized multi-layer perception neural network which was built and trained on a personal computer and implemented on a limited microcontroller based on Tensorflow Lite.Similarly, a TinyML-oriented mosquito sound classification model has been proposed by Choi and Kim [9] operated on a 32-bit advanced reduced instruction set computer (RISC) machine (ARM) Cortex-M4F processor.TinyML has been also considered to deploy on  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5727-5736 5728 some low power micro-controller units (MCUs) in the autonomous mini-vehicles or voice-recognition applications [10], [11].However, these implementations have commonly unconcentrated on any techniques to enhance the performance of the proposed networks except Kwon and Park [11] have been proposed the field programmable gate array (FPGA)-based or hardware based solution.
The implementation of online training and making inference of the customized TinyML model on power-saving microcontrollers has been become research of interest.Some customized TinyML models such as TinyFedTL [12], machine learning-microcontroller unit (ML-MCU) [13], Train++ [14] or Globe2Train [15] have been proposed which can be operated directly on limited-resource microcontrollers.These platforms have been demonstrated the feasibility of building directly the TinyML models on the small MCUs.However, large scale models remain issues due to the lack of memory resources and processing speed of embedded machines.Further research should be conducted to find suitable solutions to increase the scale of these kind of TinyML models.
Real-time operating system (RTOS) is an operating system that is performed to the control of hardware, and must operate within specified time constraints [16], [17].FreeRTOS [18] is one of the well-known RTOS kernels that is designed for many kind of MCUs.Due to the supporting of multi-tasking feature, RTOS is promising a good solution to improve the performance of artificial intelligence networks (AINs).In fact, Zim [19] applied the freeRTOS to parallel two parts of a AIN on two cores of ESP32 MCU.It is clearly to see that this work took advantage of the multitask on the multi-core MCUs to improve the inference time of the AINs.However, for the TinyML model building by the TensorFlow Lite, such solution is unable to apply due to the unknown structure of trained model.Furthermore, beside the difficulty of the implementing DLNs on MCUs, the pre-processing procedures of new data are also another issue that needs to be overcome.On top of that, a suitable multi-tasking mechanism corresponding with an effective hardware platform should be proposed to enhance executive time of the DLN based application.
The objective of this study is to propose a multiple MCU platform combining with multi-tasking programming based on the freeRTOS kernel to enhance the performance of the TinyML model.Our proposed hardware platform consists of two dual-core ESP32 MCUs for parallelly executing both pre-processing data and recognition tasks on their cores.The demonstrated TinyML model is a three-phase alternating current (AC) motor faults classification based on operation noise proposed by Nguyen et al. [20].This DLN is modified to classify the faults of four motors instead of one as the original version, its input is a 64x64 pixel grayscale Mel-spectrogram image [21] of a one-second noise data segment.The trained model is implemented on Core 1 of the second ESP32 of the proposed hardware, while Core 0 is reserved.The pre-processing data procedures are multitasked by the first ESP32.All tasks are scheduled to run parallelly to optimize the total inference time of the motor's faults identification application.

DATA COLLECTION AND METHOD 2.1. Data description and analysis
The operated noise of three-phase AC motors corresponding to their common faults will be acquired.In this study, four datasets corresponding to four motor operation cases consist of normal operation and three common faults including phase shift, phase loss and bearing failure are logged.The target faults are generated deliberately, and their noises are acquired by a smartphone software at the sampling rate of 16 kHz.Figure 1 shows the experimental layout conducted by Tung et al. [22].The activities are carried out on four different motors in accordance with parameters listed in Table 1.
Acquired datasets are processed to create data for implementing TinyML DLN.They are converted to grayscale Mel-spectrogram images for training and testing the proposed TinyML model before deploying the trained model on the proposed hardware.These images have been applied as the input of TinyML associated with the core algorithm can be found in [21] that can be modified for new project.The spectrogram image is created by three consecutive steps including denoising by wavelet filter using a soft threshold as described in (1) [23], splitting denoised data into one-second (i.e., a 16,000-sample segment) before converting it into a 64×64 pixel spectrogram image using the algorithm proposed in [21].All created image sets will be classified randomly and labelled as representing in Table 2 to train and test the proposed TinyML model on the computer.During the training process on the computer side, the training data were split into two separate sets including 80% for training the model and 20% for validating the model.
where,   is the wavelet coefficient after applying the soft threshold, w is the wavelet coefficient of the original signal, T is the soft threshold value, and sgn is the sign function.

Overview of process to build TinyML-based AC motor fault's classification
Figure 2 illustrates all procedures to build the motors' faults identification application.At first, the noise data of intentionally motor failures are acquired.After that, some pre-processing steps are applied to convert them to the grayscale Mel-spectrogram image sets.Then, these image sets are classified, labelled, and utilized to train and test the proposed TinyML model to classify them or identify the motor's faults.
Finally trained model will be converted to a TensorFlow Lite, quantized to reduce network capacity, and converted to a static array of C language before being integrated into a supported embedded platform.In this paper, the trained model was deployed on both our proposed hardware and nRF52840 MCU for validation.On both platforms, the proposed model will be tested with the live spectrogram image creating directly from the recorded noise data from a microphone., regarding DLN has four labels  = { 1 ,  2 ,  3 ,  4 } represent for a set motor operation noise (i.e., normal operation, phase shift, phase loss, and baring failure) is proposed to classify them.This DLN will be trained by a training set  = {  ,   | 1 ≤  ≤ }, where   ⊂ , to learn a multi-lable classifier to predict lables of new noise samples.As mentioned above,   is a set of two-dimensional (2D) grayscale Melspectrogram images of the operation noise.The proposed DLN with parameter Θ can be described as (2): where   ( |   ) ( ∈ ) represents the layer  ℎ of the network with total number of layer n.The proposed DLN structure in this study is illustrated in Figure 3.This is a 2D DLN has 64×64 input size, four outputs correspond to the four predicted cases of motor operation states including normal, phase shift, phase loss, and bearing failure.Feature extraction stage comprises of four 2D convolutional layers following by a 2D max polling for each layer.The convolutional layer can be formulated in the (3) as shown: where, W is the set of kernels or filters, h(.) and b stand for activation function and bias value, respectively.
The convolutional layers and the first fully connected (Dense) layer apply the redirected linear unit (ReLU) function which is formulated as (4).
The last dense layer utilizes the softmax activation function for the multiclass classification with a posterior probability output in which the equation can be described in the (5): where   is the inferred scores of each class in  by the DLN.The loss function uses the form of cross-entropy loss is formulated as (6): where

Proposed multiple MCU embedded platform
Due to the inference time of the trained DLN is nearly unable to optimize, our solution is to apply the FreeRTOS kernel to multitask the pre-processing new data on dual-core MCU.Simultaneously, other MCU will be predicting the previous processed data.It means that the best total inference time equal the DLN's inference time plus with the data communication time.Consequently, Figure 5 depicts the block diagram of our proposed hardware.In fact, ESP32-1 is responsible for multi-tasking all pre-processing tasks and transmitting the result or spectrogram image to ESP32-2, while ESP32-2 is simultaneously identifying the previous image and showing the prediction on light-emitting diodes (LEDs).The motor operation noise is recorded by an IMNP441 omnidirectional micro-electro-mechanical systems (MEMS) microphone manufactured by InvenSen Inc [24].
However, the fault detection is tested with two scheduling schemes of ESP32-1 including single task and multi-task to clarify the advantage of our proposed platform.For the ESP32-2, it is mainly waiting for the spectral image from the ESP32-1 to recognize, therefore the multi-tasking is unnecessary.The timing diagram of executed tasks on two ESP32s is shown in Figure 6.For Case a, ESP32-1 only runs single-core, or sequential tasks, but it can take the advantage of the direct memory access (DMA) controller to buffer one second sound data.In the first recognized cycle, the recognition time is longer, since ESP32-2 does not have spectral image to recognize.However, from the second period onwards, the time between two printing results (ta) is shorter as ESP32-2 and ESP32-1 execute simultaneously.In this case, it is possible to utilize two cores on ESP32-1 to perform the whole application concurrently, unfortunately, ESP32 MCU have insufficient memory for the most cases.For Case b, ESP32-1 executes parallelly two tasks on Core 0 and 1.One task is responsible for reading data periodically from the buffer of the DMA controller, while the other task carries out denoising new audio data, generating and transmitting Mel-spectrograms to the ESP32-2 for model inference.For the first identification cycle, the inference result should be shown out at the same time with Case a.However, from the second cycle onwards, the time between two printing recognition outcomes (tb) is equal to the total of transmitted data and inference time.

System evaluation
To evaluate the TinyML DLN on the proposed hardware platform, 400 segments of live audio records were tested for each fault (i.e., phase shift, phase loss, and baring failure) and normal operation case.These are total 16,000 noise segments were tested to evaluate the DLN performance.The average inference time of all cases is used to evaluate the total inference time.Besides, the TinyML DLN is also evaluated by using different performance metrics, e.g., accuracy, precision, recall, and F1 value.They are explained in ( 7) to (10), respectively.(10) where, true positive (TP) and true negative (TN) are precisely predicted as the positive class and negative class, correspondingly, whereas false positive (FP) and false negative (FN) is improperly projected with the positive class and negative class, respectively.In this study, macro averaged value of accuracy, precision, recall, and F1 are used to evaluate the proposed DLN based on model's inference results with real-time data.

The efficiency of the proposed platform
Figure 7 shows the experimental layout for real-time testing the trained DLN on our proposed hardware.Since there are no engines to test, the recorded noise data was played through a high-definition speaker to imitate sufficiently the real engine operation noise.The microphone is placed in front of the speakers to record the sound and identify the type of noise or the motor fault.Four LEDs are used to indicate the prediction fault.Furthermore, data samples such as the recorded noise, filtered noise, Mel-spectrogram image, and identification outputs were also printed to the Serial Monitor window for collecting and analyzing afterward.The executed time of a particular task is measured by the deviation time of the system timer at the begin and end of the task.Table 3 indicates the executed time of the tasks on the experimental system.One-second audio data is sampled by IMNP441 at the sampling rate of 16 kHz and sent back as a data stream on the Inter-IC sound (I2S) bus.For evaluating the performance, the parameters of wavelet filter applied on both computer and embedded side are similar, such as using the sym4 wavelet function, the soft threshold determined by the sureshrink method [25].In addition, the wavelet filter library developed by Hussain [26] has been modified to be compatible for running on the embedded system.Table 3 also shows the total inference time of the TinyML DLN on our proposed hardware and single core MCU.The average time between two consecutive identifications is 1.87 seconds associated with the ESP32-1 running in single core, compare to 1.22 seconds for multi-core running that enhanced 34.8%.Besides, the program is also modified to test on a single-core MCU nRF52840, the inference time of DLN is 0.90 seconds compared to 1.07 seconds from ESP32, nevertheless, the total inference time including pre-processing new data is up to 4.93 seconds.
Consequently, our proposed hardware combining with the multi-tasking programming technique is remarkably improved the speed of the AC motor faults classifier.However, Figure 6 also shows that the cores of ESP32-1 and ESP32-2 have not fully executed, on which the application scalability of the system is obviously remaining.The idle time of these cores could be exploited to conduct the additional tasks such as communicating with the cloud server or other edge devices on the IoT system.Besides, the averaged consumption current during inference time is only 165.9 mA and can be lower if the low power modes of ESP32 MCUs were applied.So, it is feasible to apply DLNs to analyze and make decisions directly at IoT devices.

Evaluation of real-time fault classification
The proposed DLN in Figure 3 has been trained on the computer using the training data in Table 2.The implementation of the DLN and training were done on the Google Colab. Figure 8 shows the accuracy and loss of training process.The training accuracy is up to 97.7 % after 300 epochs, the total training time is 3 hours and 35 minutes on the Colab.The testing accuracy is 95.6 % using the testing data in Table 2.
The accuracy of the trained TinyML DLN on the proposed embedded system was also evaluating using the same experimental layout illustrated in the Figure 7.The computer randomly plays about 400-second noise data of each data set via speaker for the real-time classification.Each sound type was identified 400 times.The experimental result shows that the macro accuracy, precision, recall, and F1 values are 83.7%,85.2%, 83.7%, and 84.4%, respectively.This accuracy is significantly proved in comparing with the similar DLN model and pre-processing procedures but applied on only one motor in [20].Figure 9 presents the confusion matrix of real-time inference of all cases.

Discussion
The work has successfully proposed an embedded hardware platform to apply the multi-tasking programming technique for improving the speed of the TinyML-oriented application.Furthermore, some pre-processing data procedures, such as wavelet denoising and grayscale Mel-spectrogram imaging, are successfully implemented and optimized execution period on the proposed hardware.The preliminary results promise many future applications, especially in integrating the TinyML to IoT edge devices.In addition, this study has also initially solved the energy problem for IoT devices integrated artificial intelligence that mainly operate with the renewable energy sources.Indeed, tiny MCUs like ESP32 or nRF52840 consume very low power and support power-saving modes, while many embedded computers deploying popular DLN platforms are high energy consumption that is impractical for renewable energy sources.

CONCLUSION
Multiple MCU-based hardware platforms have been successfully proposed to take advantage of the multi-tasking programming techniques to speed up the TinyML-oriented application.The total inference time of a TinyML network on our platform is decreased by 34.8% in comparison with a single processor platform.Besides, the proposed DLN has been modified to classify well the faults of four three-phase motors with the best accuracy of real-time inference (detect the live audio samples recorded directly by MCU) is 83.7%.In future work, our hardware platform can be utilized to directly implement artificial intelligence to IoT devices without computation tasks on the computer or cloud server.

5729 Figure 1 .
Figure 1.Motor's noise data logging experimental setup

Figure 2 .
Figure 2. System architecture for faults identification system is the score of the positive class.To prevent overfitting of the DLN, a Dropout layer is added right after the Flatten layer with a default dropout percentage of 50%.The proposed DLN has totally 163,588 training parameters after quantizing.

Figure 5 .
Figure 5. Hardware block diagram of the proposed embedded hardware

Figure 8 .Figure 9 .
Figure 8. Training and validation loss of the proposed DLN

Table 1 .
The parameters of using three-phase motors

Table 2 .
Data summary

Table 3 .
The executing time (seconds) of implemented tasks on the proposed hardware