Design and implementation of speech recognition system integrated with internet of things

Received Jul 14, 2020 Revised Aug 17, 2020 Accepted Nov 6, 2020 The process of speech recognition is such that a speech signal from a client or user is received by the system through a microphone, then the system analyses this signal and extracts useful information from the signal which is converted to text. This study focuses on the design and implementation of a speech recognition system integrated with internet of thing (IoT) to control electrical appliances and door with raspberry pi as a core element. To design the speech recognition system, digital signal processing (DSP) technique and hidden Markov model were fully considered for processing, extraction and high predictive accuracy of the system. The Google application programming interface (API) was used as a cloud server to store command and give the system to assess to the internet. With 150 speech samples on the system, a high level of accuracy of over 80% was obtained.


INTRODUCTION
Speaking is the major means of communication by a human. There are a lot of processes involved in the production of speech. Also, there are several body parts that aid in the production of speech, apart from the commonly known body parts such as tongue, mouth and lips. The lungs, trachea, larynx, vocal cord, oral cavity and nasal cavity are highly involved [1,2]. Human speech is produced by the flow of air from the lungs through the larynx. It is produced by inhaling and exhaling through the nasal and oral cavity. Vowel sounds are produced by the flow of air from the lungs through the vocal cord, making them vibrate [3]. Consonants can be produced when the air is pressed through the closed vocal resulting in turbulent airflow. Due to the vibration of the vocal cords, sounds can be produced [4]. Each sound, word or speech vibrates differently. The frequency of the vibration is called pitch. In reference [5], the source-filter theory of speech production was introduced, which explains how speech is produced. According to [5], speech production is in two stages. In the first stage, air flows through the vocal cords to produce a basic signal. This basic signal is known as the signal source.
The recognition of the speaker is the process of recognising a speaker from the unique information, which is present in the wave of the word. This technique uses the speaker's voice to check the identity of the rapporteur and control access to services such as composition from voice, security, information service, remote access to a computer, purchases, etc. A lot of handicap (blind, lame) and aged persons in society have a limited capacity to perform certain tasks due to their physical and environmental conditions [6,7]. Most often they require human help in several of their activities which usually cost a huge sum if the person is not their family member and persons who render such services are very minimal [8,9]. This work seeks to help It is like a telecommunication service that aids attention to the need of the disabled via automation [10,11].
With the recent trend in automation as a means of control systems in different areas [12][13][14][15][16], this work deems it fit to integrate automation to meet some of the needs of the disabled individual. The proposed model in this study is limited to the sound or speech recognition mode of authentication. Although there is another authentication mode to gain access such RFID [17][18][19], biometrics [20][21][22], PIN [23,24] and or a combination [25][26][27], this study focuses on the voice recognition.

MATERIALS AND METHODS
The bulk of this work hangs on the software part, although the hardware part is also important. Most of the hardware components used were ready-made; the technical aspect of the hardware lies on the correct ratings of all the components and right connection. In this work, the Raspberry Pi single board computer (SBC) software is deployed for the design of speech recognition system. The Raspberry Pi software is used to configure the hardware for the required action. As this work is based on internet of things (IoT), internet connectivity is required to be setup. In this work, internet connectivity is gotten through USB Wi-Fi adapter. For ease of identification, Figure 1 shows the block diagram of the proposed system.

DESIGN SPECIFICATIONS
The design specifications of the speech recognition for access control module deals with the conditions necessary for the functionality of the module optimally. For this work, two types of design specifications would be considered namely: hardware and software specifications

Hardware specifications
The hardware specification deals with the optimal conditions necessary for the module to function. These conditions include: a. Operating current: The current source needed to power up the raspberry pi, and all the peripheral devices attached to it should provide at least 2 A of current. b. Operating voltage: The operating voltage by which the raspberry pi functions is 5 V. The current compensates to provide a required power of about 10 W. (5 V x 2 A = 10 W). It can be seen that this is a very low power device. Assuming a 7500 mAh battery is used to power up this device, the module can last for almost 4 hours before a battery recharge would be required. c. Operating temperature: The official operating temperature range for the raspberry pi is from -40℃ to 85℃. As the temperature begins to approach 82℃, the performance is thermally throttled.

Software specifications
The software specifications necessary for the module to run optimally include: a. The RAM size: The RAM size required for this project is 512 MB and the raspberry pi zero meets this specification. b. The ROM size: The minimum ROM size (storage space of the microSD card) is 8 GB. In this project, an 8 GB class 10 microSD card. This memory card has a very high read and write speed of 10 MB/s. c. The processor: The processor that is necessary for running the software of this project work is the Broadcom BCM2835 system on chip (SoC) with an ARM 11 CPU running 1 GHz. d. The operating system: The operating system, which the raspberry pi (the control unit of the module) runs is a Debian based distribution. It is mounted on an SD card, which takes at least 4 GB of space. The Debian distribution is known as Raspbian. Figure 2 shows the circuit diagram of the proposed system and it is drawn from EasyEda application.

Fourier methods
The short-time Fourier transform of the speech signal ( ) can be calculated using (1) as given in [28].
where index 'n' is referred to as time nT, which means that Xn(ω)=(nT, w). By inverse Fourier transform, ( ) (speech signal) is recovered as shown in e (2): Since ( ) is a function of time and it changes as time changes, it is sampled at a rate that allows the speech ( ) to be reconstructed. With the bandwidth, Bx of the speech signal ( ) being approximately equal to 5kHz, the sampling rate frequency (Fs) is therefore equals 10 kHz. For the Hamming window of length N=100, (wn), using (3), the bandwidth B is found to be The Nyquist rate for the short-time Fourier transform is twice the the bandwidth B, therefore, the Nyquist rate equals 400 Hz. Hence, at Fs equals 10,000, it requires a value of ( ) every 25 samples. Since N=100, the windows should overlap by 75% [28].

Gaussian mixture model (GMM)
For Gaussian mixture model, (4) gives a non-singular multivariate normal distribution of a ddimensional random variable x [29]: Multivariate data are observation that are made on more than one variable, In (3) ( ) is called probability density function formula, is the mean vector ( × 1 ) and ∑ ℎ [ × ] of the normally distributed random variable X. In (4) the mean vector (expected vector) is as shown in (5): where the number of samples is and are the mel-cepstral feature vector. The expression for variance-covariance matrix of a multi-dimensional random variance is described in (7): where the sample mean is obtained from (5) and the second order sum matrix as shown in (8) [28].
when the preparation information is prepared and the reason for an independent model that is saved as the previous statistics is assembled, many speakers are used to advance the Gaussian parameters and coefficients, using standard procedures, for example, maximum likelihood estimation (MLE), maximum posterior regulation (MPR) and maximum likelihood linear regression (MLLR). Now, the frame is ready to play the enlistment. Enrollment can be completed by taking an example of an objective voice sound and adjusting it so that it is ideal for adjusting this example. This guarantees that the probabilities returned when coordinating a similar example with the adapted model would be maximum.

Hidden Markov model (HMM)
Hidden Markov model is a model based on augmenting the Markov chain. A Markov chain is a nonlinear model that is often used to represent a sequence of possible event such that the probability that even event would occur is dependent solely on the previous one. Figure 4 shows a Markov chain for assigning probability to a sequence of words, 1, 2……. i.e (the probability for a user to say "TO" after saying the word "GO" is 0.4) the speech recognition make decision based the most probably next state, which means from the Figure 3, the system is most likely to move to "BED" with "GO" as the initial state. The transition probability matrix is given as follows: The following components are required for a markov chain: a set of N state The probability that a Markov chain will begin in state is . The flow chart for the procee is given in Figure 5.

RESULTS AND DISCUSSIONS
The proposed system was tested severally to measure and ascertain its accuracy. The speaker recogniser was trained with various speech samples. Afterward, a candidate was enrolled for the system in order to confirm the accuracy of the system. The speech of the candidate was taken under different conditions to show the performance of the module with the test samples of the speaker's voice already in the database. The conditions under which the candidate voice was taken include:  Crowded place with background noise;  Silent place with little or no background noise;  A condition such that the speaker's voice was low; and  A condition such that the speaker's voice was loud.
By these four different conditions, recognition accuracy of the system is obtained as shown in Figure 6.  Table 1 shows the accuracy of samples taken. A condition of silent place with little or no background noise; 123 82.00 C.
A condition such that the speaker's voice was low; and 137 91.33 D.
A condition such that the speaker's voice was loud. 110 73.33

CONCLUSION AND RECOMMENDATION
This work has successfully constructed a prototype speech recognition system for home automation using Raspberry Pi Single Board Computer. The prototype worked well and gave a vary promising result for real production. With noise interference (i.e. worse scenario environmental condition in terms of noise), a very good result was obtained with approximately 65% accuracy and the highest accuracy recorded was 91%. The work can find application in different places and fields depending on the work to be carried out. The response time of this module is relatively fast. Further work can be done by adding more appliances, voice recognition module to ensure security for home automation and training of voice recognition module to adjust to diverse voice condition of the user. Moreso, other means of authentication can be added.