A novel automatic voice recognition system based on text-independent in a noisy environment

ABSTRACT


INTRODUCTION
The process of voice recognition starts with recording the user's voice through the microphone. In a default mode, noise from the environment is automatically recorded together with the voice. Subsequently, the de-noising method is applied to the recorded voice. There are a number of features of noise extraction that can be applied to the de-noise speech signal, usually, (MFCC's) would be applied. The extracted features characterise the frequency content of the voice, the shape of the vocal tract, and the intonation, or prosody. For each voice frame, a vector of 20 to 30 characteristics, termed as the "Cepstral" coefficients are calculated [1]. Finally, the classification of the users' voices is done using the Log-Likelihood Ratio (LLR) of the UBM-GMM model. However, before being used in the classification of the users' voices, the UBM-GMM model is created during the training process using the training data which contains speech signals from known users [2]. The recognition of a speaker through voice performance is subjected to signal quality, and it also depends on the variability of the speaker's voice over time as in the case of illness (colds), emotional states (anxiety, joy, etc), age, voice acquisition conditions (such as noise and reverberation, the quality of equipment such as the microphone) [3].
MSR identity toolbox is a very accurate tool for voice recognition when tested in a quiet (no noise) environment [4]. However, the performance of the toolbox is greatly reduced when tested in a noisy environment, according to the study by [5], which investigates the relations between the accuracy of speaker recognition and adverse acoustic conditions. The researcher established that the accuracy of the MSR identity toolbox is significantly reduced when used in a noisy environment, the author documented the implication of noise and reverberation in the research. In particular, a regression formula has been established to predict the recognition accuracy of a typical speech recognition system. On the other hand, Dirceu presents an evaluation of the Microsoft Research Identity Toolbox under conditions reflecting those of a real-world condition which means noisy environment, and he established that the accuracy of the MSR identity toolbox drops when tested in a noisy environment [6][7][8][9]. However, both researchers did not provide solutions to this problem. Hence, the research is conducted to improve the accuracy of (MSR) identity toolbox in a noisy environment. The research attempts to address the issues in the domain of voice recognition in a noisy environment, by:  Using several noise removal (de-noising) methods to improve the MSR identity toolbox;  Determining the best MFCC that suits both voice recognition and noise removal, since multiple types of MFCC voice features may be used.

RESEARCH OBJECTIVES
This research objectives: (a) To improve the Microsoft research team's MSR identity toolbox and making it robust in a noisy environment. A method based on noise removal filtering and the choice of the best MFCC speech signal feature that gives the best accuracy for voice recognition in a noisy environment is proposed. Firstly, we implemented noise removal filters such as Filter bank, least mean squares filter, Wavelet filter, and hearing aid filter in order to improve the accuracy of the MSR identity toolbox in real-life conditions (noisy environment). Subsequently, we investigated several types of the MFCC voice features to show their effects on the accuracy of the voice recognition system, and to determine the best MFCC that suits both UBM-GMM voice recognition model and the noise removal methods for better accuracy. (b) To develop the modules of the proposed voice recognition system using MSR identity toolbox, which implements the GMM-UBM model. Several modules will be implemented in MATLAB, such as Users enrolment, management, users' voice recording, a module for training and testing, a module for noise application and removal, MFCC features extraction module, and a module for wav files conversion in the case of ussage of TIMIT voice database. (c) To evaluate the feasibility of the proposed method in improving the voice detection accuracy of the MSR identity toolbox in a noisy environment.

RELATED WORKS OF VOICE RECOGNITION
The first research work for AVR was done in 1985, it uses: Vector Quantization (VQ) [10][11][12] from the enlistment data of a given voice, a partitioning of the acoustic space into a finite number of regions and each represents a centroid vector. The set of these vectors forms what are called a quantization dictionary. The proximity measurement between this dictionary and the frames of a test segment, calculated as the average of the minimum distances between each frame and the centroids, allows the comparison of voice segments. This method can be described as deterministic, but also vectorial and non-parametric. Its results depend strongly on the fixed size of the dictionary. Core approaches, such as Support Vector Machine SVM [13][14][15][16] apply to frame a nonlinear transformation to a large superspace. This method is to maximize the linear margin between classes. The performance of SVM-based systems now competes with those of the state of the art Gaussian mixture that we will present later. This method is qualified as semi-deterministic (the transformation may or may not derive from theoretical laws), vector and discriminant. Anchor models [17] represent a voice relative to a model space, in the form of a vector of similarities with respect to them. A similarity can be calculated by a comparison score (anchor model VS voice learning data).
This method is strongly dependent on the robustness of the anchoring models. It depends on their ability to synthesize the structural information about the acoustic space by a discrete overlap. Other methods have been presented for about twenty years, as well as variants of the previous methods. They are generative or discriminative "oriented": Dynamic Bayesian Networks (DBN) [18][19], Segment Mixture Models (SMM) [20] combining GMM approaches (which we propose below) and Dynamic Time Warping (DTW) [21] to take into account the temporal structure of the voice signal. The GDW (Gaussian Dynamic Warping) approach of [22], which is a hybrid exploiting the generalization of generative approaches and the structural modeling of DTW. VQ codebook model is developed based on K-means clustering procedure for these perceptual features. In this algorithm there is a mapping from L training vectors into M clusters. Each block is normalized to unit magnitude before giving as input to the model. One model is created for each human voice [23]. In the family of discriminating linear classifiers, [24] introduced the classifier by logistic regression, the latter having the advantage of relying on probabilistic and Bayesian hypotheses. However, its performance does not match those of the SVM in the experiments of [24]. Also noteworthy is  [25], which goes beyond the limits of SVMS based on the supervisors. Modeling based on GMM is part of the more general Hidden Markov Models (HMM) model. Hidden Markov models apply perfectly to text-dependent mode, obtaining excellent results. On the other hand, the use of HMM models in text-independent mode does not improve the performance achieved by simpler models based on GMM [26]. These use the notions of state introduced by the Russian mathematician. Their succession, with transition probabilities, and makes it possible to elaborate a stochastic model of the voice. This approach, initially used for voice recognition, is very effective for voice recognition. According to the literature review paper done by [2] GMM based AVR is the best method because it is excellent for voice pattern recognition problem. In addition [27] developed a GMM based automatic voice verification system development for forensics in Bahasa Indonesia. Furthermore [28,29] have proved from their research work that GMM is the best for voice recognition, then followed by DTW and finally the less performing algorithm, namely SVM. As shown in Figure 1, the researchers compared GMM, DTW and SVM using the TIMIT voice corpus. They increased the number of voices from 10 to 50.

RESEARCH METHOD
Voice recognition systems aim to verify and identify unknown voice signal with specific voice signals in the database. Voice recognition systems are classified into two types of services [30]: a. Voice verification confirms the identity claim b. Voice identification determines of registered voices Our purposed system is the Voice Verification System (VVS) and it is divided into three key components: feature extraction, voice modeling, and decision.

Feature extraction of voice signals
Mel Frequency Cepstral Coefficients (MFCCs) is a type of features that describes the characteristics of voice signal, its working is similar to the simulation of the human auditory system that has proved its effectiveness in noise environment; it operates more accurately in frequency domain than the time domain. The MFCC consists of six stages as shown in Figure 2. The first stage in MFCC is the pre-emphasis where increase frequency of voice signal to restore the original voice signal at low levels of noise. The second stage is the frame blocking that divides voice signal into many frames that are used to keep the voice signal for a longer time. The third stage is the hamming window which each frame is multiplied to keep continuing and reduce truncated of voice signal at the first and last points in the frame. The fourth stage is the Fourier transform that used to convert the frame from time domain to frequency domain. The fifth stage is the Mel scale filter that used to obtain high frequency and minimize unwanted frequency. The last stage is the log energy computation, then followed by discrete cosine transform that is applied to return to the time domain [1].

Voice modeling
The voice modeling step exploits the data provided in the feature extraction step in order to create the representation of an individual who will subsequently be used to authenticate it. The model used is usually a statistical representation of the acquired data. The voice model depends also on the quality of microphones, expected channels of voice signal, amount of voice data duration enrollment and detection. In this paper, we have chosen the stochastic volatility model such as (GMM-UBM) which is simple, easy to evaluate, and faster to compute classification that is used for human voice verification of the text-independent in noisy environment. As shown in Figures 3 and 4 the UBM model is first created by using Expectation Maximization algorithm (EM) then it is used as feedback to create the GMM model through Maximum A Posteriori algorithm (MAP). Then UBM-GMM models are used in order to recognize the unknown voice during the testing process using the log-likelihood ratio and error equal rate.

Decision and performance measures
It consists of checking the adequacy of the vocal message with the acoustic reference of the voice it pretends to be. It is a decision in all or nothing. The performance of voice verification is given in terms of the accuracy rate that defines as a percentage of true voices accepted during the system test.

RESULTS AND ANALYSIS 5.1. Data collection
In the data collection stage, we used the TIMIT database that contains 630 (192 female and 438 male) human voice records with a sampling frequency. Each user has recorded 10 sentences, the voice record duration of each sentence ranged between two to five seconds. We selected 530 human voices randomly as background data and the remaining 100 human voices for users' enrollment data. We have also selected 3647 randomly 9 out of 10 sentences per human voice and the remaining sentence is kept in target human voice data. Then we collected the data for 50 human voices randomly from the users' enrollment data for system testing.

Voice verification system consists of two phases: training and testing
a. Train phase: After data collection, we have created (MFCCs) for each human voice using Hcopy, it is a tool of HTK to convert the voice signal to different types of MFCCs. In order to calculate MFCCs we have selected the configuration file that we have taken from HTK-MFCC-MATLAB Toolbox to use it with Hcopy tool. Then we trained background data to calculate UBM model by using the algorithm of the MSR identity toolbox, namely Expectation Maximization (EM). Then we applied the Training user enrollment data to calculate GMM model by using the algorithm of the MSR identity toolbox is namely Maximum A Posteriori (MAP). b. Test phase: To verify the voice, we added background noise. Then we reduced the noise by using noise filter in order to get an enhanced signal. After that we extracted (MFCCs) feature of the enhanced signal in the noisy environment by using Hcopy tool. Finally, we matched MFCCs feature of enhanced signal with another features of human voices that existed in the users enrollment data by using GMM-UBM trained models with algorithms of MSR identity toolbox, namely Log Likelihood Ratio (LLR) and Error Equal Rate (EER). In this section, all of our experiment results have been done in MATLAB v 8.3. The results have shown the system under perfect conditions (no noise) accuracy was nearly 100% when tested with 50 human voices. Then noise and de-noise methods were applied in order to determine the best configuration of MFCC features that better suits the MSR identity toolbox in order to achieve the best accuracy possible. To do this we changed the MFCC configurations of Hcopy that are configured to automatically convert its wave file input into another types of MFCC. Subsequently, we applied two types of noise namely Background noise and White Gaussian noise at different signal to noise ratios (SNR) (-30 to 60 db), and followed by four different de-noising methods -Filter bank, Least mean squares filter (LMS), Wavelet filter, and Digital hearing aid filter. The implement of the Digital hearing aid filter was the most complicated with many filter types and different steps; therefore, it seems that it is not the most suitable for our system as all its results recognition were zero. From the results listed in the Tables 1 and 2 we notice that Filter bank when choose MFCC_0_D_A, the results of average accuracy at SNR (-30 to 60 db) were: 54.875% for Background noise as well as 14.750% for White Gaussian noise. The results of the other filters are listed in the following tables, then comparison of our proposed system, and with the results of the work of [23] in the clean and noisy environment. The following Figures 5 and 6 show the performance de-noising algorithms under different MFCC using Background noise as well as White Gaussian noise. From Table 3 we notice the result of our proposed system GMM-UBM in clean environment was 100% for 50 voices which is the better result compared to VQ codebook system. The noise deteriorates the information if the voice signals without a noise removal the system accuracy will automatically go down. Therefore, the accuracy of VQ system [23] all will drop in a noisy environment at SNR 15db as shown in the Table 4.    Table 5 we notice the objectives achievement of our proposed system are listed as follows:  We used MATLAB, implemented our methods and rapid application methodology to achieve our first objective; which is: to design a voice recognition system using the MSR identity toolbox that uses the GMM-UBM models with enhanced modules in MATLAB such as user voice recording.  To achieve our second objective, we used HTK MFCC configuration to investigate the feature extraction process, and to figure out which configuration of MFCC features that better suits the MSR identity toolbox.

3649
 To achieve the third objective, we implemented our voice recognition system in noisy environment based on MSR identity toolbox, HTK and we implemented our noise removal method (filter bank), in order, to achieve the best accuracy possible compared to VQ codebook system.

CONCLUSION
In this paper, we presented a robust system of voice recognition by implementing the codes necessary to run the Microsoft research identity toolbox. Most importantly, we have improved the accuracy of the Microsoft research identity toolbox where the result of GMM-UBM was better than VQ codebook in the clean environment by 9%. We found that the result increased the accuracy of the MSR identity toolbox by 3% by applying Filter bank using MFCC_0_D_A for voice recognition system in background noises as well as white Gaussian noise. As a conclusion, our proposed system will contribute to making biometric access through voice a real success in real-world application, as well as playing an important role in the research involving voice recognition technology.