Comparative study to realize an automatic speaker recognition system

ABSTRACT


INTRODUCTION
Automatic speech recognition is a computer technique that is commonly used in several systems.In car systems, speech recognition identifies the driver's voice to start responding to their commands such as playing music, activating global positioning system (GPS), launching phone calls, and selecting radio stations.In the field of language education, speech recognition can teach proper pronunciation and help people to develop their oral expression, and also it facilitates education for blind students.In this context, the research proposed a method to identify the speaker's voice using adaptive orthogonal transformations [1] and comparing it with the method of mel-frequency cepstral coefficients (MFCCs) [2]- [4].
In order to identify the speaker's voice several methods are used to extract the special features of each voice, among them mel-frequency cepstral coefficients.Although numerous researchers chose it as their feature extraction method because of its several advantages [5], [6], it reaches its limit in the improvement of automatic speaker recognition system as described by references [7]- [9].It needs a large voice training dataset and a long execution time to identify the voice of each speaker [10] and the same goes for other approaches such as principal component analysis (PCA), discrete wavelet transform (DWT) and empirical modal decomposition (EMD) as revealed by reference [11].
Janse et al. [12] presented a comparative study between mel-frequency cepstral coefficients and discrete wavelet transform, where it mentioned that MFCCs values are not very robust in the presence of additive noise and that DWT requires a longer compression time.Winursito et al. [13] combined MFCCs with data reduction methods with the aim of improving the accuracy and increasing the computational speed Comparative study to realize an automatic speaker recognition system (Fadwa Abakarim)

377
of the classification process by decreasing the dimensions of the feature data.The data reduction process is designed in two versions: MFCC+SVD version 1 and MFCC+PCA version 2. The results showed a performance improvement for the proposed approach.Wang and Lawlor [14] proposed a method for a speaker recognition system by combining MFCCs with back-propagation neural networks.It revealed that this approach works successfully only when the number of unfamiliar speakers is not too large.From these research studies, we deduced that many authors have used MFCCs as a feature extraction approach and to strengthen their methodologies, they have used other approaches in the classification process to obtain the desired results.In addition, other authors have developed new methods by addressing the limitations of MFCCs in order to obtain an improved algorithm, which is not sensitive to noise, and has a fast execution time.The goal of this study is to solve the problems mentioned above by developing a fast algorithm based on adaptive orthogonal transformations for the extraction of the informative features from the voice signal using the smallest possible training dataset, inspired by references [1], [15].This paper is organized as follows: Section 2 describes the new approach of orthogonal operators, then the comparison results obtained between MFCCs and the proposed method are discussed in section 3, and finally section 4 concludes the paper.

RESEARCH METHOD 2.1. Pre-processing
Before starting to apply the proposed approach, it is first necessary to pre-process the signals as shown in Figure 1.This involves firstly removing silence, then secondly detecting the beginning and the end of the speech by using the zero-crossing rate (ZCR) [16]- [19].The third step is making them equal in length by using zero padding [20], [21] because the training dataset can contain several signals that do not have the same length.The final step is compressing their size without losing quality to avoid the problem of system slowness by using Fourier transform [22], [23] or correlation [24], [25].Figure 2 shows the input signal before and after removing silence with detection of the beginning and the end of speech.Figure 3 shows the speech signal after applying the Fourier transform method to detect the informative intervals.Figure 4 shows the speech signal after applying correlation to detect the informative intervals.

Theoretical background
Our approach consists in searching the informative features of the signal by using the operator H, which is a matrix operator of the transform (dimension N×N) whose number of rows corresponds to the number of basic functions.To decompose the vector X, the calculation of the discrete spectrum Y with the numerical methods can be represented by the following matrix [1], [15], [26]: where  = [ 1 ,  2 , … ,   ]  is the initial signal to be transformed (size Ν = 2  ). = [ 1 ,  2 , … ,   ]  is the vector of the spectral coefficients, calculated by the operator orthogonal H.The calculation of the spectrum  by using (1) requires  2 multiplication and addition operations.The most efficient way to reduce the number of operations is to use a sparse matrix [27] where most of its elements are zero, which will make the calculation and execution time of the algorithm faster.The method of Good [28] with   [0,2].
Then   will be written as (3): with  = 1 …  ,  = log 2  which is the number of matrices   , then each matrix So, the formula of  will be: The algorithm goes through a procedure of adaptation of the operator  to a class of input signals.It consists in calculating the average of the statistical features at the pre-processing part (section 2 part 1) to form the standard vector  ̂ .We can say that the operator  is adapted to a class of signals represented by a standard vector  ̂ if it verifies the following condition: where   is the target vector that constructs the adaptation criterion of the operator   to  ̂ .The target vector   is calculated as (6): with  = 1 … log 2  and  0 =  ̂ .In a simplified way, the synthesis procedure of the operator of orthogonal transformation is as follows: For  = 1,  1 =  1  ̂ with  1 contains  2 1 non-zero number of elements.For  = 2,  2 =  2  1 with  2 contains  2 2 non-zero number of elements.For  = ,   =   =    −1 with   contains  2  non-zero number of elements.Then the calculation of the orthogonal spectral operator is: Figure 5 shows the overall process to extract the informative features from the input signals by using our approach: As shown in Figure 5, the extraction of the informative features consists of 7 steps: − Step 1: Input signals go through a pre-processing process (section 2 part 1).− Step 2: We calculate the average of the statistical features obtained during the pre-processing part (using Fourier transform and correlation).− Step 3: The operator synthesis algorithm is applied to the average of the statistical features obtained from the previous step.− Step 4: The output of the algorithm is the adaptive operator H.

−
Step 5: The projection multiplication is applied between the operator H and the rest of the statistical features.− Step 6: The result of the previous operation is a set of informative features that characterize each signal of the class.− Step 7: The average of the feature vectors is calculated.The result is an informative feature vector with a minimum dimension that characterizes the whole class.The test dataset contains 6000 voice recordings of speaker A, B, …, J and other unfamiliar speakers.To test the similarity between the speaker's voice in the training dataset and the speaker's voice in the test dataset, dynamic time wrapping (DTW) is used.Dynamic time wrapping or DTW [29]- [32] consists in comparing two voice signals by considering the Euclidean distance between the two vectors obtained by the applied method, which is defined by (8): with   : The distance between the  vector of the spectrum  and the  vector of the spectrum . : Dimension of  and  spectrum.Therefore, the vector   will correspond to class i if   = min ( =1… ) where  is the number of classes.

RESULTS AND DISCUSSION
The quality of recognition is measured by calculating the recognition rate which is defined as (9).

𝑅𝑎𝑡𝑒 =
ℎ    ℎ    * 100 Tables 1 and 2 show the voice recognition rate according to the size of the interval of the analysis.From these tables, we observe that the adaptive orthogonal transform method gives good results compared to the MFCCs approach.As mentioned in section 2 part 1, we used correlation and Fourier transform to work only with the informative intervals of the signal instead of working with the whole signal.As we can see there is a 47.5% difference in voice identification rates with Fourier transform intervals between using our approach (96.8%) and MFCCs (49.3%).On the other hand, we found a 45.0% difference in voice identification rates with correlation intervals between using our approach (98.1%) and the MFCCs (53.1%

CONCLUSION
MFCCs is one of the most usable and well-known methods in the field of signal processing.However, it needs a large training dataset and a long execution time to extract the important features if the number of test dataset unfamiliar speakers is large, so for these reasons we developed a new method based on the creation of the operator H which is adaptable to any input signal.Even though its creation goes through several iterations log 2  iterations where N is the length of the signal, an advantage of working with a sparse matrix where most of its elements are zero is that it makes the calculation and execution time of the algorithm faster.Our future goal is to increase the number of voice recordings in the test dataset and to decrease the number of voice recordings in the training dataset to see if the method continues to give successful results or not.In addition, we will combine it with other methods that are commonly used as classification methods such as hidden markov model (HMM) and artificial neural networks (ANN).

Figure 5 .
Figure 5.The overall process to extract the informative features from the input signals by using the adaptive orthogonal transform method which is used in the construction of fast transformation algorithms consists in expressing the 379 orthogonal spectral operator  as a product of sparse matrices   composed by minimum dimensional matrices called spectral kernels, where these matrices take the following form:  , ( , ) = [ cos ( , ) sin ( , ) sin ( , ) −cos ( , ) ] Comparative study to realize an automatic speaker recognition system (FadwaAbakarim)

Table 1 .
Comparative study to realize an automatic speaker recognition system (Fadwa Abakarim)381give us better results than Fourier transform intervals, either for our approach or for the MFCCs.The proposed method has succeeded in identifying 5886 voice recordings among 6000 voice recordings of the test dataset (rate 98.1%) compared to MFCCs that identified only 3186 voice recordings (rate 53.1%), and these results show the efficiency of our algorithm.The voice recognition rate according to the size of the interval using Fourier transform with MFCCs and the adaptive orthogonal transform method.

Table 2 .
The voice recognition rate according to the size of the interval using correlation with MFCCs and the adaptive orthogonal transform method