Adaptive wavelet thresholding with robust hybrid features for text-independent speaker identification system

ABSTRACT


INTRODUCTION
Speaker recognition is a task of identifying the speaker by the voice information that is extracted from speakers underlying speech [1] and its divided into speaker identification system which define the speaker from a group of speakers, and speaker verification system which determining if the voice is really the claimed speaker's voice [2]. Furthermore, it falls in two classes, text-dependent, in which the speaker should utter a password, and text-independent, which letting the speaker free to say any words in mind [3]. There are many applications for speaker identification system, such as remote access to services, banking operations through a telephone line, authentication and forensic applications [1].
In speaker identification (SID) systems, the extracted features from each speech frame are a crucial factor for building a reliable identification system. In clean environments, the identification system performs well, but in noisy environments, the distribution models of the features that extracted from the noisy speech will not matches the clean features distribution model that built in training phase [4]. To overcome this problem, the researchers applied many approaches to achieve this goal. In this section, we presents some related works; Speech enhancement is one of these approaches, where the noisy speech signal pre-processed first to suppress the noise. Spectral subtraction (SS) speech enhancement technique [5] depends on the correlation absence between the clean speech and the noise in which they are additive in time domain. The noise is assumed to change very slow compared with speech so, the noise spectrum can be estimated during silence periods, and the clean speech spectrum can be estimated by subtracting the estimated noise spectrum from noisy speech signal spectrum. Improved spectral subtraction [6] based on two steps, speech activity detection (SAD) and noise amplitude spectral estimation; it was adopted based on frequency band variance to detect speech endpoints to calculate the noise power spectrum. Brajevic et al [7] proposed to use Ephraim-Malah estimation and short time Fourier transform to suppress stationary noise by reducing spectral coefficients. Abd El-Fattah et al [8], used Adaptive wiener filter in time domain to estimate noise from speech signal. Ahmet M. and Aydin A. [9] used empirical mode decomposition (EMD) for speech signal decomposition with detrended fluctuation analysis (DFA) technique to threshold the noisy intrinsic mode functions (IMFs) and drop them, the experiments showed good results on Gaussian noise at 0 db. S. Abd El-Moniem, et al. [10] proposed the use of EMD and SS as a pre-processing stage to enhance the noisy speech, SS was used to estimate and suppress noise spectrum on each IMF before reconstructing the input signal which would be enhanced. S.M. Govidan et al [4] used Adaptive bionic wavelet shrinkage (ABWS) which is a speech enhancement technique that's used to suppress the additive noise and increase the accuracy of the speaker recognition system, a double threshold is computed and applied based on estimated noise on each sub-band decomposed by adaptive bionic wavelet coefficients, a good results was reported in variety of noise types and levels. Y. Xu et al [11] obtained clean speech signal from noisy one with deep neural network (DNN) by calculating log-power spectra of noisy speech signal then mapping noisy to clean data using a well-trained DNN, the mapping function was trained with DNN over 104 noise types with 2500 hour of training.
Another approach is to extract a noise robust features that achieve a high identification rate without suppressing noise. H. Hermansky and N.Morgan [12] used Relative spectral perceptual linear prediction (RASTA-PLP) was built on the assumption that the human auditory system is sensitive for stimulus that are slowly varying, and the performance can be improved by eliminating the very slowly changing components comparing to the speech signal change. RASTA filtering ensuring that the output signal is much less to the stimuli that varying very slowly. Kim and Stern [13] proposed new features called power normalized cepstral coefficients (PNCC) in which the power nonlinearity was used instead of log nonlinearity in MFCC features and power-bias subtraction technique to suppress the additive noise. Wang, et al. [14] propose to use wavelet octave coefficients residues (WOCOR) that complements MFCC features information, the results state that this technique enhanced the system accuracy in mismatched spoken contents. Zhao et al [15] introduced new method for extracting features called gammatone frequency cepstral coefficients (GFCC), the work was based on the human auditory peripheral model where gammatone filter bank was used as a replacement of Mel-frequency filter bank which made it performs better than MFCC features. Mean Hilbert envelop coefficients (MHEC) proposed by Sadjadi and Hansen [16] to extract features by using smoothed Hilbert envelop of gammatone filter bank, the results showed that MHEC features are less prone to noise than MFCC features. Satyanand Singh and Pragya Singh [17] proposed to extract speaker specific features based on statistical modeling techniques of the speaker, the authors used TIMIT dataset with 1000 utterances and the results showed that using GMM gives the best recognition accuracy of 99.1%. Kobra et al [18] proposed to use mean and variance normalization and then applying auto-regression moving-average filter (MVA) to MFCC features, the new features give 28% accuracy improvement comparing with MFCC features at 5db SNR level.
To achieve a high accuracy in speaker identification by exclude the problem of additive noise, a proposed algorithm for feature extraction based on speech enhancement and roust combined features is used. The speech enhancement is based on wavelet thresholding as a pre-processing stage to remove the noise from the input speech signal first, after that, two cepstral features (PNCC and GFCC) are extracted from the estimated clean speech signal, feature warping is applied to the extracted features, and finally, a concatenation of the resulted features was applied to produce the final proposed robust features.
The rest of the paper is organized as follow. Section (2) describes the proposed feature extraction algorithm. Section (3) presents the experimental methodology. Section (4) presents simulation results and discussion. Finally, conclusion is in section (5). Figure 1 shows the proposed feature extraction algorithm, where, the input speech signal is denoised by implementing discrete wavelet transform (DWT) semisoft thresholding, then extracting PNCC and GFCC features followed by applying feature warping technique, and finally concatenating them. The extraction algorithm steps are describe in the proceeding subsections.

Speech enhancement
Wavelet transform used to analyze speech signals and DWT is a type of wavelet transform where the speech signal is decomposed to detail coefficients (CD) and approximation coefficients (CA) at several frequency subband levels with a finite impulse response (FIR) filter [19] as shown in Figure 2. The CA produced by convolving the speech signal with low-pass filter and the CD produced by convolving the speech signal with high-pass filter. Each decomposition level is done by applying DWT to the approximation coefficients [20]. Figure 2. Block diagram of speech enhancement using DWT thresholding (h is low-pass filter, g is a high-pass filter, ↓ 2 is down sampling that discarding half of signal data, and ↑ 2 is up sampling that doubles signal data) After the speech signal decomposition, adaptive thresholding is applied to each resulted sub-band except for last approximation sub-band. Semisoft thresholding function is given by [21]: where: ( , 1 , 2 ) is the output value after thresholding, is the DWT subband frame, 1 , 2 is the upper and lower thresholds respectively. The thresholding value 1 is very important to the denoising performance, if it's too low, the noise won't be removed, and if it's too high, part of the speech signal will be lost [22], Donoho [23] suggested the following estimation for the 1 threshold value: where: is the signal length at subband level k, is the standard deviation at subband level and given by [23]: where: (| |) is the median absolute deviation detail coefficients at sub-band level . 2 is calculated as [20]: To recover the enhanced speech signal, Inverse DWT (IDWT) is applied; the de-noising procedure is repeated for each frame.

Robust features extraction 2.2.1. Pre-processing
After speech enhancement stage, the enhanced speech is used to extract the proposed features, the second step of the pre-processing is the pre-emphasis filter that is applied first to the speech signal to intensify high frequencies [24], Pre-emphasis is applied to PNCC features only as in [25] but not applied to GFCC features because it leads to performance dropping [26]. Framing is the third step in pre-processing stage where the enhanced speech signal is to be cut into short overlapping frames of 20-30 ms to overcome the discontinuity problem of the speech signal that may lead to wrong extracted features and performance dropping [27]. The fourth step is to apply a hamming window to every frame to increase signal continuity of the start and end of the frame [28].

Power normalized cepstral coefficients (PNCC)
PNCC features are powerful features that outperform conventional features in noisy and clean environments [25]. The high identification accuracy is resulted by the using of power-law nonlinearity that gives a close approximation of human auditory system [29]. Figure 3 illustrate the processing stages block diagram of PNCC features as described in [13].
After the pre-processing stage, the cepstral features are extracted from frequency domain by STFT. Frame power is calculated, then, gammatone filter bank is used with Equivalent Rectangular Bandwidth (ERB). To suppress the channel noise, Asymmetric Noise Suppression (ANS), temporal masking and weight smoothing are used. Power function nonlinearity is used because the output behavior does not critically rely on the amplitude of the input [30]. Discrete Cosine Transform (DCT) is applied then to de-correlate the highly correlated spectral features [31]. Finally, implement cepstral mean normalization to produce the normalized cepstral vector to remove channel distortion and improve recognition rate in noisy environments [32].

Gammatone frequency cepstral coefficients (GFCC)
The gammatone filter bank is series of overlapping band-pass filters that models the human auditory system [33]. The combination of gammatone filter bank (GF), cubic root and equivalent rectangular bandwidth (ERB) gives the robustness of GFCC features in noisy environments [34]. The block diagram of GFCC features processing stages is depicted in Figure 4 as described in [15]. Preprocessed speech signal passed through 64-channel gamma tone filter bank whose center frequencies ranging from 50 -8000 Hz, then, fully rectify the response of the filter (i.e. take absolute value) at each channel then decimate into 100 Hz, which yields a 10-ms. frame rate. The absolute value is calculated to create T-F representation that is a variant of cochlea-gram. After that, implement cubic root for the decimated outputs magnitudes. Finally, apply DCT to de-correlate the cepstral coefficients and reduce dimensionality [15].

Feature warping (FW)
Feature warping is letting the cepstral features following a distribution target to increase the robustness of the resulted features. FW processing steps can be summarized as following [ If a normal distribution is chosen, then: where: is the feature warped component, is the window length and is the rank.

EXPERIMENTAL METHODOLOGY
Experiments are done on TIMIT [36] dataset, which consists of 630 speakers, each speaker has 10 utterances. To train the UBM-GMM classifier, 530 speakers are chosen randomly (i.e. 5300 utterances) to train UBM and 100 speakers are left for testing. The GMM is trained with 9 utterances from each speaker and the last utterance is left for testing. To test the robustness of the proposed algorithm presented here, 4 noise types are chosen from the Noisex-92 [37] noise dataset which are artificially added to the test utterances with a signal to noise ratio levels 0,5,10 and 15 db. For speech enhancement stage, all utterances are framed into non-overlapping frames with 16 ms. length and decomposed with 4 levels using DWT wavelet decomposition, scaling function Daubechies 8 techniques and pruned using semisoft thresholding, and then reconstruct each frame and recombine them to produce the enhanced speech signal. The feature extraction stage includes framing the speech signal into an overlapping Hamming window of 25 milliseconds frame length and 10 milliseconds window shifts. GFCC with 42 (21 GFCC and 21 ∆GFCC) features are extracted with 64 gammatone filters and dropping 0th coefficient and 42 PNCC (21 PNCC and 21 ∆PNCC) features are extracted with 40 filters and applying pre-emphasizing filter with 0.97 and dropping 0th coefficient from each frame, then applying feature warping with window length of 301 frames (3 sec) to each cepstral features (GFCC and PNCC) to produce GFCC-FW and PNCC-FW. After that, a concatenation of resulted features is taken place to obtain the final proposed features. UBM-GMM is used to evaluate the results, 256 Gaussian mixtures and 10 expectation maximization iterations are used.

SIMULATION RESULTS AND DISCUSSION
In this section, the proposed features robustness tested with both clean and noisy environments, then compared with similar studies.

Speech enhancement technique analysis
To select the best parameters settings for the speech enhancement pre-processing stage, number of factors are selected and used in the test, such as, frame length, number of decomposition levels, filter function, and number of filters, for their effect on the average identification accuracy. The average results are listed in Table 1. The average identification accuracy results that are shown in Table 1, indicates that 4 levels of DWT decomposition with db8 and frame length of 16 ms gives the best identification accuracy.  Table 2 shows a comparison between baseline and proposed features with and without DWT speech enhancement technique and its effect on the identification accuracy rate. The results obtained in Table 2 shows that DWT with semisoft thresholding and the proposed features give a noticeable improvement in identification rate except for the clean speech signal where PNCC features gives the top identification rate.

Comparison with similar studies
The proposed feature extraction algorithm is compared with similar studies to show the effectiveness of the algorithm. Table 3 describes briefly the systems of the studies used in the comparison. Figure 5 shows the comparison results with other studies and the proposed feature extraction algorithm outperforms the other studies results. The same parameters used in the comparison, which are frame length, frame shift, the number of gaussian mixtures, and noise types and SNR levels. The comparison results shows that the proposed feature extraction algorithm outperforms all the compared studies with a grate identification accuracy.  Figure 5. Comparison with other studies: (a) with work proposed by [1], (b) with work proposed by [18], (c) with work proposed by [38], and (d) with work proposed by [39] ISSN: 2088-8708  Adaptive wavelet thresholding with robust hybrid features for text-independent... (Hesham A. Alabbasi) 5215

CONCLUSION
In this work, new feature extraction algorithm is presented, it consist of two stages, first stage is speech enhancement with DWT semisoft thresholding. The second stage is concatinate two extracted features named power normalized cepstral coefficients (PNCC) with feature warping (FW) and gammatone frequency cepstral coefficients (GFCC) with FW that are studied for robust speaker identification system over noisy channel. UBM-GMM is used as feature. Experiments are done on TIMIT dataset where 100 speakers are used for test. The testing is done on clean and noisy conditions to test the robustness of the proposed feature extraction algorithm, 4 noise types are chosen from the Noisex-92 noise dataset (babble, factory 1, pink and white) that are added to the test utterances with SNR levels 0