Purging of silence for robust speaker identification in colossal database

Received Aug 18, 2020 Revised Dec 4, 2020 Accepted Jan 19, 2021 The aim of this work is to develop an effective speaker recognition system under noisy environments for large data sets. The important phases involved in typical identification systems are feature extraction, training and testing. During the feature extraction phase, the speaker-specific information is processed based on the characteristics of the voice signal. Effective methods have been proposed for the silence removal in order to achieve accurate recognition under noisy environments in this work. Pitch and Pitch-strength parameters are extracted as distinct features from the input speech spectrum. Multi-linear principle component analysis (MPCA) is is utilized to minimize the complexity of the parameter matrix. Silence removal using zero crossing rate (ZCR) and endpoint detection algorithm (EDA) methods are applied on the source utterance during the feature extraction phase. These features are useful in later classification phase, where the identification is made on the basis of support vector machine (SVM) algorithms. Forward loking schostic (FOLOS) is the efficient large-scale SVM algorithm that has been employed for the effective classification among speakers. The evaluation findings indicate that the methods suggested increase the performance for large amounts of data in noise ecosystems.


INTRODUCTION
Ever-increasing data base size in real-world speaker recognition systems pose challenges such as large training time, poor response time and vast memory requirements [1][2][3]. Robustness and adaptability are the major aspects in real-world speaker recognition systems. From the previous work, it is observed that good results were achieved for the clean high-quality speech under matched conditions. However, under noisy environments and mismatched conditions, the performance of recognition system degrades significantly, much further by being acceptable. Therefore, sophistication is also a critical analysis in the identification of speakers. This motivates us to investigate new methods in different stages during the speaker recognition process. A typical speaker recognition system consists of mainly two stages: enrollment phase and classification phase [4,5]. During the enrolment stage, speaker-specific information is extracted from the speech database in chronological mode. A cluster of such models tends to establish the speaker data base. An input speaker model is compared with the existing models in the database and then the results are expedited in the classification stage. In addition, features are extracted from input speech and transferred into a compact

THE PROPOSED METHOD
In this research work, silence removal algorithms are employed to eliminate the back ground noise from the source utterance for the robust speaker recognition. The FOLOS SVM algorithm along with the MPCA dimensionality reduction technique is adopted for the classification of voices. Hence, the MPCA-FOLOS combination is employed in this work to improve the recognition accuracy under noisy environments for the large data sets as illustrated in Figure 1.

Feature extraction
The features extraction module collects a group of parameters which indicatethe speaker-explicit data from the input signal. This data is the consequence of complex data processes at various phases of speech signal processing. Characteristics of speech signals are mainly differentiated by learning habits, vocal tract and vocal excitation of speakers. It is observed that the physiological based vocal tract features are more robust and less prone to mimicry [8][9][10]. Therefore, vocal tract related parameters like "mel frequency cepstral coefficients (MFCC)", "linear predictive cepstral coding (LPCC)" and the other non-conventional features exhibit good results under clean and matched conditions [11]. These methods are developed by using linear predictive (LP) residual signals. However, the performance of these systems severely degrades under mismatched and noisy environments [12]. Therefore, it is required to derive a set of new features from the excitation source which are less prone to environmental noise. Excitation source characteristics such as pitch and pitch strength exhibit the physiological and behavioral aspects of the speaker. This work mainly focuses on feature extraction by accumulating characteristics from pitch and pitch strength parameters.

Noise removal
Voice activity detection algorithms are utilized to eliminate the environmental noise from the source articulation for the quality speaker identification. The noise removal comprises two main techniques: zero crossing rate and energy based speech utterance detector. In order to separate the pitched part of the speech signals, statistical nature of back ground is more significant.
In the speaker identification process, elimination of silent part is the crucial for robustness. Therefore, the preprocessing step involves the isolating of repeated information is important. The speech segment mainly comprises of unvoiced (U), silence (S) and voiced (V) events [13][14][15]. There is no voice in a silent zone, where sound waves don't even pulsate as in contingencies regions. This work utilizes the wellknown silence removal methods.

Dimensionality reduction
This procedure minimizes dimensions of the feature matrix while holding the speaker discriminative data for the identification procedure. The size of information that is gathered from the in put speech segments is very enormous since the fundamental qualities change gradually. Further, the identification procedure requires moderately little information to specify the attributes of speech. In speaker recognition process, a large number of feature set demands vast memory and high processing time. Hence, it is necessary to minimize the complexity of the parameter matrices, where a highdimensional space is transferred into a space of fewer dimensions. Apart from the benefit in computation, improvement in accuracy can be achieved under noisy environments [16]. The much more effective and being structured analysis used for the minimization for the experimental work is "multilinear principal component analysis" (MPCA) [17]. We have conducted several investigations and performed the comparative analysis with respect to population size. Since these methods are successful in large-scale recognition tasks, original contributions are carried out in this work with respect to large-scale speaker recognition.

Classification
Signal classification is a dynamic procedure for approving a given unknown individual in speaker identification process. The classification is commonly made by comparing with the existing data base. This process is generally partitioned into two sections namely: train and validate [18][19][20]. In training step, the artificial neural network is trained with the speaker-specific information. The validation is a procedure of calculations of coordinated score which is a proportion of closeness among applied and existing speaker models.
Verification is tested by comparing the estimated features to the speaker models. Support vector machine (SVM), is a successful discriminator and very effective method in recognition systems [21]. Specifically, SVM is operated in statistical learning procedures in order to attribute the risk minimization function. But, the main constraints of the SVM are computational complexity and poor performance at large data set conditions. Therefore, investigations have been carried out for the optimization of large-scale SVM Algorithms. An efficient SVM algorithm namely FOLOS has been derived and tested [22,23].

SILENCE REMOVAL
In speech based applications, for removal of silence, two main techniques are widely adopted: zero crossing rate and energy based speech utterance detector. In order to separate the pitched part of the speech signals, statistical nature of back ground is more significant. The following section describes the important silence removal procedures.

STE algorithm
The short time energy (STE) algorithm comprises of three steps for the elimination or silence part and speech detection from the recorded speech utterance.

Pre-process
Extracted input signals are partitioned into frame blocks of size 18 ms by Hanning window technique.

Estimation of segments
The signal peaks of every block is determined analytically as, Where, i = 0, 1, 2,3,…, N in which the sample count in every segment is denoted by the parameter N. The magnitude of n th spike in i th segment is represented by the function xi(n). Energy thresholds T1 & T2 are derived from the frames of signal. These levels are calculated by the following typical parameters.
Here k is perhaps the element of the segments with P > T1 From the above equations, the maximum energy voice spike Enrg_Maxi, Enrg_Mini specifies the least energy level and the mean energy level represents SL for above T1.

Silence elimination
The frames which haven't been recognized as "voice" by the above steps are marked as `silence' and would be removed in signal utterance.

Zero crossing rate (ZCR) for Silence elimination
It is very difficult to record perfect clean speech. Therefore, it is obvious that some level of noise interferences the clean speech. As a result, there is a larger zero-crossing in silence speech patterns. The following three steps are employed for the silence elimination by zero-crossing segments.

Basic segmentation
The speech spectrum is divided to 10ms segments with the Hanning window

Zero cross rate estimation
In each frame, threshold point for the estimation is calculated as, Where, m stands for the segment number, xm(n) represents an n th frame in m th segmen.

Silence elimination
Unvoiced parts of speech are eliminated by taking higher rates of zero crossings in comparison with a threshold.

Detection and silence removal algorithm
The 'mean' µ calculation and the 'standard deviation' for 3100 segments of the voice signal that represents the background noise.
The deviation is calculated in the stored signal and categorized as voiced sample if | − | ≤ 3.
The speech ssignal is denoted as one and non-voiced segment as zero. The whole data is sampled in to other segments, each 10 ms long. Then the entire speech is transformed as series of 0 and 1 pattern. This first 100ms to 200ms of any speech signal represents back ground noise or silence portions. The back ground noise and silence portions are defined as white gaussian noise with the following frames.
First 3100 frames of the signals are used to calculate the parameters µ and as these samples are known as white noise. The possibility of data samples x satisfies, Hence, any sample x satisfies the condition | − | ≤ 3, then, it relates to background noise and the corresponding part can be eliminated.

DIMENSIONALITY REDUCTION
The dimensionality reduction technique MPCA can perform reduction by taking all tensor modes. Specifically, these projected tensors capture the variation present in the given feature set. More precisely, a selected set of features that varies in characteristics is identified [24]. It is obtained by selecting elements from each column of the feature matrix, where, to be a perfect square number. Therefore, column vectors are selected of size × 1. Each column vector is represented as ( ) , where 0 ≤ ≤ − 1 and the respective matrix M is defined as, From the matrix modified above, each b th column vector is suggested by dimensionality reduction using MPCA. Y is an arbitrary constant. In the above process specified above, the distance matrix D is derived for each b th matrix as, In (16), represents the mean matrix, derived for ( ) and it is used for determining the distance matrix. To obtain the projection of matrix Ψ, tensors 1 ( ) and 2 ( ) are applied to distance matrix ( ) : The specific form for calculation is denoted in (17), and it is required to determine the tensors 1 ( ) and 2 ( ) . This projection matrix is subjected to the generalized eigenvector problem.

Experimentation setup 5.1.1. Speaker database
The proposed speaker recognition techniques are implemented in the signal processing research tool, MATLAB. These techniques are tested by using the speech database taken from Neurosciences Institute. In experimental setup, all 640 speakers (446 males and 194 females) are used for the training and testing [25]. During the second stage, a set consists of 240 speakers (males-142 and females-98), were chosen arbitrarily in datasets. In learning stage, every speaker having clean speech is trained where as testing is done separately on speech data from NIST database with channel noise and white noise [26]. The experiments are conducted by preparing database by adding white noise to the clean speech of database at different SNRs. This is dynamically generated by MATLAB toolbox.

Feature extraction
In the feature extraction stage, twenty five speech samples are extracted from every user in to prepare speaker-specific features. Twenty speech samples are employed for the training process, and the other samples are used for the testing of recognition. The speech signals are normally sampled at 16 KHz and establishes as windows of size 25ms. Each frame comprises of FFT-based 256 power spectrum vectors to develop feature vectors. Next stage, these feature vectors are applied for dimensionality reduction to develop an optimized feature set.

Results
The performances of the proposed technique are evaluated on the sentences that are taken from NIST database. The robustness is measured by using noisy database, intentionally made by inserting white gaussian noise to the original database at different levels like signal to noise ratios 10, 20 and 30 in dBs. That blue areas in the sound performance signal reflect the parts of the voice whereas green areas reflect the neglected silence zones as seen in Figures 2-4. From the experimental results, it is identified that enhancement in SNRs improves that performance significantly. The improvement of our technique is similar to that of sentences have been used from clean noise free data base. But, the higher SNRs improve the effectiveness. At the SNR of 10dB, the standard techniques eliminate major portion of the voice signal as silent, but the proposed method removes only the non-voiced parts. The results obtained from the experiments that are conducted after silence removal step show that the new feature vectors exhibit reasonable improvement over original values from Tables 1 and 2. Specifically, with the inclusion of background noise at 30 dB SNR, the accuracy is improved by 2.68% as compared to the silence removal step by using MPCA method. Identically, the noisy conditions reduced upto 1.82% for the "channel noise" at the signal to noise ratio of 30 decibels. From the improved results, it is found that the existence of back ground noisy conditions, MPCA generates more effective results. The evaluation of the silence removal is summarized in Tables 1 and 2.

Population vs accuracy
This experimental setup uses different population sizes (150, 300, 450, 600, 750, 900, 1050, and 1200). Accuracy is calculated by performing speaker identification assessments on voices picked at random mostly from 1500 voice database. Figure 5 and 6 reflect the effect of population size with accuracy under noisy conditions. The accuracy with the application of MPCA feature extraction is improved.
Further investigations are carried out in order to evaluate the efficiency of MPCA with FOLOS SVM algorithm method. The recognition system is tested on noisy database provided by NIST database having gaussian and channel noises. Tests have been performed using 1200 voices (males712 and females-488 females) mostly from NIST dataset. Every speaker's classification is performed on about 6 sentences of 28 seconds clean voices. Figure 6. Speaker detection levels for additive white gaussian noise and channel noises signal to noise ratios of 10 decibels Figure 6. Speaker detection levels for additive white gaussian noise and channel noises signal to noise ratios of 30 decibels Results from studies with gaussian noise at signal to noise ratio of 30 decibels increased by 2.76 percent (without application of algorithms) but 3.26 percent (with application of algorithms) after employing MPFA transition and forward loking schostic SVM algorithm. Alternatively, the rise in channel affected noise at 30 decibels is 1.88% and 1.95% accordingly. According to the above data, the existence of noise influences output method but is much more prevalent.

CONCLUSION
The results achieved from the experiments indicate that the MPCA transformation improves the accuracy when compared to the conventional features. Though, the reduction is 50% in computational complexity, the MPCA-FOLOS outperforms the state-of-art speaker recognition techniques. Compared to other techniques, our suggested methodologies improve more the overall performance of the robust speaker recognition system at noisy conditions. Though, the different noise contents present in the system the improvement is significant. As explained in the above mentioned sections, there are considerable constraints during learning and testing stages, it does not contribute to significant increase. If the overall response of the recognition system is taken, the procedure of the candidate selection represented by FOLOS is the optimal set of feature vectors. The proposed technique is experimentally proved and also tested that the performance is improved than the existing technologies. Also it is identified that the complexity of SVM is reduced significantly for the large number of speakers in candidate selection step.