Convolutional Neural Network and Feature Transformation for Distant Speech Recognition

ABSTRACT


INTRODUCTION
Deep Learning technologies have recently achieved huge success in acoustic modelling for automatic speech recognition (ASR) tasks [1]- [4]. They replace conventional Hidden Markov Models-Gaussian Mixture Models (HMM-GMM) [5], [6]. Curently, Deep Neural Network (DNN) is the state-of-the-art architecture for speech recognition. DNN is used to provide posterior probability to HMM based on a set of learned features. A hybrid of HMM-DNN has shown to have superior performance compared to HMM-GMM models for ASR.
Currently, more automatic speech recognition (ASR) applications found in our daily activities. They have been implemented as virtual assistant in smart-phones, home automation, meeting diarisation, and so on. For such applications, ASR must operate in conditions where there are some distances between the speakers and the microphones. This is called distant speech recognition (DSR). In such conditions, ASR systems are expected to be robust against noise and reverberation. However, the performance of DNN-HMM systems are still unsatisfactoryfor these conditions [7]. Noise and reverberation distort the speech signals causing large degradation on the performance of ASR systems. This may hold back the users when using ASR applications.
Many studies have proposed techniques to improve the accuracies of ASR in noisy and reverberant conditions. One approach is to enhance the noisy features by applying noise removal techniques [8] . Others designed a discriminative, handcrafted features that are more robust against noise and reverberation [9]. Many works also propose adapting the acoustic models into noisy condition [10]. In DNN frameworks however, many methods proposed for HMM-GMM systems may not work as well [7]. For deep learning frameworks, various architectures are investigated to find the better systems such as recurrent neural network (RNN) [11] and convolutional neural network (CNN) [12]. In these approaches, the hidden layers of the systems are increased to produce more discriminative features before fine-tuning in the last layers. However, this may significantly increase the computational time for training.
Currently, CNN are gaining interests among researchers. Originally, it is used in computer vision [13]. Some studies [14]- [17] indicate it to be better than DNN for large scale vocabulary tasks. We argues, the properties of CNN such as pooling could benefit in reverberant conditions. In these studies mostly deal with noise only. Their implementations on dealing reverberations have not yet explored.
One advantage of deep learning frameworks is the ability of the network to learn the discriminative features given input data [18]. Studies show that transforming features before feeding them to the networks may benefit the performance of deep learning systems. There are numerous approaches that can be implemented in feature domain to improve the performance of ASR systems in deep architectures for large vocabulary systems. Some of them are linear discriminant analysis (LDA) [19], heteroscedastic linear discriminant analysis (HLDA) [20], Maximum Likelihood Linear Transform (MLLT) [21], feature basedminimum phone error (fMPE) [22], or using the combined transformations.
In this study, we propose CNN with feature transformations for improving the robustness of ASR against reverberation. we apply LDA and MLLT on features before feeding them to CNN. We argue that, applying them may also improve the robustness of speech recognition in reverberant conditions by still using relatively smaller number of hidden layer. We evaluate the use of feature transformations (i.e. LDA and MLLT) on Mel-frequency cepstral coefficient (MFCC) We capture the context information of speech by splicing the features with several preceding and succeeding frames and then applied LDA to reduce the dimensionality. After that, we apply MLLT on the reduced features. In this we feed the transformed features as acoustic input for CNN.
The rest of the paper is organized as follows. Section 2. provide theoretical background for our system. In this section, we briefly describe the features we used, the feature transformations and CNN. In Section 3., we explain our proposed system. In Section 4., we we explain our experimental setup to evaluate our method and discuss the results. We conclude the paper in Section 5.

THEORETICAL BACKGROUND 2.1. Speech Features
Many features have been proposed for ASR. MFCC is arguably the most popular one. MFCC is a handcrafted feature that is extracted using two-stages Fourier transform. The aim is applied to decorrelate speech components in time and frequency domains. By doing so, speech units, such as phonemes, could be modeled using mixtures of Gaussians using only their diagonal covariances. In MFCC extraction process, the speech signals are chunked into sequences of frames with fixed duration, usually around 25-50 ms each. Speech is assumed to be stationary for each frames and then the Fourier transform is applied to obtain its spectral components. Usually, the power spectra are used by taking the square of its magnitude. Then, the spectra are mapped into a mel-scaled filter-banks to emphasis the frequency in lower region more. After that, the log operation is applied to the output of mel-filterbank before applying Fourier Transform, in this case only using the real part of Fourier transform to decorrelate each component in frequency domain.
While MFCC shows good results when the conditions between training and testing are the same, it suffers when there is high variability on the data. Speech is highly varied due to intra-speaker variabilities, inter-speaker variabilities, environment variabilities, i.e. when speech is noisy or contaminated, etc. Many studies propose different features to improve the robustness of ASR. PLP is one of the examples. The main difference between PLP and MFCC is that PLP applies cube root function instead of log. The objective is to reduce the sensitivity of the features in low energy region which is most sensitive to noise. Other difference is the use of bark scale instead of mel in MFCC.
MFCC shows pretty good performance in HMM-GMM systems. Since it is quite uncorrelated, it is adequate to model each state of HMM using mixtures of Gaussian only using the diagonal covariances of the GMM. However, two-stages Fourier transform remove the correlation between speech components in time- frequency domains. These correlations could still be needed in recognition process. This may be one of the reason that ASR are not robust. In DNN-HMM systems, some studies show it is more benefit when more "raw" features are used. One of them is FBANK [23]. FBANK has the same extraction process with MFCC except it is without the second stage Fourier transform. So, some correlations in frequency are still exist.

Feature Transformation
Transforming features to other domain spaces often found effective in many classification tasks in machine learning. The use of high dimension features are often ineffective because it may lead to overfitting. Reducing the dimensions of the features is often applied. LDA and PCA are examples of feature reduction techniques. LDA [24] is applied in supervised manner while PCA is an unsupervised technique.
LDA is usually applied in preprocessing stage to reduce the dimensions of feature from the n-dimensional features are reduced into m-dimensional space (m<n).The objective is to project the features space into lower dimensional space and making the features more discriminative. The lower dimensional feature space is chosen such that it emphasizes the distances between classes more than within the class. Mathematically it could be written: where Σ are the covariance between class, Σ is the covariance within class, θ is the feature, and J(θ) is the cost function that to be maximized. The solution for J(θ) is by taking the first m eigenvectors of matrix Σ −1 Σ after sorting the eigenvalues from the largest ones. For more information on applying LDA on speech features could refer to [19] Meanwhile, MLLT [25] is applied in HMM-GMM systems to loosen the assumption in HMM-GMM systems. In HMM-GMM systems, it is assumed that the features are independent with each other. Therefore, Gaussian assumptions are with only diagonal co-variances are used. While this could fasten the training time, the assumption may not necessarily holds. This is because speech component may still be related to each other in feature space. MLLT [25], which is also known as semi-tied co-variance (STC) [26], linearly transforms the sample data to a new transformed space that are Gaussian distributed to loosen this assumption. MLLT is applied to implicitly capture the correlation between the feature elements by using constrained co-variance model.
MLLT works as follow. MLLT uses eigen decomposition to decompose a full co-variance matrix over the set of Gaussian components, and each components maintain its "diagonal" characteristics. A full covariance matrix could be decomposed using the following formula: where m is the index of Gaussian component and r is the index of class. Each component m has three parameters: weight, mean, and diagonal element of semi-tied co-variance matrix. So, each co-variance matrix ( ) could be decompose into two: diagonal element of co-variance matrix component m, ( ) , and a shared full co-variance matrix of Gaussian components in class r, ( ) (named as semi-tied transform). We denote ( ) be the inverse of ( ) . In ASR, the covariance matrices are trained under Maximum Likelihood sense on the training data and will be optimized with respect to ̂( ) the mean of the Gaussians ( ) and diagonal covariance matrices ̂( ) . So, the cost function J could be written: where: and the covariance matrix estimate is: Calculating ( ) is nontrivial. To estimate it, it is initialized using an identity matrix and then estimate ̂( ) using Equation (8) and then ( ) is updated using Eg. (2)

Convolutional Neural Network
A typical convolutional network structure is illustrated in Figure 1. This is different from DNN, where all neurons in the previous layers are connected to all the neurons of the successive layers, which may not be effective when the features have large dimensions. Convolutional Neural Network (CNN) is a special kind of deep neural network. CNN introduces two types of special network layers, called convolutional layer and pooling layer. Each neuron of the convolutional layer receives inputs from a set of filters of the lower layer. The filters are obtained by multiplying a small local part of the input with the weight matrix, where these filters are then replicated throughout the whole input space. Localized filters that share the same weight appear as feature maps. After completing convolution process, a pooling layer takes inputs from a local part of the convolutional layer and generates a lower resolution version of filter activation.
In the implementations for speech recognition, after few layers of CNN structured, a fully connected layer of generative deep neural network model (DBN-DNN) is performed to combine extracted local patterns from all positions in the lower layer for final recognition [27]. In this paper, we use two layers of CNN structure and then apply 4 layers of DBN to produce a total of 6 layers of hidden layers for pretraining. Then DNN is applied on the top of then for supervised learning.  Figure 2 is the currently the most commonly used ASR systems. DBN-DNN, which is based on Karel's implementation of DNN on KALDI [28] is used as baseline. DBN configuration in this experiment uses 6-depth hidden layers with dimension of 2048 hidden neurons, i.e. Gaussian-Bernoulli RBM as first layer connecting to the Gaussian acoustic inputs and Bernoulli-Bernoulli RBM layers afterward. The stack of pre-training layers is followed by DNN layers with 1 hidden layers (1024 neurons) and softmax output layer. We denote this as BASELINE2 in this paper.

THE PROPOSED SYSTEM
We also train a conventional HMM-GMM. We denote this as BASELINE1. For this, we model each digit with 16 states HMM, left-to-right where each state was modelled using Mixtures of Gaussian with the number of Gaussian is three. For pause models: sil, we use HMM with 3 states with 6 Gaussian components. Figure 3 is the proposed system in this study. We use MFCC as the basic for static features. We use 13 dimensions of static features and then the features are spliced by using 4 preceding and succeeding frames to capture the context of the speech producing 117 dimensions. Then, we apply LDA to reduce the dimensions into 40 dimensions for all features. Then we apply MLLT on the output of LDA before feeding them into CNN. For systems using only LDA, we denote as PROPOSED1, and for system with both LDA and MLLT is denoted as PROPOSED2.  [15] as it is found a good setting for speech recognition. The output of DNN is used to estimate the posterior probability of HMM states in hybrid of deep learning and HMM systems.

EXPERIMENTS 4.1. The Setup
The experiments are evaluated on speech corpus of isolated digit recognition task. We use TIDigits corpus to train acoustic models on clean conditions, which consists of 8623 utterances pronounced by 111 male and 114 female adult speakers. For test data, we use the reverberant version of TIDigits, that is the Meeting Recorder Digits (MRD) subset of Aurora-5 corpus [29]. The corpus comprises of real recording in hands-free mode in the meeting room. The speech data is collected from 24 speakers at the International Computer Science Institute in Barkeley, resulting of 2400 utterances for each microphone. The recording is performed using four microphones (labeled as 6, 7, E, and F) which are placed at the middle of the table. The recording thus contain some reverberant acoustic conditions from the effect of hands-free recording in a meeting room. The performance is measured using word error rate (WER). Table 1 shows the evaluation of the proposed methods (PROPOSE1 and PROPOSE2). As comparison, the performance of BASELINE1 and BASELINE2 are shown as well. The table clearly indicates a consistent reduction of word error rates in reverberant conditions. PROPOSED1 achieves 37.67 % and 27.78 % relative improvements over BASELINE1 and BASELINE2 respectively while PROPOSED2 achieves 38.69% and 28,94 % relative improvements over BASELINE1 and BASELINE2 respectively. Applying MLLT after LDA (PROPOSED2) slightly better than applying LDA alone.

Results and Discussion
This might be analysed as the followings. When reverberation exists, the resulting reverberant speech are a sum of the signal with the delayed version of the same signals. Therefore reverberant speech contain the information from previous frames. This will increase correlations between neighboring frames, hence may increase the local correlations nearby frames. In CNN architecture, the properties of locality, convolution and pooling may be benefit in such conditions. Since the emphasis is on local neuron first, it can learn on the local information and produce good features based on clean part of speech from early frames (since they are relatively clean compared to late part of speech). When speech is corrupted by reveberation, It may cause some frequency shift (delay in time-frequency domains).
These delays are difficult to handle within other models such as GMMs and DNNs, where many Gaussians and hidden units are needed to be optimized for all possible pattern shifts [27]. With pooling properties in CNN, the same feature value that calculated from different location is collected together and indicated by a single value, which may be from the cleaner part of speech. Therefore, the differences in features extracted by poling ply may minimized the effect of delay which are caused by reverberation. LDA finds the features with the large variances and most separated means within the class. So, when it is used for features, it is very likely to choose most distinguish spectra (the dominant spectra) which may contain the phonemes information. When CNN is applied, due to the max-pooling, the information is maintained up to the top layers, producing a more discriminative features and hence improving the performance.

CONCLUSION
In this study, we evaluate the use of LDA and MLLT on CNN-based speech recognition to improve the robustness of speech recognition against reverberation. Our experiments confirm that our proposed method is more robust than standard DNN-HMM and HMM-GMM systems. The properties of weight sharing, pooling, and locality of CNN, could improve the recognition accuracy on all transformed features compared to the standard fully-connected DNN.
We need to state that the evaluated tasks are digit recognition tasks. Therefore, the long-term dependency that exists in speech may not as significant as in continuous speech. Therefore it is interesting to see how each architecture fare for continuous tasks. Since reverberation time is also heavily influenced by the size of the room, it is also interesting to see how deep architectures perform in different settings of rooms. This is our future plan.