Continuous kannada speech segmentation and speech recognition based on threshold using MFCC and VQ

ABSTRACT


INTRODUCTION
Speech segmentation with recognition is challenging task now a days. Proposed speech segmentation in this paper is the processes of isolating the speech signal with context based information of the particular scenario. The segmentation of the speech based on context is to identify end points of the context information based on words, syllables or phonemes in natural langages.The speech segmentation is a subpart of speech recognition system. In natuaral language processing context, semantics, and grammer are very important to recognize the speech. The applications of automatic speech recognition system are used in broadcast news transcription, information extraction and retrieval, identity the speakers voice and allow the authenticated user to utilize the services and many more.
Higher lag is method is used to extract the features of speech signal with linear prediction [1]. The method gives two prediction errors, one is the ordinary convention linear prediction and other one is the delayed version with k number of samples of linear prediction. Further Combined Higher Lag Linear Prediction (CHLLP) model simultaneously by zero lag and higher lag prediction. In CHLLP model the cost function CHLLP model is equal to the cost function of conventional linear prediction if signal is completely periodic with fundamental period length equal to P number of samples and if m number of samples selected from P.In CHLLP if zero lag prediction means m = 0, simultaneously weighted by scalar. So this method is a new spectral modeling method can be used as a modern speaker recognition system with different noisy environments. Speaker independent for Telugu language for continuous words MFCC (mel frequency cepstral coefficients) [2] and DWT are used to extract the features of continuous speech then classified these extracted features using HMM based neural network. The MFCC and DWT combined methods are useful to extract features of the speech signal. Wavelet packet decomposition are used to remove the noise present in the feature of the speech signal.DWT isolate the higher and low frequency bands present in the speech signal. So high frequency bands are considered as a useful features. This process gives low word error rate in speech recognition system.
The method enhance the whisper recognition [3] by extracting a new robust cepstral features and preprocessing based on demising autoencoder. Teaser energy based cepstral features are more robust than MFCC for whispered description DDAER PECC feature extraction significantly improves the recognition rate compare MFCC, GMM, HMM. This proposed method improves more than 31% than traditional methods. Typically auto encoder has input layer which is the original feature vector one or more hidden layer which are Transformed features out of those hidden layer which matches input layer for reconstruction. TECC based features predicts the facts that Teoperator and gammet one filter bank to describe whishper characterstics. So, because of these the achieved word recognition rate is 93%.
Word boundary detection is used to separate the word from Gujarat Speech. This paper achieves end point detection [4] in Gujarath speech recognition system with the presence of background noise. It separates the silent portions of the speech.So that noise is reduced. This word boundary detection uses two algorithm to detect and point explicitly and implicitly. Explicit end point detection usedative before recognition and implicit end points are used after speech process to detect end point.this method able to detect weak fricative in signal to noise ratio condition. The Hybrid is model used for maximum Gaussian mixture continuous Tamil Speech recognition [5]. This method improves accuracy upto 3% error rate up to 4% compare to the existing system. This model is used in speech to text conversion in various application. In this LPC, MFCC, LP are used to extract the features. This method is an unsupervised method to analysis of data and construction modeling. So this portion data points between zero and one. These values are assigned based on the clusters, centre and data points To recognize isolated Kannada words trained HMM model and viterbi algorithm for decoding process [6]. MFCC are computed in frontend processing. This proposes to compare the performance of phone level and syllable level acoustic model for small to medium sized kannda language vocabulary. Average word recognition accuracy 97% for syllable level modeling, 98.6% for phone level meodeling. Speech coding setup has been done using HTK tool. The entire database training and testing samples are used to build by using HMM. Cofusion matrix is used to analyse and interpret the results at the word level.
HMM and Normal Fit [7] method is used for continuous speech recognition. Voice detection based on computing dynamic threshold and cepstrum coefficients are extracted as a feature of voice. The Baum-Walsh algorithm is used for trained database and Normal Fit tehnique is used to label the speech. This method tested for five languages. In an average accuration rate 95%. The experimental results shows that size of memory reduces because of Normal Fit values.
The MFCC is used [8] for feature extraction for training data. The Vector Quantization is used for clustering Speaker Independent Kannada Speech Recognition. VQ1 and VQ2 is used for clustering purpose. The Speech Recognition error decreases from 2.5 to 1.5. In case of VQ1 and VQ2 are the two clustering techniques, VQ1 based on binary splitting algorithm and VQ2 based on largest average distortion. Appling Linear discriminant analysis (LDA) [9] and maximum likelihood transformation on MFCC to extract features of speech and input these features to Convolution Neural Network (CNN) to improve robustness of speech recognition. This improves the recognition accuracy.
The proposed method organized into two parts; 1) Speech segmentation 2) Recognition; The first part of the paper is continuous kannada speech segmentation based on context and isolate kannada letters from continuous kannada speech which contains only kannada letters speech signal. This can be achieved by detection of voiced and unvoiced speech signal based on computing the average energy and spectral centroid of each frame of the Kannada speech signal. Average energy and spectral centroid coefficients are futher subjected to a median filter. The output of the median filter coefficient are used to set the thersholds. These thresholds are used to segment the continuous Kannada speech signal based on context. The second part of the paper is to determine the feature extraction of the segmented speech signal using threshold based MFCC and VQ in an automatic speech recognition (ASR) system. The threshold based MFCC and VQ is used to train speech data set. The methods uses less number of MFCC and less number of codebook of VQ gives better results than the existing methods.

PROPOSED METHODOLOGY
The first part of the proposed method contains continuous context based Kannada speech segmentation and isolated Kannada Akshara (means letter) from continuous Kannada speech which contains the utterance of Kannada Akshara only. Second part of method describes the continuous context based Kannada speech segments and isolated Kannada akshara speech recognition system using threshold based MFCC and VQ methods.
Speech segmentation part is this paper is carried out by detecting the presence of the voiced and unvoiced speech using average short time energy and spectral centroids. The median filter is used to smoothen the average short time energy and spectral centroids coefficients of the speech signal and threshold has been set based on the probability density function (pdf) of the output of the median filter coefficients.Then contex based speech segmentation performed based on the threshold levels.

Average short time energy (STE)
The Kannada speech signal is decomposed into a number of frames by multipling window function of length L using (1). Then each frame average short time energy [10][11] is computed using (2).
where () sk is kannada speech signal, () wi represents hamming window function, which is shifted across the speech signal to obtain frames and () w si is the windowed speech signal

Spectral centroid features
The spectral centroid (SC) measures frequency and magnitude of the particular spectral bin using the Discrete Fourier Transform. The spectral centroid contains more energy above and below the fundemantal frequency, which is almost the average energy of the spectral bin. Usually the speech signal has asymmetric in naure about the pitch range. The accuracy of perception in speech signal in the form of ramp function, so that it gives more accurate perception in both lower and higher frequencies of the spectral bin. The each frame of the spectral centroid of size Sm is the FFT of windowed sequence of the speech signal of size N samples, is the width of the each spectral bin and s f is the sampling frequency of the speech signal.
The multiplication factor j in (3) refers to the perception of speech signal as a ramp function.

Median filter and thershold setting
The median filter is used futher to smoothen and retain any abrupt changes within 2 L of average energy and spectral centroid coefficients. Where L is the length of filter. In this paper length of the filter is 5. Since it is a non liear filter it will not smoothens the noise components presents in the average energy and spectral centroid coefficients. The median filter outputs are used to set the thresholds based on the probability density function (pdf) of the filter output coefficients. These thersholds are used to identify the context of the speech signal in appropriate manner. Energy threshold (ET) and spectral centroid threshold (ST) setting is required to segment the continuous kannada speech signal.Both the threshold can be computed by taking the histogram of the STE and Spectral Centroid of each frame. Two flags f1 and f2 are setting by comparing energy with ET and centroid with ST. Depending on the final flag, the speech segmentation is achieved based on the context of the scenario. Finally each frame of the speech is separated with voiced and unvoiced speech based on context.

Zero crossing rate and end point detection
Zero Crossing Rate gives information of rapidly changing of the speech signal from positive to negative. If more number zero crossings means the speech signal contains the high frequency information [12]. If it is less the signal contain low frequency information. Thus zero crossings is used in this paper to identify the voiced and unvoiced speech signal which is helpful to segments the given signal. In a given frame the speech signal is considered as non-stationary signal and It is defined in (4).
Zero Crossing Rates (ZCR) is used to detect the voice activity in the speech signal, the signal whether it is a speech has spoken voice or silent. The ZCR used in this paper, to detect the end point of the speech signal within the context. Zero crossing rate is isolating the letter exactly from continuous speech. Zero crossing is playing important role in this aspect to separate individual letters. By masking unvoiced speech is considered as zero and voiced speech is maintained as it is in the original speech signal. Further each letters are isolated with their endpoints using short time energy and zero crossing rates.

SPEECH RECOGNITION
Context based recognition and Kannada Varnamala and Kannda alphabet recognitions are proposed from continuous Kannada sp each signal. Mel Frequency Cepstral Coefficients (MFCC) an Vector Quantisation (VQ) based feature extractions are proposed.

Mel frequency cepstral coefficients
Mel Frequency Cepstral Coefficients (MFCC) is one of the efficient and effective significant feature extraction method [13][14][15] The frequency analysis of windowed sequence is computed using discrete Fourier transform (DFT) in (6).
The triangular band of frequencies are obtained using Mel-filter banks in (7). 2595log 1 700

Vector quantization (VQ)
Vector Quantization is one of the most important method of distance measure between the test data and trained data set in automatic speech recognition. Based on the minimum distance measurement, it is easy to recognise the test data present in the trained data set. VQ is the one of the method to reduce the number of significant dimensions of input data. So that, it matches the unknown models in a very simple manner by reducing the data. This VQ algorithm creates 8 number of dimensions in this paper, which produces a set of cluster centers spread the distance space depending on the speech.
Signal features.Then categarise any feature vector to one of these clusters and by using these cluster number as an input feature vector.Comparing [16,17] two sequences of integers vectors than the entire original vectors.one of the additional advantage is to compute the distance between the pairs of clusters as the Euclidean distance measure between their corresponding centers of the vectors. This is very simple to view the looked up table to measure the distance bwtween the clusters and not required any additional computations to measure distance.
In order to make the VQ method simple, a set of cluster centers is defined as a codebook because it produces feature vectors into single value. The codebook size refers to the number of clusters in the codebook. If any sort of information is lost when VQ is [18][19][20] method is used to encode an input vector sequence. Then grouping dissimilar points and representing every cluster member by computing the mean clusters. The loss of information is the difference between the original input vector sequence and the quantised vector sequence. The mean value of this difference is the VQ distortion. By increasesing the size of the codebook leads to decreases in VQ distortion.

Threshold setting
The output of the codebook is used to set the threshold value to recognize wether test speech signal is present or not in the training dataset. The minimum value in the code book is used as threshold value but this threshold should be less the half of the average value in the codebook, then only the test speech signal is allowed to test in the training data set otherwise test speech signal is not present in the training data set. Once the test speech signal is allowed to test, it looks only the minimum distance vector, the minimum distance vector speech is recognize as a test signal speech.

Proposed model algorithm Context based voice detection:
Step 1: Input data continuous kannada speech signal 1. The speech signal of sampling frequency s f Hz and hamming 2. window length ( N ) = step size = 0.050  s f .
3. Compute number of frames of speech using (1) by shifting the window the across the entire speech signal.
Step 2: Compute the average energy of each frame using equation (2).
Step 3: Compute 2  N point FFT of windowed sequence of each frame. 1. Consider only N point FFT coefficients to reduce higher spectral components.

Spectral centroid 'C' of each windowed sequence using equation (3)
Then finally centroid C is Step Plot the isolated kannadaletter speech signal by representing in red colour and play isolated kannada speech.Finally written each letter speech as .wav file.

Speech Recognition
Step 1: Take segmented speech signal as a training dataset Step 2: Apply the Fourier transform segmented speech signal Step 3: Map the log amplitudes of the spectrum obtained above onto the Mel scale, using triangular overlapping windows.
Step 4: Take the Discrete Cosine Transform of the list of Mel log-amplitudes, as if it were a signal.
Step 5: The MFCCs are the amplitudes of the resulting spectrum.
Step 6: Calculate MFCC Coefficient for training dataset with frequency rate 10 Step 7: Generate code book for each segmented MFCC coefficients using vector Quantization (VQ) with 8 number using equidistance and keep these codebooks as a training data set.
Step 8: Repeat step6 and step7 for test speech signal Step 9: Set the threshold by computing the the minimum value of codebooks in training data set.
Step 10: Compute the average value of codebooks in training data set.
Step 11: If threshold value is less than half of the average value, then it check the test signal in the training data set otherwise test speech signsl is not recognized.
Step 12: Compute distance between test data with training data.
Step 13: The minimum distance vector speech in the training data set is considered as recognized Step 14: Stop

RESULTS AND DISCUSSIONS 4.1. Segmentation
The Figure 1 shows that continuous original kannada Speech signal segmented into four parts. The short time energy of speech signal and corresponding spectral centroid of each segmented output mentioned with green colour and its filtered output with blue colour. The median filter output is completly smoothen so that any distortion present in the speech signal completely eliminated. The segmented speech signal completely isolated with voiced and unvoiced with respect to the particular scenario of that context. The Figure 2, shows the corresponding kannada Speech signal text. Each segmented output is completly meaningful with respect to the kannada syntatic, semantic and grametic rules which is mentioned as in Figure 3 of (a-d) using unicode of kannada language. The Figure 4 illustrates some of the words present in the segmented speech signal text written using Unicode of kannada language with matlab R2014a.
The Table 1 gives the context based continuous kannada speech signal segmentation with different accent of different signal size of male and female speech. The segmentation algorithm tested nearly 100 different kannada speech signal of female and male with different accent. This algorithm gives 0.01 % of segmentation error rate. Only three different speech of different duration is listed in Table 1. The original segmentation with contex based is almost nearly equal to practical segmentation of correct segments. The number of missed and extra segmentation almost nil and error rate is almost neligible. So, the algorithm which has been mentioned in this paper gives correct segmentation with less error rate. The Table 2 shows that segmentation of context based with different vocabulary for male and female of different accent gives depending on the vocabulary size the segmentation accuracy decreases as vocabulary size increases but proposed algorithm gives better segmentation accuracy even for large vocabulary size.
The Figure 5 shows the continuous kannada letters speech signal and segmentation of continuous kannada letters speech signal shows the continuous kannada letters speech signal which contains all 52 letters of kannada language. This speech signal letters are isolated in Figure 5 (a-h) which are mentioned only seven letters with red mark in speech signal.
The Table 3 contains the segmentation of the kannada akshra from continuous kannada akshra speech signal of male and female. Segmentation of kannada akshra in this case is the isolation of kannada akshra. The continuous speech signal contains total 52 kannada akshar's. the original segments must be 52 akshar's but isolated kannada letters from the algorithm which has been mentioned gives more than 52 akshar's which contains correct 52 akshar's segments and extra segments which are not required segments. The error occurs due to the extra segments which is redundant, but the missing segments nil for both male and female speech signal.

Recognition
The Table 4 illustrates the segmented Kannada letters speech recognition for male and female. The isolated Kannada akshara speech signal is used as a traing data set for the recognition system. The recognition system gives '1' for the the recognised segment which is present in the trained data set i.e the test segment otherwise it is '0'.   Table 5 contains the context based continuous Kannada speech recognition. The segmented speech signal is used as a traing data set for the recognition system. The recognition system gives '1' for the recognized segment which is present in the trained data set i.e the test segment. Otherwise it is '0'. This recognition system gives better recognition rate.
The Table 6 shows the comparison of speech recognition of different feature extraction methods of different languages which are referred in these papers [21,22]. This table mainly compares the MFCC along with other technique of feature extraction methods used for test and traing data set which are are used as a input to the recognition system. Speech signal of different languages depends on the syntax and symantic analysis of that language. Especially the Indian languages have large sets of vocabulary. The recognition of Indian language speech signal as shown in Table 5 with different recognition rate. The proposed method which has been used gives better recognition rate with less number of MFCC coefficients than the other language. The Figure 6 shows the graphical representation of recognition accuracy of different languages [23][24][25]. The method which is used in this paper for Kannada language speech recognition shows better accuracy rate than the existing methods.

CONCLUSION
Segmentation and recognition of Kannada speech signal gives better results in this paper because of the proposed methods are efficient and effective based on the speech signal of different scenario. The methods contributes good segmentation with respect to the context of Kannada speech signal with segmentation errors are very less, but it depends on the vocabulary size. For large vabulary size this proposed method gives good segmentation with less missed segments. The speech recognition system is based on threshold with minimum number of MFCC and minimum number of codebook of VQ features gives better recognition rate forkannada speech signal with different scenario. The recognition system produces good recognition rate also works for different accents for male and female. In future there are more challenges, to reduce errors by using using different segmentation and recognition techniques.