Machine learning for Arabic phonemes recognition using electrolarynx speech

Automatic speech recognition system is one of the essential ways of interaction with machines. Interests in speech based intelligent systems have grown in the past few decades. Therefore, there is a need to develop more efficient methods for human speech recognition to ensure the reliability of communication between individuals and machines. This paper is concerned with Arabic phoneme recognition of electrolarynx device. Electrolarynx is a device used by cancer patients having vocal laryngeal cords removed. Speech recognition here is considered to find the preferred machine learning model that can classify phonemes produced by electrolarynx device. The phonemes recognition employs different machine learning schemes, including convolutional neural network, recurrent neural network, artificial neural network (ANN), random forest, extreme gradient boosting (XGBoost), and long short-term memory. Modern standard Arabic is utilized for testing and training phases of the recognition system. The dataset covers both an ordinary speech and electrolarynx device speech recorded by the same person. Mel frequency cepstral coefficients are considered as speech features. The results show that the ANN machine learning method outperformed other methods with an accuracy rate of 75%, a precision value of 77%, and a phoneme error rate (PER) of 21.85%.


INTRODUCTION
Speech recognition is a technique of converting spoken words into writing. Every spoken word is a composition of the most basic symbols of a particular language. Phonemes are the smallest units of each language. The method of recognizing simple units of phonemes is critical for developing speech recognition systems [1]. The language model and the acoustic model are two main components of each speech recognition system. A phoneme recognizer's accuracy is crucial to the acoustic model's accuracy [2]. In order to avoid large vocabulary word size, phoneme-based speech recognition is used, because words may be generated by combining the phonemes of the language. Due to the finite number of phonemes in each language, the process requires a substantial amount of training data compared to word-based models. Lowering complexity of the system allows the use of neural network (NN) often used in speech recognition systems. A neural network is a kind of machine learning technique which is based on the human nerve system and brain structure [3]. Machine-learning methods have recently encountered increasing attention due to their structure which is able to extract robust latent characteristics that allow various recognition algorithms to show generalization in a variety of applications.

RELATED MACHINE LEARNING ALGORITHMS
The term machine learning (ML) refers to a collection of methodologies or algorithms that enable computers to automate data-driven model programming and develop models employing systematically detecting patterns in statistically significant data [15]. There are three categories of ML: supervised, unsupervised, and reinforcement. The basic concepts of the applied supervised ML algorithms used in this paper for training and implementing an Arabic phonemes recognition system are described briefly in the next subsections.

Artificial neural network
The artificial neural network (ANN) is a non-parametric prediction tool for pattern classification applications, including speech recognition [16]. ANNs extract complex patterns from data and detect complicated trends for people or other computational approaches [17]. As a result, ANN is capable of modeling both complicated and multi-complex problems [18]. Initially, the number of layers and the activation functions are determined according to the complexity of the problem to be solved [19]. The multilayer perceptron (MLP) is one of the most widely utilized ANN architecture for pattern classification. It has been employed in a variety of voice synthesis and recognition schemes [17]. Figure 1 shows the structure of an ANN, which consists of two layers: the hidden layer and the output layer. The output layer value is expressed via (1) [17], is ANN output, is the number of output elements, is the input data, is the number of input attributes, 1 is the first layer weight, 2 is the second layer weight, and and * represent the applied activation function for each layer.
One of the weight adjustment methods applied in ANN is back propagation which is necessary to reduce error using gradient descent algorithm [17] by adjusting the weights according to the partial derivative of error concerning each weight [3]. Thus, the actual output becomes closer to the target output due to error minimization for each output neuron and the whole network [15]. The main families of ANN are:

Convolutional neural network
Convolutional neural network (CNN) is a powerful family of neural networks designed precisely for image processing applications containing convolutional layers. CNNs are regularized fully connected networks in which each neuron in one layer is connected to all neurons in the following layer [20]. Convolution, pooling, and fully linked layers are the three primary layers of CNN. The first layer is the convolution layer utilized to extract the features from an input. The pooling layer is the second layer used to reduce the number of parameters. Different pooling techniques are available, usually based on the requirement. The most commonly used one is max pooling, which only considers the highest concentrated element of the obtained feature map [21]. Sub-sampling is frequently achieved using max/mean pooling or local averaging filters [3]. After flattening the generated feature map, the 1D array is sent into a fully connected network [21]. The final layer of CNN handles the actual classifications. Multiple series of sub-sampling and weight-sharing convolution layers can be used to build deep CNN [3]. The basic CNN architecture with its main layers is displayed in Figure 2.

Recurrent neural network
Recurrent neural network (RNN) is an ANN with cyclic connections that are more powerful tool for modeling sequence of data compared to an ordinary ANN [17]. In a regular MLP, each layer has its weights and biases. This ensures that the neuron remembers the existing state, and based on this state, the following output is generated [21]. RNN has a memory which determines future predictions. As a result, anticipating any word in a sentence necessitates knowledge of the previously processed words [22]. The RNN algorithm is depicted in Figure 3. As shown in (2) calculates the hidden state which stores the data of the previous steps: where −1 is the previous input, is the present input at time t, and are the hidden layer weights, and is the output layer weight. Based on the memory at time t, the output step is calculated as in (3) [23]. Figure 3. RNN algorithm LSTM network is a type of RNN that uses a combination of specialized and standard units. LSTM can remember prior states and be trained to do tasks requiring memory or awareness of current states. LSTM partially overcomes the significant weakness of RNN, especially the problem of vanishing gradients [23]. As depicted in Figure 4, a memory cell is a component of LSTM units which is able to store information for a long time [3]. A memory state can add or remove any information as needed using gates. Input, output, and forget gates are the three types of gates considered in RNN. These gates have the ability to protect or control memory state. The LSTM network is made of a sigmoid function, which is a type of activation function. This squashing function restricts the output to a range between zero and one, making these functions useful in predicting probabilities [23]. In Figure 4, the value of the time step is denoted by the subscript t and the memory cell, input, and output values are represented by , , ℎ. Moreover, () represents the sigmoid function while the hyperbolic tangent function is defined by ℎ. The gates' computations are given in (4) to (8) [3].
where , , and stand for input weights, recurrent output weights, and bias, and the forget, input, and output gates are denoted by , i , and , respectively. The ANN algorithm described above is closely connected to deep learning [17]. A deep neural network is made of multiple layers of nodes. To address problems in a variety of areas or used cases, many designs have been devised. CNN is extensively employed in image recognition and computer vision, while RNN is commonly used in forecasting and time series problems [3].

Random forest classifier
Decision tree algorithms are the most widely used ML approaches. They are employed to represent a wide range of data classification problems [17]. Random forest classification (RFC) is a supervised classification technique in ML that uses decision trees. A decision tree method creates a tree-like model of the dataset, with each node being further divided. An RFC is a collection of de-correlated, unbiased, and unrelated decision trees and hence is called a random forest. It is based on integrating two basic principles: each decision tree is trained using only a portion of the training samples and must make its prediction using only a random subset of the whole features [22]. The two most significant concepts that are widely used for this task are Gini and entropy values. Gini is utilized to calculate the impurity, while entropy is used to calculate the node's information gain. The formula for calculating the impurity from the Gini value is defined as (9) [24]: represents the relative frequency of the class in the dataset and c represents the number of classes. To split the node, the least Gini impurity is chosen. At the leaf nodes, a perfect split would result in a Gini score of 0. Similarly, the goal of entropy is to split at the node that gives the maximum information gain [24]. The RFC method also has advantages over other methods due to its suitability for classifying highdimensional data [25].

Extreme gradient boosting classifier
Extreme gradient boosting (XGBoost) is a sequential tree-building algorithm implemented by parallelization [26]. It runs faster than any other model, and it is famous for its scalability in all scenarios. There are many different boosting algorithms like parallel boosting, regression tree boosting, and stochastic  [23]. In XGB, interchangeable nature of loops determines the foundation of building algorithm. The exterior loop keeps track of the tree count, while the inside loop calculates the features. Loops can be swapped out, and this action improves run time performance [27]. Parallel threads are used to scan, initialize, and sort all the instances, globally. The algorithm performance is improved by switching loops. The overheads of parallelization in the computation are offset. The negative loss criterion at the split point determines whether or not the tree at the node will split [26].

The considered Arabic phoneme model
Arabic is one of the oldest languages in the world, and it is the fourth most widely spoken language worldwide [28]. Several researchers have addressed the development of an Arabic speech recognition system. The Arabic phoneme set used in the present work is the small modern standard Arabic (MSA) speech corpus shown in Table 1. In the table, each phoneme is listed with its corresponding English symbol. This corpus is considered as the primary reference for Arabic speech recognition systems [29]. In comparison to English, Arabic has fewer vowels, and it possesses only three long and short vowels. The articulation refers to the influence of emphatic phonemes on adjacent phonemes, particularly vowels. In nearby segments, emphatic consonants cause a significant backing (i.e., sliding the tongue back during articulation) gesture, which happens mainly for adjacent vowels. This effect can be felt throughout full syllables as well as across syllable boundaries [28]. Using V symbol for short or long vowels and C for consonants, CV, CVC, and CVCC are the syllable types allowed in Arabic. In the third type of Arabic syllable, the indicated vowel can only be short. All Arabic syllables must have at least one vowel, and all Arabic utterances must begin with a consonant [29].

The proposed Arabic phoneme recognition model
In this section the proposed Arabic phonemes model is discussed in detail. This model is comprised of four sections. In the first part the dataset is prepared, then MFCC features are extracted. Followed by training and testing phases for the related ML algorithms which would be evaluated with the proposed performance measure. The proposed Arabic phoneme's structure is shown in Figure 5. This structure involves the following processing steps.

The preparation of data set
The recognition system here used phoneme classes of 27 Arabic phoneme categories as shown in Table 1. The data set was gathered for two cases: i) the normal speech and ii) EL produced speech. The dataset consists of 63 instances for each phoneme class. The tests involved eight individuals (three males and five females). The recorded speech is sampled at 48 kHz sampling frequency covering one-second utterance.
To have an idea about the quality of the speech produced by EL, Figures 6 and 7 show the time waveform of normal phoneme example ‫/ع/‬ and its EL produced version.
The processed dataset used in NN is divided into three subsets. The first is the training set that is used to update weights and biases according to output values of NN and the target output. The second set is utilized for validation by measuring the NN generalization to stop training before overfitting occurs. The last set is the testing one which is considered as an independent measure of NN performance using random indices for NN future prediction. In general, the training set makes up approximately 70% of the entire dataset, while the validation and testing sets are about 30%. The experiments are conducted on the prepared

Extraction of features
Because the input data provided to the system is too large to be processed and highly redundant, feature extraction is considered as a dimensionality reduction procedure. Several methods have been designed to extract the features for speech recognition [30]. Experiments revealed that MFCC is a commonly utilized technique, especially for noisy datasets like the collected speech dataset produced by EL device. In this work, for each phoneme, a vector of 40 MFCC features is created, where this value has been approved to get better results in comparison to using 10, 20, 80, 120, and 200 MFCC features. Figure 8 [31] shows the main steps for calculating MFCC coefficients. These are found by taking the fast Fourier transform for a windowed portion of the speech signal. The power of the spectrum obtained is then mapped into Mel scale using overlapping triangular windows. This is followed by taking the logarithm of the power at each of the Mel frequencies. Then, discrete cosine transform is applied to the sequence of the Mel log powers. The

Data set training and testing
Like any other pattern recognition systems, the process of performing speaker recognition consists of two phases namely: training and testing. In the training phase, a database of the extracted features from the whole dataset is created. These features are used to train the proposed machine learning algorithms applied in this paper. The applied machine learning system models are ANN, CNN, RNN, RF, XGB, and LSTM either independently or in a hybrid model for the classification task. The applied hybrid models are CNN-LSTM, CNN-RF, and CNN-XGB, whereas in the testing phase, features are extracted from every input signal and a feature matching process is performed to decide whether these features belong to a previously created database or not. The behavior of each machine learning method is evaluated based on specific parameters for its performance measurements. The applied parameters for each model were varied many times to ensure a fair comparison and to get the best results of the given method. The ANN model is trained with 40 neurons in the input layer, 256 neurons in the hidden layer, and 27 neurons in the output layer. Exponential linear unit (ELU) and SoftMax activation functions [32] with a learning rate of 0.001 are used with 1,500 epochs. The batch size value is 32, while the dropout value is 0.5 for the input and 0.75 for the output layer. The regularizer value of 0.2 to face the overfitting issue is used.
For the XGB classifier, the applied parameters are as follows; the number of estimators is 102, the maximum depth is 5, and the learning rate is 0.7. For RFC, the involved parameters are: 73 estimators, maximum depth of 90, minimum samples leaf is 1, and the minimum samples split is 2. For the CNN model, the system is trained using one convolution layer, one maximum pooling layer, and one flatten layer. The input layer is 40 neurons, 2 hidden layers of 256 neurons for each layer, and an output layer of 27 neurons. The applied activation functions are rectified linear unit (ReLU), ELU, and SoftMax [33], while the batch size is 64. The dropout value is 0.6, and the regularizer is 0.02. In the LSTM model, the system is trained to apply one input layer of 40 neurons, one hidden layer of 256 neurons with one flatten layer and the output layer of 27 neurons. The dropout values are 0.2, 0.25, and 0.6. The applied activation functions are ELU and SoftMax [32].
For the hybrid CNN-XGB model, the system is trained using 2 convolution layers, 2 maximum pooling layers, and 1 flatten layer. The input layer has 40 neurons, while the number of estimators is 40, the maximum depth is 40, and the learning rate is 0.8. The CNN-RFC model is trained using two convolutional layers, two maximum pooling layers, and one flatten layer. The input layer has 40 neurons, while the parameters of RFC classifier are: the number of estimators is 31, maximum depth is 20, minimum samples leaf is 1, and the minimum samples split is 2.

Performance measures
The performance of the Arabic phoneme recognition system is evaluated for each machine learning method considered in the work. After feature extraction, the model is implemented to produce output in the form of a class. The accuracy rate is considered as the main classification performance measure here. Further, receiver operating characteristics (ROC) curve, phoneme error rate (PER), conversion matrix, and precision are also considered as performance measures.
The accuracy or recognition rate represents the ratio of the correctly predicted samples relative to the total number of instances present in the data set. The following relation determines the accuracy rate ( ) [25]: where is true positive, is true negative, is false positive, and is false negative. ROC metric is used to assess a classifier's output quality. The ROC curve gives an indication to the fraction of correct predictions for the positive class (number of false positives) on the x-axis versus the errors for the negative class (number of true positives) on the y-axis [34]. Each classifier uses a single point on the ROC to represent the false positive (FP) and true positive (TP) pairs. The upper left corner point in the ROC curve represents a perfect or ideal classification [25].
ROC probability curve represents an insensitive to class distribution, if the proportion of positive to negative instances changes, the ROC curve will not change. There is a simple relationship to determine based on ROC: where is the true positive rate, is the false positive rate taken from ROC, is fraction of correct predictions for the positive class, and is fraction of negative classes. The confusion matrix gives valuable information in comparing actual and predicted classes to evaluate the classifier's performance. Four categories covered by this matrix as shown in Figure 9. TPs and TNs represent the predicted and actual phoneme classes, while FPs and FNs exist when the prediction does not match with the actual phoneme classes. The predicted class in FN is negative, but the actual class is positive. Similarly, the class in FP is positive, but the actual class is negative [25]. Two main values used for performance evaluation as depicted in Table 4, could be calculated from the confusion matrix in (10) and precision as in (12). In the field of speech recognition and processing, the error rate is a widely used measure. The error rate here is PER using a phoneme-level recognition system, and it is calculated using (13) and (14) where represents the total number of reference labels or classes.

RESULTS AND DISCUSSION
This section summarizes the tests that were carried out to find the suitable ML algorithm for phonemes classifying tasks among the tested methods and discuss the results. The results confirm that ANN model is the most appropriate choice for Arabic phonemes recognition. Furthermore, the applied models were evaluated depending on noisy signals that are totally corrupted that specify the capability of proposed model for dealing with such signals. Figure 9 shows the accuracy results for training and testing phases obtained by the proposed methods for Arabic phonemes recognition depending on normal and EL based produced speech. As indicated in the figure, the ANN model outperforms other models with a 73.23% testing accuracy rate. ANN is a comparatively lightweight way of solving data classification problems, especially for limited datasets conditions.

Recognition results
The training learning curve is calculated from the training dataset. It gives an idea of successful learning of the method, while the validation learning curve calculated from a hold-out validation dataset, gives an idea on how well the model in general. It is also common to create learning curves for optimization tasks according to cross-entropy loss and the model performance is evaluated using classification accuracy. Figure 10 shows the ANN model learning curves as the training accuracy grows. The validation accuracy increases at first, then begins to fall after a given number of epochs (increase or decrease) due to the overfitting effect. This has been dealt with using regularization during the training and adding dropout layers. The loss measures the model error, decreasing as the training progresses, indicating better model performance. Figure 9. Accuracy results of different methods Figure 10. Learning curves for ANN model The higher the ROC value is, the better the model will be. Figure 11 shows the ROC results for each class for the ANN network. As shown in Figure 12, the confusion matrix for the ANN network, TPs are represented on the matrix's diagonal. It is approved that most of the phonemes were classified correctly, and the letter ‫/ش/‬ of class 20 got the higher value among the other letters. The value of precision for the ANN model value is 77%, while the value of PER is 21.85, calculated as calculated depending on the confusion matrix Figure 12

CONCLUSION
This work aims to adopt the most promising machine learning model to detect the produced speech by the electrolarynx device. Automatic speech recognition has recently emerged as a significant research subject in human-computer interaction. The work in this paper investigates the most useful machine learning techniques for Arabic phoneme recognition. The applied machine learning system models are ANN, CNN, RNN, RF, XGB, and LSTM either independently or in a hybrid model for the classification task. The applied hybrid models are CNN-LSTM, CNN-RF, and CNN-XGB. According to Table 1, the experiments has been conducted on the prepared 63 instances for the 27 classes of Arabic phoneme categories. The dataset has been gathered in two stages: the normal speech and EL-produced speech. The performance results show that ANN outperforms the other applied models with the value of 97.49% training accuracy rate and 73.23% testing accuracy rate with an achieved PER of 21.85, and the precision value is 77%. Due to their robust pattern recognition and classification skills, ANNs have delivered remarkable results. In terms of performance metrics, the behaviors and performances of ANN training algorithms have been compared.