English speaking proficiency assessment using speech and electroencephalography signals

In this paper, the English speaking proficiency level of non-native English speakers was automatically estimated as high, medium, or low performance. For this purpose, the speech of 142 non-native English speakers was recorded and electroencephalography (EEG) signals of 58 of them were recorded while speaking in English. Two systems were proposed for estimating the English proficiency level of the speaker; one used 72 audio features, extracted from speech signals, and the other used 112 features extracted from EEG signals. Multi-class support vector machines (SVM) was used for training and testing both systems using a cross-validation strategy. The speech-based system outperformed the EEG system with 68% accuracy on 60 testing audio recordings, compared with 56% accuracy on 30 testing EEG recordings.

INTRODUCTION In addition to be the language of science, English becomes the medium of instruction and communication across the world. It is very important for English learners to get proficiency assessment, as the output of such an assessment influences their academic careers. Therefore, providing reliable and instant English proficiency assessments are most important in determining their academic progress. One of the most common problems facing non-native speakers when learning English is the inability to pronounce the sounds of English words properly. Quality of pronunciation is the major difference between native and non-native English speakers that will be noticed. Learning proper English pronunciation helps learners to communicate effectively with the native English speakers. From this point, having an automatic system which provides an immediate assessment and feedback for English learner will help and improve the English quality of learners. Currently, there are some systems which can automatically measure the English pronunciation quality of speaker automatically and give an instant feedback at word and phoneme levels. This kind of systems depends on the speech signal for extracting acoustic features of each phone and compare them with a standard pronunciation models (mono-phones or tri-phones), which are usually trained on English native speech.
Computer assisted language learning (CALL) systems [1], [2] are proven to be helpful and effective for learning non-natives pronunciation details of a new language, especially in the starting phase of learning and for pronunciation training. Computer assisted pronunciation training system (CAPT) [3] is another learning systems which are always available and can be used everywhere, and allow the learner to make mistakes without loss of self-confidence because it is one to one teaching process. This gives the learner a positive experience ❒ ISSN: 2088-8708 in learning process [3]. Most computer-assisted pronunciation training system are based on automatic speech recognition (ASR) techniques. CALL system use ASR for offering new perspective for language learning [4]. Language learners "are known to perform best in one-to-one interactive situations in which they receive optimal corrective feedback" [3]. Due to the lack of time, in most cases, it is not always possible to provide individual corrective feedback. Therefore, ASR-based CALL systems can recognize what a person actually uttered, to detect pronunciation and language errors, and to provide feedback spontaneously. According to [4], the system's accuracy is higher for native speech than for non-native, and that the speech recognition technology is still at an early stage of development in terms of accuracy. Although there are some commercial systems for non-native English speakers across the world. As it can be seen that the general ASR application for English learning may not work satisfactory for Arab pronunciation learners, because "the former requires the ASR in general to be forgiving to allophonic variation due to accent [5]. Most of available English proficiency assessment systems depend on the speech signal for extracting acoustic features of each sound and compare them with a standard pronunciation models for each phone, or tri-phones, which are usually trained on English native speech. The accuracy of such systems is acceptable but since they depend on speech signal only, the performance of these systems is quietly affected by the quality of recorded speech signal. In other words, background noise, which is inevitable, degrades the efficiency of these systems dramatically. In order to overcome this limitation and to improve the system accuracy, in this framework, we are using EEG signals which reflect neurons firing inside the learner brain and use it, beside the speech signal to measure the proficiency and confidence level of English learner. The EEG signal has been successfully used in many applications in various fields such as emotion detection, robot controlling and other appliances by thinking, typing characters by thinking and many other interesting applications. We believe, by combining speech signal and EEG signals, that the performance of automatic systems for estimating English proficiency and confidence level can be improved significantly, and the problems of depending only on speech signal will be partially or completely solved. To our knowledge, this is the first attempt to use multimodality (voice and EEG) for assessing speaking English quality and confidence level of non-native speakers. Such system is very important for giving an immediate and instant feedback for English learner, especially, when assessing speaking quality. The traditional way of assessing speaking quality and level of confidence is by setting up an interview with an expert in English. Therefore, developing an automatic system for such a task will come back with many benefits for both language learners and assessors. The rest of paper is organized as follow: literature review is discussed and presented in section 2, dataset collection and description is presented in section 3.1. Sections 3.2 and 3.3 describe the audio and EEG based systems, respectively. Experiments and results are presented and discussed in section 4. Conclusion and future work are presented in section 5.

2.
LITERATURE REVIEW Automatic speech processing technology has been applied to many different fields over the past two decades. For example, preparation for English proficiency tests required for the higher education institutions [6], foreign-based English skills call center agents evaluation [7], and aviation English evaluation [8] are heavily dependent on the speech technology. For more examples and more details [9].
In most of these systems, the participants need to speak to the system for language proficiency evaluation. Read aloud is the most common type, where the participant reads out loud one sentence or a set of sentences. In order to make these systems more interactive and communicate with the participants by speech, automatic speech recognition (ASR) systems, which converts speech into text, are used, even with heavily accented non-native speech. Different features types representing non-native speakers when producing English sounds and speech patterns are extracted from the participants responses and used in the English proficiency evaluation. Some of the most successfully used features include phone's spectral match to native speaker acoustic models [10] and a phone's duration compared to native speaker models [11], fluency features, such as the rate of speech, average pause length, and number of disfluencies [12] and prosody features, such as pitch and intensity slope [13].
Although most of the applications elicit restricted speech, some applications have used automated scoring for non-native spontaneous speech, in order to make speaker's communicative competence is fully evaluated (e.g., [6] and [14]). In such systems, the same types of features based on the prosody, fluency, and pronunciation are extracted. Furthermore, features related to additional aspects of a speaker's proficiency in the non-native language can be extracted, such as vocabulary usage [15], syntactic complexity [16], [17], and topical content [18].
In other related studies, the fundamental frequency (F0) and pitch contours were used to assess the oral reading proficiency of non-native speakers, automatically. For example, Tepperman et al. [19] developed canonical contour models of F0 and compared non-native speakers model with the native speakers. The contours were modeled at the word level and then the prosody scores were computed based on a combination of the contour features, energy features, and duration features. A correlation of 0.80 between these scores and human ratings were reported. Moreover, based on an autocorrelation maximum posteriori approach, authors in [19] took advantage of pitch floor and ceiling values to capture aspects of pronunciation quality not seen at the segment level and achieved a maximum accuracy of 89.8% in a classification task discriminating between native and non-native speakers.
Electroencephalography (EEG) is an electrophysiological monitoring method to measures and records the electrical activity of the brain. It is a readily available test that provides evidence of how the brain functions over time. It is typically noninvasive, with the electrodes placed along the scalp, using the standardized electrode placement scheme, known as 10-20 international system [20]. Wires attach these electrodes to a machine, which records the electrical impulses. It is proven that EEG signals can be used for new methods of communication besides the well-known clinical applications. It has been proven that it is feasible to link between speech recognition and EEG, that is to use EEG for the recognition of normal speech [21].

SYSTEM DESIGN 3.1. Database description
To our knowledge, there is no available dataset which includes both speech and EEG for the purpose of estimating English language proficiency level for non-native English speakers. Therefore, we have collected our own dataset. For this purpose, a short interview has been designed which consists of two parts. First, each participant is asked to talk about himself/herself in two minutes, then he/she asked to describe a picture presented on a paper in front of him/her in around 2 minutes. With the help of an English expert from the English department at Birzeit University, 142 participants (university students learning in English and instructors teaching in English at Birzeit University) with different level of English proficiency, have been made recorded interviews in English. A high quality close microphone was used for recording audio signals of all participants (142) in a quiet environment. Sampling frequency of 44.1 KHz was used and recordings were saved in wav format. The average length of speech files is 2.5 minutes. During speech recording, an Emotiv Epoc headset device [22] was used for recording EEG signals for all male participants (58) and saved in csv file format. Fourteen soft electrodes were attached to the participant's scalp with special adhesive electrode gel, located according to the international 10/20 system [20].
Because of long hair and covered head of most of female participants, EEG signals could not be recorded for all females (84). Therefore, EEG signals were recorded for males only. With help of English expert from English department at Birzeit University, an evaluation criteria was used for evaluating English proficiency level of each participant. English expert listened to all recorded interviews and do the assessment by assigning 1 to 10 for each participant using a predefined assessment criteria.
Based on the results of expert evaluation, all participants had been divided into three skill levels; participants with average scores 8-10 are classified as high proficiency (HP), participants with average scores of 5-7 are classified as medium proficiency (MP), and participants with average scores 1-4 are classified as low proficiency (LP). According to this criteria, 20 participants were classified as HP, 47 participants as MP and the remaining 75 were classified as LP. More details about the participants and the result of human expert evaluation are found in the Table 1.

.1. Speech features
To build an automatic assessment system based on speech, speech recordings of each skill level (HP, MP and LP) were used to train a classification system which can classify skill level of English speaker into one of the three classes; HP, MP or LP.
− Short frame energy: The speech signal is divided into short frames of 20 ms length at rate of 10 ms (i.e. frames overlap is 50%). A hamming window is multiplied by each frame and then the short frame energy is computed in decibel, as shown in (1), for each frame and used as an audio feature for our proposed system.
Where, s(n) is the speech signal and w(n) is Hamming window, with a formula shown in (2).
− Short frame zero-crossing rate (ZCR): In order to make the speech signal unbiased, the signal average (dc component) is computed and subtracted from the signal. Then, the number of time-axis crossings is computed for each short frame. These counts are then divided by the total number of zero-crossings of the whole utterance, as shown in (3).
Where, sgn() is sign function which gives 1 for positive values and -1 for negative values. − Mel frequency cepstral coefficients (MFCCs): MFCC features are the most commonly used in the speech processing applications. They represent the general shape of power spectrum for each frame with low dimensional feature vectors (12). More details about MFCC technique can be found in [23]. The first 12 MFCCs of each frame are appended to the audio feature vectors of our audio-based system. − Short-time frame pitch: Pitch refers to the fundamental frequency of the voiced speech. Pitch is an important feature contains speaker specific information. It is a property of vocal folds in the larynx and is independent of vocal tract. A single pitch value is determined from every windowed frame of speech.
There is a number of algorithms for estimating pitch form speech signal. Among these, one of the most popular algorithms is the robust algorithm for pitch tracking (RAPT) proposed by Talkin [24]. This algorithm was used to extract pitch for use in all experiments reported in this paper. − Formant frequencies: The general shape of the vocal tract is characterized by the first few formant frequencies. Praat toolkit [25] has been used to estimate the first three formant frequencies and their gains and appended to the acoustic feature vectors. − Phoneme rate: Speaking rate has been used as a feature in numerous speech processing applications. In this work, speaking rate was estimated from the number of phonemes in each 0.5 s window. The publicly available English phone recognizer developed in [26] has been used to generate phonetic labels. − Pauses: Pauses in speech have a meaning. There are two types of pauses; empty pauses which are silent intervals in the speech signal, in which speaker is usually thinking of the next utterance. The second type of pauses is the filled pauses with vocalizations, which do not have a lexical meaning. Usually, non-native speakers need more time (pauses) for thinking of the proper and suitable words and for producing meaningful sentences while speaking. These pauses are relatively longer than the natural pauses made by the native speakers when they are speaking. Therefore, the length of the pauses and the frequency of pauses may carry an important information about the speaker proficiency in the foreign language. Based on the short frame energy and zero-crossing rates, a simple algorithm has been developed for estimating the length and the number of pauses occurred in each utterance. Low energy and high zero-crossing rate frames are usually silence frames and, hence, classified as pauses frames. Whereas, the frames with high energy and relatively low zero crossing rates are for normal Eng   ISSN: 2088-8708  ❒  2505 speech, hence, classified as speech frames. If a number of successive pause frames exceeds a practically specified threshold, they are considered as a pause. Therefore, if the pause is exceeds a certain duration time or if it is occurred many times while speaking, this may indicate low language proficiency. Each of the above 24 audio features (energy, ZCR, 12 MFCCs, pitch, 6 formant frequencies with their gains, phoneme rate, average pauses length, number of pauses) is represented by average, minimum and maximum values for each utterance. This results in 72-dimensional vector for each utterance. Using cross-validation method (as described earlier), these feature vectors are used to train and test a three-class support vector machines (SVM) system. The results are presented in section 4.

EEG-based system 3.3.1. EEG pre-processing
In addition to language thinking while speaking, recorded EEG signals include multiple sources of actions such as eye blinking, eye movement, head movement, and muscle movement. which are known as artifacts. In our case, we are interested in second language thinking information. Therefore, the other artifacts are considered as noises for our system and need to be removed. Numerous techniques proposed for avoiding, rejecting and removing artifacts from EEG signals [27]. More recently, independent component analysis (ICA) technique has been used successfully to remove EEG artifacts [28], and it has been demonstrated to be more reliable than other artifact-removal methods.
There are many implementations for ICA artifacts removal which differ on the independence of the components and estimation of the mixing matrix. However, recent study has showed that there is no significant difference in the performance of these algorithms. Moreover, it is shown that ICA reliability depends more on pre-processing, such as raw data filtering, than algorithm type [29].

EEG feature extraction
Feature extraction is the process of finding appropriate representative features from the raw EEG signals which can be used for classification of different brain activity patterns. The most common set of features for raw EEG signal processing are temporal, frequential and time-frequency [30].

EEG classification
There are many classification techniques applied in EEG processing. The most successfully used classification techniques include SVMs, k-nearest neighbour (KNN) and naïve bayesian (NB) [31]. In order to remove the low and high frequency noise from the recorded data, signals were band-pass filtered (bandwidth 1 to 40 Hz). In order to separate and remove sources associated with artifacts from EEG, the ICA algorithm (refer to as Fast ICA) [28], was applied to the filtered data. Features are extracted from the pre-processed EEG channels. The EEG signal of each channel is divided into frames of 1.5 s length with 0.5 s overlap. A set of frequential features were obtained by a 128-point fast fourier transform (FFT). The sum of the spectral power lying in delta (1 to 4 Hz), theta (4 to 8 Hz), alpha (8 to 13 Hz) and beta (13-20 Hz) bands and relative intensity ratio of each band are used as the features. The eight features extracted from each single channel, of the fourteen channels, are concatenated together to form the final feature vector of length 112.

4.
RESULTS AND DISCUSSION 4.1. Speech-based system As mentioned earlier, the leave one out cross-validation technique was used for training and testing our two sub-systems. For audio system, speech of 60 participants (20 for each group) are used for training and testing SVM classifier. With the 62 tests, the system accuracy is 68%. The confusion matrix of the system is shown in Table 2. It is clear from the confusion matrix that there is large confusion between high and medium performance groups and very small confusion with low performance group. A possible explanation for this, is that the average English proficiency level of LP classified participants is near the lowest border of low scale. On the other hand, the average level of HP classified participants is near low border. This makes it difficult to distinguish between high performance and medium performance skill levels.

EEG-based system
Recall that the amount of EEG data is much less compared with speech data. According to the English expert evaluation, the number of participants, who have EEG recordings, in each of the three skill levels are as ❒ ISSN: 2088-8708 follow: 10 for HP group, 11 for MP group and 21 for the LP group. To keep data size of each class balanced, EEG recordings of 10 subjects from each group were used for training and testing, using leave one out strategy and using multi-class SVM. The EEG system accuracy is 56%. The confusion matrix is shown in Table 3.  Similar to audio-based system, EEG-based system presents a high confusion between the high and medium performance groups. This motivates us to combine these two groups into one group (high) and re-train the two systems with two-class SVMs (high vs low). The audio-based system performance is increased to 83% and EEG-based system is increased to 78%.

CONCLUSION
In this paper, two English proficiency level estimation systems were proposed. One system uses audio features extracted from speech recordings and the other uses features extracted from EEG signals. In the audio system, each utterance is represented by 72 audio features, whereas, in EEG-based system, each EEG recording is represented by 112 features. For this purpose, 142 volunteers made recorded English speech. EEG signals of 58 of them were recorded during English speech, using Emotiv EPOC headset. With a help of English expert, each participant had been evaluated and categorized into three different English skill levels. With 1-fold cross-validation, 20 audio recordings from each level were used for training and testing audio-based system. Similarly, 10 EEG recordings from each skill level were used for training and testing EEG-based system. The audio-based system outperformed EEG-based system with 68% accuracy compared with 56% accuracy for the EEG system.
In the future work, more data to be recorded for more participants, specifically EEG data. The twosubsystems will be combined together at the feature level, i.e. concatenating audio features and the EEG features into one feature vector. It will be also interesting to combine the sub-systems at the model level, i.e. train a back-end classifier which combines scores of the two sub-systems and compare its result with the feature level concatenation and also with each individual sub-system.