AlertNet: Deep convolutional-recurrent neural network model for driving alertness detection

ABSTRACT


INTRODUCTION
The physiological signals like EEG, ECG, EMG, and EOG are the proven measures for the analysis and detection of abnormalities in the area of clinical diagnosis, but recently these signals also have found their usage in many other applications. Diver alertness detection is one such application, where these physiological signals can also be used. Amongst the various physiological signals, Electroencephalogram (EEG) which varies in frequency and time-invariant features is found to be a direct indicator of driver's alertness level. The standard 10-20 system is used to collect EEG signals from different locations of the scalp using the electrodes [1]. The different frequency components along with the related amplitude levels observed over the time represent the condition of the brain [2]. There are 5 stages of sleep in which the brain produces distinguishable electric patterns which help in the classification of stages. The PSG signals are collected from a subject during the entire night of sleep and are manually scored by sleep experts into different sleep stages by visually analyzing the signals for a specific time frame [3]. The criteria for sleep stage scoring are proposed in rechtschaffen and kales (RK) [4] manual which was further developed by the LSTMs. The encoder captures the dependencies related to long-short contexts between the target classes and inputs. The input for the encoders is time-series features obtained from ResNets. The time series non-linear dependencies are captured for detecting the targets by the encoder. The output of the encoded sequence is fed to the attention network and further, they are decoded for detecting the category. Next, we will discuss each module in detail.

The ResNets
ResNet a residual neural network is a deep neural network that uses shortcuts to jump some layers, called skip connections [18]. The basic ResNet block is shown in Figure 2. It was developed in the view of avoiding the degradation problem which is encountered in deeper neural networks, it was observed that as the depth of the neural networks increases, accuracy gets saturated and decreases rapidly. The ResNet model is implemented with two or three skips that contain ReLU activation function and BatchNorm in between the skip layers, this will help to avoid vanishing gradient, as the network reuses activation from the previous layer hand till the adjacent layer learns the weights, only the weights of adjacent layers are considered, this provides best results when a nonlinear layer is skipped or the consecutive layers are linear. The skip connections use only fewer layers in the starting training stages, which simplifies the layer. The learning is thus faster reducing the vanishing gradients' impact. On the later part of the training, the network restores the layer which was skipped to learn the feature space. In the end, the layers are usually expanded which stay close to manifold for faster learning we use this functionality of ResNets to capture frequency information, the residual connection helps us to maintain features from the previous layer. These features are then fed to the RNN model for classification. The features obtained from ResNets learn the complex features which help in classification. In the above figure, the shortcuts can be directly used if the dimension of the input and the output are the same, denoted by expression (1). Where x is input and y is out vectors for the layers considered for skip connection. The ( , ) is the function representing the residual map. If the number of layers of the residual block is two, the function F can be represented as = 2 ( 1 ) with σ as ReLU, and neglecting biases. The shortcut connection and element-wise addition are used to operate + . If the dimension of the input and output are not same, then zero paddings are done and the shortcut is used to match the dimension using the following formula, If the matrix dimensions of x and F are matching the (1) is used, for changing dimensions (2) is used. The Ws term in (2) represents the linear projection which is performed using the shortcut or skip connections. The ResNet is used for feature extraction in our model. The residual network that is the "identification of shortcut connection" for retaining the features from consecutive layers is important. The features required for the identification of three classes can be obtained by the extraction of features using ResNet. The ResNets do not require two filters to extract temporal and frequency-based features, rather using ResNet helps to retain features in the consecutive layers. It is a simple notion to increase features we need to increase the number of layers, that simply stacks the layers, but this can cause a vanishing gradient problem because the increase in layers will also increase the back propagation multiple times across the layers. Due to multiplication, the gradient becomes infinitely small and the gradient saturates.
EEG signal is fed to the neural network as an array. 1 1(1 ) Convolutions are carried out throughout the network for feature extraction. The filter sizes start from 64 and go up to 512. As there are skip connections in ResNet model to retain features from previous layers, whenever there is change in the input filter_size and output filter_size zero padding has been done, it is represented as filter size/2 indicating the change in filter_size. ResNets helps in retaining the features and reduces overfitting which is caused due to usage of a fully connected layer. Also, there is no max-pooling layer used due to the use of a global average pooling (GAP) layer. The details of ResNet architecture is shown in Figure 3.

Bi-LSTM based sequence to sequence models with attention
Sequence to sequence method is a deep learning approach that uses an encoder-decoder based machine translation technique to translate the given input sequence into an alternative output sequence with a tag and attention weightage. It uses two recurrent neural networks (RNNs) which work together to predict the next output sequence from the previous input sequence with a special toke. The cases in which we need to predict the next state based on the previous state, like predicting driver experiencing the drowsy state which depends on the behavior of past EEG signal behavior, sequence models can be used. As compared to the conventional neural networks where all the inputs to the network and the corresponding outputs are independent, we are choosing bi-directional LSTMs wherein the next state or the output is predicted from the current and past input. The Bi-LSTMs, shown in Figure 4, which have two series RNNS can remember the previous state and with that information, they can predict the next state. LSTM units have a rich internal structure. The various "gates" determine the propagation of information and can choose to "remember" or "forget" information. Compared to the traditional unidirectional RNNs, the operation of which only depends on previous input state, Bi-directional LSTMs process data both in forward and backward directions simultaneously. Hence, the Bi-directional LSTMs are used to remember both past and future data points, and the inputs run in both the directions, one from future to past and one from past to future using two hidden states. In Bidirectional LSTM, the replica of the first recurrent layer is created and the input is given to the first layer in the normal time order, t=1,...T, while the reversed input in the time order t=T,…1, is provided to the second or backward layer [19]. The output is computed as the weighted sum of the two layers. The same is represented as (3)-(5).
The hidden state and feed-forward network's bias are represented as (ℎ → , → ); the hidden state and backward network's bias is represented as (ℎ ← , ← ); and are the input and the output of Bi-LSTM, respectively. The sequence-to-sequence model used in our model consists of an encoder and a decoder built with LSTMs. The encoder takes the input as one sequence at a time in the form of vector representation, and the decoder estimates the class for each 30-s input sequence. The long-short-term memory units of the encoder capture the context dependencies of the input and the output target. The decoder thus computes the information of hidden states and predicts the output with the help of Softmax [20]. Since there are three classes to be classified the length of the encoded vector will be three, which is e1, e2, and e3.

The attention network
The encoded sequence of every epoch is further used to get the target sequence using attention network, which is a decoder part of the network. The decoder is also built using LSTMs. In the standard decoder, for every sequence of inputs, the decoder generates the new representation of the input sequence along with a target input element. The last input coming to the decoder is the last effect to update for the hidden state of encoder. Thus, the model has to be biased according to the last element. So, the use of attention mechanism in the model can address such a problem. The attention network learns different portions of the output sequence of encoder for each decoding step along with considering the entire encoder representation. Hence the decoder learns only the significant input sequence parts during decoding stage. Without the attention mechanism, the decoder operation relies on the hidden vector of the decoder's Bi-LSTM. The sequence to sequence model including attention mechanism is more effective as it includes both encoder's representation and decoder with hidden vector calling the context or attention vector, represented as (ct). Attention weights are computed as a function of f (.), before computing attention vector (ct). The context or attention vector (ct) is probabilities (αi), relating to the significance of each hidden state, multiplied by a hidden state ( ).
where αi is the significant of part i of the hidden state. The f(.), is a combination of the encoder's hidden state ( ), and decoder's hidden state (ℎ −1 ), with the ℎ layer followed by. Later, f(.), is given to the softmax module to calculate αi for n parts. Then the computation of , is performed by the attention module, which is a weighted sum of all ( ) and ∝ vectors. Hence, while decoding, the model can only consider the important regions of the input vector sequence.

IMPLEMENTATION
This section elaborates the details of the implementation of the algorithm; the data set preparation, training procedure, loss calculation, and the evaluation of the model using various metrics.

The data set preparation
The datasets used for this study are common public datasets of sleep-edf 2013 and 2018 versions which consist of 61 and 197 polysomnograms (PSGs) respectively. Table 1 shows the data corresponding to different sleep classes. We consider the data from Fpz-Cz/Pz-Oz EEG channels for our all analysis. The data set used here does not have an equal distribution of all sleep classes, the sleep stages W and other sleep stages are greater in number compared to N1-state. Such a class imbalance problem is better addressed using deep learning methods compared to conventional machine learning techniques. The loss calculation method used in this paper also helps in dealing with the class imbalance problem. In addition to this, the data set is oversampled wherever required to balance the number of all sleep stage classes.

Training procedure with optimizing parameters and hyper-parameters
We feed 30s-epoch to the ResNet for extraction of frequency component related to the sleep stages, which is further connected to sequence-to-sequence models. For each fold, one part is taken for testing rest and is used for training. Finally, all the evaluation results are combined.
The model is evaluated using k-fold cross-validation. The Sleep-EDF 2013 dataset is trained by setting k value to 20 and the sleep edf 2018 data set is trained by setting k value to 10. Cross-validation is used to evaluate the machine learning models with the help of a resampling procedure. The single parameter called 'k' is used and it denotes to the number of groups to be spilt on the available dataset. Cross-validation is applied to check the behavior of the machine learning model on the unseen dataset. So, when validation is done it is to check whether the increase in the accuracy of the training data also leads to an increase in the accuracy of the dataset which is not previously seen by the network. It is done to minimize overfitting. This method is less biased and optimized for a simple train/test split.
The network is trained for 120 epochs, with RMS prop as the optimizer, this is similar to Gradient descent but the oscillations in the vertical direction are restricted, helping the model to move in the horizontal direction to converge faster with the increase in the learning rate. The mini-batches of size 20 are used with a learning rate set to = 0.00001 and the L2 regularization element with = 0.001 to minimize overfitting.

Loss calculation
There is a problem of data imbalance in the Sleep-EDF dataset; to reduce the effect of this issue we use MSE and MSFE for multiclass classification. The mean squared error (MSE) is a very effective means to determine loss functions in deep learning models. It performs well for a balanced data class, but for an imbalanced dataset it fails, this is due to the fact that it averages the loss by essentially summing up all the errors in the whole dataset. This can effectively estimate the errors if the dataset of both minority and majority class is the same. When the dataset is imbalanced the loss tends to get biased to the majority class as it contributes more to the loss when compared to the minority class. This results in the loss which captures the error of the majority class only. The MSE with MSFE can be used wherein the mean squared false error (MSFE), firstly it averages the error separately in each class and then adds them up.
where ci is the class label, Ci is the number of samples, N is the number of available classes, l(ci) is the error calculated for class ci. With the help of MSE and MFSE, the loss of both minority and majority classes is considered [21,22].

Evaluation metrics
We evaluate the model using overall-accuracy, recall, precision, Cohen's Kappa and F1-score. The overall-accuracy is represented by the ratio of the correct number of predictions to the number of complete input data samples. Precision represents the ratio of correctly classified samples (true positives) to the summation of true positives and classes which are wrongly classified as positive (false positive). It checks out of the ones which the model predicts as positive, how many are actually positive. Recall is represented as the ratio of correctly classified samples to the summation of true positives and false negatives (classified wrongly as negative). It checks out of the ones which are actually positive in the input data samples how many did the model predict positive. Cohen's Kappa is a statistical approach that measures intra(inter) rater reliability for the categorical (qualitative) objects. This method is preferred over direct percent agreement calculations, as it considers the probability of the agreement occurring by chance. It is expected to be a more robust measure [23,24]. F1-score is the calculation of harmonic mean between precision and the recall.

RESULTS AND DISCUSSION
The data sets used are 2013 and 2018 sleep edfs as referred in previous sections as shown in Table 1. The results consisting of confusion matrix and the per-class performance for both versions of data and both FPz-Cz and Pz-Oz, EEG channel is given in Tables 2 and 3 respectively. According to literature, the model can be evaluated by two methods, one method is to use epochs from the same subject for both training and validation which is called as the intra-subject paradigm and the other is the inter-subject paradigm, wherein we use the epochs from different subjects for training and testing. In our study used the second approach that is both trying and testing epochs come from different subjects. Tables 2 and 3, represent the confusion matrices and the related performance parameters for 2013 and 2018 sleep edf data; also, both tables include the results of both FPz-Cz and Pz-Oz channels. True positive values are represented in the main diagonal of the confusion matrix. For all columns, true positive numbers are higher compared to other numbers. The prediction performance parameters, precision, F1 score, recall, and specificity also shown in the table. Performance is slightly low for N1 class compared to other classes, but recall significantly convincing compared to the existing literature. The literature available has mostly recorded the analysis performed for sleep stages, where the N1 stage is part of it. The results of the N1 classification compared to other sleep classes are not convincing. Our results show that the model we proposed works significantly better for the N1 sleep stage also.
The performance is verified for both EEG channels and for both data sets. The performance of the model is improved for the following reasons: i) The sleep stages Alert, drowsy, and sleep are sequential in nature and every next stage is the transition from the previous stage and is related to the previous state. Hence the application of sequence to sequence learning approach is a preferred choice; ii) The use of attention decoder and Bi-LSTM has improved the performance; iii) Use of ResNets allows us to have a deeper network, without compromising on training error and also allows learning temporal and frequency domain features without having extra layers; iv) The loss calculation procedure used helps to deal with class imbalance problems existing in the data sets, iv) A similar approach with minimum changes can be adapted for classification problems which have sequential behavior and also class imbalance problem.

CONCLUSION
The proposed deep neural network-based model architecture for EEG based driver alertness detection uses ResNets and Bi-LSTM, sequence to the sequence learning approach. The ResNets are used for feature extraction, with the skip-connections helps to retain the information from the alternate layers. This helps in retaining the features without adding any extra layers and learns deeper into the network with no increase in training error. The sequence to sequence model helps in learning the complex dependencies present in the EEG signal. The model's performance for the N1 sleep stage is better compared to existing models. Hence, the model can be used for future usage of automatic classification of sleep using raw EEG signals.