Feature extraction of electrocardiogram signal using machine learning classification

Received Oct 28, 2019 Revised Jun 4, 2020 Accepted Jun 16, 2020 In the various field of life person identification is an essential and important task. This helps for the investigation of criminal activities and used in various type of forensic applications like surveillance. For biometric recognition iris, face, voice and fingerprint have a limited fabrication and from there the exact decision regarding liveliness of the subject can be drawn. The aim of the approach is to construct a biometric recognition system based on ECG which processes the raw ECG signal. The entire process is supported by different filters for noise elimination and ECG characteristics waves gone through time domain analysis. Based on the analysis an efficient feature extraction model is developed where several best P-QRS-T signal parts are taken and the positions of the fragmented signals are normalized depends on the priorities of their positions. The calculation of domain features done 72 times. It checks the data sets (train and test) and from feature vector matching to each of the individual signal, separately. The performance and utility of the system are analyzed and feature vectors are examined by different classification algorithms of machine learning. The leading algorithms like K-nearest neighbor, artificial neural network and support vector machine are used to classify different features of ECG, and it is tested using standard cardiac database i.e. the MIT-BIH ECG -ID database.


INTRODUCTION
The ECG Signal is used as reliable and efficient document for medical applications. ECG is the biometric tool used for long period of time in different cardiac disease identification. The geometric and physiological characteristics of heart produces different signals and maintain uniqueness for different subjects [1]. The ECG signals of every individual person produce unique patterns and characteristics where the heart condition of a patient is decided by the outline of the waveform where function and rhythm of the heart produces effective information for the detection of heart disease [2,3]. Each of the cardiac cycle of the ECG signal generates the characteristic wave of P-QRS-T. The features of the signal are categorized as Fiducial and non-Fiducial. For the study of ECG, the major uniqueness in the wave is finding of R-peak. So theare identified by making the location of R-peak as a standard [4]. The Figure 1 describes the standard ECG signal where different features are explained based on amplitude, distance, time, angle slope and some related features which supports fiducial based feature representation.For the design of biometric template fiducial based approach is important because it implements the difference between amplitude or temporal for successive fiducial points [5,6].

FEATURE EXTRACTION ON ECG DATA
Several steps are followed for the feature extraction of ECG signal. The methodology that will help feature extraction and classification of ECG data will follow several procedures like pre-processing of data, feature extraction in logical and effective way and from that features the classification of ECG data. For ECG signal detection and classification initially the ECG data should be collected for preprocessing. For classification of ECG data P-QRS-T detection and feature extraction is required. In Figure 2, the whole process is described pictorially [7]. The steps are used for preprocessing and cleaning of data.

ECG signal loading and data acquisition
The ECG data is available Phsiobank which contains individual patient's ECG recording and it is more than 90 thousand recordings. In this paper the raw ECG data that is used for feature extraction and classification is MIT-BIH Arrhythmia database which is already available in the PhysioNet [8]. The ECG recording contains beat-by-beat annotations and it contains more than 200 hours recording which is publicly available in PhysioNet for research and commercial purpose [9,10].

Pre-processing of data
The characteristics of raw ECG is that it is noisy and it hold deformation of different signal source. Interferences are generated by various frequency components while the process of acquisition done on the ECG signal recording procedure. The additional noise comes to the ECG signal from that interference and that unwanted data in the signal might fiddle the original data upon the ECG signal and brings the false result [11,12].

Detection of wave characteristics (wavelet transform approach)
Different characteristics (P, Q, R, S and T) of the wave which are working on frequency and time based approach are analyzed here. The method primarily works with detection of R-peak using the method of wavelet decomposition. To obtain frequency and time information wavelet transform is used which delivers simultaneous and high-resolution wave signal. The operation done by compressing and dilating the wave signal. The high frequency components are extracted by fine wavelet and components containing lower frequency works with a stretched wavelet [13]. The wavelet transform is worked with the signal s(t) with the groups of time frequency component (t). That generates a coefficient set (p,q) which can be implemented by : where q represents the translation parameter of time location and p is the scale factor. The scale factor is represented as inversely proportional to the working frequency. The complex conjugate is denoted by * and the mother wavelet which works as analyzing wavelet is denoted by (t). The time scale illustration of digital signal is represented by discrete wavelet transform and in modern techniques of digital filtering are used to represent it. The discrete wavelet transform (DWT) capitulates wavelet transform of fast computation. The characteristics of DWT is that it is relatively easy to implement where different technique of translation is used to minimize the computation time. After final calculation the DWT can be represented as where the translated wavelet in dyadic scale can be describes as The popular orthogonal properties of wavelet like Symlets, Daubechies, Discreate Mayer and Coiflets are used for the reconstruction of wave.

Analysis of time domain
For the detection of R-Peak frequency domain approach is used and for other characteristics of the wave time domain approach is used.
where represents the sampling frequency and Rlocation = location of Rpeak.
The R-R interval between 65% and 95% is the time gap limit created by the window specifying time domain for recognizing P-wave. The window represents the maximum value for P-wave. Here the P wave is represented as, where u and v are the window ranges for time domain.
To identify Q-wave the maximum value of time-based window is chosen and it taken 20ms on average to find the related R-peak in ideal case: In the same way S-peak is recognized by choosing the minimum value of time-based window immediately after R-peak.
So for the S-wave u= Rlocation(j) v= Rlocation(j)-floor(0.07× ) To identify the T-wave a window is created in time domain where the gap limits of R-R interval lies between 15% to 55% .
The windows from time domain are adaptive as they entirely depend on the values of R-R interval [14,15]. Figure 3 shows the spotted peaks. Here the PQRST fragments are followed where the finest components are chosen. The minimum distance from the mean is checked with the Euclidian distance of the individual peaks. The best fragments which maintains these homogenous criteria are eligible for data extraction [16].

Efficient feature extraction
The features are extracted from the data set generated by time and amplitude of the ECG Signal. The nature of the feature describes the cardiac condition of a patient. The feature describes after the peaks are detected, i.e. P, Q, R, S also with QRS, ST and QT intervals have to check when the intervals have to be calculated. In the Figure 3 all PQRST fragments are studied and among them best six excellent fragments are chosen. The selection is based on the variation of the peak and from there lower and better fragments are taken for further analysis. The selected best six PQRST fragment of the electrocardiogram signal are divided into standalone pulses where the information of individual pulses can be compared with each other. Every individual pulse is splitted in the range of 'P' peak to 'T' peak. Figure 3 describes the PQRST beat and graphical representation of the of the data from where the comparison among the signal can be done. The data extracted from the pulses and before that all the pulses should meet a common point to make the position normalized for all the beat. Now comparing the signals, the features of the data can be analyzed. The time duration and amplitudes are calculated as we take the R-peak as the origin of the analysis [17,18].
The Table 1 describes the various features of the ECG signal. In the work total 72 features are extracted from separate 7 pulses where it uses best six PQRST fragments. The following issues like time, amplitude, angle, slope, gap between time and gap between amplitude is to be discussed.

CLASSIFICATION OF THE EXTRACTED FEATURES
The classification is done based on the extracted features. Artificial neural network (ANN) has an important role in classification with the extracted features. Based on the training set data the classification is done by the classifiers. For pattern recognition artificial neural network (ANN), K-nearest neighbor (KNN) and support vector machine (SVM) are the classifiers used to implement the proposed methodology [19].
In this work, the standard MIT-BIH ECG-ID database has been used. The Massachusetts Institute of Technology-Beth Israel Hospital database consists of 48 records, of about 30 minutes, sampled at 360 Hz. Each record comprises of two signals: the modified lead II (MLII) and the second one is V1, V2, V3, V4 or V5 depending on the record. Following the inter-patient paradigm, the database was divided into 2 datasets, each dataset contains 22 [20].

Classification based on ANN
The ANN can be described as statistical learning model in the field of machine learning which has direct impact on biological neural network. Initially the neural network has to be tainted with electrocardiogram data of individual person and after that the biometric identification is done by the neural network which is generated by training data set [21,22]. In this work MIT-BIH ECG ID database is used and its records are used for the purpose of training and testing. The facility of the database is that it contains more than one records for individual subjects. In this experiment 5 records are taken for every individual Int J Elec& Comp Eng ISSN: 2088-8708  Feature extraction of electrocardiogram signal using machine learning classification (Sumanta Kuila) 6603 patientand out of these five records first 2 records are used for train set and rest of the 3 records are used as test set. The experiment is done by taking 16 subjects, each of which contains 5 individual records. The records are taken from MIT-BIH ECG ID database and for convenience say, Person_09/rec_2 record can be renamed as H9. Here 70 features are extracted from double layer feed forward network. Softmax output neurons and Sigmoid hidden helps to classify the vectors which has enough neurons in the hidden layer. The classification works with 25 hidden layers and from there 20 output targets are chosen. Table 2 shows the confusion matrix for true classification of ECG pulse data. Each of the ECG record contains 6 PQRST fragments and one mean value are taken to prepare the confusion matrix. This makes total of 7 sets of pulse data generates the ECG record. The training set uses the pattern recognition network in the recognition phase and working with the test data extracts the features of a person arranging the sequence of training data.
The probabilistic matrix also known as confusion matrix is generated by the entire simulation over the particular subject (The Person) where the probability produced by the simulation is utmost and it extracts the features of that person which is described in the confusion matrix [23,24]. After the calculation it is found that the true positive rate (TPR) of the over all True classification is 71.18%. True identification of the ECG pulse data is represented in Table 3. Misclassified pulses are defined by the basic difference between True classification and True identification [25,26]. At the classification stage several pulses of a specific set may trend to misclassified. For the subject (the person) these misclassified pulses are analyzed and for true classified pulses (out of 7 pulses) it executes the maximum pulse set for true identification [27]. The confusion matrix generated by the testing, the TPR of total counted true identification of pulses is 81.75%, it is 9.95% extra compared to the pulses of true classification.

Overall result of the system
The experiment result is shown in the Table 4. Here the results of the parameters are shown. It is observed that the values of the performance paramaters reflects the characteristics of the samples of the ECG data taken.

CONCLUSION
The paper proposes a systematic methodology of an ECG based biometric recognition system. The ECG data acquisition, preprocessing of data, detection of P-QRS-T, cardiac cycle classification and feature extraction are the proposed methodology. For human the ECG signal is universal and analyzing the signals unique identification of person is possible. Different machine learning classification like ANN, SVM, KNN classifiers are used to test the performance of the proposed system. For every ECG signal 6 best P-QRS-T portions are chosen by their minimum distance from the mean of the Euclidian distance where the concern peaks and their respective positions were normalized according to the R-peak position of the signal. For data extraction only these 6 fragments were considered in each record of MIT-BIH ECGID database. The obtained resultare very much supportive to implement ECG as one of the important biometric features.