Electrocardiogram Classification Deep Learning Approach

Source: Deep Learning on Medium

Electrocardiogram Classification Deep Learning Approach

Overview: an example of the ECG classification system design (sorry for the low resolution)

In this paper, I will talk about an electrocardiogram (ECG) classification system I have built. The classification system takes a record and map it into a label. There are seven sections in this paper, to help you read I have marked some paragraphs that you can skip. My ECG classification system did a decent job on classifying the records, it gives an average F1 score of 81.2%. In the last section, I will also compare my work with others.


Section 1: data description

Section 2: data visualization and morphology

Section 3: evaluation metric

Section 4: ECG classification system

Section 5: result and data analysis

Section 6: system pros and cons

Section 7: models from the competition (compare with them)

Section 1: data description

The ECG dataset is from the 2018 China Psychology Competition, and it has nine categories (see below). There are 6877 records in this dataset. Each record is sampled at 500 Hz and has 12 channels/leads. In addition, each record also contains that individual’s gender and age, but I did not use them as inputs.

Nine categories:

  1. Normal (N)
  2. Atrial fabrillation (AV)
  3. First-degree AV block (I-AVB)
  4. Left bundle branch block (LBBB)
  5. Right bundle branch block (RBBB)
  6. Premature atrial contraction (PAC)
  7. Premature ventricular contraction (PVC)
  8. ST-segment elevation (STE)
  9. ST-segment depression (STD)

Data removal: few records belong to more than one category, and to make the study simpler, I removed them. There are 6397 records left.

After data removal, the data distribution is plotted below.

ECG data distribution

The table below shows the number of records in each category.

Number of records in each category

Section 2: data visualization and morphology

Before we go into our data set, I’d like to talk about PQRST-waves. If you are not interested in this part, you can skip the following paragraph and jump into “morphology of each category”.

heart’s electrical system. source from: www.healthlinkbc.ca
a cycle of heartbeat signal. source from: aclsmedicaltraining.com

(Skip). The starting point of a heartbeat is the sinoatrial (SA) node. The SA node is a natural peacemaker which initiates the heartbeat and determines the heart rate. The firing of the SA node spreads throughout both atriums and causes them to depolarize, the P-wave. The PR-segment is the duration between the end of the P-wave and the beginning of the Q-wave, it represents the time the signal travels from the SA node to the atrioventricular (AV) node. The AV node is the electrical gateway to the ventricles. It receives the signals from the SA node and passes them onto the AV bundle (bundle of HIS). AV bundle divides into right and left bundle branches. The depolarization of the interventricular septum initiates from the left bundle to the right which causes a negative deflection of the Q-wave. Then the electrical signal moves to both left and right ventricles at the same time but the signal to the left ventricle is dominate, so the ECG displays a big positive deflection at the R-wave. S-wave represents the last phase of ventricular depolarization. The atrial repolarization is obscured by the ventricular depolarization, so we cannot see it in the QRS-complex. The ST-segment represents the plateau in the myocardial action potential. Finally, the T-wave reflects the repolarization of the ventricles. Based on the morphologies of these waves, intervals, and segments, we can analyze ECGs.

Morphology in each category

For visualization, I will use Lead I (see figure below). Each ECG signal I will show in the following is from my dataset. For fast reading, you can skip the detailed explanation about the abnormality formations, you can directly read the bold words in each category section. For example, in “AF type section”, you can just read “the oscillations before the QRS-complexes”.

non-invasive 12-lead ECG. source from firstaidforfree.com

N type

Plot 1

In N type, the RR-intervals are consistent which indicates that the patient has a regular heart rate. There is no abnormal morphology of PQRST-waves.

AF type

Plot 2

AF is the most common type of cardiac arrhythmias. The start of the heartbeat is not SA node, instead of an ectopic site. The electrical impulses are generated randomly from many ectopic sites which causes the fibrillation of atriums. In Plot 2, the oscillations before the QRS-complexes reflect the electrical impulses of many ectopic sites. Sometimes those ectopic sites activate the AV node, so the ventricles depolarize (the QRS-complex). Therefore, the P-wave is missing. AF can cause inefficient contractions of the atriums which in turn forms blood clots. The blood clots pass into blood stream and get stuck in small arteries of the brain can cause a stroke.

IAVB type

Plot 3: time = (105 samples)x(1 second)/(500 samples)=0.21 seconds

The symptom of IAVB is the delayed AV node conduction which results prolonged PR-interval (>0.2s) or prolonged PR-segment. The majority IAVBs are benign and require no treatment. Plot 3 shows an example of IAVB heartbeats, the PR-interval has 105 samples. By calculating the time in the green rectangle, the PR-interval is about 0.21 seconds, so the PR-interval is greater than 0.2 seconds.

LBBB type

Normally the two ventricles depolarize simultaneously, then contract at the same time. In bundle branch blocks (BBB), the unaffected ventricle depolarizes first, then the other one depolarizes later. This results the delayed depolarization of the affected ventricle, so the morphologies of the QRS-complexes are abnormal.

Plot 4

For LBBB, we use Plot 4 to demonstrate the formation of the abnormal QRS-complexes. Abnormally, the depolarization of the septum initiates from the right to the left, therefore a positive deflection is showing in the beginning of the QRS complex. The right ventricle activates first which will cause a small downwards deflection. Then the electrical signal spreads to the large left ventricle which will cause another positive deflection, so a wide characteristic “bunny ears” QRS-complex. Sometimes, the small downwards deflection which is caused by the depolarization of the right ventricle is not visible, or the left ventricular depolarization has less amplitude than the septum depolarization.

RBBB type

Plot 5

Plot 5 is an example of RBBB. The depolarization of the septum initiates from the left to the right as normal. The left ventricle depolarizes normally which results a normal R-wave. Then the electrical impulse spreads to the right ventricle which results a large negative deflection and broader S-wave.

PAC type

Plot 6

PAC happens when there is an ectopic site firing in one of the atriums. There are several morphologies of PAC. Firstly, the P-wave can merge to the preceding T-wave due to the activation of the ectopic site. Secondly, the P-wave can be inverted, a negative deflection. Thirdly, the ectopic atrial activation can enter the SA node, depolarize it and reset its timing which changes the PP-interval. Plot 6 shows a merged PT-wave.

PVC type

Plot 7

PVC happens when there is an ectopic site firing prematurely in one of the ventricles. Because PVC does not affect the SA node, so the PP-interval remains unchanged. The morphology of QRS-complex is depending on the location of the ectopic site. The QRS complex can be broader, higher, or deeper. Plot 7 shows two negative deflections of the QRS complexes.

STE type

Plot 8

The morphology of STE is that the ST-segment is elevated by referring to the baseline. Plot 8 shows each ST-segment has an upwards slope and larger amplitude compared to the PR-segment.

STD type

Plot 9

The morphology of STD is that the ST-segment is depressed by referring to the baseline. Plot 9 shows each ST-segment has negative amplitudes, and they are below the baseline.

According to those nine morphologies, extracting the heartbeats from each record is necessary in the data preprocess, because each abnormal part is appearing in each heartbeat signal, except PAC (it involves PP-intervals, but in this study I will ignore this fact).

Section 3: evaluation metric

Because the data set is highly imbalanced, I will choose F1 score for the evaluation metric.

Equation 1

Equation 1 corresponds the F1 score for each category, where N_ij is the number of testing examples of category i predicted as category j. The numbers from 1 ~ 9 are corresponding to N, AF, I-AVB, LBBB, RBBB, PAC, PVC, STE, and STD.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Equation 2

Where F_1 is the average F1 score.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Equation 3

F_AF measures AF.

F_Block measures LBBB, RBBB and IAVB.

F_PC measures PAC and PVC.

F_ST measures STD and STE.

Section 4: ECG classification system

In this section, there are four subsections.

In section 4.1: dataset splitting based on the number of r-peaks

In section 4.2: data preprocess and data augmentation

In section 4.3: neural network design

In section 4.4: model training and testing

Section 4.1: dataset splitting based on the number of r-peaks

histogram: “the lengths of the records” versus “the number of the records”

The histogram above shows “the lengths of the records” versus “the number of the records”. We can see most records fall into the short length, which is around 5000, but there are still a good amount records distributed in different long lengths. For batch training, it is important to preprocess all the training data into a unified length. First of all, I will arrange my dataset.

I will split the dataset into training and testing based on the number of r-peaks. Thanks to my friend, Ahmed, who helped me develop the r-peak detection algorithm. I will not go into detail about this algorithm, but I will briefly talk about it. The r-peak detection algorithm is developed based on the wavelet transform. After reconstructing the signal, most noise is removed. By calculating the local energy with a slide window fashion, the peaks are enhanced. Because there are 12 channels, so we used voting method to find the most reliable peaks. Then there are some refinement work to remove the false peaks. An example is showing in Plot 10. In my study, I remove the first and last two r-peaks due to the segmentation process (those four peaks sometimes cause program error because the samples are not enough).

Plot 10

In each category, I want to have 80% data for training and 20% data for testing. By using r-peak detection algorithm, most records have less than 20 r-peaks, I decide to use 20 as the boundary to split the dataset. If a record has less than or equal to 20 r-peaks, the record is moved to the training dataset, otherwise it is moved to the testing dataset.

The above process cannot promise 80% training and 20% testing. After splitting the dataset, one category can have 87% training data and 13% testing data, or 70% training data and 30% testing data. The first case is easy to deal with where the 7% training data can just move to the 13% testing data, so it causes no problem during the data preprocess. However, the second case can cause a problem. If the 10% testing data moves to the 70% training data, I no longer know the number of the heartbeats I want to duplicate into a unified one. Therefore, I need to find the number of r-peaks which can be used as a reference to select some records from the testing dataset and move them to the training dataset. I go through the testing dataset in each category, and find out 40 r-peaks and 60 r-peaks are the best options. Therefore, I further move the records who have “less than or equal to 40 r-peaks” and “less than 60 r-peaks and greater than 40 r-peaks” from the testing dataset into the training dataset. Finally, for each category, the dataset split into 80% training and 20% testing. Table below shows the numbers of training and testing records.


Section 4.2: data preprocess and data augmentation

The key in this data preprocess is to extract heartbeats. It is easier to manipulate the dataset once heartbeats of each record are extracted. Figure below shows four steps of the data preprocess.

Step 1, by using the r-peak detection algorithm, the locations of the r-peaks are found.

Step 2, each heartbeat can be segmented or extracted by picking samples before and after the corresponding r-peak. In this paper, I will pick 119 samples before the r-peak and 180 samples after the r-peak, so there are 300 samples for each heartbeat. Because there are 12 leads, the shape of each extracted heartbeat is 300 by 12.

Step 3, the duplication process allows the neural network to train in batch manner. Here, I will skip this demonstration.

Step 4, we concatenate all the extracted heartbeats from each record. For each preprocessed data, it has shape 300n by 12 where n is the number of the heartbeats or R-peaks. The plot below refers to the one in step 2.

Data augmentation

After heartbeat extraction, it is much easier to manipulate the data. I can take advantages from the data preprocess to do data augmentation. Because the detected r-peak locations are always in order. For data augmentation, I can randomly change the orders of the r-peaks, this can help the neural network enhance learning the shapes of the heartbeats, instead of memorizing the locations of the abnormal heartbeats. The duplication step in the data preprocess also contributes to the data augmentation. This is because it randomly duplicates the existing heartbeats with respect to the record. Then the duplicated heartbeats also change their orders randomly along with all the heartbeats before the concatenation. A illustration of the data augmentation is showing below.

original preprocessed record
augment the original preprocessed record

Section 4.3: neural network design

For real-life application, we always want the algorithm to perform fast and accurately. When I was designing the neural network, I kept the speed and the architecture in my mind which you will see why I designed in such way. The neural network has three parts: a feature extractor, a heartbeat analyzer and a classifier. I will talk about how to choose hyperparameters for the neural network, especially for the feature extractor.

Feature extractor

The feature extractor is a stack of the convolutional layers connected serially. The preprocessed record is a concatenation of the heartbeats, so between each two adjacent heartbeats the signal is not naturally connected. In the other words, I should avoid convolutional operation between each two adjacent heartbeats. To achieve this, the kernel size should be equal to the stride size so that during convolution there is no overlapping. In addition, there is a rule to pick the kernel size in each convolutional layer. I want the feature extractor to extract the characteristics of each heartbeat without involving other heartbeats so that the heartbeat analyzer can process features of each heartbeat at a time step. Now let’s start.

in each convolutional layer, the kernel size always equals to the stride size

Because the sample number of each heartbeat is 300, the kernel size must be a divisor of 300. Before choosing the kernel size of each convolutional layer the number of the convolutional layers needs to be decided first, this is because the number of the convolutional layers is also the number of the divisors of 300. For example, if there are two convolutional layers, then the kernel sizes can be 10 and 30, or 12 and 25, etc. If there are three convolutional layers, then the kernel sizes can be 5 and 6 and 10, or 5 and 5 and 12, etc. The product of each set of numbers is 300. An illustration is showing next to this paragraph (sorry for the low resolution).

The way I choose the hyperparameters in the feature extractor cannot capture all the underlying characteristics of the heartbeats, this is because there is no overlapping during convolution. The various abnormalities of the heartbeats require different scales of the filters to detect. To capture the features from the heartbeats better, I need to add convolutional layers with different kernel sizes parallelly along the other branches of the convolutional layers. Figure below shows two branches of convolutional layers with different kernel sizes. The output size from each branch is the same, so those outputs can be concatenated into one tensor.

two branches of convolutional layers assigned with different kernel sizes where n is the number of heartbeats, N is the number of filters, F is the kernel size, and S is the stride size. The outputs from both branches can be concatenated into one tensor. In this case, the concatenated output has size n by 24.

Heartbeat analyzer

Heartbeat analyzer is a stack of the long-short-term-memory (LSTM) layers. The output from the feature extractor has two dimensions: the first dimension represents the number of heartbeats, and the second dimension represents a set of features in different scales with respect to each heartbeat. Therefore, the first dimension decides the number of time steps in the process of the LSTM layers. In each time step the heartbeat analyzer takes a set of features of one heartbeat as an input. It produces a sequence of the outputs from each LSTM layer until it reaches the last one. The last LSTM layer outputs a fixed size vector. The fixed size vector is a set of aggregative features which summarizes the entire record.

process of heartbeat analyzer where “HB” means “heartbeat”

The figure on the left is an indication of how the heartbeat analyzer processes the output of the feature extractor. There are 20 heartbeats, so 20 time-steps for each LSTM layer. Each heartbeat has a set of features where the size of features is D. The third LSTM layer is the last layer of the heartbeat analyzer, so the output size is fixed. Furthermore, I replace the LSTM layer with bidirectional LSTM layer, which helps the sequential learning.


I can use the fully connected layers (FCLs) as the classifier because the input size is fixed. FCLs are also called multi-layer perceptron. This layer does not share weights. Each neuron in a layer connects every single neuron in the preceding and following layers which can cause a tremendous number of parameters if the widths of the layers are large. It is impractical to use a large size classifier in our small dataset. However, the feature extractor and heartbeat analyzer vastly reduce the dimensions of the input data. Firstly, the feature extractor extracts the useful information of the heartbeats and abandons the redundant ones where the useful information is sets of low-dimensional features which represents the original heartbeats. Then the heartbeat analyzer analyzes each set of features and further reduces the dimensions of the data. Finally, the output of the heartbeat analyzer is a compact feature which represents all heartbeats from a recording. By using the compact feature for classification, the size of the classifier can be very small.

An instantiation of the neural network design

Figure below shows an instantiation of our neural network. The feature extractor has five branches of the convolutional layers, the number of parameters is 23,184. The heartbeat analyzer has 74,240 parameters, and the classifier has 3,689 parameters. Except the last FCL, each layer in the figure is followed by a batch-normalization layer and ReLu activation function. The last layer is only followed by Softmax activation function. The loss function is categorical cross-entropy loss function.

Section 4.4: model training and testing

Now it is the time to talk about how to train the neural network. I have training records in three different folders: the 20 r-peaks, the 40 r-peaks, and the 60 r-peaks. So, there are three training phases in each training iteration. To avoid neural network bias to one category than another, we want to evenly train each category. The detail of the training process is shown below:

For each iteration:

==> In 20-heartbeats recordings (1st phase):

======>Randomly pick 128 recordings in each category

==========>Data preprocessing and augmenting

==========>Train 128*9 recordings all together

==>In 40-heartbeats recordings (2nd phase):

======>Randomly select 32 recordings in each category

==========>Data preprocessing and augmenting

==========>Train 32*9 recordings all together

==>In 60-heartbeats recordings (3rd phase):

======>Randomly select 16 recordings in each category

==========>Data preprocessing and augmenting

==========>Train 16*9 recordings all together

Reducing learning rate by multiplying by 0.1 if the average F1 score saturates

For testing dataset, we do not use duplication step and data augmentation.

Section 5: result and analysis

table 5.1: results of each model and ensemble model

Ensemble model is a combination of multiple learning algorithms which has better predictive performance than a single learning algorithm (see table 5.1). I independently train four neural networks with the same design concepts. Each neural network has a different number of parameters (see table 5.1). For example, a feature extractor of the neural network can have more or less branches of convolutional layers. The size of bidirectional-LSTM layer can be 32, or 50. The point is to reduce the loss function of the neural network to different local minima so that each neural network can learn the dataset in different directions. The training dataset and testing dataset for all the models are the same, I do not use cross validation here.

figure 5.1: ensemble model

The output of the last fully connected layer is a vector of nine probabilities. In figure 5.1, I combine four vectors by piece-wise summation which results in a vector with nine values. The index of the largest value decides the category of the input data. “9” in figure 5.1 is the size of the output from each neural network. The size of my ensemble model is 703,153 parameters which is the sum of four neural networks. In addition, the ensemble model is relatively small by comparing with other models in the competition (I will show other models result later).

Table below is a detailed result of the ensemble model. Now, let’s analyze the results.

table 5.2: the result of testing dataset (average F1 score = 0.812)

PAC record analysis

Table 5.2 shows the worst F1 score is in PAC where most data fall into N. I mentioned the morphologies of PAC includes the changes of PP-intervals. Our data preprocess is to extract heartbeats which removes the correlations between heartbeat to heartbeat. Thus, the system cannot learn the time between P-wave to P-wave. To understand why most PAC data fall into N much better, we need visualize the PAC data which falls into N (plot 5.1) and the N data which is classified correctly (plot 5.2).

plot 5.1
plot 5.2
plot 5.3

By comparing both plots (5.1 and 5.2), we can see they look alike. If we take a look at the original signal of this PAC data, we will see the essential problem is from data preprocess. Plot 5.3 is the PAC data from plot 5.1 without data preprocess. In plot 5.3, the green rectangle is the abnormal part where the P-wave is merging to the preceding T-wave. At segmentation step, I am using the fixed threshold values to extract heartbeats, so the abnormality disappears after the segmentation step. Here is another case, in plot 5.4, we can see the PP-intervals are changing. Because our data preprocess, the segmented part looks just like N data. Thus, this explains why most PAC records are misclassified as N.

plot 5.4: PAC data which demonstrates the problem of the data preprocess

STE and STD record analysis

plot 5.5
plot 5.6
plot 5.7
figure 5.2: ECG data distribution

According to table 5.2, STE and STD also have relatively low F1 scores which are below 80%. Most of their data are misclassified as N. We will use lead V5 to visualize the data in this case. We compare the N data which is classified correctly (plot 5.5) and the STE data which falls into N (plot 5.6, 5.7). There are upwards slopes at ST-segments in all the plots. Plot 5.5 and plot 5.7 are very similar in terms of the morphology of each heartbeat, which explains why this STE record is misclassified as N. However, plot 5.6 shows the steepest upwards slopes at ST-segments, it is easy to be recognized as STE, but the model classifies it as N. The only reason I can think of is that the STE category does not have enough data (figure 5.2) to train, which causes learning bias of our system. The F1 score of STD is 6% higher than STE’s because STD has much more records to train, but STD is still difficult for the system to learn.

LBBB record analysis

LBBB has the least data, but the F1 score is very high, 0.896. This is because LBBB has the most unique morphology among all nine categories.


Furthermore, training neural network requires big dataset and sophisticated data augmenting techniques so that the neural network can capture some types of invariances from the data. The dataset I have is small, and it is hard to apply data augmenting techniques. Although I use data preprocess to ease the training of the system and data augmenting technique to enhance the learning of the system, the system cannot fully capture invariances, such as signal distortion invariances. The signal distortions are noises on the original ECGs which cause corruptions of the information. Those noises include baseline wander, motion artifacts, powerline interference, muscle noise, etc. A good performance of the model also requires good qualities of the dataset. The dataset I have contains a lot of noises and some leads do not have useful information (plot 5.8, 5.9). In addition, it is possible that the label can be wrong.

plot 5.8: the lead does not contain useful information
plot 5.9: the lead is useless

Section 6: system pros and cons


My ECG classification system design is quite special. It maps each heartbeat into a single time step case which reduces the processing time in the bi-directional LSTM layers. The way I choose the hyperparameters in feature extraction also reduces processing time because the kernel size equals to the stride size in each layer. More importantly, It can run four neural networks parallelly which cost the same amount time as a single neural network.


I have mentioned the shortcoming of my data preprocess. Since I am extracting individual heartbeats, the preprocessed data does not provide information of the relationship between heartbeats, such as PP-intervals. Therefore, my neural network cannot learn the statistics of PP-intervals which cause poor performance in PAC. To improve the result of PAC, I need to segment two to three consecutive heartbeats.

There are also flaws in my neural network design. In each convolutional layer, the kernel size equals the stride size, so there is a limitation to capture all the underlying characteristics of the heartbeats. The reason for same sized kernel and stride is that the convolutional operation should not convolve between two adjacent segmented heartbeats (they are not naturally connected). However, if I apply the convolutional operation in the heartbeat analyzer, we can use any stride size. This type of design is called recurrent convolutional neural network where the convolutional operation is used in the recurrent layers. However, the trade-off is of course the processing speed (it can be very slow).

Section 7: models from the competition

Now my project is done, and I’d like to compare my work with others’ pioneer studies. The results of others are from the 2018 China Psychology Competition. Here I need to declare that there are two datasets from the 2018 China Psychology Competition. I only used dataset 1 because the dataset 2 is hidden which no one can access it, but the people who participated this competition have their models evaluated by the hidden dataset. Unfortunately, I was not allowed to use the hidden dataset after the competition, so my ensemble model is evaluated with the testing dataset in section 4.1. Anyways, let’s start.

table 7.1

Table 7.1 is a summary of competition results by using the hidden dataset. The Serial No. indicates the rank of each model. There are 34 models in the competition, but I only take the top 15 models. F1 in table 7.1 is the average F1 score which determines the placement of the model.

plot 7.1

Plot 7.1 is made according to table 7.1; it shows the ranks of models versus the types of F1 scores. The horizontal dash lines are the average values of certain types of F1 scores. As we can see all the models have difficulties to learn the PC type and ST type recordings. The performances of the PC type and the ST type have a strong influence on the rank of the model. In plot 7.1, the star markers are the results of my ensemble model by using the testing dataset in section 4.1. The star markers are only for the y-axis visualization, the x-axis is not applicable to the star markers. Although the results of my model are not comparable with others due to different testing dataset, it gives a general idea of how the performance of my model is. My F1 score of ST type is quite high, but the F1 score of PA type is lower than the average. The F1 scores of both AF type and Block type are above average.

Let me know if you have any question. Thank you for reading my work!

Let me know if you have any question. Thank you for reading my work!