Reducing Commercial Aviation Fatalities

Original article was published by Vinithavn on Artificial Intelligence on Medium

Reducing Commercial Aviation Fatalities

Table of Contents

  1. Overview
  2. Business Problem
  3. Dataset Analysis
  4. Mapping the Real Problem into ML problem
  5. Performance Metric
  6. Exploratory Data Analysis
  7. Feature Engineering
  8. Data preprocessing
  9. Modeling
  10. Future improvements
  11. Results
  12. References



This was a competition conducted by Kaggle where we need to build a model to detect troubling events from aircrew’s physiological data.

Aviation fatality means the death of one or more persons inside or outside of an aircraft, spacecraft, or any other aerospace vehicle that occurs during a flight operation or any other operation involving that vehicle. These fatalities are directly associated with the operation of the aircraft and are called aviation accidents.

The most frequent causes for these aviation accidents include:

  1. Pilot error
  2. Mechanical failure
  3. Design defect
  4. Air traffic failure
  5. Defective runways

In this competition, the main focus is on the first cause i.e, aviation accidents caused by pilot error, and solutions to avoid them.

Business Problem

A large part of the training given to the pilot involves the physiological aspects which are required while flying an airplane. This is important because one of the important abilities required for pilots is to multitask, the ability to concentrate, and the ability to pay attention to all these tasks. All of these may help to reduce pilot induced flight fatalities.

Most of the flight fatalities or flight accidents due to pilot error are due to the loss of airplane state awareness. Airplane state awareness (ASA) is a pilot performance attribute wherein the pilot should be able to realize and respond quickly to any change of state of the airplane. Loss of airplane state awareness may lead to many dangerous situations and may result in loss of airplane control wherein an extreme deviation from the intended flight path may occur. Loss of ASA is mainly due to loss of attention on the part of pilots who may be distracted, sleepy, or in other dangerous cognitive states. Due to the stressful environment, while flying, the possibility of the loss of awareness is common.

In this competition, we are provided with real physiological data from pilots who were subjected to various distracting events. The pilots experienced distractions and resulted in one of the following three cognitive states:

  1. Channelized Attention (CA): This occurs when the pilot is focusing only on one task without giving any attention to other tasks.
  2. Diverted Attention (DA): The state of having one’s attention diverted by actions or thought processes associated with a decision. This is induced by having the subjects perform a display monitoring task.
  3. Startle/Surprise (SS): This is the response to a sudden unexpected stimulus. In aviation, this can be defined as an uncontrollable automatic reflex or reaction caused due to exposure to a sudden intense event that violated a pilot’s expectations.

The aim is to build a model that can estimate the state of mind of the pilot in real-time using the physiological data given. When the pilot enters into any one of the above mentioned dangerous cognitive states, he/she should be alerted, thereby preventing any possible accident.

Dataset Analysis

Three CSV files are provided for this competition. The first one is train.csv in which all the data which is to be used for training. Test.csv is provided to test the model. Sample_submission.csv is provided to submit the final output in the CSV format.

Now, let’s analyze each attribute in the dataset.

The training data consist of three experiments: CA, DA, and SS. The output is one of the four labels: Baseline(no event), CA, DA, or SS. For example, if the experiment is CA, the output is either CA or Baseline(no event). The test data is taken from a full flight simulator. Here the experiment is called LOFT or Line Oriented Flight Training where the training of the pilot is carried out in a flight simulator, which artificially creates the environment of a real flight. In the test data, the experiment is given as LOFT and the output can be one of the four states at a given time. To predict the state of a pilot, physiological data are required. We have data from four sensors — EEG, ECG, Respiration, Galvanic skin response. Let’s analyze each attribute of the dataset.

  • Id: Unique identifier for crew+time combination. A pilot with a particular time into the experiment is represented using an id. So for each id, we need to predict the state
  • Crew: Unique id for a pair or pilot
  • Experiment: For training, it will be either CA or DA or SS. For testing, it will be LOFT
  • Time: Seconds into the experiment
  • Seat: Seat of the pilot- 0 means left, 1 means right

EEG (Electroencephalogram) — This is the summation of all activities on the surface of the brain. Data from 20 electrodes are given to us. Each electrode lead is placed near a particular part of the brain ( prefrontal(fp), temporal(t), frontal(f), parietal(p), occipital(o), central(c) ). The odd numbers in the representation indicate that the electrode is placed on the left side of the brain, even numbers indicate the right side, and z indicate the middle region.

The below figure gives an idea about the position of each electrode.

Figure 1: Position of electrodes in the scalp
  • Eeg_f7: Data from the electrode near the prefrontal portion — left side
  • Eeg_f8: Data from the electrode near the frontal area — right side
  • Eeg_t4: Data from the electrode near the temporal area — right side
  • Eeg_t6: Data from the electrode near the temporal area — right side
  • Eeg_t5: Data from the electrode near the temporal area — left side
  • Eeg_t3: Data from the electrode near the temporal area — left side
  • Eeg_fp2: Data from the electrode near the prefrontal area — right side
  • Eeg_o1: Data from the electrode near the occipital area — left side
  • Eeg_p3: Data from the electrode near the parietal area — left side
  • Eeg_pz: Data from the electrode near the parietal area — middle region
  • Eeg_f3: Data from the electrode near the frontal area — left side
  • Eeg_fz: Data from the electrode near the frontal area — middle region
  • Eeg_f4: Data from the electrode near the frontal area — right side
  • Eeg_c4: Data from the electrode near the central area — right side
  • Eeg_p4: Data from the electrode near the parietal area — right side
  • Eeg_poz: Data from the electrode near the parietal-occipital junction — Middle region
  • Eeg_c3: Data from the electrode near the central area — left side
  • Eeg_cz: Data from the electrode near the central area — middle region
  • Eeg_o2: Data from the electrode near the occipital area — right side
  • Ecg: Three-point electrocardiogram (ECG) signal — It measures the electrical activity of the heart (sensor output is in microvolts)
  • R: Respiration sensor — It measures the rise and fall of the chest (Sensor output is in microvolts)
  • Gsr: Galvanic skin response — The measure of electrodermal activity (Sensor output is in microvolts)
  • Event: The output which is to be predicted — The state of the pilot at a given time. It will be either baseline (A no event) or SS(B) or CA(C)or DA(D)

Mapping the Real Problem into ML problem

This is a multiclass classification problem wherein, for each id (for a particular crew at a particular time), we need to predict the state of the pilot as belonging to one of the four given classes. Given all the attributes, we need to predict the probability of occurrence of each event.

Performance Metric

The problem we are handling is a multiclass classification problem where the number of classes is 4

where N is the total number of data points, M is the number of classes.

yij is 1 if the data point i is predicted to be of class j, and is 0 otherwise.

pij is the probability of datapoint i belonging to class j

  • We can also use precision and recall matrices for evaluating the performance where we can check how well we were able to predict and recall each of the states. i.e, for each of the “dangerous state” classes, we should be able to correctly predict maximum data points in these classes and we should not misclassify it as a normal state.

Exploratory Data Analysis(EDA)

Analyze the events

First and foremost, let’s analyze the frequency of occurrence of each of the events. We can use the count plot(paste link) for this.

Figure 2: The frequency of events

Here A indicates Baseline(no event), B indicates Startle/Surprise, C indicates CA, and D indicates DA. From this plot, it is clear that the data is imbalanced. Or we can say, the frequency of occurrence of each event is different. Let’s go one level deeper and analyze each event.

Let’s consider each of the experiment separately and consider a randomly selected crew and analyze the frequency of each event.

Figure 3: Analyzing the frequency of CA event
Figure 4: Analyzing the frequency of DA event
Figure 5: Analyzing the frequency of SS event

So from these figures, we can say that all the experiments are conducted for the same interval, yet the frequency of occurrence of SS is very low. This frequency imbalance is possible in the test dataset also, so we keep this imbalance for the time being and check the performance without balancing the dataset.

Univariate analysis for each feature

Here we use box plots for analyzing the effect of each feature in predicting the event.

Figure 6: Box plot of ECG data

From the above figure, it can be seen that the ECG data has some outliers. But we cannot simply remove them because these extreme values might be useful in predicting the event. When the value of ECG is high (more than 10000 microvolts), the pilot is more likely to enter into the DA state. Similarly, when the value is too negative, the pilot is likely to be in the CA state. It is also observed that ECG alone cannot simply predict the events. But it has some role in prediction

Figure 7: Box plot of respiration sensor output data

Similar to ECG data this data also has some outliers. But we cannot simply remove them because these extreme values might be useful in predicting the event. The respiration signal should have some effect in predicting the event, But from the above box plot, we can see that this sensor output is not at all separating the events. This might be because of the presence of noise in the data.

Figure 8: box plot of GSR sensor output

From the box plot of GSR, we can say that GSR plays some role in predicting the output. This GSR data is separating the events to some extend.

Now the next step is to check for noise in the data. The biological sensors are easily affected by noise and since the data is obtained from the physiological data of real people, the output from these sensors will be rich in noises. Let’s check noise in ECG and R data. This data is analyzed for the experiment and for a particular crew.

Figure 9: ECG output data for 10 seconds
Figure 10: Respiration sensor data for 10 seconds

These data are clearly rich in noise and hence we need to remove this high-frequency noise. For that purpose, we use a low pass Butterworth filter.

For filtering the ECG signal, the cutoff frequency(w) was selected as 100 and for filtering the respiration signal, the value of w was taken as 0.7. The filtered ECG and r signal is shown below.

Figure 11: Filtered ECG data
Figure 12: Filtered respiration data

The filtered data are much cleaner and meaningful than the original data. So we replace the ECG and r data with the corresponding filtered signals.

Feature Engineering

Now let’s try to derive some additional features from the existing ones.

Heart Beat information from ECG

What if we could get the heartbeat from the ECG signal?

ECG is the graph showing the electrical activity of the heart vs time. The output from ECG is in microvolts. Now, apart from this voltage, we can get the cardiac output or heart rate or the number of heartbeats per min using this data. Python provides a powerful tool called Biosppy which can do biosignal processing.