A Practical Guide for Exploratory Data Analysis: Flight Delays

Original article was published on Artificial Intelligence on Medium

Origins and destinations

ORIGIN column contains departure locations and DEST column contains destination locations. Let’s see how many origins and destinations we have in the dataset.

print("There are {} unique origin city".format(df['ORIGIN'].nunique()))print("There are {} unique destination city".format(df['DEST'].nunique()))

We can sort the origins and destinations in terms of delay rate.

df[['ORIGIN','DEP_DEL15']].groupby('ORIGIN').agg(['mean','count']).sort_values(by=('DEP_DEL15','mean'), ascending=False)[:10]
df[['DEST','ARR_DEL15']].groupby('DEST').agg(['mean','count']).sort_values(by=('ARR_DEL15','mean'), ascending=False)[:10]

We sorted the origins in terms of departure delay ratio and destinations in terms of arrival delay ratio. The locations with high delay rates have very few flights.

Average number of flights per locations is 1635. The ones with less than 100 flights may be outliers and do not give us an accurate estimation. Let’s sort the origins that have more than 500 flights.

origin_delays = df[['ORIGIN','DEP_DEL15']].groupby('ORIGIN').agg(['mean','count'])origin_delays[origin_delays[('DEP_DEL15','count')] > 500].sort_values(by=('DEP_DEL15','mean'), ascending=False)[:10]

Whole list changed except for “ASE” which has the highest delay rate by far. Please note that when we filter a dataframe with multilevel index, we need to use all levels. For instance, in order the filter the above dataframe based on “count”, we use origin_delays[(‘DEP_DEL15’,’count’)], not origin_delays[‘count’].