2023/9/15

In this project, the 2009 ASA Statistical Computing and Graphics Dataset is used.
Because the dataset is very large (contains the year between 1987 and 2008),
I choose 2006, 2007 data as sub-dataset for the tasks.
This project mainly focus on 5 problems
1. When is the best time of day, day of the week, and time of year to fly to minimise delays?
2. Do older planes suffer more delays?
3. How does the number of people flying between different locations change over time?
4. Can you detect cascading failures as delays in one airport create delays in others?
5. Use the available variables to construct a model that predicts delays.
Library introduction
pandas: process csv data
numpy: process matrix
matplotlib: data visualization
sklearn: training prediction model

Data wrangling
In [80]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Load flight, plane, airport data

In [11]:

flight_2007 = pd.read_csv('./dataverse_files/2007.csv')
flight_2006 = pd.read_csv('./dataverse_files/2006.csv')
flight = pd.concat([flight_2006, flight_2007])

In [12]:

airport = pd.read_csv('./dataverse_files/airports.csv')

In [13]:

plane = pd.read_csv('./dataverse_files/plane-data.csv')

flight data

In [14]:


(14595137, 29)
Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],

plane data

In [15]:


(5029, 9)
Index(['tailnum', 'type', 'manufacturer', 'issue_date', 'model', 'status',
'aircraft_type', 'engine_type', 'year'],

airport data

In [16]:


(3376, 7)
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], dtype='obj

Question 1
When is the best time of day, day of the week, and time of year to fly to minimise delays?

Join the flight data with the plain data on tailnum.

Get the hour of the day by dividing CRSDepTime with 100. Because the origin format is HHSS.
Group the delay according to 'day of month', 'day of week' and 'hour of day'. Calculate the delay mean,
median and average.
Plot the delay and find the best time.
Use line plot and box plot to display the median and mean delay of different time.
We mainly focus on the median delay of different time.

In [141]:

flight['CRSDepHourOfDay'] = flight['CRSDepTime'] // 100

flight['DelayMin'] = flight['DepDelay'] / 60

In [135]:

dom_delay = flight.groupby('DayofMonth').agg({'DepDelay': ['sum', 'mean', 'median']}).reset_inde

dow_delay = flight.groupby('DayOfWeek').agg({'DepDelay': ['sum', 'mean', 'median']}).reset_index
hod_delay = flight.groupby('CRSDepHourOfDay').agg({'DepDelay': ['sum', 'mean', 'median']}).reset

In [136]:

dom_delay_np = dom_delay.to_numpy()
dow_delay_np = dow_delay.to_numpy()
hod_delay_np = hod_delay.to_numpy()

In [26]:

ax1 = plt.subplot(3, 1, 1)
l1, = ax1.plot(dom_delay_np[:,0], dom_delay_np[:,3], label='median')
ax11 = ax1.twinx()
l11, = ax11.plot(dom_delay_np[:,0], dom_delay_np[:,2], color='orange', label='mean')
# ax111 = ax1.twinx()
# l111, = ax111.plot(dom_delay_np[:,0], dom_delay_np[:,3], color='red', label='median')
plt.legend([l1, l11], ['median', 'mean'])

ax2 = plt.subplot(3, 1, 2)
l2, = ax2.plot(dow_delay_np[:,0], dow_delay_np[:,3], label='median')
ax21 = ax2.twinx()
l21, = ax21.plot(dow_delay_np[:,0], dow_delay_np[:,2], color='orange', label='mean')
# ax211 = ax2.twinx()
# l211, = ax211.plot(dow_delay_np[:,0], dow_delay_np[:,3], color='red', label='median')
plt.legend([l2, l21], ['median', 'mean'])

ax3 = plt.subplot(3, 1, 3)
l3, = ax3.plot(hod_delay_np[:,0], hod_delay_np[:,3], label='median')
ax31 = ax3.twinx()
l31, = ax31.plot(hod_delay_np[:,0], hod_delay_np[:,2], color='orange', label='mean')
# ax311 = ax3.twinx()
# l311, = ax311.plot(hod_delay_np[:,0], hod_delay_np[:,3], color='red', label='median')
plt.legend([l3, l31], ['median', 'mean'])

<matplotlib.legend.Legend at 0x17fa320d0>

In [153]:

ax = sns.boxplot(x='DayofMonth', y='DelayMin', data=flight, showfliers=False)

In [152]:

ax = sns.boxplot(x='CRSDepHourOfDay', y='DelayMin', data=flight, showfliers=False)

In [156]:

ax = sns.boxplot(x='DayOfWeek', y='DelayMin', data=flight, showfliers=False)

As shown in the figure, we mainly focus on the median dep_delay group by time.
Best time of day is: 5 o'clock, because 5 o'clock has least maximum delay value and lower average
Best day of week is: 2 because week 2 wih lower median and average delay.
Best day of month is: 6, 8, 9. All of them has least maximum delay value and lower average and
median delay. But the difference between them cannot be shown from the box plot.

Do older planes suffer more delays?

Import the plane data and drop NA values

Join the plane data and the flight data on 'tailnum' column
Group the delay according to the plane's 'year' column, calculate the delay's mean, average and
Plot the delay and the year with line plot, find if there exists a relationship that older planes suffer more

In [27]:

plane = pd.read_csv('./dataverse_files/plane-data.csv')
plane = plane.dropna()
flight_plane = flight.join(plane.set_index('tailnum'), on='TailNum')

In [28]:

plane_delay = flight_plane.groupby('year').agg({'DepDelay': ['sum', 'mean', 'median']})

plane_delay_np = plane_delay.reset_index().to_numpy()

In [171]:

ax1 = plt.subplot(1, 1, 1)
l1 = ax1.scatter(plane_delay_np[:,0], plane_delay_np[:,1], label='sum')
ax1.tick_params(axis='x', rotation=90)
ax11 = ax1.twinx()
l11 = ax11.scatter(plane_delay_np[:,0], plane_delay_np[:,2], color='orange', label='mean')
plt.legend([l1, l11], ['sum', 'mean'])

<matplotlib.legend.Legend at 0x109dfbf10>

In [194]:

from scipy.stats.stats import pearsonr

plane_delay_np_noNone = plane_delay_np[(plane_delay_np[:, 0] != 'None'), :]
plane_delay_np_noNone = plane_delay_np_noNone[(plane_delay_np_noNone[:, 0] != '0000'), :]
plane_delay_np_noNone[:, 0] = plane_delay_np_noNone[:, 0].astype(np.int32)
r, p = pearsonr(plane_delay_np_noNone[:,0], plane_delay_np_noNone[:,2])
print(f'Pearson correlation coefficient: {r}, Two-tailed p-value: {p}')

Pearson correlation coefficient: 0.1554534236860175, Two-tailed p-value: 0.29140

As shown in the figure, we mainly focus on the mean dep_delay group by plane's issue year.
There are no clearly relation between manufacturing year and mean dep_delay time. So older planes
do not suffer more delays.
The Pearson correlation coefficient is 0.155, which means there are no linear correlation between
manufacturing year and delay.

How does the number of people flying between different locations change over time?

In this question, I mainly uses two approach to analyze the problem.

1. Find the locations with top-5 flight number and find the change of the number of people according
to months (line plot).
2. Find the change of number of people with heat map according to the geolocation of the airport.
Because the airports are too much, use latitude and longtitude region is a better way (heat map).
For approach1
First combine the origin and destination with the same order, for example, treat (BOS,LGA) and
(LGA, BOS) as the same key.
Group the data with the origin dest tuple and sort with the count.
Get the top-5 origin dest tuple
Plot the change of those locations and find the relationship
For approach2
First join the flight data and the airport data with destination and divide the data into latitude and
longtitude slots with space 5 according to the airport's location.
Count the flight number in each slot and plot them in a heat map.

In [30]:

flight['SortedOriginDest'] = flight.apply(lambda row : ','.join(sorted([row['Origin'], row['Des

N = 5
max_n = flight.groupby(['SortedOriginDest'], as_index=False).size().sort_values(by='size', asce
max_n_OD = list(max_n['SortedOriginDest'])

In [31]:

ODYear = flight[flight['SortedOriginDest'].isin(max_n_OD)]
ODYear = ODYear[ODYear['Year'].isin([2006, 2007])]
ODYear = ODYear.groupby(['SortedOriginDest', 'Month'], as_index=False)
ODYear_cnt = ODYear['SortedOriginDest'].size()

In [32]:

plot_dic = {}
for od in max_n_OD:
plot_dic[od] = ODYear_cnt[ODYear_cnt['SortedOriginDest'] == od]
plt.plot(plot_dic[od]['Month'], plot_dic[od]['size'], label=od)


<matplotlib.legend.Legend at 0x17fc5d4f0>

In this approach, I have choose the top-5 flight, group by both direction of the Origin and Destination
For example, flight with direction (HNL, OGG) and (OGG, HNL) are count as the same 'locations'.
As shown in the figure, for the flight with dest or origin of HNL, Feburary has least people flying and
July has most people flying.
All location has a decrease in February and an increase in March.
People flying in summer is more than people flying in winter.

In [95]:

flight['my'] = flight['Year'].astype(str) + '-' + flight['Month'].astype(str).str.zfill(2)

flight_airport = pd.merge(flight[['my', 'Origin']], airport[['iata', 'lat', 'long']], left_on='O
slot = 5
flight_airport['round_lat'] = flight_airport['lat'] // slot * slot
flight_airport['round_long'] = flight_airport['long'] // slot * slot

lat_range = np.arange(flight_airport['lat'].min() // slot * slot, flight_airport['lat'].max() /
long_range = np.arange(flight_airport['long'].min() // slot * slot, flight_airport['long'].max(

dates = flight_airport['my'].unique()
dates = sorted(dates)

lats = np.zeros((len(lat_range), len(dates)))
longs = np.zeros((len(long_range), len(dates)))

flight_airport = flight_airport[['my', 'round_lat', 'round_long', 'Origin']]

lats_g = flight_airport.groupby(['my', 'round_lat']).count().reset_index()
longs_g = flight_airport.groupby(['my', 'round_long']).count().reset_index()

In [96]:

for _, row in lats_g.iterrows():
lats[(int)(row['round_lat'] // slot - flight_airport['round_lat'].min() // slot), dates.ind
for _, row in longs_g.iterrows():
longs[(int)(row['round_long'] // slot - flight_airport['round_long'].min() // slot), dates.

In [98]:

ax = sns.heatmap(lats, linewidths=0.5, cmap='coolwarm')

ax.set_xticks(ticks = range(len(dates)), labels=dates)
ax.set_yticks(ticks = range(len(lat_range)), labels=lat_range)

(12, 24)

In [99]:

ax = sns.heatmap(longs, linewidths=0.5, cmap='coolwarm')

ax.set_xticks(ticks = range(len(dates)), labels=dates)
ax.set_yticks(ticks = range(len(long_range)), labels=long_range)

In this approach,
As shown in the figures, the numbers of passengers changes are not much for the origin of the flights.
The heatmap's color shows the number of passengers changes with respset to latitude and longitude.
The color of each slot does not change too much according to time.

Can you detect cascading failures as delays in one airport create delays in others?

Group the flight data by the time and destination attribute.

Aggegrate the mean and median value of the departure delay attribute.
Join the flight according to the destination and origin attribute within the same day.
For example, one flight from LA to DC will join with another record from DC to another place on
the same day.
Because the data is too large, I only choose the data in 2006, Jan
Calculate the ratio between the destination and origin's departure delay, if the ratio is steady and close
to 1.
Use the scatter plot to plot the ratio and time in different airports.
In [127]:

flight_delay = flight[flight['DepDelay'] > 0]

flight_sub_dest_grouped = flight_delay.groupby(['Year', 'Month', 'DayofMonth', 'Dest'], as_index
flight_sub_orig_grouped = flight_delay.groupby(['Year', 'Month', 'DayofMonth', 'Origin'], as_ind
flight_sub_joined = pd.merge(flight_sub_dest_grouped, flight_sub_orig_grouped, how='inner', left

PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
arameter may impact performance.
flight_sub_joined = pd.merge(flight_sub_dest_grouped, flight_sub_orig_grouped,
how='inner', left_on=['Year', 'Month', 'DayofMonth', 'Dest'], right_on=['Year',
'Month', 'DayofMonth', 'Origin'])

In [131]:

flight_sub_joined = flight_sub_joined[flight_sub_joined['Year'] == 2006]

flight_sub_joined = flight_sub_joined[flight_sub_joined['Month'] == 1]
# flight_sub_joined = flight_sub_joined[flight_sub_joined['DayofMonth'] <= 10]

flight_sub_joined['ymd'] = flight_sub_joined['Year'].astype(str) + flight_sub_joined['Month'].as
flight_sub_joined['ratio'] = flight_sub_joined['DepDelay_x']['median'] / flight_sub_joined['DepD

In [132]:

ax = sns.scatterplot(x='ymd', y='ratio', hue='Dest', data=flight_sub_joined, legend=None)

In [133]:

ax = sns.scatterplot(x='ymd', y='ratio', hue='Dest', data=flight_sub_joined[flight_sub_joined['r

As shown in the figure, the ratio of the origin and destination delay does not have clearly relationship,
the ratio is not close to 1.
So there is no cascading failures as delays in one airport create delays in others.

Use the available variables to construct a model that predicts delays.

Divide the dataset into two groups: delay and no delay

Join the flight data with the airport, with both origin and destination airport info.
Choose time, distance and location as attribute
"Year", "Month", "DayOfWeek", "CRSDepTime", "CRSArrTime", "CRSElapsedTime", "Distance",
"lat_x", "long_x", "lat_y", "long_y"
Downsample the data with 20000 items, because the origin dataset is too large.
Divide the dataset into train and test with 7 : 3 ratio.
Train the model of random forest
Get the test result

In [157]:

airport = pd.read_csv('./dataverse_files/airports.csv')
sampled_flight = flight.sample(20000)
flight_ori_airport = pd.merge(sampled_flight, airport, left_on="Origin", right_on="iata")
flight_src_airport = pd.merge(flight_ori_airport, airport, left_on="Dest", right_on="iata")
flight_src_airport['delay'] = flight_src_airport['ActualElapsedTime'] - flight_src_airport['CRSE

In [158]:

flightds = flight_src_airport[["Year", "Month", "DayOfWeek", "CRSDepTime", "CRSArrTime", "CRSEl

In [159]:

flightds['delay'] = np.where(flight_src_airport['delay'] > 0, 1, 0)


In [167]:

from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, confusion_matrix, roc_curv

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [162]:

X_train, X_test, Y_train, Y_test = train_test_split(flightds[["Year", "Month", "DayOfWeek", "CRS

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [163]:

clf = RandomForestClassifier()
model =, Y_train)

In [164]:

Y_pred = model.predict(X_test)
cm = confusion_matrix(Y_test, Y_pred, labels = model.classes_)
cd = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=model.classes_)
print(f'Acc: {accuracy_score(Y_test, Y_pred)}')

Acc: 0.6371666666666667

In [169]:

fpr, tpr, thre = roc_curve(Y_test, Y_pred)

roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label = f'AUC = {roc_auc:0.2}')
plt.ylabel('TP rate')
plt.xlabel('FP rate')

Text(0.5, 0, 'FP rate')

The model has accuracy of 63%.

There are a lot of FNs.

