ML Project

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

PREDICTING FLIGHT PRICES

In [ ]: from PIL import Image


Image.open("C:/Users/Aditya Saxena/Desktop/Flightdescrip.jpg")

Out[ ]:

IMPORTING LIBRARIES
In [ ]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

LOADING THE DATASET


In [ ]: df_flight = pd.read_csv("C:/Users/Aditya Saxena/Downloads/Clean_Dataset.csv/Clean_Dataset.csv")

In [ ]: df_flight.head()

Out[ ]: Unnamed: 0 airline flight source_city departure_time stops arrival_time destination_city class duration days_left price

0 0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953

1 1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953

2 2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956

3 3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955

4 4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955

Using .sample() for better understanding of data or to avoid symmetrical features together at top or bottom.

In [ ]: df_flight.sample(5)

Out[ ]: Unnamed: 0 airline flight source_city departure_time stops arrival_time destination_city class duration days_left price

218250 218250 Air_India AI-540 Delhi Night one Morning Kolkata Business 14.17 37 47545

138273 138273 Vistara UK-720 Kolkata Early_Morning two_or_more Evening Bangalore Economy 11.67 21 8111

227634 227634 Vistara UK-845 Mumbai Early_Morning one Afternoon Delhi Business 8.17 27 53152

262119 262119 Vistara UK-776 Kolkata Evening one Late_Night Delhi Business 6.58 16 66063

73971 73971 Air_India AI-687 Mumbai Afternoon one Night Hyderabad Economy 7.58 27 4173

Droping unnecessary column "Unnamed:0" from the dataset.

In [ ]: df_flight.drop(columns='Unnamed: 0' , axis = 1, inplace=True)

df_flight

Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price

0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953

1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953

2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956

3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955

4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955

... ... ... ... ... ... ... ... ... ... ... ...

300148 Vistara UK-822 Chennai Morning one Evening Hyderabad Business 10.08 49 69265

300149 Vistara UK-826 Chennai Afternoon one Night Hyderabad Business 10.42 49 77105

300150 Vistara UK-832 Chennai Early_Morning one Night Hyderabad Business 13.83 49 79099

300151 Vistara UK-828 Chennai Early_Morning one Evening Hyderabad Business 10.00 49 81585

300152 Vistara UK-822 Chennai Morning one Evening Hyderabad Business 10.08 49 81585

300153 rows × 11 columns

In [ ]: df_flight.shape

Out[ ]: (300153, 11)

In [ ]: df_flight.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 airline 300153 non-null object
1 flight 300153 non-null object
2 source_city 300153 non-null object
3 departure_time 300153 non-null object
4 stops 300153 non-null object
5 arrival_time 300153 non-null object
6 destination_city 300153 non-null object
7 class 300153 non-null object
8 duration 300153 non-null float64
9 days_left 300153 non-null int64
10 price 300153 non-null int64
dtypes: float64(1), int64(2), object(8)
memory usage: 25.2+ MB

The dataset consists of 11 columns, with 8 being categorical and 3 being numerical.

In [ ]: df_flight.isnull().sum()

Out[ ]: airline 0
flight 0
source_city 0
departure_time 0
stops 0
arrival_time 0
destination_city 0
class 0
duration 0
days_left 0
price 0
dtype: int64

No missing values are found

In [ ]: df_flight.duplicated().sum()

Out[ ]: 0

No duplicated rows are found

In [ ]: df_flight.describe()

Out[ ]: duration days_left price

count 300153.000000 300153.000000 300153.000000

mean 12.221021 26.004751 20889.660523

std 7.191997 13.561004 22697.767366

min 0.830000 1.000000 1105.000000

25% 6.830000 15.000000 4783.000000

50% 11.250000 26.000000 7425.000000

75% 16.170000 38.000000 42521.000000

max 49.830000 49.000000 123071.000000

The average price of the flight ticket is 21,000 INR(approx.), while the median is 7,425 INR indicating the presence of outliers.
The days_left exhibit symmetrical distribution (mean and median are almost equal).

In [ ]: corr = df_flight.corr(numeric_only=True)
corr

Out[ ]: duration days_left price

duration 1.000000 -0.039157 0.204222

days_left -0.039157 1.000000 -0.091949

price 0.204222 -0.091949 1.000000

Duration taken by the flight and the price of the tickets are positively correlated.

In [ ]: sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax= 1, annot=True)

Out[ ]: <Axes: >

EXPLORATORY DATA ANALYSIS

1) Class

In [ ]: df_flight.value_counts(['class'])

Out[ ]: class
Economy 206666
Business 93487
Name: count, dtype: int64

In [ ]: df_flight.value_counts(['class']).plot(kind='pie',autopct = '%2f')

Out[ ]: <Axes: ylabel='count'>

Almost 70 percent people traveled in Economy class, while remaining traveled in Business class.

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.barplot(df_flight, x='airline',y = 'price',hue = 'class', palette= 'rocket', estimator='mean')

ax.set_title("Multivariate BarPlot", loc = 'center', fontsize = 15)

Out[ ]: Text(0.5, 1.0, 'Multivariate BarPlot')

Among the airlines, only Vistara and Air_India has Business Class, due to which the prices are high.
The fare of a business class is two to five times higher than an economy class.
The most expensive flight ticket is of Vistara followed by Air_India.
The cheapest flight ticket is of AirAsia.

2) Days_Left

In [ ]: plt.figure(figsize=(15,6))
sns.barplot( data=df_flight, x='days_left', y='price', palette='magma')
plt.title('Price Variation with Days Left Before Departure')
plt.xlabel('Days Left Before Departure')
plt.ylabel('Price')
plt.grid(True)
plt.show()

Ticket prices increases as the days left before departure decreases, but they are considerably reduced when one day is left before travelling.

3) Source_City - Destination_City

In [ ]: fig, axes = plt.subplots(1, 2, figsize = (15, 5))

sns.barplot(ax = axes[0], data = df_flight, x = 'source_city', y = 'price', palette='rocket')


axes[0].set_title("Price Vs Source_City", loc = 'center', fontsize = 15)

sns.barplot(ax = axes[1], data = df_flight, x = 'destination_city', y = 'price', palette= 'rocket')


axes[1].set_title("Price Vs Destination_City", loc = 'center', fontsize = 15)

plt.tight_layout()

plt.show()

Delhi is the cheapest source and destination city, followed by Hyderabad.


While, Mumbai, Bangalore, Kolkata and Chennai have similar prices.

4) Arrival_Time - Departure_Time

In [ ]: fig, axes = plt.subplots(1, 2, figsize = (15, 5))

sns.barplot(ax = axes[0], data = df_flight, x = 'arrival_time', y = 'price', palette='rocket')


axes[0].set_title("Price Vs ArrivalTime", loc = 'center', fontsize = 15)

sns.barplot(ax = axes[1], data = df_flight, x = 'departure_time', y = 'price', palette='rocket')


axes[1].set_title("Price Vs DepartureTime", loc = 'center', fontsize = 15)

plt.tight_layout()

plt.show()

The flights with Departure time and Arrival time at late night are the cheapest, followed by afternoon and early morning.

DATA CLEANING

Detecting and Removing Outliers

1) Price

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.boxplot(x =df_flight['price'])
ax.set_title("'Price' Boxplot Before Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("INR", fontsize = 10)

plt.show()

In [ ]: df_flight['price'].skew()

Out[ ]: 1.0613772532064343

Since, the value is above 1, it suggests strong right skewness.


Most data points are concentrated on the left side, which contain outliers as shown in the above boxplot.

In [ ]: # Finding the percentile25, percentile7, IQR, upper_limit & lower_limit


percentile25 = df_flight['price'].quantile(0.25)
percentile75 = df_flight['price'].quantile(0.75)
print("Percentile25: ",percentile25)
print("Percentile75: ",percentile75)

iqr = percentile75-percentile25
print("IQR: ",iqr)

upper_limit = percentile75 + 1.5 * iqr


lower_limit = percentile25 - 1.5 * iqr
print("Upper_limit: ",upper_limit)
print("Lower_limit: ",lower_limit)

Percentile25: 4783.0
Percentile75: 42521.0
IQR: 37738.0
Upper_limit: 99128.0
Lower_limit: -51824.0

In [ ]: df_flight[df_flight['price']> upper_limit]

Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price

215858 Vistara UK-809 Delhi Evening two_or_more Evening Kolkata Business 21.08 1 114434

215859 Vistara UK-809 Delhi Evening two_or_more Evening Kolkata Business 21.08 1 116562

216025 Vistara UK-817 Delhi Evening two_or_more Morning Kolkata Business 17.58 4 100395

216094 Vistara UK-995 Delhi Morning one Evening Kolkata Business 6.50 5 99129

216095 Vistara UK-963 Delhi Morning one Evening Kolkata Business 8.00 5 101369

... ... ... ... ... ... ... ... ... ... ... ...

293474 Vistara UK-836 Chennai Morning one Night Bangalore Business 9.67 3 107597

296001 Vistara UK-838 Chennai Night one Morning Kolkata Business 11.50 3 102832

296081 Vistara UK-832 Chennai Early_Morning one Night Kolkata Business 15.83 5 102384

296170 Vistara UK-838 Chennai Night one Morning Kolkata Business 11.50 7 104624

296404 Vistara UK-838 Chennai Night one Evening Kolkata Business 21.00 12 102384

123 rows × 11 columns

In [ ]: df_newflight = df_flight[(df_flight['price']>lower_limit) & (df_flight['price']<upper_limit)].copy()


df_newflight.reset_index()
df_newflight.shape

Out[ ]: (300030, 11)

123 Outliers in 'price' column are removed.

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.boxplot(x =df_newflight['price'])
ax.set_title("'Price' Boxplot After Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("INR", fontsize = 10)

plt.show()

2) Duration

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.boxplot(x =df_newflight['duration'])
ax.set_title("'Duration' Boxplot Before Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)

plt.show()

In [ ]: df_newflight['duration'].skew()

Out[ ]: 0.6030484682602605

Since, the value is between 0.5 and 1, the data are moderately skewed.

In [ ]: # Finding the IQR


percentile25 = df_newflight['duration'].quantile(0.25)
percentile75 = df_newflight['duration'].quantile(0.75)
print("Percentile25: ",percentile25)
print("Percentile75: ",percentile75)

iqr = percentile75-percentile25
print("IQR: ",iqr)

upper_limit = percentile75 + 1.5 * iqr


lower_limit = percentile25 - 1.5 * iqr
print("Upper_limit: ",upper_limit)
print("Lower_limit: ",lower_limit)

Percentile25: 6.83
Percentile75: 16.17
IQR: 9.340000000000002
Upper_limit: 30.180000000000003
Lower_limit: -7.1800000000000015

In [ ]: df_newflight[df_newflight['duration']> upper_limit]

Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price

10534 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 31.25 4 12222

10535 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 33.17 4 12222

10540 Air_India AI-9887 Delhi Early_Morning two_or_more Evening Bangalore Economy 36.92 4 12321

10891 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 31.25 6 12222

10892 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 33.17 6 12222

... ... ... ... ... ... ... ... ... ... ... ...

296064 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 5 55377

296297 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 10 55377

296391 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 12 55377

296716 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 19 55377

297661 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 40 55377

2110 rows × 11 columns

In [ ]: df_newflight = df_newflight[(df_newflight['duration']>lower_limit) & (df_newflight['duration']<upper_limit)].copy()


df_newflight.reset_index()
df_newflight.shape

Out[ ]: (297920, 11)

2110 Outliers in 'duration' column are removed.

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.boxplot(x =df_newflight['duration'])
ax.set_title("'Duration' Boxplot After Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)

plt.show()

3) DaysLeft

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.boxplot(x =df_newflight['days_left'])
ax.set_title("'Days Left' Boxplot", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)

plt.show()

No Outliers detected.

DATA PREPROCESSING

In [ ]: cat_cols = ['airline', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class','flight','days_left']


for col in cat_cols:
df_newflight[col] = df_newflight[col].astype('category').cat.codes

In [ ]: df_newflight.head()

Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price

0 4 1408 2 2 2 5 5 1 2.17 0 5953

1 4 1387 2 1 2 4 5 1 2.33 0 5953

2 0 1213 2 1 2 1 5 1 2.17 0 5956

3 5 1559 2 4 2 0 5 1 2.25 0 5955

4 5 1549 2 4 2 4 5 1 2.33 0 5955

In [ ]: from sklearn.model_selection import train_test_split

X = df_newflight.drop(columns=['price'])
y = df_newflight['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [ ]: X_train.shape, X_test.shape, y_train.shape, y_test.shape

Out[ ]: ((238336, 10), (59584, 10), (238336,), (59584,))

MODEL SELECTION

In [ ]: from sklearn.linear_model import LinearRegression


from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

MODEL TRAINING AND EVALUATION

LINEAR REGRESSION

In [ ]: from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

models = {
"Linear Regression": LinearRegression(),
}

for name, model in models.items():

# Train the model


model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

# Calculate Mean Absolute Error (MAE)


mae = mean_absolute_error(y_test, y_pred)

# Calculate R-squared (R2) score


r2 = r2_score(y_test, y_pred)

# Calculate Explained Variance Score


explained_variance = explained_variance_score(y_test, y_pred)

# Accuracy of Model
score = model.score(X_test, y_test)

print(f"Mean Squared Error (MSE): {mse:.3f}")


print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"R-squared (R2) score: {r2:.3f}")
print(f"Accuracy of Model: {score*100:.3f}")

Mean Squared Error (MSE): 48631250.548


Mean Absolute Error (MAE): 4630.652
R-squared (R2) score: 0.906
Accuracy of Model: 90.562

DECISION TREE REGRESSOR

In [ ]: from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

models = {
"Decision Tree": DecisionTreeRegressor(),
}

for name, model in models.items():

# Train the model


model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

# Calculate Mean Absolute Error (MAE)


mae = mean_absolute_error(y_test, y_pred)

# Calculate R-squared (R2) score


r2 = r2_score(y_test, y_pred)

# Accuracy of Model
score = model.score(X_test, y_test)

print(f"Mean Squared Error (MSE): {mse:.3f}")


print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"R-squared (R2) score: {r2:.3f}")
print(f"Accuracy of Model: {score*100:.3f}")

Mean Squared Error (MSE): 8294294.911


Mean Absolute Error (MAE): 887.874
R-squared (R2) score: 0.984
Accuracy of Model: 98.390

In [ ]: fig, ax = plt.subplots(figsize = (8, 6))

sns.scatterplot(x = y_test, y = y_pred)


ax.set_title("'Real x Predicted' DecisionTreeRegressor", loc = 'left', fontsize = 15, pad = 10)
ax.set_xlabel("Real", fontsize = 10)
ax.set_ylabel("Predicted", fontsize = 10)

plt.show()

RANDOM FOREST REGRESSOR

In [ ]: from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

models = {
"Random Forest": RandomForestRegressor(),
}

for name, model in models.items():

# Train the model


model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

# Calculate Mean Absolute Error (MAE)


mae = mean_absolute_error(y_test, y_pred)

# Calculate R-squared (R2) score


r2 = r2_score(y_test, y_pred)

# Accuracy of Model
score = model.score(X_test, y_test)

print(f"Mean Squared Error (MSE): {mse:.3f}")


print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"R-squared (R2) score: {r2:.3f}")
print(f"Accuracy of Model: {score*100:.3f}")

Mean Squared Error (MSE): 5238118.029


Mean Absolute Error (MAE): 864.032
R-squared (R2) score: 0.990
Accuracy of Model: 98.983

In [ ]: fig, ax = plt.subplots(figsize = (8, 6))

sns.scatterplot(x = y_test, y = y_pred)


ax.set_title("'Real x Predicted' RandomForestRegressor", loc = 'left', fontsize = 15, pad = 10)
ax.set_xlabel("Real", fontsize = 10)
ax.set_ylabel("Predicted", fontsize = 10)

plt.show()

Model Prediction

The RandomForestRegressor algorithm resulted in an outstanding R-squared metric of 0.99, followed by DecisionTreeRegressor with the R-squared metric of 0.984.
The Linear Regression algorithm resulted in R-squared metric of 0.90.

You might also like