ML Project

PREDICTING FLIGHT PRICES
In [ ]: from PIL import Image

Image.open("C:/Users/Aditya Saxena/Desktop/Flightdescrip.jpg")
Out[ ]:
IMPORTING LIBRARIES
In [ ]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
LOADING THE DATASET

In [ ]: df_flight = pd.read_csv("C:/Users/Aditya Saxena/Downloads/Clean_Dataset.csv/Clean_Dataset.csv")
In [ ]: df_flight.head()
Out[ ]: Unnamed: 0 airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
0 0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953
1 1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953
2 2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956
3 3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955
4 4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955
Using .sample() for better understanding of data or to avoid symmetrical features together at top or bottom.
In [ ]: df_flight.sample(5)
Out[ ]: Unnamed: 0 airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
218250 218250 Air_India AI-540 Delhi Night one Morning Kolkata Business 14.17 37 47545
138273 138273 Vistara UK-720 Kolkata Early_Morning two_or_more Evening Bangalore Economy 11.67 21 8111
227634 227634 Vistara UK-845 Mumbai Early_Morning one Afternoon Delhi Business 8.17 27 53152
262119 262119 Vistara UK-776 Kolkata Evening one Late_Night Delhi Business 6.58 16 66063
73971 73971 Air_India AI-687 Mumbai Afternoon one Night Hyderabad Economy 7.58 27 4173
Droping unnecessary column "Unnamed:0" from the dataset.
In [ ]: df_flight.drop(columns='Unnamed: 0' , axis = 1, inplace=True)
df_flight
Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953
1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953
2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956
3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955
4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955
... ... ... ... ... ... ... ... ... ... ... ...
300148 Vistara UK-822 Chennai Morning one Evening Hyderabad Business 10.08 49 69265
300149 Vistara UK-826 Chennai Afternoon one Night Hyderabad Business 10.42 49 77105
300150 Vistara UK-832 Chennai Early_Morning one Night Hyderabad Business 13.83 49 79099
300151 Vistara UK-828 Chennai Early_Morning one Evening Hyderabad Business 10.00 49 81585
300152 Vistara UK-822 Chennai Morning one Evening Hyderabad Business 10.08 49 81585
300153 rows × 11 columns
In [ ]: df_flight.shape
Out[ ]: (300153, 11)
In [ ]: df_flight.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 airline 300153 non-null object
1 flight 300153 non-null object
2 source_city 300153 non-null object
3 departure_time 300153 non-null object
4 stops 300153 non-null object
5 arrival_time 300153 non-null object
6 destination_city 300153 non-null object
7 class 300153 non-null object
8 duration 300153 non-null float64
9 days_left 300153 non-null int64
10 price 300153 non-null int64
dtypes: float64(1), int64(2), object(8)
memory usage: 25.2+ MB
The dataset consists of 11 columns, with 8 being categorical and 3 being numerical.
In [ ]: df_flight.isnull().sum()
Out[ ]: airline 0
flight 0
source_city 0
departure_time 0
stops 0
arrival_time 0
destination_city 0
class 0
duration 0
days_left 0
price 0
dtype: int64
No missing values are found
In [ ]: df_flight.duplicated().sum()
Out[ ]: 0
No duplicated rows are found
In [ ]: df_flight.describe()
Out[ ]: duration days_left price
count 300153.000000 300153.000000 300153.000000
mean 12.221021 26.004751 20889.660523
std 7.191997 13.561004 22697.767366
min 0.830000 1.000000 1105.000000
25% 6.830000 15.000000 4783.000000
50% 11.250000 26.000000 7425.000000
75% 16.170000 38.000000 42521.000000
max 49.830000 49.000000 123071.000000
The average price of the flight ticket is 21,000 INR(approx.), while the median is 7,425 INR indicating the presence of outliers.
The days_left exhibit symmetrical distribution (mean and median are almost equal).
In [ ]: corr = df_flight.corr(numeric_only=True)
corr
Out[ ]: duration days_left price
duration 1.000000 -0.039157 0.204222
days_left -0.039157 1.000000 -0.091949
price 0.204222 -0.091949 1.000000
Duration taken by the flight and the price of the tickets are positively correlated.
In [ ]: sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax= 1, annot=True)
Out[ ]: <Axes: >
EXPLORATORY DATA ANALYSIS
1) Class
In [ ]: df_flight.value_counts(['class'])
Out[ ]: class
Economy 206666
Business 93487
Name: count, dtype: int64
In [ ]: df_flight.value_counts(['class']).plot(kind='pie',autopct = '%2f')
Out[ ]: <Axes: ylabel='count'>
Almost 70 percent people traveled in Economy class, while remaining traveled in Business class.
In [ ]: fig, ax = plt.subplots(figsize = (8, 5))
sns.barplot(df_flight, x='airline',y = 'price',hue = 'class', palette= 'rocket', estimator='mean')
ax.set_title("Multivariate BarPlot", loc = 'center', fontsize = 15)
Out[ ]: Text(0.5, 1.0, 'Multivariate BarPlot')
Among the airlines, only Vistara and Air_India has Business Class, due to which the prices are high.
The fare of a business class is two to five times higher than an economy class.
The most expensive flight ticket is of Vistara followed by Air_India.
The cheapest flight ticket is of AirAsia.
2) Days_Left
In [ ]: plt.figure(figsize=(15,6))
sns.barplot( data=df_flight, x='days_left', y='price', palette='magma')
plt.title('Price Variation with Days Left Before Departure')
plt.xlabel('Days Left Before Departure')
plt.ylabel('Price')
plt.grid(True)
plt.show()
Ticket prices increases as the days left before departure decreases, but they are considerably reduced when one day is left before travelling.
3) Source_City - Destination_City
In [ ]: fig, axes = plt.subplots(1, 2, figsize = (15, 5))
sns.barplot(ax = axes[0], data = df_flight, x = 'source_city', y = 'price', palette='rocket')

axes[0].set_title("Price Vs Source_City", loc = 'center', fontsize = 15)
sns.barplot(ax = axes[1], data = df_flight, x = 'destination_city', y = 'price', palette= 'rocket')

axes[1].set_title("Price Vs Destination_City", loc = 'center', fontsize = 15)
plt.tight_layout()
plt.show()
Delhi is the cheapest source and destination city, followed by Hyderabad.

While, Mumbai, Bangalore, Kolkata and Chennai have similar prices.
4) Arrival_Time - Departure_Time
In [ ]: fig, axes = plt.subplots(1, 2, figsize = (15, 5))
sns.barplot(ax = axes[0], data = df_flight, x = 'arrival_time', y = 'price', palette='rocket')

axes[0].set_title("Price Vs ArrivalTime", loc = 'center', fontsize = 15)
sns.barplot(ax = axes[1], data = df_flight, x = 'departure_time', y = 'price', palette='rocket')

axes[1].set_title("Price Vs DepartureTime", loc = 'center', fontsize = 15)
plt.tight_layout()
plt.show()
The flights with Departure time and Arrival time at late night are the cheapest, followed by afternoon and early morning.
DATA CLEANING
Detecting and Removing Outliers
1) Price
sns.boxplot(x =df_flight['price'])
ax.set_title("'Price' Boxplot Before Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("INR", fontsize = 10)
plt.show()
In [ ]: df_flight['price'].skew()
Out[ ]: 1.0613772532064343
Since, the value is above 1, it suggests strong right skewness.

Most data points are concentrated on the left side, which contain outliers as shown in the above boxplot.
In [ ]: # Finding the percentile25, percentile7, IQR, upper_limit & lower_limit

percentile25 = df_flight['price'].quantile(0.25)
percentile75 = df_flight['price'].quantile(0.75)
print("Percentile25: ",percentile25)
iqr = percentile75-percentile25
print("IQR: ",iqr)
upper_limit = percentile75 + 1.5 * iqr

lower_limit = percentile25 - 1.5 * iqr
print("Upper_limit: ",upper_limit)
print("Lower_limit: ",lower_limit)
Percentile25: 4783.0
Percentile75: 42521.0
IQR: 37738.0
Upper_limit: 99128.0
Lower_limit: -51824.0
In [ ]: df_flight[df_flight['price']> upper_limit]
215858 Vistara UK-809 Delhi Evening two_or_more Evening Kolkata Business 21.08 1 114434
215859 Vistara UK-809 Delhi Evening two_or_more Evening Kolkata Business 21.08 1 116562
216025 Vistara UK-817 Delhi Evening two_or_more Morning Kolkata Business 17.58 4 100395
216094 Vistara UK-995 Delhi Morning one Evening Kolkata Business 6.50 5 99129
216095 Vistara UK-963 Delhi Morning one Evening Kolkata Business 8.00 5 101369
... ... ... ... ... ... ... ... ... ... ... ...
293474 Vistara UK-836 Chennai Morning one Night Bangalore Business 9.67 3 107597
296001 Vistara UK-838 Chennai Night one Morning Kolkata Business 11.50 3 102832
296081 Vistara UK-832 Chennai Early_Morning one Night Kolkata Business 15.83 5 102384
296170 Vistara UK-838 Chennai Night one Morning Kolkata Business 11.50 7 104624
296404 Vistara UK-838 Chennai Night one Evening Kolkata Business 21.00 12 102384
In [ ]: df_newflight = df_flight[(df_flight['price']>lower_limit) & (df_flight['price']<upper_limit)].copy()

df_newflight.reset_index()
df_newflight.shape
Out[ ]: (300030, 11)
123 Outliers in 'price' column are removed.
sns.boxplot(x =df_newflight['price'])
ax.set_title("'Price' Boxplot After Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("INR", fontsize = 10)
plt.show()
2) Duration
sns.boxplot(x =df_newflight['duration'])
ax.set_title("'Duration' Boxplot Before Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)
plt.show()
In [ ]: df_newflight['duration'].skew()
Out[ ]: 0.6030484682602605
Since, the value is between 0.5 and 1, the data are moderately skewed.
In [ ]: # Finding the IQR

percentile25 = df_newflight['duration'].quantile(0.25)
percentile75 = df_newflight['duration'].quantile(0.75)
iqr = percentile75-percentile25
print("IQR: ",iqr)
upper_limit = percentile75 + 1.5 * iqr

lower_limit = percentile25 - 1.5 * iqr
print("Upper_limit: ",upper_limit)
print("Lower_limit: ",lower_limit)
Percentile25: 6.83
Percentile75: 16.17
IQR: 9.340000000000002
Upper_limit: 30.180000000000003
Lower_limit: -7.1800000000000015
In [ ]: df_newflight[df_newflight['duration']> upper_limit]
10534 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 31.25 4 12222
10540 Air_India AI-9887 Delhi Early_Morning two_or_more Evening Bangalore Economy 36.92 4 12321
... ... ... ... ... ... ... ... ... ... ... ...
296064 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 5 55377
In [ ]: df_newflight = df_newflight[(df_newflight['duration']>lower_limit) & (df_newflight['duration']<upper_limit)].copy()

df_newflight.reset_index()
df_newflight.shape
Out[ ]: (297920, 11)
2110 Outliers in 'duration' column are removed.
sns.boxplot(x =df_newflight['duration'])
ax.set_title("'Duration' Boxplot After Outliers Removal", loc = 'center', fontsize = 15)
plt.show()
3) DaysLeft
sns.boxplot(x =df_newflight['days_left'])
ax.set_title("'Days Left' Boxplot", loc = 'center', fontsize = 15)
plt.show()
No Outliers detected.
DATA PREPROCESSING
In [ ]: cat_cols = ['airline', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class','flight','days_left']

for col in cat_cols:
df_newflight[col] = df_newflight[col].astype('category').cat.codes
In [ ]: df_newflight.head()
0 4 1408 2 2 2 5 5 1 2.17 0 5953
1 4 1387 2 1 2 4 5 1 2.33 0 5953
2 0 1213 2 1 2 1 5 1 2.17 0 5956
3 5 1559 2 4 2 0 5 1 2.25 0 5955
4 5 1549 2 4 2 4 5 1 2.33 0 5955
In [ ]: from sklearn.model_selection import train_test_split
X = df_newflight.drop(columns=['price'])
y = df_newflight['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [ ]: X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[ ]: ((238336, 10), (59584, 10), (238336,), (59584,))
MODEL SELECTION
In [ ]: from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
MODEL TRAINING AND EVALUATION
LINEAR REGRESSION
In [ ]: from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
models = {
"Linear Regression": LinearRegression(),
}
for name, model in models.items():
# Train the model

model.fit(X_train, y_train)
# Make predictions on the test set

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# Calculate Mean Absolute Error (MAE)

mae = mean_absolute_error(y_test, y_pred)
# Calculate R-squared (R2) score

r2 = r2_score(y_test, y_pred)
# Calculate Explained Variance Score

explained_variance = explained_variance_score(y_test, y_pred)
# Accuracy of Model
score = model.score(X_test, y_test)
print(f"Mean Squared Error (MSE): {mse:.3f}")

print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"R-squared (R2) score: {r2:.3f}")
print(f"Accuracy of Model: {score*100:.3f}")
Mean Squared Error (MSE): 48631250.548

Mean Absolute Error (MAE): 4630.652
R-squared (R2) score: 0.906
Accuracy of Model: 90.562
DECISION TREE REGRESSOR
models = {
"Decision Tree": DecisionTreeRegressor(),
}
# Train the model




# Accuracy of Model


sns.scatterplot(x = y_test, y = y_pred)

ax.set_title("'Real x Predicted' DecisionTreeRegressor", loc = 'left', fontsize = 15, pad = 10)
ax.set_xlabel("Real", fontsize = 10)
ax.set_ylabel("Predicted", fontsize = 10)
plt.show()
RANDOM FOREST REGRESSOR
models = {
"Random Forest": RandomForestRegressor(),
}
# Train the model




# Accuracy of Model


sns.scatterplot(x = y_test, y = y_pred)

ax.set_title("'Real x Predicted' RandomForestRegressor", loc = 'left', fontsize = 15, pad = 10)
ax.set_xlabel("Real", fontsize = 10)
ax.set_ylabel("Predicted", fontsize = 10)
plt.show()
Model Prediction
The RandomForestRegressor algorithm resulted in an outstanding R-squared metric of 0.99, followed by DecisionTreeRegressor with the R-squared metric of 0.984.
The Linear Regression algorithm resulted in R-squared metric of 0.90.

ML Project

Uploaded by

Copyright:

Available Formats

You might also like

ML Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Project

Uploaded by

Copyright:

Available Formats

PREDICTING FLIGHT PRICES

In [ ]: from PIL import Image

LOADING THE DATASET

Droping unnecessary column "Unnamed:0" from the dataset.

In [ ]: df_flight.drop(columns='Unnamed: 0' , axis = 1, inplace=True)

300153 rows × 11 columns

Out[ ]: (300153, 11)

No missing values are found

No duplicated rows are found

Out[ ]: duration days_left price

count 300153.000000 300153.000000 300153.000000

mean 12.221021 26.004751 20889.660523

std 7.191997 13.561004 22697.767366

min 0.830000 1.000000 1105.000000

25% 6.830000 15.000000 4783.000000

50% 11.250000 26.000000 7425.000000

75% 16.170000 38.000000 42521.000000

max 49.830000 49.000000 123071.000000

Out[ ]: duration days_left price

duration 1.000000 -0.039157 0.204222

days_left -0.039157 1.000000 -0.091949

price 0.204222 -0.091949 1.000000

In [ ]: sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax= 1, annot=True)

Out[ ]: <Axes: >

EXPLORATORY DATA ANALYSIS

Out[ ]: <Axes: ylabel='count'>

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

sns.barplot(df_flight, x='airline',y = 'price',hue = 'class', palette= 'rocket', estimator='mean')

ax.set_title("Multivariate BarPlot", loc = 'center', fontsize = 15)

Out[ ]: Text(0.5, 1.0, 'Multivariate BarPlot')

In [ ]: fig, axes = plt.subplots(1, 2, figsize = (15, 5))

sns.barplot(ax = axes[0], data = df_flight, x = 'source_city', y = 'price', palette='rocket')

sns.barplot(ax = axes[1], data = df_flight, x = 'destination_city', y = 'price', palette= 'rocket')

Delhi is the cheapest source and destination city, followed by Hyderabad.

In [ ]: fig, axes = plt.subplots(1, 2, figsize = (15, 5))

sns.barplot(ax = axes[0], data = df_flight, x = 'arrival_time', y = 'price', palette='rocket')

sns.barplot(ax = axes[1], data = df_flight, x = 'departure_time', y = 'price', palette='rocket')

Detecting and Removing Outliers

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

Since, the value is above 1, it suggests strong right skewness.

In [ ]: # Finding the percentile25, percentile7, IQR, upper_limit & lower_limit

upper_limit = percentile75 + 1.5 * iqr

123 rows × 11 columns

In [ ]: df_newflight = df_flight[(df_flight['price']>lower_limit) & (df_flight['price']<upper_limit)].copy()

Out[ ]: (300030, 11)

123 Outliers in 'price' column are removed.

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

In [ ]: # Finding the IQR

upper_limit = percentile75 + 1.5 * iqr

2110 rows × 11 columns

In [ ]: df_newflight = df_newflight[(df_newflight['duration']>lower_limit) & (df_newflight['duration']<upper_limit)].copy()

Out[ ]: (297920, 11)

2110 Outliers in 'duration' column are removed.

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

In [ ]: fig, ax = plt.subplots(figsize = (8, 5))

In [ ]: cat_cols = ['airline', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class','flight','days_left']

0 4 1408 2 2 2 5 5 1 2.17 0 5953

1 4 1387 2 1 2 4 5 1 2.33 0 5953