Professional Documents
Culture Documents
ML Project
ML Project
ML Project
Out[ ]:
IMPORTING LIBRARIES
In [ ]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
In [ ]: df_flight.head()
Out[ ]: Unnamed: 0 airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
0 0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953
1 1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953
2 2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956
3 3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955
4 4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955
Using .sample() for better understanding of data or to avoid symmetrical features together at top or bottom.
In [ ]: df_flight.sample(5)
Out[ ]: Unnamed: 0 airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
218250 218250 Air_India AI-540 Delhi Night one Morning Kolkata Business 14.17 37 47545
138273 138273 Vistara UK-720 Kolkata Early_Morning two_or_more Evening Bangalore Economy 11.67 21 8111
227634 227634 Vistara UK-845 Mumbai Early_Morning one Afternoon Delhi Business 8.17 27 53152
262119 262119 Vistara UK-776 Kolkata Evening one Late_Night Delhi Business 6.58 16 66063
73971 73971 Air_India AI-687 Mumbai Afternoon one Night Hyderabad Economy 7.58 27 4173
df_flight
Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953
1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953
2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956
3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955
4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955
... ... ... ... ... ... ... ... ... ... ... ...
300148 Vistara UK-822 Chennai Morning one Evening Hyderabad Business 10.08 49 69265
300149 Vistara UK-826 Chennai Afternoon one Night Hyderabad Business 10.42 49 77105
300150 Vistara UK-832 Chennai Early_Morning one Night Hyderabad Business 13.83 49 79099
300151 Vistara UK-828 Chennai Early_Morning one Evening Hyderabad Business 10.00 49 81585
300152 Vistara UK-822 Chennai Morning one Evening Hyderabad Business 10.08 49 81585
In [ ]: df_flight.shape
In [ ]: df_flight.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 airline 300153 non-null object
1 flight 300153 non-null object
2 source_city 300153 non-null object
3 departure_time 300153 non-null object
4 stops 300153 non-null object
5 arrival_time 300153 non-null object
6 destination_city 300153 non-null object
7 class 300153 non-null object
8 duration 300153 non-null float64
9 days_left 300153 non-null int64
10 price 300153 non-null int64
dtypes: float64(1), int64(2), object(8)
memory usage: 25.2+ MB
The dataset consists of 11 columns, with 8 being categorical and 3 being numerical.
In [ ]: df_flight.isnull().sum()
Out[ ]: airline 0
flight 0
source_city 0
departure_time 0
stops 0
arrival_time 0
destination_city 0
class 0
duration 0
days_left 0
price 0
dtype: int64
In [ ]: df_flight.duplicated().sum()
Out[ ]: 0
In [ ]: df_flight.describe()
The average price of the flight ticket is 21,000 INR(approx.), while the median is 7,425 INR indicating the presence of outliers.
The days_left exhibit symmetrical distribution (mean and median are almost equal).
In [ ]: corr = df_flight.corr(numeric_only=True)
corr
Duration taken by the flight and the price of the tickets are positively correlated.
1) Class
In [ ]: df_flight.value_counts(['class'])
Out[ ]: class
Economy 206666
Business 93487
Name: count, dtype: int64
In [ ]: df_flight.value_counts(['class']).plot(kind='pie',autopct = '%2f')
Almost 70 percent people traveled in Economy class, while remaining traveled in Business class.
Among the airlines, only Vistara and Air_India has Business Class, due to which the prices are high.
The fare of a business class is two to five times higher than an economy class.
The most expensive flight ticket is of Vistara followed by Air_India.
The cheapest flight ticket is of AirAsia.
2) Days_Left
In [ ]: plt.figure(figsize=(15,6))
sns.barplot( data=df_flight, x='days_left', y='price', palette='magma')
plt.title('Price Variation with Days Left Before Departure')
plt.xlabel('Days Left Before Departure')
plt.ylabel('Price')
plt.grid(True)
plt.show()
Ticket prices increases as the days left before departure decreases, but they are considerably reduced when one day is left before travelling.
3) Source_City - Destination_City
plt.tight_layout()
plt.show()
4) Arrival_Time - Departure_Time
plt.tight_layout()
plt.show()
The flights with Departure time and Arrival time at late night are the cheapest, followed by afternoon and early morning.
DATA CLEANING
1) Price
sns.boxplot(x =df_flight['price'])
ax.set_title("'Price' Boxplot Before Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("INR", fontsize = 10)
plt.show()
In [ ]: df_flight['price'].skew()
Out[ ]: 1.0613772532064343
iqr = percentile75-percentile25
print("IQR: ",iqr)
Percentile25: 4783.0
Percentile75: 42521.0
IQR: 37738.0
Upper_limit: 99128.0
Lower_limit: -51824.0
In [ ]: df_flight[df_flight['price']> upper_limit]
Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
215858 Vistara UK-809 Delhi Evening two_or_more Evening Kolkata Business 21.08 1 114434
215859 Vistara UK-809 Delhi Evening two_or_more Evening Kolkata Business 21.08 1 116562
216025 Vistara UK-817 Delhi Evening two_or_more Morning Kolkata Business 17.58 4 100395
216094 Vistara UK-995 Delhi Morning one Evening Kolkata Business 6.50 5 99129
216095 Vistara UK-963 Delhi Morning one Evening Kolkata Business 8.00 5 101369
... ... ... ... ... ... ... ... ... ... ... ...
293474 Vistara UK-836 Chennai Morning one Night Bangalore Business 9.67 3 107597
296001 Vistara UK-838 Chennai Night one Morning Kolkata Business 11.50 3 102832
296081 Vistara UK-832 Chennai Early_Morning one Night Kolkata Business 15.83 5 102384
296170 Vistara UK-838 Chennai Night one Morning Kolkata Business 11.50 7 104624
296404 Vistara UK-838 Chennai Night one Evening Kolkata Business 21.00 12 102384
sns.boxplot(x =df_newflight['price'])
ax.set_title("'Price' Boxplot After Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("INR", fontsize = 10)
plt.show()
2) Duration
sns.boxplot(x =df_newflight['duration'])
ax.set_title("'Duration' Boxplot Before Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)
plt.show()
In [ ]: df_newflight['duration'].skew()
Out[ ]: 0.6030484682602605
Since, the value is between 0.5 and 1, the data are moderately skewed.
iqr = percentile75-percentile25
print("IQR: ",iqr)
Percentile25: 6.83
Percentile75: 16.17
IQR: 9.340000000000002
Upper_limit: 30.180000000000003
Lower_limit: -7.1800000000000015
In [ ]: df_newflight[df_newflight['duration']> upper_limit]
Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
10534 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 31.25 4 12222
10535 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 33.17 4 12222
10540 Air_India AI-9887 Delhi Early_Morning two_or_more Evening Bangalore Economy 36.92 4 12321
10891 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 31.25 6 12222
10892 Vistara UK-706 Delhi Afternoon two_or_more Night Bangalore Economy 33.17 6 12222
... ... ... ... ... ... ... ... ... ... ... ...
296064 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 5 55377
296297 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 10 55377
296391 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 12 55377
296716 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 19 55377
297661 Air_India AI-440 Chennai Early_Morning one Afternoon Kolkata Business 30.33 40 55377
sns.boxplot(x =df_newflight['duration'])
ax.set_title("'Duration' Boxplot After Outliers Removal", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)
plt.show()
3) DaysLeft
sns.boxplot(x =df_newflight['days_left'])
ax.set_title("'Days Left' Boxplot", loc = 'center', fontsize = 15)
ax.set_xlabel("Hrs", fontsize = 10)
plt.show()
No Outliers detected.
DATA PREPROCESSING
In [ ]: df_newflight.head()
Out[ ]: airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
X = df_newflight.drop(columns=['price'])
y = df_newflight['price']
MODEL SELECTION
LINEAR REGRESSION
models = {
"Linear Regression": LinearRegression(),
}
# Accuracy of Model
score = model.score(X_test, y_test)
models = {
"Decision Tree": DecisionTreeRegressor(),
}
# Accuracy of Model
score = model.score(X_test, y_test)
plt.show()
models = {
"Random Forest": RandomForestRegressor(),
}
# Accuracy of Model
score = model.score(X_test, y_test)
plt.show()
Model Prediction
The RandomForestRegressor algorithm resulted in an outstanding R-squared metric of 0.99, followed by DecisionTreeRegressor with the R-squared metric of 0.984.
The Linear Regression algorithm resulted in R-squared metric of 0.90.