EDA of Hotel Booking Dataset - Kaggle

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Mehedi Azad

Data Analyst | Data Science | Machine Learning |


AI | Power BI | Python | Data nerd
https://www.linkedin.com/in/mehediazad/

In [ ]: #linkedin link: https://www.linkedin.com/in/mehediazad/


# dataset link: https://drive.google.com/drive/folders/1NPxYfyBHfHOoPpHLd9qbkfFWA

In [1]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import plotly.express as px

In [2]: df = pd.read_csv(r'G:\dataset\EDA ofhotel booking -kaggle\hotel_bookings.csv\hote


df

Out[2]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_num

Resort
0 0 342 2015 July
Hotel

Resort
1 0 737 2015 July
Hotel

Resort
2 0 7 2015 July
Hotel

Resort
3 0 13 2015 July
Hotel

Resort
4 0 14 2015 July
Hotel

... ... ... ... ... ...

City
119385 0 23 2017 August
Hotel

City
119386 0 102 2017 August
Hotel

City
119387 0 34 2017 August
Hotel

City
119388 0 109 2017 August
Hotel

City
119389 0 205 2017 August
Hotel

119390 rows × 32 columns

Checking First 5 records from this dataset using head()


function
In [3]: df.head()

Out[3]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number

Resort
0 0 342 2015 July 27
Hotel

Resort
1 0 737 2015 July 27
Hotel

Resort
2 0 7 2015 July 27
Hotel

Resort
3 0 13 2015 July 27
Hotel

Resort
4 0 14 2015 July 27
Hotel

5 rows × 32 columns

Checking Last 5 records from this dataset using tail()


function
In [4]: df.tail()

Out[4]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_numb

City
119385 0 23 2017 August
Hotel

City
119386 0 102 2017 August
Hotel

City
119387 0 34 2017 August
Hotel

City
119388 0 109 2017 August
Hotel

City
119389 0 205 2017 August
Hotel

5 rows × 32 columns

Checking columns of this dataset


In [5]: df.columns

Out[5]: Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',

'arrival_date_month', 'arrival_date_week_number',

'arrival_date_day_of_month', 'stays_in_weekend_nights',

'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',

'country', 'market_segment', 'distribution_channel',

'is_repeated_guest', 'previous_cancellations',

'previous_bookings_not_canceled', 'reserved_room_type',

'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',

'company', 'days_in_waiting_list', 'customer_type', 'adr',

'required_car_parking_spaces', 'total_of_special_requests',

'reservation_status', 'reservation_status_date'],

dtype='object')

Showing Statistical information regarding the data using


describe() function
In [6]: df.describe()

Out[6]:
is_canceled lead_time arrival_date_year arrival_date_week_number arrival_date_day_

count 119390.000000 119390.000000 119390.000000 119390.000000 1193

mean 0.370416 104.011416 2016.156554 27.165173

std 0.482918 106.863097 0.707476 13.605138

min 0.000000 0.000000 2015.000000 1.000000

25% 0.000000 18.000000 2016.000000 16.000000

50% 0.000000 69.000000 2016.000000 28.000000

75% 1.000000 160.000000 2017.000000 38.000000

max 1.000000 737.000000 2017.000000 53.000000

Check the informations about the dataset using info()


function
In [7]: df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 119390 entries, 0 to 119389

Data columns (total 32 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 hotel 119390 non-null object

1 is_canceled 119390 non-null int64

2 lead_time 119390 non-null int64

3 arrival_date_year 119390 non-null int64

4 arrival_date_month 119390 non-null object

5 arrival_date_week_number 119390 non-null int64

6 arrival_date_day_of_month 119390 non-null int64

7 stays_in_weekend_nights 119390 non-null int64

8 stays_in_week_nights 119390 non-null int64

9 adults 119390 non-null int64

10 children 119386 non-null float64

11 babies 119390 non-null int64

12 meal 119390 non-null object

13 country 118902 non-null object

14 market_segment 119390 non-null object

15 distribution_channel 119390 non-null object

16 is_repeated_guest 119390 non-null int64

17 previous_cancellations 119390 non-null int64

18 previous_bookings_not_canceled 119390 non-null int64

19 reserved_room_type 119390 non-null object

20 assigned_room_type 119390 non-null object

21 booking_changes 119390 non-null int64

22 deposit_type 119390 non-null object

23 agent 103050 non-null float64

24 company 6797 non-null float64

25 days_in_waiting_list 119390 non-null int64

26 customer_type 119390 non-null object

27 adr 119390 non-null float64

28 required_car_parking_spaces 119390 non-null int64

29 total_of_special_requests 119390 non-null int64

30 reservation_status 119390 non-null object

31 reservation_status_date 119390 non-null object

dtypes: float64(4), int64(16), object(12)

memory usage: 29.1+ MB

Checking number of rows & columns in the dataset using


shape
In [8]: df.shape
# There is 119390 rows & 32 columns

Out[8]: (119390, 32)

Checking the data types of the dataset using dtypes


In [9]: df.dtypes

Out[9]: hotel object

is_canceled int64

lead_time int64

arrival_date_year int64

arrival_date_month object

arrival_date_week_number int64

arrival_date_day_of_month int64

stays_in_weekend_nights int64

stays_in_week_nights int64

adults int64

children float64

babies int64

meal object

country object

market_segment object

distribution_channel object

is_repeated_guest int64

previous_cancellations int64

previous_bookings_not_canceled int64

reserved_room_type object

assigned_room_type object

booking_changes int64

deposit_type object

agent float64

company float64

days_in_waiting_list int64

customer_type object

adr float64

required_car_parking_spaces int64

total_of_special_requests int64

reservation_status object

reservation_status_date object

dtype: object

Checking null values present in each column using


isnull().sum()
In [10]: df.isnull().sum()

Out[10]: hotel 0

is_canceled 0

lead_time 0

arrival_date_year 0

arrival_date_month 0

arrival_date_week_number 0

arrival_date_day_of_month 0

stays_in_weekend_nights 0

stays_in_week_nights 0

adults 0

children 4

babies 0

meal 0

country 488

market_segment 0

distribution_channel 0

is_repeated_guest 0

previous_cancellations 0

previous_bookings_not_canceled 0

reserved_room_type 0

assigned_room_type 0

booking_changes 0

deposit_type 0

agent 16340

company 112593

days_in_waiting_list 0

customer_type 0

adr 0

required_car_parking_spaces 0

total_of_special_requests 0

reservation_status 0

reservation_status_date 0

dtype: int64

In [11]: df.head()

Out[11]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number

Resort
0 0 342 2015 July 27
Hotel

Resort
1 0 737 2015 July 27
Hotel

Resort
2 0 7 2015 July 27
Hotel

Resort
3 0 13 2015 July 27
Hotel

Resort
4 0 14 2015 July 27
Hotel

5 rows × 32 columns
Droping 'company' column because it contains a large
amount of null values
In [12]: df = df.drop('company', axis=1)

In [13]: df.shape

Out[13]: (119390, 31)

In [14]: df.columns

Out[14]: Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',

'arrival_date_month', 'arrival_date_week_number',

'arrival_date_day_of_month', 'stays_in_weekend_nights',

'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',

'country', 'market_segment', 'distribution_channel',

'is_repeated_guest', 'previous_cancellations',

'previous_bookings_not_canceled', 'reserved_room_type',

'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',

'days_in_waiting_list', 'customer_type', 'adr',

'required_car_parking_spaces', 'total_of_special_requests',

'reservation_status', 'reservation_status_date'],

dtype='object')

there is a large amount of null values in 'agent' column.


replacing null values with zero using fillna() function
In [15]: df['agent'] = df['agent'].fillna(0)

In [16]: df['agent']

Out[16]: 0 0.0

1 0.0

2 0.0

3 304.0

4 240.0

...

119385 394.0

119386 9.0

119387 9.0

119388 89.0

119389 9.0

Name: agent, Length: 119390, dtype: float64

To check count number in each country using


'value_counts()' function
In [17]: df['country'].value_counts()

Out[17]: PRT 48590

GBR 12129

FRA 10415

ESP 8568

DEU 7287

...

ASM 1

BWA 1

HND 1

MLI 1

BHS 1

Name: country, Length: 177, dtype: int64

There is also some null values in 'country' column.


Replaing it with the most value count country 'PRT'
In [18]: # To replace the null values of country with prt
df['country'] = df['country'].fillna('PRT')

In [19]: df['country'].value_counts()

Out[19]: PRT 49078

GBR 12129

FRA 10415

ESP 8568

DEU 7287

...

ASM 1

BWA 1

HND 1

MLI 1

BHS 1

Name: country, Length: 177, dtype: int64

Now again Checking null values after clearing


them in each column using isnull().sum()
In [20]: df.isnull().sum()
# now there is no null values

Out[20]: hotel 0

is_canceled 0

lead_time 0

arrival_date_year 0

arrival_date_month 0

arrival_date_week_number 0

arrival_date_day_of_month 0

stays_in_weekend_nights 0

stays_in_week_nights 0

adults 0

children 4

babies 0

meal 0

country 0

market_segment 0

distribution_channel 0

is_repeated_guest 0

previous_cancellations 0

previous_bookings_not_canceled 0

reserved_room_type 0

assigned_room_type 0

booking_changes 0

deposit_type 0

agent 0

days_in_waiting_list 0

customer_type 0

adr 0

required_car_parking_spaces 0

total_of_special_requests 0

reservation_status 0

reservation_status_date 0

dtype: int64

To find correlation between each data using 'corr()'


function
In [21]: df.corr()

Out[21]:
is_canceled lead_time arrival_date_year arrival_date_week_numbe

is_canceled 1.000000 0.293123 0.016660 0.00814

lead_time 0.293123 1.000000 0.040142 0.12687

arrival_date_year 0.016660 0.040142 1.000000 -0.54056

arrival_date_week_number 0.008148 0.126871 -0.540561 1.00000

arrival_date_day_of_month -0.006130 0.002268 -0.000221 0.06680

stays_in_weekend_nights -0.001791 0.085671 0.021497 0.01820

stays_in_week_nights 0.024765 0.165799 0.030883 0.01555

adults 0.060017 0.119519 0.029635 0.02590

children 0.005048 -0.037622 0.054624 0.00551

babies -0.032491 -0.020915 -0.013192 0.01039

is_repeated_guest -0.084793 -0.124410 0.010341 -0.03013

previous_cancellations 0.110133 0.086042 -0.119822 0.03550

previous_bookings_not_canceled -0.057358 -0.073548 0.029218 -0.02090

booking_changes -0.144381 0.000149 0.030872 0.00550

agent -0.046529 -0.012640 0.056463 -0.01824

days_in_waiting_list 0.054186 0.170084 -0.056497 0.02293

adr 0.047557 -0.063077 0.197580 0.07579

required_car_parking_spaces -0.195498 -0.116451 -0.013684 0.00192

total_of_special_requests -0.234658 -0.095712 0.108531 0.02614

Type Markdown and LaTeX: 𝛼2


To check the number of duplicated values using
'duplicated().sum()'
In [22]: df.duplicated().sum()

Out[22]: 32020

To drop the duplicates permanently using


'drop_duplicates(inplace=True)' fnction
In [23]: df.drop_duplicates(inplace=True)
In [24]: df.shape

Out[24]: (87370, 31)

Cheking unique 'Hotel' column using 'unique()'


function
In [25]: df['hotel'].unique()

Out[25]: array(['Resort Hotel', 'City Hotel'], dtype=object)

Type Markdown and LaTeX: 𝛼2


Checking Total number of 'hotel'column using
'value_counts()' function
In [26]: df['hotel'].value_counts()

Out[26]: City Hotel 53426

Resort Hotel 33944

Name: hotel, dtype: int64

Count plot on single categorical variable 'hotel'


In [27]: sns.countplot(x='hotel', data=df)
#show the plot using 'plt.show()'
plt.show()
In [28]: df.columns

Out[28]: Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',

'arrival_date_month', 'arrival_date_week_number',

'arrival_date_day_of_month', 'stays_in_weekend_nights',

'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',

'country', 'market_segment', 'distribution_channel',

'is_repeated_guest', 'previous_cancellations',

'previous_bookings_not_canceled', 'reserved_room_type',

'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',

'days_in_waiting_list', 'customer_type', 'adr',

'required_car_parking_spaces', 'total_of_special_requests',

'reservation_status', 'reservation_status_date'],

dtype='object')

To count the number of visitors canceled in Total


In [29]: df.groupby(['is_canceled'])['hotel'].value_counts().sum()

Out[29]: 87370

To anlyse which hotel gets cancelled the most by the


customers
In [30]: sns.barplot(x=df['hotel'], y=df['is_canceled'])
plt.show()

Type Markdown and LaTeX: 𝛼2


In [31]: df.groupby(['arrival_date_month'])['hotel'].count().sort_values(ascending=False)

Out[31]: arrival_date_month

August 11257

July 10055

May 8354

April 7905

June 7765

March 7510

October 6932

September 6689

February 6091

December 5128

November 4993

January 4691

Name: hotel, dtype: int64

To know the count the visitors in each month


In [32]: sns.countplot(x='arrival_date_month', data=df)
plt.xticks(rotation=90)
plt.rcParams['figure.figsize'] = (13, 6)
plt.show()
In [33]: df.head()

Out[33]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number

Resort
0 0 342 2015 July 27
Hotel

Resort
1 0 737 2015 July 27
Hotel

Resort
2 0 7 2015 July 27
Hotel

Resort
3 0 13 2015 July 27
Hotel

Resort
4 0 14 2015 July 27
Hotel

5 rows × 31 columns

To know the count the visitors in each week


In [34]: df.groupby(['arrival_date_week_number'])['hotel'].count()

Out[34]: arrival_date_week_number

1 862

2 945

3 1049

4 1124

5 1101

6 1295

7 1629

8 1523

9 1579

10 1630

11 1657

12 1572

13 1817

14 1693

15 1989

16 1736

17 1878

18 2087

19 1813

20 1843

21 2043

22 1753

23 1872

24 1746

25 1786

26 1739

27 2166

28 2343

29 2197

30 2335

31 2286

32 2449

33 2793

34 2491

35 2105

36 1626

37 1474

38 1634

39 1590

40 1427

41 1663

42 1445

43 1605

44 1548

45 1314

46 1141

47 1288

48 1199

49 1169

50 1053

51 785

52 1061

53 1422

Name: hotel, dtype: int64

In [35]: df.groupby(['hotel'])['lead_time'].count().sort_values(ascending=False)

Out[35]: hotel

City Hotel 53426

Resort Hotel 33944

Name: lead_time, dtype: int64

In [36]: df.groupby(['arrival_date_day_of_month'])['arrival_date_month'].count().sort_valu

Out[36]: arrival_date_day_of_month

17 3018

2 3015

26 3000

5 2980

16 2957

19 2949

12 2928

28 2924

18 2923

11 2915

20 2915

27 2902

29 2880

9 2878

15 2868

25 2837

3 2833

21 2822

13 2812

8 2808

6 2804

4 2798

10 2785

23 2776

24 2774

30 2770

1 2769

7 2704

14 2692

22 2601

31 1733

Name: arrival_date_month, dtype: int64


In [37]: sns.barplot(x=df['hotel'], y=df['lead_time'])
plt.show()
In [38]: df.groupby(['arrival_date_day_of_month'])['hotel'].count().sort_values(ascending=

Out[38]: arrival_date_day_of_month

17 3018

2 3015

26 3000

5 2980

16 2957

19 2949

12 2928

28 2924

18 2923

11 2915

20 2915

27 2902

29 2880

9 2878

15 2868

25 2837

3 2833

21 2822

13 2812

8 2808

6 2804

4 2798

10 2785

23 2776

24 2774

30 2770

1 2769

7 2704

14 2692

22 2601

31 1733

Name: hotel, dtype: int64


In [39]: sns.countplot(x='arrival_date_day_of_month', data=df)
plt.rcParams['figure.figsize'] = (13, 6)
plt.show()

In [40]: df.columns

Out[40]: Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',

'arrival_date_month', 'arrival_date_week_number',

'arrival_date_day_of_month', 'stays_in_weekend_nights',

'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',

'country', 'market_segment', 'distribution_channel',

'is_repeated_guest', 'previous_cancellations',

'previous_bookings_not_canceled', 'reserved_room_type',

'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',

'days_in_waiting_list', 'customer_type', 'adr',

'required_car_parking_spaces', 'total_of_special_requests',

'reservation_status', 'reservation_status_date'],

dtype='object')

To count the number of meal


In [41]: df['meal'].count()

Out[41]: 87370

In [42]: data = df['meal'].value_counts()


print(data)
labels = df['meal'].unique()

BB 67955

SC 9481

HB 9084

Undefined 490

FB 360

Name: meal, dtype: int64

In [43]: explode = (0.1, 0.0, 0.2, 0.3, 0.7)


plt.figure(figsize =(10, 8))
plt.pie(data,labels=labels, explode=explode, autopct='%.2f')
plt.show()

In [44]: fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,6))


ax1.scatter(x=df['hotel'], y=df['meal'], marker='+', color='red')
ax2.scatter(x=df['hotel'], y=df['stays_in_weekend_nights'], marker='^', color='bl

Out[44]: <matplotlib.collections.PathCollection at 0x1a443345610>


In [45]: df.columns

Out[45]: Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',

'arrival_date_month', 'arrival_date_week_number',

'arrival_date_day_of_month', 'stays_in_weekend_nights',

'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',

'country', 'market_segment', 'distribution_channel',

'is_repeated_guest', 'previous_cancellations',

'previous_bookings_not_canceled', 'reserved_room_type',

'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',

'days_in_waiting_list', 'customer_type', 'adr',

'required_car_parking_spaces', 'total_of_special_requests',

'reservation_status', 'reservation_status_date'],

dtype='object')

To count the number of adults in each hotel


In [46]: df.groupby(['hotel'])['adults'].sum()

Out[46]: hotel

City Hotel 100246

Resort Hotel 63656

Name: adults, dtype: int64

In [47]: ​
sns.scatterplot( data=df, x="hotel", y="adults", hue='meal')
plt.grid()
plt.show()

In [48]: ​
sns.scatterplot(data=df, x='hotel', y='children', hue='meal')
plt.grid()
plt.show()

In [49]: sns.scatterplot(data=df, x='hotel', y='babies', hue='meal')


plt.grid()
plt.show()

Insights:
1. Adult people prefer to come to Resort hotel and mostly order bb meal.
2. Most of the Childrens also prefer Resort hotel and mostly order bb meal.
3. The family with the babies prefer city hotel and the meal that is bb.

In [50]: df.columns

Out[50]: Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',

'arrival_date_month', 'arrival_date_week_number',

'arrival_date_day_of_month', 'stays_in_weekend_nights',

'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',

'country', 'market_segment', 'distribution_channel',

'is_repeated_guest', 'previous_cancellations',

'previous_bookings_not_canceled', 'reserved_room_type',

'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',

'days_in_waiting_list', 'customer_type', 'adr',

'required_car_parking_spaces', 'total_of_special_requests',

'reservation_status', 'reservation_status_date'],

dtype='object')

In [51]: df['market_segment'].value_counts()

Out[51]: Online TA 51613

Offline TA/TO 13886

Direct 11798

Groups 4940

Corporate 4202

Complementary 702

Aviation 227

Undefined 2

Name: market_segment, dtype: int64

Market segment of both hotels


In [52]: sns.countplot(x='market_segment', data=df)
plt.rcParams['figure.figsize'] = (12,6)
plt.show()
In [53]: sns.countplot(x='market_segment', data=df, hue='hotel', palette='magma')
plt.rcParams['figure.figsize'] = (12,6)
plt.xticks(rotation=90)
plt.title('Market segment from both hotel')
plt.show()

checking customer ratio staying weekend night


In [54]: df.groupby(['hotel'])['stays_in_weekend_nights'].count().sort_values(ascending=Fa

Out[54]: hotel

City Hotel 53426

Resort Hotel 33944

Name: stays_in_weekend_nights, dtype: int64


In [55]: sns.countplot(x='stays_in_weekend_nights', data=df, hue='hotel')
# plt.rcParams['figure.figsize']=(12,6)
plt.figure(figsize=(12,6))
plt.show()

<Figure size 864x432 with 0 Axes>

Insight
Most of the people prefer to stay weekend night in City Hotel.

checking customer ratio staying week in night


In [56]: df['stays_in_week_nights'].count()

Out[56]: 87370
In [57]: df.groupby(['hotel'])['stays_in_week_nights'].count().sort_values(ascending=False

Out[57]: hotel

City Hotel 53426

Resort Hotel 33944

Name: stays_in_week_nights, dtype: int64

In [58]: sns.countplot(x='stays_in_week_nights',data=df, hue='hotel')


plt.figure(figsize=(12,6))
plt.show()

<Figure size 864x432 with 0 Axes>

Insight
Most of the people prefer to stay week in night in City Hotel.

Checking Reservation of hotel canceled in each hotel


In [59]: sns.countplot(x="market_segment", data=df, hue="is_canceled", palette="magma")
plt.xticks(rotation=90)
plt.title("Reservation cancelled from both hotels")
plt.show()

In [60]: list(df.columns)

Out[60]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']
In [61]: df.groupby(['market_segment','hotel']).count()

Out[61]:
is_canceled lead_time arrival_date_year arrival_date_month arrival_date_

market_segment hotel

Aviation City
227 227 227 227
Hotel

Complementary City
513 513 513 513
Hotel

Resort
189 189 189 189
Hotel

Corporate City
2227 2227 2227 2227
Hotel

Resort
1975 1975 1975 1975
Hotel

Direct City
5559 5559 5559 5559
Hotel

Resort
6239 6239 6239 6239
Hotel

Groups City
2635 2635 2635 2635
Hotel

Resort
2305 2305 2305 2305
Hotel

Offline TA/TO City


7271 7271 7271 7271
Hotel

Resort
6615 6615 6615 6615
Hotel

Online TA City
34992 34992 34992 34992
Hotel

Resort
16621 16621 16621 16621
Hotel

Undefined City
2 2 2 2
Hotel

14 rows × 29 columns
In [62]: df.groupby(['market_segment'])['hotel'].count()

Out[62]: market_segment

Aviation 227

Complementary 702

Corporate 4202

Direct 11798

Groups 4940

Offline TA/TO 13886

Online TA 51613

Undefined 2

Name: hotel, dtype: int64

In [63]: sns.countplot(x='market_segment', data=df, hue='hotel', palette='magma')


plt.xticks(rotation=90)
plt.title('Market segment in each hotel')
plt.show()

Checking Repeated guest


In [64]: df['is_repeated_guest'].value_counts()
# 0 means there are no repeated guests
# 1 means there are repeated guests

Out[64]: 0 83955

1 3415

Name: is_repeated_guest, dtype: int64


In [65]: df.groupby(['is_repeated_guest'])['hotel'].value_counts()
# df.groupby(['hotel'])['is_repeated_guest'].value_counts()

# 0 means there are no repeated guests
# 1 means there are repeated guests

Out[65]: is_repeated_guest hotel

0 City Hotel 51718

Resort Hotel 32237

1 City Hotel 1708

Resort Hotel 1707

Name: hotel, dtype: int64

In [66]: df.groupby(['hotel','is_repeated_guest']).count()

Out[66]:
is_canceled lead_time arrival_date_year arrival_date_month arrival_dat

hotel is_repeated_guest

City 0 51718 51718 51718 51718


Hotel
1 1708 1708 1708 1708

Resort 0 32237 32237 32237 32237


Hotel
1 1707 1707 1707 1707

4 rows × 29 columns

In [67]: df.groupby(['is_repeated_guest','hotel']).count()

Out[67]:
is_canceled lead_time arrival_date_year arrival_date_month arrival_dat

is_repeated_guest hotel

0 City
51718 51718 51718 51718
Hotel

Resort
32237 32237 32237 32237
Hotel

1 City
1708 1708 1708 1708
Hotel

Resort
1707 1707 1707 1707
Hotel

4 rows × 29 columns
In [68]: list(df.columns)

Out[68]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']

In [69]: df['distribution_channel'].value_counts()

Out[69]: TA/TO 69133

Direct 12980

Corporate 5071

GDS 181

Undefined 5

Name: distribution_channel, dtype: int64


In [70]: sns.countplot(x='distribution_channel',data=df)
plt.show()
In [71]: sns.countplot(x='distribution_channel',data=df, hue='hotel')
plt.title('Distribution channel from both hotels')
plt.show()

In [72]: df['reservation_status'].value_counts()

Out[72]: Check-Out 63346

Canceled 23010

No-Show 1014

Name: reservation_status, dtype: int64

In [73]: df.groupby(['hotel', 'reservation_status']).count()

Out[73]:
is_canceled lead_time arrival_date_year arrival_date_month arrival_dat

hotel reservation_status

City Canceled 15301 15301 15301 15301


Hotel
Check-Out 37377 37377 37377 37377

No-Show 748 748 748 748

Resort Canceled 7709 7709 7709 7709


Hotel
Check-Out 25969 25969 25969 25969

No-Show 266 266 266 266

6 rows × 29 columns
In [74]: sns.countplot(x='reservation_status', data=df, hue='hotel')
plt.show()

In [75]: sns.countplot(x='hotel', data=df, hue='reservation_status')


plt.show()

To count adr of each hotel


In [76]: df.groupby(['arrival_date_month'])['adr'].mean()

Out[76]: arrival_date_month

April 103.631776

August 150.876120

December 81.425918

February 74.731739

January 70.061422

July 135.521525

June 119.750120

March 81.624414

May 111.191058

November 72.768983

October 90.167276

September 112.081873

Name: adr, dtype: float64

In [77]: list(df.columns)

Out[77]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']
In [78]: #To show the adr in each month

sns.scatterplot(x=df['arrival_date_month'], y=df['adr'])
plt.xticks(rotation=90)
plt.show()

In [79]: df['reserved_room_type'].value_counts()

Out[79]: A 56530

D 17397

E 6047

F 2822

G 2052

B 999

C 915

H 596

P 6

L 6

Name: reserved_room_type, dtype: int64


In [80]: sns.countplot(x=df['reserved_room_type'],data=df)
plt.show()

In [81]: df['assigned_room_type'].value_counts()

Out[81]: A 46301

D 22422

E 7193

F 3625

G 2498

C 2165

B 1820

H 706

I 357

K 276

P 6

L 1

Name: assigned_room_type, dtype: int64


In [82]: sns.countplot(x=df['assigned_room_type'],data=df)
plt.show()

In [83]: df.groupby(['hotel'])['assigned_room_type'].value_counts()

Out[83]: hotel assigned_room_type

City Hotel A 33403

D 13209

E 2051

F 1970

B 1661

G 691

K 276

C 161

P 4

Resort Hotel A 12898

D 9213

E 5142

C 2004

G 1807

F 1655

H 706

I 357

B 159

P 2

L 1

Name: assigned_room_type, dtype: int64


In [84]: list(df.columns)

Out[84]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']
In [85]: df.groupby(['previous_cancellations'])['hotel'].value_counts()

Out[85]: previous_cancellations hotel

0 City Hotel 52248

Resort Hotel 33437

1 City Hotel 966

Resort Hotel 441

2 City Hotel 72

Resort Hotel 40

3 City Hotel 51

Resort Hotel 10

4 City Hotel 24

Resort Hotel 6

5 City Hotel 16

Resort Hotel 3

6 City Hotel 17

11 City Hotel 27

13 City Hotel 4

14 Resort Hotel 1

19 Resort Hotel 1

21 City Hotel 1

24 Resort Hotel 2

25 Resort Hotel 2

26 Resort Hotel 1

Name: hotel, dtype: int64

In [86]: df.groupby(['hotel'])['previous_cancellations'].value_counts()

Out[86]: hotel previous_cancellations

City Hotel 0 52248

1 966

2 72

3 51

11 27

4 24

6 17

5 16

13 4

21 1

Resort Hotel 0 33437

1 441

2 40

3 10

4 6

5 3

24 2

25 2

14 1

19 1

26 1

Name: previous_cancellations, dtype: int64


In [87]: sns.scatterplot(data=df, x=df['previous_cancellations'], y=df['customer_type'], h
plt.grid()
plt.show()

In [88]: df['customer_type'].value_counts()

Out[88]: Transient 71968

Transient-Party 11719

Contract 3139

Group 544

Name: customer_type, dtype: int64

Pie chart of customer_type


In [89]: data = df['customer_type'].value_counts()
print(data)
labels = df['customer_type'].unique()

explode = (0.1, 0.1, 0.2, 0.4)
plt.figure(figsize =(10, 8))
plt.pie(data,labels=labels, explode=explode, autopct='%.2f')
plt.show()

Transient 71968

Transient-Party 11719

Contract 3139

Group 544

Name: customer_type, dtype: int64

In [90]: df['deposit_type'].value_counts()

Out[90]: No Deposit 86225

Non Refund 1038

Refundable 107

Name: deposit_type, dtype: int64

Pie chart of deposit_type


In [91]: data = df['deposit_type'].value_counts()
# print(data)
labels = df['deposit_type'].unique()

explode = (0.2, 0.3, 0.6)
plt.figure(figsize =(10, 8))
plt.pie(data,labels=labels, explode=explode, autopct='%.2f')
plt.show()

countplot chart of deposit_type


In [92]: sns.countplot(x=df['deposit_type'], data=df, hue='hotel')
plt.show()

In [93]: list(df.columns)

Out[93]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']

To know the number of guest who have not cancelled the


booking
In [94]: # df[df['is_canceled']==0]['arrival_date_month'].value_counts()
month_guest = df[df['is_canceled']==0]['arrival_date_month'].value_counts().reset
month_guest.columns = ['month','Number of guest']
month_guest

Out[94]:
month Number of guest

0 August 7634

1 July 6857

2 May 5912

3 March 5680

4 April 5497

5 June 5411

6 October 5290

7 September 5047

8 February 4676

9 November 3939

10 December 3750

11 January 3653

To know the number of guests who have not cancelled the


booking each month.
In [95]: ​
sns.lineplot(data=month_guest, x ="month", y ="Number of guest")
plt.grid()
plt.show()
# To KNOW which visitors havent cancelled the booking

In [96]: data = df[df['is_canceled']==0]['arrival_date_month'].value_counts()
print(data)
labels = df['arrival_date_month'].unique()

# explode = (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2)
explode = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
plt.figure(figsize =(12, 10))
plt.pie(data,labels=labels, explode=explode, autopct='%.2f')
plt.show()

August 7634

July 6857

May 5912

March 5680

April 5497

June 5411

October 5290

September 5047

February 4676

November 3939

December 3750

January 3653

Name: arrival_date_month, dtype: int64

To count the number of guest each year


In [97]: df['arrival_date_year'].value_counts()

Out[97]: 2016 42377

2017 31683

2015 13310

Name: arrival_date_year, dtype: int64

To count the number of guests who visited each year and


didnt cancelled the booking
In [98]: df[df['is_canceled']==0]['arrival_date_year'].value_counts().reset_index()

Out[98]:
index arrival_date_year

0 2016 31169

1 2017 21571

2 2015 10606
pie chart of the number of guests who visited each year
and didnt cancelled the booking
In [99]: data = df[df['is_canceled']==0]['arrival_date_year'].value_counts()
print(data)
labels = df['arrival_date_year'].unique()
explode = (0.0, 0.1, 0.1)
plt.figure(figsize =(12, 10))
plt.pie(data, labels=labels, explode=explode, autopct='%0.2f')
plt.show()

2016 31169

2017 21571

2015 10606

Name: arrival_date_year, dtype: int64

To understand the number of visitors who havent


cancelled the bookings
In [100]: year_guest = df[df['is_canceled']==0]['arrival_date_year'].value_counts().reset_i
year_guest.columns = ['Year', 'Number of guest']
print(year_guest)

sns.lineplot(data=year_guest, x='Year', y='Number of guest')
plt.grid()
plt.show()

Year Number of guest

0 2016 31169

1 2017 21571

2 2015 10606

In [101]: list(df.columns)

Out[101]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']

To count number of children who visited the hotel and


didnt cancelled the booking
In [102]: children_guest = df[(df['is_canceled'] == 0) & (df['children'])]['hotel'].value_c
children_guest.columns = ['hotel', 'no of guests']
children_guest

Out[102]:
hotel no of guests

0 City Hotel 3233

1 Resort Hotel 2146


In [103]: children_guest = df[(df['is_canceled'] == 0) & (df['children'])]['hotel'].value_c
explode = (0.0, 0.05)
labels = df['hotel'].unique()
plt.figure(figsize=(12, 10))
plt.pie(children_guest, labels=labels, explode=explode, autopct='%0.2f')
plt.show()

To count number of adult who visited the hotel and didnt


cancelled the booking
In [104]: adult_guest = df[(df['is_canceled']==0) & (df['adults'])]['hotel'].value_counts()
adult_guest.columns = ['hotel', 'no of guests']
adult_guest

Out[104]:
hotel no of guests

0 City Hotel 11025

1 Resort Hotel 6159

In [105]: adult_guest = df[(df['is_canceled']==0) & (df['adults'])]['hotel'].value_counts()


labels = df['hotel'].unique()
explode = (0.0, 0.04)
plt.figure(figsize=(12, 10))
plt.pie(adult_guest, labels=labels, explode=explode, autopct='%0.2f')
plt.show()

To count number of babies who visited the hotel and didnt


cancelled the booking
In [106]: babies_guest = df[(df['is_canceled']==0) & (df['babies'])]['hotel'].value_counts(
babies_guest.columns = ['hotel', 'no of guests']
babies_guest

Out[106]:
hotel no of guests

0 Resort Hotel 435

1 City Hotel 298

In [107]: babies_guest = df[(df['is_canceled']==0) & (df['babies'])]['hotel'].value_counts(


labels = df['hotel'].unique()
explode = (0.0, 0.04)
plt.figure(figsize=(12, 10))
plt.pie(babies_guest, labels=labels, explode=explode, autopct='%0.2f')
plt.show()

visualisation of number of visitors (adults,children and


babies) visited by each country
In [108]: cancel_by_adults = df[(df['is_canceled']==0) & (df['adults'])]['hotel'].value_cou
cancel_by_adults.columns = ['hotel', 'no of guest']
# cancel_by_adults

cancel_by_children = df[(df['is_canceled']==0) & (df['children'])]['hotel'].value
cancel_by_children.columns = ['hotel', 'no of guest']
# cancel_by_children

cancel_by_babies = df[(df['is_canceled']==0) & (df['babies'])]['hotel'].value_cou
cancel_by_babies.columns = ['hotel', 'no of guest']
# cancel_by_adults

# visualize them all

plt.figure(figsize=(14,6))
sns.lineplot(data=cancel_by_adults, x=cancel_by_adults["hotel"],y=cancel_by_adult
sns.lineplot(data=cancel_by_children, x=cancel_by_children["hotel"],y=cancel_by_c
sns.lineplot(data=cancel_by_babies, x=cancel_by_babies["hotel"],y=cancel_by_babie

plt.show()

Booking cancel by each country


In [109]: country_guest = df[(df['is_canceled']==1)]['country'].value_counts().reset_index(
country_guest.columns=['country','no of guest']
country_guest

Out[109]:
country no of guest

0 PRT 9824

1 GBR 1985

2 ESP 1862

3 FRA 1733

4 ITA 1075

... ... ...

122 KHM 1

123 FJI 1

124 MCO 1

125 UMI 1

126 TJK 1

127 rows × 2 columns

Booking not cancel by each country


In [110]: country_guest = df[(df['is_canceled']==0)]['country'].value_counts().reset_index(
country_guest.columns=['country','no of guest']
country_guest

Out[110]:
country no of guest

0 PRT 18058

1 GBR 8447

2 FRA 7104

3 ESP 5390

4 DEU 4334

... ... ...

160 MAC 1

161 CYM 1

162 TJK 1

163 ZMB 1

164 PLW 1

165 rows × 2 columns

number of visitors visited by each country visual


representation
In [111]: # import plotly.express as px
# figure = px.line(country_guest, x='country', y='no of guest')
# figure.show()
In [112]: list(df.columns)

Out[112]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']
In [113]: count = df['assigned_room_type'].value_counts().reset_index()
count.columns = ['assignedroom', 'no of guest']
count

Out[113]:
assignedroom no of guest

0 A 46301

1 D 22422

2 E 7193

3 F 3625

4 G 2498

5 C 2165

6 B 1820

7 H 706

8 I 357

9 K 276

10 P 6

11 L 1

not canceled 'room type' by guests


In [114]: not_canceled = df[(df['is_canceled'])==0]['assigned_room_type'].value_counts().re
not_canceled.columns = ['assignedroom', 'no of guest']
not_canceled

Out[114]:
assignedroom no of guest

0 A 32120

1 D 16991

2 E 5495

3 F 2731

4 C 1770

5 G 1745

6 B 1421

7 H 457

8 I 352

9 K 264

canceled 'room type' by guests


In [115]: canceled = df[(df['is_canceled'])==1]['assigned_room_type'].value_counts().reset_
canceled.columns = ['assignedroom', 'no of guest']
canceled

Out[115]:
assignedroom no of guest

0 A 14181

1 D 5431

2 E 1698

3 F 894

4 G 753

5 B 399

6 C 395

7 H 249

8 K 12

9 P 6

10 I 5

11 L 1

Visulisation to represnt the above data


In [116]: ​
sns.lineplot(data=not_canceled, x=not_canceled["assignedroom"],y=not_canceled["no
sns.lineplot(data=canceled, x=canceled["assignedroom"],y=canceled["no of guest"])
plt.show()

representaation of assignedroom vs hotel


In [117]: ​
sns.countplot(data=df, x= df['assigned_room_type'], hue='hotel')
plt.show()

In [118]: list(df.columns)

Out[118]: ['hotel',

'is_canceled',

'lead_time',

'arrival_date_year',

'arrival_date_month',

'arrival_date_week_number',

'arrival_date_day_of_month',

'stays_in_weekend_nights',

'stays_in_week_nights',

'adults',

'children',

'babies',

'meal',

'country',

'market_segment',

'distribution_channel',

'is_repeated_guest',

'previous_cancellations',

'previous_bookings_not_canceled',

'reserved_room_type',

'assigned_room_type',

'booking_changes',

'deposit_type',

'agent',

'days_in_waiting_list',

'customer_type',

'adr',

'required_car_parking_spaces',

'total_of_special_requests',

'reservation_status',

'reservation_status_date']

Total Nights
In [119]: df['stays_in_weekend_nights']+df['stays_in_week_nights']

Out[119]: 0 0

1 0

2 1

3 1

4 2

..

119385 7

119386 7

119387 7

119388 7

119389 9

Length: 87370, dtype: int64

creating a column named 'total_nights' in dataset


In [120]: df['total_nights'] = df['stays_in_weekend_nights']+df['stays_in_week_nights']

In [121]: df.head()

Out[121]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number

Resort
0 0 342 2015 July 27
Hotel

Resort
1 0 737 2015 July 27
Hotel

Resort
2 0 7 2015 July 27
Hotel

Resort
3 0 13 2015 July 27
Hotel

Resort
4 0 14 2015 July 27
Hotel

5 rows × 32 columns

In [122]: df['total_nights']

Out[122]: 0 0

1 0

2 1

3 1

4 2

..

119385 7

119386 7

119387 7

119388 7

119389 9

Name: total_nights, Length: 87370, dtype: int64

In [123]: df.groupby(['hotel'])['total_nights'].count()

Out[123]: hotel

City Hotel 53426

Resort Hotel 33944

Name: total_nights, dtype: int64

Percentage of total_nights counts on hotels


In [124]: data = df.groupby(['hotel'])['total_nights'].count()
labels = df['hotel'].unique()
explode = (0.0, 0.1)
plt.figure(figsize=(12,10))
plt.pie(data, labels=labels, explode=explode, autopct="%0.2f")
plt.title("Percentage of total_nights counts on hotels")

plt.show()
In [125]: df.groupby(['hotel'])['total_nights'].value_counts()

Out[125]: hotel total_nights

City Hotel 3 13552

2 10824

1 10282

4 9620

5 4180

...

Resort Hotel 38 1

45 1

46 1

60 1

69 1

Name: total_nights, Length: 76, dtype: int64

You might also like