‘yr, 445 Pee etic
1 (2): Amport pandas as pd
import nunpy as np
import natplotlib.pyplot as pit
import seaborn as sns
# ignore warnings
import warnings
warnings .filterwarnings ("ignore")
9 (2): dfepd.read_csv( "netflix. txt")
[3]: dfshead()
r show id type title director _—cast country date added release year rating duratio
Dick kirster Unitec September
251 Movie Jone en NEN ac Bs 301 2020 6-13. 90m
Ama
Qamata,
Khosi
TV Blood & South September ow
q $2 show Water NN ore Africe 24, 2021 202° MA Seasor
Mebalane,
Thaban..
sami
ouajila,
Tag
wv Julien September . ow
28g NY Gangland: JF Gotoas, Nan SBIEDOS 202" Ta 1 Seaso
Samuel
Jouy,
Nabi.
Jalbirds
v September ow
20 4 New NaN NaN NaN 202, 1 Seaso
Show gant 24,2021 va
Mayur
More,
Jitendra s
TW kta September ow
4 show Factory NN Samar Indie 24,2001 202° WA Seasor
Rj, Alam
x.
—_—_—_—_—_—_—_—_— >
n [4]: df-dnfo()
lecathos 888/nbconverVnimiDowmicadsinettxipyndownioad-ase 6‘yr, 445 Pee
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
atic
RangeIndex: 8807 entries, 0 to 8806
data columns (total 12 columns):
Column Non-Nul1 Count Dtype
show_id 8807 non-null object
type 8807 non-null object
title 8807 non-null object
director 6173 non-null object
cast 7982 non-null object.
country 7976 non-null object
date_added 8797 non-null object
release_year 8807 non-null int64
rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object.
11 description 8807 non-null object.
dtypes: int64(1), object (11)
nemory usage: 825.8+ KB
row, col=df. shape
create a custom function to get distinct count of every column
def count_distinct_info(datafrane) :
print("Count and Distinct Count
for colunn in datafrane.colunns:
unique_count = datafrane[colunn].nunique()
total_count = dataframe[ column] .count()
print(f"{colunn}: Total Count = {total_count}, Distinct Count =
n”)
count_distinct_info(d#)
df.dtypes
Count and Distinct Count:
show_id: _ Total Count = 8807, Distinct Count = 8807
type: Total Count = 8807, Distinct Count = 2
title: Total Count = 8807, Distinct Count = 8807
director: Total Count = 6173, Distinct Count = 4528
cast Total Count = 7982, Distinct Count = 7692
country: Total Count = 7976, Distinct Count = 748
date_added: Total Count = 8797, Distinct Count = 1767
release_year Total Count = 8807, Distinct Count = 74
rating: Total Count = 8803, Distinct Count = 17
duration: Total Count = 8804, Distinct Count = 22
listed_in: Total Count = 8897, Distinct Count = 514
description: Total Count = 8807, Distinct Count = 8775
82 ane‘yr, 445 Pee
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
etic
show_id object
type object
title object
director object
cast object
country object
date_added object
release year intea
rating object
duration object
Listed_in object
description object
dtype: object
changing date_added data type to date
df ‘date_added'] = pd.to_datetime(df['date_added" ].str.strip(), format="%8 %d, %Y")
af. dtypes
show_id object
type object
title object
director object
cast object
country object
date_added datetime6a[ns]
release_year intoa
rating object
duration object
listed_in object
description object
dtype: object
Here we can see title is also acting as our primary key and date_added is in
object data type
df.isnull().sum()
show_id e
type @
title @
director 2634
cast 825
country 831
date_added 16
release_year e
rating 4
duration 3
Listed_in @
description @
dtype: inted
# checking null balues in percentage
df.isnull().sum()/row #120
36‘129124, 445 Pat
show_id
type
title
director
cast
country
date_added
release_year
rating
duration
Listed_in
description
type: floatea
Th df.head(s)
show id
0 st
1 32
2 3
3 sf
4 ss
—=
In df.tail(s)
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
type
Movie
Vv
Show
Vv
Show
Vv
Show
Vv
Show
2.000000
2.200000
2.00008
29, 908028
9.367549
9.435676
0.113546
2.200008
0.045418
0.034064
2.000008
2.000000
title
Dick
Johnson Is
Dead
Blood &
Water
Gangland:
Jallbirds
New
Orleans
kota
Factory
director
kirster
Johnson
NaN
Julier
Leclercq
NaN
NaN
NaN
Ama
Qamata,
Khosi
Ngema,
Sail
Mabalane,
Thaban.
Sami
Bouaiila,
Tacy
Gotoas,
Samuel
Jouy,
Nabi
NaN
Mayur
More,
Jitendra
RangeIndex: 202065 entries, @ to 202064
Data columns (total 12 columns):
# Column Non-Null Count
show_id 202065 non-null
type 202065 non-null
title 202065 non-null
date_added 201907 non-null
release year 202065 non-null
rating 201998 non-null
duration 202062 non-null
description 202065 non-null
rnew_cast 199916 non-null
new_director 151422 non-null
new 190168 non-null
new_listed_in 202665 non-null
: datetimes4{ns](1),
nemory usage: 18.
In df1.isnul1().sum(
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
5+ ME
)
Dtype
object
object
object
datetimes4[ns.
ints
object
object
object
object
object
object
object
int64(1), object (10)
site‘yr, 445 Pee
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
etic
show_id e
type e
title e
date_added 158
release_year e
rating a
duration 3
description e
new_cast 2aas
nen_director 5643
new_country 11897
nen_listed_in e
dtype: intéa
# handling missing values in date_added
# storing release_year values in a variable in which date_added is null
xedfl. loc [df1[ ‘date_added' ].isnul1(), ‘release_year' }.unique()
x
array([2613, 2018, 2003, 2008, 2010, 2012, 2016, 2015], dtype=inté4)
enpty_df=pd.DataFrame()
for i in x:
gh=d#1.1oc[d#1[ ‘release_year' ==]
Bh[ ‘mode_of_date_added’ J=gh[ ‘date_added' ].mode().iloc[@]
empty_df=pd.concat ([gh, enpty_¢f])
df2=df1[~df1[ ‘release_year').isin(x)]
df3=pd.concat([df2,empty_df])
df3['mode_of_date_added' ].fil1na(dfi['date_added' ], inplace=True)
d3.drop(columns=[ ‘date_added’ ], inplace=True)
#3. rename(colunns={'mode_of_date_added' : ‘date_added' }, inplace=True)
# droping all nuLl values where new_cast,new director,new_country is null
df4=df3.dropna(subset=[ ‘title’, ‘new_cast', ‘new_director" , "new_country’],thresh=3)
df4.isnul1().sum()
show_id e
type e
title e
release_year e
rating 65
duration 3
description e
new_cast 1278
nen_director 45162
new_country 6838
nen_listed_in e
date_added e
dtype: inted
82 eine‘yr, 445 Pee
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
etic
# sorting values of our data frame with respect to important colunns to identify new <
df4.sort_values(['show_id’, ‘type’, "title", ‘rating’ ,‘new_listed_in’, 'new_cast'], inplace
# filling null values in our new cast
df4['new_cast'] = df4[‘new_cast' ].fillna(method='FFi1l")
df4.loc(dfa[‘new_country"
»"new_country" J=np.nan
#4, sort_values(['show_id', 'type’, ‘title’, ‘new_listed_in', ‘rating’, ‘new_cast', ‘new cot
af4[‘new_country' ]=dF4[ ‘new_country' ].str-strip()
dF4[‘new_country’] = df4[‘new_country"].fillna(method='Ffi11")
df4.sort_values([ ‘title’, 'type’, ‘new_listed_in', ‘rating’, ‘new_country’, ‘new director")
df4['new_director’] = dfa['new_director' ].fillna(method="Ffil1')
df4.sort_values([ ‘title’, ‘type’, ‘new_listed_in', ‘new_country', ‘new_director', ‘rating']
df4[‘rating'] = dfa[‘rating’].fillna(method="#fi11')
dF4.dropna(inplace=True)
# finally data is cleaned
dF4.info()
Index: 196448 entries, 48557 to 161548
Data columns (total 12 columns):
# Column Non-Null Count Dtype
show_id 196448 non-null object
type 196448 non-null object
title 196448 non-null object
release_year 196448 non-null int6a
rating 196448 non-null object
duration 196448 non-null object
description 196448 non-null object
new_cast 196448 non-null object
new director 196448 non-null object
new_country 196448 non-null object
new_listed_in 196448 non-null object
date_added 196448 non-null datetime64[ns:
dtypes: datetime64[ns](1), int64(1), object(1e)
nemory usage: 19.5+ MB
df4.head()
82 016‘yr, 445 Pee
7 show id
48557 52037
48554 52037
48558 52037
48555 52037
48556 52037
type
Movie
Movie
Movie
Movie
Movie
title release year rating
#Alive 200 Wa
#Alive 2020 Wa
save ammo
#Alive 200 Wa
#Alive 2020 Wa
etic
duration
description
Asa grisly
virus
‘ampages ¢
city, alone
man
new.cast
Park
99 min Shin-hye
Asa grisly
virus
‘ampages ¢
city, alone
man
Yoo Ah-
99min
Asa grisly
‘ampages 2
city, alone
rman
Park
Semin Shin-hye
Asa grisly
virus
‘ampages ¢
city, alone
man
Yoo Ah-
99min i
Asa grisly
virus
‘ampages ¢
city, alone
man
Park
99 min Shin-hye
———
Th df.shape[@)
8807
Th count_distinct_info(d#4)
# since there is to much distinct variable in our fields we will only take top 10 vari
Count and Distinct Count:
show_id:
type:
title:
release_year:
rating:
new_director:
new_country:
new_listed_in:
date_added:
Th dF4.head()
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
Total Count = 196448,
Total Count
Total Count = 196448,
Total Count
Total Count = 196448,
Total Count
196448,
196448,
96448,
Total Count = 196448,
Total Count = 196448,
Total Count
Total Count
196448,
196448,
Total Count = 196448,
Total Count = 196448,
Distinct Count
Distinct Count = 8@8¢
Distinct Count = 808@
Distinct Count = 73
Distinct Count = 14
Distinct Count = 217
Distinct Count = 8055
Distinct Count = 3822
Distinct Count = 5075
Distinct Count = 122
Distinct Count = 73
Distinct Count = 1266
new director
Chol
Chol
Chol
Chol
Chol
wine‘yr, 445 Pee
48557
48554
48558
48555
48556
show id
52037
52037
52037
52037
52037
type
Movie
Movie
Movie
Movie
Movie
title
#Alive
#Alive
#Alive
#Alive
#Alive
release year rating
2020
2020
2020
2020
2020
MA
MA
W-
MA
MA
MA
etic
duration description
Asa grisly
virus
99min rampages ¢
city, alone
man
Asa grisly
virus
99min -ampages
city, alone
man
Asa grisly
99min vampages ¢
city, alone
rman
Asa grisly
virus
99min -ampages
city, alone
man
Asa grisly
virus
99min -ampages ¢
city, alone
man
———
Th # trend of type of movies and tv show in each year
plt. Figure(figsize = (12,3)
sns.1ineplot (data=df4.groupby(['type',, ‘release_year’])['title’ ] .nunique().reset_index(
plt.ylabel('count")
plt.xlabel('year")
plt.title('Trend analysis of count of Movies and Tv shows")
“Tend analysis of count of Movies and Ty shows
new cast new ditector
Park
Shin-hye
Yoo Ah-
Park
Shin-hye
Yoo Ah-
in
Park
Shin-hye
Chol
Chol
Chol
Chol
Chol
plt.show()
‘ype
«00
year
This indicates that movies become much more popular than tv
shows after 2010
Th plt.Figure(figsize = (12,3)
sns.barplot(x='rating’ ,y='title’ data = d#4.groupby([ ‘rating’ , ‘type'])[‘title" ].nuniq
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
rane‘yr, 445 Pee
localhost B888/nbconverintmiDownloadsinettixipynb?download-false
etic
plt.ylabel(‘count")
plt.show()
2000 pe
sm Howe
1500 Show
# 00
ating
TV-14 , TV-MA and TV-G have most number of tvshows and
movies
plt. Figure(figsize
sns.barplot (data=df4..groupby([‘new_country', ‘type’ ])[ title’ ] .nunique() .reset_index().
plt .xticks(size=1¢)
plt.ylabel(‘count)
plt.xlabel(‘country")
a7,5))
plt.show()
Here we can notice that we have tv shows only in united states,
united kingdom and japan where as japan is the only country
where we do not have any movies
dF4[ time’ ]=d#4[ ‘duration’ ].str.split(" ",expand=True) [0] .astype(‘int')
dF4[' Length_type" ]=d#4[ ‘duration’ ].str.split(” ",expand=True) [1]
# dfa.groupby([ ‘duration’, 'type'])[ ‘title’ ].nunique().reset_index()
sns. barplot (data=df4.loc[df4[ ‘type
plt.ylabel( count")
plt.xLabel(' season’)
plt.title(‘Count of seasons in TV show")
plt.show()
“TV Show" ].groupby('tine')[ ‘title’ ].nunique().re
136‘yr, 445 Pee etic
Count of seasons in TV show
1200
1000
800
count
600
400
200
12 3 4 5 6 7 8 9 10 1 2 13 «15 17
season
Display count of season
In [ plt. Figure (Figsize-(22,16))
sns. barplot (data-df4.loc{dfa[ 'type"]
plt.xlabel("Listed_in", size-15)
plt.ylabel("count",size=15)
plt.show()
“TV Show" ].groupby('new_listed_in')['title" ].nur
localhost B888/nbconvertntmiDownloadsinetxipynb?downloa
sane‘yr, 445 Pee etic
here we know that international tv shows are most popular
In pt. Figure(Figsize=(22,16))
sns.barplot (data=df4.1oc[dF4[ ‘type’ ]=='Movie' ].groupby(‘new_listed_in')['title’ ].nunic
plt.xlabel("Listed_in", size=i5)
plt.ylabel("count", size=15)
lt. show()
localhost B888/nbconverintmiDownloadsinettixipynb?download-false
ssit6‘yr, 445 Pee etic
In movies we have most popular in dramas
localhost B888/nbconverintmiDownloadsinettixipynb?download-false 616