Download as pdf
Download as pdf
You are on page 1of 16
‘yr, 445 Pee etic 1 (2): Amport pandas as pd import nunpy as np import natplotlib.pyplot as pit import seaborn as sns # ignore warnings import warnings warnings .filterwarnings ("ignore") 9 (2): dfepd.read_csv( "netflix. txt") [3]: dfshead() r show id type title director _—cast country date added release year rating duratio Dick kirster Unitec September 251 Movie Jone en NEN ac Bs 301 2020 6-13. 90m Ama Qamata, Khosi TV Blood & South September ow q $2 show Water NN ore Africe 24, 2021 202° MA Seasor Mebalane, Thaban.. sami ouajila, Tag wv Julien September . ow 28g NY Gangland: JF Gotoas, Nan SBIEDOS 202" Ta 1 Seaso Samuel Jouy, Nabi. Jalbirds v September ow 20 4 New NaN NaN NaN 202, 1 Seaso Show gant 24,2021 va Mayur More, Jitendra s TW kta September ow 4 show Factory NN Samar Indie 24,2001 202° WA Seasor Rj, Alam x. —_—_—_—_—_—_—_—_— > n [4]: df-dnfo() lecathos 888/nbconverVnimiDowmicadsinettxipyndownioad-ase 6 ‘yr, 445 Pee localhost B888/nbconvertntmiDownloadsinetxipynb?downloa atic RangeIndex: 8807 entries, 0 to 8806 data columns (total 12 columns): Column Non-Nul1 Count Dtype show_id 8807 non-null object type 8807 non-null object title 8807 non-null object director 6173 non-null object cast 7982 non-null object. country 7976 non-null object date_added 8797 non-null object release_year 8807 non-null int64 rating 8803 non-null object 9 duration 8804 non-null object 10 listed_in 8807 non-null object. 11 description 8807 non-null object. dtypes: int64(1), object (11) nemory usage: 825.8+ KB row, col=df. shape create a custom function to get distinct count of every column def count_distinct_info(datafrane) : print("Count and Distinct Count for colunn in datafrane.colunns: unique_count = datafrane[colunn].nunique() total_count = dataframe[ column] .count() print(f"{colunn}: Total Count = {total_count}, Distinct Count = n”) count_distinct_info(d#) df.dtypes Count and Distinct Count: show_id: _ Total Count = 8807, Distinct Count = 8807 type: Total Count = 8807, Distinct Count = 2 title: Total Count = 8807, Distinct Count = 8807 director: Total Count = 6173, Distinct Count = 4528 cast Total Count = 7982, Distinct Count = 7692 country: Total Count = 7976, Distinct Count = 748 date_added: Total Count = 8797, Distinct Count = 1767 release_year Total Count = 8807, Distinct Count = 74 rating: Total Count = 8803, Distinct Count = 17 duration: Total Count = 8804, Distinct Count = 22 listed_in: Total Count = 8897, Distinct Count = 514 description: Total Count = 8807, Distinct Count = 8775 82 ane ‘yr, 445 Pee localhost B888/nbconvertntmiDownloadsinetxipynb?downloa etic show_id object type object title object director object cast object country object date_added object release year intea rating object duration object Listed_in object description object dtype: object changing date_added data type to date df ‘date_added'] = pd.to_datetime(df['date_added" ].str.strip(), format="%8 %d, %Y") af. dtypes show_id object type object title object director object cast object country object date_added datetime6a[ns] release_year intoa rating object duration object listed_in object description object dtype: object Here we can see title is also acting as our primary key and date_added is in object data type df.isnull().sum() show_id e type @ title @ director 2634 cast 825 country 831 date_added 16 release_year e rating 4 duration 3 Listed_in @ description @ dtype: inted # checking null balues in percentage df.isnull().sum()/row #120 36 ‘129124, 445 Pat show_id type title director cast country date_added release_year rating duration Listed_in description type: floatea Th df.head(s) show id 0 st 1 32 2 3 3 sf 4 ss —= In df.tail(s) localhost B888/nbconvertntmiDownloadsinetxipynb?downloa type Movie Vv Show Vv Show Vv Show Vv Show 2.000000 2.200000 2.00008 29, 908028 9.367549 9.435676 0.113546 2.200008 0.045418 0.034064 2.000008 2.000000 title Dick Johnson Is Dead Blood & Water Gangland: Jallbirds New Orleans kota Factory director kirster Johnson NaN Julier Leclercq NaN NaN NaN Ama Qamata, Khosi Ngema, Sail Mabalane, Thaban. Sami Bouaiila, Tacy Gotoas, Samuel Jouy, Nabi NaN Mayur More, Jitendra RangeIndex: 202065 entries, @ to 202064 Data columns (total 12 columns): # Column Non-Null Count show_id 202065 non-null type 202065 non-null title 202065 non-null date_added 201907 non-null release year 202065 non-null rating 201998 non-null duration 202062 non-null description 202065 non-null rnew_cast 199916 non-null new_director 151422 non-null new 190168 non-null new_listed_in 202665 non-null : datetimes4{ns](1), nemory usage: 18. In df1.isnul1().sum( localhost B888/nbconvertntmiDownloadsinetxipynb?downloa 5+ ME ) Dtype object object object datetimes4[ns. ints object object object object object object object int64(1), object (10) site ‘yr, 445 Pee localhost B888/nbconvertntmiDownloadsinetxipynb?downloa etic show_id e type e title e date_added 158 release_year e rating a duration 3 description e new_cast 2aas nen_director 5643 new_country 11897 nen_listed_in e dtype: intéa # handling missing values in date_added # storing release_year values in a variable in which date_added is null xedfl. loc [df1[ ‘date_added' ].isnul1(), ‘release_year' }.unique() x array([2613, 2018, 2003, 2008, 2010, 2012, 2016, 2015], dtype=inté4) enpty_df=pd.DataFrame() for i in x: gh=d#1.1oc[d#1[ ‘release_year' ==] Bh[ ‘mode_of_date_added’ J=gh[ ‘date_added' ].mode().iloc[@] empty_df=pd.concat ([gh, enpty_¢f]) df2=df1[~df1[ ‘release_year').isin(x)] df3=pd.concat([df2,empty_df]) df3['mode_of_date_added' ].fil1na(dfi['date_added' ], inplace=True) d3.drop(columns=[ ‘date_added’ ], inplace=True) #3. rename(colunns={'mode_of_date_added' : ‘date_added' }, inplace=True) # droping all nuLl values where new_cast,new director,new_country is null df4=df3.dropna(subset=[ ‘title’, ‘new_cast', ‘new_director" , "new_country’],thresh=3) df4.isnul1().sum() show_id e type e title e release_year e rating 65 duration 3 description e new_cast 1278 nen_director 45162 new_country 6838 nen_listed_in e date_added e dtype: inted 82 eine ‘yr, 445 Pee localhost B888/nbconvertntmiDownloadsinetxipynb?downloa etic # sorting values of our data frame with respect to important colunns to identify new < df4.sort_values(['show_id’, ‘type’, "title", ‘rating’ ,‘new_listed_in’, 'new_cast'], inplace # filling null values in our new cast df4['new_cast'] = df4[‘new_cast' ].fillna(method='FFi1l") df4.loc(dfa[‘new_country" »"new_country" J=np.nan #4, sort_values(['show_id', 'type’, ‘title’, ‘new_listed_in', ‘rating’, ‘new_cast', ‘new cot af4[‘new_country' ]=dF4[ ‘new_country' ].str-strip() dF4[‘new_country’] = df4[‘new_country"].fillna(method='Ffi11") df4.sort_values([ ‘title’, 'type’, ‘new_listed_in', ‘rating’, ‘new_country’, ‘new director") df4['new_director’] = dfa['new_director' ].fillna(method="Ffil1') df4.sort_values([ ‘title’, ‘type’, ‘new_listed_in', ‘new_country', ‘new_director', ‘rating'] df4[‘rating'] = dfa[‘rating’].fillna(method="#fi11') dF4.dropna(inplace=True) # finally data is cleaned dF4.info() Index: 196448 entries, 48557 to 161548 Data columns (total 12 columns): # Column Non-Null Count Dtype show_id 196448 non-null object type 196448 non-null object title 196448 non-null object release_year 196448 non-null int6a rating 196448 non-null object duration 196448 non-null object description 196448 non-null object new_cast 196448 non-null object new director 196448 non-null object new_country 196448 non-null object new_listed_in 196448 non-null object date_added 196448 non-null datetime64[ns: dtypes: datetime64[ns](1), int64(1), object(1e) nemory usage: 19.5+ MB df4.head() 82 016 ‘yr, 445 Pee 7 show id 48557 52037 48554 52037 48558 52037 48555 52037 48556 52037 type Movie Movie Movie Movie Movie title release year rating #Alive 200 Wa #Alive 2020 Wa save ammo #Alive 200 Wa #Alive 2020 Wa etic duration description Asa grisly virus ‘ampages ¢ city, alone man new.cast Park 99 min Shin-hye Asa grisly virus ‘ampages ¢ city, alone man Yoo Ah- 99min Asa grisly ‘ampages 2 city, alone rman Park Semin Shin-hye Asa grisly virus ‘ampages ¢ city, alone man Yoo Ah- 99min i Asa grisly virus ‘ampages ¢ city, alone man Park 99 min Shin-hye ——— Th df.shape[@) 8807 Th count_distinct_info(d#4) # since there is to much distinct variable in our fields we will only take top 10 vari Count and Distinct Count: show_id: type: title: release_year: rating: new_director: new_country: new_listed_in: date_added: Th dF4.head() localhost B888/nbconvertntmiDownloadsinetxipynb?downloa Total Count = 196448, Total Count Total Count = 196448, Total Count Total Count = 196448, Total Count 196448, 196448, 96448, Total Count = 196448, Total Count = 196448, Total Count Total Count 196448, 196448, Total Count = 196448, Total Count = 196448, Distinct Count Distinct Count = 8@8¢ Distinct Count = 808@ Distinct Count = 73 Distinct Count = 14 Distinct Count = 217 Distinct Count = 8055 Distinct Count = 3822 Distinct Count = 5075 Distinct Count = 122 Distinct Count = 73 Distinct Count = 1266 new director Chol Chol Chol Chol Chol wine ‘yr, 445 Pee 48557 48554 48558 48555 48556 show id 52037 52037 52037 52037 52037 type Movie Movie Movie Movie Movie title #Alive #Alive #Alive #Alive #Alive release year rating 2020 2020 2020 2020 2020 MA MA W- MA MA MA etic duration description Asa grisly virus 99min rampages ¢ city, alone man Asa grisly virus 99min -ampages city, alone man Asa grisly 99min vampages ¢ city, alone rman Asa grisly virus 99min -ampages city, alone man Asa grisly virus 99min -ampages ¢ city, alone man ——— Th # trend of type of movies and tv show in each year plt. Figure(figsize = (12,3) sns.1ineplot (data=df4.groupby(['type',, ‘release_year’])['title’ ] .nunique().reset_index( plt.ylabel('count") plt.xlabel('year") plt.title('Trend analysis of count of Movies and Tv shows") “Tend analysis of count of Movies and Ty shows new cast new ditector Park Shin-hye Yoo Ah- Park Shin-hye Yoo Ah- in Park Shin-hye Chol Chol Chol Chol Chol plt.show() ‘ype «00 year This indicates that movies become much more popular than tv shows after 2010 Th plt.Figure(figsize = (12,3) sns.barplot(x='rating’ ,y='title’ data = d#4.groupby([ ‘rating’ , ‘type'])[‘title" ].nuniq localhost B888/nbconvertntmiDownloadsinetxipynb?downloa rane ‘yr, 445 Pee localhost B888/nbconverintmiDownloadsinettixipynb?download-false etic plt.ylabel(‘count") plt.show() 2000 pe sm Howe 1500 Show # 00 ating TV-14 , TV-MA and TV-G have most number of tvshows and movies plt. Figure(figsize sns.barplot (data=df4..groupby([‘new_country', ‘type’ ])[ title’ ] .nunique() .reset_index(). plt .xticks(size=1¢) plt.ylabel(‘count) plt.xlabel(‘country") a7,5)) plt.show() Here we can notice that we have tv shows only in united states, united kingdom and japan where as japan is the only country where we do not have any movies dF4[ time’ ]=d#4[ ‘duration’ ].str.split(" ",expand=True) [0] .astype(‘int') dF4[' Length_type" ]=d#4[ ‘duration’ ].str.split(” ",expand=True) [1] # dfa.groupby([ ‘duration’, 'type'])[ ‘title’ ].nunique().reset_index() sns. barplot (data=df4.loc[df4[ ‘type plt.ylabel( count") plt.xLabel(' season’) plt.title(‘Count of seasons in TV show") plt.show() “TV Show" ].groupby('tine')[ ‘title’ ].nunique().re 136 ‘yr, 445 Pee etic Count of seasons in TV show 1200 1000 800 count 600 400 200 12 3 4 5 6 7 8 9 10 1 2 13 «15 17 season Display count of season In [ plt. Figure (Figsize-(22,16)) sns. barplot (data-df4.loc{dfa[ 'type"] plt.xlabel("Listed_in", size-15) plt.ylabel("count",size=15) plt.show() “TV Show" ].groupby('new_listed_in')['title" ].nur localhost B888/nbconvertntmiDownloadsinetxipynb?downloa sane ‘yr, 445 Pee etic here we know that international tv shows are most popular In pt. Figure(Figsize=(22,16)) sns.barplot (data=df4.1oc[dF4[ ‘type’ ]=='Movie' ].groupby(‘new_listed_in')['title’ ].nunic plt.xlabel("Listed_in", size=i5) plt.ylabel("count", size=15) lt. show() localhost B888/nbconverintmiDownloadsinettixipynb?download-false ssit6 ‘yr, 445 Pee etic In movies we have most popular in dramas localhost B888/nbconverintmiDownloadsinettixipynb?download-false 616

You might also like