Professional Documents
Culture Documents
Netflix Analysis Report (2105878 - Bibhudutta Swain)
Netflix Analysis Report (2105878 - Bibhudutta Swain)
Submitted by
Bibhudutta Swain(2105878)
Under the guidance of
Dr. Soumya Ranjan Mishra
TABLE CONTENTS
Page No.
1. Abstract____________________________ 2
2. Introduction_________________________ 2-3
Requirement Gathering
Data Collection
Data Processing and Cleaning
4. Methodology_________________________ 4
7. Problem Statement____________________ 5
9. Implementation________________________ 6-16
11. Conclusion___________________________ 16
12. References___________________________ 17
2
ABSTRACT
Exploring datasets of Netflix for Future Release of TV shows and Movies on the
Platform. In this project, we are going to explore the dataset from Kaggle and we
would like to find out how long the Netflix platform takes a movie or a TV show to
release on its platform, how many movies and TV shows are released in specific time
frame, how many movies and TV shows are release in the recent ten years on the
platform, and what were the top 5 genres that the audience of the Netflix platform
liked the most. From here, we would like to apply a machine learning approach to
understand the data fully and provide a great solution where the platform should be
headed to. From our data analysis we conducted on R markdown, we have
discovered that there were a wide variety of genres that movie directors produced
worldwide and we have observed many cast members and genres they were in.
The project is made using different utility analytical tools present in Python
Library of versatile packages. This paper introduces systematic and insightful usage
of methods for Exploratory Data Analysis & Sentiment Analysis by utilizing various
packages concerned.
INTRODUCTION
The statistics field, which has a lengthy history of its own, is thought to be the origin
of the word "data analysis." We can arrive to intriguing results by using statistical
development strategies. Big Data is a result of the world's rapidly advancing
technological ramifications; we are continuously exposed to vast volumes of
unprocessed data that may be improved in the future based on the specifications
and requirements set forth by an organisation.
Following the most typical course of action, which is data gathering, is data analysis.
As a result, data analysis is recognised as a scientific method with the data as its
exclusive topic. In order to find and collect useful information that meets an entity's
demands, it starts with obtaining data from multiple external and internal sources.
From there, intrinsic analysis is performed on the data.
It starts with gathering data from several internal and external sources, then using
that data to do intrinsic analysis in order to find and extract useful information that
meets an entity's needs.Fundamentally, there are two primary methods for data
analysis – based on the nature and characteristic of data - qualitative data
analysis and quantitative data analysis techniques. These data analysis techniques
have the scope to be utilized independently or in combination with other
methods in order to gain access to some of the best business and intelligence-
oriented insights for making better decisions over the already present data.
3
1. Requirements Gathering
The data is the necessary requirement for providing inputs to any type of analysis. It
can be based on the requirements and parameters based on the user. This data
can be either numerical or categorical. The purpose and scope can range from
supervised to unsupervised learning.
2. Data Collection
The required data can be collected from a wide range of sources. It can be structured
based on the criteria provided by analysts to custodians of a particular data set. The
data can be man-made or in the form of technological output over utility (sensor
system tracking) and many other implications.
The data when purported for utilization must be processed on the level where the
needs for analysis are satisfied. This includes placing data in the form of rows and
columns that are human-understandable in nature. Further, it must be cleaned for
getting rid of any redundant data, or minimalize the presence of anomalies prior
to the deployment of the data for analysis. The figure given below explains these
four fundamental steps in a lucrative pictogram.
LITERATURE REVIEW
With the advent of technology, the need for consumption of data has increased
tremendously. Every single activity of our life has become interlinked with data.
As the famous British Mathematician – Clive Humby – once said, “Data is the new
oil."
As interesting as it may sound, the fact that there is a constant need for
intelligent solutions – with data as inputs – cannot be neglected. Due to these
changes, we have witnessed a paradigm shift in almost every single field around the
world. One such field is the entertainment industry that had a rapid transposition as
4
the industry soon began to adopt virtual methods for releasing its content. The
Over-The-Top (OTT) as a means to provide and deliver content has gained huge
prominence around the world. One of the key players in the game being Netflix,
has changed the way people subject themselves to entertainment. The from-
home-all-in-one package type subscription services that these OTT platforms has to
offer had customers glued to them quite literally. All these behavioral information
about large base of customers is essentially a valuable data. Thus, our project
focuses on deriving crucial insights from the publicly available Netflix dataset
(obtained from Kaggle6). We have also made use of Geographical Data Set and
Netflix Title Reviews Data Set.
METHODOLOGY
PROBLEM STATEMENT
Netflix has attracted a large audience in the last 5-10 years. As the number of
viewers increases, the variety of programs will also increase. Advances in technology
and streaming platforms have allowed us to watch movies from anywhere in the
world. Due to differences in tastes and preferences, the number of films released in
a year may vary depending on the current situation. So, let's compare and analyze
the frequency of TV series/films released in a year.
ABOUT DATASET
This data-set consists of TV shows and movies available on Netflix from 1930 but has
been boiled down from 2005 onwards. The data-set is collected from kaggle.
https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download
Netflix Movies and TV Shows | Kaggle
df.head()
df.info()
df.describe()
6
IMPLEMENTATION
1. Code
import numpy as np
import pandas as pd
import plotly.express as px
from textblob import TextBlob
df = pd.read_csv('netflix_titles.csv')
df.shape
df.head()
df.columns
x = df.groupby(['rating']).size().reset_index(name='counts')
print(x)
directors_list = pd.DataFrame()
print(directors_list)
directors_list.columns = ['Director']
print(directors_list)
top5Directors = directors.head()
print(top5Directors)
cast_df = pd.DataFrame()
cast_df = df['cast'].str.split(',', expand = True).stack()
cast_df = cast_df.to_frame()
cast_df.columns = ['Actor']
actors = cast_df.groupby(['Actor']).size().reset_index(name = 'Total Count')
actors = actors[actors.Actor != 'No cast specified']
actors = actors.sort_values(by=['Total Count'], ascending = False)
print(actors)
top5Actors = actors.head()
top5Actors = top5Actors.sort_values(by=['Total Count'])
print(top5Actors)
df['country'] = df['country'].fillna('NaN')
df.head()
country_df = pd.DataFrame()
country_df = df['country'].str.split(',', expand = True).stack()
country_df = country_df.to_frame()
country_df.columns = ['Country']
countries = country_df.groupby(['Country']).size().reset_index(name = 'Total Count')
countries = countries[countries.Country != 'NaN']
countries = countries.sort_values(by=['Total Count'], ascending = False)
print(countries)
top5Countries = countries.head()
8
df3 = df[['release_year','description']]
df3 = df3.rename(columns = {'release_year': 'Release Year', 'description':
'Description'})
for index, row in df3.iterrows():
d=row['Description']
testimonial = TextBlob(d)
p = testimonial.sentiment.polarity
if p==0:
sent = 'Neutral'
elif p>0:
sent = 'Positive'
else:
sent = 'Negative'
df3.loc[[index, 2], 'Sentiment']=sent
2. About Code
By utilizing the EDA approach towards the Netflix data we have garnered some
crucial insights for other useful purposes. The following are summarized points:
It's clear that Netflix has grown over the years. We can see it from the data that the
company took certain approaches in their marketing strategy to break into new
markets around the world.
CONCLUSION
Data analysis is a vital first step in meeting a client's diverse demands across all
professional domains. Since many firms are actively searching for futuristic,
predictive, and descriptive insights from the raw data they have previously collected,
the diverse variety of insights that may be gleaned from data is important in and of
itself.
REFERENCES