Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Exploratory

& Sentiment Analysis


of Netflix Data

A Project report submitted in the partial fulfillment of the


requirements for the award of the
Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE ENGINEERING

Submitted by
Bibhudutta Swain(2105878)
Under the guidance of
Dr. Soumya Ranjan Mishra

Kalinga Institue of Industrial Technology


1

TABLE CONTENTS
Page No.

1. Abstract____________________________ 2

2. Introduction_________________________ 2-3
 Requirement Gathering
 Data Collection
 Data Processing and Cleaning

3. Literature Review_____________________ 3-4

4. Methodology_________________________ 4

5. Objectives of the Project________________ 4

6. Scope of the Project____________________ 4-5

7. Problem Statement____________________ 5

8. About Dataset_________________________ 5-6

9. Implementation________________________ 6-16

10. Results and Ananlysis__________________ 16

11. Conclusion___________________________ 16

12. References___________________________ 17
2

ABSTRACT

Exploring datasets of Netflix for Future Release of TV shows and Movies on the
Platform. In this project, we are going to explore the dataset from Kaggle and we
would like to find out how long the Netflix platform takes a movie or a TV show to
release on its platform, how many movies and TV shows are released in specific time
frame, how many movies and TV shows are release in the recent ten years on the
platform, and what were the top 5 genres that the audience of the Netflix platform
liked the most. From here, we would like to apply a machine learning approach to
understand the data fully and provide a great solution where the platform should be
headed to. From our data analysis we conducted on R markdown, we have
discovered that there were a wide variety of genres that movie directors produced
worldwide and we have observed many cast members and genres they were in.
The project is made using different utility analytical tools present in Python
Library of versatile packages. This paper introduces systematic and insightful usage
of methods for Exploratory Data Analysis & Sentiment Analysis by utilizing various
packages concerned.

Keywords: Exploratory Data Analysis, Sentiment Analysis, Netflix, Kaggle, Machine


learning, Python.

INTRODUCTION

The statistics field, which has a lengthy history of its own, is thought to be the origin
of the word "data analysis." We can arrive to intriguing results by using statistical
development strategies. Big Data is a result of the world's rapidly advancing
technological ramifications; we are continuously exposed to vast volumes of
unprocessed data that may be improved in the future based on the specifications
and requirements set forth by an organisation.

Following the most typical course of action, which is data gathering, is data analysis.
As a result, data analysis is recognised as a scientific method with the data as its
exclusive topic. In order to find and collect useful information that meets an entity's
demands, it starts with obtaining data from multiple external and internal sources.
From there, intrinsic analysis is performed on the data.

It starts with gathering data from several internal and external sources, then using
that data to do intrinsic analysis in order to find and extract useful information that
meets an entity's needs.Fundamentally, there are two primary methods for data
analysis – based on the nature and characteristic of data - qualitative data
analysis and quantitative data analysis techniques. These data analysis techniques
have the scope to be utilized independently or in combination with other
methods in order to gain access to some of the best business and intelligence-
oriented insights for making better decisions over the already present data.
3

1. Requirements Gathering

The data is the necessary requirement for providing inputs to any type of analysis. It
can be based on the requirements and parameters based on the user. This data
can be either numerical or categorical. The purpose and scope can range from
supervised to unsupervised learning.

2. Data Collection

The required data can be collected from a wide range of sources. It can be structured
based on the criteria provided by analysts to custodians of a particular data set. The
data can be man-made or in the form of technological output over utility (sensor
system tracking) and many other implications.

3. Data Processing and Cleaning

The data when purported for utilization must be processed on the level where the
needs for analysis are satisfied. This includes placing data in the form of rows and
columns that are human-understandable in nature. Further, it must be cleaned for
getting rid of any redundant data, or minimalize the presence of anomalies prior
to the deployment of the data for analysis. The figure given below explains these
four fundamental steps in a lucrative pictogram.

Fig 1.1 Relationship of Data, Information and Intelligence

LITERATURE REVIEW

With the advent of technology, the need for consumption of data has increased
tremendously. Every single activity of our life has become interlinked with data.
As the famous British Mathematician – Clive Humby – once said, “Data is the new
oil."

As interesting as it may sound, the fact that there is a constant need for
intelligent solutions – with data as inputs – cannot be neglected. Due to these
changes, we have witnessed a paradigm shift in almost every single field around the
world. One such field is the entertainment industry that had a rapid transposition as
4

the industry soon began to adopt virtual methods for releasing its content. The
Over-The-Top (OTT) as a means to provide and deliver content has gained huge
prominence around the world. One of the key players in the game being Netflix,
has changed the way people subject themselves to entertainment. The from-
home-all-in-one package type subscription services that these OTT platforms has to
offer had customers glued to them quite literally. All these behavioral information
about large base of customers is essentially a valuable data. Thus, our project
focuses on deriving crucial insights from the publicly available Netflix dataset
(obtained from Kaggle6). We have also made use of Geographical Data Set and
Netflix Title Reviews Data Set.

METHODOLOGY

Exploratory data analysis is an approach to analyzing data sets to summarize their


main characteristics, often with visual methods. Statistical models may or may not
be used, but EDA is primarily about understanding what data can tell us beyond
modeling or theoretical analysis. John Tukey advocates data exploration to
encourage statisticians to explore data and perhaps develop hypotheses that will
lead to new data collection and testing. EDA differs from initial data analysis (IDA),
which focuses on examining hypotheses for model fit and hypothesis testing and
handling missing and variable results. EDA has IDA. The main purpose of EDA is to
enable the analyst to visualize the data set and the underlying structure of the data
set, while providing all the unique features that the analyst wants to extract from the
data set: a reasonable, parsimonious output model and a list of outliers.

Random forest classifier is a supervised learning algorithm. The "forest" it builds, is


an ensemble of decision trees, usually trained with the “bagging” method. The
general idea of the bagging method is that a combination of learning models
increases the overall result.

OBJECTIVES OF THE PROJECT

 To understand what content is available in different countries.


 No of movies/shows released based on the month.
 The most rated content based on the genre.
 To identify similar contents by matching criteria.
 Network analysis of actors/directors.
 Is Netflix has increasingly focusing on TV shows rather than movies in recent
years.
 The most observed rating by category in TV shows and movies.
 How many movies or shows it has released year differ from its year added.

SCOPE OF THE PROJECT

 Understanding what content is available in different countries


 Identifying similar content by matching text-based features
5

 Network analysis of Actors / Directors / Countries and find interesting insights


 Sentiment analysis of the Data-set

PROBLEM STATEMENT

Netflix has attracted a large audience in the last 5-10 years. As the number of
viewers increases, the variety of programs will also increase. Advances in technology
and streaming platforms have allowed us to watch movies from anywhere in the
world. Due to differences in tastes and preferences, the number of films released in
a year may vary depending on the current situation. So, let's compare and analyze
the frequency of TV series/films released in a year.

ABOUT DATASET

1. Data Indentified from

This data-set consists of TV shows and movies available on Netflix from 1930 but has
been boiled down from 2005 onwards. The data-set is collected from kaggle.
https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download
Netflix Movies and TV Shows | Kaggle

2. Details about the attributes of the dataset

 Show_id -Unique ID for every Movie / TV Show (NUMERICAL)


 Type -Identifier - A Movie or TV /Show (CATEGORICAL)
 title -Title of the Movie / TV Show (CATEGORICAL)
 director -Director of the Movie (CATEGORICAL)
 cast -Actors involved in the movie / show (CATEGORICAL) DATASET
 Country -Country where the movie / show was produced (CATEGORICAL) •
date_added -Date it was added on Netflix
 release_year -Actual Release year of the move / show (NUMERICAL)
 rating -TV Rating of the movie / show (CATEGORICAL)
 duration-Total Duration - in minutes or number of seasons (NUMERICAL)
 listed_in -movie genre (CATEGORICAL)
 Description-movie description.

BASIC DATA EXPLORATION

 df.head()
 df.info()
 df.describe()
6

VARIOUS ANALYSIS PERFORMED

 Checking for null values.


 Handling missing values.
 Checking for outliers.
 Handling Outliers that are present.

IMPLEMENTATION

1. Code

import numpy as np
import pandas as pd
import plotly.express as px
from textblob import TextBlob

df = pd.read_csv('netflix_titles.csv')

df.shape

df.head()

df.columns

x = df.groupby(['rating']).size().reset_index(name='counts')
print(x)

pieChart = px.pie(x, values='counts', names='rating', title='Distribution of content


ratings on Netflix')
pieChart.show()

df['director']=df['director'].fillna('Director not specified')


df.head()

directors_list = pd.DataFrame()
print(directors_list)

directors_list = df['director'].str.split(',', expand=True).stack()


print(directors_list)

directors_list.columns = ['Director']
print(directors_list)

directors = directors_list.groupby(['Director']).size().reset_index(name='Total Count')


print(directors)
7

directors = directors[directors.Director != 'Director not specified']


print(directors)

directors = directors.sort_values(by=['Total Count'], ascending = False)


print(directors)

top5Directors = directors.head()
print(top5Directors)

top5Directors = top5Directors.sort_values(by=['Total Count'])


barChart = px.bar(top5Directors, x = 'Total Count', y = 'Director', title = 'Top 5
Directors on Netflix')
barChart.show()

df['cast'] = df['cast'].fillna('No cast specified')


df.head()

cast_df = pd.DataFrame()
cast_df = df['cast'].str.split(',', expand = True).stack()
cast_df = cast_df.to_frame()
cast_df.columns = ['Actor']
actors = cast_df.groupby(['Actor']).size().reset_index(name = 'Total Count')
actors = actors[actors.Actor != 'No cast specified']
actors = actors.sort_values(by=['Total Count'], ascending = False)
print(actors)

top5Actors = actors.head()
top5Actors = top5Actors.sort_values(by=['Total Count'])
print(top5Actors)

barChart2 = px.bar(top5Actors, x ='Total Count', y = 'Actor', title = 'Top 5 Actors on


Netflix')
barChart2.show()

df['country'] = df['country'].fillna('NaN')
df.head()

country_df = pd.DataFrame()
country_df = df['country'].str.split(',', expand = True).stack()
country_df = country_df.to_frame()
country_df.columns = ['Country']
countries = country_df.groupby(['Country']).size().reset_index(name = 'Total Count')
countries = countries[countries.Country != 'NaN']
countries = countries.sort_values(by=['Total Count'], ascending = False)
print(countries)

top5Countries = countries.head()
8

top5Countries = top5Countries.sort_values(by=['Total Count'])


print(top5Countries)

barChart3 = px.bar(top5Countries, x ='Total Count', y = 'Country', title = 'Top 5


Countries on Netflix')
barChart3.show()

df1 = df[['type', 'release_year']]


df1 = df1.rename(columns ={"release_year": "Release Year","type":"Type"})
df2 = df1.groupby(['Release Year', 'Type']).size().reset_index(name='Total Count')
print(df2)

df2 = df2[df2['Release Year'] >= 2000]


graph = px.line(df2, x = "Release Year", y = "Total Count", color = "Type", title =
"Trend on Content Produced on Netflix Every Year")
graph.show()

df3 = df[['release_year','description']]
df3 = df3.rename(columns = {'release_year': 'Release Year', 'description':
'Description'})
for index, row in df3.iterrows():
d=row['Description']
testimonial = TextBlob(d)
p = testimonial.sentiment.polarity
if p==0:
sent = 'Neutral'
elif p>0:
sent = 'Positive'
else:
sent = 'Negative'
df3.loc[[index, 2], 'Sentiment']=sent

df3 = df3.groupby(['Release Year', 'Sentiment']).size().reset_index(name = 'Total


Count')

df3 = df3[df3['Release Year']>2005]


barGraph = px.bar(df3, x ="Release Year", y ="Total Count", color ="Sentiment", title
="Sentiment Analysis of Content on Netflix")
barGraph.show()
9

2. About Code

 numpy used for linear algrebra operations


 pandas used for data preparation
 plotly.express used for data visualization
 textblob used for sentiment analysis
10
11
12
13
14
15
16

RESULTS AND ANALYSIS

By utilizing the EDA approach towards the Netflix data we have garnered some
crucial insights for other useful purposes. The following are summarized points:

 We have successfully cleaned the redundant records by removing them, and


filtered the Data Set by discarding unused features.
 We developed a correlation amongst the utility features and established a
guidance for our analysis.
 The most content type on Netflix is movies.
 The country by the amount of the produces content is the United States,
 The most popular director on Netflix , with the most titles, is Jan Suter.
 International Movies is a genre that is mostly in Netflix.
 Largest count of Netflix content is made with a “TV-14” rating.
 The most popular actor on Netflix TV Shows based on the number of titles is
Takahiro Sakurai.
 The most popular actor on Netflix movie, based on the number of titles, is
Anupam Kher.

It's clear that Netflix has grown over the years. We can see it from the data that the
company took certain approaches in their marketing strategy to break into new
markets around the world.

CONCLUSION

Data analysis is a vital first step in meeting a client's diverse demands across all
professional domains. Since many firms are actively searching for futuristic,
predictive, and descriptive insights from the raw data they have previously collected,
the diverse variety of insights that may be gleaned from data is important in and of
itself.

It helps the organizations to gain access to numerous concealed patterns,


information and bits of knowledge after the analysis had been performed. The
analysis that we have just performed using the Netflix data not only provides us with
incentives to take smart and intelligent business decisions, but also contribute to the
overall growth of the firm. These insights maintain a clear sight and perspective for
various stakeholders and help in targeting a positive vision for the future. The future
scope of Data Analysis is bound to remain intact as long as businesses require Data
Science in their everyday applicable decision-making processes. Also, there is a
great scale of possibilities when it comes to developing unique interactive
solutions and methods that are confined to make data exploration much more
intriguing in nature. These constant advancements have stabilized a promising
direction for data analysis as a systemic study that is going to stay as long as there is
the crunch for data in any viable field of study in the real-world.
17

REFERENCES

 Netflix recommendation system and text analysis.


 Numerical computing with Numpy: https://jovian.ml/aakashns/python-
numerical-computing-with-numpy
 Analyzing tabular data with Pandas: https://jovian.ml/aakashns/python-pandas-
data-analysis
 https://www.kaggle.com/shivamb/netflix-shows(link to the dataset and some
useful notebooks)
 https://en.wikipedia.org/wiki/Data_analysis [6]
https://towardsdatascience.com/represent-your-geospatial-data-using-folium-
c2a0d8c35c5

You might also like