Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

final_project

May 23, 2023

1 Final Project

2 Exploratory data analysis of the Disney Movies Dataset


2.1 Foreword
In this notebook, I have shared my exploratory data analysis for the Disney Movies dataset located
here.

3 Introduction
3.1 Question(s) of interests
The goal of this project is to analyze the performance of Disney movies based on their inflation-
adjusted gross values. I aim to explore various aspects such as the highest-grossing movies, role
various MPAA ratings play in bringing the audience to the theatres, and the impact of different
genres on the movie’s success. By examining the data, we can gain insights into Disney’s box office
success and identify patterns that contribute to their financial achievements.

3.2 Dataset description


The data was acquired from data.world, a platform that follows the Creative Commons Attribution
4.0 International License.
The Disney Movies dataset is composed of 5 tables, disney_movies_total_gross.csv,
disney_revenue_1991-2016.csv, disney-characters.csv, disney-director.csv, and
disney-voice-actors. I have used the following two datasets.
• disney_movies_total_gross.csv
– This dataset contained information about the total gross and inflation-adjusted gross of
Disney movies, along with their release dates, genres, and MPAA ratings.

4 Methods and Results


Firstly, I need to import all the relevant libraries and files
[16]: # Importing libraries
import pandas as pd
import altair as alt

1
# Importing files
total_gross = pd.read_csv("data/disney_movies_total_gross.csv")

Checking the tables


[17]: total_gross.head()

[17]: movie_title release_date genre MPAA_rating \


0 Snow White and the Seven Dwarfs Dec 21, 1937 Musical G
1 Pinocchio Feb 9, 1940 Adventure G
2 Fantasia Nov 13, 1940 Musical G
3 Song of the South Nov 12, 1946 Adventure G
4 Cinderella Feb 15, 1950 Drama G

total_gross inflation_adjusted_gross
0 $184,925,485 $5,228,953,251
1 $84,300,000 $2,188,229,052
2 $83,320,000 $2,187,090,808
3 $65,000,000 $1,078,510,579
4 $85,000,000 $920,608,730

Checking the quality of data


[18]: total_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 579 entries, 0 to 578
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movie_title 579 non-null object
1 release_date 579 non-null object
2 genre 562 non-null object
3 MPAA_rating 523 non-null object
4 total_gross 579 non-null object
5 inflation_adjusted_gross 579 non-null object
dtypes: object(6)
memory usage: 27.3+ KB
Also, we need to clean our dataframe before we do any analysis. For this, I created
a function named “clean_my_data.py”. I created this function because if the dataset
at the URL was changed, the code in this report will still be able to run and give
meaningful insights. I have paid special attention to “movie_tile” because a null
value here means the data is unreliable in my opinion (the movie_title is the most
basic value here). Also, it drops the release_date column because we are only going
to use inflation_adjusted_gross, which makes the release_date reduntant (as time
factor is already considered under inflation adjustment).

2
Calling that function now
[19]: import clean_my_data as cd

[20]: total_gross_cleaned = cd.clean_data(total_gross)

/home/jupyter/prog-python-ds-students/release/final_project/clean_my_data.py:22:
FutureWarning: The default value of regex will change from True to False in a
future version. In addition, single character regular expressions will*not* be
treated as literal strings when regex=True.
data['total_gross'] = data['total_gross'].str.replace('$',
'').str.replace(',', '').astype(float)
/home/jupyter/prog-python-ds-students/release/final_project/clean_my_data.py:23:
FutureWarning: The default value of regex will change from True to False in a
future version. In addition, single character regular expressions will*not* be
treated as literal strings when regex=True.
data['inflation_adjusted_gross'] =
data['inflation_adjusted_gross'].str.replace('$', '').str.replace(',',
'').astype(float)

[21]: #Calculate the highest-grossing movies based on the total gross


highest_grossing_movies = total_gross_cleaned.sort_values('total_gross',␣
,→ascending=False).head(10)

print("Top 10 Highest-Grossing Movies:")


print(highest_grossing_movies[['movie_title', 'total_gross']])

Top 10 Highest-Grossing Movies:


movie_title total_gross
564 Star Wars Ep. VII: The Force Awakens 936662225.0
524 The Avengers 623279547.0
578 Rogue One: A Star Wars Story 529483936.0
571 Finding Dory 486295561.0
558 Avengers: Age of Ultron 459005868.0
441 Pirates of the Caribbean: Dead Man’… 423315812.0
179 The Lion King 422780140.0
499 Toy Story 3 415004880.0
532 Iron Man 3 408992272.0
569 Captain America: Civil War 408084349.0

[22]: # Calculate the highest-grossing movies based on the inflation-adjusted gross


highest_inflation_adjusted_movies = total_gross_cleaned.
,→sort_values('inflation_adjusted_gross', ascending=False).head(10)

print("Top 10 Highest-Grossing Movies (Inflation-Adjusted):")


print(highest_inflation_adjusted_movies[['movie_title',␣
,→'inflation_adjusted_gross']])

Top 10 Highest-Grossing Movies (Inflation-Adjusted):


movie_title inflation_adjusted_gross

3
0 Snow White and the Seven Dwarfs 5.228953e+09
1 Pinocchio 2.188229e+09
2 Fantasia 2.187091e+09
8 101 Dalmatians 1.362871e+09
6 Lady and the Tramp 1.236036e+09
3 Song of the South 1.078511e+09
564 Star Wars Ep. VII: The Force Awakens 9.366622e+08
4 Cinderella 9.206087e+08
13 The Jungle Book 7.896123e+08
179 The Lion King 7.616409e+08

[23]: # Explore the distribution of movie genres


genre_distribution = total_gross_cleaned['genre'].value_counts()
print("Movie Genre Distribution:")
print(genre_distribution)

Movie Genre Distribution:


Comedy 182
Adventure 129
Drama 114
Action 40
Thriller/Suspense 24
Romantic Comedy 23
Documentary 16
Musical 16
Western 7
Horror 6
Black Comedy 3
Concert/Performance 2
Name: genre, dtype: int64

[24]: # Identify the most common genres


most_common_genres = genre_distribution.head(5)
print("Most Common Genres:")
print(most_common_genres)

Most Common Genres:


Comedy 182
Adventure 129
Drama 114
Action 40
Thriller/Suspense 24
Name: genre, dtype: int64

[25]: # Investigate the relationship between MPAA ratings and movie success
mpaa_rating_success = total_gross_cleaned.groupby('MPAA_rating')['total_gross'].
,→mean()

print("Mean Total Gross by MPAA Rating:")

4
print(mpaa_rating_success)

Mean Total Gross by MPAA Rating:


MPAA_rating
G 9.209061e+07
Not Rated 1.708736e+07
PG 7.362521e+07
PG-13 8.118074e+07
R 2.936536e+07
Name: total_gross, dtype: float64

[26]: total_gross_cleaned.head()

[26]: movie_title genre MPAA_rating total_gross \


0 Snow White and the Seven Dwarfs Musical G 184925485.0
1 Pinocchio Adventure G 84300000.0
2 Fantasia Musical G 83320000.0
3 Song of the South Adventure G 65000000.0
4 Cinderella Drama G 85000000.0

inflation_adjusted_gross
0 5.228953e+09
1 2.188229e+09
2 2.187091e+09
3 1.078511e+09
4 9.206087e+08

[27]: # Bar chart: Top 10 Highest-Grossing Movies (Total Gross)


bar_chart_total_gross = alt.Chart(highest_grossing_movies).mark_bar().encode(
x=alt.X('total_gross:Q', title='Total Gross'),
y=alt.Y('movie_title:N', title='Movie Title'),
tooltip=['movie_title', 'total_gross']
).properties(
title='Top 10 Highest-Grossing Movies (Total Gross)'
)

bar_chart_total_gross

[27]: alt.Chart(…)

[28]: # Bar chart: Top 10 Highest-Grossing Movies (Inflation-Adjusted Gross)


bar_chart_inflation_adjusted_gross = alt.
,→Chart(highest_inflation_adjusted_movies).mark_bar().encode(

x=alt.X('inflation_adjusted_gross:Q', title='Inflation-Adjusted Gross'),


y=alt.Y('movie_title:N', title='Movie Title'),
tooltip=['movie_title', 'inflation_adjusted_gross']
).properties(

5
title='Top 10 Highest-Grossing Movies (Inflation-Adjusted Gross)'
)

bar_chart_inflation_adjusted_gross

[28]: alt.Chart(…)

As the value of money changes over time, it is important to use the ‘inflation-adjusted
gross’ column for our analysis. Also, we need to use average inflation-adjusted gross
and not total inflation-adjusted gross because the count of movies under each category
is different.
[29]: import altair as alt

# Calculate average inflation-adjusted gross for each genre


genre_avg_gross = total_gross_cleaned.
,→groupby('genre')['inflation_adjusted_gross'].mean().reset_index()

# Sort the DataFrame by average inflation-adjusted gross in descending order


genre_avg_gross = genre_avg_gross.sort_values('inflation_adjusted_gross',␣
,→ascending=False)

# Bar chart for average inflation-adjusted gross per genre


bar_genre = alt.Chart(genre_avg_gross).mark_bar().encode(
x=alt.X('genre:N', title='Genre', sort='-y'),
y=alt.Y('inflation_adjusted_gross:Q', title='Average Inflation Adjusted␣
,→Gross'),

tooltip=['genre', 'inflation_adjusted_gross']
).properties(
title='Average Inflation Adjusted Gross per Genre of Disney Movies'
)

bar_genre

[29]: alt.Chart(…)

[30]: # Calculate average inflation-adjusted gross for each MPAA rating


mpaa_avg_gross = total_gross_cleaned.
,→groupby('MPAA_rating')['inflation_adjusted_gross'].mean().reset_index()

# Sort the DataFrame by average inflation-adjusted gross in descending order


mpaa_avg_gross = mpaa_avg_gross.sort_values('inflation_adjusted_gross',␣
,→ascending=False)

# Bar chart for average inflation-adjusted gross per MPAA rating


bar_mpaa = alt.Chart(mpaa_avg_gross).mark_bar().encode(
x=alt.X('MPAA_rating:N', title='MPAA Rating', sort='-y'),

6
y=alt.Y('inflation_adjusted_gross:Q', title='Average Inflation Adjusted␣
Gross'),
,→

tooltip=['MPAA_rating', 'inflation_adjusted_gross']
).properties(
title='Average Inflation Adjusted Gross per MPAA Rating of Disney Movies'
)

bar_mpaa

[30]: alt.Chart(…)

5 Discussions
The analysis of Disney movies’ average inflation-adjusted gross revealed interesting insights. The
graph depicting the average inflation-adjusted gross per genre shows that the ‘musical’ genre has
significantly higher average gross compared to other genres. This suggests that Disney movies
belonging to the musical genre tend to perform exceptionally well in terms of box office revenue,
even when adjusting for inflation. The popularity and enduring appeal of Disney musicals, with
their captivating songs and enchanting storytelling, have likely contributed to their financial success
over the years.
Additionally, the graph illustrating the average inflation-adjusted gross per MPAA rating high-
lights the ‘G’ rating as having the highest average gross among the different MPAA ratings. This
indicates that Disney movies rated ‘G’ have historically generated substantial box office revenue
when considering inflation. The ‘G’ rating signifies that the movies are suitable for all audiences,
which aligns with Disney’s family-friendly brand and suggests that their movies appeal to a wide
range of viewers, including children and adults alike.
The ‘musical’ genre and ‘G’ rating seem to be associated with higher average inflation-adjusted
gross, indicating a strong market demand and audience reception for these types of movies. This
information can be valuable for Disney in understanding audience preferences and making strategic
decisions when developing and marketing future movie projects.
It’s important to note that while the average inflation-adjusted gross provides valuable insights,
individual movie performance can vary within each genre and MPAA rating category. Factors such
as production budgets, marketing strategies, release timing, and critical reception also influence
a movie’s financial success. Therefore, further analysis considering these factors would provide a
more comprehensive understanding of Disney’s movie performance.
[ ]:

You might also like