Professional Documents
Culture Documents
TMDB Box Office Prediction: Group 6
TMDB Box Office Prediction: Group 6
Abhinav Pateria, Ravi Gunjan, Saurabh Kumar Gupta, Vipul Kumar, Yash Panchal
07/11/2019
Budget
Introduction
In this project we will predict how much revenue a movie
makes at the box office. During this process we will be
going through:
2 Feature engineering
Here we can see that the average revenue for a lot of the
Movies with Homepage generates higher revenue than top production companies is higher than the ‘other’
movies with no Homepage. Movies with homepages seem production companies.
to be making on average 3 times as much as movies
without a homepage. Production company size
Production companies with lower numbered id’s seem to We assigned the production companies that have at least
be making more revenue compared to the ones with 60 movies each as big producers and all the rest as small
higher id’s. There is small negative correlation present producers. We will assume that all NAs are small
( Correlation: -0.1282278) producers.
Language
Absolute majority of the movies are English, we will
create the variable language with levels English versus
Non-English.
The U.S. and Great Britain seem to, on average, be
getting more revenue than the countries that are not
among the top production countries.
IMDB id
We will now extract the IMDb number from the IMDb_id
string in order to see if this variable affects revenue.
There will likely not be any correlation with this and
revenue, but we will plot and explore this to make sure.
Revenue, Tag-line_length: -0.1206457 Predicting Revenue for validation test using ModelLR3
and calculating RMSE value
Revenue, Overview_length: -0.008616267
So, there is a weak correlation between title length and RMSE Value for Linear Regression model (Model3) =
tag-line length and revenue. There is no correlation 0.9133
between overview length and revenue so it is probably
best to not include the variable in our model. Model 2: Random Forest
randomForest(formula = revenue ~ ., data = train, ntree =
Machine Learning Model
Number of trees: 501
Model 1: Linear Regression No. of variables tried at each split: 9
We have splitted data in 80-20 ratio as Train data and Mean of squared residuals: 0.8591925
Validation data.
% Var explained: 51.38
Linear Regression for numerical attributes: Popularity,
Importance of variables using Random Forest
Runtime, Budget to predict revenue (ModelLR1)