Conference Paper

FORECASTING FILM FINANCIAL
SUCCESS WITH MACHINE LEARNING

Gokulakrishnan.A Gokulnath.k
Department of IT Ram vilas.H
Department of IT Department of IT
Panimalar Institute Of Technology Panimalar Institute Of Technology
Chennai, India Panimalar Institute Of Technology
Chennai, India Chennai, India
gokulakrishnantamil@gmail.com gokulnath7707@gmail.com ramvilas273@gmail.com
Miss Rajeshwari,.M.E.,
Assistant professor
Department of IT
Panimalar Institute Of Technology
Chennai, India
Abstract—Predicting society's reaction to This study marks as a decision support system

a new product in the sense of popularity for the film investment sector using machine
and adaption rate has become an emerging learning techniques. This project helps investors
field of data analysis. The motion picture associated with this business for avoiding
industry is a multi-billion-dollar business, investment risks. The system predicts an
and there is a massive amount of data approximate success rate of a film based on its
related to films that is available over the profitability by analyzing historical data from
internet. This study proposes a decision different sources like Online rating, Director,
support system for film investment sector using Budget, Pre-Release business, Genre, etc.
machine learning techniques. This research
helps investors associated with this business The film industry has grown immensely over
avoid investment risks. The system the past few decades generating billions of dollars
predicts an approximate success rate of a of revenue for the stakeholders. Now people can
film based on its profitability by analyzing watch movies online and offline on a variety of
historical data from different sources like mobile devices during leisure or travel through
IMDb, Rotten Tomatoes, Box Office Mojo, Netflix, YouTube and downloads. A prediction
and Metacritic. Using Support Vector system to assess the box office success of new
Machine (SVM), Neural Network and movies can help the film producers and
Natural Language Processing, the system directors make informed decisions when making
predicts a film box office profit based on the film in order to increase the chance of
some pre-released features and post-released profitability and box office gross success. New
features. This paper shows Neural Network social media tools are constantly appearing which
gives an accuracy of 84.1% for pre-released are enabling people to gather information on films
features and 89.27% for all features, while and post comments about movies. These
SVM has 83.44% and 88.87% accuracy for comments can influence the initial prediction
pre-released features and all features about the box office gross success of a film
respectively, when one away prediction which some of the existing research do not
is considered. Moreover, we figure out that consider. Critic reviews often come out a few
budget, IMDb votes, and no. Out screens days before the film is released and may,
are the most important features which therefore, help in prediction and at the same time
play avital role in predicting a film's box influence the box office revenue.
office success.
Keywords— Box office gross; II. System Design
Data Mining; Machine learning; film EXISTING SYSTEM:
success; film; Critical Predictive analytics;
review; Rating. Movies, in general, are products that
I. INTRODUCTION have a long development stage until they
reach final consumers and normally at a
Predicting society's reaction to a new high cost level. We can describe a film
product in the sense of popularity and development process, in a broader way, as
adoption rate has become an emerging field being composed of 4 (four) stages: Pre-
of data analysis, and such kind of analysis production, production, post-production and
can help the film industry to take distribution.The large growth in number of
appropriate decisions. Can film studios and its movies releasing over the past few decades
related stakeholders use a forecasting method film Prediction is necessary.
for the prediction of revenue that a new
film can generate based on a few given
input attributes like budget, runtime, released
year, popularity, and so on.
1
The only way people can check whether the film III. SYSTEM ARCHITECHTURE
will be worth to watch is through applications, so
this system would analyze the reviews posted by
other users, as these reviews are large in number III. MOTIVATION
which the user cannot read and gets confused. As data scientists we wanted to dig deeper into the
Following are the aims and objectives suggested by business side of movies and explore the economics
our system. The first step is to identify a dataset of behind what makes a successful film. Basically we
film data which is suitable for analysis. Relevant wanted to
attributes need to be selected from the film data.
Attributes can be general pre production
information regarding film productions such as IV. DATA DESCRIPTION
film title, sequel, genre, language and information
The dataset we utilized to train and test our model
about writers, actors, and directors. Similarly, the
data must include some measure of success, such as EXPLANATION:
user film ratings. Secondly, the relevant dataset
has to be prepared and structured in such a way that UI (User Interface): This is the space where
the data used is representative of the film scene at interactions between humans and machines
large, as well as suitable for analysis by the occur. It consists of the hardware and software
relevant machine learning techniques and that allow effective operation and control of
algorithms. Further, correlation is performed on the machine from the human end, as well as
relevant dataset to find the relationship between al the exchange of information between the
the variable with each other. machine and the human.
The important step in training our system is to
apply classification model. There are many Prediction: In the context of machine learning,
classifiers. Lastly, the prediction performance of the prediction refers to the output or result
relevant machine learning algorithm has to be generated by a model based on the input data.
evaluated on the dataset in order to determine The model makes predictions by learning
success and failure of film accurately. patterns from the training data.
Model: In machine learning, a model is a

PROPOSED SYSTEM: mathematical representation of a real-world
process. It is trained on a dataset to make
This project aims to predict film box office gross predictions or decisions without being
using supervised learning algorithms, specifically explicitly programmed to perform the task.
Hyper parameter Tuning, Decision Tree , Random
Forest. User: This refers to the person or system that
The study found that Random Forest and Hyper interacts with the machine learning model. The
Parameter Tuning had the highest accuracy and can user provides inputs to the model and receives
be used to recommend the best prediction the predictions or outputs.
technique. The project used Jupyter software and
evaluated essential attributes for film gross Inputs: These are the data points or features
prediction. that are fed into the machine learning model.
The model uses these inputs to make
The final results can be displayed using a Flask app predictions.
with an HTML user interface for film box office
gross prediction. Algorithm: This is a set of statistical
processing steps. In machine learning,
algorithms are used to learn from and make
predictions or decisions on data.
Evaluation: This is the process of determining

the performance of a machine learning model.
It involves using various metrics to measure
how well the model is doing in terms of
accuracy, precision, recall, F1 score, etc.
2
Train Data: This is the dataset that is used to
train the machine learning model. The model
learns patterns from this data to make predictions.
Test Data: This is the dataset that is used to

evaluate the performance of the machine learning
model. The model's predictions are compared
with the actual values in this dataset to measure
its accuracy.
Data Preprocessing: This is the process of

transforming raw data into a understandable
format. It involves cleaning the data, handling
missing values, encoding categorical variables,
scaling numerical variables, etc.
Data CSV: This refers to a file format

(Comma Separated Values) where data is stored
in a tabular format, with each value separated by a
comma. This format is often used for exchanging
data between different applications.
Then, a check was done for outliers, which are
IV. PROPSED METHODOLOGY data points distant from the rest of the data in the
dataset. They have the ability to distort the final
A significant part of the dataset required for result and prediction, and thus, were removed.
the project was extracted from the global TMDB
dataset using its APIs. Following this, the OMDB The given dataset was then explored to
API was used to extract the MPAA ratings and understand the relationships between the features
IMDb ratings and votes of each film in the given, how they interact with each other, spot
dataset. The final dataset has the features- Genres, anomalies in the data, and find patterns to help build
ID, Original Language, Original Title, Overview, the model. For this purpose, histograms were
Popularity Rating, Release Date, Title, TMDB plotted to study the range of features like runtime,
Rating, TMDB Vote Count, IMDb ID, Budget, budget, popularity, release-data, IMDb rating,
Revenue, Production Companies, Cast, Crew, revenue, etc., to study the range of this data.
Production Countries, Spoken Languages, Following this, a correlation matrix was plotted with
Runtime, Tagline, MPAA Rating, IMDb Rating, the same features mentioned above to find the linear
IMDb Vote Count and Star Power. interaction between every pair of features. This
contained the correlation coefficient between each
The final dataset has 6065 films. With regards pair. In addition to this, bar plots and frequency
to Genres, Cast, Crew and Production Companies, polygons were plotted to study the given data.
the dataset returned a JSON array of responses.
The popularity index of the most popular cast in
the film was taken as the star power. Only the top
few values in each JSON were considered in the V.ALGORITHM AND
final dataset since these are the elements along TECHNIQUES:
with star power that attract majority of the
audience.
A.Support Vector Machine
The extracted dataset was modified, and some
of the es sential features were updated. Features Support Vector Machine (SVM) is a supervised
like ID, IMDb ID, Original title, and Tagline were machine learning model that is used for
not relevant data for the predicting model and classification. SVMs work by maximize the
were thus removed. All the NULL values in the margin between separating hyper plane. In
dataset were removed by changing the NULL linear SVM the plane can be split by a line, see
values in ’runtime’ to the median value and figure 3.14 for an example how the model could
replacing the NULL values in the other columns look like. For example, could the red values be
with an empty set. After removing all the NULL answer A and the blue be answer B. If a new
values from all entries, the ’release date’ feature value would be introduced to the system and
was modified by splitting it into three distinct positioned on the red side, the model would
features for the day, month, and year of the predict the new value to be equal to answer A. If
release. All data except the first three members in there are more answers possible a hyper plane is
the cast of each entry were deleted. created to be able to split all the
3
answers up in different areas. SVM are VI. EXPERIMENTAL RESULT
effective high dimensional, memory efficient,
and versatile machine learning algorithms that The below figures show the results of the module
work well with non-linear data. implementation. These screenshots show the User
Interface through which the modules are being
B.Random Forest developed.
Random Forest or Random decision forests are Flask:
an ensemble method for classification, regression
and other tasks that operate by constructing a Flask is a lightweight WSGI web application
multitude of decision trees at training time and framework. It is designed to make getting started
outputting the class that is the mode of the classes quick and easy, with the ability to scale up to
or mean/average prediction of the individual complex applications.
trees.
C.KNN Classification algorithm or K-Nearest

Neighbor algorithm
K-Nearest Neighbor is one of the simplest

Machine Learning algorithms based on
Supervised Learning technique. K-NN algorithm
assumes the similarity between the new case/data
and available cases and put the new case into the
category that is most similar to the available
categories.
D.Linear Regression Home Page:

Linear regression analysis is used to predict the These factors, along with others, can contribute to
value of a variable based on the value of another the overall financial success of a film. However, it's
variable. The variable you want to predict is important to note that the specific combination and
called the dependent variable. The variable you weight of these factors can vary for each film, and
are using to predict the other variable's value is the outcome might not always be predictable.
called the independent variable. This form of
analysis estimates the coefficients of the linear
equation, involving one or more independent
variables that best predict the value of the
dependent variable. Linear regression fits a
straight line or surface that minimizes the
discrepancies between predicted and actual output
values. There are simple linear regression
calculators that use a “least squares” method to
discover the best-fit line for a set of paired data.
You then estimate the value of X (dependent
variable) from Y (independent variable).
Result Page:
You don't need a large time to wait for the results
with in a second the amount will be predicted.
4
CONCLUSION: REFERENCES
The proposed system addresses the limitations

of the existing system by providing accurate [1] Simonoff, J. S. and Sparrow, I. R. Predicting
predictions of Film Financial Forecasting using film grosses: Winners and losers, blockbusters
various supervised learning algorithms. By and sleepers. In Chance, 2000.
comparing the performance of different
classification algorithms, the system identifies the [2] Joshi, M., Das, D., Gimpel, K., and Smith,
best-performing technique, ensuring the use of the N. film Reviews and Revenues: An Experiment in
A.
most effective method for prediction. The Text Regression. In Proceedings of the North
utilization of Linear Regression, which has been American Chapter of the Association for
identified as the algorithm with the highest Computational Linguistics Human Language
accuracy, provides a reliable method for Technologies Conference, 2010.
predicting film box office gross. The user-friendly
interface created using Flask enhances usability [3] Sharda, R. and Delen, D. Predicting box-office
and accessibility, allowing users to input data and success of motion pictures with neural networks.
view the predicted box office gross values easily. In Expert Systems with Applications, 2006.
FUTURE ENHANCEMENT: [4] “Global box office revenue 2016 |
Statistic.” [Online]. Available: https://
The proposed system can be further improved by www.statista.com/statistics/259987/global-box-
incorporating additional features, such as social office revenue/. [Accessed: 03-Jun-2018].
media sentiment analysis, to provide a more
comprehensive prediction of film success. The [5] S. Gopinath, P. K. Chintagunta, and S.
system can also be expanded to include real-time Venkataraman, “Blogs, Advertising, and Local-
predictions, allowing users to access up-to-date Market film Box Office Performance,”
predictions as soon as they become available. Management Science, vol. 59, no. 12, pp. 2635–
Additionally, the system can be integrated with 2654, 2013.
existing film streaming platforms, such as Netflix
or Amazon Prime, to provide personalized [6] M. C. A. Mestyán, T. Yasseri, and J. Kertész,
recommendations based on predicted box office “Early Prediction of film Box Office Success
gross. Based on Wikipedia Activity Big Data,” PLoS
ONE, vol. 8, no. 8, 2013.
Furthermore, the system can be scaled up to
analyze larger datasets, allowing for more [7] J. S. Simonoff and I. R. Sparrow, “Predicting
accurate predictions and improved performance. film Grosses: Winners and Losers,
The system can also be adapted to predict the Blockbusters and Sleepers,” Chance, vol. 13, no.
success of other forms of media, such as music or 3, pp. 15–24, 2000.
books, by using similar supervised learning
algorithms and techniques. [8] A. Chen, “Forecasting gross revenues at the
film box office,” Working paper, University of
Overall, the proposed system provides a reliable Washington, Seattle, WA, June, 2002.
and accurate method for predicting film box
office gross, and has the potential for further [9] M. S. Sawhney and J. Eliashberg, “A
development and expansion in the future. Parsimonious Model for Forecasting Gross Box-
Office Revenues of Motion Pictures,” Marketing
Science, vol. 15, no. 2, pp. 113–131, 1996.
[10] M. T. Lash and K. Zhao, “Early

Predictions of film Success: The
Who, What, and When of Profitability,” Journal
of Management
Information Systems, vol. 33, no. 3, pp. 874–903,
Feb. 2016.
5
[11] A. Sivasantoshreddy, P. Kasat, and A. Jain,
“Box-Office Opening
Prediction of Movies based on Hype Analysis
through Data Mining,”
International Journal of Computer Applications,
vol. 56, no. 1, pp. 1–
5, 2012.
[12] R. Sharda and D. Delen, “Predicting box-

office success of motion
pictures with neural networks,” Expert Systems
with Applications,
vol. 30, no. 2, pp. 243–254, 2006.

Conference Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Conference Paper

Uploaded by

Copyright:

Available Formats

FORECASTING FILM FINANCIAL

SUCCESS WITH MACHINE LEARNING

Abstract—Predicting society's reaction to This study marks as a decision support system

Model: In machine learning, a model is a

Evaluation: This is the process of determining

Test Data: This is the dataset that is used to

Data Preprocessing: This is the process of

Data CSV: This refers to a file format

C.KNN Classification algorithm or K-Nearest

K-Nearest Neighbor is one of the simplest

D.Linear Regression Home Page:

The proposed system addresses the limitations

[10] M. T. Lash and K. Zhao, “Early

[12] R. Sharda and D. Delen, “Predicting box-

You might also like