Zhang2015 Pre Production Phase Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2015 8th International Symposium on Computational Intelligence and Design

Movie Box Office Inteval Forecasting Based on CART

Zhenlong Zhang,Jianping Chai,Bo Li ,Yan Wang,Min An,Zhougui Deng


Information Engineering School
Communication University of China, China
Beijing, China
sunny5rain@cuc.edu.cn,{jp_chai, lbmovie wangyancuc,}@sina.com, anmin@picc.com, dengzhougui@163.com

Abstract—This paper constructs a model for the forecasting of


forecasting of movie revenues during the pre-production
movie box office revenues during the pre-production period.
phase. In our model, the forecasting results are divided into
Our study selects some basic variables in film forecasting field
such as production country, genre, seasonality and star value. seven internals ranged fro m ‘flop’ to ‘blockbuster’ according
This paper divides the box office revenue into seven intervals
to the film bo x office revenues. Our study selects the top 100
ranged from ‘flop’ to ‘blockbuster’ and chooses Classification films of 2014 in bo x office revenues as the training set and
and Regression Tree to predict box office in China by the top 50 films of former 5 months in 2015 as the testing set.
comparing different models. The study sets the top 100 films of In our model, the fitting degree of training set is 99%, and
2014 in box re venues as the training set and the top 50 films of the application prediction accuracy value is 76%. The
the former 5 months in 2015 as the testing set. The model importance of variables indicates that the appeal of directors
produces excellent forecasting accuracy in both training and and stars is huge in China. The study first applies film fans
testing sets. In our model, the fitting degree of training set is as a quantization method of star and director value. The
99%, and the prediction accuracy value is 76%. The method is verified to be very effective in our study.
importance of variables indicates that the appeal of directors
and stars is notable in China. II. LIT ERAT URE REVIEW
Revenue forecasting is a widely studied field within the
Keywords: forecast; movie box office revenues; intervals; film industry. People try various methods to extract
pre-production; Classification and Regression Tree
information fro m the film industry companies, professional
film critic website and social media.
I. INT RODUCT ION The information of film industry companies, such as
The film market grew rapidly over the past three years production data, can be used to predict the rank of a film.
in Ch ina. The total box office is expected to reach ̞40 Littman Barry, a U.S. economist, can be described as a
pioneer in the field of box office forecasting. In his opinion,
billion in 2015, whose annual growth rate is about 30%. the decision to determine the success of the box office is
Now, China is the second largest film market in the world mainly in three aspects: film creativity, distribution pattern
and the main engine of global growth. However, the and movie theaters arrangement, marketing. The film
production of a motion picture is an expensive, risky creativity includes 7 indicators: film type, MPAA
endeavor. Although the domestic movie market is very hot, classification, former influence, production areas, the
only a handful of investment is profit, 70% of the domestic director and main actors, production costs and film critics.
films can’t recover the cost [1]. More than 90% of Ch inese The distribution pattern and movie theaters arrangement
movies are at a loss. In this background, an accurate forecast includes 4 indicators: publisher, schedule, distributing
of risk control and decision made immediately prior to the pattern and market forces. Marketing includes 2 indicators:
debut of a film has great practical significance. the distribution firm's market ing capabilities and whether or
This paper presents a forecasting model for the domestic not the winning [2]. Sharda&Delen emp loyed MPAA rating,
box office revenues of major motion pictures. In order to competition, star value, genre, special effects, sequel and
generate these forecasts, we emp loy a decision tree and apply number of screen to predict box office. These variables as
the traditional production data as our major inputs. The input of artificial neural network models, their best
variables in our model are production country, first genre, prediction average percent hit rate is 56.07% in 2010 [3]. M.
the second genre, seasonality, mean of director and former Gh iassi&David Lio& Brian Moon employed the same
two actors’ fans, mean of director and former five actors’ variables and methodology as Sharda & Delen used to
fans. Compared with post-production and post-release improve classification accuracy to 74.4% average percent hit
forecasting models, which established by either word -of- rate in 2014 [4]. They presented the development of a model
month data from internet immediately prior to release or with based upon dynamic artificial neural network (DAN2) for
post-release data from the opening weekend of a film, our the forecasting of movie revenues during the pre-production
research can avoid risk and provide guidance for filmmakers . period. Their study demonstrated the effectiveness of DAN2
In this paper, we present the development of model based and showed that DAN2 improves box-office revenue
on Classification and Regression Tree (CA RT) for the forecasting accuracy by 32.8% over existing models.

978-1-4673-9587-8/15 $31.00 © 2015 IEEE 87


DOI 10.1109/ISCID.2015.165
Social media information has much prediction value. In determin ing prospective audience demographics. In China,
2013, Google released a white paper called "Quantifying American films with a genre of science fiction tend to attract
Movie Magic with Google Search". The white paper large audience.
described a box office prediction model, which mainly used
the movie searches, movie advertising clicks, the number of C. The Second Genre
theaters and box office performance of the first few film As a film can’t be labeled as only one genre in general,
series. According to this model, Google believed that they we add the second genre to complete the information the first
can predict the movie bo x office one week in advance, and genre cannot express. In other words, the second genre acts
the accuracy rate was 92%. Just because Google has not as a modified variable.
released the movie box office forecast results based on this
D. Seasonality
model, the practical value of the model has yet to be tested
[5].Jingfei Du& Hua Xu& Xiaoqiu Huang (2013) employed Dramat ic seasonal effects occur within the motion
microblog of movies and box office revenues of films in first picture industry as the number of Americans who attend
two weeks to predict the third week’s box office movies in theaters varies significantly over the course of the
performance by using machine learning methods, such as the year, sometimes more than doubling within a two week
neural network, support vector machine and emotion period, generally around holiday weekends (Einav, 2007)[8].
classification [6]. Lian Wang, Jianmin Jia (2014) selected Seasonality is a co mplex factor. Choosing a proper day to
movie-related data (weeks’ nu mber of screens, release time, release a film is very important to its total box-office
film genre, whether sequel series, and country) and Internet revenues.
search data (Baidu index) as explanatory variables. Their E. Mean of director and former two actors’ fans
research constructed a log liner model to predict week box
office [7]. This variable is similar to “star value”, mean of director
This paper uses the similar methodology as Sharda & and former t wo actors’ fans present a considerable like
Delen did. Our model selects the traditional film data as degree of directors or actors. We use number of Douban fans
variables, such as production country, first genre, the second of a director or actor as their appealing power. Using the
genre, seasonality, mean of director and former two actors’ application of professional film website data to quantify the
fans, mean of director and former five actors’ fans . The study contribution of directors and actors is an innovative
first applies film fans as a quantization method of star and quantization method of star and director value compared
director value. The method is verified to be very effective with some research which generally calculating by
later. The model emp loys CART for the forecasting of movie comparing the averages of the recent box-office revenues of
revenues during the pre-production phase. films in which a star is credited with the averages of other
stars.
III. DAT A COLLECT ION F. Mean of director and former five actors’ fans
In our model, we chose the top 100 films of 2014 in box Just as mentioned before, this variable is another one to
office revenues as the training set and the top 50 films of the quantify appeal of directors and actors. In order to better
former 5 months in 2015 as the testing set to do our research. characterize their appeal, we further expand our scope of
All data are collected fro m internet. Mostly, they are intercepting the actor as director and top five actors.
obtained from movie bo x office database website and
Douban website. The variables employed in our analysis are IV. CART MODEL
as follows: production country, first genre, the second genre,
seasonality, mean of director and former two actors’ fans, CART, a recursive partit ioning method, builds
mean of director and former five actors’ fans. classification and regression trees for predicting continuous
dependent variables (regression) and categorical predictor
A. Production Counry variables (classification). The classic CA RT algorith m was
Film production country has a strong correlation with popularized by Breiman et al. (Breiman, Fried man, Olshen,
box office, since the development of film industry is & Stone, 1984; see also Ripley, 1996).CA RT (Classification
different in different country. For example, the United States And Regression Tree) algorith m uses a binary recursive
Holly wood system is relatively mature and complete segmentation technology, the current sample set is divided
industrial system. Its system leads the development of global into two sub sample set, makes the generated each a leaf
movie industry. Production country can be also regarded as a node has two branches [9]. Therefore, the CA RT algorith m
form of cultural co mpetition. Therefore, production country of decision tree is a simp le binary t ree structure.
is a very important variable, it presents a production level of Classification tree has two basic ideas: the first is to div ide
the film, to some extent. the training samp le to recursively independent variable
B. First Genre space of idea, the second idea is to use validation data fo r
pruning.Algorithms for constructing decision trees usually
In our model, we conclude first genre as the first name of
that type list of a film. Movie genre can reflects the work top-down, by choosing a variable at each step that can
audience’s viewing preferences. In the context of box-office best splits the set of items. Different algorith ms use different
forecasting, first genre of a film acts an important attribute in

88
metrics for measuring "best". These generally measure the TABLE II. CART MODEL: CLASSIFICAT ION T HRESHOLDS
homogeneity of the target variable within the subsets. CLASS REVENUE
̞ MILLION )
RANGE(̞
A. Gini impurity A (BLOCKBUST ER) 1000+
Used by the CART algorith m, Gini impurity is a measure B 600-1000
of how often a rando mly chosen element fro m the set would C 400-600
be incorrectly labeled if it were rando mly labeled according
D 200-400
to the distribution of labels in the subset. Gin i impurity is a
E 100-200
rule of part ition and can be computed by summing the
probability of each item being chosen times the probability F 50-100
of a mistake in categorizing that item. It reaches its G(FLOP) <50
minimu m (zero) when all cases in the node fall into a single
target category.
To compute Gin i impurity for a set of items, suppose A. Performance metrics
L  {1, 2,…,m}, and let IL be the fraction of items Our model uses average percent hit rate (APHR) [11] as
the accuracy metric, which is calculated by formula (2).This
labeled with value L in the set[10]. Gini impurity is metric is arguably the intuitive method to estimate the
computed by formula (1). predictive performance of models. APHR is the ratio of total
P
correct classifications to total number of samples, averaged
 IL

, * I    . (1) for all classes in the classification problem and is more
L commonly known as precision. As the formula showing, the
bigger the values are, the better classification performance
V. BOX-OFFICE FORECAST ING the model predicts.
As mentioned above, we chose the top 100 films of 2014 &ODVVLILHG
1XPEHURI6DPSOHV&RUUHFWO\
in bo x office revenues as the training set and the top 50 films $3+5  (2)
of the former 5 months in 2015 as the testing set to do our 7RWDO1XPEHURI6DPSOHV
research. The collection and conversion of data accord to the
following variables as “TABLE I” showed. The movie B. Training model
metrics employed by us are transformed fro m 6 variables This paper applies IBM SPSS Modeler to do our s tudy.
into 34 data points for input into our Classification and In our study, we set 6 variables as 6 inputs and box office
Regression Tree Model. revenue class as the output. The CART model is set as
enhanced accuracy style which generating model sequences
TABLE I. VARIABLES to obtain more accurate predictions. The maximu m tree
depth is 7, which is decided by testing. Through training, the
VARIABLES PO SSIBLE VALUES fitting degree of training set is 99%. The importance of the
Production China, America, Hong Kong, Taiwan, others predictive variables is shown in “TABLE III”.
country
First Action, Plot, Adventure, Comedy, Thriller, TABLE III. IMP ORTANCE OF VARIABLES
genr e Animation, love
VARIABLES WEIGHT O F IMPO RTANCE
T he second Comedy,Adventure,Fantasy,Action ,thriller ,Suspense Production country 0.04
genre ,Family ,Love ,Sci-fi, Animation, Costume, War,
Crime, Terror, Children First genre 0.18
Seasonality New year, Labor day, Summer Vacation, National T he second genre 0.04
day, Others Seasonality 0.20
Mean of director A positive integer between 0 and 33890
Mean of director and former two 0.21
and former two
actors’ fans actors’ fans
Mean of director A positive integer between 0 and 19283 Mean of director and former five 0.33
and former five actors’ fans
actors’ fans
The weights of importance of the predictive variables
In our CART based model, we convert the problem of indicate that Mean of director and former five actors’ fans
revenue forecasting from a point-estimate into a has the most influence value as 0.33. The weight of Mean of
classification problem. By this way, movies are clas sified director and former five actors’ fans is bigger than the weight
into one of seven classes from ‘flop’ to ‘blockbuster’ as of Mean of director and former two actors’ fans, this result
“TABLE II” showed. This clustering of films allows for a gives evidence that the more actors are considered the more
CART model to be trained to recognize elements and comprehensive star value quantifies. Seasonality can’t be
combinations of elements which are of predictive value fro m ignored, its weight of importance marks 0.20. It can testify
similarly performing films. the importance of Seasonality in determin ing the box office
performance of films in China. First genre is important than
the second genre, meaning that firs t genre can better

89
represent the major type of films. While, Production country 0.33.It can be concluded that the method is verified to be
performs the minimal influence value. very effective.
This paper also compares the prediction performance of
CART model with others in the training set. The results are VI. CONCLUSIONS
showed as” TABLE IV”. It can be seen clearly that CART This paper converts the problem of revenue forecasting
model has the best prediction performance. So, our study fro m a point-estimate into a classification problem. In our
chooses CART model to predict box office. study, 6 basic film variables with 34 data points are set to be
inputs. The study first applies film fans as a quantization
TABLE IV. COMPARISON OF MODELS
method of star and director value, which is verified to be
MO DEL APHR effective. Through the comparison of different models, our
CART 99% study chooses CART algorithm to do the research. The
Bayesian Network 86% trained model can predict the level of box office at the early
stage of the film, and the prediction accuracy is high as 76%,
SVM 77%
which can provide decision-making reference for filmmakers
C5.0 60% and reduces the investment risk. In the further research, our
NNs 40% study will increase data index to improve the prediction
precision, such as production companies, distribution
C. Forcasting Results companies and film format index, and so on. Meanwhile,
This study set the top 50 films of former 5 months in more models will be applied in our research.
2015 as the testing set. We predict their results though the
trained CART model. The results are shown as” TABLE V”. A CKNOWLEDGMENT
This paper is financially supported by Engineering
TABLE V. FORECATE RESULTS
Planning Project of Co mmunication Un iversity of Ch ina
CLASS TRAINING TESTING (XNG1356), Engineering Planning Project of
APHR (%) APHR (%) Co mmunication University of China (XNG1412),
Outstanding Young Teacher Training Project of
A 100% 100%
Co mmunication University of Ch ina (YXJS201527) and the
B 87.5% 75% National Science Foundation of China (71172040).
C 100% 83.3%
D 100% 80%
REFERENCES
E 100% 75% [1] Yan Wang, T ianxin Jin,“Marketing and risk assessment Under the
dual perspective of movie box office forecasting,” The Chinese film
F 100% 100% market, The third stage, pp. 11-12, 2012.
G 100% 63.2% [2] Litman B R. Predicting Success of Theatrical Movies: An Empirical
Study[J]. Journal of Popular Culture, 1983, 16(4):159–175.
AVERAGE 98.2% 82%
[3] Sharda, R., Delen, D.”Predicting box-office success of motion
T OTAL 99% 76% pictures with neural networks”. Expert Systems with Applications,
vol.30,2006, pp .243–254.
The results indicates that our model performs very well [4] M. Ghiassi., David Lio, Brian Moon. “Pre-production forecasting of
movie revenues with a dynamic artificial neural network”.Expert
both in training and testing set. In our model, the fitting Systems with Applications, vol.42,2015,pp. 3176-3193.
degree of training set is 99%, and the application prediction [5] Reggie Panaligan, Andrea Chen. Quantifying Movie Magic with
accuracy value is 76%. By employing CA RT to our data Google Search 㹙 EB/OL. Google Whitepaper | Industry
with the similar methodology, we can improve classification Perspective+User Insights. http://www.google.com.au/think/research-
accuracy from the 56.01% APHR benchmark previously tudies/quantifying-movie-magic.html, 2013.6.
established by Sharda & Delen to 76% APHR, and fro m the [6] Jingfei Du., Hua Xu, Xiaoqiu Huang. “Box office prediction based
74.4% APHR established by M. Ghiassi & Dav id Lio to on microblog” . Expert Systems with Applications, vol.41,2014,pp.
1680-1689.
76%. Therefore, our study shows fine practical value and
[7] Lian Wang, Jian-min Jia.Forecasting box office perforemance based
innovation. Our research can avoid risk for filmmakers and on online search: Evidence from Chinese movie industry[J].Systems
provide guidance to some extent. The bo x-office revenue Engineering-Theory&Practice.Vol.34,N0.12.Dec .2014.
forecasting model introduced in this research and its superior [8] Einav, L,“ Seasonality in the U.S. motion picture industry”. The
accuracy establish a scientific decision support tool for RAND Journal of Economics, 38(1), pp.127–145ˈ2007
stakeholders in the movie industry, offering them a rational, [9] Post&Telecom Press:Machile Learning in Action,pp.160,2013.
practical advantage. [10] T singhua University Publication: DataMining: Concepts, Models,
The study first applies film fans as a quantization method Methods, and Algorithms,Second Editionl,pp146,2013.
of star and director value. Because the weights of importance [11] Li Zhang,Jianhua Luo, Suying Yang “ Forecasting box office revenue
of the predictive variables indicate that Mean of director and of movies with BP neural network,” Expert Systems with
former five actors’ fans has the most influence value as Applications,vol.36,2008,pp.6580-6587.

90

You might also like