Professional Documents
Culture Documents
Machine Learning Approach To Predict Facebook Comment Volume
Machine Learning Approach To Predict Facebook Comment Volume
Machine Learning Approach To Predict Facebook Comment Volume
net/publication/341372125
CITATIONS READS
0 33
3 authors, including:
Alaa Elsakran
American University of Sharjah
1 PUBLICATION 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Alaa Elsakran on 14 May 2020.
Abstract:
There is an enormous amount of data uploaded to Facebook every day. Therefore, there is an
essential need to analyze this data for many purposes. In this paper, a deep analysis on Facebook
comment volume prediction have been conducted to demonstrate the effect of the number of
comments on marketing. The data used consists of 603,813 observations provided by Facebook.
It includes fifty-three input attributes that have been used to predict the number of comments a
post is going to have in the next H hours. Moreover, several Machine Learning techniques have
been applied to analyze this considerable amount of data. For instance, Stepwise Linear
Regression, Gradient Boosting, Neural Network, and Decision Tree models. Findings indicate that
Neural Network model outperformed the other models used in previous studies with the smallest
error; when Stepwise Linear Regression is used as a variable selection for it. However, Gradient
Boosting outperformed other models with a Hits@10 score of 6.7. Additionally, one of the most
crucial outcomes of this study is determining the most significant variables. Particularly, it was
found that the publication day of the post, number of check-ins, shares or likes on the page, and
the number of comments in the last 24 or 48 hours were the most important variables. These
variables contribute to garner more comments, which results in more viewers and higher profit for
the marketer. The study concludes with some limitations and suggestions for future research.
Keywords: Comment volume prediction; linear regression; neural network; gradient boosting;
decision tree.
Introduction:
Social media platforms are considered as vital sources for data. Facebook is one of the most used
social media platforms as it is being loaded by massive amounts of data every day. This has
encouraged researchers from different fields to try to glean some knowledge by analyzing its huge
data. Furthermore, researchers were inspired to use data mining techniques to handle the huge
dynamic activities of Facebook users efficiently. These activities include likes, comments and
shares. However, comments were given the greatest attention as they reflect people’s emotions.
Thus, Facebook Comment Volume Prediction (CVP) has been considered as the most intrinsic
area of research in social media in the last decade. It refers to the number of comments, the number
of words in the comment section and the number of users who write comments (Kaur & Verma,
2016). Furthermore, since comments indicate people’s feelings and opinions, they can be used to
for sharing information which could have a crucial impact on sale’s revenues. At present, sales
and marketing ways have changed dramatically and companies have adapted to this change to
compete with each other. If a company wants to reach the maximum number of customers today,
there is only one fast way to get them, which is by promoting through social media platforms.
Suppose that there are over one million active users in social media networks and that a company
or a brand is able to promote and sell their products to at least one-tenth of 1 % of them, then they
have promoted the product to more than one hundred thousand users (Belew, 2014). As a company
or a brand, it is impossible to reach this number of customers in one week without social media
platforms. Hence, this current study has a great significance for businesses from a commercial
point of view. Companies, for example, can use social media networks for marketing purposes.
These venues could be utilized by advertisers or sales’ managers to advertise for their products or
1
even predict their profit. Moreover, for brands to reach a wide audience it is a time-consuming
mission; however, bloggers’ pages that have a huge comment volume that can fasten the popularity
of brands. Thus, effective insight into Facebook activities might contribute to more vital marketing
approaches.
Nowadays, Facebook has become one of the most significant advertising mediums for brands
to connect with their clients (Striga & Podobnik, 2018). There are many motives that inspire a
company or a brand to promote their products and reach new customers through such social
networks. According to Belew (2014), the old-fashioned sales funnel no longer lasts. For instance,
McDonald’s Australia is serving approximately two million people every day. This huge fast food
company has promoted their food using Facebook ads and interestingly they reached over 5 million
people in just 5 weeks (Kaur & Verma, 2016). Furthermore, Coca-Cola Korea has used Facebook
ads to promote a new beverage which resulted in reaching 4.5 million customers (Kaur & Verma,
2016). Facebook can also provide ratings for doctors depending on the number of comments to
assist in the decision-making process of choosing a doctor (Carbonell and Brad, 2018). It was
found that people tend to choose the doctor who is more popular depending on the number of
comments and ratings rather than considering the medical qualifications s/he possesses. Moreover,
Facebook likes were used to predict the winning nominees in an election campaign. Khairuddin
and Rao (2017) figured out that the number of bare likes was not a good predictor, and that the
number of comments would be more effective in such a study. Additionally, Facebook interactions
were used to assess whether the popularity of an unpublished scientific work may be taken as an
index to the number of citations this scholarly article may obtain when it is formally published. In
this regard, Ringelhan, Wollersheim and Welpe (2015) found that the number of likes along with
2
Nevertheless, an overview of two closely related studies will be presented in the remaining part of
this section, and their results will be compared with the current study. The two papers used the
same Facebook datasets and features that are being used for the present study. The target is to
predict the volume of comments in the next H hours. One of these studies was conducted by Singh,
Kaur and Kumar (2015). They used Neural Networks and Decision Trees to perform CVP. Data
was split into 80% for training, and 20 % for testing. The train data was split into five variants. By
comparing the evaluation metrics of the models used in this study, it has been found that Decision
Trees outperformed Neural Networks with an average Hits@10 of 6.3 and an AUC@10 of 0.79
in average. The limitation of this research was that they did not find the significant variables that
has a great effect on predicting the target variable. Another related study was conducted by Chen,
Ehrich and Li (2017) to show that Neural Networks with transformed data outperformed other
methods with a Hits@10 score of 6.6 and Mean Squared Error (MSE) of 5803. The most
significant predictors were found to be the number of posts on a page in the last 24 hours, and the
number of shares on the post (Chen, Ehrich and Li, 2017). A limitation of this study is that the
authors did not consider the number of comments on a promoted post as an input variable which
On the other hand, the focus of this research is to probe into analyzing the comment volume on
Facebook. Several predictors are used to help predict the number of comments in the next H hours.
Moreover, various data mining techniques, such as Stepwise Linear Regression, Decision Tree,
Neural Network and Gradient Boosting, have been applied for CVP. Eventually, this paper aims
to reduce the error of the previous related studies, in addition to figuring out the most significant
3
Methodology:
Data Preparation:
Initially, the data was provided by Facebook from different pages. It was collected from 2,770
Facebook pages of various categories (see Appendix 1). Moreover, 57,000 posts and 4,120,532
comments were extracted from these Facebook pages using JAVA and Facebook Query
Language. Afterwards, the data was cleaned by Singh, Kaur and Kumar (2015). Data cleaning
resulted in eliminating a huge number of posts; therefore, 603,813 observations were left for this
study. Moreover, there are 53 predictors that are considered to predict the number of comments
in the next H hours (see Table 2). The significant variables will be determined from these
Unlike the previous related studies, the five variants in this study were appended for training to
Ten different real-life datasets that are independent of the train data have been used as test data.
These datasets have been recorded without taking into account the base time or date. Each score
data consists of 100 observations which add up to 1000 observations in total. By exploring the
4
data, it has been observed that there are no missing values. However, there are some outliers that
have not been given attention in this analysis as they were noticed in the derived features. These
derived features consist of mean, median, minimum and maximum values. Moreover, features’
headings have been labeled. The role of each variable has been chosen according to its description
as given in Table 2. Then, the combined dataset has been partitioned to 40% for train, 30% for
5
Page_Talking_About Input Number of activities done by followers on the
page (likes, shares or comments).
Post_Length Input Number of characters or words in the post.
Post_Promotion_Status Input Promulgated post(1), not promulgated post(0)
Post_Share_Count Input Number of shares on the post.
Evaluation Metrics:
On the other hand, Hits@10, AUC@10, Mean Absolute error (MAE) and Average Squared Error
(ASE) are used as evaluation metrics in this study. Hits@10 is a customized measurement tool
that is conducted by taking the top 10 posts that were predicted to receive the highest number of
comments for each test case, and rank them in descending order. Then, the frequency of having
these posts among the actual top ten posts that had received the highest number of comments is
recorded (Singh, Kaur and Kumar 2015). Hits@10 is used to measure the accuracy of prediction
in the model. AUC@10 represents the area under the Receiver Operator Characteristic Curve
Tp
(ROC), which is given by the formula AUC where Tp is the positives and Fp is the
T p Fp
false positives. AUC@10 is used to measure the prediction exactness of a model. In addition,
MAE measures the closeness of the eventual results to the real comment volume. These
measures show the efficiency of a model in predicting the actual comment volume, while the
measure of Hits@10 shows the efficiency of model in terms of predicting the ranks of comment
volume. Eventually, ASE, MAE, AUC@10 and Hits@10 will be used to compare results of the
6
Modeling:
The data mining techniques used in this study are: Stepwise Linear Regression (LR), Decision
Tree as a variable selection for Stepwise Linear Regression (DT-LR), Neural Network (NN),
Gradient Boosting (GB), Gradient Boosting as a variable selection for Neural Network (GB-NN),
and Stepwise Linear Regression as a variable selection for Neural Network (LR-NN). The variable
selection techniques used for the Neural Network are used to improve the computational efficiency
and decrease the complexity of the neural network. A Neural Network with a single hidden layer
and three neurons has been used. In an attempt to increase accuracy, other Neural Network models
were tested with increased number of hidden layers from single to double, and three layers. It
yielded a better accuracy for the training data, however failed to improve the errors in the test data.
Hence it was not taken into consideration. Furthermore, different models were used to get the best
ASE and only best six models, mentioned above, were taken into consideration. In the coming
section, the results from this current study are compared with those reported by Singh, Kaur and
Model comparison analysis has been performed on the models used for this study. As can be seen
from Table 3 below, LR-NN and GB-NN yielded the smallest ASE while GB resulted in the
highest Hits@10 and AUC@10 values. NN and DT-LR have the lowest MAE but large values of
ASE. DT-LR performed the least in terms of Hits@10 and AUC@10. In general, LR-NN performs
the best in terms of MAE and ASE with acceptable values of Hits@10 and AUC@10. And GB
7
Table 3: Model Comparison Results
LR-NN GB-NN GB NN LR DT-LR
Hits@10 6.30 6.40 6.70 6.30 5.80 5.50
AUC@10 0.63 0.64 0.67 0.63 0.58 0.55
MAE 23.70 25.30 32.63 20.34 24.75 22.13
ASE 431.30 441.47 524.30 660.70 808.60 812.70
Furthermore, GB model outperformed both Decision Tree model by Singh, Kaur and Kumar
(2015) and NN by Chen, Ehrich and Li (2017). In contrast to the Neural Network model applied
by Chen, Ehrich and Li (2017), the LR-NN model applied in this study has yielded a much smaller
Moreover, in contrast to the two significant variables found by Chen, Ehrich and Li (2017),
in this study it was found that there are other significant variables that have remarkable effect
on the number of comments in the next H hours. For example, LR-NN reveals that the
following variables have significant effect on the target varialbes; Cc2, B_Monday, Cc3,
P_Saturday, Post_Share_Count, P_Tuesday and other 17 features that are derived variables.
Moreover, the other significant variables are found through the use of GB as shown in Table
5.
8
Table 5: Significant Variables Given by Gradient Boosting
Significant Variables Importance Level
Base_time 1.00
CC2 0.94
CC5 0.49
Post_Share_count 0.30
CC4 0.13
Page_talking_about 0.10
Page_Checkins 0.08
CC1 0.06
Page_Popularity_likes 0.03
It is noteworthy to mention that both LR-NN and GB models indicate that the number of shares,
likes and comments has a great influence on the target variable. Likewise, the more interactions
done on a post, the more popular it gets, the more viewers it attracts and then the greater number
of comments it will receive. Additionally, it is observed that the number of comments on a post
decreases after Monday and Wednesday whereas they increase after Thursday. From a sales
comments, while publishing a post on the weekdays has an opposite influence on the number of
comments. Moreover, one of the most interesting outcomes of this study is that the number of page
check-ins was found to be a significant variable as well. The number of check-ins of a place,
restaurant, or shop results in a huge increase in the number of comments in the next H hours. For
instance, many shops require Facebook details to log in into their Wi-Fi network. This information
is used to indicate a check-in sign on the customers’ Facebook account so their followers can see
the post. In that way the shop is promoting for itself indirectly. The greater number of check-ins a
page has, the more popular it gets. Therefore, the page attracts more viewers and most probably
9
Furthermore, Facebook users range from normal people to well-known bloggers. Bloggers are
social media influencers who share parts of their everyday lives and have thousands or millions of
followers. The most advertising technique that various companies and brands are adopting now is
cooperating with bloggers. The blogging trend has been growing tremendously since 2016.
Today’s bloggers are more intellectual and are exploiting social media platforms to share
information and do business. According to Zakaria (2018), the number of licensed bloggers in
UAE has reached 650 in 2018. Now the question rises up is: Does advertising on Facebook warrant
the brand or the company that this advertisement that they pay thousands or millions for will get
them the chance to reach thousands or millions of people? Thus, predicting the number of
comments would be beneficial for companies to decide the budget of their advertisements or to
choose the blogger they will cooperate with depending on predicting the number of comments they
will get by advertising through the bloggers’ account. For example, some companies market
through influencer’s page and the contract is for one day only. Then, they observe the comments,
the numbers of shares and likes for their post on that day. Afterwards, they predict the number of
comments this post will have in the next H hours. This step could help the companies in making
What can be concluded from this study is that in order for a company or a blogger to receive more
comments on a post, it is recommended to publish posts on Saturday and Thursday. The most
plausible interpretation for this is that people have weekends and it is more likely that they are
going to view the post which will result in more engagement. Furthermore, both models indicate
that the number of comments a post receives in 24 hours or in 48 hours with respect to a base time
has a considerable effect on predicting the number of comments a post will receive in H hours. It
10
seems that post with a big number of comments grabs people’s attention to see it and most probably
To sum up, the phenomenal success of social media platforms has enabled billions to share their
opinions by writing comments and engaging with Facebook posts. In this paper the authors have
considered the most active social media platform ‘Facebook’ to analyze people’s interactions of
several posts. This was done by analyzing the effect of 53 variables to predict the number of
comments in the next H hours. One of the most interesting outcomes of this study is that CVP can
be used for enhancing marketing through social media influencers by choosing the right day to
publish a post, and by having more shares and check-ins on the page. Our results have depicted
that the best model is the LR-NN with an ASE of 431.3, MAE of 23.7, AUC@10 of 0.63 and
Hits@10 of 6.3. Several privileges of Facebook CVP have been discussed in the literature review.
Nevertheless, although many studies have been conducted on Facebook CVP, yet there is an
enormous area for improvement that is left untouched. For example, our study only predicted the
number of comments on Facebook pages; there are some variables that could be significant and
have not been taken into consideration while collecting the data. Some of these are the number of
online users, or the number of positive and negative comments made. Additionally, the number of
comments does not provide an effective indication of people’s feelings or opinions. To gain a
complete picture of the nature of comments, text-mining techniques can be used for the analysis
of post contents. We believe that this will result in remarkable outcomes that can be effectively
used to understand the thoughts and feelings of people based on their behavior and their language
11
on social media platforms. In the end, this will also help in understanding if the customers are
References:
Belew, S. (2014).The Art of Social Selling: Finding and Engaging Customers on Twitter,
Facebook, LinkedIn, and Other Social Networks. [AMACOM], 2014. Retrieved from
Carbonell, G. and Brand, M. (2018). Choosing a physician on social media comments and
ratings of users is more important than the qualification of a physician. International
Journal of Human Interaction. 34(2), 117-128.
Chen, Y., Ehrich, W., and Li, C. (2017). Facebook comment volume prediction. Northwestern
University, 1-5. Retrieved from http://wdehrich.com/facebook-comment-volume-
prediction/docs/Final-Report.pdf
Facebook comment volume. (2016). Center for Machine Learning and Intelligent Systems at the
University of California, Irvine. Retrieved from
https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
Kaur, M., and Verma, P. (2016). Review on Comment Volume Prediction. IOSR Journal of
Computer Engineering, 18(2), 134-138.
Khairuddin, M., and Rao, A. (2017). Significance of likes: Analysing passive interactions on
Facebook during campaigning. PLOS ONE, 12(6), 1-24.
Ringelhan, S., Wollersheim, J. and Welpe I.(2015). I Like, I Cite? Do Facebook Likes Predict
the Impact of Scientific Work? PLOS, 10(8), 1-21.
Singh, K., Kaur, R., and Kumar, D. (2015). Comment volume prediction using neural networks
and decision trees. ResearchGate, 1(9), 15-20.
Striga, D., and Podobnik, V. (2018). Benford’s Law and Dunbar’s Number: Does Facebook
Have a Power to Change Natural and Anthropological Laws? IEEE Access, 6(1), 14,629-
14,642.
Tan, P. N., Steinbach, M. and Kumar, V. (2005). Introduction to Data Mining (1st Ed.). [Pearson
Addition Weslley].
Zakaria, S. (2018). E-media License a boon to both bloggers and brands in UAE. Khaleej Times,
pp.1-2.
12
Appendix
13