Machine Learning Approach To Predict Facebook Comment Volume

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341372125

Machine Learning Approach to Predict Facebook Comment Volume

Article · May 2020

CITATIONS READS
0 33

3 authors, including:

Alaa Elsakran
American University of Sharjah
1 PUBLICATION   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Data Mining, Data Science, and Machine Learning View project

All content following this page was uploaded by Alaa Elsakran on 14 May 2020.

The user has requested enhancement of the downloaded file.


Machine Learning Approach to Predict Facebook Comment Volume

Alaa El-Sakran, Abdulla Al-Amin, and Ayman Alzaatreh


Department of Mathematics and Statistics, American University of Sharjah, Sharjah, UAE

Abstract:

There is an enormous amount of data uploaded to Facebook every day. Therefore, there is an
essential need to analyze this data for many purposes. In this paper, a deep analysis on Facebook
comment volume prediction have been conducted to demonstrate the effect of the number of
comments on marketing. The data used consists of 603,813 observations provided by Facebook.
It includes fifty-three input attributes that have been used to predict the number of comments a
post is going to have in the next H hours. Moreover, several Machine Learning techniques have
been applied to analyze this considerable amount of data. For instance, Stepwise Linear
Regression, Gradient Boosting, Neural Network, and Decision Tree models. Findings indicate that
Neural Network model outperformed the other models used in previous studies with the smallest
error; when Stepwise Linear Regression is used as a variable selection for it. However, Gradient
Boosting outperformed other models with a Hits@10 score of 6.7. Additionally, one of the most
crucial outcomes of this study is determining the most significant variables. Particularly, it was
found that the publication day of the post, number of check-ins, shares or likes on the page, and
the number of comments in the last 24 or 48 hours were the most important variables. These
variables contribute to garner more comments, which results in more viewers and higher profit for
the marketer. The study concludes with some limitations and suggestions for future research.

Keywords: Comment volume prediction; linear regression; neural network; gradient boosting;
decision tree.
Introduction:

Social media platforms are considered as vital sources for data. Facebook is one of the most used

social media platforms as it is being loaded by massive amounts of data every day. This has

encouraged researchers from different fields to try to glean some knowledge by analyzing its huge

data. Furthermore, researchers were inspired to use data mining techniques to handle the huge

dynamic activities of Facebook users efficiently. These activities include likes, comments and

shares. However, comments were given the greatest attention as they reflect people’s emotions.

Thus, Facebook Comment Volume Prediction (CVP) has been considered as the most intrinsic

area of research in social media in the last decade. It refers to the number of comments, the number

of words in the comment section and the number of users who write comments (Kaur & Verma,

2016). Furthermore, since comments indicate people’s feelings and opinions, they can be used to

evaluate several advertising campaigns. Nowadays, Facebook is used by commercial enterprises

for sharing information which could have a crucial impact on sale’s revenues. At present, sales

and marketing ways have changed dramatically and companies have adapted to this change to

compete with each other. If a company wants to reach the maximum number of customers today,

there is only one fast way to get them, which is by promoting through social media platforms.

Suppose that there are over one million active users in social media networks and that a company

or a brand is able to promote and sell their products to at least one-tenth of 1 % of them, then they

have promoted the product to more than one hundred thousand users (Belew, 2014). As a company

or a brand, it is impossible to reach this number of customers in one week without social media

platforms. Hence, this current study has a great significance for businesses from a commercial

point of view. Companies, for example, can use social media networks for marketing purposes.

These venues could be utilized by advertisers or sales’ managers to advertise for their products or

1
even predict their profit. Moreover, for brands to reach a wide audience it is a time-consuming

mission; however, bloggers’ pages that have a huge comment volume that can fasten the popularity

of brands. Thus, effective insight into Facebook activities might contribute to more vital marketing

approaches.

Nowadays, Facebook has become one of the most significant advertising mediums for brands

to connect with their clients (Striga & Podobnik, 2018). There are many motives that inspire a

company or a brand to promote their products and reach new customers through such social

networks. According to Belew (2014), the old-fashioned sales funnel no longer lasts. For instance,

McDonald’s Australia is serving approximately two million people every day. This huge fast food

company has promoted their food using Facebook ads and interestingly they reached over 5 million

people in just 5 weeks (Kaur & Verma, 2016). Furthermore, Coca-Cola Korea has used Facebook

ads to promote a new beverage which resulted in reaching 4.5 million customers (Kaur & Verma,

2016). Facebook can also provide ratings for doctors depending on the number of comments to

assist in the decision-making process of choosing a doctor (Carbonell and Brad, 2018). It was

found that people tend to choose the doctor who is more popular depending on the number of

comments and ratings rather than considering the medical qualifications s/he possesses. Moreover,

Facebook likes were used to predict the winning nominees in an election campaign. Khairuddin

and Rao (2017) figured out that the number of bare likes was not a good predictor, and that the

number of comments would be more effective in such a study. Additionally, Facebook interactions

were used to assess whether the popularity of an unpublished scientific work may be taken as an

index to the number of citations this scholarly article may obtain when it is formally published. In

this regard, Ringelhan, Wollersheim and Welpe (2015) found that the number of likes along with

the number of comments can be a good indicator for such a study.

2
Nevertheless, an overview of two closely related studies will be presented in the remaining part of

this section, and their results will be compared with the current study. The two papers used the

same Facebook datasets and features that are being used for the present study. The target is to

predict the volume of comments in the next H hours. One of these studies was conducted by Singh,

Kaur and Kumar (2015). They used Neural Networks and Decision Trees to perform CVP. Data

was split into 80% for training, and 20 % for testing. The train data was split into five variants. By

comparing the evaluation metrics of the models used in this study, it has been found that Decision

Trees outperformed Neural Networks with an average Hits@10 of 6.3 and an AUC@10 of 0.79

in average. The limitation of this research was that they did not find the significant variables that

has a great effect on predicting the target variable. Another related study was conducted by Chen,

Ehrich and Li (2017) to show that Neural Networks with transformed data outperformed other

methods with a Hits@10 score of 6.6 and Mean Squared Error (MSE) of 5803. The most

significant predictors were found to be the number of posts on a page in the last 24 hours, and the

number of shares on the post (Chen, Ehrich and Li, 2017). A limitation of this study is that the

authors did not consider the number of comments on a promoted post as an input variable which

might have an impact on the prediction, in addition to a large MSE of 5803.

On the other hand, the focus of this research is to probe into analyzing the comment volume on

Facebook. Several predictors are used to help predict the number of comments in the next H hours.

Moreover, various data mining techniques, such as Stepwise Linear Regression, Decision Tree,

Neural Network and Gradient Boosting, have been applied for CVP. Eventually, this paper aims

to reduce the error of the previous related studies, in addition to figuring out the most significant

variables that have the greatest impact on the target variable.

3
Methodology:

Data Preparation:

Initially, the data was provided by Facebook from different pages. It was collected from 2,770

Facebook pages of various categories (see Appendix 1). Moreover, 57,000 posts and 4,120,532

comments were extracted from these Facebook pages using JAVA and Facebook Query

Language. Afterwards, the data was cleaned by Singh, Kaur and Kumar (2015). Data cleaning

resulted in eliminating a huge number of posts; therefore, 603,813 observations were left for this

study. Moreover, there are 53 predictors that are considered to predict the number of comments

in the next H hours (see Table 2). The significant variables will be determined from these

predictors depending on their effect on the target variable.

Unlike the previous related studies, the five variants in this study were appended for training to

add up to a total of 603,813 as shown in Table 1.

Table 1: Number of observations


Variants Number of observations
Variant 1 40,949
Variant 2 81,831
Variant 3 121,098
Variant 4 160,424
Variant 5 199,030
Total number of observations: 603,813

Ten different real-life datasets that are independent of the train data have been used as test data.

These datasets have been recorded without taking into account the base time or date. Each score

data consists of 100 observations which add up to 1000 observations in total. By exploring the

4
data, it has been observed that there are no missing values. However, there are some outliers that

have not been given attention in this analysis as they were noticed in the derived features. These

derived features consist of mean, median, minimum and maximum values. Moreover, features’

headings have been labeled. The role of each variable has been chosen according to its description

as given in Table 2. Then, the combined dataset has been partitioned to 40% for train, 30% for

validation and 30% for test.

Table 2: Specifications of the Variables


Name Model Description
Role
B_’weekday’ (7 variables) Input Indicates the weekday on a specific base date or
time
Base_Time Input Chosen time (0-71) to form the system.
Derived Feature (25 Input Attributes that has been accumulated by finding
variables) the mean, minimum, maximum, median and
standard deviation of crucial attributes.
Cc1 Input The overall number of comments prior choosing
a base time.
Cc2 Input Number of comments in the latter twenty four
hours with respect to the base time.
Cc3 Input Number of comments in the latter forty eight to
the later twenty four hours with respect to the
base time
Cc4 Input Number of comments in the first twenty four
hours after posting the post and prior the base
time.
Cc5 Input |CC2-CC3|
H_Local Input Number of H hours at which the target is
achieved.
No_Comments_In_H_Hrs Target Number of comments in the subsequent hours
(H: from “H_Local” input variable)
P_’weekday’ (7 variables) Input Indicates the publication day of the post
Page_Category Input Category of the source (such as: place, business,
blogger, institution)
Page_Checkins Input Number of people who visited this place.
Page_Popularity_Likes Input Number of likes on the page.

5
Page_Talking_About Input Number of activities done by followers on the
page (likes, shares or comments).
Post_Length Input Number of characters or words in the post.
Post_Promotion_Status Input Promulgated post(1), not promulgated post(0)
Post_Share_Count Input Number of shares on the post.

Evaluation Metrics:

On the other hand, Hits@10, AUC@10, Mean Absolute error (MAE) and Average Squared Error

(ASE) are used as evaluation metrics in this study. Hits@10 is a customized measurement tool

that is conducted by taking the top 10 posts that were predicted to receive the highest number of

comments for each test case, and rank them in descending order. Then, the frequency of having

these posts among the actual top ten posts that had received the highest number of comments is

recorded (Singh, Kaur and Kumar 2015). Hits@10 is used to measure the accuracy of prediction

in the model. AUC@10 represents the area under the Receiver Operator Characteristic Curve

Tp
(ROC), which is given by the formula AUC  where Tp is the positives and Fp is the
T p  Fp

false positives. AUC@10 is used to measure the prediction exactness of a model. In addition,

MAE measures the closeness of the eventual results to the real comment volume. These

measures show the efficiency of a model in predicting the actual comment volume, while the

measure of Hits@10 shows the efficiency of model in terms of predicting the ranks of comment

volume. Eventually, ASE, MAE, AUC@10 and Hits@10 will be used to compare results of the

models used in this study.

6
Modeling:

The data mining techniques used in this study are: Stepwise Linear Regression (LR), Decision

Tree as a variable selection for Stepwise Linear Regression (DT-LR), Neural Network (NN),

Gradient Boosting (GB), Gradient Boosting as a variable selection for Neural Network (GB-NN),

and Stepwise Linear Regression as a variable selection for Neural Network (LR-NN). The variable

selection techniques used for the Neural Network are used to improve the computational efficiency

and decrease the complexity of the neural network. A Neural Network with a single hidden layer

and three neurons has been used. In an attempt to increase accuracy, other Neural Network models

were tested with increased number of hidden layers from single to double, and three layers. It

yielded a better accuracy for the training data, however failed to improve the errors in the test data.

Hence it was not taken into consideration. Furthermore, different models were used to get the best

ASE and only best six models, mentioned above, were taken into consideration. In the coming

section, the results from this current study are compared with those reported by Singh, Kaur and

Kumar (2015) and Chen, Ehrich and Li (2017).

Results and Discussion:

Model comparison analysis has been performed on the models used for this study. As can be seen

from Table 3 below, LR-NN and GB-NN yielded the smallest ASE while GB resulted in the

highest Hits@10 and AUC@10 values. NN and DT-LR have the lowest MAE but large values of

ASE. DT-LR performed the least in terms of Hits@10 and AUC@10. In general, LR-NN performs

the best in terms of MAE and ASE with acceptable values of Hits@10 and AUC@10. And GB

performs the best in terms of Hits@10 and AUC@10.

7
Table 3: Model Comparison Results
LR-NN GB-NN GB NN LR DT-LR
Hits@10 6.30 6.40 6.70 6.30 5.80 5.50
AUC@10 0.63 0.64 0.67 0.63 0.58 0.55
MAE 23.70 25.30 32.63 20.34 24.75 22.13
ASE 431.30 441.47 524.30 660.70 808.60 812.70

Furthermore, GB model outperformed both Decision Tree model by Singh, Kaur and Kumar

(2015) and NN by Chen, Ehrich and Li (2017). In contrast to the Neural Network model applied

by Chen, Ehrich and Li (2017), the LR-NN model applied in this study has yielded a much smaller

ASE as demonstrated in Table 4.

Table 4: Model Comparison with Previous Studies


LR-NN GB DT Singh, Kaur NN Chen, Ehrich
& Kumar (2015) & Li (2017)
Hits@10 6.3 6.7 6.3 6.6
ASE 431.3 524.3 * 5803
*The value was not reported by the authors.

Moreover, in contrast to the two significant variables found by Chen, Ehrich and Li (2017),

in this study it was found that there are other significant variables that have remarkable effect

on the number of comments in the next H hours. For example, LR-NN reveals that the

following variables have significant effect on the target varialbes; Cc2, B_Monday, Cc3,

B_Thursday, Cc5, B_Wednesday, H_local, P_Wednesday, Page_Talking_About ,

P_Saturday, Post_Share_Count, P_Tuesday and other 17 features that are derived variables.

Moreover, the other significant variables are found through the use of GB as shown in Table

5.

8
Table 5: Significant Variables Given by Gradient Boosting
Significant Variables Importance Level
Base_time 1.00
CC2 0.94
CC5 0.49
Post_Share_count 0.30
CC4 0.13
Page_talking_about 0.10
Page_Checkins 0.08
CC1 0.06
Page_Popularity_likes 0.03

It is noteworthy to mention that both LR-NN and GB models indicate that the number of shares,

likes and comments has a great influence on the target variable. Likewise, the more interactions

done on a post, the more popular it gets, the more viewers it attracts and then the greater number

of comments it will receive. Additionally, it is observed that the number of comments on a post

decreases after Monday and Wednesday whereas they increase after Thursday. From a sales

standpoint, publishing a post on Saturday or Thursday results in an increase in the number of

comments, while publishing a post on the weekdays has an opposite influence on the number of

comments. Moreover, one of the most interesting outcomes of this study is that the number of page

check-ins was found to be a significant variable as well. The number of check-ins of a place,

restaurant, or shop results in a huge increase in the number of comments in the next H hours. For

instance, many shops require Facebook details to log in into their Wi-Fi network. This information

is used to indicate a check-in sign on the customers’ Facebook account so their followers can see

the post. In that way the shop is promoting for itself indirectly. The greater number of check-ins a

page has, the more popular it gets. Therefore, the page attracts more viewers and most probably

receives more comments.

9
Furthermore, Facebook users range from normal people to well-known bloggers. Bloggers are

social media influencers who share parts of their everyday lives and have thousands or millions of

followers. The most advertising technique that various companies and brands are adopting now is

cooperating with bloggers. The blogging trend has been growing tremendously since 2016.

Today’s bloggers are more intellectual and are exploiting social media platforms to share

information and do business. According to Zakaria (2018), the number of licensed bloggers in

UAE has reached 650 in 2018. Now the question rises up is: Does advertising on Facebook warrant

the brand or the company that this advertisement that they pay thousands or millions for will get

them the chance to reach thousands or millions of people? Thus, predicting the number of

comments would be beneficial for companies to decide the budget of their advertisements or to

choose the blogger they will cooperate with depending on predicting the number of comments they

will get by advertising through the bloggers’ account. For example, some companies market

through influencer’s page and the contract is for one day only. Then, they observe the comments,

the numbers of shares and likes for their post on that day. Afterwards, they predict the number of

comments this post will have in the next H hours. This step could help the companies in making

the decision of extending the advertisement contract to more weeks or not.

What can be concluded from this study is that in order for a company or a blogger to receive more

comments on a post, it is recommended to publish posts on Saturday and Thursday. The most

plausible interpretation for this is that people have weekends and it is more likely that they are

going to view the post which will result in more engagement. Furthermore, both models indicate

that the number of comments a post receives in 24 hours or in 48 hours with respect to a base time

has a considerable effect on predicting the number of comments a post will receive in H hours. It

10
seems that post with a big number of comments grabs people’s attention to see it and most probably

make comments on it.

Conclusion and Limitations:

To sum up, the phenomenal success of social media platforms has enabled billions to share their

opinions by writing comments and engaging with Facebook posts. In this paper the authors have

considered the most active social media platform ‘Facebook’ to analyze people’s interactions of

several posts. This was done by analyzing the effect of 53 variables to predict the number of

comments in the next H hours. One of the most interesting outcomes of this study is that CVP can

be used for enhancing marketing through social media influencers by choosing the right day to

publish a post, and by having more shares and check-ins on the page. Our results have depicted

that the best model is the LR-NN with an ASE of 431.3, MAE of 23.7, AUC@10 of 0.63 and

Hits@10 of 6.3. Several privileges of Facebook CVP have been discussed in the literature review.

Nevertheless, although many studies have been conducted on Facebook CVP, yet there is an

enormous area for improvement that is left untouched. For example, our study only predicted the

number of comments on Facebook pages; there are some variables that could be significant and

have not been taken into consideration while collecting the data. Some of these are the number of

online users, or the number of positive and negative comments made. Additionally, the number of

comments does not provide an effective indication of people’s feelings or opinions. To gain a

complete picture of the nature of comments, text-mining techniques can be used for the analysis

of post contents. We believe that this will result in remarkable outcomes that can be effectively

used to understand the thoughts and feelings of people based on their behavior and their language

11
on social media platforms. In the end, this will also help in understanding if the customers are

willing to buy a specific product or not.

References:

Belew, S. (2014).The Art of Social Selling: Finding and Engaging Customers on Twitter,
Facebook, LinkedIn, and Other Social Networks. [AMACOM], 2014. Retrieved from

Carbonell, G. and Brand, M. (2018). Choosing a physician on social media comments and
ratings of users is more important than the qualification of a physician. International
Journal of Human Interaction. 34(2), 117-128.
Chen, Y., Ehrich, W., and Li, C. (2017). Facebook comment volume prediction. Northwestern
University, 1-5. Retrieved from http://wdehrich.com/facebook-comment-volume-
prediction/docs/Final-Report.pdf
Facebook comment volume. (2016). Center for Machine Learning and Intelligent Systems at the
University of California, Irvine. Retrieved from
https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
Kaur, M., and Verma, P. (2016). Review on Comment Volume Prediction. IOSR Journal of
Computer Engineering, 18(2), 134-138.
Khairuddin, M., and Rao, A. (2017). Significance of likes: Analysing passive interactions on
Facebook during campaigning. PLOS ONE, 12(6), 1-24.
Ringelhan, S., Wollersheim, J. and Welpe I.(2015). I Like, I Cite? Do Facebook Likes Predict
the Impact of Scientific Work? PLOS, 10(8), 1-21.
Singh, K., Kaur, R., and Kumar, D. (2015). Comment volume prediction using neural networks
and decision trees. ResearchGate, 1(9), 15-20.
Striga, D., and Podobnik, V. (2018). Benford’s Law and Dunbar’s Number: Does Facebook
Have a Power to Change Natural and Anthropological Laws? IEEE Access, 6(1), 14,629-
14,642.
Tan, P. N., Steinbach, M. and Kumar, V. (2005). Introduction to Data Mining (1st Ed.). [Pearson
Addition Weslley].
Zakaria, S. (2018). E-media License a boon to both bloggers and brands in UAE. Khaleej Times,
pp.1-2.

12
Appendix

Appendix 1: Categories of Pages and post

Product/service Cars Tv channel


Public figure Clothing Telecommunication
Retail and consumer merchandise Local business Entertainment website
Athlete/Coach Musician/band Shopping/retail
Education website Politician Personal blog
Arts/entertainment/nightlife News/media website App page
Aerospace/defense Education Vitamins/supplements
Actor/director Author Professional services
Professional sports team Sports event Movie theater
Travel/leisure Restaurant/café Small business
Arts/humanities website School sports team News personality
Food/beverages University Teens/kids website
Record label Tv show Government official
Movie Website Photographer
Song Outdoor gear/sporting goods Bar
Community Political party Camera/photo
Company Sports league Book
Artist Entertainer Producer
Non-governmental organization Church/religious organization Society/culture website
Media/news/publishing Non-profit organization Games/toys
Spas/beauty/personal care Automobiles and parts Bank/financial institution
Studio Organization Software
Home décor Tv/movie award Magazine
Jewelry/watches Hotel Electronics
Writer Health/medical/pharmaceuticals School
Health/beauty Transportation Just for fun
Music video Local/travel website Club
Appliances Musical instrument Comedian
Computers/technology Radio station Sports venue
Insurance company Video game/Other Sports/recreation/activities
Music award Computers Publisher
Recreation/sports website Phone/tablet Tv network
Reference website Internet/software Health/medical/pharmacy
Business/economy website Tools/equipment Landmark

13

View publication stats

You might also like