Relatorio Regressao

Preliminary Report
Professor | Pedro Duarte Silva
Gonçalo Tavares | 355422025

João Francisco | 355422052
Afonso Tenreiro| 355422075
Course | Regression and Data Analysis
Masters in management, Business Analytics | 2022/2023

Contents
Introduction............................................................................................................................ 3
Dataset Description................................................................................................................ 3
Model development................................................................................................................ 4
Model 1............................................................................................................................... 4
Model 2............................................................................................................................... 5
Model 3............................................................................................................................... 5
Model 4............................................................................................................................... 6
Autocorrelation.................................................................................................................... 6
Multicollinearity................................................................................................................... 6
Heteroscedasticity.............................................................................................................. 7
Confidence and precision intervals.....................................................................................7
Results................................................................................................................................... 7
Autocorrelation results........................................................................................................9
Multicollinearity results........................................................................................................9
Heteroscedasticity results...................................................................................................9
Oultiers results..................................................................................................................10
Confidence and precision intervals results........................................................................10
Conclusion........................................................................................................................... 10
Appendix.............................................................................................................................. 12
2
Introduction
This report presents the analysis of a dataset that includes all the posts published on the
Facebook page of a well-known cosmetic brand between January 1st and December 31st of
2014.
The report will include an explanation of the methodology used in the development, choice,
and validation of the chosen models, followed by a discussion of the findings and study
conclusions.
Dataset Description
The dataset used in this analysis consists of all the posts published between the 1st of January
and the 31st of December of 2014 in the Facebook's page of a worldwide renowned cosmetic
brand. As a result, the data set contained a total of 790
posts published. The data set compiled contained four types of features:
· Identification-features that allow identifying each individual post
· Content-the textual content of the post
· Categorization-features the characterize the post
· Performance-metrics for measuring the impact of the post
In the dataset proposed there is a list of input and output features used:
The majority of the data was extracted straight from the business' Facebook page. There
were two exceptions: "total interactions" and "category." The former shows a column that
was calculated using performance metrics downloaded from Facebook, the total of the post's
comments, likes, and shares. The latter offers a manual classification based on the campaign
with which the posted content is affiliated. Another seasoned social media worker inside the
organization validated this category for all 790 posts in order to reduce the potential of
misclassification due to a typing error since it was a manual process.
The performance metrics collected characterized posts' performance in several aspects. Some
of them were intuitively derived from interactions with posts, such as the number of
comments, likes, and shares of the post. The “page total likes” measures the number of likes
3
the page had when the post was published. The remaining metrics are not so intuitive. These
can be logically divided in visualizations and interactions. The former, named “impressions,”
are based on counting the number of times the post was loaded onto the user's browser,
whether directly (organic reach) or through another user's interaction (viral reach). The latter,
“engagements,” account for all the types and origins of clicks on the post. Considering that
engagements define explicit user actions on the post, these constitute a stronger measure for
user feedback on the post when compared to impressions, since loading the contents on the
browser does not actually means the user has paid attention to it.
Model development
The main goal of this assignment is to understand the impact of explanatory variables in life
post total reach (The number of people who saw a page post, unique users). Having that in
consideration, we started our analysis with a global model (including all the explanatory
variables), later excluding those that are not significant to explain the behaviour of
“Lifetime.Post.Total.Reach”.
Initially, we considered the following model:
Model 1
Lifetime . Post . Total . Reach=β 0+ β 1 Page .total .likes + β 2 Lifetime . Post . Total . Impressions+ ¿
β 3 Lifetime . Post . Consumers+ β 4 Lifetime . Post . Consumptions+ ¿
β 5 Lifetime . Post . Impressions . by . people . who .have .liked . your . Page+¿
β 6 Lifetime . Post .reach . by . people.. who . like . your . Page+¿
β 7 Lifetime . People . who. have .liked . your . Page .∧. engaged . with . your . post + β 8 comment
+ β 9 like+ β 10 share+ β 11Total . Interactions
The test statistics - summary (Mod1res) - revealed that the Total.Interactions variable
presents NA values which indicates multicollinearity, so we have to remove this variable
from the model.
4
Model 2
β 7 Lifetime . People . who. have .liked . your . Page .∧. engaged . with . your . post + β 8 comment
β 9like + β 10 share
Subsequently, we stepped the regression model 2 and obtained the following model:
Model 3
β 7 Lifetime . People . who . have .liked . your . Page .∧. engaged . with . your . post + β 8 like
Finally, by removing variables with a VIF (Variance Inflation Factor) greater than 10, we
reduced the degree of multicollinearity in the model and improved the accuracy and stability
of the estimated regression coefficients:
Model 4
Lifetime . Post . Total . Reach=β 0+ β 1 Lifetime . Post . Total . Impressions +¿
β 2 Lifetime . Post . Impressions . by . people . who . have .liked . your . Page+¿
β 3 Lifetime . Post . reach . by . people.. who .like . your . Page+¿
β 4 Lifetime . People . who . have . liked . your . Page .∧. engaged . with . your . post+ μ
5
We calculated the AIC (Akaike Information Criterion) and came to the conclusion that
regression model 3 would be the best model because it has a lower AIC value than the other
regression models.
After deciding on regression model 3, we assessed its suitability in comparison to the more
comprehensive model. In order to do this, we used the ANOVA test in R, which takes the
following assumptions into account.
𝐻0: Reduced model (mod3regression) is satisfactory compared to the more complete model
(mod1regression).
𝐻1: Opposite
The overall p-value for the ANOVA test is the p-value in the last row (Pr(>F)), which is
0.0067772. This suggests that there is a significant difference between the two models.
Autocorrelation
The Durbin-Watson test is a statistical test used to detect the presence of autocorrelation in
the residuals of a regression model, which occurs when the residuals are not independent of
each other.
Multicollinearity
In this report, we conducted a multicollinearity test using the Variance Inflation Factor (VIF)
method to assess the degree of correlation among the predictor variables in a regression
model. The purpose of this test is to identify any predictor variables that may be redundant or
highly correlated with other variables, which can result in unstable and unreliable estimates
of regression coefficients.
Heteroscedasticity
The code snippet above shows the results of a Breusch-Pagan test for heteroskedasticity on a
regression model "mod3regression". The Breusch-Pagan test is a statistical test used to detect
the presence of heteroskedasticity in a regression model, which occurs when the variance of
the errors is not constant across all levels of the predictor variables.
6
Outliers
To look for outliers, we use the command influenceIndexPlot(),which shows us graphically

that there are some outliers, for both mod1regression and mod3regression.
Outliers do not necessarily invalidate models or estimates, but they can reduce precision and
accuracy. The potential implications of these outliers on model interpretation and use should
be considered.
Confidence and precision intervals

Analysing both confidence and prediction intervals together provides a more complete
picture of the model's uncertainty and can help us make better decisions based on the model's
predictions. It is important to note, however, that prediction intervals tend to be wider than
confidence intervals due to the added variability of the error term, and caution should be
taken when interpreting them.
Results
Firstly, we started with a model called “mod1regression” that included all the numeric
variables in the dataset. However, the summary of this model revealed that the variable
“Total. Interactions” has NA values, indicating multicollinearity, so we removed this
variable and created a new model called “mod2regression”, this model included all the
remaining numeric variables.
The summary of “mod2regression” showed us that all the variables are significant at the 5%
level except for “like”, “comment”, “share”, “Page. Popularity” and “Post. Month3”. The
Adjusted R-squared value for this model is 0.2359, which means that 23.59% of the variation
in the “Lifetime. Post. Total. Reach” variable can be explained by the other numeric
variables.
Further, we used a stepwise regression method to try to improve the model by only including
significant variables, the resulting model is called “Model_step”. The summary of this model
shows that the variables “Page. Category” and “Post. Month12” are not significant at the
7
5% level and should be removed. The Adjusted R-squared value for this model is 0.2357,
which is slightly lower than the previous model.
Then we created a new model called “mod3regression” that included the variables
“Lifetime. Post. Total. Impressions”, “Lifetime. Post. Consumers” and “Lifetime. Engaged.
Users”, and had an Adjusted R-squared of 0.4725, which indicated that approximately 47%
of the variance in “Lifetime. Post. Total. Reach” can be explained by the model. The
coefficients in the model represent the change in “Lifetime. Post. Total. Reach” associated
with a one-unit increase in each predictor variable, holding all other variables constant, for
example, a one-unit increase in “Lifetime. Post. Total. Impressions” is associated with a
0.0113 increase in “Lifetime. Post. Total. Reach”.
The p-values associated with each variables indicate the level of significance of each
predictor variable in the model, variables with p-values less than 0.05 are considered
statistically significant at a 5% level, so, in “mod3regression”, all variables are statistically
significant predictors of the variable “Lifetime. Post. Total. Reach”.
As for the variance inflation factors (VIF’s), they were used to check the multicollinearity in
the model, so if the VIF values are greater than 10, multicollinearity may be a problem. In the
“mod3regression”, all VIF variables were less than 10 and that indicates that
multicollinearity was not a problem.
Finally, we concluded that the final model (“mod3regression”) indicates that the variables
“Lifetime. Post. Total. Impressions”, “Lifetime. Post. Consumers” and “Lifetime. Engaged.
Users” are significant predictors of “Lifetime. Post. Total. Reach”, so this model explains the
variability in this outcome variable.
Autocorrelation results
The output shows that the Durbin-Watson test statistic (DW) is 1.7664, with a p-value of
0.00419. The alternative hypothesis is that the true autocorrelation is greater than 0. This
indicates that there is evidence of positive autocorrelation in the residuals, which violates one
of the assumptions of regression analysis.
8
Multicollinearity results
The output shows the VIF values for each predictor variable in the model. The variable
"Lifetime.Engaged.Users" has a very high VIF value of 686.85, indicating that it is highly
correlated with the other predictor variables in the model. Similarly, the variable
"Lifetime.Post.Consumers" also has a high VIF value of 540.15. These high VIF values
suggest that these variables may be redundant in the model, and their effects may be
explained by other variables. In contrast, the other predictor variables have relatively low VIF
values, indicating that they have low collinearity with the other predictor variables in the
model.
The last line of the output shows the VIF value for the variable "like" which also has a
relatively high VIF value of 48.14. This suggests that the variable "like" may also be highly
correlated with the other predictor variables in the model.
Overall, this code shows the results of a multicollinearity test using the VIF method, which
reveals that some of the predictor variables in the model are highly correlated and may need
to be removed or combined with other variables to improve the model's performance.
Heteroscedasticity results
The output shows that the test statistic (BP) is 127.2, with 7 degrees of freedom, and a very
low p-value of <2.2e-16, indicating strong evidence of heteroskedasticity in the model. This
means that the variance of the errors is not constant across all levels of the predictor
variables, and this violates one of the assumptions of regression analysis.
Overall, this code snippet provides valuable information about the presence of
heteroscedasticity in the regression model and suggests that further analysis is needed to
address this issue.
Oultiers results
We obtained the same exact outliers for both models, 416 and 447 for Cook's Distance and
447 and 483 for studentized residuals. These outliers are potentially influential in both
models and may have a significant impact on regression estimates and/or model fit. More
9
research is required to determine whether these outliers are valid data points or potential
errors, as well as to assess their impact on the model and the results.
Confidence and precision intervals results

The code shows the results of two types of prediction intervals, confidence intervals and
prediction intervals, for the regression model "mod3regression" using the predict function in
R.
The first line of code generates 95% confidence intervals for the predicted response variable
using the "interval = 'confidence'" argument in the predict function. The resulting output
shows the fitted values (fit) along with the lower (lwr) and upper (upr) bounds of the
confidence intervals for the first six observations in the dataset.
The second line of code generates 95% prediction intervals for the predicted response
variable using the "interval = 'prediction'" argument in the predict function.
Prediction intervals are wider than confidence intervals since they also take into account the
variability of the error term in addition to the variability of the predictor variables.
Confidence intervals only take into account the variability of the predictor variables.
Overall, this code provides valuable information about the uncertainty around the predicted
response variable and can be used to assess the precision of the model's predictions.
However, caution should be taken when interpreting the results of prediction intervals,
especially when using the same dataset to generate the predictions as the one used to fit the
model.
Conclusion
After analysing the results, we can conclude that there were three main models for predicting
"Lifetime. Post. Total. Reach" based on several predictor variables. The final model
("mod3regression") includes the variables "Lifetime. Post. Total. Impressions", "Lifetime.
Post. Consumers", and "Lifetime. Engaged. Users", and explains approximately 47% of the
variance in the outcome variable. The autocorrelation test revealed positive autocorrelation in
the residuals, violating one of the assumptions of regression analysis. The multicollinearity
test showed that some predictor variables are highly correlated and may need to be removed
10
or combined to improve the model's performance. The heteroscedasticity test indicated strong
evidence of heteroscedasticity in the model, suggesting that further analysis is needed to
address this issue. Furthermore, outliers were identified and their potential impact on the
model and results needs to be assessed. Finally, prediction intervals are wider than
confidence intervals because they also account for the variability of the error term.
Appendix
11

Relatorio Regressao

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Relatorio Regressao

Uploaded by

Copyright:

Available Formats

Preliminary Report

Professor | Pedro Duarte Silva

Gonçalo Tavares | 355422025

Course | Regression and Data Analysis

Masters in management, Business Analytics | 2022/2023

Initially, we considered the following model:

β 3 Lifetime . Post . Consumers+ β 4 Lifetime . Post . Consumptions+ ¿

β 5 Lifetime . Post . Impressions . by . people . who .have .liked . your . Page+¿

β 6 Lifetime . Post .reach . by . people.. who . like . your . Page+¿

β 3 Lifetime . Post . Consumers+ β 4 Lifetime . Post . Consumptions+ ¿

β 5 Lifetime . Post . Impressions . by . people . who .have .liked . your . Page+¿

β 6 Lifetime . Post .reach . by . people.. who . like . your . Page+¿

β 3 Lifetime . Post . Consumers+ β 4 Lifetime . Post . Consumptions+ ¿

β 5 Lifetime . Post . Impressions . by . people . who .have .liked . your . Page+¿

β 6 Lifetime . Post .reach . by . people.. who . like . your . Page+¿

Lifetime . Post . Total . Reach=β 0+ β 1 Lifetime . Post . Total . Impressions +¿

β 2 Lifetime . Post . Impressions . by . people . who . have .liked . your . Page+¿

β 3 Lifetime . Post . reach . by . people.. who .like . your . Page+¿

To look for outliers, we use the command influenceIndexPlot(),which shows us graphically

Confidence and precision intervals

Confidence and precision intervals results

You might also like