Lecture 3

APPLIED DATA ANALYSIS
IN BUSINESS WITH R
L3: Correlation and regression

analysis
Denis Marinšek
CORRELATION ANALYSIS
What is correlation?
Correlation is a method of measuring the degree of relationship (interdependence) of
variables.
perfectly negative no correlation
perfectly positive
closer are the points to the line - stronger is the correlation

Pearson correlation coefficient
Its values are between -1 and +1
The absolute values can be interpreted as follows:

- [0.0 – 0.1) = very weak correlation
- [0.1 – 0.3) = weak
- [0.3 – 0.7) = semi strong
- [0.7 – 0.9) = strong
- [0.9 – 1.0) = very strong correlation
if at least one is ordinal

If we analyze ordinal variables, Spearman correlation coefficient should be used.
4
на всіх графіках Пірсон покаже, що сила кореляції 0,8.
CORRELATION ANALYSIS хоча реально воно 0,8 лише на перщому графіку. На
інших - ні, не логічно
Anscombe (1973) non-linear >>> so not possible to use Pearson

0.8
1 (without outlier) 0 (without outlier)
5
impossible to use Pearson, when there is an outlier
Example: High school graduation exam
In the national examination center, they want to check the correlation between the
results in 4 examination subjects. They sample 360 high school students and check
their results in English, German, Math and the first elective subject. Is there a
correlation between the scores achieved in the mentioned subjects?
6
distribution
7
type = "spearman" <- if at least one is not numeric but ordinal
linear relationship between X and Y positive/negative and (how strong)
sample size
table to check if there is a statistical significant correlation in each pair
H0: ro (ger - math) = 0

H1: ro (ger - math) =/ 0
8
ordinal
9
The third variable problem
In correlation analysis, one should not speak of causality, since there may be other
variables in the background that are responsible for the observed relationship.
10
OLS REGRESSION ANALYSIS
REGRESSION ANALYSIS
We use regression analysis to check whether we can use a set of explanatory variables
𝑋1 , 𝑋2 , … , 𝑋𝑘 (generally 𝑋𝑗 ) to explain the differences in the values of the dependent
variable 𝑌.
Example: Can differences in the height of people be explained by the height of the
parents and the gender of the person?
Regression model:
beta for population

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀 error term
regeression coefficients
regression constant
Relationship can be described with the regression function:
𝐸 𝑌 𝑿 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘
expected values 12
REGRESSION ANALYSIS
When analyzing sample data, we can only estimate the regression function:
𝑌෠ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + ⋯ + 𝑏𝑘 𝑋𝑘 b - for sample
• 𝑌෠ : estimated value of the dependent variable.

• 𝑏0 : estimate of the regression constant, i.e., the estimate of the mean of the
dependent variable when the values of all explanatory variables are equal to 0.
• 𝑏1 , 𝑏2 , … , 𝑏𝑘 : estimates of the partial regression coefficients, which indicate how
much the dependent variable changes on average if the explanatory variable 𝑋𝑗
increases by one unit, assuming that the other explanatory variables remain
unchanged.
We try to generalize the results showing the influence of the individual explanatory
variable 𝑋𝑗 on the studied variable 𝑌 to the population.
13
REGRESSION ANALYSIS
Ordinary Least Squares (OLS) method
We need to find such parameter estimates 𝑏0 , 𝑏1 , 𝑏2 , … , 𝑏𝑘 , for which the estimated
regression function best fits the sample data. In other words, this means that the
deviations between the observed values of the dependent variable ( 𝑌𝑖 ) and the
estimates obtained from the regression function (𝑌෠𝑖 ) must be as small as possible.
These deviations, called residuals, are denoted by 𝑒𝑖 .
The method of parameter estimation that follows the idea described above is called the
ordinary least squares (OLS) method, which is based on minimizing the following
expression:
positive and negative deviations

sum to zero
14
REGRESSION ANALYSIS
Example: Earnings
For a sample of 53 employed people, you have obtained data on annual net income
(Earnings), years of professional experience (Work) and completed faculty education
(Faculty). Check whether the number of years of work experience and education can
explain the differences in annual net income expressed in EUR 1,000.
15
REGRESSION ANALYSIS
Example: Earnings
16
REGRESSION ANALYSIS
Example: Earnings
17
REGRESSION ANALYSIS
Example: Earnings
18
REGRESSION ANALYSIS
Example: Earnings
linear model
ESTIMATED VALUE
(increases / decreases)
ON AVERAGE
interpret b0 only if it makes sense

Earnings = 24.80 + 1.2*Work
column with p-value to check if В is significant
H0: в = 0 p < 0.001, rej H0

H1: в =/ 0
тобто реально є звязок. бо якби ми не можли реджектнути Н0, то
ми би не могли сказати напевно, чи звязок є
t-value = (b1 - В1) / se(b1) В1 always comes from H0
coefficients of determination: how well the regression fits the data. if all points lie on the
regression line, then R^2=1
R^2 may be from 0 to 1
share of explained variability: how many percent of data are explained with our data
19
корінь з R^2 = Pearson coef. тобто наскільки сильна кореляція
0.5648^1/2 = 0.751556
але треба бути обережним, чи не втрачається (-)
REGRESSION ANALYSIS
Example: Earnings
expected - real = residuls
sum of all residuals = 0 (thas how OLS works)
яущо ми підставимо Х і b, то отримаємо оце число
20
REGRESSION ANALYSIS
Example: Earnings
21
REGRESSION ANALYSIS
Example: Earnings
what is coded with 1
22
REGRESSION ANALYSIS
Assumptions
General assumptions of the classical linear regression model:
I. Linearity in parameters.
The dependent variable is a linear function of the parameters 𝛽1 , 𝛽2 , 𝛽3 , … , 𝛽𝑘 .
II. The expected value of the errors equals 0: 𝐸 𝜀ȁ𝒙 = 0

Essentially, this assumption means that the total effect of all random factors not
included in the explanatory variables is equal to 0. With properly specified regression
models (inclusion of all relevant explanatory variables in correct form and inclusion of
regression constant), we can assume that this assumption is met.
Violation: regression coefficients Вi are biased
Solution: more scientific background to add right explanatory variables
23
REGRESSION ANALYSIS
Assumptions e = y - y^
residuals = actual value - expected
III. Homoskedasticity: 𝑉𝑎𝑟 𝜀ȁ𝒙 = 𝜎 2
We check the assumption with the help of a scatter plot between standardized
residuals and (standardized) fitted values.
normal distribution of standardized
residuals over standardized fitted
values >> homoskedastisity
Estimated values of the dependent variable, called fitted

values: 𝑌෠𝑖 = 𝑏0 + 𝑏1 𝑋𝑖1 + 𝑏2 𝑋𝑖2 + ⋯ + 𝑏𝑘 𝑋𝑖𝑘
෠
𝑌 −𝑌 ത෠
Standardized fitted values: 𝑌෠𝑖_𝑠𝑡𝑑 = 𝑖
𝑠𝑌
෡
෡𝑖
Residuals: 𝑒𝑖 = 𝑌𝑖 − 𝑌 standardized: +/- 3
𝑒𝑖 −𝑒ҧ 𝑒
Standardized residuals: 𝑒𝑖_𝑠𝑡𝑑 = 𝑠𝑒
= 𝑠𝑖
𝑒
To test this assumption, we can use H0: homoskedast

24
H1: hetero
Breusch-Pagan heteroskedasticity test.
rej H0 ar p<0,001
REGRESSION ANALYSIS
Assumptions H0: В(i) = 0
H1: В(i) =/ 0
Violation: if homoskedasticity is violated
В cannot be estimated properly
there is hetero
t-value = (b - В) / se(b)
standard errors are biased
t-test will be biased

may reject H0 when you shouldnt
Solution:
Robust standard error
25
errors on the population
REGRESSION ANALYSIS
Assumptions
IV. Normal distribution of errors: 𝜀𝑖 ~ 𝑁 0, 𝜎 2
The assumption is particularly important for small samples and is usually checked
with a histogram of the standardized residuals (graphical evaluation) or with the help of
the Shapiro-Wilk normality test. This assumption implies that errors on the population must be normally distributed.
However, we cannot check the whole population.
But we know that if errors are normally distributed, then residuals in sample will be
normally distributed. (And the residuals are kindda errors, no?)
Hence we check the residual distribution
R way to check normality

check p-value
Violation: Right-skewed means errors on the population are neither normally distributed
we cannot assume t distribution to be correct. And we need t-value to calculate p-value. And if p-value
is wrong, we can do mistakes while testing if regression coefficient is statistically significant (H0: В = 0)
Solution:
large enough sample (if it is > 100, we can ignore this violation) 26
REGRESSION ANALYSIS
Assumptions
V. Errors are independent: Cov 𝜀𝑖 , 𝜀𝑗 = 0 for each 𝑖 ≠ 𝑗
In economics, we sometimes analyze panel data, which means that we observe the
same unit several times in a time sequence.
27
REGRESSION ANALYSIS
Assumptions
Violation:
Standard errors are biased > t biased > H0: В = 0 may be rejected wrongly
Solution: Econometrics
28
REGRESSION ANALYSIS
Assumptions
VI. No perfect multicolinearity in a form of: 𝜆1 𝑋1 + 𝜆2 𝑋2 + ⋯ + 𝜆𝑘 𝑋𝑘 = 0
VII. The number of units is greater than the number of estimated parameters: 𝑛 > 𝑘
in general minimum 20 units of observation for each dependent

variable
29
REGRESSION ANALYSIS
Assumptions
If the assumptions presented (I. to VII.) are satisfied, the regression function estimated
by the least squares method is:
1. best (smallest variance of the parameter estimates),
2. linear,
3. unbiased (𝐸 𝑏𝑗 = 𝛽𝑗 )
estimator (Best Linear Unbiased Estimator, BLUE).
This is defined in the Gauss-Mark theorem.
30
REGRESSION ANALYSIS
Assumptions
In addition to the general assumptions, some basic requirements for a linear
regression model estimated by the OLS method must also be met:
1. The dependent variable is numerical, the explanatory variables can be numerical

or dichotomous (i.e., Dummy variable).
2. Each explanatory variable must vary (nonzero variance), but it is desirable that the
explanatory variables assume as wide a range of possible values as possible.
31
REGRESSION ANALYSIS
Assumptions
3. Potential outliers and units that have a large impact on the estimated regression
function are removed from the data. How are they determined?
unit with high impact
Cook's distances
how to find outliers:

- calculate residuals
- any unit that has std resid outside +-3 is an outlier
32
+-3 is a typical frame for std resid

REGRESSION ANALYSIS
Assumptions
4. There is not too strong multicolinearity.
The assumption is tested using correlation matrix or graphically using scatter plots that
examine the relationship between the explanatory variables, but the most commonly
used is the VIF (Variance Inflation Factor) statistic.
How high can the VIF statistic be? to check the strengths of multicolinearity
𝑋෡𝑗 = 𝑎0 + 𝑎1 𝑋1 + ⋯ + 𝑎𝑗−1 𝑋𝑗−1 + 𝑎𝑗+1 𝑋𝑗+1 + ⋯ + 𝑎𝑘 𝑋𝑘 ⇒ 𝑟𝑋2𝑗
1 VIF < 5 максимально допустима для кожної пари

𝑉𝐼𝐹𝑗 =
1 − 𝑟𝑋2𝑗 VIF = 1 приблизно середня по моделях
33
REGRESSION ANALYSIS
Assumptions
How else can muticolinearity be detected?
- VIF
- coefficients are contrary to our expectations (+ or -)
Violation:
std errors are inflated (too large) >>> p-values are biast (to high)
Solution:
drop controlled variables
34
REGRESSION ANALYSIS
Assumptions
if multimillionearity is to strong, everything is too biased.
р-value пораховане неправильно
нема сенсу нічого інтерпретувати: ні р, ні В, бо вони будуть

теж пораховані неправильно
35
VIF < 5 максимально допустима для кожної пари
VIF = 1 приблизно середня по моделях
REGRESSION ANALYSIS
Assumptions
тут ми типу створюєємо по рекгресійній моделі на

кожну незалежну змінну, аби подивитися, які інші
незалежні змінні пояснюють її
VIF = 1 / (1-R^2)
test of significance of the model as a whole

checked with p-value 36
REGRESSION ANALYSIS
Assumptions
37
REGRESSION ANALYSIS
Assumptions linearity is violated (there is a curve)
homoskedastisity
solution 1: b0 + b1*x + b2*x^2
linearity it is heteroskedastic
absolutely normal
Solution 2: transform variable into
logarithmic form
std resid
std fitted values

reasons: - time series
- or you missed some crucial variable
hererosked again >>>
std errors are biast >> independence is violated
however linearity and homosked are ok
problrms of dependancy
наприклад, якщо аналізуєш успішність людей з трьох різних

міст. і тоді міста она них теж якось вплинули. не ззовсім
hetero рандомно обрані. і це додатковий ефект. вони згруповані.
curve
точно +, але не факт, що рішення, якщо додати це пояснення
для контролю
38
REGRESSION ANALYSIS
Model evaluation: Coefficient of determination 𝑅2
The quality of fit of the estimated model to the data is assessed by calculating the
proportion of variability in the dependent variable that can be explained by the linear
effect of explanatory variables, included in the regression model.
39
REGRESSION ANALYSIS
Model evaluation: F-statistics (ANOVA)
Test of Significance of regression
"ro" - like R^2 but for population
H0: 𝜌2 = 0 or 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
H1: 𝜌2 > 0 or at least one 𝛽𝑗 is different from 0
𝑠𝑅2
𝐹= 2
𝑠𝑒
𝑆𝑆𝑅
𝑠𝑅2 =
𝑘
𝑅𝑆𝑆
𝑠𝑒2 =
𝑛−𝑘−1
𝑑𝑓1 = 𝑘
𝑑𝑓2 = 𝑛 − 𝑘 − 1
40
REGRESSION ANALYSIS
Significance of regression coefficient
H0: 𝛽𝑗 = 0
H1: 𝛽𝑗 ≠ 0
𝑏𝑗 − 𝛽𝑗
𝑡=
𝑠𝑒(𝑏𝑗 )
𝑑𝑓 = 𝑛 − 𝑘 − 1
41
REGRESSION ANALYSIS
Building regression model
Option 1:
- We include all variables in the regression model simultaneously.
- Since the outcome depends on all variables, it is important to have a good
theoretical basis for including each explanatory variable.
Option 2 (Block Method):

- First, known explanatory variables that have been used in previous studies are
included in the regression model.
- Then, new explanatory variables are gradually included and each model is
evaluated individually. In this way, we can assess the impact of each new variable.
42
REGRESSION ANALYSIS
Fitting the regression model
Descriptive
statistics of the
studied variables
Initial regression
Correct model specification? Linearity?
Multicolinearity?
Diagnostics Outliers, units with high impact?
Distribution of residuals?
Correctly
Homoskedasticity?
adjusting the
regression
model
We estimate the final

regression model and
generalize the findings 43
REGRESSION ANALYSIS
Example: Life expectancy
As part of a study on life expectancy, the United Nations intern wanted to find out how
average alcohol consumption per capita, average years of schooling, the proportion of
the population that is overweight, and the proportion of the population infected with
the HIV virus affect life expectancy. For 2015, he collected data for 90 countries, which
he divided into two groups: developing countries and developed countries.
෢ = 𝑏0 + 𝑏1 Alcohol + 𝑏2 Schooling + 𝑏3 Obesity + 𝑏4 HIV + 𝑏5 Development

Life
44
REGRESSION ANALYSIS
45
REGRESSION ANALYSIS
46
REGRESSION ANALYSIS
47
REGRESSION ANALYSIS
to check correlation between all pairs of variables
48
REGRESSION ANALYSIS
49
REGRESSION ANALYSIS
50
REGRESSION ANALYSIS
51
REGRESSION ANALYSIS
52
REGRESSION ANALYSIS
53
REGRESSION ANALYSIS
54
REGRESSION ANALYSIS
55
REGRESSION ANALYSIS
Robust Standard Errors if Heterosked
56
REGRESSION ANALYSIS
57
REGRESSION ANALYSIS
calculating standardized coefficients.
in order to compare independent variables between each other. i.e. to analyse which variable contributes to the
R^2 more
58
REGRESSION ANALYSIS
0 1 first one is coded with 0
dummy variable that is coded with 0 is also called

base or reference
all other categories are always compared to the reference

one. this also applies to the variable with several categories
(> several dummies)
coded with 1
59
REGRESSION ANALYSIS to compare which model is better
Example: Life expectancy another anova.this one to compare models.

our anova test has a function aov(...)
H0: delta ro square = 0 >> means both models are equally good
H1: delta ro square > 0 >> means the second model is better
residual sum of square - how much is unexplained

however, we don't need it for this course
60
DUMMY VARIABLES
REGRESSION ANALYSIS
Dummy variables
In a regression analysis, we sometimes want to include categorical explanatory
variables that have more than two categories. In this case, groups of categories cannot
be separated by a single dichotomous variable with values of 0 and 1. Instead, several
dummy variables are needed.
A dummy variable (D) is a variable that can take the values 0 or 1.
1 if the categorical variable has a value A

D𝐴 = ቊ
0 in all other cases
Example:
1 if the person is male
D𝑀 = ቊ
0 if the person is not male
62
REGRESSION ANALYSIS
Example: Cars
We are interested in how the consumption of the car (in l/100km at 90 km/h)
depends on the power of engine and the type of drive: Front-wheel drive (1), rear-
wheel drive (2), 4x4 (3). Data: Cars.csv.
63
REGRESSION ANALYSIS
Example: Cars
completely wrong, because Drive was included as a numeric variable
64
REGRESSION ANALYSIS
Example: Cars
If a categorical variable has 𝑗 categories, we need exactly (𝑗 − 1) dummy variables.
Dummy_1 Dummy_2 Dummy_3
65
REGRESSION ANALYSIS
Example: Cars
66
REGRESSION ANALYSIS
Example: Cars
67
ROBUST STANDARD ERRORS
REGRESSION ANALYSIS
Example: Advertising
For 200 successful startups, you obtained data on the annual sales of their products and the
monthly cost of advertising on Facebook. For each product, you also have information on whether
it belongs to the technology products group and through which distribution channels the
companies sell their products. Since you are a founding member of a new startup company
yourself, you want to estimate a linear regression function of the form:
Sales = 𝑓(Facebook, _Technological, _Channel)
a) Evaluate the model to see if all assumptions are met.
b) Explain results.
c) Estimate the sales of your start-up company that has developed a technological product and
plans monthly Facebook advertising costs of EUR 2,500, with sales initially being made only
through domestic online store.
69
REGRESSION ANALYSIS
70
REGRESSION ANALYSIS
71
REGRESSION ANALYSIS
VIF adjusted
but now, for the generalized VIF, value cannot bu higher than
2.24 = square root of 5
72
REGRESSION ANALYSIS
still normal. незважаючи на bimodality
even if there was a violation, still ok, because sample size is big (> 100)
73
REGRESSION ANALYSIS
should not be any gaps
drop them otherwise
74
REGRESSION ANALYSIS
to create variable "ID"
впорядкувати
впорядкувати в зворотньому порядку
to drop this unit because there was a gap in graph before 75

REGRESSION ANALYSIS
76
REGRESSION ANALYSIS
to check homoskedasticity for the regression model
of errors
77
REGRESSION ANALYSIS
78
REGRESSION ANALYSIS
Example: Advertising Robust Standard Errors Test
its where there is a heteroskedast of std resid
Heteroskedastic
дані для потрібної нам моделі
79

Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

APPLIED DATA ANALYSIS

L3: Correlation and regression

closer are the points to the line - stronger is the correlation

The absolute values can be interpreted as follows:

if at least one is ordinal

Anscombe (1973) non-linear >>> so not possible to use Pearson

1 (without outlier) 0 (without outlier)

type = "spearman" <- if at least one is not numeric but ordinal

linear relationship between X and Y positive/negative and (how strong)

table to check if there is a statistical significant correlation in each pair

H0: ro (ger - math) = 0

The third variable problem

beta for population

Relationship can be described with the regression function:

• 𝑌෠ : estimated value of the dependent variable.

positive and negative deviations

interpret b0 only if it makes sense

H0: в = 0 p < 0.001, rej H0

t-value = (b1 - В1) / se(b1) В1 always comes from H0

expected - real = residuls

sum of all residuals = 0 (thas how OLS works)

яущо ми підставимо Х і b, то отримаємо оце число

what is coded with 1

II. The expected value of the errors equals 0: 𝐸 𝜀ȁ𝒙 = 0

Violation: regression coefficients Вi are biased

Solution: more scientific background to add right explanatory variables

Estimated values of the dependent variable, called fitted

To test this assumption, we can use H0: homoskedast

t-test will be biased

R way to check normality

in general minimum 20 units of observation for each dependent

This is defined in the Gauss-Mark theorem.

1. The dependent variable is numerical, the explanatory variables can be numerical

unit with high impact

how to find outliers:

+-3 is a typical frame for std resid

𝑋෡𝑗 = 𝑎0 + 𝑎1 𝑋1 + ⋯ + 𝑎𝑗−1 𝑋𝑗−1 + 𝑎𝑗+1 𝑋𝑗+1 + ⋯ + 𝑎𝑘 𝑋𝑘 ⇒ 𝑟𝑋2𝑗

1 VIF < 5 максимально допустима для кожної пари

if multimillionearity is to strong, everything is too biased.

р-value пораховане неправильно

нема сенсу нічого інтерпретувати: ні р, ні В, бо вони будуть

тут ми типу створюєємо по рекгресійній моделі на

test of significance of the model as a whole

std fitted values

however linearity and homosked are ok

наприклад, якщо аналізуєш успішність людей з трьох різних

Option 2 (Block Method):

Diagnostics Outliers, units with high impact?

We estimate the final

෢ = 𝑏0 + 𝑏1 Alcohol + 𝑏2 Schooling + 𝑏3 Obesity + 𝑏4 HIV + 𝑏5 Development

Robust Standard Errors if Heterosked

dummy variable that is coded with 0 is also called

all other categories are always compared to the reference

Example: Life expectancy another anova.this one to compare models.

residual sum of square - how much is unexplained

A dummy variable (D) is a variable that can take the values 0 or 1.

1 if the categorical variable has a value A

completely wrong, because Drive was included as a numeric variable

a) Evaluate the model to see if all assumptions are met.

still normal. незважаючи на bimodality

should not be any gaps

drop them otherwise

впорядкувати в зворотньому порядку

to drop this unit because there was a gap in graph before 75