Professional Documents
Culture Documents
Lecture 3
Lecture 3
IN BUSINESS WITH R
Denis Marinšek
CORRELATION ANALYSIS
CORRELATION ANALYSIS
What is correlation?
Correlation is a method of measuring the degree of relationship (interdependence) of
variables.
perfectly negative no correlation
perfectly positive
4
на всіх графіках Пірсон покаже, що сила кореляції 0,8.
CORRELATION ANALYSIS хоча реально воно 0,8 лише на перщому графіку. На
інших - ні, не логічно
5
impossible to use Pearson, when there is an outlier
CORRELATION ANALYSIS
Example: High school graduation exam
In the national examination center, they want to check the correlation between the
results in 4 examination subjects. They sample 360 high school students and check
their results in English, German, Math and the first elective subject. Is there a
correlation between the scores achieved in the mentioned subjects?
6
CORRELATION ANALYSIS
Example: High school graduation exam
distribution
7
CORRELATION ANALYSIS
Example: High school graduation exam
sample size
8
CORRELATION ANALYSIS
Example: High school graduation exam
ordinal
9
CORRELATION ANALYSIS
In correlation analysis, one should not speak of causality, since there may be other
variables in the background that are responsible for the observed relationship.
10
OLS REGRESSION ANALYSIS
REGRESSION ANALYSIS
We use regression analysis to check whether we can use a set of explanatory variables
𝑋1 , 𝑋2 , … , 𝑋𝑘 (generally 𝑋𝑗 ) to explain the differences in the values of the dependent
variable 𝑌.
Example: Can differences in the height of people be explained by the height of the
parents and the gender of the person?
Regression model:
regeression coefficients
regression constant
𝐸 𝑌 𝑿 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘
expected values 12
REGRESSION ANALYSIS
When analyzing sample data, we can only estimate the regression function:
𝑌 = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + ⋯ + 𝑏𝑘 𝑋𝑘 b - for sample
We try to generalize the results showing the influence of the individual explanatory
variable 𝑋𝑗 on the studied variable 𝑌 to the population.
13
REGRESSION ANALYSIS
Ordinary Least Squares (OLS) method
We need to find such parameter estimates 𝑏0 , 𝑏1 , 𝑏2 , … , 𝑏𝑘 , for which the estimated
regression function best fits the sample data. In other words, this means that the
deviations between the observed values of the dependent variable ( 𝑌𝑖 ) and the
estimates obtained from the regression function (𝑌𝑖 ) must be as small as possible.
These deviations, called residuals, are denoted by 𝑒𝑖 .
The method of parameter estimation that follows the idea described above is called the
ordinary least squares (OLS) method, which is based on minimizing the following
expression:
14
REGRESSION ANALYSIS
Example: Earnings
For a sample of 53 employed people, you have obtained data on annual net income
(Earnings), years of professional experience (Work) and completed faculty education
(Faculty). Check whether the number of years of work experience and education can
explain the differences in annual net income expressed in EUR 1,000.
15
REGRESSION ANALYSIS
Example: Earnings
16
REGRESSION ANALYSIS
Example: Earnings
17
REGRESSION ANALYSIS
Example: Earnings
18
REGRESSION ANALYSIS
Example: Earnings
linear model
ESTIMATED VALUE
(increases / decreases)
ON AVERAGE
coefficients of determination: how well the regression fits the data. if all points lie on the
regression line, then R^2=1
R^2 may be from 0 to 1
share of explained variability: how many percent of data are explained with our data
19
корінь з R^2 = Pearson coef. тобто наскільки сильна кореляція
0.5648^1/2 = 0.751556
але треба бути обережним, чи не втрачається (-)
REGRESSION ANALYSIS
Example: Earnings
20
REGRESSION ANALYSIS
Example: Earnings
21
REGRESSION ANALYSIS
Example: Earnings
22
REGRESSION ANALYSIS
Assumptions
General assumptions of the classical linear regression model:
I. Linearity in parameters.
The dependent variable is a linear function of the parameters 𝛽1 , 𝛽2 , 𝛽3 , … , 𝛽𝑘 .
23
REGRESSION ANALYSIS
Assumptions e = y - y^
residuals = actual value - expected
III. Homoskedasticity: 𝑉𝑎𝑟 𝜀ȁ𝒙 = 𝜎 2
We check the assumption with the help of a scatter plot between standardized
residuals and (standardized) fitted values.
normal distribution of standardized
residuals over standardized fitted
values >> homoskedastisity
𝑖
Residuals: 𝑒𝑖 = 𝑌𝑖 − 𝑌 standardized: +/- 3
𝑒𝑖 −𝑒ҧ 𝑒
Standardized residuals: 𝑒𝑖_𝑠𝑡𝑑 = 𝑠𝑒
= 𝑠𝑖
𝑒
t-value = (b - В) / se(b)
standard errors are biased
Solution:
Robust standard error
25
errors on the population
REGRESSION ANALYSIS
Assumptions
IV. Normal distribution of errors: 𝜀𝑖 ~ 𝑁 0, 𝜎 2
The assumption is particularly important for small samples and is usually checked
with a histogram of the standardized residuals (graphical evaluation) or with the help of
the Shapiro-Wilk normality test. This assumption implies that errors on the population must be normally distributed.
However, we cannot check the whole population.
But we know that if errors are normally distributed, then residuals in sample will be
normally distributed. (And the residuals are kindda errors, no?)
Hence we check the residual distribution
Violation: Right-skewed means errors on the population are neither normally distributed
we cannot assume t distribution to be correct. And we need t-value to calculate p-value. And if p-value
is wrong, we can do mistakes while testing if regression coefficient is statistically significant (H0: В = 0)
Solution:
large enough sample (if it is > 100, we can ignore this violation) 26
REGRESSION ANALYSIS
Assumptions
V. Errors are independent: Cov 𝜀𝑖 , 𝜀𝑗 = 0 for each 𝑖 ≠ 𝑗
In economics, we sometimes analyze panel data, which means that we observe the
same unit several times in a time sequence.
27
REGRESSION ANALYSIS
Assumptions
Violation:
Standard errors are biased > t biased > H0: В = 0 may be rejected wrongly
Solution: Econometrics
28
REGRESSION ANALYSIS
Assumptions
VI. No perfect multicolinearity in a form of: 𝜆1 𝑋1 + 𝜆2 𝑋2 + ⋯ + 𝜆𝑘 𝑋𝑘 = 0
VII. The number of units is greater than the number of estimated parameters: 𝑛 > 𝑘
29
REGRESSION ANALYSIS
Assumptions
If the assumptions presented (I. to VII.) are satisfied, the regression function estimated
by the least squares method is:
1. best (smallest variance of the parameter estimates),
2. linear,
3. unbiased (𝐸 𝑏𝑗 = 𝛽𝑗 )
estimator (Best Linear Unbiased Estimator, BLUE).
30
REGRESSION ANALYSIS
Assumptions
In addition to the general assumptions, some basic requirements for a linear
regression model estimated by the OLS method must also be met:
2. Each explanatory variable must vary (nonzero variance), but it is desirable that the
explanatory variables assume as wide a range of possible values as possible.
31
REGRESSION ANALYSIS
Assumptions
3. Potential outliers and units that have a large impact on the estimated regression
function are removed from the data. How are they determined?
Cook's distances
How high can the VIF statistic be? to check the strengths of multicolinearity
33
REGRESSION ANALYSIS
Assumptions
How else can muticolinearity be detected?
- VIF
- coefficients are contrary to our expectations (+ or -)
Violation:
std errors are inflated (too large) >>> p-values are biast (to high)
Solution:
drop controlled variables
34
REGRESSION ANALYSIS
Assumptions
35
VIF < 5 максимально допустима для кожної пари
VIF = 1 приблизно середня по моделях
REGRESSION ANALYSIS
Assumptions
VIF = 1 / (1-R^2)
37
REGRESSION ANALYSIS
Assumptions linearity is violated (there is a curve)
homoskedastisity
solution 1: b0 + b1*x + b2*x^2
linearity it is heteroskedastic
absolutely normal
Solution 2: transform variable into
logarithmic form
std resid
problrms of dependancy
38
REGRESSION ANALYSIS
Model evaluation: Coefficient of determination 𝑅2
The quality of fit of the estimated model to the data is assessed by calculating the
proportion of variability in the dependent variable that can be explained by the linear
effect of explanatory variables, included in the regression model.
39
REGRESSION ANALYSIS
Model evaluation: F-statistics (ANOVA)
Test of Significance of regression
"ro" - like R^2 but for population
H0: 𝜌2 = 0 or 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
H1: 𝜌2 > 0 or at least one 𝛽𝑗 is different from 0
𝑠𝑅2
𝐹= 2
𝑠𝑒
𝑆𝑆𝑅
𝑠𝑅2 =
𝑘
𝑅𝑆𝑆
𝑠𝑒2 =
𝑛−𝑘−1
𝑑𝑓1 = 𝑘
𝑑𝑓2 = 𝑛 − 𝑘 − 1
40
REGRESSION ANALYSIS
Significance of regression coefficient
H0: 𝛽𝑗 = 0
H1: 𝛽𝑗 ≠ 0
𝑏𝑗 − 𝛽𝑗
𝑡=
𝑠𝑒(𝑏𝑗 )
𝑑𝑓 = 𝑛 − 𝑘 − 1
41
REGRESSION ANALYSIS
Building regression model
Option 1:
- We include all variables in the regression model simultaneously.
- Since the outcome depends on all variables, it is important to have a good
theoretical basis for including each explanatory variable.
42
REGRESSION ANALYSIS
Fitting the regression model
Descriptive
statistics of the
studied variables
Initial regression
Correct model specification? Linearity?
Multicolinearity?
Distribution of residuals?
Correctly
Homoskedasticity?
adjusting the
regression
model
44
REGRESSION ANALYSIS
Example: Life expectancy
45
REGRESSION ANALYSIS
Example: Life expectancy
46
REGRESSION ANALYSIS
Example: Life expectancy
47
REGRESSION ANALYSIS
Example: Life expectancy
to check correlation between all pairs of variables
48
REGRESSION ANALYSIS
Example: Life expectancy
49
REGRESSION ANALYSIS
Example: Life expectancy
50
REGRESSION ANALYSIS
Example: Life expectancy
51
REGRESSION ANALYSIS
Example: Life expectancy
52
REGRESSION ANALYSIS
Example: Life expectancy
53
REGRESSION ANALYSIS
Example: Life expectancy
54
REGRESSION ANALYSIS
Example: Life expectancy
55
REGRESSION ANALYSIS
Example: Life expectancy
56
REGRESSION ANALYSIS
Example: Life expectancy
57
REGRESSION ANALYSIS
Example: Life expectancy
calculating standardized coefficients.
in order to compare independent variables between each other. i.e. to analyse which variable contributes to the
R^2 more
58
REGRESSION ANALYSIS
Example: Life expectancy
0 1 first one is coded with 0
coded with 1
59
REGRESSION ANALYSIS to compare which model is better
H0: delta ro square = 0 >> means both models are equally good
H1: delta ro square > 0 >> means the second model is better
60
DUMMY VARIABLES
REGRESSION ANALYSIS
Dummy variables
In a regression analysis, we sometimes want to include categorical explanatory
variables that have more than two categories. In this case, groups of categories cannot
be separated by a single dichotomous variable with values of 0 and 1. Instead, several
dummy variables are needed.
Example:
1 if the person is male
D𝑀 = ቊ
0 if the person is not male
62
REGRESSION ANALYSIS
Example: Cars
We are interested in how the consumption of the car (in l/100km at 90 km/h)
depends on the power of engine and the type of drive: Front-wheel drive (1), rear-
wheel drive (2), 4x4 (3). Data: Cars.csv.
63
REGRESSION ANALYSIS
Example: Cars
64
REGRESSION ANALYSIS
Example: Cars
If a categorical variable has 𝑗 categories, we need exactly (𝑗 − 1) dummy variables.
Dummy_1 Dummy_2 Dummy_3
65
REGRESSION ANALYSIS
Example: Cars
66
REGRESSION ANALYSIS
Example: Cars
67
ROBUST STANDARD ERRORS
REGRESSION ANALYSIS
Example: Advertising
For 200 successful startups, you obtained data on the annual sales of their products and the
monthly cost of advertising on Facebook. For each product, you also have information on whether
it belongs to the technology products group and through which distribution channels the
companies sell their products. Since you are a founding member of a new startup company
yourself, you want to estimate a linear regression function of the form:
Sales = 𝑓(Facebook, _Technological, _Channel)
b) Explain results.
c) Estimate the sales of your start-up company that has developed a technological product and
plans monthly Facebook advertising costs of EUR 2,500, with sales initially being made only
through domestic online store.
69
REGRESSION ANALYSIS
Example: Advertising
70
REGRESSION ANALYSIS
Example: Advertising
71
REGRESSION ANALYSIS
Example: Advertising
VIF adjusted
but now, for the generalized VIF, value cannot bu higher than
2.24 = square root of 5
72
REGRESSION ANALYSIS
Example: Advertising
even if there was a violation, still ok, because sample size is big (> 100)
73
REGRESSION ANALYSIS
Example: Advertising
74
REGRESSION ANALYSIS
Example: Advertising
to create variable "ID"
впорядкувати
76
REGRESSION ANALYSIS
Example: Advertising
to check homoskedasticity for the regression model
of errors
77
REGRESSION ANALYSIS
Example: Advertising
78
REGRESSION ANALYSIS
Example: Advertising Robust Standard Errors Test
its where there is a heteroskedast of std resid
Heteroskedastic
79