Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

APPLIED DATA ANALYSIS

IN BUSINESS WITH R

L3: Correlation and regression


analysis

Denis Marinšek
CORRELATION ANALYSIS
CORRELATION ANALYSIS
What is correlation?
Correlation is a method of measuring the degree of relationship (interdependence) of
variables.
perfectly negative no correlation

perfectly positive

closer are the points to the line - stronger is the correlation


CORRELATION ANALYSIS
Pearson correlation coefficient
Its values are between -1 and +1

The absolute values can be interpreted as follows:


- [0.0 – 0.1) = very weak correlation
- [0.1 – 0.3) = weak
- [0.3 – 0.7) = semi strong
- [0.7 – 0.9) = strong
- [0.9 – 1.0) = very strong correlation

if at least one is ordinal


If we analyze ordinal variables, Spearman correlation coefficient should be used.

4
на всіх графіках Пірсон покаже, що сила кореляції 0,8.
CORRELATION ANALYSIS хоча реально воно 0,8 лише на перщому графіку. На
інших - ні, не логічно

Anscombe (1973) non-linear >>> so not possible to use Pearson


0.8

1 (without outlier) 0 (without outlier)

5
impossible to use Pearson, when there is an outlier
CORRELATION ANALYSIS
Example: High school graduation exam
In the national examination center, they want to check the correlation between the
results in 4 examination subjects. They sample 360 high school students and check
their results in English, German, Math and the first elective subject. Is there a
correlation between the scores achieved in the mentioned subjects?

6
CORRELATION ANALYSIS
Example: High school graduation exam

distribution

7
CORRELATION ANALYSIS
Example: High school graduation exam

type = "spearman" <- if at least one is not numeric but ordinal

linear relationship between X and Y positive/negative and (how strong)

sample size

table to check if there is a statistical significant correlation in each pair

H0: ro (ger - math) = 0


H1: ro (ger - math) =/ 0

8
CORRELATION ANALYSIS
Example: High school graduation exam

ordinal

9
CORRELATION ANALYSIS

The third variable problem

In correlation analysis, one should not speak of causality, since there may be other
variables in the background that are responsible for the observed relationship.

10
OLS REGRESSION ANALYSIS
REGRESSION ANALYSIS

We use regression analysis to check whether we can use a set of explanatory variables
𝑋1 , 𝑋2 , … , 𝑋𝑘 (generally 𝑋𝑗 ) to explain the differences in the values of the dependent
variable 𝑌.

Example: Can differences in the height of people be explained by the height of the
parents and the gender of the person?

Regression model:

beta for population


𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀 error term

regeression coefficients
regression constant

Relationship can be described with the regression function:

𝐸 𝑌 𝑿 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘

expected values 12
REGRESSION ANALYSIS

When analyzing sample data, we can only estimate the regression function:
𝑌෠ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + ⋯ + 𝑏𝑘 𝑋𝑘 b - for sample

• 𝑌෠ : estimated value of the dependent variable.


• 𝑏0 : estimate of the regression constant, i.e., the estimate of the mean of the
dependent variable when the values of all explanatory variables are equal to 0.
• 𝑏1 , 𝑏2 , … , 𝑏𝑘 : estimates of the partial regression coefficients, which indicate how
much the dependent variable changes on average if the explanatory variable 𝑋𝑗
increases by one unit, assuming that the other explanatory variables remain
unchanged.

We try to generalize the results showing the influence of the individual explanatory
variable 𝑋𝑗 on the studied variable 𝑌 to the population.

13
REGRESSION ANALYSIS
Ordinary Least Squares (OLS) method
We need to find such parameter estimates 𝑏0 , 𝑏1 , 𝑏2 , … , 𝑏𝑘 , for which the estimated
regression function best fits the sample data. In other words, this means that the
deviations between the observed values of the dependent variable ( 𝑌𝑖 ) and the
estimates obtained from the regression function (𝑌෠𝑖 ) must be as small as possible.
These deviations, called residuals, are denoted by 𝑒𝑖 .

The method of parameter estimation that follows the idea described above is called the
ordinary least squares (OLS) method, which is based on minimizing the following
expression:

positive and negative deviations


sum to zero

14
REGRESSION ANALYSIS
Example: Earnings
For a sample of 53 employed people, you have obtained data on annual net income
(Earnings), years of professional experience (Work) and completed faculty education
(Faculty). Check whether the number of years of work experience and education can
explain the differences in annual net income expressed in EUR 1,000.

15
REGRESSION ANALYSIS
Example: Earnings

16
REGRESSION ANALYSIS
Example: Earnings

17
REGRESSION ANALYSIS
Example: Earnings

18
REGRESSION ANALYSIS
Example: Earnings
linear model

ESTIMATED VALUE
(increases / decreases)
ON AVERAGE

interpret b0 only if it makes sense


Earnings = 24.80 + 1.2*Work
column with p-value to check if В is significant

H0: в = 0 p < 0.001, rej H0


H1: в =/ 0
тобто реально є звязок. бо якби ми не можли реджектнути Н0, то
ми би не могли сказати напевно, чи звязок є

t-value = (b1 - В1) / se(b1) В1 always comes from H0

coefficients of determination: how well the regression fits the data. if all points lie on the
regression line, then R^2=1
R^2 may be from 0 to 1
share of explained variability: how many percent of data are explained with our data
19
корінь з R^2 = Pearson coef. тобто наскільки сильна кореляція
0.5648^1/2 = 0.751556
але треба бути обережним, чи не втрачається (-)
REGRESSION ANALYSIS
Example: Earnings

expected - real = residuls

sum of all residuals = 0 (thas how OLS works)

яущо ми підставимо Х і b, то отримаємо оце число

20
REGRESSION ANALYSIS
Example: Earnings

21
REGRESSION ANALYSIS
Example: Earnings

what is coded with 1

22
REGRESSION ANALYSIS
Assumptions
General assumptions of the classical linear regression model:

I. Linearity in parameters.
The dependent variable is a linear function of the parameters 𝛽1 , 𝛽2 , 𝛽3 , … , 𝛽𝑘 .

II. The expected value of the errors equals 0: 𝐸 𝜀ȁ𝒙 = 0


Essentially, this assumption means that the total effect of all random factors not
included in the explanatory variables is equal to 0. With properly specified regression
models (inclusion of all relevant explanatory variables in correct form and inclusion of
regression constant), we can assume that this assumption is met.

Violation: regression coefficients Вi are biased

Solution: more scientific background to add right explanatory variables

23
REGRESSION ANALYSIS
Assumptions e = y - y^
residuals = actual value - expected
III. Homoskedasticity: 𝑉𝑎𝑟 𝜀ȁ𝒙 = 𝜎 2

We check the assumption with the help of a scatter plot between standardized
residuals and (standardized) fitted values.
normal distribution of standardized
residuals over standardized fitted
values >> homoskedastisity

Estimated values of the dependent variable, called fitted


values: 𝑌෠𝑖 = 𝑏0 + 𝑏1 𝑋𝑖1 + 𝑏2 𝑋𝑖2 + ⋯ + 𝑏𝑘 𝑋𝑖𝑘

𝑌 −𝑌 ത෠
Standardized fitted values: 𝑌෠𝑖_𝑠𝑡𝑑 = 𝑖
𝑠𝑌

෡𝑖
Residuals: 𝑒𝑖 = 𝑌𝑖 − 𝑌 standardized: +/- 3
𝑒𝑖 −𝑒ҧ 𝑒
Standardized residuals: 𝑒𝑖_𝑠𝑡𝑑 = 𝑠𝑒
= 𝑠𝑖
𝑒

To test this assumption, we can use H0: homoskedast


24
H1: hetero
Breusch-Pagan heteroskedasticity test.
rej H0 ar p<0,001
REGRESSION ANALYSIS
Assumptions H0: В(i) = 0
H1: В(i) =/ 0
Violation: if homoskedasticity is violated
В cannot be estimated properly
there is hetero

t-value = (b - В) / se(b)
standard errors are biased

t-test will be biased


may reject H0 when you shouldnt

Solution:
Robust standard error

25
errors on the population

REGRESSION ANALYSIS
Assumptions
IV. Normal distribution of errors: 𝜀𝑖 ~ 𝑁 0, 𝜎 2
The assumption is particularly important for small samples and is usually checked
with a histogram of the standardized residuals (graphical evaluation) or with the help of
the Shapiro-Wilk normality test. This assumption implies that errors on the population must be normally distributed.
However, we cannot check the whole population.
But we know that if errors are normally distributed, then residuals in sample will be
normally distributed. (And the residuals are kindda errors, no?)
Hence we check the residual distribution

R way to check normality


check p-value

Violation: Right-skewed means errors on the population are neither normally distributed

we cannot assume t distribution to be correct. And we need t-value to calculate p-value. And if p-value
is wrong, we can do mistakes while testing if regression coefficient is statistically significant (H0: В = 0)
Solution:
large enough sample (if it is > 100, we can ignore this violation) 26
REGRESSION ANALYSIS
Assumptions
V. Errors are independent: Cov 𝜀𝑖 , 𝜀𝑗 = 0 for each 𝑖 ≠ 𝑗
In economics, we sometimes analyze panel data, which means that we observe the
same unit several times in a time sequence.

27
REGRESSION ANALYSIS
Assumptions
Violation:
Standard errors are biased > t biased > H0: В = 0 may be rejected wrongly

Solution: Econometrics

28
REGRESSION ANALYSIS
Assumptions
VI. No perfect multicolinearity in a form of: 𝜆1 𝑋1 + 𝜆2 𝑋2 + ⋯ + 𝜆𝑘 𝑋𝑘 = 0

VII. The number of units is greater than the number of estimated parameters: 𝑛 > 𝑘

in general minimum 20 units of observation for each dependent


variable

29
REGRESSION ANALYSIS
Assumptions
If the assumptions presented (I. to VII.) are satisfied, the regression function estimated
by the least squares method is:
1. best (smallest variance of the parameter estimates),
2. linear,
3. unbiased (𝐸 𝑏𝑗 = 𝛽𝑗 )
estimator (Best Linear Unbiased Estimator, BLUE).

This is defined in the Gauss-Mark theorem.

30
REGRESSION ANALYSIS
Assumptions
In addition to the general assumptions, some basic requirements for a linear
regression model estimated by the OLS method must also be met:

1. The dependent variable is numerical, the explanatory variables can be numerical


or dichotomous (i.e., Dummy variable).

2. Each explanatory variable must vary (nonzero variance), but it is desirable that the
explanatory variables assume as wide a range of possible values as possible.

31
REGRESSION ANALYSIS
Assumptions
3. Potential outliers and units that have a large impact on the estimated regression
function are removed from the data. How are they determined?

unit with high impact

Cook's distances

how to find outliers:


- calculate residuals
- any unit that has std resid outside +-3 is an outlier
32

+-3 is a typical frame for std resid


REGRESSION ANALYSIS
Assumptions
4. There is not too strong multicolinearity.
The assumption is tested using correlation matrix or graphically using scatter plots that
examine the relationship between the explanatory variables, but the most commonly
used is the VIF (Variance Inflation Factor) statistic.

How high can the VIF statistic be? to check the strengths of multicolinearity

𝑋෡𝑗 = 𝑎0 + 𝑎1 𝑋1 + ⋯ + 𝑎𝑗−1 𝑋𝑗−1 + 𝑎𝑗+1 𝑋𝑗+1 + ⋯ + 𝑎𝑘 𝑋𝑘 ⇒ 𝑟𝑋2𝑗

1 VIF < 5 максимально допустима для кожної пари


𝑉𝐼𝐹𝑗 =
1 − 𝑟𝑋2𝑗 VIF = 1 приблизно середня по моделях

33
REGRESSION ANALYSIS
Assumptions
How else can muticolinearity be detected?
- VIF
- coefficients are contrary to our expectations (+ or -)

Violation:
std errors are inflated (too large) >>> p-values are biast (to high)

Solution:
drop controlled variables

34
REGRESSION ANALYSIS
Assumptions

if multimillionearity is to strong, everything is too biased.

р-value пораховане неправильно

нема сенсу нічого інтерпретувати: ні р, ні В, бо вони будуть


теж пораховані неправильно

35
VIF < 5 максимально допустима для кожної пари
VIF = 1 приблизно середня по моделях
REGRESSION ANALYSIS
Assumptions

тут ми типу створюєємо по рекгресійній моделі на


кожну незалежну змінну, аби подивитися, які інші
незалежні змінні пояснюють її

VIF = 1 / (1-R^2)

test of significance of the model as a whole


checked with p-value 36
REGRESSION ANALYSIS
Assumptions

37
REGRESSION ANALYSIS
Assumptions linearity is violated (there is a curve)
homoskedastisity
solution 1: b0 + b1*x + b2*x^2
linearity it is heteroskedastic
absolutely normal
Solution 2: transform variable into
logarithmic form
std resid

std fitted values


reasons: - time series
- or you missed some crucial variable
hererosked again >>>
std errors are biast >> independence is violated

however linearity and homosked are ok

problrms of dependancy

наприклад, якщо аналізуєш успішність людей з трьох різних


міст. і тоді міста она них теж якось вплинули. не ззовсім
hetero рандомно обрані. і це додатковий ефект. вони згруповані.
curve
точно +, але не факт, що рішення, якщо додати це пояснення
для контролю

38
REGRESSION ANALYSIS
Model evaluation: Coefficient of determination 𝑅2

The quality of fit of the estimated model to the data is assessed by calculating the
proportion of variability in the dependent variable that can be explained by the linear
effect of explanatory variables, included in the regression model.

39
REGRESSION ANALYSIS
Model evaluation: F-statistics (ANOVA)
Test of Significance of regression
"ro" - like R^2 but for population
H0: 𝜌2 = 0 or 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
H1: 𝜌2 > 0 or at least one 𝛽𝑗 is different from 0

𝑠𝑅2
𝐹= 2
𝑠𝑒

𝑆𝑆𝑅
𝑠𝑅2 =
𝑘
𝑅𝑆𝑆
𝑠𝑒2 =
𝑛−𝑘−1

𝑑𝑓1 = 𝑘
𝑑𝑓2 = 𝑛 − 𝑘 − 1

40
REGRESSION ANALYSIS
Significance of regression coefficient

H0: 𝛽𝑗 = 0
H1: 𝛽𝑗 ≠ 0

𝑏𝑗 − 𝛽𝑗
𝑡=
𝑠𝑒(𝑏𝑗 )

𝑑𝑓 = 𝑛 − 𝑘 − 1

41
REGRESSION ANALYSIS
Building regression model
Option 1:
- We include all variables in the regression model simultaneously.
- Since the outcome depends on all variables, it is important to have a good
theoretical basis for including each explanatory variable.

Option 2 (Block Method):


- First, known explanatory variables that have been used in previous studies are
included in the regression model.
- Then, new explanatory variables are gradually included and each model is
evaluated individually. In this way, we can assess the impact of each new variable.

42
REGRESSION ANALYSIS
Fitting the regression model
Descriptive
statistics of the
studied variables

Initial regression
Correct model specification? Linearity?

Multicolinearity?

Diagnostics Outliers, units with high impact?

Distribution of residuals?

Correctly
Homoskedasticity?
adjusting the
regression
model

We estimate the final


regression model and
generalize the findings 43
REGRESSION ANALYSIS
Example: Life expectancy
As part of a study on life expectancy, the United Nations intern wanted to find out how
average alcohol consumption per capita, average years of schooling, the proportion of
the population that is overweight, and the proportion of the population infected with
the HIV virus affect life expectancy. For 2015, he collected data for 90 countries, which
he divided into two groups: developing countries and developed countries.

෢ = 𝑏0 + 𝑏1 Alcohol + 𝑏2 Schooling + 𝑏3 Obesity + 𝑏4 HIV + 𝑏5 Development


Life

44
REGRESSION ANALYSIS
Example: Life expectancy

45
REGRESSION ANALYSIS
Example: Life expectancy

46
REGRESSION ANALYSIS
Example: Life expectancy

47
REGRESSION ANALYSIS
Example: Life expectancy
to check correlation between all pairs of variables

48
REGRESSION ANALYSIS
Example: Life expectancy

49
REGRESSION ANALYSIS
Example: Life expectancy

50
REGRESSION ANALYSIS
Example: Life expectancy

51
REGRESSION ANALYSIS
Example: Life expectancy

52
REGRESSION ANALYSIS
Example: Life expectancy

53
REGRESSION ANALYSIS
Example: Life expectancy

54
REGRESSION ANALYSIS
Example: Life expectancy

55
REGRESSION ANALYSIS
Example: Life expectancy

Robust Standard Errors if Heterosked

56
REGRESSION ANALYSIS
Example: Life expectancy

57
REGRESSION ANALYSIS
Example: Life expectancy
calculating standardized coefficients.

in order to compare independent variables between each other. i.e. to analyse which variable contributes to the
R^2 more

58
REGRESSION ANALYSIS
Example: Life expectancy
0 1 first one is coded with 0

dummy variable that is coded with 0 is also called


base or reference

all other categories are always compared to the reference


one. this also applies to the variable with several categories
(> several dummies)

coded with 1

59
REGRESSION ANALYSIS to compare which model is better

Example: Life expectancy another anova.this one to compare models.


our anova test has a function aov(...)

H0: delta ro square = 0 >> means both models are equally good
H1: delta ro square > 0 >> means the second model is better

residual sum of square - how much is unexplained


however, we don't need it for this course

60
DUMMY VARIABLES
REGRESSION ANALYSIS
Dummy variables
In a regression analysis, we sometimes want to include categorical explanatory
variables that have more than two categories. In this case, groups of categories cannot
be separated by a single dichotomous variable with values of 0 and 1. Instead, several
dummy variables are needed.

A dummy variable (D) is a variable that can take the values 0 or 1.

1 if the categorical variable has a value A


D𝐴 = ቊ
0 in all other cases

Example:
1 if the person is male
D𝑀 = ቊ
0 if the person is not male

62
REGRESSION ANALYSIS
Example: Cars
We are interested in how the consumption of the car (in l/100km at 90 km/h)
depends on the power of engine and the type of drive: Front-wheel drive (1), rear-
wheel drive (2), 4x4 (3). Data: Cars.csv.

63
REGRESSION ANALYSIS
Example: Cars

completely wrong, because Drive was included as a numeric variable

64
REGRESSION ANALYSIS
Example: Cars
If a categorical variable has 𝑗 categories, we need exactly (𝑗 − 1) dummy variables.
Dummy_1 Dummy_2 Dummy_3

65
REGRESSION ANALYSIS
Example: Cars

66
REGRESSION ANALYSIS
Example: Cars

67
ROBUST STANDARD ERRORS
REGRESSION ANALYSIS
Example: Advertising
For 200 successful startups, you obtained data on the annual sales of their products and the
monthly cost of advertising on Facebook. For each product, you also have information on whether
it belongs to the technology products group and through which distribution channels the
companies sell their products. Since you are a founding member of a new startup company
yourself, you want to estimate a linear regression function of the form:
Sales = 𝑓(Facebook, _Technological, _Channel)

a) Evaluate the model to see if all assumptions are met.

b) Explain results.

c) Estimate the sales of your start-up company that has developed a technological product and
plans monthly Facebook advertising costs of EUR 2,500, with sales initially being made only
through domestic online store.

69
REGRESSION ANALYSIS
Example: Advertising

70
REGRESSION ANALYSIS
Example: Advertising

71
REGRESSION ANALYSIS
Example: Advertising

VIF adjusted

but now, for the generalized VIF, value cannot bu higher than
2.24 = square root of 5

72
REGRESSION ANALYSIS
Example: Advertising

still normal. незважаючи на bimodality

even if there was a violation, still ok, because sample size is big (> 100)

73
REGRESSION ANALYSIS
Example: Advertising

should not be any gaps

drop them otherwise

74
REGRESSION ANALYSIS
Example: Advertising
to create variable "ID"

впорядкувати

впорядкувати в зворотньому порядку

to drop this unit because there was a gap in graph before 75


REGRESSION ANALYSIS
Example: Advertising

76
REGRESSION ANALYSIS
Example: Advertising
to check homoskedasticity for the regression model

of errors

77
REGRESSION ANALYSIS
Example: Advertising

78
REGRESSION ANALYSIS
Example: Advertising Robust Standard Errors Test
its where there is a heteroskedast of std resid

Heteroskedastic

дані для потрібної нам моделі

79

You might also like