Business Analytics & Machine Learning: Regression Analysis

Business Analytics & Machine Learning
Regression Analysis
Prof. Dr. Martin Bichler

Department of Computer Science
School of Computation, Information, and Technology
Technical University of Munich
To Stream or Not to Stream?
There is no built-in support for streaming from the AudiMax in the Galileo so far. We were told
that it is not possible to stream yet.
With extra equipment, that we need to purchase, we would be able to set up a connection to
live.rbg.tum.de. We need to see, if this is a stable solution.
Note that TUM wants students to be back on campus. We might not get the large lecture hall
again next year.
2
Overview
Machine
Learning
Supervized Unsupervised Reinforcement

Learning Learning Learning
Dimensionality
Regression Classification Clustering Model-based
Reduction
Linear Logistic
k-Means PCR Model-free
Regression Regression
Poisson
Naive Bayes MST SVD
Regression
Ensemble Expectation
Decision Trees
Methods Maximization
PCR Ensemble
Regression Methods
Neural
Lasso
Networks
Neural
Networks
3
Course Content
Supervized
Learning
• Introduction
• Regression Analysis
Regression Classification
• Regression Diagnostics
• Logistic and Poisson Regression Linear Logistic
Regression Regression
• Naive Bayes and Bayesian Networks
• Decision Tree Classifiers Poisson
Naive Bayes
Regression
• Data Preparation and Causal Inference
• Model Selection and Learning Theory Ensemble
Decision Trees
• Ensemble Methods and Clustering Methods
• Dimensionality Reduction
PCR Ensemble
• Association Rules and Recommenders Regression Methods
• Convex Optimization
Neural
Lasso
• Neural Networks Networks
• Reinforcement Learning
Neural
Networks
4
Recommended Literature
• Introduction to Econometrics
− James H. Stock and Mark W. Watson
− Chapter 2 – 7, 17, 18
• The Elements of Statistical Learning

− Trevor Hastie, Robert Tibshirani, Jerome Friedman
− http://web.stanford.edu/~hastie/Papers/ESLII.pdf
− Section 3.1-3.2: Linear Methods for Regression
• Any Introduction to Statistics

(e.g.: Statistical Inference by George Casella, Roger L. Berger
or online course http://onlinestatbook.com/)
5
Agenda for Today
Today we revisit three important elements
of statistical inference:
• estimation
• testing
• regression
Note: This class should be a repetition of topics from your introductory statistics
class. If you feel uncomfortable with the material in this class or have never heard
these concepts (estimation, testing, simple regression), this might not be the right
course for you!
6
Question
What is the probability
that a sample of 100
randomly selected
elements with a mean
of 300 or more gets
selected if the true
population mean is 288
and the population
standard deviation is 60?
7
Statistical Estimation
Population A simple random sample

with mean of 𝑛𝑛 elements is selected
μ=? from the population.
The value of 𝑥𝑥̅ is used to The sample data

make inferences about provide a value for
the value of μ. the sample mean 𝑥𝑥.̅
9
Statistical Estimation
Estimate
Point estimate Interval estimate
• sample mean • confidence interval for mean
• sample proportion • confidence interval for
proportion
Point estimate is always within the interval estimate.
10
Confidence Interval
probability density
1-α = 95%
2.5% probability
α/2 = 2.5% probability
-1.96 1.96
11
Confidence Interval (CI)
Suppose the samples are drawn from a normal distribution. The CI provides us with
a range of values that we believe, with a given level of confidence, contains a
population parameter:
𝜎𝜎 𝜎𝜎
Pr 𝑋𝑋� − 𝑧𝑧 1−
𝛼𝛼 < 𝜇𝜇 < 𝑋𝑋� + 𝑧𝑧 1−
𝛼𝛼
2 𝑛𝑛 2 𝑛𝑛
There is a 95% chance that your interval contains 𝜇𝜇.
Pr 𝑋𝑋� − 1.96 𝑆𝑆𝑆𝑆 < 𝜇𝜇 < 𝑋𝑋� + 1.96 𝑆𝑆𝑆𝑆 = 0.95
12
13
Example
Suppose we have a sample of 𝑛𝑛 = 100 persons
mean = 215, standard deviation = 20
95% CI = 𝑋𝑋� ± 1.96 ⋅ 𝜎𝜎⁄ 𝑛𝑛
20
Lower Limit: 215 − 1.96 ⋅
10
= (211, 219)
20
Upper Limit: 215 + 1.96 ⋅
10
“We are 95% confident that the interval 211-219 contains 𝜇𝜇.”
If the population standard deviation 𝜎𝜎 is unknown,

use the sample standard deviation 𝑠𝑠 and the 𝑡𝑡-distribution.
If 𝑛𝑛 is large enough, you might also use 𝑠𝑠 and the standard Normal distribution.
14
Effect of Sample Size
Suppose we had only 10 observations
What happens to the confidence interval?
𝜎𝜎
𝑋𝑋� ± 1.96 ⋅
𝑛𝑛
20
For 𝑛𝑛 = 100, 215 ± 1.96 ⋅ ≈ (211, 219)
100
20
For 𝑛𝑛 = 10, 215 ± 1.96 ⋅ ≈ (203, 227)
10
Larger sample size = smaller interval
15
Effect of Confidence Level
Suppose we use a 90% confidence level.
s
𝑋𝑋� ± 1.645 ⋅
𝑛𝑛
20
90%: 215 ± 1.645 ⋅ ≈ (212, 218)
100
Lower confidence level = smaller interval

(A 99% interval would use 2.58 as multiplier and the interval would be larger.)
16
Effect of Standard Deviation
Suppose we had a s of 40 (instead of 20)
𝜎𝜎
𝑋𝑋� ± 1.96 ⋅
𝑛𝑛
40
215 ± 1.96 ⋅ ≈ (207, 223)
100
More variation = larger interval
17
Estimation for Population Mean µ
∑ 𝑋𝑋
Point estimate: 𝑋𝑋� = 𝑛𝑛
1
Estimate of variability in population 𝑠𝑠 = ∑ (𝑋𝑋 �
− 𝑋𝑋)²
𝑛𝑛−1 𝑖𝑖 𝑖𝑖
(if σ is unknown, use 𝑠𝑠)
True standard deviation of sample mean 𝑆𝑆𝑆𝑆 = 𝜎𝜎⁄ 𝑛𝑛

Standard error of sample mean 𝑆𝑆𝑆𝑆 = 𝑠𝑠⁄ 𝑛𝑛
95% confidence interval 𝑋𝑋� ± 1.96 𝑆𝑆𝑆𝑆

or 𝑋𝑋� ± 1.96 𝑆𝑆𝑆𝑆
See the last slides in from the first class!
18
How does the size of the random sample impact the size of a
confidence interval?
19
Statistical Tests
Systematic error
Formulate Collect data to

hypotheses test hypotheses
CHANCE
Accept hypothesis Reject hypothesis
Random error (chance) can be controlled

by statistical significance or by confidence interval
20
Hypothesis Testing
State null and alternative hypothesis (H0 and H1 )
• H0 usually postulates no difference between groups
Choose α level (related to confidence level)

• probability of falsely rejecting H0 (Type I error),
typically 0.05 or 0.01
Calculate test statistic, find 𝑝𝑝-value (𝑝𝑝)

• measures how far data are from what you expect under null hypothesis
State conclusion:
𝑝𝑝 ≤ 𝛼𝛼, reject H0
𝑝𝑝 > 𝛼𝛼, insufficient evidence to reject H0
21
Hypothesis Testing
Hypothesis: A statement about parameters of population or of a model (𝜇𝜇 = 200?)
Test: Does the data agree with the hypothesis? (sample mean 220)
Simple random sample from a normal population
(or 𝑛𝑛 large enough for CLT)
H0: 𝜇𝜇 = 𝜇𝜇0
H1 : 𝜇𝜇 ≠ 𝜇𝜇0, pick 𝛼𝛼
22
Z-Test
Problem of interest:
Population mean 𝜇𝜇 and population standard deviation 𝜎𝜎 are known.
𝜎𝜎
Z-confidence interval: 𝑋𝑋� ± z(1−𝛼𝛼⁄2)
𝑛𝑛
� 0
𝑋𝑋−𝜇𝜇
Z-test: z=
𝜎𝜎 ⁄ 𝑛𝑛
H1 Rejection region
𝜇𝜇 ≠ 𝜇𝜇0 z ≥ z1−α/2
𝜇𝜇 > 𝜇𝜇0 z ≥ z1−α
𝜇𝜇 < 𝜇𝜇0 z ≤ 𝑧𝑧𝛼𝛼 = − z1−α
23
Student t-Distribution:
Test Statistic for a mean µ with unknown σ
�
𝑋𝑋−𝜇𝜇
𝑡𝑡 = ~ Student-𝑡𝑡 (df = 𝑛𝑛 − 1)
𝑠𝑠/ 𝑛𝑛
When the population is normally distributed,

the statistic 𝒕𝒕 is Student-𝒕𝒕 distributed.
The “degrees of freedom (df)”, The 𝑡𝑡 distribution is bell-shaped,

a function of the sample size, and symmetric around zero.
determines how spread the
distribution is (compared to df = 𝑛𝑛2
the normal distribution)
df = 𝑛𝑛1
𝑛𝑛1 < 𝑛𝑛2
0
24
CI and 2-Sided Tests
A level α 2-sided test rejects H0 : 𝜇𝜇 = 𝜇𝜇0 exactly when the value 𝜇𝜇0 falls outside a
level 1 − α confidence interval for 𝜇𝜇.
Calculate 1 − 𝛼𝛼 level confidence interval, then

• if 𝜇𝜇0 within the interval, do not reject the null hypothesis,
t < t1−α/2
• otherwise, |t| ≥ t1−α/2

 reject the null hypothesis.
𝛼𝛼/2 = .025 𝛼𝛼/2 = .025
tα/2 t1-α/2
25
Student t-Distribution for α=0.05
𝛼𝛼/2 = .025 𝛼𝛼/2 = .025
t1-α/2, df
Degrees of Freedom t.9 t.95 t.975 t.99 t.995

1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.92 4.303 6.965 9.925
: : : : : :
24 . 1.711 2.064 2.492 .
: : : : : :
200 1.286 1.653 1.972 2.345 2.601
∞ 1.282 1.645 1.96 2.326 2.576
𝑡𝑡-distribution critical values

26
Possible Results of Tests
What we decide
Reject Fail to
null reject null
Type I
Null true Error (𝛼𝛼) Correct
(false positive)
Reality
Type II
Null false Correct Error (𝛽𝛽)
(false negative)
Type I error - You reject the null hypothesis when the null hypothesis is actually true.
Type II error - You fail to reject the null hypothesis when the alternative hypothesis is true.
27
𝑡𝑡-Tests
Formula is slightly different for each:
• Single sample:
− tests whether a sample mean is significantly different from a pre-existing value
• Paired samples:
− tests the relationship between 2 linked samples, e.g., means obtained in
2 conditions by a single group of participants
• Independent samples:
− tests the relationship between 2 independent populations
28
The Paired 𝑡𝑡-Test with 2 Paired Samples
Null hypothesis: H0: 𝜇𝜇𝑑𝑑 = 𝜇𝜇1 − 𝜇𝜇2 = ∆0
� 0
𝑑𝑑−∆
Test statistic: 𝑡𝑡 = 𝑠𝑠/ 𝑛𝑛
H1 Rejection region
µ𝑑𝑑 ≠ ∆0 𝑡𝑡 ≥ 𝑡𝑡1−𝛼𝛼,𝑛𝑛−1
2
µ𝑑𝑑 > ∆0 𝑡𝑡 ≥ 𝑡𝑡1−𝛼𝛼,𝑛𝑛−1
µ𝑑𝑑 < ∆0 𝑡𝑡 ≤ 𝑡𝑡𝛼𝛼,𝑛𝑛−1 = −𝑡𝑡−𝛼𝛼,𝑛𝑛−1
Observations are dependent, e.g., pre and post test,

left and right eyes, brother-sister pairs.
29
The Paired 𝑡𝑡 -Test with 2 Paired Samples
Subjects: random sample of 25 students from TUM, mean grades of the
students on two subsequent exams 𝐴𝐴 and 𝐵𝐵.
Is there a significant difference between the two exams?
Null Hypothesis: 𝐸𝐸(𝐴𝐴) = 𝐸𝐸(𝐵𝐵), No. 𝑨𝑨 𝑩𝑩 𝒅𝒅 = 𝑨𝑨 − 𝑩𝑩

Answer can be given based on 1 3.7 3.5 0.2
significance testing 2 2.2 2.3 -0.1
…
∆0 = 0.
25 4.8 4.4 0.4
𝑑𝑑̅ = 0.093
𝑠𝑠 = 0.150 𝑑𝑑̅ 0.093
𝑡𝑡 = = = 3.1
𝑛𝑛 = 25 s/ 𝑛𝑛 0.03
𝑠𝑠⁄ 𝑛𝑛 = 0.03
𝑝𝑝 = Pr t > 3.1 df = 24 = 0.005
𝑡𝑡0.975;24 = 2.064
30
The 𝑝𝑝-Value
The 𝑝𝑝-value describes the probability of having 𝑡𝑡 = 3.1 (or larger), given the null
hypothesis. The smaller the 𝑝𝑝-value, the more unlikely it is to observe the
corresponding sample value (or more extreme) by chance under H0.
> # R code
> x = c(3, 0, 5, 2, 5, 5, 5, 4, 4, 5)
> y = c(2, 1, 4, 1, 4, 3, 3, 2, 3, 5)
> t.test(x,y,alt="two.sided", paired=TRUE)
Paired t-test
data: x and y
t = 3.3541, df = 9, p-value = 0.008468
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.325555 1.674445
sample estimates:
mean of the differences
31
Independent Samples
2 independent samples (possibly different size and variance):
Does the amount of credit card debt differ between households in rural areas
compared to households in urban areas?
Population 1: Population 2:
all rural households all urban households
𝑚𝑚1 𝑚𝑚2
Null Hypothesis: H0 : 𝑚𝑚1 = 𝑚𝑚2

Alternate Hypothesis: H1 : 𝑚𝑚1 ≠ 𝑚𝑚2
32
Independent Two-Sample 𝑡𝑡-Test (Welch’s 𝑡𝑡-Test)
Two-sample unpaired 𝑡𝑡-test with (un)equal sample sizes, assuming unequal variance
Under Ho 𝑡𝑡 follows a 𝑡𝑡-distribution with degrees of freedom (df) of 𝑠𝑠12 𝑠𝑠22
2
𝑛𝑛1 + 𝑛𝑛2
2 2
𝑠𝑠12 𝑠𝑠22
𝑛𝑛1 𝑛𝑛2
+
𝑛𝑛1 − 1 𝑛𝑛2 − 1
𝑥𝑥̅1 − 𝑥𝑥̅2
𝑡𝑡 =
𝑠𝑠𝑥𝑥̅1 −𝑥𝑥̅2
𝑠𝑠12 𝑠𝑠22
𝑠𝑠𝑥𝑥̅1 −𝑥𝑥̅2 = +
Group 1 Group 2 𝑛𝑛1 𝑛𝑛2
33
Independent Two-Sample 𝑡𝑡-Test: Example
Group 1 Group 2 df = 18 (rounded to integer)
21 22
𝑥𝑥̅1 − 𝑥𝑥̅2 19 − 26
19 25 𝑡𝑡 = = = −7
𝑠𝑠𝑥𝑥̅1−𝑥𝑥̅2 1
18 27
18 24 𝑠𝑠12 𝑠𝑠22 40/9 50/9
23 26 𝑠𝑠𝑥𝑥̅1−𝑥𝑥̅2 = + = +
𝑛𝑛1 𝑛𝑛2 10 10
17 24
=1
19 28
16 26
𝑡𝑡 0.975,18 = 2.101
21 30 |𝑡𝑡| ≥ 𝑡𝑡 0.975,18
18 28
𝑥𝑥̅1 = 19 𝑥𝑥̅2 = 26  Reject H0 (𝜇𝜇1 − 𝜇𝜇2 = 0)
𝑠𝑠1 = 40/9 𝑠𝑠2 = 50/9
34
Selected Statistical Tests
Parametric Tests
• the family of 𝑡𝑡-tests: compares two sample means or tests a single sample mean
• F-test: compares the equivalence of variances of two samples
Non-parametric Tests
• Wilcoxon signed-rank test for 2 paired i.i.d samples
• Mann-Whitney-U test is used for 2 independent i.i.d samples
• Kruskal-Wallis-Test for several i.i.d non-normally distributed samples
Tests of the Probability Distribution

• Kolmogorov-Smirnov and Chi-square test: used to determine whether two
underlying probability distributions differ, or whether an underlying probability
distribution differs from a hypothesized distribution
35
Please explain the role of confidence intervals in a
single-sample 𝑡𝑡-test.
36
Linear Regression
Regressions identify relationships between dependent and independent
variables:
• Is there an association between the two variables?
• estimation of impact of an independent variable
• formulation of the relation in a functional form
• used for numerical prediction and time series forecasting
Regression as an established statistical technique:

• Sir Francis Galton (1822-1911) studied the relationship between a father’s height
and the son’s height
37
Terminology
• data streams 𝑋𝑋 and 𝑌𝑌, forming the measurement tuples (𝑥𝑥1, 𝑦𝑦1), … . , (𝑥𝑥𝑛𝑛, 𝑦𝑦𝑛𝑛)
• 𝑥𝑥𝑖𝑖 is the predictor (regressor, covariate, feature, independent variable)
• 𝑦𝑦𝑖𝑖 is the response (dependent variable, outcome, label)
• denote the regression function by: 𝐸𝐸 𝑌𝑌 𝑥𝑥)
• the linear regression model assumes a specific linear form
38
The Simple Linear Regression Model
• linear regression is a statistical tool for numerical predictions
• the first order linear model
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 + 𝜀𝜀
𝛽𝛽0 and 𝛽𝛽1 are unknown,
therefore, are estimated
from the data
𝑌𝑌 = scalar response variable
𝑋𝑋 = scalar predictor variable 𝑌𝑌
𝛽𝛽0 = y-axis intercept
𝛽𝛽1 = slope of the line
𝜀𝜀 = random error term (residual)
Rise
β1 = Rise/Run
β0 Run
𝑋𝑋
39
Estimating the Coefficients
• Coefficients are random variables
• (Ordinary Least Squares) estimates are determined by
− drawing a sample from the population of interest
− calculating sample statistics
− producing a straight line that cuts into the data
y  The question is:

 Which straight line fits best?

 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥
    

   

x
40
OLS Estimators
Ordinary Least Squares (OLS) approach:
• Minimize the sum of squared residuals (aka. loss function)
𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥
2
min � 𝑒𝑒𝑖𝑖2 = min � 𝑦𝑦𝑖𝑖 − 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥𝑖𝑖
𝑖𝑖 𝑖𝑖
∑𝑖𝑖( 𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦

̅ 𝑖𝑖 − 𝑦𝑦)
� cov(𝑥𝑥, 𝑦𝑦)
𝛽𝛽̂1 = =
∑𝑖𝑖(𝑥𝑥𝑖𝑖 − 𝑥𝑥)̅ 2 var(𝑥𝑥)
𝛽𝛽̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅
41
Example
• A car dealer wants to find Car Odometer Price
the relationship between 1 37388 5318
the odometer reading and
2 44758 5061
the selling price of used cars.
3 45833 5008
• A random sample of 100 cars is
4 30862 5795
selected, and the data recorded.
5 31705 5784
• Find the regression line.
6 34010 5359
. . .
. . .
. . .
Independent/predictor variable 𝑥𝑥
Dependent/response variable 𝑦𝑦
42
Solving a Simple Regression
To calculate 𝛽𝛽0 and 𝛽𝛽1 , we can calculate several statistics first:
� )2
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥
𝑥𝑥̅ = 36,009.45; 𝑠𝑠𝑥𝑥2 = = 43,528,688
𝑛𝑛 −1
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦
̅ �
𝑖𝑖 −𝑦𝑦)
𝑦𝑦� = 5,411.41; cov 𝑋𝑋, 𝑌𝑌 = = −1,356,256
𝑛𝑛−1
where 𝑛𝑛 = 100:
cov 𝑋𝑋,𝑌𝑌 −1,356,256

𝛽𝛽̂1 = = = −0.0312
𝑠𝑠𝑥𝑥2 43,528,688
𝛽𝛽̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅ = 5411.41−(−0.0312)(36,009.45) = 6,533
𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥 = 6,533 − 0.0312 𝑥𝑥
43
Residual Sum of Squares (RSS)
This is the sum of squared differences between the points and the regression line.
It can serve as a measure of how well the line fits the data
(fits well, if statistic is small).
An unbiased estimator of the RSS of the population is given by:
𝑛𝑛
𝑅𝑅𝑅𝑅𝑅𝑅 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )²

𝑖𝑖=1
44
Total Deviation
The Total Sum of Squares (TSS) is the sum of the Explained Sum of Squares (ESS)
and the RSS.
�(𝑦𝑦 − 𝑦𝑦)²
� = � 𝑦𝑦� − 𝑦𝑦� ² + �(𝑦𝑦 − 𝑦𝑦)²
�
TSS = ESS + RSS
Total deviation = explained deviation + unexplained deviation
45
Coefficient of Determination
R2 measures the proportion of the variation in 𝑦𝑦 that is explained by the variation in 𝑥𝑥:
𝑛𝑛 𝑛𝑛
2
𝐸𝐸𝐸𝐸𝐸𝐸 = �(𝑦𝑦�𝑖𝑖 − 𝑦𝑦)²
� , 𝑇𝑇𝑇𝑇𝑇𝑇 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦� = 𝐸𝐸𝐸𝐸𝐸𝐸 + 𝑅𝑅𝑅𝑅𝑅𝑅
𝑖𝑖=1 𝑖𝑖=1
𝑇𝑇𝑇𝑇𝑇𝑇 − 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑅𝑅 𝐸𝐸𝐸𝐸𝐸𝐸
𝑅𝑅2 = =1− =
𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇
R2 takes on any value between zero and one

• R2 = 1: perfect match between the line and data points
• R2 = 0: there is no linear relationship between x and y
46
Testing the Coefficients
• Test the significance of the linear relationship
H0 : 𝛽𝛽1 = 0
H1 : 𝛽𝛽1 ≠ 0
𝛽𝛽�1 𝛽𝛽�1
• The test statistic is 𝑡𝑡 = =
𝑆𝑆E(𝛽𝛽�1 )
RSS 1
�
∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖
− 𝑥𝑥)̅2 𝑛𝑛 − 2
• If 𝑆𝑆E(𝛽𝛽̂1 ) is large, then 𝛽𝛽̂1 must be large to reject H0 .

• If the error variable is normally distributed, the statistic is a
student 𝑡𝑡-distribution with 𝑛𝑛 − 2 degrees of freedom (if 𝑛𝑛 is large, draw on the CLT).
• Reject H0 , if: 𝑡𝑡 < 𝑡𝑡𝛼𝛼⁄2 or 𝑡𝑡 > 𝑡𝑡1−𝛼𝛼⁄2
47
The Multiple Linear Regression Model
A 𝑝𝑝-variable regression model can be expressed as a series of equations.
Equations condensed into a matrix form, give the general linear model.
β coefficients are known as partial regression coefficients:

𝑋𝑋1 , 𝑋𝑋2 , for example,
− 𝑋𝑋1 = ‘years of experience’
− 𝑋𝑋2 = ‘age’
− 𝑌𝑌 = ‘salary’
Estimated equation:
𝑌𝑌� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 = 𝐗𝐗𝛽𝛽̂
Source: Hastie et al. 2016, p. 45

48
Matrix Notation
𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑝𝑝 𝛽𝛽0 𝜀𝜀1

𝑦𝑦2 1 𝑥𝑥21 𝑥𝑥22 ⋯ 𝑥𝑥2𝑝𝑝 𝛽𝛽1 𝜀𝜀2
⋮ = ⋮ ⋮ ⋱ ⋮ + ⋮
⋮ ⋮
𝑦𝑦𝑛𝑛 1 𝑥𝑥𝑛𝑛𝑛 𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛 𝛽𝛽𝑝𝑝 𝜀𝜀𝑛𝑛
𝑦𝑦 𝑿𝑿 𝛽𝛽 ε
(𝑛𝑛 × 1) (𝑛𝑛 × (𝑝𝑝+1)) ((𝑝𝑝+1) × 1) (𝑛𝑛 × 1)
49
OLS Estimation
Sample-based counter part to population regression model:
𝑦𝑦 = 𝐗𝐗𝛽𝛽 + 𝜀𝜀
𝑦𝑦 = 𝐗𝐗𝛽𝛽̂ + 𝑒𝑒
OLS requires choosing values of the estimated coefficients, such that Residual
Sum of Squares (RSS) is as small as possible for the sample
̂ 𝑇𝑇 (𝑦𝑦 − 𝐗𝐗𝛽𝛽)
𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑒𝑒 𝑇𝑇 𝑒𝑒 = (𝑦𝑦 − 𝐗𝐗𝛽𝛽) ̂
Need to differentiate with respect to the unknown coefficients.
50
Least Squares Estimation
𝐗𝐗 is 𝑛𝑛 × (𝑝𝑝 + 1), 𝑦𝑦 is the vector of outputs
RSS(β) = 𝑦𝑦 − 𝐗𝐗𝛽𝛽 𝑇𝑇 (𝑦𝑦 − 𝐗𝐗𝛽𝛽)
If 𝐗𝐗 is full rank, then 𝐗𝐗 𝑇𝑇 𝐗𝐗 is positive definite
𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑦𝑦𝑇𝑇𝑦𝑦 − 2𝛽𝛽 𝑇𝑇 𝐗𝐗 𝑇𝑇 y + 𝛽𝛽 𝑇𝑇 𝐗𝐗 𝑇𝑇 𝐗𝐗𝛽𝛽

𝜕𝜕𝑅𝑅𝑅𝑅𝑅𝑅
= −2𝐗𝐗 𝑇𝑇 y + 2𝐗𝐗 𝑇𝑇 𝐗𝐗𝛽𝛽 = 0 First-order condition
𝜕𝜕𝛽𝛽
𝛽𝛽 = 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝐗𝐗 𝑇𝑇 𝑦𝑦
𝑦𝑦� = 𝐗𝐗 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝐗𝐗 𝑇𝑇 𝑦𝑦
(aka. Normal Equation)
“Hat” or projection matrix 𝐻𝐻
51
Geometrical Representation
• Least square estimates in ℝ𝑛𝑛
• Minimize RSS(β) = ||𝑦𝑦 − 𝐗𝐗𝛽𝛽||2 , s.t. residual vector 𝑦𝑦 − ŷ is orthogonal to this
subspace.
Definition (Projection):
The set 𝐶𝐶 ⊂ ℝ𝑛𝑛 is non-empty, closed
and convex. For a fixed 𝑦𝑦 ∈ ℝ𝑛𝑛 we
search a point 𝑦𝑦� ∈ 𝐶𝐶, with the smallest
distance to 𝑦𝑦 (wrt. the Euclidean norm),
i.e. we solve the minimization problem
2
𝑃𝑃𝐶𝐶 𝑦𝑦 = min 𝑦𝑦 − 𝑦𝑦�
�
𝑦𝑦∈𝐶𝐶
Source: Hastie et al. 2016, p. 46
52
Projections
(2,2,2) b = (2,2)
(0,0,1)
(0,1,0)
a = (1,0)
(1,0,0)
In general, to project 𝑏𝑏 onto some vector 𝑎𝑎:
 2   1 0 0  2 
     aT b  2
  
2 = 0 1 0  2 
 0   0 0 0  2 
c = T a =  
     a a 0
53
Example
𝑦𝑦: 2.6 1.6 4.0 3.0 4.9
𝑥𝑥: 1.2 3.0 4.5 5.8 7.2
 2.6  1 1.2   e1 
   e 
1.6  1 3.0 
𝑦𝑦 = 𝐗𝐗𝛽𝛽̂ + 𝑒𝑒  βˆ   2 
 4.0  = 1 4.5  0  +  e3 
    ˆ   
 3.0  1 5.8  β1   e4 

 4.9  1 7.2  e 
   5
−1
 1 1.2    2.6 
 1 3.0    
 1 1 1 1 1 1 4.5    1 1 1 1 1  1.6 
      4.0  =
−𝟏𝟏 𝐓𝐓
𝛽𝛽� = 𝐗𝐗 𝐓𝐓 𝐗𝐗 𝐗𝐗 y  1 .2 3 . 0 4 . 5 5 . 8 7. 2 1 5.8    1 . 2 3 . 0 4 . 5 5 . 8 7 . 2  3.0 

     
 1 7.2    4.9 
−1
 5 21.7  16.1   1.0565 − 0.1973  16.1  1.498 
    =    =  
 21.7 116.17   78.6   − 0.1973 0.0455  78.6   0.397 
54
Check Results in R
> y <- c(2.6, 1.6, 4.0, 3.0, 4.9)
> x <- c(1.2, 3.0, 4.5, 5.8, 7.2)
> mod <- lm(y ~ x)
> summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
1. check coefficients
0.6259 -1.0883 0.7165 -0.7993 0.5452
2. check significance
Coefficients: 3. check coefficient of
Estimate Std. Error t value Pr(>|t|) determination
(Intercept) 1.4980 1.0322 1.451 0.243
x 0.3968 0.2142 1.853 0.161
Residual standard error: 1.004 on 3 degrees of freedom

Multiple R-Squared: 0.5336, Adjusted R-squared: 0.3782
F-statistic: 3.433 on 1 and 3 DF, p-value: 0.1610
55
Selected Statistics
Adjusted R2
• It represents the proportion of variability of 𝑦𝑦 explained by 𝑋𝑋.
R2 is adjusted so that models with a different number of variables can be
compared:
𝑅𝑅𝑅𝑅𝑅𝑅/(𝑛𝑛 − 𝑝𝑝 − 1)
𝑅𝑅� 2 = 1 −
𝑇𝑇𝑇𝑇𝑇𝑇/(𝑛𝑛 − 1)
The F-test
• Significant F indicates a linear relationship between 𝑦𝑦 and at least one of the x’s:
H0 : 𝛽𝛽1 = 𝛽𝛽2 = ⋯ = 𝛽𝛽𝑝𝑝 = 0
The t-test of each partial regression coefficient

• Significant t indicates that the variable in question influences the response
variable while controlling for other explanatory variables.
56
Model Specification
• In regression analysis the specification is the process of developing a regression
model.
• This process consists of selecting an appropriate functional form for the model
and choosing which variables to include.
• The model might include irrelevant variables or omit relevant variables.
• Dummy variables for discrete variables (e.g., 0/1 for gender)

• Quadratic models: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥22 + ε use 𝑧𝑧2 = 𝑥𝑥22 (linear in β)
• Models with interaction terms 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 𝑥𝑥2 use 𝑧𝑧1 = 𝑥𝑥1 𝑥𝑥2 (linear in β)
• Non-linear models are challenging, but some nonlinear regression problems can
be linearized. For example, exponential terms 𝑦𝑦 = 𝛼𝛼𝑥𝑥 𝛽𝛽 𝜀𝜀 can be transformed
using the logarithm to
ln 𝑦𝑦 = ln 𝛼𝛼 + 𝛽𝛽 ln(𝑥𝑥) + ln(𝜀𝜀)
57
Subset Selection
Setting: possibly a large set of predictor variables, some irrelevant
Goal: fit a parsimonious model that explains variation in 𝑌𝑌 with a small set of
predictors
• Aka. subset selection or feature selection problem
Automated procedures:
• best subset (among all exponentially many, computationally expensive)
• backward elimination (top down approach)
• forward selection (bottom up approach)
• stepwise regression (combines forward/backward)
More in the context of the class on dimensionality reduction

• subset selection vs. shrinkage methods
58
Example: Backward Elimination
• Select a significance level to stay in the model (generally 0.05 is too low, causing
too many variables to be removed).
• Fit the full model with all possible predictors.
• Consider the predictor with lowest t-statistic (highest 𝑝𝑝-value).

− If 𝑝𝑝 > sign. level, remove the predictor and fit model without this variable (must
re-fit model here because partial regression coefficients change).
− If 𝑝𝑝 ≤ sign. level, stop and keep current model.
• Continue until all predictors have 𝑝𝑝-values below sign. level.
• Forward selection is similar: predictors with lowest 𝑝𝑝-value are added until none is
left with 𝑝𝑝 > sign. level.
59

Business Analytics & Machine Learning: Regression Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Analytics & Machine Learning: Regression Analysis

Uploaded by

Copyright:

Available Formats

Business Analytics & Machine Learning

Prof. Dr. Martin Bichler

Supervized Unsupervised Reinforcement

• The Elements of Statistical Learning

• Any Introduction to Statistics

Population A simple random sample

The value of 𝑥𝑥̅ is used to The sample data

Point estimate is always within the interval estimate.

There is a 95% chance that your interval contains 𝜇𝜇.

Pr 𝑋𝑋� − 1.96 𝑆𝑆𝑆𝑆 < 𝜇𝜇 < 𝑋𝑋� + 1.96 𝑆𝑆𝑆𝑆 = 0.95

If the population standard deviation 𝜎𝜎 is unknown,

Larger sample size = smaller interval

Lower confidence level = smaller interval

More variation = larger interval

True standard deviation of sample mean 𝑆𝑆𝑆𝑆 = 𝜎𝜎⁄ 𝑛𝑛

95% confidence interval 𝑋𝑋� ± 1.96 𝑆𝑆𝑆𝑆

See the last slides in from the first class!

Formulate Collect data to

Accept hypothesis Reject hypothesis

Random error (chance) can be controlled

Choose α level (related to confidence level)

Calculate test statistic, find 𝑝𝑝-value (𝑝𝑝)

When the population is normally distributed,

The “degrees of freedom (df)”, The 𝑡𝑡 distribution is bell-shaped,

Calculate 1 − 𝛼𝛼 level confidence interval, then

• otherwise, |t| ≥ t1−α/2

𝛼𝛼/2 = .025 𝛼𝛼/2 = .025

𝛼𝛼/2 = .025 𝛼𝛼/2 = .025

Degrees of Freedom t.9 t.95 t.975 t.99 t.995

𝑡𝑡-distribution critical values

Observations are dependent, e.g., pre and post test,

Is there a significant difference between the two exams?

Null Hypothesis: 𝐸𝐸(𝐴𝐴) = 𝐸𝐸(𝐵𝐵), No. 𝑨𝑨 𝑩𝑩 𝒅𝒅 = 𝑨𝑨 − 𝑩𝑩

Null Hypothesis: H0 : 𝑚𝑚1 = 𝑚𝑚2

Tests of the Probability Distribution

Regression as an established statistical technique:

y  The question is:

𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥

∑𝑖𝑖( 𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦

𝛽𝛽̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅

cov 𝑋𝑋,𝑌𝑌 −1,356,256

𝛽𝛽̂0 = 𝑦𝑦� − 𝛽𝛽̂1 𝑥𝑥̅ = 5411.41−(−0.0312)(36,009.45) = 6,533

𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥 = 6,533 − 0.0312 𝑥𝑥

An unbiased estimator of the RSS of the population is given by:

𝑅𝑅𝑅𝑅𝑅𝑅 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )²

TSS = ESS + RSS

Total deviation = explained deviation + unexplained deviation

R2 takes on any value between zero and one

• If 𝑆𝑆E(𝛽𝛽̂1 ) is large, then 𝛽𝛽̂1 must be large to reject H0 .

β coefficients are known as partial regression coefficients:

Source: Hastie et al. 2016, p. 45

𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑝𝑝 𝛽𝛽0 𝜀𝜀1

(𝑛𝑛 × 1) (𝑛𝑛 × (𝑝𝑝+1)) ((𝑝𝑝+1) × 1) (𝑛𝑛 × 1)

Need to differentiate with respect to the unknown coefficients.

If 𝐗𝐗 is full rank, then 𝐗𝐗 𝑇𝑇 𝐗𝐗 is positive definite

𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑦𝑦𝑇𝑇𝑦𝑦 − 2𝛽𝛽 𝑇𝑇 𝐗𝐗 𝑇𝑇 y + 𝛽𝛽 𝑇𝑇 𝐗𝐗 𝑇𝑇 𝐗𝐗𝛽𝛽

“Hat” or projection matrix 𝐻𝐻

Source: Hastie et al. 2016, p. 46

Residual standard error: 1.004 on 3 degrees of freedom

The t-test of each partial regression coefficient

• Dummy variables for discrete variables (e.g., 0/1 for gender)

More in the context of the class on dimensionality reduction