Professional Documents
Culture Documents
Business Analytics & Machine Learning: Regression Analysis
Business Analytics & Machine Learning: Regression Analysis
Regression Analysis
With extra equipment, that we need to purchase, we would be able to set up a connection to
live.rbg.tum.de. We need to see, if this is a stable solution.
Note that TUM wants students to be back on campus. We might not get the large lecture hall
again next year.
2
Overview
Machine
Learning
Dimensionality
Regression Classification Clustering Model-based
Reduction
Linear Logistic
k-Means PCR Model-free
Regression Regression
Poisson
Naive Bayes MST SVD
Regression
Ensemble Expectation
Decision Trees
Methods Maximization
PCR Ensemble
Regression Methods
Neural
Lasso
Networks
Neural
Networks
3
Course Content
Supervized
Learning
• Introduction
• Regression Analysis
Regression Classification
• Regression Diagnostics
• Logistic and Poisson Regression Linear Logistic
Regression Regression
• Naive Bayes and Bayesian Networks
• Decision Tree Classifiers Poisson
Naive Bayes
Regression
• Data Preparation and Causal Inference
• Model Selection and Learning Theory Ensemble
Decision Trees
• Ensemble Methods and Clustering Methods
• Dimensionality Reduction
PCR Ensemble
• Association Rules and Recommenders Regression Methods
• Convex Optimization
Neural
Lasso
• Neural Networks Networks
• Reinforcement Learning
Neural
Networks
4
Recommended Literature
• Introduction to Econometrics
− James H. Stock and Mark W. Watson
− Chapter 2 – 7, 17, 18
5
Agenda for Today
Today we revisit three important elements
of statistical inference:
• estimation
• testing
• regression
Note: This class should be a repetition of topics from your introductory statistics
class. If you feel uncomfortable with the material in this class or have never heard
these concepts (estimation, testing, simple regression), this might not be the right
course for you!
6
Question
What is the probability
that a sample of 100
randomly selected
elements with a mean
of 300 or more gets
selected if the true
population mean is 288
and the population
standard deviation is 60?
7
Statistical Estimation
9
Statistical Estimation
Estimate
Point estimate Interval estimate
• sample mean • confidence interval for mean
• sample proportion • confidence interval for
proportion
10
Confidence Interval
probability density
1-α = 95%
2.5% probability
α/2 = 2.5% probability
-1.96 1.96
11
Confidence Interval (CI)
Suppose the samples are drawn from a normal distribution. The CI provides us with
a range of values that we believe, with a given level of confidence, contains a
population parameter:
𝜎𝜎 𝜎𝜎
Pr 𝑋𝑋� − 𝑧𝑧 1−
𝛼𝛼 < 𝜇𝜇 < 𝑋𝑋� + 𝑧𝑧 1−
𝛼𝛼
2 𝑛𝑛 2 𝑛𝑛
12
13
Example
Suppose we have a sample of 𝑛𝑛 = 100 persons
mean = 215, standard deviation = 20
95% CI = 𝑋𝑋� ± 1.96 ⋅ 𝜎𝜎⁄ 𝑛𝑛
20
Lower Limit: 215 − 1.96 ⋅
10
= (211, 219)
20
Upper Limit: 215 + 1.96 ⋅
10
“We are 95% confident that the interval 211-219 contains 𝜇𝜇.”
14
Effect of Sample Size
Suppose we had only 10 observations
What happens to the confidence interval?
𝜎𝜎
𝑋𝑋� ± 1.96 ⋅
𝑛𝑛
20
For 𝑛𝑛 = 100, 215 ± 1.96 ⋅ ≈ (211, 219)
100
20
For 𝑛𝑛 = 10, 215 ± 1.96 ⋅ ≈ (203, 227)
10
15
Effect of Confidence Level
Suppose we use a 90% confidence level.
What happens to the confidence interval?
s
𝑋𝑋� ± 1.645 ⋅
𝑛𝑛
20
90%: 215 ± 1.645 ⋅ ≈ (212, 218)
100
16
Effect of Standard Deviation
Suppose we had a s of 40 (instead of 20)
What happens to the confidence interval?
𝜎𝜎
𝑋𝑋� ± 1.96 ⋅
𝑛𝑛
40
215 ± 1.96 ⋅ ≈ (207, 223)
100
17
Estimation for Population Mean µ
∑ 𝑋𝑋
Point estimate: 𝑋𝑋� = 𝑛𝑛
1
Estimate of variability in population 𝑠𝑠 = ∑ (𝑋𝑋 �
− 𝑋𝑋)²
𝑛𝑛−1 𝑖𝑖 𝑖𝑖
(if σ is unknown, use 𝑠𝑠)
18
How does the size of the random sample impact the size of a
confidence interval?
19
Statistical Tests
Systematic error
CHANCE
20
Hypothesis Testing
State null and alternative hypothesis (H0 and H1 )
• H0 usually postulates no difference between groups
State conclusion:
𝑝𝑝 ≤ 𝛼𝛼, reject H0
𝑝𝑝 > 𝛼𝛼, insufficient evidence to reject H0
21
Hypothesis Testing
Hypothesis: A statement about parameters of population or of a model (𝜇𝜇 = 200?)
Test: Does the data agree with the hypothesis? (sample mean 220)
Simple random sample from a normal population
(or 𝑛𝑛 large enough for CLT)
H0: 𝜇𝜇 = 𝜇𝜇0
H1 : 𝜇𝜇 ≠ 𝜇𝜇0, pick 𝛼𝛼
22
Z-Test
Problem of interest:
Population mean 𝜇𝜇 and population standard deviation 𝜎𝜎 are known.
𝜎𝜎
Z-confidence interval: 𝑋𝑋� ± z(1−𝛼𝛼⁄2)
𝑛𝑛
� 0
𝑋𝑋−𝜇𝜇
Z-test: z=
𝜎𝜎 ⁄ 𝑛𝑛
H1 Rejection region
𝜇𝜇 ≠ 𝜇𝜇0 z ≥ z1−α/2
𝜇𝜇 > 𝜇𝜇0 z ≥ z1−α
𝜇𝜇 < 𝜇𝜇0 z ≤ 𝑧𝑧𝛼𝛼 = − z1−α
23
Student t-Distribution:
Test Statistic for a mean µ with unknown σ
�
𝑋𝑋−𝜇𝜇
𝑡𝑡 = ~ Student-𝑡𝑡 (df = 𝑛𝑛 − 1)
𝑠𝑠/ 𝑛𝑛
tα/2 t1-α/2
25
Student t-Distribution for α=0.05
t1-α/2, df
What we decide
Reject Fail to
null reject null
Type I
Null true Error (𝛼𝛼) Correct
(false positive)
Reality
Type II
Null false Correct Error (𝛽𝛽)
(false negative)
Type I error - You reject the null hypothesis when the null hypothesis is actually true.
Type II error - You fail to reject the null hypothesis when the alternative hypothesis is true.
27
𝑡𝑡-Tests
Formula is slightly different for each:
• Single sample:
− tests whether a sample mean is significantly different from a pre-existing value
• Paired samples:
− tests the relationship between 2 linked samples, e.g., means obtained in
2 conditions by a single group of participants
• Independent samples:
− tests the relationship between 2 independent populations
28
The Paired 𝑡𝑡-Test with 2 Paired Samples
Null hypothesis: H0: 𝜇𝜇𝑑𝑑 = 𝜇𝜇1 − 𝜇𝜇2 = ∆0
� 0
𝑑𝑑−∆
Test statistic: 𝑡𝑡 = 𝑠𝑠/ 𝑛𝑛
H1 Rejection region
µ𝑑𝑑 ≠ ∆0 𝑡𝑡 ≥ 𝑡𝑡1−𝛼𝛼,𝑛𝑛−1
2
µ𝑑𝑑 > ∆0 𝑡𝑡 ≥ 𝑡𝑡1−𝛼𝛼,𝑛𝑛−1
µ𝑑𝑑 < ∆0 𝑡𝑡 ≤ 𝑡𝑡𝛼𝛼,𝑛𝑛−1 = −𝑡𝑡−𝛼𝛼,𝑛𝑛−1
29
The Paired 𝑡𝑡 -Test with 2 Paired Samples
Subjects: random sample of 25 students from TUM, mean grades of the
students on two subsequent exams 𝐴𝐴 and 𝐵𝐵.
> # R code
> x = c(3, 0, 5, 2, 5, 5, 5, 4, 4, 5)
> y = c(2, 1, 4, 1, 4, 3, 3, 2, 3, 5)
> t.test(x,y,alt="two.sided", paired=TRUE)
Paired t-test
data: x and y
t = 3.3541, df = 9, p-value = 0.008468
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.325555 1.674445
sample estimates:
mean of the differences
31
Independent Samples
2 independent samples (possibly different size and variance):
Does the amount of credit card debt differ between households in rural areas
compared to households in urban areas?
Population 1: Population 2:
all rural households all urban households
𝑚𝑚1 𝑚𝑚2
32
Independent Two-Sample 𝑡𝑡-Test (Welch’s 𝑡𝑡-Test)
Two-sample unpaired 𝑡𝑡-test with (un)equal sample sizes, assuming unequal variance
Under Ho 𝑡𝑡 follows a 𝑡𝑡-distribution with degrees of freedom (df) of 𝑠𝑠12 𝑠𝑠22
2
𝑛𝑛1 + 𝑛𝑛2
2 2
𝑠𝑠12 𝑠𝑠22
𝑛𝑛1 𝑛𝑛2
+
𝑛𝑛1 − 1 𝑛𝑛2 − 1
𝑥𝑥̅1 − 𝑥𝑥̅2
𝑡𝑡 =
𝑠𝑠𝑥𝑥̅1 −𝑥𝑥̅2
𝑠𝑠12 𝑠𝑠22
𝑠𝑠𝑥𝑥̅1 −𝑥𝑥̅2 = +
Group 1 Group 2 𝑛𝑛1 𝑛𝑛2
33
Independent Two-Sample 𝑡𝑡-Test: Example
Group 1 Group 2 df = 18 (rounded to integer)
21 22
𝑥𝑥̅1 − 𝑥𝑥̅2 19 − 26
19 25 𝑡𝑡 = = = −7
𝑠𝑠𝑥𝑥̅1−𝑥𝑥̅2 1
18 27
18 24 𝑠𝑠12 𝑠𝑠22 40/9 50/9
23 26 𝑠𝑠𝑥𝑥̅1−𝑥𝑥̅2 = + = +
𝑛𝑛1 𝑛𝑛2 10 10
17 24
=1
19 28
16 26
𝑡𝑡 0.975,18 = 2.101
21 30 |𝑡𝑡| ≥ 𝑡𝑡 0.975,18
18 28
𝑥𝑥̅1 = 19 𝑥𝑥̅2 = 26 Reject H0 (𝜇𝜇1 − 𝜇𝜇2 = 0)
𝑠𝑠1 = 40/9 𝑠𝑠2 = 50/9
34
Selected Statistical Tests
Parametric Tests
• the family of 𝑡𝑡-tests: compares two sample means or tests a single sample mean
• F-test: compares the equivalence of variances of two samples
Non-parametric Tests
• Wilcoxon signed-rank test for 2 paired i.i.d samples
• Mann-Whitney-U test is used for 2 independent i.i.d samples
• Kruskal-Wallis-Test for several i.i.d non-normally distributed samples
35
Please explain the role of confidence intervals in a
single-sample 𝑡𝑡-test.
36
Linear Regression
Regressions identify relationships between dependent and independent
variables:
• Is there an association between the two variables?
• estimation of impact of an independent variable
• formulation of the relation in a functional form
• used for numerical prediction and time series forecasting
37
Terminology
• data streams 𝑋𝑋 and 𝑌𝑌, forming the measurement tuples (𝑥𝑥1, 𝑦𝑦1), … . , (𝑥𝑥𝑛𝑛, 𝑦𝑦𝑛𝑛)
• 𝑥𝑥𝑖𝑖 is the predictor (regressor, covariate, feature, independent variable)
• 𝑦𝑦𝑖𝑖 is the response (dependent variable, outcome, label)
• denote the regression function by: 𝐸𝐸 𝑌𝑌 𝑥𝑥)
• the linear regression model assumes a specific linear form
38
The Simple Linear Regression Model
• linear regression is a statistical tool for numerical predictions
• the first order linear model
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 + 𝜀𝜀
𝛽𝛽0 and 𝛽𝛽1 are unknown,
therefore, are estimated
from the data
𝑌𝑌 = scalar response variable
𝑋𝑋 = scalar predictor variable 𝑌𝑌
𝛽𝛽0 = y-axis intercept
𝛽𝛽1 = slope of the line
𝜀𝜀 = random error term (residual)
Rise
β1 = Rise/Run
β0 Run
𝑋𝑋
39
Estimating the Coefficients
• Coefficients are random variables
• (Ordinary Least Squares) estimates are determined by
− drawing a sample from the population of interest
− calculating sample statistics
− producing a straight line that cuts into the data
2
min � 𝑒𝑒𝑖𝑖2 = min � 𝑦𝑦𝑖𝑖 − 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥𝑖𝑖
𝑖𝑖 𝑖𝑖
41
Example
• A car dealer wants to find Car Odometer Price
the relationship between 1 37388 5318
the odometer reading and
2 44758 5061
the selling price of used cars.
3 45833 5008
• A random sample of 100 cars is
4 30862 5795
selected, and the data recorded.
5 31705 5784
• Find the regression line.
6 34010 5359
. . .
. . .
. . .
Independent/predictor variable 𝑥𝑥
Dependent/response variable 𝑦𝑦
42
Solving a Simple Regression
To calculate 𝛽𝛽0 and 𝛽𝛽1 , we can calculate several statistics first:
� )2
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥
𝑥𝑥̅ = 36,009.45; 𝑠𝑠𝑥𝑥2 = = 43,528,688
𝑛𝑛 −1
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦
̅ �
𝑖𝑖 −𝑦𝑦)
𝑦𝑦� = 5,411.41; cov 𝑋𝑋, 𝑌𝑌 = = −1,356,256
𝑛𝑛−1
where 𝑛𝑛 = 100:
43
Residual Sum of Squares (RSS)
This is the sum of squared differences between the points and the regression line.
It can serve as a measure of how well the line fits the data
(fits well, if statistic is small).
𝑛𝑛
44
Total Deviation
The Total Sum of Squares (TSS) is the sum of the Explained Sum of Squares (ESS)
and the RSS.
�(𝑦𝑦 − 𝑦𝑦)²
� = � 𝑦𝑦� − 𝑦𝑦� ² + �(𝑦𝑦 − 𝑦𝑦)²
�
45
Coefficient of Determination
R2 measures the proportion of the variation in 𝑦𝑦 that is explained by the variation in 𝑥𝑥:
𝑛𝑛 𝑛𝑛
2
𝐸𝐸𝐸𝐸𝐸𝐸 = �(𝑦𝑦�𝑖𝑖 − 𝑦𝑦)²
� , 𝑇𝑇𝑇𝑇𝑇𝑇 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦� = 𝐸𝐸𝐸𝐸𝐸𝐸 + 𝑅𝑅𝑅𝑅𝑅𝑅
𝑖𝑖=1 𝑖𝑖=1
𝑇𝑇𝑇𝑇𝑇𝑇 − 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑅𝑅 𝐸𝐸𝐸𝐸𝐸𝐸
𝑅𝑅2 = =1− =
𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇
46
Testing the Coefficients
• Test the significance of the linear relationship
H0 : 𝛽𝛽1 = 0
H1 : 𝛽𝛽1 ≠ 0
𝛽𝛽�1 𝛽𝛽�1
• The test statistic is 𝑡𝑡 = =
𝑆𝑆E(𝛽𝛽�1 )
RSS 1
�
∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖
− 𝑥𝑥)̅2 𝑛𝑛 − 2
47
The Multiple Linear Regression Model
A 𝑝𝑝-variable regression model can be expressed as a series of equations.
Equations condensed into a matrix form, give the general linear model.
Estimated equation:
𝑌𝑌� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 = 𝐗𝐗𝛽𝛽̂
𝑦𝑦 𝑿𝑿 𝛽𝛽 ε
49
OLS Estimation
Sample-based counter part to population regression model:
𝑦𝑦 = 𝐗𝐗𝛽𝛽 + 𝜀𝜀
𝑦𝑦 = 𝐗𝐗𝛽𝛽̂ + 𝑒𝑒
OLS requires choosing values of the estimated coefficients, such that Residual
Sum of Squares (RSS) is as small as possible for the sample
̂ 𝑇𝑇 (𝑦𝑦 − 𝐗𝐗𝛽𝛽)
𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑒𝑒 𝑇𝑇 𝑒𝑒 = (𝑦𝑦 − 𝐗𝐗𝛽𝛽) ̂
50
Least Squares Estimation
𝐗𝐗 is 𝑛𝑛 × (𝑝𝑝 + 1), 𝑦𝑦 is the vector of outputs
RSS(β) = 𝑦𝑦 − 𝐗𝐗𝛽𝛽 𝑇𝑇 (𝑦𝑦 − 𝐗𝐗𝛽𝛽)
𝛽𝛽 = 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝐗𝐗 𝑇𝑇 𝑦𝑦
𝑦𝑦� = 𝐗𝐗 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝐗𝐗 𝑇𝑇 𝑦𝑦
(aka. Normal Equation)
51
Geometrical Representation
• Least square estimates in ℝ𝑛𝑛
• Minimize RSS(β) = ||𝑦𝑦 − 𝐗𝐗𝛽𝛽||2 , s.t. residual vector 𝑦𝑦 − ŷ is orthogonal to this
subspace.
Definition (Projection):
The set 𝐶𝐶 ⊂ ℝ𝑛𝑛 is non-empty, closed
and convex. For a fixed 𝑦𝑦 ∈ ℝ𝑛𝑛 we
search a point 𝑦𝑦� ∈ 𝐶𝐶, with the smallest
distance to 𝑦𝑦 (wrt. the Euclidean norm),
i.e. we solve the minimization problem
2
𝑃𝑃𝐶𝐶 𝑦𝑦 = min 𝑦𝑦 − 𝑦𝑦�
�
𝑦𝑦∈𝐶𝐶
52
Projections
(2,2,2) b = (2,2)
(0,0,1)
(0,1,0)
a = (1,0)
(1,0,0)
In general, to project 𝑏𝑏 onto some vector 𝑎𝑎:
2 1 0 0 2
aT b 2
2 = 0 1 0 2
0 0 0 0 2
c = T a =
a a 0
53
Example
𝑦𝑦: 2.6 1.6 4.0 3.0 4.9
𝑥𝑥: 1.2 3.0 4.5 5.8 7.2
2.6 1 1.2 e1
e
1.6 1 3.0
𝑦𝑦 = 𝐗𝐗𝛽𝛽̂ + 𝑒𝑒 βˆ 2
4.0 = 1 4.5 0 + e3
ˆ
3.0 1 5.8 β1 e4
4.9 1 7.2 e
5
−1
1 1.2 2.6
1 3.0
1 1 1 1 1 1 4.5 1 1 1 1 1 1.6
4.0 =
−𝟏𝟏 𝐓𝐓
𝛽𝛽� = 𝐗𝐗 𝐓𝐓 𝐗𝐗 𝐗𝐗 y 1 .2 3 . 0 4 . 5 5 . 8 7. 2 1 5.8 1 . 2 3 . 0 4 . 5 5 . 8 7 . 2 3.0
1 7.2 4.9
−1
5 21.7 16.1 1.0565 − 0.1973 16.1 1.498
= =
21.7 116.17 78.6 − 0.1973 0.0455 78.6 0.397
54
Check Results in R
> y <- c(2.6, 1.6, 4.0, 3.0, 4.9)
> x <- c(1.2, 3.0, 4.5, 5.8, 7.2)
> mod <- lm(y ~ x)
> summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
1. check coefficients
0.6259 -1.0883 0.7165 -0.7993 0.5452
2. check significance
Coefficients: 3. check coefficient of
Estimate Std. Error t value Pr(>|t|) determination
(Intercept) 1.4980 1.0322 1.451 0.243
x 0.3968 0.2142 1.853 0.161
55
Selected Statistics
Adjusted R2
• It represents the proportion of variability of 𝑦𝑦 explained by 𝑋𝑋.
R2 is adjusted so that models with a different number of variables can be
compared:
𝑅𝑅𝑅𝑅𝑅𝑅/(𝑛𝑛 − 𝑝𝑝 − 1)
𝑅𝑅� 2 = 1 −
𝑇𝑇𝑇𝑇𝑇𝑇/(𝑛𝑛 − 1)
The F-test
• Significant F indicates a linear relationship between 𝑦𝑦 and at least one of the x’s:
H0 : 𝛽𝛽1 = 𝛽𝛽2 = ⋯ = 𝛽𝛽𝑝𝑝 = 0
56
Model Specification
• In regression analysis the specification is the process of developing a regression
model.
• This process consists of selecting an appropriate functional form for the model
and choosing which variables to include.
• The model might include irrelevant variables or omit relevant variables.
57
Subset Selection
Setting: possibly a large set of predictor variables, some irrelevant
Goal: fit a parsimonious model that explains variation in 𝑌𝑌 with a small set of
predictors
• Aka. subset selection or feature selection problem
Automated procedures:
• best subset (among all exponentially many, computationally expensive)
• backward elimination (top down approach)
• forward selection (bottom up approach)
• stepwise regression (combines forward/backward)
58
Example: Backward Elimination
• Select a significance level to stay in the model (generally 0.05 is too low, causing
too many variables to be removed).
• Forward selection is similar: predictors with lowest 𝑝𝑝-value are added until none is
left with 𝑝𝑝 > sign. level.
59