Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Regression Analysis

• Linear regression is used to predict the value of an outcome variable Y based on one or more
input predictor variables X. The aim is to establish a linear relationship (a mathematical formula)
between the predictor variable(s) and the response variable, so that, we can use this formula to
estimate the value of the response Y, when only the predictors (Xs) values are known.

• The aim of linear regression is to model a continuous variable Y as a mathematical function of


one or more X variable(s), so that we can use this regression model to predict the Y when only
the X is known. This mathematical equation can be generalized as follows:

• Y = β1 + β2X + ϵ
The Regression “Machine”

The regression model relevant sample data


(terminology, structure, (no missing data)
and assumptions)

regression results
(today’s focus)
Assumptions of Regression Model

1. The regression model is linear in parameters


2. The mean of residuals is zero
3. Homoscedasticity of residuals
4. No autocorrelation of residuals
5. The X is variables and the variability in X values is positive
6. The residuals are uncorrelated with independent variables.
7. The number of observations must be greater than number of Xs y=a+b X
8. No perfect multicollinearity
9. Normality of residuals
A 2 2-4= -2

B 4 4-4= 0

C 6 6-4=2

Total 12 0

Average 4 0
Residual Analysis for Linearity

Y Y

x x
residuals

residuals
x x

Not Linear
 Linear
Residual Analysis for Homoscedasticity

Y Y

x x
residuals

residuals
x x

Non-constant variance
 Constant variance
Residual Analysis for
Autocorrelation

Not Independent
 Independent
residuals

residuals
X
residuals

X
Residual plot Normality
Violations and Implications
Assumption Problem Associated With violation
1 Model is linear in parameters Non Linearity
2 Mean of residuals is zero Biased Intercept
3 Homoscedasticity of residuals Heteroscedasticity
4 No correlation of residuals Auto Correlation
5 X variables and the variability in X values is positive Error in Variable
6 Residuals are uncorrelated with independent variable Endogeneity
7 The number of observations must be greater than number Unique Solution
of Xs
8 No correlation between independent variables Multicollinearity
9 Normality of residuals Hypothesis Testing
Consequences of Multicollinearity
Concluding when imperfect multi-collinearity is present we have:
(a) Estimates of the OLS may be imprecise because of large standard errors.
(b) Affected coefficients may fail to attain statistical significance due to low t-
stats. T= beta/ SE
(c) Sign reversal might exist.
(d) Addition or deletion of few observations may result in substantial changes
in the estimated coefficients.
Consequences of Heteroskedasticity
1. The OLS estimators are still unbiased and consistent. This is
because none of the explanatory variables is correlated with the
error term. So a correctly specified equation will give us values of
estimated coefficient which are very close to the real parameters.
2. Affects the distribution of the estimated coefficients increasing the
variances of the distributions and therefore making the OLS
estimators inefficient.
3. Underestimates the variances of the estimators, leading to higher
values of t and F statistics. T= B/SE
Consequences of Autocorrelation

1. The OLS estimators are still unbiased and consistent. This is because both
unbiasedness and consistency do not depend on assumption violated.
2. The OLS estimators will be inefficient and therefore no longer BLUE.
3. The estimated variances of the regression coefficients will be biased and
inconsistent, and therefore hypothesis testing is no longer valid. In most of
the cases, the R2 will be overestimated and the t-statistics will tend to be
higher.
Hypothesis
• A hypothesis is an educated guess about something in the world
around you. It should be testable, either by experiment or
observation.
• The null and alternative hypotheses are two mutually exclusive
statements about a population.
• A hypothesis test uses sample data to determine whether to reject
the null hypothesis. The alternative hypothesis is what you might
believe to be true or hope to prove true.
Null and Alternative hypothesis
• The null hypothesis states that a population parameter (such as the
mean, the standard deviation, and so on) is equal to a hypothesized
value. The null hypothesis is often an initial claim that is based on
previous analyses or specialized knowledge.
• The alternative hypothesis states that a population parameter is
smaller, greater, or different than the hypothesized value in the null
hypothesis. The alternative hypothesis is what you might believe to be
true or hope to prove true.
One-sided and two-sided hypotheses
• Use a two-sided alternative hypothesis (also known as a
nondirectional hypothesis) to determine whether the population
parameter is either greater than or less than the hypothesized value.
A two-sided test can detect when the population parameter differs in
either direction, but has less power than a one-sided test.
What does H0 and H1 mean?
• The null hypothesis (H0) is a statement of “no difference,” “no
association,” or “no treatment effect.” •
• The alternative hypothesis, (H1) is a statement of “difference,”
“association,” or “treatment effect.”
• H0 is assumed to be true until proven otherwise.
Steps in Hypothesis Testing
• There are the five steps of hypothesis testing
• Step 1: Specify the Null Hypothesis. ...
• Step 2: Specify the Alternative Hypothesis. ...
• Step 3: Set the Significance Level.
• Step 4: Calculate the Test Statistic and Corresponding P-Value. ...
• Step 5: Drawing a Conclusion.
How do you reject a null hypothesis
t test?
• If the absolute value of the t-value is greater than the critical value,
you reject the null hypothesis.
• If the absolute value of the t-value is less than the critical value, you
fail to reject the null hypothesis.
How do you reject null hypothesis
p value?
• If the p-value is less than 0.05, we reject the null hypothesis that
there's no difference between the means and conclude that a
significant difference does exist.
• If the p-value is larger than 0.05, we cannot conclude that a
significant difference exists.
Rejection Region
Rejection Region
Difference between Z-test and t-test:
• Z-test is used when sample size is large (n>50), or the population
variance is known.
• t-test is used when sample size is small (n<50) and population
variance is unknown.
Difference between Z-test and F-test
• A z-test is used for testing the mean of a population versus a
standard, or comparing the means of two populations, with large (n ≥
30) samples whether you know the population standard deviation or
not.
• An F-test is used to compare 2 populations' variances. The samples
can be any size. It is the basis of ANOVA
Standard Deviation vs Standard Error
• Standard deviation (SD) measures the dispersion of a dataset relative
to its mean.
• The standard error (SE) of a statistic is the approximate standard
deviation of a statistical sample population.
• Standard error of the mean (SEM) measured how much discrepancy
there is likely to be in a sample's mean compared to the population
mean.
• The SEM takes the SD and divides it by the square root of the sample
size
An Example
Court Decision Person is declared Person is declared “guilty”
’not guilty’

Reality

Person is “innocent”
H0 is true
Correct Decision Type 1 Error

Person is “guilty ”
H1 is true Type 11 Error Correct Decision
Type 1 and Type 11 Errors
• Alternative Hypothesis: H1: The hypothesis that we are interested in proving.
• Null hypothesis: H0: The complement of the alternative hypothesis.
• Type I error: reject the null hypothesis when it is correct.
It is measured by the level of significance α, i.e., the probability of type I error.
This is the probability of falsely rejecting the null hypothesis.
• Type II error: do not reject the null hypothesis when it is wrong.
It is measure by the probability of type II error β.
Furthermore we call 1 − β to be the power of test, which is the probability of
correctly rejecting the null hypothesis.
• Critical value: The dividing point between the region where the null hypothesis is
rejected and the region where it is not rejected
Statistical significance vs Economic
Significance
• Statistical significance itself doesn't imply that your results have
practical consequence.
• If you use a test with very high power, you might conclude that a
small difference from the hypothesized value is statistically significant.
• However, that small difference might be meaningless to your
situation. You should use your specialized knowledge to determine
whether the difference is practically significant.
Types of Multivariate Regression

 Multiple linear regression is for normally distributed


outcomes

 Logistic regression is for binary outcomes

 Cox proportional hazards regression is used when time-to-


event is the outcome
Example outcome Appropriate multivariate Example equation What do the coefficients
Outcome (dependent variable regression model give you?
variable)
Continuous Blood pressure Linear regression blood pressure (mmHg) = slopes—tells you how
 + salt*salt much the outcome
consumption (tsp/day) + variable increases for
age*age (years) + every 1-unit increase in
smoker*ever smoker each predictor.
(yes=1/no=0)

Binary High blood pressure Logistic regression ln (odds of high blood odds ratios—tells you
(yes/no) pressure) = how much the odds of the
 + salt*salt outcome increase for
consumption (tsp/day) + every 1-unit increase in
age*age (years) + each predictor.
smoker*ever smoker
(yes=1/no=0)
Time-to-event Time-to- death Cox regression ln (rate of death) = hazard ratios—tells you
 + salt*salt how much the rate of the
consumption (tsp/day) + outcome increases for
age*age (years) + every 1-unit increase in
smoker*ever smoker each predictor.
(yes=1/no=0)
Statistics for various types of outcome data
Are the observations independent or correlated?

independent correlated
Outcome Variable Assumptions

Outcome is normally
Continuous Ttest Paired t test
distributed (important for
ANOVA Repeated-measures ANOVA
(e.g. pain scale, cognitive small samples).
Linear correlation Mixed models/GEE modeling Outcome and predictor have a
function)
Linear regression linear relationship.

Chi-square test assumes


Binary or categorical Difference in proportions McNemar’s test
sufficient numbers in each cell
Relative risks Conditional logistic regression
(e.g. fracture yes/no) (>=5)
Chi-square test GEE modeling
Logistic regression
Cox regression assumes
Time-to-event Kaplan-Meier statistics n/a
proportional hazards between
Cox regression
(e.g. time to fracture) groups
Continuous outcome
(Means)
Are the observations independent or correlated?
Outcome Alternatives if the normality
Variable independent correlated assumption is violated (and small
sample size):

Continuous Ttest: compares means between Paired ttest: compares means Non-parametric statistics
(e.g. pain two independent groups between two related groups (e.g., the Wilcoxon sign-rank test: non-
same subjects before and after) parametric alternative to the paired ttest
scale,
cognitive ANOVA: compares means between
function) more than two independent groups Repeated-measures ANOVA: Wilcoxon sum-rank test
compares changes over time in the (=Mann-Whitney U test): non-
means of two or more groups parametric alternative to the ttest
Pearson’s correlation
(repeated measurements)
coefficient (linear correlation):
shows linear correlation between two Kruskal-Wallis test: non-
continuous variables Mixed models/GEE modeling: parametric alternative to ANOVA
multivariate regression techniques to
compare changes over time between
Linear regression: multivariate two or more groups; gives rate of Spearman rank correlation
regression technique used when the change over time coefficient: non-parametric
outcome is continuous; gives slopes alternative to Pearson’s correlation
coefficient
Binary or categorical outcomes
(Proportions)
Are the observations correlated? Alternative to the chi-
Outcome square test if sparse cells:
Variable independent correlated

Binary or Chi-square test: compares McNemar’s chi-square test: Fisher’s exact test: compares
categorical proportions between two or more compares binary outcome between proportions between independent
groups correlated groups (e.g., before and after) groups when there are sparse data
(e.g. fracture,
(some cells <5).
yes/no)
Relative risks: odds ratios or Conditional logistic regression:
risk ratios multivariate regression technique for a McNemar’s exact test: compares
binary outcome when groups are proportions between correlated groups
correlated (e.g., matched data) when there are sparse data (some cells
Logistic regression: <5).
multivariate technique used when
outcome is binary; gives GEE modeling: multivariate
multivariate-adjusted odds ratios regression technique for a binary
outcome when groups are correlated
(e.g., repeated measures)
Time-to-event outcome
(survival data)
Are the observation groups independent or correlated? Modifications to
Outcome Cox regression if
Variable proportional-
independent correlated
hazards is
violated:

Time-to-event Kaplan-Meier statistics: estimates survival functions for each group n/a (already over Time-dependent
(e.g., time to (usually displayed graphically); compares survival functions with log-rank test time) predictors or time-
dependent hazard
fracture)
ratios (tricky!)
Cox regression: Multivariate technique for time-to-event data; gives
multivariate-adjusted hazard ratios

You might also like