Ra Web

R E G R E S S I O N
a n a l y s i s
SECTION 1
I N T R O D U C T İ O N
HISTORY
1805: LENARDE 1822-1911: Sir Galton

METHOD OF LEAST "REGRESSION" WAS
SQUARES COINED
1809: GAUSS
METHOD OF LEAST
SQUARES
HISTORY
1851-1952: George Yule

1890-1962: Sir Ronald Fisher
JOINED DSTRIBUTION
WAS ASSUMED TO BE WEAKENED THE ASSUMPTION
GAUSSIAN OF YULE AND PEARSON
1857-1936: Karl Pearson

JOINED DSTRIBUTION WAS
ASSUMED TO BE GAUSSIAN
HISTORY
Gauss published a further development of
the theory of least squares, including a
version of the Gauss–Markov theorem
1805 –
1821
1809
The earliest form of regression was the method of least

squares, published by Legendre in 1805, and by Gauss in
1809. They both applied the method to the problem of
determining, from astronomical observations, the orbits of
bodies about the Sun (mostly comets, but also later the
then newly discovered minor planets).
Galton’s work was later extended by Udny Yule and
HISTORY Karl Pearson to a more general statistical context.
In the work of Yule and Pearson, the joint
distribution of the response and explanatory
variables is assumed to be Gaussian.
1897 –
1890
1903
The term "regression" was coined by Francis Galton to

describe a biological phenomenon. The phenomenon
was that the heights of descendants of tall ancestors tend
to regress down towards a normal average (a
phenomenon also known as regression toward the
mean). For Galton, regression had only this biological
HISTORY
Economists used electromechanical desk

calculators to calculate regressions.
1950s
1922 –
–
1925
1960s
This assumption was weakened by R.A. Fisher. He

assumed that the conditional distribution of the response
variable is Gaussian, but the joint distribution need not
be. In this respect, Fisher's assumption is closer to
Gauss's formulation of 1821.
Regression methods continue to be an area of
HISTORY active research. In recent decades, new
methods have been developed for robust
regression, regression involving correlated
responses such as time series and growth
curves, regression in which the predictor
(independent variable) or response variables are
BEFORE
curves, images, graphs, or other complex data
1970
objects, regression methods accommodating
It sometimes took up to 24 hours to various types of missing data, nonparametric
receive the result from one regression regression, Bayesian methods for regression,
regression in which the predictor variables are
measured with error, regression with more
predictor variables than observations, and
causal inference with regression.
Statistical modeling
D e s c r i b e s R e l a t i o n s h i p
b e t w e e n V a r i a b l e s
Deterministic Models Probabilistic Models
Hypothesize Exact Relationships

Hypothesize 2 Components:
Suitable When Prediction Error is Negligible
Deterministic
Example: Body mass index (BMI) is measure of body fat based
Random Error
Metric Formula: Example: Systolic blood pressure of newborns Is 6 Times the
Age in days + Random Error

Non-metric Formula:
Random Error May Be Due to Factors Other Than age in days
(e.g. Birthweight)
STATISTICAL MODELING
T y p e s o f P r o b a b i l i s t i c M o d e l s
PROBABILISTIC
MODELS
REGRESSION CORRELATION OTHER

MODELS MODELS MODELS
TYPES OF
REGRESSION
MODELS
SIMPLE M U LT I P L E
1 EXPLANATORY 2+ EXPLANATORY
VARIABLE VARIABLES
LINEAR NON-LINEAR LINEAR NON-LINEAR

REGRESSION
a n a l y s i s
In analyzing data for the health sciences disciplines, we find that it is
frequently desirable to learn something about the relationship between two
numeric variables. We may, for example, be interested in studying the
relationship between blood pressure and age, height and weight, the
concentration of an injected drug and heart rate, the consumption level of
some nutrient and weight gain, the intensity of a stimulus and reaction
time, or total family income and medical care expenditures. The nature and
strength of the relationships between variables such as these may be
examined using linear models such as regression and correlation analysis,
two statistical techniques that, although related, serve different purposes.

REGRESSION
a n a l y s i s
Regression analysis is helpful in assessing specific
forms of the relationship between variables, and the
ultimate objective when this method of analysis is
employed usually is to predict or estimate the value
of one variable corresponding to a given value of
another variable.
REGRESSION MODELING
STEPS
2. Specify model 1. Define problem
3. Collect data or question
4. Do descriptive data analysis
5. Estimate unknown parameters
6. Evaluate model
7. Use model for prediction

SIMPLE
VS.
MULTIPLE
represents the unit representsthe unit
change in Y per unit change in Y per unit
change in X 1 1 change in Xi
does not take into 2 2 takes into account

account any other the effect of other
variable besides 𝛽𝑖s
single independent
3 net regression
variable
coefficient
ASSUMPTIONS
CONTINUOUS V NO SIGNIFICANT
two variables should be OUTLIERS
either interval or ratio
variables 1 4 outliers can have a
negative effect on the
regression analysis
LINEARITY HOMO-
the Y variable is SCEDASTICITY
the variation around the
2 5
linearly related to the
value of the X variable line of regression be
constant for all values of X
INDEPENDENCE NORMALITY
OF ERROR the values of Y be
the error (residual) is normally distributed at
independent for each 3 6 each value of X
value of X
“
Develop a statistical model that can predict the
GOAL
values of a dependent (response)

variable based upon the values of
the independent (explanatory)
variables.
SECTION 2
L İ N E A R R E G R E S S İ O N A N A L Y S İ S
T Y P E S O F C O R R E L A T İ O N
Positive correlation Negative correlation No correlation

S İ M P L E L İ N E A R R E G R E S S İ O N
describes the linear
Dependent Variable (Y)

relationship between a
predictor variable, plotted on
the x-axis, and a response
variable, plotted on the y-axis Independent Variable (X)
Straight line is the simplest model of the relationship
L İ N E A R E Q U A T İ O N between two interval-scaled attributes, and its slope
gives us an indication of the existence of an association
between them.
Therefore, an objective way to investigate an
Y Association between interval attributes will be to draw a
Y =mX +b straight line through the center of the cloud of points and
Change measure its slope.
m = S lo p e in Y If the slope is zero, the line is horizontal and we
C h a n g e in X conclude that there is no association. If it is non-zero,
b = Y - in t e r c e p t then we can conclude on an association.

So we have two problems to solve:
X
how to draw the straight line that best models the
relationship between attributes and
how to determine whether its slope is different from
zero.
L I N E A R R E G R E S S I O N M O D E L
R e l a t i o n s h i p B e t w e e n Va r i a b l e s I s a L i n e a r F u n c t i o n
POPULATION POPULATION RANDOM

Y-INTERCEPT SLOPE ERROR

𝑌 𝑖¿
𝛽0
+¿
𝛽1 ∗ 𝑋𝑖

+¿
𝜀𝑖
DEPENDENT INDEPENDENT
(RESPONSE) (EXPLANATORY)
VARIABLE VARIABLE
L I N E A R R E G R E S S I O N M O D E L
R e l a t i o n s h i p B e t w e e n Va r i a b l e s I s a L i n e a r F u n c t i o n

Unknown ^ ^
𝑌 𝑖= 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀^ 𝑖
Relationship

𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀 𝑖
Population Y 𝑌 = ^
𝛽 + ^ 𝑋 + 𝜀^
𝛽
𝑖 0 1 𝑖 𝑖
Linear
Regression
= Random error
Model
Unsampled
value
´ =^
𝑌 𝛽 + ^ 𝑋
𝛽
𝑖 0 1 𝑖
X
RA
𝑌 = 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀 𝑖
Y 𝑖
Observed value
Observed value
= Random error
´ ) =𝛽
Sample Linear
𝐸 ( 𝑌 + 𝛽1 𝑋𝑖
0
X Regression
Model
Observed value
SECTION 3
T H E O R D I N A R Y L E A S T S Q U A R E M E T H O D ( O L S )
!
THE ORDINARY
HOW TO FIT DATA TO
LEAST SQUARE
A LINEAR MODEL?
METHOD (OLS)
T H E O R D I N A R Y L E A S T S Q U A R E M E T H O D
O v e r v i e w
‘Best Fit’ means difference between

2 2
actual Y values & predicted Y values are (𝑌 − 𝑌
^
∑ 𝑖 𝑖 ∑ ) = 𝜀
^ 𝑖
a minimum.
LS minimizes the Sum of the Squared
But positive differences off-set negative. Differences (errors) (SSE)
So square errors!
L E A S T S Q U A R E S G R A P H I C A L LY LEAST SQUARES REGRESSION
2 2 2 2 2 ´
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖
𝐿𝑆𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑠 ∑ 𝜀^ =^𝜀 + 𝜀^ +^𝜀 +^𝜀
𝑖 1 2 3 4
Model line:
´
𝑌 = ^
𝛽 + ^
𝛽1 𝑋 2 + 𝜀^ 2
Residual (ε) = 𝑌 −𝑌
Y 2 0
S u m o f s q u a r e s
^ 4 2
^
^2
^ 3
of residuals = min ∑ (𝑌 − 𝑌 )
´
1

𝑌
´ 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 we must find values of and that
X minimise
T H E R E G R E S S İ O N C O E F F İ C İ E N T S

𝑆 𝑥𝑦 𝜎 𝑥𝑦
𝛽 1= = 2
𝑆 𝑥𝑥 𝜎𝑥
´ ´
𝛽 0 = 𝑌 − 𝑏1 𝑋
C O E F F I C I E N T E Q U AT I O N S
Prediction equation Sample slope Sample Y - intercept
𝑆 𝑥𝑦 ∑ ( 𝑥 𝑖 − 𝑥´ )( 𝑦𝑖 − ´𝑦 ) ^ ^
´ ^ ^
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 ^
𝛽 1= = 2
𝛽 0= ´𝑦 − 𝛽 1 ´𝑥
𝑆 𝑥𝑥 ∑ ( 𝑥 𝑖 − 𝑥´ )
I N T E R P R E TAT I O N
^
𝑆𝑙𝑜𝑝𝑒 ( 𝛽1 )
^
1
𝑒 𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑌 𝑐 h𝑎𝑛𝑔𝑒𝑠 𝑏𝑦 𝛽 1 𝑓𝑜𝑟 𝑒 𝑎𝑐h 1𝑢 𝑛𝑖𝑡 𝑖 𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑋
^
𝐼𝑓 𝛽 1=2 , 𝑡h𝑒𝑛 𝑌 𝑖 𝑠 𝑒 𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑖 𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑏𝑦 2 𝑓𝑜𝑟

𝑒 𝑎𝑐h1 𝑢 𝑛𝑖𝑡 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑋
2 𝑌 − 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ( ^𝛽 0 )

If = 4, then average Y is expected to be 4 when X is 0
R E Q U İ R E D S T A T İ S T İ C S

∑ 𝑋 ∑ 𝑌
´𝑋 = ´𝑌 =
𝑛 𝑛
D E S C R İ P T İ V E S T A T İ S T İ C S
 X  X 
n
2 S xx
Var ( X )  i 1
n 1
S yy (SST )
 Y  Y 
n
2
Var (Y )  i 1
n 1
S xy
  X  X Y  Y 
n
Covar( X , Y )  i 1
n 1
R E G R E S S İ O N S T A T İ S T İ C S
The Sum of Squares Regression (SSR) is the sum of the squared differences between
the prediction for each observation and the population mean.
T h e To t a l S u m o f S q u a r e s ( S S T ) i s e q u a l t o S S R + S S E
SSR   (Y   Y ) 2
(measure of explained variation)
SSE   (Y  Y ) 2 (measure of unexplained variation)
SST   (Y  Y )  2
(measure of total variation in y)
 SSR  SSE
!
Variance to be
explained by predictors
(SST)
Y
!
X1
Variance
Y
explained by X1 Variance NOT
explained by X1
(SSR)
(SSE)
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N

The proportion of total variation (SST) that is explained by the regression (SSR) is known as the Coefficient of
Determination, and is often referred to as .
The value of can range between 0 and 1, and the higher its value the more accurate the regression
model is.
It is often referred to as a percentage.


An important measure of association between variables.
Represented as because its value is the square of another measure of association frequently used, called the
correlation coefficient, which is represented by .
Although we can obtain from , the two measures are not completely equivalent.
has values between 0 and 1
ranges from 1 to +1
in addition to providing a measure of the strength of an association, also informs us of the type of association
In both cases, the greater the absolute value of the coefficient, the greater the strength of the association
Unlike the coefficient of determination, the correlation coefficient is an abstract value that has no direct and precise
interpretation, somewhat like a score.

These two measures are related to the degree of dispersion of the observations about the regression line.
In a scatterplot, when the two variables are independent, the points will be distributed over the entire area of the
plot. The regression line is horizontal and the coefficient of determination is zero. When an association exists, the
regression line is oblique and the points are more or less spread along the line. The higher the strength of the
association, the less the dispersion of the points around the line and the greater will be and the absolute value of .
If all the points are over the line, has value 1 and value +1 or 1.

The importance of these measures of association comes from the fact that it is very common to find
evidence of association between two variables, and it is the strength of the association that tells us
whether it has some important meaning.
In clinical research, associations explaining less than 50% of the variance of the dependent variable, that
is, associations with less than 0.50 or, equivalently, with between 0.70 and 0.70 are usually not
regarded as important.
S T A N D A R D E R R O R O F R E G R E S S İ O N
The Standard Error of a regression is a

measure of its variability. It can be used
in a similar manner to standard deviation,
allowing for prediction intervals.
S e  S  ˆ 2
e
2 Standard Error for the regression model
From the regression equation
compute the predicted values 1
of the dependent variable
Compute the variance of the

2 2𝑆𝑆𝐸
2 𝑆 =∑ ( 𝑦− ´𝑦 ) =¿
𝑒 =𝑀𝑆𝐸 ¿
residuals from y and y* 𝑛−2 Mean
Squared
Error
2 2
Obtain the sum of squares
of x from the variance of x
3
∑ ( 𝑥− ´𝑥 ) ( 𝑛−1) =𝑆 ( 𝑛−1)
𝑥
𝑆
2
The standard error of the
regression coefficient is 4 𝑆 𝑒 ( 𝛽 )=
√ 𝑒
( 𝑥 − ´𝑥 ) 2

This estimate of the true standard error of is unbiased on the condition that the dispersion of the points about
the regression line is approximately the same along the length of the line. This will happen if the variance of Y is
the same for every value of X, that is, if Y is homoscedastic. If this condition is not met, then the estimate of the
standard error of may be larger or smaller than the true standard error, and there is no way of telling which.
In summary, we can estimate the standard error of the regression coefficient from our sample and construct
confidence intervals, under the following assumptions:
 The dependent variable has a normal distribution for all values of the independent variable.
 The variance of the dependent variable is equal for all values of the independent variable.
 If the independent variable is interval its distribution is normal.
 The relationship between the two variables is linear.
TEST IN LINEAR REGRESSION
can test the null hypothesis that with a different test

We
based on analysis of variance.
The Figure compares a situation where the null hypothesis is

If the null hypothesis is false, the regression line will be steep and the
true, on the left, with a situation where the null hypothesis is
false, on the right.
departures of the values y from the regression line will be less than the

When the two variables are independent, and the slope of
the sample regression line will be very nearly zero (not departures from . Therefore, the residual mean square will be smaller
exactly zero because of sampling variation).
than the total variance of Y. We could compare the two estimates and

by taking the ratio .
RESIDUAL MEAN SQUARE The resulting variance ratio would follow an F distribution if the two
An estimate of the variance of Y for fixed values of X can estimates of were independent, and if the null hypothesis were false
be obtained from the variance of the residuals, that is, the
the variance ratio would have a value much larger than expected under
variance of the departure of each y from the value predicted
.
by the regression
SECTION
4i z
Q u
Q U I Z
The regression line is drawn so that:

A. The line goes through more points than any other possible line, straight or
curved
B. The line goes through more points than any other possible straight line.
C. The same number of points are below and above the regression line.
D. The sum of the absolute errors is as small as possible.
E. The sum of the squared errors is as small as possible.

Q U I Z
In order for the regression technique to give the best and minimum variance prediction, all the
following conditions must be met, EXCEPT for:
A. The relation is linear.
B. We have not omitted any significant variable.
C. Both the X and Y variables (the predictors and the response) are normally distributed.
D. The residuals (errors) are normally distributed.
E. The variance around the regression line is about the same for all values of the predictor.
C. Both the X and Y variables (the predictors and the response) are normally
distributed.
Q U I Z
If a regression has the problem of heteroscedasticity,

A. The predictions it makes will be wrong on average.
B. The predictions it makes will be correct on average, but we will not be certain of
the RMSE (root-mean-square error)
C. It will also have the problem of an omitted variable or variables.
D. It will also be based on a non-linear equation
The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models. Homoscedasticity
describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the
independent variables and the dependent variable) is the same across all values of the independent variables.
Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an
independent variable. The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as
heteroscedasticity increases.
B. Heteroscedasticity implies that the variance will differ for different values of the
regressor.
Q U I Z
In regression, the equation that describes how the response variable (y) is
related to the explanatory variable (x) is:
a. the correlation model
b. the regression model
c. used to compute the correlation coefficient
d. None of these alternatives is correct.
b. the regression model

Q U I Z
The regression line is drawn so that:

A. The line goes through more points than any other possible line, straight or
curved
B. The line goes through more points than any other possible straight line.
C. The same number of points are below and above the regression line.
D. The sum of the absolute errors is as small as possible.

Q U I Z
The relationship between number of beers consumed (x) and blood alcohol content (y) was studied
in 16 male college students by using least squares regression. The following regression equation
was obtained from this study:
y= -0.0127 + 0.0180x
The above equation implies that:
a. each beer consumed increases blood alcohol by 1.27%
b. on average it takes 1.8 beers to increase blood alcohol content by 1%
c. each beer consumed increases blood alcohol by an average of amount of 1.8%
d. each beer consumed increases blood alcohol by exactly 0.018
c. each beer consumed increases blood alcohol by an average of amount of 1.8%

Q U I Z
SSE can never be

a. larger than SST
b. smaller than SST
c. equal to 1
d. equal to zero
a. larger than SST

Q U I Z
Regression modeling is a statistical framework for developing a

mathematical equation that describes how
a. one explanatory and one or more response variables are related
b. several explanatory and several response variables response are related
c. one response and one or more explanatory variables are related
d. All of these are correct.
c. one response and one or more explanatory variables are related

Q U I Z
In regression analysis, the variable that is being predicted is the

a. response, or dependent, variable
b. independent variable
c. intervening variable
d. is usually x
a. response, or dependent, variable

Q U I Z
In least squares regression, which of the following is not a required

assumption about the error term ε?
a. The expected value of the error term is one.
b. The variance of the error term is the same for all values of x.
c. The values of the error term are independent.
d. The error term is normally distributed.
a. The expected value of the error term is one.

Q U I Z
Larger values of () imply that the observations are more closely grouped about
the
a. average value of the independent variables
b. average value of the dependent variable
c. least squares line
d. origin
c. least squares line

Q U I Z
In a regression analysis if r2 = 1, then

a. SSE must also be equal to one
b. SSE must be equal to zero
c. SSE can be any positive value
d. SSE must be negative
b. SSE must be equal to zero

Q U I Z
In regression analysis, the variable that is used to explain the change in the
outcome of an experiment, or some natural process, is called
a. the x-variable
b. the independent variable
c. the predictor variable
d. the explanatory variable
e. all of the above (a-d) are correct
f. none are correct
e. all of the above (a-d) are correct

Q U I Z
In the case of an algebraic model for a straight line, if a value for the x variable is
specified, then
a. the exact value of the response variable can be computed
b. the computed response to the independent value will always give a minimal
residual
c. the computed value of y will always be the best estimate of the mean response
d. none of these alternatives is correct.
a. the exact value of the response variable can be computed

Q U I Z

In a regression and correlation analysis if = 1, then
a. SSE = SST
b. SSE = 1
c. SSR = SSE
d. SSR = SST
d. SSR = SST
Q U I Z
If the coefficient of determination is a positive value, then the regression equation

a. must have a positive slope
b. must have a negative slope
c. could have either a positive or a negative slope
d. must have a positive y intercept
c. could have either a positive or a negative slope

Q U I Z
If two variables, x and y, have a very strong linear relationship, then

a. there is evidence that x causes a change in y
b. there is evidence that y causes a change in x
c. there might not be any causal relationship between x and y
c. there might not be any causal relationship between x and y

Q U I Z
In regression analysis, if the independent variable is measured in kilograms, the

dependent variable
a. must also be in kilograms
b. must be in some unit of weight
c. cannot be in kilograms
d. can be any units
d. can be any units

Q U I Z
In a regression analysis if SSE = 200 and SSR = 300, then the coefficient of
determination is
a. 0.6667
b. 0.6000
c. 0.4000
d. 1.5000
b. 0.6000
Q U I Z
A fitted least squares regression line

a. may be used to predict a value of y if the corresponding x value is given
b. is evidence for a cause-effect relationship between x and y
c. can only be computed if a strong linear relationship exists between x and y
a. may be used to predict a value of y if the corresponding x value is given

Q U I Z
You have carried out a regression analysis; but, after thinking about the relationship
between variables, you have decided you must swap the explanatory and the
response variables. After refitting the regression model to the data you expect that:
a. the value of the correlation coefficient will change
b. the value of SSE will change
c. the value of the coefficient of determination will change
d. the sign of the slope will change
e. nothing changes
b. the value of SSE will change

Q U I Z
Suppose you use regression to predict the height of a woman’s current boyfriend by using her
own height as the explanatory variable. Height was measured in feet from a sample of 100
women undergraduates, and their boyfriends, at Dalhousie University. Now, suppose that the
height of both the women and the men are converted to centimeters. The impact of this
conversion on the slope is:
a. the sign of the slope will change
b. the magnitude of the slope will change
c. both a and b are correct
d. neither a nor b are correct
d. neither a nor b are correct

Q U I Z
A residual plot:
a. displays residuals of the explanatory variable versus residuals of the response
variable.
b. displays residuals of the explanatory variable versus the response variable.
c. displays explanatory variable versus residuals of the response variable.
d. displays the explanatory variable versus the response variable.
e. displays the explanatory variable on the x axis versus the response variable on
the y axis.
c. displays explanatory variable versus residuals of the response variable.

Q U I Z
When the error terms have a constant variance, a plot of the residuals versus
the independent variable x has a pattern that
a. fans out
b. funnels in
c. fans out, but then funnels in
d. forms a horizontal band pattern
e. forms a linear pattern that can be positive or negative
d. forms a horizontal band pattern

THANK YOU
S E R V E R . R A R E D I S . O R G / E D U

Ra Web

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ra Web

Uploaded by

Copyright:

Available Formats

R E G R E S S I O N

1805: LENARDE 1822-1911: Sir Galton

1851-1952: George Yule

1857-1936: Karl Pearson

The earliest form of regression was the method of least

The term "regression" was coined by Francis Galton to

Economists used electromechanical desk

This assumption was weakened by R.A. Fisher. He

It sometimes took up to 24 hours to various types of missing data, nonparametric

Deterministic Models Probabilistic Models

Hypothesize Exact Relationships

Metric Formula: Example: Systolic blood pressure of newborns Is 6 Times the

Age in days + Random Error

Random Error May Be Due to Factors Other Than age in days

REGRESSION CORRELATION OTHER

LINEAR NON-LINEAR LINEAR NON-LINEAR

frequently desirable to learn something about the relationship between two

numeric variables. We may, for example, be interested in studying the

concentration of an injected drug and heart rate, the consumption level of

strength of the relationships between variables such as these may be

examined using linear models such as regression and correlation analysis,

two statistical techniques that, although related, serve different purposes.

Regression analysis is helpful in assessing specific

forms of the relationship between variables, and the

ultimate objective when this method of analysis is

employed usually is to predict or estimate the value

of one variable corresponding to a given value of

3. Collect data or question

4. Do descriptive data analysis

5. Estimate unknown parameters

7. Use model for prediction

does not take into 2 2 takes into account

values of a dependent (response)

Positive correlation Negative correlation No correlation

describes the linear

Dependent Variable (Y)

b = Y - in t e r c e p t then we can conclude on an association.

POPULATION POPULATION RANDOM

‘Best Fit’ means difference between

LS minimizes the Sum of the Squared

But positive differences off-set negative. Differences (errors) (SSE)

Prediction equation Sample slope Sample Y - intercept

SSE   (Y  Y ) 2 (measure of unexplained variation)

Determination, and is often referred to as .

It is often referred to as a percentage.

has values between 0 and 1

whether it has some important meaning.

The Standard Error of a regression is a

Compute the variance of the

can test the null hypothesis that with a different test

The Figure compares a situation where the null hypothesis is

The regression line is drawn so that:

E. The sum of the squared errors is as small as possible.

If a regression has the problem of heteroscedasticity,

b. the regression model

The regression line is drawn so that:

E. The sum of the squared errors is as small as possible.

c. each beer consumed increases blood alcohol by an average of amount of 1.8%

SSE can never be

a. larger than SST

Regression modeling is a statistical framework for developing a

c. one response and one or more explanatory variables are related