Professional Documents
Culture Documents
Ra Web
Ra Web
a n a l y s i s
SECTION 1
I N T R O D U C T İ O N
HISTORY
1809: GAUSS
METHOD OF LEAST
SQUARES
HISTORY
1805 –
1821
1809
1897 –
1890
1903
1950s
1922 –
–
1925
1960s
receive the result from one regression regression, Bayesian methods for regression,
regression in which the predictor variables are
measured with error, regression with more
predictor variables than observations, and
causal inference with regression.
Statistical modeling
D e s c r i b e s R e l a t i o n s h i p
b e t w e e n V a r i a b l e s
(e.g. Birthweight)
STATISTICAL MODELING
T y p e s o f P r o b a b i l i s t i c M o d e l s
PROBABILISTIC
MODELS
SIMPLE M U LT I P L E
1 EXPLANATORY 2+ EXPLANATORY
VARIABLE VARIABLES
relationship between blood pressure and age, height and weight, the
some nutrient and weight gain, the intensity of a stimulus and reaction
time, or total family income and medical care expenditures. The nature and
another variable.
REGRESSION MODELING
STEPS
2. Specify model 1. Define problem
6. Evaluate model
CONTINUOUS V NO SIGNIFICANT
two variables should be OUTLIERS
either interval or ratio
variables 1 4 outliers can have a
negative effect on the
regression analysis
LINEARITY HOMO-
the Y variable is SCEDASTICITY
the variation around the
2 5
linearly related to the
value of the X variable line of regression be
constant for all values of X
INDEPENDENCE NORMALITY
OF ERROR the values of Y be
the error (residual) is normally distributed at
independent for each 3 6 each value of X
value of X
“
Develop a statistical model that can predict the
GOAL
Y =mX +b straight line through the center of the cloud of points and
Change measure its slope.
m = S lo p e in Y If the slope is zero, the line is horizontal and we
C h a n g e in X conclude that there is no association. If it is non-zero,
𝑌 𝑖¿
𝛽0
+¿
𝛽1 ∗ 𝑋𝑖
+¿
𝜀𝑖
DEPENDENT INDEPENDENT
(RESPONSE) (EXPLANATORY)
VARIABLE VARIABLE
L I N E A R R E G R E S S I O N M O D E L
R e l a t i o n s h i p B e t w e e n Va r i a b l e s I s a L i n e a r F u n c t i o n
Unknown ^ ^
𝑌 𝑖= 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀^ 𝑖
Relationship
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀 𝑖
Population Y 𝑌 = ^
𝛽 + ^ 𝑋 + 𝜀^
𝛽
𝑖 0 1 𝑖 𝑖
Linear
Regression
= Random error
Model
Unsampled
value
´ =^
𝑌 𝛽 + ^ 𝑋
𝛽
𝑖 0 1 𝑖
X
RA
𝑌 = 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀 𝑖
Y 𝑖
Observed value
Observed value
= Random error
´ ) =𝛽
Sample Linear
𝐸 ( 𝑌 + 𝛽1 𝑋𝑖
0
X Regression
Model
Observed value
SECTION 3
T H E O R D I N A R Y L E A S T S Q U A R E M E T H O D ( O L S )
!
THE ORDINARY
HOW TO FIT DATA TO
LEAST SQUARE
A LINEAR MODEL?
METHOD (OLS)
T H E O R D I N A R Y L E A S T S Q U A R E M E T H O D
O v e r v i e w
So square errors!
L E A S T S Q U A R E S G R A P H I C A L LY LEAST SQUARES REGRESSION
2 2 2 2 2 ´
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖
𝐿𝑆𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑠 ∑ 𝜀^ =^𝜀 + 𝜀^ +^𝜀 +^𝜀
𝑖 1 2 3 4
Model line:
´
𝑌 = ^
𝛽 + ^
𝛽1 𝑋 2 + 𝜀^ 2
Residual (ε) = 𝑌 −𝑌
Y 2 0
S u m o f s q u a r e s
^ 4 2
^
^2
^ 3
of residuals = min ∑ (𝑌 − 𝑌 )
´
1
𝑌
´ 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 we must find values of and that
X minimise
T H E R E G R E S S İ O N C O E F F İ C İ E N T S
𝑆 𝑥𝑦 𝜎 𝑥𝑦
𝛽 1= = 2
𝑆 𝑥𝑥 𝜎𝑥
´ ´
𝛽 0 = 𝑌 − 𝑏1 𝑋
C O E F F I C I E N T E Q U AT I O N S
𝑆 𝑥𝑦 ∑ ( 𝑥 𝑖 − 𝑥´ )( 𝑦𝑖 − ´𝑦 ) ^ ^
´ ^ ^
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 ^
𝛽 1= = 2
𝛽 0= ´𝑦 − 𝛽 1 ´𝑥
𝑆 𝑥𝑥 ∑ ( 𝑥 𝑖 − 𝑥´ )
I N T E R P R E TAT I O N
^
𝑆𝑙𝑜𝑝𝑒 ( 𝛽1 )
^
1
𝑒 𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑌 𝑐 h𝑎𝑛𝑔𝑒𝑠 𝑏𝑦 𝛽 1 𝑓𝑜𝑟 𝑒 𝑎𝑐h 1𝑢 𝑛𝑖𝑡 𝑖 𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑋
^
𝐼𝑓 𝛽 1=2 , 𝑡h𝑒𝑛 𝑌 𝑖 𝑠 𝑒 𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑖 𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑏𝑦 2 𝑓𝑜𝑟
𝑒 𝑎𝑐h1 𝑢 𝑛𝑖𝑡 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑋
2 𝑌 − 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ( ^𝛽 0 )
If = 4, then average Y is expected to be 4 when X is 0
R E Q U İ R E D S T A T İ S T İ C S
∑ 𝑋 ∑ 𝑌
´𝑋 = ´𝑌 =
𝑛 𝑛
D E S C R İ P T İ V E S T A T İ S T İ C S
X X
n
2 S xx
Var ( X ) i 1
n 1
S yy (SST )
Y Y
n
2
Var (Y ) i 1
n 1
S xy
X X Y Y
n
Covar( X , Y ) i 1
n 1
R E G R E S S İ O N S T A T İ S T İ C S
The Sum of Squares Regression (SSR) is the sum of the squared differences between
the prediction for each observation and the population mean.
T h e To t a l S u m o f S q u a r e s ( S S T ) i s e q u a l t o S S R + S S E
SSR (Y Y ) 2
(measure of explained variation)
SST (Y Y ) 2
(measure of total variation in y)
SSR SSE
!
Variance to be
explained by predictors
(SST)
Y
!
X1
Variance
Y
explained by X1 Variance NOT
explained by X1
(SSR)
(SSE)
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N
The proportion of total variation (SST) that is explained by the regression (SSR) is known as the Coefficient of
The value of can range between 0 and 1, and the higher its value the more accurate the regression
model is.
An important measure of association between variables.
Represented as because its value is the square of another measure of association frequently used, called the
correlation coefficient, which is represented by .
Although we can obtain from , the two measures are not completely equivalent.
ranges from 1 to +1
in addition to providing a measure of the strength of an association, also informs us of the type of association
In both cases, the greater the absolute value of the coefficient, the greater the strength of the association
Unlike the coefficient of determination, the correlation coefficient is an abstract value that has no direct and precise
interpretation, somewhat like a score.
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N
These two measures are related to the degree of dispersion of the observations about the regression line.
In a scatterplot, when the two variables are independent, the points will be distributed over the entire area of the
plot. The regression line is horizontal and the coefficient of determination is zero. When an association exists, the
regression line is oblique and the points are more or less spread along the line. The higher the strength of the
association, the less the dispersion of the points around the line and the greater will be and the absolute value of .
If all the points are over the line, has value 1 and value +1 or 1.
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N
The importance of these measures of association comes from the fact that it is very common to find
evidence of association between two variables, and it is the strength of the association that tells us
In clinical research, associations explaining less than 50% of the variance of the dependent variable, that
is, associations with less than 0.50 or, equivalently, with between 0.70 and 0.70 are usually not
regarded as important.
S T A N D A R D E R R O R O F R E G R E S S İ O N
S e S ˆ 2
e
2 Standard Error for the regression model
From the regression equation
compute the predicted values 1
of the dependent variable
2 2
Obtain the sum of squares
of x from the variance of x
3
∑ ( 𝑥− ´𝑥 ) ( 𝑛−1) =𝑆 ( 𝑛−1)
𝑥
𝑆
2
The standard error of the
regression coefficient is 4 𝑆 𝑒 ( 𝛽 )=
√ 𝑒
( 𝑥 − ´𝑥 ) 2
This estimate of the true standard error of is unbiased on the condition that the dispersion of the points about
the regression line is approximately the same along the length of the line. This will happen if the variance of Y is
the same for every value of X, that is, if Y is homoscedastic. If this condition is not met, then the estimate of the
standard error of may be larger or smaller than the true standard error, and there is no way of telling which.
In summary, we can estimate the standard error of the regression coefficient from our sample and construct
confidence intervals, under the following assumptions:
The dependent variable has a normal distribution for all values of the independent variable.
The variance of the dependent variable is equal for all values of the independent variable.
If the independent variable is interval its distribution is normal.
The relationship between the two variables is linear.
TEST IN LINEAR REGRESSION
RESIDUAL MEAN SQUARE The resulting variance ratio would follow an F distribution if the two
An estimate of the variance of Y for fixed values of X can estimates of were independent, and if the null hypothesis were false
be obtained from the variance of the residuals, that is, the
the variance ratio would have a value much larger than expected under
variance of the departure of each y from the value predicted
.
by the regression
SECTION
4i z
Q u
Q U I Z
In order for the regression technique to give the best and minimum variance prediction, all the
following conditions must be met, EXCEPT for:
A. The relation is linear.
B. We have not omitted any significant variable.
C. Both the X and Y variables (the predictors and the response) are normally distributed.
D. The residuals (errors) are normally distributed.
E. The variance around the regression line is about the same for all values of the predictor.
C. Both the X and Y variables (the predictors and the response) are normally
distributed.
Q U I Z
The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models. Homoscedasticity
describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the
independent variables and the dependent variable) is the same across all values of the independent variables.
Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an
independent variable. The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as
heteroscedasticity increases.
B. Heteroscedasticity implies that the variance will differ for different values of the
regressor.
Q U I Z
In regression, the equation that describes how the response variable (y) is
related to the explanatory variable (x) is:
a. the correlation model
b. the regression model
c. used to compute the correlation coefficient
d. None of these alternatives is correct.
The relationship between number of beers consumed (x) and blood alcohol content (y) was studied
in 16 male college students by using least squares regression. The following regression equation
was obtained from this study:
y= -0.0127 + 0.0180x
The above equation implies that:
a. each beer consumed increases blood alcohol by 1.27%
b. on average it takes 1.8 beers to increase blood alcohol content by 1%
c. each beer consumed increases blood alcohol by an average of amount of 1.8%
d. each beer consumed increases blood alcohol by exactly 0.018
Larger values of () imply that the observations are more closely grouped about
the
a. average value of the independent variables
b. average value of the dependent variable
c. least squares line
d. origin
In regression analysis, the variable that is used to explain the change in the
outcome of an experiment, or some natural process, is called
a. the x-variable
b. the independent variable
c. the predictor variable
d. the explanatory variable
e. all of the above (a-d) are correct
f. none are correct
In the case of an algebraic model for a straight line, if a value for the x variable is
specified, then
a. the exact value of the response variable can be computed
b. the computed response to the independent value will always give a minimal
residual
c. the computed value of y will always be the best estimate of the mean response
d. none of these alternatives is correct.
In a regression and correlation analysis if = 1, then
a. SSE = SST
b. SSE = 1
c. SSR = SSE
d. SSR = SST
d. SSR = SST
Q U I Z
In a regression analysis if SSE = 200 and SSR = 300, then the coefficient of
determination is
a. 0.6667
b. 0.6000
c. 0.4000
d. 1.5000
b. 0.6000
Q U I Z
You have carried out a regression analysis; but, after thinking about the relationship
between variables, you have decided you must swap the explanatory and the
response variables. After refitting the regression model to the data you expect that:
a. the value of the correlation coefficient will change
b. the value of SSE will change
c. the value of the coefficient of determination will change
d. the sign of the slope will change
e. nothing changes
Suppose you use regression to predict the height of a woman’s current boyfriend by using her
own height as the explanatory variable. Height was measured in feet from a sample of 100
women undergraduates, and their boyfriends, at Dalhousie University. Now, suppose that the
height of both the women and the men are converted to centimeters. The impact of this
conversion on the slope is:
a. the sign of the slope will change
b. the magnitude of the slope will change
c. both a and b are correct
d. neither a nor b are correct
A residual plot:
a. displays residuals of the explanatory variable versus residuals of the response
variable.
b. displays residuals of the explanatory variable versus the response variable.
c. displays explanatory variable versus residuals of the response variable.
d. displays the explanatory variable versus the response variable.
e. displays the explanatory variable on the x axis versus the response variable on
the y axis.
When the error terms have a constant variance, a plot of the residuals versus
the independent variable x has a pattern that
a. fans out
b. funnels in
c. fans out, but then funnels in
d. forms a horizontal band pattern
e. forms a linear pattern that can be positive or negative