Correlation,Simple Linear Regression and Multiple Linear Regression Practice

Correlation
Simple Linear Regression

Multiple linear regression
Practice
Truong Phuoc Long, ph.D
11/28/2022 1
The estimated linear regression
 The estimated linear regression model

 Y: dependent variable
 X: independent variable
 ŷ is the average value of the dependent variable for a group of

individuals all of the same X value.
 Linear regression attempts to model the relationship between
two variables by fitting a linear equation to observed data.
 Example:
Y: arm circumference; X: height
ŷ is the average arm circumference for a group of
children all of the same height, x 2
Linear regression
3
Interpret the regression coefficients
 The regression line relating estimated mean arm circumference

(cm) to height (cm): ŷ  2.7  0.16x
 The slope ෠ 𝟏 = 0.16
 ෠ 𝟏 is the average change in arm circumference for a one unit
(1cm) increase in height; or
 ෠ 𝟏 is the mean difference in arm circumference for two groups
of children who differ by one-unit (1 cm) in height, taller to
shorter.
 The intercept ෠ 𝟎 = 2.7

෠ 𝟎 is the mean of arm circumference when height is equal to 0
4
The estimated linear regression
• Assume about the distribution of the error terms, independence

of the observed values of y, and so on.
• Use observed values of x and y to estimate 0 and 1
• Make inferences such as confidence intervals and tests of
hypotheses for 0 and 1.
• Use the estimated model to forecast or predict the value of y for a
particular value of x, in which case a measure of predictive
accuracy may also be of interest.
5
Practice
 Download the “World95.sav” data file from Blackboard.

 Read Label in the “Variable view” tab to explore available
variables in the data set.
 A researcher concerns about daily calorie intake and female
life expectancy.
 Which one is:
dependent variable
independent variable
 Run descriptive statistics on these two variables. How is the
distribution of each variable?
6
Practice
 Now let’s explore the relationship between the two

variables.
 First, look at the scatter plot.
7
Scatter plot
8
8
 In the Output viewer,

double click on the chart
to bring up the Chart
Editor;
 go to Elements and select
“Fit Line at Total,”
 then select “linear” and
click Close.
9
 It would appear that there is a positive correlation between X and Y.
10
Now let’s find the regression equation
11
Results
 Results if we ask for

descriptives.
 r = 0.775: strong
positive correlation
between two variables.
12
Inference for regression coefficients
 The intercept and the slope: ෠ 0  25.904; ෠ 1 0.016

 SE of the coefficients: how they vary in the population
 95% CI of the coefficient (෠ 0, ෠ 1)
- 95% confident that the true coefficient of the population regression
fall into this interval.
- 95% CI =
13
Inference for regression coefficients
 The t-test for whether the coefficient is equal to zero (2-tailed test
with significance level alpha = 0.05)
 Can perform a test for whether the coefficient is equal to a
specific value.
Use t-table to obtain p-value.
14
The standardized coefficients
 The coefficients that you would obtain if you standardized all of the
variables in the regression, including the dependent and all of the
independent variables in multiple linear regression.
 Standardize = put all variables on the same scale (variance of variables = 1)
 Can compare the magnitude of the coefficients to see which variable has more
effective.
 Notice that the larger betas are associated with the larger t-values and lower p-
values. 15
Evaluation of the model
 R: Pearson correlation
 R square: Coefficient of determination
The proportion of the variability of Y is explained by the linear
regression of Y on X (or the linear relationship between Y and X)
R2 = 0.601 → 60.1% of the variability of female life expectancy is
explained by the linear relationship between female life expectancy and
daily calorie intake.
The remaining (100 - 60.1=39.9%) of the variation is not explained by
this relationship.
16
• Total variance is partitioned into the variance which can be explained by

the independent variables (Regression) and the variance which is not
explained by the independent variables (Residual)
• Sum of squares total (SST) is the squared differences between the
observed dependent variable and its mean.
• Sum of squares regression(SSR) is the sum of the differences between
the predicted value and the mean of the dependent variable.
• Sum of squares error or residual (SSE) is the difference between
the observed value and the predicted value.
17
 Total variance has N-1 degrees of freedom.

 The Regression df = the number of coefficients (including the
intercept) minus 1
 The Residual df = Total df – Regression df
 Mean Square = SSQ (Sum of Squares)/df
18
• F statistic = MSregression/MS
residual
• Null hypothesis: all of the model coefficients are 0
19
Now it’s your turn
 Look at the data, explore the relationship between two

continuous variables (make a research question by your own).
 Make a scatter plot between two variables, see if they

potentially have a linear relationship.
 Estimate the linear regression and interpret the results.
20
Multiple linear regression
21
The General Idea
Simple regression considers the relation between a single

independent variable X and dependent variable Y
- X variable: independent variable, also called an experimental

or predictor variable.
- Y variable: dependent variable or response variable.
22
The General Idea
Multiple regression considers the relation between multiple

independent variables (X1, X2,…, Xk) and dependent variable Y.
If two or more explanatory variables have a linear relationship with

the dependent variable, the regression is called a multiple linear
regression.
23
Multiple Regression Analysis
• Method for studying the relationship between a dependent

variable and two or more independent variables.
• Purposes:
- Prediction
- Explanation
- Theory building
24
Simple vs. Multiple Regression
• One dependent variable Y • One dependent variable Y predicted

predicted from one from a set of independent
independent variable X variables (X1, X2 ….Xk)
• One regression coefficient • One regression coefficient for each

independent variable.
• r : proportion of variation
2
in dependent variable Y • R2: proportion of variation in

predictable from X dependent variable Y predictable by
set of independent variables (X’s)
25
Design Requirements
• One dependent variable (criterion)

• Two or more independent variables (predictor
variables).
• Sample size: >= 30 (at least 10 times as many cases
as independent variables)
26
Assumptions
• Independence: the scores of any particular subject are independent of the

scores of all other subjects.
• Normality: the scores on the dependent variable are normally distributed
for each of the possible combinations of the level of the X variables; each of
the variables is normally distributed.
• Homoscedasticity: the variances of the dependent variable for each of the
possible combinations of the levels of the X independent variables are
equal.
• Linearity: the relation between the dependent variable and the independent
variable is linear when all the other independent variables are held constant.
27
Multiple Linear Regression Models
Equation12-2
• Dependent variable or response Y may be related to k

independent or regressor variables.
• The parameters J, J = 0, 1, …, k, are called the regression
coefficients.
• The parameters J, represents the expected change in response Y
per unit change in xj when all the remaining regressors xi (ij) are
held constant.
28
Least Squares Estimation of the Parameters
29

• The least squares function is given by
• The least squares estimates must satisfy
30

• The least squares normal equations are:
• The solution to the normal Equations are the least squares

estimators of the regression coefficients.
31
Example: Investigating the wire bond strength. Here,

researchers used data on pull strength of a wire bond in a
semiconductor manufacturing process, wire length and die
height to illustrate building an empirical model. Estimate
model parameters of multiple linear regression.
32
33
• The displays can be helpful in visualizing the relationships among

variables in a multivariable data set.
• The plot indicates that there is a strong linear relationship between
strength and wire length.
34
• Here, we will fit the multiple linear regression model as follows:

Y = 0 + 1 x1 + 2 x2 + 
Where: Y = pull strength, x1 = wire length, and x2 = die height
• From the data in table 12-2, we calculate:
35
For the model Y = 0 + 1 x1 + 2 x2 + , the normal equations are:
Insert the computed summations into the normal equations, we obtain:
36
The solution to this set of equations is:
Therefore, the fitted regression equation is:
Practical interpretation: this equation can be used to predict pull

strength for pairs of regressor variable wire length (x1) and die
height (x2).
37
Notes on the coefficient of determination
In a study: investigate the relationship of head circumference,

gestational age and weight.
• The simple linear regression of head circumference on
gestational age:
Model 1: yˆ  3.9143  0.7801x
• The multiple linear regression of head circumference on
gestational age (x1) and weight (x2):
Model 2: yˆ  8.3080  0.4487x1  0.0047x2
Guess which R2 is larger (that of Model 1 or 2)?
38
More notes on the coefficient of determination
 Model 1: R2 = 0.6095
 Model 2: R2 = 0.7520
 The inclusion of an additional variable in a model can never
cause R2 to decrease.
 Comparing R2 to conclude about the improvement of the model
is not suitable, use the adjusted R2 instead.
 The adjusted R2 is an estimator of the population correlation(ρ) but
can’t be interpreted as the proportion of the variability among the
observed values y that is explained by the linear regression model.
39
Comparing Models -Testing R2
Example: A study of the relation between academic achievement

(AA), grades, and general and academic self concept (N=103)
Shavelson, Text example p 530 (Table 18.1) (Example 18.1 p 538).
Comparing Models -Testing R2
• The Model:
Y’ = a + b1X1 + b2X2 + …bkXk
- The b’s are called partial regression coefficients
• Our example-Predicting AA:
Y’= 36.83 + (3.52)XASC + (-.44)XGSC
• Example: Predicted AA for person with GSC of 4 and ASC of 6
Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23

Multiple Correlation Coefficient (R) and Coefficient
of Multiple Determination (R2)
• R = the magnitude of the relationship between the

dependent variable and the best linear combination of
the predictor variables.
• R2 = the proportion of variation in Y accounted for by

the set of independent variables (X’s).
42
Explaining Variation: How much?
Predictable variation
by the combination of
independent variables
Total Variation in Y
Unpredictable
Variation
Proportion of Predictable and
Unpredictable Variation
Where: (1-R2) = Unpredictable

Y= AA (unexplained) variation
in Y
X1 = ASC
X2 =GSC Y
X1
R2 = Predictable
X2 (explained)
variation in Y
Various Significance Tests
Testing R2
- Test R2 through an F test
- Test competing models (difference between R2) through an F test of
difference of R2s
Testing 
- Test of each partial regression coefficient () by t-tests
- Comparison of partial regression coefficients with each other - t-test
of difference between standardized partial regression coefficients ()
Example: Testing R2
• What proportion of variation in AA can be predicted from GSC

and ASC?
- Compute R2: R2 = 0.16 (R = 0.41): 16% of the variance in AA

can be accounted for by the composite of GSC and ASC.
• Is R2 statistically significant from 0 ?
- F test: Fobserved = 9.52, Fcritical (05/2,100) = 3.09
- Reject H0: in the population, there is a significant relationship

between AA and the linear composite of GSC and ASC.
Example: Comparing Models-Testing R2
Comparing models
Model 1: Y’= 35.37 + (3.38)XASC
Model 2: Y’= 36.83 + (3.52)XASC + (-0.44)XGSC
Compute R2 for each model
Model 1: R2 = r2 = 0.160
Model 2: R2 = 0.161
Test difference between R2s
Fobs = 0.119, Fcrit(0.05/1,100) = 3.94
 Conclude that GSC does not add significantly to ASC in predicting AA
Testing Significance of b’s
H0:  = 0
b-
tobserved =
standard error of b
With N-k-1 df
Example: t-test of b
tobserved = (-0.44 – 0)/14.24

tobserved = -0.03
tcritical (0.05,2,100) = 1.97
Decision: Cannot reject the null hypothesis.

Conclusion: The population  for GSC is not significantly
different from 0
Comparing Partial Regression Coefficients
• Which is the stronger predictor?

- Comparing bGSC and bASC
• Convert to standardized partial regression coefficients (beta weights,
’s)
- GSC = -0.038
- ASC = 0.417
- On same scale so can compare: ASC is stronger predictor than GSC.
• Beta weights (’s) can also be tested for significance with t tests.

Correlation,Simple Linear Regression and Multiple Linear Regression Practice

Uploaded by

Copyright:

Available Formats

You might also like

Correlation,Simple Linear Regression and Multiple Linear Regression Practice

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation,Simple Linear Regression and Multiple Linear Regression Practice

Uploaded by

Copyright:

Available Formats

Correlation

Simple Linear Regression

Truong Phuoc Long, ph.D

 The estimated linear regression model

 ŷ is the average value of the dependent variable for a group of

 The regression line relating estimated mean arm circumference

 The intercept ෠ 𝟎 = 2.7

• Assume about the distribution of the error terms, independence

 Download the “World95.sav” data file from Blackboard.

 Now let’s explore the relationship between the two

 In the Output viewer,

 Results if we ask for

 The intercept and the slope: ෠ 0  25.904; ෠ 1 0.016

Use t-table to obtain p-value.

• Total variance is partitioned into the variance which can be explained by

 Total variance has N-1 degrees of freedom.

 Look at the data, explore the relationship between two

 Make a scatter plot between two variables, see if they

 Estimate the linear regression and interpret the results.

Simple regression considers the relation between a single

- X variable: independent variable, also called an experimental

Multiple regression considers the relation between multiple

If two or more explanatory variables have a linear relationship with

• Method for studying the relationship between a dependent

• One dependent variable Y • One dependent variable Y predicted

• One regression coefficient • One regression coefficient for each

in dependent variable Y • R2: proportion of variation in

• One dependent variable (criterion)

• Independence: the scores of any particular subject are independent of the

• Dependent variable or response Y may be related to k

Least Squares Estimation of the Parameters

Least Squares Estimation of the Parameters

• The least squares estimates must satisfy

Least Squares Estimation of the Parameters

• The solution to the normal Equations are the least squares

Example: Investigating the wire bond strength. Here,

• The displays can be helpful in visualizing the relationships among

• Here, we will fit the multiple linear regression model as follows:

For the model Y = 0 + 1 x1 + 2 x2 + , the normal equations are:

Insert the computed summations into the normal equations, we obtain:

The solution to this set of equations is:

Therefore, the fitted regression equation is:

Practical interpretation: this equation can be used to predict pull

In a study: investigate the relationship of head circumference,

Example: A study of the relation between academic achievement

Y’ = a + b1X1 + b2X2 + …bkXk

- The b’s are called partial regression coefficients

• Our example-Predicting AA:

Y’= 36.83 + (3.52)XASC + (-.44)XGSC

• Example: Predicted AA for person with GSC of 4 and ASC of 6

Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23

• R = the magnitude of the relationship between the

• R2 = the proportion of variation in Y accounted for by

Where: (1-R2) = Unpredictable

• What proportion of variation in AA can be predicted from GSC

- Compute R2: R2 = 0.16 (R = 0.41): 16% of the variance in AA

• Is R2 statistically significant from 0 ?

- F test: Fobserved = 9.52, Fcritical (05/2,100) = 3.09

- Reject H0: in the population, there is a significant relationship