Correlation,Simple Linear Regression and Multiple Linear Regression Practice

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Correlation

Simple Linear Regression


Multiple linear regression
Practice

Truong Phuoc Long, ph.D

11/28/2022 1
The estimated linear regression

 The estimated linear regression model


 Y: dependent variable

 X: independent variable

 ŷ is the average value of the dependent variable for a group of


individuals all of the same X value.
 Linear regression attempts to model the relationship between
two variables by fitting a linear equation to observed data.
 Example:
Y: arm circumference; X: height
ŷ is the average arm circumference for a group of
children all of the same height, x 2
Linear regression

3
Interpret the regression coefficients

 The regression line relating estimated mean arm circumference


(cm) to height (cm): ŷ  2.7  0.16x
 The slope ෠ 𝟏 = 0.16
 ෠ 𝟏 is the average change in arm circumference for a one unit
(1cm) increase in height; or
 ෠ 𝟏 is the mean difference in arm circumference for two groups
of children who differ by one-unit (1 cm) in height, taller to
shorter.

 The intercept ෠ 𝟎 = 2.7


෠ 𝟎 is the mean of arm circumference when height is equal to 0
4
The estimated linear regression

• Assume about the distribution of the error terms, independence


of the observed values of y, and so on.
• Use observed values of x and y to estimate 0 and 1
• Make inferences such as confidence intervals and tests of
hypotheses for 0 and 1.
• Use the estimated model to forecast or predict the value of y for a
particular value of x, in which case a measure of predictive
accuracy may also be of interest.

5
Practice

 Download the “World95.sav” data file from Blackboard.


 Read Label in the “Variable view” tab to explore available
variables in the data set.
 A researcher concerns about daily calorie intake and female
life expectancy.
 Which one is:
dependent variable
independent variable
 Run descriptive statistics on these two variables. How is the
distribution of each variable?

6
Practice

 Now let’s explore the relationship between the two


variables.
 First, look at the scatter plot.

7
Scatter plot

8
8

 In the Output viewer,


double click on the chart
to bring up the Chart
Editor;
 go to Elements and select
“Fit Line at Total,”
 then select “linear” and
click Close.

9
 It would appear that there is a positive correlation between X and Y.

10
Now let’s find the regression equation

11
Results

 Results if we ask for


descriptives.
 r = 0.775: strong
positive correlation
between two variables.

12
Inference for regression coefficients

 The intercept and the slope: ෠ 0  25.904; ෠ 1 0.016


 SE of the coefficients: how they vary in the population
 95% CI of the coefficient (෠ 0, ෠ 1)
- 95% confident that the true coefficient of the population regression
fall into this interval.
- 95% CI =
13
Inference for regression coefficients

 The t-test for whether the coefficient is equal to zero (2-tailed test
with significance level alpha = 0.05)
 Can perform a test for whether the coefficient is equal to a
specific value.

Use t-table to obtain p-value.

14
The standardized coefficients

 The coefficients that you would obtain if you standardized all of the
variables in the regression, including the dependent and all of the
independent variables in multiple linear regression.
 Standardize = put all variables on the same scale (variance of variables = 1)
 Can compare the magnitude of the coefficients to see which variable has more
effective.
 Notice that the larger betas are associated with the larger t-values and lower p-
values. 15
Evaluation of the model

 R: Pearson correlation
 R square: Coefficient of determination
The proportion of the variability of Y is explained by the linear
regression of Y on X (or the linear relationship between Y and X)
R2 = 0.601 → 60.1% of the variability of female life expectancy is
explained by the linear relationship between female life expectancy and
daily calorie intake.
The remaining (100 - 60.1=39.9%) of the variation is not explained by
this relationship.
16
Evaluation of the model

• Total variance is partitioned into the variance which can be explained by


the independent variables (Regression) and the variance which is not
explained by the independent variables (Residual)
• Sum of squares total (SST) is the squared differences between the
observed dependent variable and its mean.
• Sum of squares regression(SSR) is the sum of the differences between
the predicted value and the mean of the dependent variable.
• Sum of squares error or residual (SSE) is the difference between
the observed value and the predicted value.
17
Evaluation of the model

 Total variance has N-1 degrees of freedom.


 The Regression df = the number of coefficients (including the
intercept) minus 1
 The Residual df = Total df – Regression df
 Mean Square = SSQ (Sum of Squares)/df

18
Evaluation of the model

• F statistic = MSregression/MS
residual
• Null hypothesis: all of the model coefficients are 0

19
Now it’s your turn

 Look at the data, explore the relationship between two


continuous variables (make a research question by your own).

 Make a scatter plot between two variables, see if they


potentially have a linear relationship.

 Estimate the linear regression and interpret the results.

20
Multiple linear regression

21
The General Idea

Simple regression considers the relation between a single


independent variable X and dependent variable Y

- X variable: independent variable, also called an experimental


or predictor variable.
- Y variable: dependent variable or response variable.

22
The General Idea

Multiple regression considers the relation between multiple


independent variables (X1, X2,…, Xk) and dependent variable Y.

If two or more explanatory variables have a linear relationship with


the dependent variable, the regression is called a multiple linear
regression.
23
Multiple Regression Analysis

• Method for studying the relationship between a dependent


variable and two or more independent variables.
• Purposes:
- Prediction
- Explanation
- Theory building

24
Simple vs. Multiple Regression

• One dependent variable Y • One dependent variable Y predicted


predicted from one from a set of independent
independent variable X variables (X1, X2 ….Xk)

• One regression coefficient • One regression coefficient for each


independent variable.
• r : proportion of variation
2

in dependent variable Y • R2: proportion of variation in


predictable from X dependent variable Y predictable by
set of independent variables (X’s)

25
Design Requirements

• One dependent variable (criterion)


• Two or more independent variables (predictor
variables).
• Sample size: >= 30 (at least 10 times as many cases
as independent variables)

26
Assumptions

• Independence: the scores of any particular subject are independent of the


scores of all other subjects.
• Normality: the scores on the dependent variable are normally distributed
for each of the possible combinations of the level of the X variables; each of
the variables is normally distributed.
• Homoscedasticity: the variances of the dependent variable for each of the
possible combinations of the levels of the X independent variables are
equal.
• Linearity: the relation between the dependent variable and the independent
variable is linear when all the other independent variables are held constant.

27
Multiple Linear Regression Models

Equation12-2

• Dependent variable or response Y may be related to k


independent or regressor variables.
• The parameters J, J = 0, 1, …, k, are called the regression
coefficients.
• The parameters J, represents the expected change in response Y
per unit change in xj when all the remaining regressors xi (ij) are
held constant.

28
Multiple Linear Regression Models

Least Squares Estimation of the Parameters

29
Multiple Linear Regression Models

Least Squares Estimation of the Parameters


• The least squares function is given by

• The least squares estimates must satisfy

30
Multiple Linear Regression Models

Least Squares Estimation of the Parameters


• The least squares normal equations are:

• The solution to the normal Equations are the least squares


estimators of the regression coefficients.
31
Multiple Linear Regression Models

Example: Investigating the wire bond strength. Here,


researchers used data on pull strength of a wire bond in a
semiconductor manufacturing process, wire length and die
height to illustrate building an empirical model. Estimate
model parameters of multiple linear regression.

32
Multiple Linear Regression Models

33
Multiple Linear Regression Models

• The displays can be helpful in visualizing the relationships among


variables in a multivariable data set.
• The plot indicates that there is a strong linear relationship between
strength and wire length.
34
Multiple Linear Regression Models

• Here, we will fit the multiple linear regression model as follows:


Y = 0 + 1 x1 + 2 x2 + 
Where: Y = pull strength, x1 = wire length, and x2 = die height
• From the data in table 12-2, we calculate:

35
Multiple Linear Regression Models

For the model Y = 0 + 1 x1 + 2 x2 + , the normal equations are:

Insert the computed summations into the normal equations, we obtain:

36
Multiple Linear Regression Models

The solution to this set of equations is:

Therefore, the fitted regression equation is:

Practical interpretation: this equation can be used to predict pull


strength for pairs of regressor variable wire length (x1) and die
height (x2).

37
Notes on the coefficient of determination

In a study: investigate the relationship of head circumference,


gestational age and weight.
• The simple linear regression of head circumference on
gestational age:
Model 1: yˆ  3.9143  0.7801x
• The multiple linear regression of head circumference on
gestational age (x1) and weight (x2):
Model 2: yˆ  8.3080  0.4487x1  0.0047x2
Guess which R2 is larger (that of Model 1 or 2)?

38
More notes on the coefficient of determination

 Model 1: R2 = 0.6095
 Model 2: R2 = 0.7520
 The inclusion of an additional variable in a model can never
cause R2 to decrease.
 Comparing R2 to conclude about the improvement of the model
is not suitable, use the adjusted R2 instead.
 The adjusted R2 is an estimator of the population correlation(ρ) but
can’t be interpreted as the proportion of the variability among the
observed values y that is explained by the linear regression model.

39
Comparing Models -Testing R2

Example: A study of the relation between academic achievement


(AA), grades, and general and academic self concept (N=103)
Shavelson, Text example p 530 (Table 18.1) (Example 18.1 p 538).
Comparing Models -Testing R2

• The Model:

Y’ = a + b1X1 + b2X2 + …bkXk

- The b’s are called partial regression coefficients

• Our example-Predicting AA:

Y’= 36.83 + (3.52)XASC + (-.44)XGSC

• Example: Predicted AA for person with GSC of 4 and ASC of 6

Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23


Multiple Correlation Coefficient (R) and Coefficient
of Multiple Determination (R2)

• R = the magnitude of the relationship between the


dependent variable and the best linear combination of
the predictor variables.

• R2 = the proportion of variation in Y accounted for by


the set of independent variables (X’s).

42
Explaining Variation: How much?

Predictable variation
by the combination of
independent variables
Total Variation in Y

Unpredictable
Variation
Proportion of Predictable and
Unpredictable Variation

Where: (1-R2) = Unpredictable


Y= AA (unexplained) variation
in Y
X1 = ASC
X2 =GSC Y
X1

R2 = Predictable
X2 (explained)
variation in Y
Various Significance Tests

Testing R2
- Test R2 through an F test
- Test competing models (difference between R2) through an F test of
difference of R2s
Testing 
- Test of each partial regression coefficient () by t-tests
- Comparison of partial regression coefficients with each other - t-test
of difference between standardized partial regression coefficients ()
Example: Testing R2

• What proportion of variation in AA can be predicted from GSC


and ASC?

- Compute R2: R2 = 0.16 (R = 0.41): 16% of the variance in AA


can be accounted for by the composite of GSC and ASC.

• Is R2 statistically significant from 0 ?

- F test: Fobserved = 9.52, Fcritical (05/2,100) = 3.09

- Reject H0: in the population, there is a significant relationship


between AA and the linear composite of GSC and ASC.
Example: Comparing Models-Testing R2

Comparing models
Model 1: Y’= 35.37 + (3.38)XASC
Model 2: Y’= 36.83 + (3.52)XASC + (-0.44)XGSC
Compute R2 for each model
Model 1: R2 = r2 = 0.160
Model 2: R2 = 0.161
Test difference between R2s
Fobs = 0.119, Fcrit(0.05/1,100) = 3.94
 Conclude that GSC does not add significantly to ASC in predicting AA
Testing Significance of b’s

H0:  = 0

b-
tobserved =
standard error of b

With N-k-1 df
Example: t-test of b

tobserved = (-0.44 – 0)/14.24


tobserved = -0.03
tcritical (0.05,2,100) = 1.97

Decision: Cannot reject the null hypothesis.


Conclusion: The population  for GSC is not significantly
different from 0
Comparing Partial Regression Coefficients

• Which is the stronger predictor?


- Comparing bGSC and bASC
• Convert to standardized partial regression coefficients (beta weights,
’s)
- GSC = -0.038
- ASC = 0.417
- On same scale so can compare: ASC is stronger predictor than GSC.
• Beta weights (’s) can also be tested for significance with t tests.

You might also like