Chapter 8-10 Contigency Table, Correlation and Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

8.

Categorical data analysis:


Chi-square Non-parametric test
Categorical variables
❑Variables that are measured using nominal
and ordinal scales.
The variable may have only two levels (values) called a
dichotomous (e.g. Sex).
The variable may has more than two levels (values) called
polytomous (e.g. blood pressure).
❑Continuous (numeric) variables can be
changed to categorical variables.

5/12/2023 Sisay W.(PhD) 2


Categorizing continuous variables
❑Sometime, we might require lesser degree
detail than afforded by numerical data, hence
we occasionally transform into categorical.
This is achieved by dividing the range of values, which the
numeric value takes into intervals.

Consider the variable “age”. Age might be transformed to


form a new variable , say “age group”, which consists of
categories.

5/12/2023 Sisay W.(PhD) 3


Continued, .....
❑Categorizing continuous variables is common practice
in Epidemiology for both descriptive and multivariate
modelling.

❑It is used for both outcome and exploratory


variables

❑Categorizing a numerical variable generally results in


an ordinal categorical variable

5/12/2023 Sisay W.(PhD) 4


Contingency Table
❑When working with categorical variables, we often
arrange the counts in a tabular format called contingency
tables.

❑If a contingency table involves two dichotomous variables


when it is a 2x2 (two way table).
❑ A 2x2 contingency table is a table composed of two rows cross-
classified by two columns

❑It can be generalized to accommodate into rxc


contingency
5/12/2023
table (r-rows and c-columns).
Sisay W.(PhD) 5
2x2 Tables - Notation

Outcome Outcome Group


Present Absent Total
Group 1 N11 (a) N12(b) n1. (a+b)

Group 2 N21(c) N22 (d) n2. (c+d)

Outcome n.1 (a+c) n.2 (b+d) n..(a+b+c+d)


Total
5/12/2023 Sisay W.(PhD) 6
The Contingency Table
Table Two-Way Classification of sample
First Criterion of Classification →
Second
Criterion ↓
1 2 3 ….. c Total
1 N11 N12 N13 …… N1c R1.
2 N21 N22 N 23 …… N2c R2.
3 N31 N32 N33 …... N3c R3.
. . . . …… . .
. . . . . .

r Nr1 Nr2 Nr3 N rc Rr.

Total C.1 C.2 C.3 …… C.c N

5/12/2023 Sisay W.(PhD) 7


Observed versus Expected
Frequencies
• Oi j : The frequencies in ith row and jth column given in
any contingency table are called observed frequencies
that result form the cross classification according to the
two classifications.
• e :Expected frequencies on the assumption of
ij

independence of two criterion are calculated by


multiplying the marginal totals of any cell and then
dividing by total frequency
𝐶.𝑗 ∗𝑅𝑖.
• Formula: 𝑒𝑖𝑗 =
𝑁

5/12/2023 Sisay W.(PhD) 8


Chi-square Test
• After the calculations of expected frequency,
Prepare a table for expected frequencies and use Chi-square
(o − e ) 2
 = [
2
ik i
i =1 ]
ei
Where summation is for all values of r xc = k cells.
• D.F.: the degrees of freedom for using the table are (r-1)(c-1)
for α level of significance
• Note that the test is always one-sided.

5/12/2023 Sisay W.(PhD) 9


TESTS OF INDEPENDENCE
▪ To test whether two criteria of classification are
independent

For example socioeconomic status and area of


residence of people in a city are independent.
▪ We divide our sample according to status: low,
medium and high incomes etc. and the same
samples is categorized according to urban, rural or
suburban and slums etc.
▪ Put the first criterion in columns equal in number to
classification of 1st criteria ( Socioeconomic status)
and the 2nd in rows, where the no. of rows equal to
the # of catagories 2nd criteria (area of cities)
5/12/2023 Sisay W.(PhD) 10
Tests of associations for 2x2 tables
❑If we have the frequencies in the four cells of 2x2 table a, b,
c and d, the appropriate test statistic is the chi-squared test
defined by:
2
(O - E) OR
 =
2

E
❑where: O and E denote the observed and expected
frequencies
❑The chi-squared test measures the disparity between
observed frequency (data from the sample ) and expected
frequency
5/12/2023 Sisay W.(PhD) 11
Assumptions of the 2 - test
• No observed frequency is zero

• No more than 20% of the expected frequencies should be <5.


• The guidelines are that 80% of the cells in the table should have
expected frequencies greater than 5, and all cells should have
expected frequencies greater than 1.

• Observation should be independents of each other

• There must be large sample (>40)


• It is based on a ‘large sample’ approximation.

5/12/2023 Sisay W.(PhD) 12


Χ2 table
• Chi-Square test is based on the table of Χ2 for
different degrees of freedom (df).
• If the value of χ2 is zero, no discrepancy
between the observed and the expected
frequencies.
• The greater the discrepancy, the larger will be
the value of χ2.
• The calculated value of χ2 is compared with the
tabulated value for the given df.

5/12/2023 Sisay W.(PhD) 13


Degrees of Freedom

5/12/2023 Sisay W.(PhD) 14


5/12/2023 Sisay W.(PhD) 15
Example 1
The researcher are interested to determine that
preconception use of folic acid and race are
independent. The data is:
Observed Frequencies Table Expected frequencies Table
Yes no Total
Use of Folic total
White (282)(559)/636 (354)(559)/6 559
Acid =247.86 36 =311.14
Ye No Black (282)(56)/636 (354)(56)/63
White 260 299 559 =24.83 6= 31.17
56
Black 15 41 56
(21)(354)/63
Other 7 14 21 (282)(21)/636 6 =11.69
Others =9.31 21
Total 282 354 636

total 282 354 636

5/12/2023 Sisay W.(PhD) 16


Calculations and Testing
• Data: See the given table
• Assumption: Simple random sample
• Hypothesis: H0: race and use of folic acid are independent
HA: the two variables are not independent. Let α = 0.05
• The test statistic is Chi Square given earlier
• Distribution when H0 is true chi-square is valid with (r-1)(c-1) = (3-
1)(2-1)= 2 d.f.
• Decision Rule: Reject H0 if value of  is greater than
2


2

 ,( r −1)( c −1) = 5.991

• Calculations:
 = (260− 247.86) / 247.86 + (299−311.14)
2 2 2
/ 311.14

+ ..... + (14−11.69) / 11.69 = 9.091


2

5/12/2023 Sisay W.(PhD) 17


Conclusion
• Statistical decision. We reject H0 since 9.08960> 5.991

• Conclusion: we conclude that H0 is false, and that


there is a relationship between race and
preconception use of folic acid.

5/12/2023 Sisay W.(PhD) 18


Example 2
• A study was conducted to look at the effects of
oral contraceptives (OC) on heart disease in
women 40 to 44 years of age. It is found that
among 5000 current OC users at baseline, 13
women develop a myocardial infarction (MI)
over a 3-year period, where as among 10,000
non-OC users, 7 develop an MI over a 3-year
period.
• Is there relationship b/n OC and MI at 95% CI?

5/12/2023 Sisay W.(PhD) 19


Example….
• 2X2 table for the above example
MI status over
OC-use 3 years
group Yes No Total
OC users 13 (6.7) 4987 (4993.3) 5000

Non-OC 7 (13.3) 9993 (9986.7) 10,000


users
Total 20 14,980 15,000

• X2 ≈ 9
5/12/2023 Sisay W.(PhD) 20
Using SPSS
1. Analysis➔ Descriptive statistics➔ Crosstab

Under crosstabs

– Put dependent variable to “column” and the independent


variables to “Rows”.
– By Clicking the ‘statistics’ mark the ‘Chi square’, .
– By clicking the ‘Cells’, mark ‘rows’ from the percent.

5/12/2023 Sisay W.(PhD) 21


Analysis➔ Descriptive statistics➔ Crosstab

Put the independent variables to “Rows”.

dependent variable to “column”

Under ‘statistics’

‘Chi square’,

‘risk’.

5/12/2023 Sisay W.(PhD) 22


Analysis➔ Descriptive statistics➔ Crosstab

Under ‘Cells’,

‘rows’ .

5/12/2023 Sisay W.(PhD) 23


gender * depression diagnosis Crosstabulation

depression diagnosis
depression
non-case case Total
gender female Count 497 358 855
% within gender 58.1% 41.9% 100.0%
male Count 420 160 580
% within gender 72.4% 27.6% 100.0%
Total Count 917 518 1435
% within gender 63.9% 36.1% 100.0%

Compare percentages
between different
Chi-Square T ests exposure status

Asymp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 30.571b 1 .000
Continuity Correctiona 29.955 1 .000
Likelihood Ratio 31.089 1 .000
X2 that needs
Fisher's Exact Test .000 .000
Linear-by-Linear
30.550 1 .000
Consideration (for 2x2)
Association
N of Valid Cases 1435
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is
209.37.
•Ifthe variables are of 2X2 table format, take the X2 under the continuity correction
•If it is of 2X(>2) take the X2 under the Pearson chi-Square
•If any cell in the table has < 5 expected count, choose likelihood ratio Fisher’s Ex..
•If the
5/12/2023 Sisay W.(PhD)
dependent variable is of ordinal type, choose linear by linear association. 24
9. Correlation and
Regression

5/12/2023 Sisay W.(PhD) 25


Correlation and Regression
• Correlation
– Establishing relationship between two quantitative
variables and without assuming case and effect
situation.
• Correlation is the method of analysis to use when studying the
possible association between two quantities variables.
– It measures the degree of linear relationship between two
variables.
• If we want to measure the degree of association, calculating the
correlation coefficient can do this.

5/12/2023 Sisay W.(PhD) 26


Correlation Analysis…
• A correlation coefficient is calculated as the measure of the
strength of this relationship.
– Its symbol is 'r' and its value lies between -1 and +1.

• The standard method (Pearson correlation) leads to a quantity


called r that can take on any value from -1 to +1.
– This correlation coefficient r measures the degree of 'straight-
line' association between the values of two variables.
• Thus a value of +1.0 or -1.0 is obtained if all the points in a
scatter plot lie on a perfectly straight line.

5/12/2023 Sisay W.(PhD) 27


Pearson’s Correlation Coefficient
• It quantifies the strength of the linear
relationship between the outcomes and
explanatory variables, x and y

• The estimator of r is known as Pearson’s


coefficient of correlation or correlation
coefficient (r).

5/12/2023 Sisay W.(PhD) 28


Pearson’s Correlation Coefficient

• It is also called product-moment correlation coefficient


• The sample correlation coefficient is denoted by r.
n

1 
( xi - x )( yi - y )
1 n  xi - x   yi - y 
r=  
n − 1 i =1  sx   s y
 = i =1

 n −1 sx s y

• sx and sy are the sample standard deviations of the x and y


values. n

 ( x - x )( y - y )
i i
r= i =1
= ˆ
n n
[ ( xi - x ) ][ ( yi - y ) 2 ]
2

i =1 i =1

5/12/2023 Sisay W.(PhD) 29


Pearson’s Correlation Coefficient
• The correlation coefficient is dimensionless number; it has no units
of measurement.

– -1 ≤ r ≤ 1

– The value r=1 and r=-1 occur when there is an exact linear
relationship between x and y.

– If y tends to increase in magnitude as x increases, r is greater


than 0; x and y are said to be positively correlated. (r >0)

5/12/2023 Sisay W.(PhD) 30


Pearson’s Correlation Coefficient
– If y decreases as x increases, r is less than 0
and the two variables are negatively
correlated. (r <0)
– If r=0, there is no linear relationship between
x and y and the variables are uncorrelated. (r
=0)
– r = +1  exact straight line relationship with
positive slope.
– r = −1  exact straight line relationship with
negative slope
– r = 0  complete random scatter (no linear
correlation).
5/12/2023 Sisay W.(PhD) 31
The result of analysis
• Pearson’s Correlation Coefficient (r)
– Tells you two things about the relationship:
1. Strength?
2. Direction?
– Also, the p-value:
3. Significant?

5/12/2023 Sisay W.(PhD) 32


1. Strength
• How strong is the relationship?
• Look at the value of r (Pearson correlation)
• How big is the number?
– 1.0 (-1.0) = Perfect Correlation
– 0.60 to 0.99 (-0.60 to -0.99) = Strong
– 0.30 to 0.59 (-0.30 to -0.59) = Moderate
– 0.01 to 0.29 (-0.01 to -0.29) = Weak
– 0 = No Correlation(no linear correlation)

5/12/2023 Sisay W.(PhD) 33


2. Direction
• What is the direction of the relationship?
– Look at the sign of r
– Positive (+)
• Both variables move in the same direction
• If one is going up, the other will go up too.
• OR, if one is going down, the other will go down
too.
• Negative (-)
– Both variables move in opposite directions
– If one is going up, the other will go down.
– OR, if one is going down, the other will go up.
5/12/2023 Sisay W.(PhD) 34
3. Significant
• The significance is illustrated by its P-value
• When P-value is below 0.05, then we consider
the correlation is statistically significant

5/12/2023 Sisay W.(PhD) 35


Summary: Features of r
➢ Unit free

➢ Range between -1 and 1

➢ The closer to -1, the stronger the negative linear


relationship

➢ The closer to 1, the stronger the positive linear relationship

➢ The closer to 0, the weaker the linear relationship

5/12/2023 Sisay W.(PhD) 36


Scatter Plot Examples
Linear relationships Curvilinear relationships
y y

x x

y y

x
Sisay W.(PhD)
x
5/12/2023 37
Scatter Plot Examples
Weak relationships
Strong relationships

y y

x x

y y

x x
5/12/2023 Sisay W.(PhD) 38
Scatter Plot Examples
No relationship at all
y

x
5/12/2023 Sisay W.(PhD) 39
Examples of Approximate r Values

y y y

x x x
r = -1 r = -.6 r=0

y y

r = +0.3 x r = +1 x
5/12/2023 Sisay W.(PhD) 40
Scatter plot
It is useful to:
• Depicts the pattern of the data on the plane.
• Shows whether x and y are related or
independent.
• Shows ranges for x and y
• Shows unique observations
• Suggests what type of models you can fit to the data

5/12/2023 Sisay W.(PhD) 41


Example
• Example , the correlation temperature
in oF and reduction in pulse rate
(beat/minute) with r= -0.941
• In this sample:
– Strong linear relationship
– Negative association: temperature
increases, pulse rate (beat/minute)
will be reduced
– The correlation coefficient merely
tells us that a linear relationship
exists between two variables; it
does not specify whether the
relationship is cause-and-effect.

5/12/2023 Sisay W.(PhD) 42


Limitations of correlation coefficient

• It quantifies only the strength and direction of the linear


relationship between two variables.
➢ Magnitude of effect is necessary→ lead regression analysis
• Care must be taken when the data contain any outliers, or pairs
of observations that lie considerably outside the range of the
other data points.

• A high correlation between two variables does not imply a


cause-and-effect relationship.

5/12/2023 Sisay W.(PhD) 43


Use of SPSS for correlation analysis
• Pearson’s correlation is used, when the two variables
– are continuous and
– are symmetrically distributed

• Therefore, we should test the variables for their


symmetry

• If they fulfill for symmetry, we are able to analyze


using the Pearson’s correlation matrix

5/12/2023 Sisay W.(PhD) 44


Steps in Correlation analysis
• Step 1. Observe the data so that the following
assumptions are met
– the two variables X and Y are random
– observation (measurement) must be independent;
– the underlying population from which the sample is
taken must be normally distributed.
• Step 2. Construct the scatter plot (NO EXCUSE).
This provides visual impression of the relationship
between the variables

5/12/2023 Sisay W.(PhD) 45


Steps in Correlation analysis….
• Step 3. State the null and alternative hypotheses
HO :  = 0 i.e. No linear relationship exists between the two
variables
HA :   0 or  > 0 or  < 0
• Step 4. Test statistics: Check first assumptions are met and
estimate  by:

t-distribution with df=n-2


Step 5. Conclusion
5/12/2023 Sisay W.(PhD) 46
Simple Example

5/12/2023 Sisay W.(PhD) 47


5/12/2023 Sisay W.(PhD) 48
• We would calculate

• Nearly perfect positive correlation!

5/12/2023 Sisay W.(PhD) 49


SPSS….
• Analysis ➔ Correlation ➔ bivariate

5/12/2023 Sisay W.(PhD) 50


SPSS….
• Analysis ➔ Correlation ➔ bivariate
1st Select continuous 2nd Pass by clicking here
variables Finally click here
To see for result

3rd Select Pearson


or make sure its
selection

5/12/2023 Sisay W.(PhD) 51


SPSS….
Pearson
Correlation
(r)

5/12/2023 Sisay W.(PhD) 52


The result of analysis
• Pearson’s Correlation Coefficient (r)
– Tells you two things about the relationship:
1. Strength?
2. Direction?

– Also, the p-value:


3. Significant?

5/12/2023 Sisay W.(PhD) 53


10. Introduction to
Regression Analysis

5/12/2023 Sisay W.(PhD) 54


Regression Analysis:
• Establish relationship between the two variables in
the presence of cause and effect scenario.

• That means, biologically, the independent


(explanatory) variable should be a case, while the
response should be an effect (outcome).

5/12/2023 Sisay W.(PhD) 55


Regression Analysis….
• Regression analysis gives information on the relationship
between a response (dependent) variable and one or
more (predictor) independent variables to the extent that
information is contained in the data.

• The goal of regression analysis is to express the response


variable as a function of the predictor variables.

5/12/2023 Sisay W.(PhD) 56


Regression Analysis…
• Regression analysis is used to:
– Predict the value of a dependent variable based on
the value of at least one independent variable
– Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain the
dependent variable

5/12/2023 Sisay W.(PhD) 57


Types of regression
• Linear regression
– SLR
– MLR
• Logistic regression

5/12/2023 Sisay W.(PhD) 58


Simple Linear Regression (SLR)
• Like correlation, it explore the nature of the
relationship between two continuous
variables.
• The main difference is that regression looks at
the change in one variable (the response or
outcome or dependent variable) that
corresponds to a given change in the other
(the explanatory or predictor or independent
variable)

5/12/2023 Sisay W.(PhD) 59


SLR …. Cont’d
➢ Only one independent variable, x and one
dependent variable Y.

➢ Relationship between x and y is described


by a linear function

➢ Changes in y are assumed to be caused by


changes in x

5/12/2023 Sisay W.(PhD) 60


Population Linear Regression
The population regression model:
Dependent Population Population Random
Variable Independent Error
y intercept Slope
Variable term, or
Coefficient
residual

y = β 0 + β1x + ε
Linear component Random Error
component

5/12/2023 Sisay W.(PhD) 61


Linear regression
• Linear regression is a basic and commonly used type of
predictive analysis. The overall idea of regression is to
examine two things:
– Does a set of predictor variables do a good job in
predicting an outcome (dependent) variable?
– Which variables in particular are significant predictors
of the outcome variable, and in what way do they–
indicated by the magnitude and sign/direction of the
beta estimates–impact the outcome variable?
• These regression estimates are used to explain the
relationship between one dependent variable and one or
more independent variables.
Sisay W.(PhD)
5/12/2023 62
6
2
What is “Linear”?
• Remember this:
• Y=mX+B?

Sisay W.(PhD)
5/12/2023 63
6
3
Example: Linear relationship (e.g. Y=cholesterol versus
X=age)

Y = b0 + b1X
b0 is the intercept,
b1 is the slope.

12/6/2018 Sisay W.(PhD)


5/12/2023
Asres5sie M. 64
Sisay W.(PhD)

Prediction

If you know something about X, this knowledge helps you


predict something about Y. (Sound familiar?…sound
like conditional probabilities?)

5/12/2023 665
5
Predicted value for an individual…
yi= 𝛽0 + 1 xi + random errori

Fixed –
exactly
on the Follows a
line normal
distribution

Sisay W.(PhD)
5/12/2023 66
6
6
Linear Regression Model:
Meaning of and
> 0 [positive slope], < 0 [negative slope]

rise

run
=slope (=rise/run)

=y-intercept

Sisay W.(PhD)
5/12/2023 67
17.8
Linear regression
• This model-known as the regression line (average)-
is the equation of a straight line.
• The parameters 𝛽0 and 𝛽1 are constants called the
coefficients of the equation; 𝛽0 is the y-intercept of
the line and 𝛽1 is its slope.
• The y-intercept is the mean value of the response y
when x is equal to 0·
• The slope is the change in the mean value of y that
corresponds to a one-unit increase in x. If 𝛽1 is
positive, µy |x increases in magnitude as x increases;
if 𝛽1 is negative, µy|x decreases as x increases.
Sisay W.(PhD)
5/12/2023 68
The method of least square –parameter
estimation
• In regression model, two variables X and Y are of interest. The
variable X is the explanatory and Y is a response.
– Suppose that we were to draw an arbitrary line through the
scatter data of the variables X and Y.
– Lines sketched by two different individuals are unlikely to be
identical, even though both might attempt to show the same
trend
– The question then arises as to which line best describes the
relationship between X and Y that fits well the data: One
mathematical technique for fitting a straight line to a set of
points is known as the method of least square

Sisay W.(PhD)
5/12/2023 69
The Method of least square

• The difference between the given score Y and the


predicted score Ŷ is known as the error of estimation.

• The regression line, or the line which best fits the


given pairs of scores, is the line for which the sum of
the squares of these errors of estimation (Σеi²) is
minimized.

Sisay W.(PhD)
5/12/2023 70
Using the sample data:

The dependent variable, y, is modeled as


Dependent Independent

y =𝛽መ0 + 𝛽መ1 x
Where
o 𝛽መ0 is the intercept or constant;
o 𝛽መ1 is the gradient or slop;
o These parameters are referred to as Regression
coefficients

5/12/2023 Sisay W.(PhD) 71


• Where

• The estimators 𝛽መ0 and 𝛽መ1 serve as point


estimates of the regression parameters 𝛽0 and
𝛽1
• With a different sample, the estimates would
change
5/12/2023 Sisay W.(PhD) 72
Explained versus Unexplained Variability

Response (Y)

Total
Yi −Y Unexplained Variability Yi − Yˆi
Variability
_ Explained Variability Ŷi −Y
Y

Predictor (X)

Sisay W.(PhD)
5/12/2023 73
Least squares criterion

Sisay W.(PhD)
5/12/2023 74
Which line has the best “fit” to the data

𝛽መ0 and 𝛽መ1 are determined such that total


variability is minimized
Sisay W.(PhD)
5/12/2023 17.75
75
Assumptions
• Linear regression assumes that…
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the
same (homogeneity of variances)
4. The observations are independent

Sisay W.(PhD)
5/12/2023 76
Checking assumptions
• Normality
– Histogram with normality plot
– Checking skewness
– K-S normality test
• Linearity
– Scatter plot (No Excuse)
• Homogeneity (homoscedasticity) of variances
– Residual plot - Residuals Versus Fitted Values
• it can help us to detect outlying observations in the sample.
• Failure in the assumption of homoscedasticity.
• If the residuals do not exhibit a random scatter but instead follow a
distinct trend the true relationship between x and y might not be
linear
Sisay W.(PhD)
5/12/2023 77
Steps in testing significance in regression
analysis
• Step 1. Observe the data
– Y represents the response and X represents explanatory
– All observations (measurements) must be independent
– The underlying population from which the sample is
selected must be normal.
• Step 2. Construct a scatter plot
• Step 3. Define the model:
Y = 𝛽0 +𝛽1 X
𝛽1 - the Y intercept and 𝛽1 - the slope are called the coefficients of
the regression equation.
𝛽0 Is the mean value of Y when X=0
𝛽1 is the change in Y that corresponds to one unit change in X

Sisay W.(PhD)
5/12/2023 78
Steps…
•Step 4. Estimate the coefficients y: =𝛽መ0 + 𝛽መ1 x
Using the least square method:

•Step 5. State the hypotheses


Ho: 𝛽1 = 0
HA : 𝛽1  0 or 𝛽1 >0 or 𝛽1 <0
SE(𝛽መ1 ) = ?
t = b/SE(𝛽መ1 ) has a t distribution with df = n-2 when Ho is
true CI for 𝛽መ1 calculated 𝛽መ1 ± t/2SE(𝛽መ1 )
Sisay W.(PhD)
5/12/2023 79
Steps…
▪ Step 6. Interpretation of results
Once we have obtained the values of the intercept and the
slope and have checked the underlying assumptions for
linear regression it can be used.
i. To predict what value of Y is likely to assume given
a particular value of X
ii. To estimate the mean (or expected value) of the sub
population of Y values assumed to exist at any particular
value of X

Sisay W.(PhD)
5/12/2023 80
Example: Education & Job Prestige
• The actual SPSS regression results for that data:
Model Summary Estimates of 𝛽0 and 𝛽1 :
Adjusted Std. Error of “Constant” = 𝛽0 = 9.427
Model R R Square R Square the Estimate Slope for “Year of School” =
1 .521a .272 .271 12.40 𝛽1 = 2.487
a. Predictors: (Constant), HIGHEST YEAR OF SCHOOL
COMPLETED Coefficientsa

Standardi
zed
Uns tandardized Coefficien
Coefficients ts
Model B Std. Error Beta t Sig.
1 (Constant) 9.427 1.418 6.648 .000
HIGHEST YEAR OF
SCHOOL COMPLETED 2.487 .108 .521 23. 102 .000

a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE

• Equation: Prestige = 9.4 + 2.5 Education


12/6•/201A
8 year of education a
Sisay W.(PhD)
A srd
essd s. 2.5 points job prestige3281
ie M
5/12/2023
Example: Slope Hypothesis Test
• The actual SPSS regression results for that data:
Model Summary

Adjusted Std. Error of t-value and “sig” (p-value) are


Model R R Square R Square the Estimate for hypothesis tests about the
1 .521a .272 .271 12.40
slope
a. Predictors: (Constant), HIGHEST YEAR OF SCHOOL
COMPLETED Coefficientsa

Standardi
zed
Uns tandardized Coefficien
Coefficients ts
Model B Std. Error Beta t Sig.
1 (Constant) 9.427 1.418 6.648 .000
HIGHEST YEAR OF
SCHOOL COMPLETED 2.487 .108 .521 23. 102 .000

a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE

• Reject H0 if: T-value > critical t (N-2 df)


12/6/2018 Sisay W.(PhD)less than 
• Or, “sig.” (p-value) A5/12/2023
sressie M. 82
33
Multiple regression
❑The simple linear regression model is easily extended to the
case of two or more explanatory variables. Such a model is
called a multiple regression model, and has the form:
y = 𝛽0 + 𝛽1 x1 + 𝛽2 x2 + … + 𝛽𝑛 xn.
• For example,
– Birth weight = 𝛽0 + 𝛽1 Maternal age + 𝛽2
Sex + 𝛽3 Gestational age + 𝛽4 Parity
• After fitting a multiple regression model, we will obtain a
point estimate for each ‘𝛽𝑖 ’ and for the intercept ‘𝛽0 ’.
• Interpretation of the 𝛽𝑖 coefficients is the same as for the
slope in simple linear regression – that is, a change in xi of
one unit will produce a change in y of 𝛽𝑖 units. We can also
test a null hypothesis HO : 𝛽𝑖 = 0, for each coefficient
Sisay W.(PhD)
5/12/2023 83
– Predicted values, also called fitted values, are the
values of each case based on using the regression
equation for all cases in the analysis.
– Residuals are the difference between the observed
values and those predicted by the regression
equation.
– Dummy variables are a way of adding the values of a
nominal or ordinal variable to a regression equation.

Sisay W.(PhD)
5/12/2023 84
Multiple linear regression

Predicted Predictor variables


Response variable Explanatory variables
Outcome variable Covariables
Dependent Independent variables

Sisay W.(PhD)
5/12/2023 85
How Good is the Model?
One of the measures of how well the model explains the
data is the R2 value. Differences between observations
that are not explained by the model remain in the error
term.
The R2 value tells you what percent of those differences is
explained by the model. An R2 of .68 means that 68% of
the variance in the observed values of the dependent
variable is explained by the model, and 32% of those
differences remains unexplained in the error term.

Sisay W.(PhD)
5/12/2023 86
Coefficient of determination
• Explained variation + unexplained variation =Total variation
• The ratio of the explained variation to the total variation
measures how well the linear regression line fits the given pairs
of scores. It is called the coefficient of determination, and is
denoted by r².
explained var iation
r2 =
totalvar iation
• The explained variation is never negative and is never larger than
the total variation. Therefore, r² is always between 0 and 1. If the
explained variation equals 0, r² = 0.

Sisay W.(PhD)
5/12/2023 87
R-square
• In most cases, the ratio would fall somewhere between these
extremes, that is, between 0.0 and 1.0.
• One minus this ratio is referred to as R-square or the coefficient
of determination.
• This value is immediately interpretable in the following manner.
If we have an R-square of 0.4 then we know that the variability
of the Y values around the regression line is 1- 0.4 times the
original variance.
• In other words, we have explained 40% of the original
variability, and are left with 60% residual variability.
• Ideally, we would like to explain most if not all of the original
variability.

5/12/2023
Sisay W.(PhD) 88
Review: R-Square
• Visually: Deviation is partitioned into two
parts
“Error
4 Variance”

“Explained
2 Variance”
Y-bar

-4 -2 0 2 4
-2

Y=2+.5X
-4

Sisay W.(PhD)
5/12/2023 89
Residual Variance and R-square
♣ The R-square value is an indicator of how well the model fits the
data
♣ An R-square close to 1.0 indicates that we have accounted for
almost all of the variability with the variables specified in the
model.

Sisay W.(PhD)
5/12/2023 90
End of SLR & MLR

5/12/2023 Sisay W.(PhD) 91

You might also like