Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

1

10
Analysis of Relationships

Our interest in this chapter is to determine if two measurements X and Y taken from
the same sample or population are associated/related/dependent on each other. We
will be interested to measure the strength of the relationship and whether we have
enough evidence that the relationship we observed in our sample can be generalized to
the entire population this sample came from. The last part of this chapter is regression
analysis where it is of our interest to use the value of some independent variable to
predict or explain variability among values of our dependent variable.

Expected Learning Outcomes


By the end of this unit, the student must be able to:
• Perform correlation analysis.
• Interpret the magnitude of the association.
• Differentiate dependent from independent variable in a linear model.
• Perform regression analysis.
• Interpret the regression analysis results.

10.1 Measures of Association and Correlation


10.1.1 Pearson Product-Moment Correlation Coefficient
The linear correlation coefficient, denoted by 𝜌 (rho), is measure of the strength
of the linear relationship existing between two variables, X and Y, that is independent
of their respective scales of measurement.
Assumptions:
1. Both variables are measured at the interval or ratio level
2. There should be no significant outliers
3. The variables should be approximately normally distributed
Remarks:

1. The coefficient takes on values between -1 and 1, inclusive, i.e., -1 <  < 1.

2. A positive  means that the line slopes upward to the right; a negative  means
that it slopes downward to the right.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 1

3. When  is 1 or –1, there is perfect linear relationship between X and Y and all
the points (x, y) fall on a straight line. A  close to 1 or –1 indicates a strong
linear relationship, but it does not necessarily imply that X causes Y or Y causes
X. It is possible that a third variable may have caused the change in both x
and y, producing the observed relationship.

4. If  = 0 then there is no linear correlation between X and Y. A value of  =


0, however, does not mean a lack of association. Hence, if a strong quadratic
relationship exists between X and Y, we will still obtain a zero correlation to
indicate a nonlinear relationship.
5. Even though two variables are highly associated or correlated, it is not a
sufficient proof of causation. Causal analysis requires looking into more
complicated properties such as reversibility and strength

The Pearson product moment coefficient of correlation, denoted by r, is


n
 n  n 
n X i Yi −   X i   Yi 
r= i =1  i =1  i =1 
 2
 2

 n X 2 −  X   n Y 2 −  Y  
n n

  i  i  i 
 i =1   i =1
i
 
 i =1 i =1

Remarks:

1. r is used to estimate 𝜌 based on a random sample of n pairs of measurements


(Xi, Yi), i=1,…,n.
2. -1 < r < 1
3. The following rules can be used to verbally describe the strength of correlation
for the absolute value of r:
±0.00 − 0.25 no linear association or weak linear association
±0.26 − 0.50 moderately weak linear association
±0.51 − 0.75 moderately strong linear association
±0.76 − 1.00 strong linear association to perfect linear association

4. Just like , when r = 1 or –1, all the points (xi, yi), i=1,…,n, fall on a straight
line; when r=0, they are scattered and give no evidence of a linear relationship.
Any other value of r suggests the degree to which the points tend to be linearly
related.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 2

Sample Scatterplots with Approximate Values of r:

10.1.2 Spearman Rank Correlation Coefficient


The Spearman’s rank correlation coefficient (𝑟𝑠 ) measures the strength of
direction of association between two ranked variables. The data should be at least
ordinal to perform the procedure, and the bivariate relationship should be monotonic.
A monotonic function is one that never increases or never decreases as its independent
variable increases. The following graphs illustrate monotonic functions:

The Spearman’s rank correlation coefficient is given by:

6 ∑ 𝑑𝑖2
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)
where 𝑑𝑖 = Rank(𝑋𝑖 ) – Rank(𝑌𝑖 ).
Remarks:
1. 𝑟𝑠 takes on values between -1 and 1, inclusive, i.e., -1 ≤ rs ≤ 1.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 3

2. The closer the value is to ±1 the stronger the monotonic relationship. The
following may serve as guide is interpreting rs value:
±0.00 − 0.25 no association or weak association
±0.26 − 0.50 moderately low association
±0.51 − 0.75 moderately high association
±0.76 − 1.00 high association to perfect association

3. The sign (- or +) of rs determines the direction of the relationship. In the figures


below, various samples and their corresponding sample correlation coefficient
values are presented. These represent the “extreme” monotonic correlation
values of -1, 0, and 1.

4. Pearson’s correlation is a statistical measure of strength of a linear relationship


between bivariate data while Spearman’s correlation is measure of strength of
monotonic relationship between bivariate data.
5. For interval level variables, Pearson correlation coefficient is initially
considered. However, when assumptions are not met, the Spearman’s correlation
is an alternative measure.

10.1.3 Example
Consider the data given below. Let X represent the lot size and Y represent the man
hours required. Construct the scatterplot and compute 𝑟𝑠 and 𝑟.

Observation Lot Size (X) Man Hours (Y)


1 30 73
2 20 50
3 60 128
4 80 170
5 40 87
6 50 108
7 60 135
8 30 69
9 70 148
10 60 132

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 4

Solution:
Constructing the scatterplot between lot size (X) and man hours (Y) we have:

To compute for r, we need the ∑ xi , ∑ yi , ∑ xi2 , and ∑ yi2 .


Obs X Y X*Y X2 Y2
1 30 73 2190 900 5329
2 20 50 1000 400 2500
3 60 128 7680 3600 16384
4 80 170 13600 6400 28900
5 40 87 3480 1600 7569
6 50 108 5400 2500 11664
7 60 135 8100 3600 18225
8 30 69 2070 900 4761
9 70 148 10360 4900 21904
10 60 132 7920 3600 17424
Sum 500 1100 61800 28400 134660

Solving for r,
n ∑ xi yi − ∑ xi ∑ yi
𝑟=
√(n ∑ xi2 − (∑ xi )2 )(n ∑ yi2 − (∑ yi )2 )
10(61800) − (500)(1100)
=
√(10(28400) − 5002 )(10(134660) − 11002 )
= 0.9978

This means that there is a strong, positive linear correlation between lot size and man
hours. Being positively correlated implies that an increase in lot size is associated with
an increase in man hours needed, and vice versa.

Computing for rs, we need to rank all x values and all y values from smallest to largest.

Solving for 𝑟𝑠 ,

6 ∑ d2i 6(2.5)
𝑟𝑠 = 1 − 2
=1− = 0.9848
n(n − 1) 10(102 − 1)

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 5

Obs X Y Rank(Xi) Rank(Yi) di d2i


1 30 73 2.5 3 -0.5 0.25
2 20 50 1 1 0 0
3 60 128 7 6 1 1
4 80 170 10 10 0 0
5 40 87 4 4 0 0
6 50 108 5 5 0 0
7 60 135 7 8 -1 1
8 30 69 2.5 2 0.5 0.25
9 70 148 9 9 0 0
10 60 132 7 7 0 0
Sum 2.5

With rs = 0.9848, we arrive at the same conclusion as the Pearson correlation


coefficient that there is a strong monotonic correlation between lot size and man hours.

10.2 Testing Hypothesis About the Correlation Coefficient


We can also perform a test of hypothesis for 𝜌. The null hypothesis states that the
underlying linear correlation coefficient is equal to some hypothesized value 𝜌0 . This
𝜌0 is usually zero to determine if there is significant linear association between the
variables, but other values important to the study may also be used. The table below
presents the null and alternative hypothesis, test statistic, and the corresponding
critical regions.

Ho Test Statistic Ha Critical Region

 = 0 (r − ρ0 )√n − 2  < 0 t < -t


t=  > 0 t > t
√1 − r 2
  0 | t | > t
v=n–2

10.2.1 Example
Given the data in the previous section, test whether the linear correlation between lot
size and man hours is different from zero at 5% level of significance.

Solution:

Let  be the linear correlation coefficient between lot size (X) and man hours (Y).

Step 1. Ho:  = 0 versus Ha:  ≠ 0


Step 2. α = 0.05
Step 3. Test Statistic:

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 6

(r − ρ0 )√n − 2 (0.9978 − 0)√10 − 2


t= = = 42.5698
√1 − r2 √1 − 0.99782

Step 4. Decision Rule:


(n−2) (10−2) (8 )
Reject Ho if | t | > t where tα = t 0.05 = t0.025 = 2.306.
2 2
Step 5. Decision: We reject Ho because t=42.5698>2.306.
Step 6. Conclusion: There is sufficient evidence that the linear correlation
coefficient between lot size and man hours is not equal to zero at 5%
level of significance.

10.2.2 Correlation Coefficient in Jamovi


Using the software, we can easily calculate the 𝑟 or 𝑟𝑠 by following the steps:

• Click Analyses → Regression → Correlation Matrix.


• Drag the variables into the selection box.
• Under Correlation coefficient, select Pearson and/or Spearman.
• Under Additional options, you may select Report significance to test the
significance of the correlation coefficient.

The output from Jamovi is similar to the manual calculations showed in the previous
example with lot size and man hours.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 7

10.3 Chi-square Test of Independence


The chi-square test for independence (𝜒 2) also called Pearson’s chi-square test of
association, is used to discover if there is relationship between two categorical variables.
For example, we might test whether a person’s music preference is related to his
intelligence as measured by IQ. We then take a random sample and for each subject
determine his music preference and classify his IQ into different categories (high,
medium, low). The observed frequencies are presented in what is known as a
contingency table shown below:
Music Preference and IQ of 480 Participants

Music IQ
Preference High Medium Low Total
Classical 40 26 17 83
Pop 47 59 25 131
Rock 83 104 79 266
Total 170 189 121 480

A contingency table containing r rows and c columns is referred to as r×c table. The
row and column totals are called marginal frequencies. Note that in a test for
independence, these marginal frequencies are not fixed in advance but depend instead
on the way the sample distributed itself across the various cells in the table.
Test Procedure:
1. State the null and alternative hypothesis.
Ho: The two variables are independent
Ha: The two variables are not independent.
2. Choose the level of significance.
3. Compute the test statistic, given by

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 8

r c 2
2
(Oij − Eij)
χ = ∑∑
Eij
i=1 j=1
where Oij= observed number of cases in the ith row of the jth column
Eij = expected number of cases under Ho
(column total)×(row total)
Eij = grand total
4. Decision Rule: Reject Ho if χ2 > χ2α,(r−1)(c−1) .

Remarks:
1. The test is valid if at least 80% of the cells have expected frequencies of at least
5 and no cell has an expected frequency  1.
2. If many expected frequencies are very small, researchers commonly combine
categories of variables to obtain a table having larger cell frequencies.
Generally, one should not pool categories unless there is a natural way to
combine them.
3. For a 2x2 contingency table, a correction called Yates’ correction for continuity
is applied. The formula then becomes
r c 2
2
(|Oij − Eij| − 0.5)
χ = ∑∑
Eij
i=1 j=1

Cramer’s V
Cramer’s V is a measure of association between two nominal variables. It is often
used as post-test to determine strengths of association after chi-square test has
determined significance.


χ2
Cramer s V = √
n(a − 1)

where n = sample size


χ2 = observed chi-square value of the r×c table
a = min(r,c)
n = sample size

Remarks:

1. 0 ≤ Cramer’s V ≤ 1.
2. A Cramer’s V value of 0 corresponds to no association between the variables
while a value of 1 corresponds to complete association.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 9

10.3.1 Example
Using the data on musical preference and intelligence, test the hypothesis that music
preference and intelligence are independent. Also, evaluate the degree of association
using Cramer’s V coefficient.

Solution:

Step 1. Ho: Musical preference and intelligence are independent


Ha: Musical preference and intelligence are not independent.
Step 2. α = 0.05
Step 3. To compute for the Chi-square test statistic, we first compute for the expected
(column total)×(row total)
frequencies by computing Eij = for every cell. The
grand total
observed frequencies and expected frequencies in parenthesis are presented in
the following table.

Music IQ
Preference High Medium Low Total
Classical 40 (29.4) 26 (32.7) 17 (20.9) 83
Pop 47 (46.4) 59 (51.6) 25 (33.0) 131
Rock 83 (94.2) 104 (104.7) 79 (67.1) 266
Total 170 189 121 480

Test Statistic:
r c 2
(Oij − Eij)
χ2 = ∑ ∑
Eij
i=1 j=1
(40 − 29.4)2 (26 − 32.7)2 (17 − 20.9)2
= + +
29.4 32.7 20.9
(47 − 46.4)2 (59 − 51.6)2 (25 − 33.0)2
+ + +
46.4 51.6 33.0
(83 − 94.2) 2 (104 − 104.7) 2 (79 − 67.1)2
+ + +
94.2 104.7 67.1
= 3.822 + 1.373 + 0.728 + 0.008 + 1.061 + 1.939 + 1.332 + 0.005 + 2.110
= 12.378
Step 4. Decision Rule: Reject Ho if χ2 > χ2α,(r−1)(c−1) where χ2α,(r−1)(c−1) =
χ20.05,(3−1)(3−1) = χ20.05,4 = 9.488.
Step 5. Decision: We reject Ho because χ2 =12.378>9.488.
Step 6. Conclusion: There is sufficient evidence that music preference and
intelligence are related at 5% level of significance.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 10

10.4 Linear Regression Analysis


A simple linear regression is a straightforward approach for predicting a
quantitative variable Y based on a single predictor variable X. It assumes that there
is approximately a linear relationship between X and Y. Mathematically, we can write
this linear relationship using an equation of a line.

The general equation of a straight line is

𝑦 = 𝛽0 + 𝛽1 𝑥
Where 𝛽0 is the y-intercept, i.e. the value of y when x=0, and 𝛽1 is the slope of the
line, i.e. the change in y for a 1-unit increase in x.

10.4.1 Deterministic Model vs. Probabilistic Model


The linear model 𝑦 = 𝛽0 + 𝛽1 𝑥 is said to be a deterministic mathematical model
because, when a value of x is substituted into the equation, the value of y is determined,
and no allowance is made for error.

In contrast, the linear model 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 ,where 𝜖 is a random error, is said the


be a probabilistic mathematic model because the model assumes that for any given
value of x the observed value of y varies in a random manner and possesses a
probability distribution with mean E(𝑦|𝑥) = 𝛽0 + 𝛽1 𝑥.

The simple linear regression model is given by

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖

• Y is the response variable. It also called the dependent or outcome variable.


• X is the explanatory variable or predictor.
• 𝛽0 is the y-intercept, this is the value of Y when X is zero.
• 𝛽1 is regression coefficient, i.e., how much we expect Y to change as X increases.
• 𝜖 is the random error of the model, the observed error of estimation is called
the residual, denoted by e.

A linear regression model that involves two or more explanatory variables is called
multiple regression model expressed as:

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜖

Sample Application of Regression Analysis:

1. Sales of a product can be predicted by utilizing the relationship between sales


and amount of TV advertising expenditures. Then we can regress Sales onto
TV by fitting the model:
Sales ≈ β0 + β1 × TV

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 11

2. College grade of a student can be predicted by utilizing the relationship between


college grade and entrance exam score. For example, X may represent Entrance
exam score and Y may represent College grade. Then College can be regressed
onto Entrance by fitting the model
𝐶𝑜𝑙𝑙𝑒𝑔𝑒 ≈ β0 + β1 × 𝐸𝑛𝑡𝑟𝑎𝑛𝑐𝑒

Assumptions of the Regression Model


1. The response variable is measured in either interval or ratio level.
2. The relationship is linear between the response and predictor variable.
3. There are no significant outliers in the data.
4. The observations are independent.
5. The data shows homoscedasticity. This means that the variance of residual is
the same across levels of the explanatory variable.
6. The residuals (errors) of the regression line are approximately normally
distributed.

10.4.2 Fitting a Simple Linear Regression Model


Fitting a simple linear regression model is often aided by a software but for reference,
the manual calculations are also shown in the next section. The following are the
general steps in creating a simple linear regression model:

1. Construct a scatterplot of X versus Y.


2. Obtain the equation that best fits the data using least squares method or other
methods. Compute for o and 1.
3. Evaluate the model found in (2). Test the significance of 1, and compute for
the coefficient of determination.
4. Obtain the residuals. Check if the assumptions about the error terms as
manifested by the residuals are satisfied.
5. If all assumptions are satisfied, interpret the model to explain the relationship
between Y and X.

Estimating 𝛽0 and 𝛽1
The estimated linear regression equation is denoted by:

𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖

• 𝑏0 is the estimate of 𝛽0
• 𝑏1 is the estimate of 𝛽1
• 𝑦̂𝑖 is the estimate of 𝑦𝑖

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 12

The formula for b0 (estimate of o) and b1 (estimate of 1) are derived using the method
of least squares where the “best-fitting” line is selected as the one that minimizes the
sum of squares of the deviations of the observed value of Y from those predicted by
the model. The formulas are
n
 n  n 
n X i Yi −   X i   Yi 
b1 = i =1  i =1  i =1 
2
n
 n

n X i2 −   X i 
i =1  i =1 
bo = y − b1 x

Inferences Concerning the Slope of the Line 𝛽1


n

SSE
(y i − yˆ i ) 2
An estimator for 2 is S2 = = i =1 where SSE stands for sum of squares
n−2 n−2
of errors. A (1-)100% Confidence Interval for 1 is

( ) ( )
(b1 − t αn−2 sb1 , b1 + t αn−2 sb1 )
2 2

s2
where sb1 = √ 2 .
n
2 (∑i=1 Xi )
∑n
i=1 Xi − n

Test of Hypothesis Concerning 1

Ho Test Statistic Ha Critical Region

1 = 0 t=s1
b 1 < 0 t < -t
1 > 0
b1
t > t
v = n-2 1  0 | t | > t

Coefficient of Determination
The coefficient of determination R2, is defined as the proportion of the variability in
the observed values of Y that can be explained by X. This coefficient is nothing but
the square of the correlation coefficient between X and Y.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 13

Diagnostic Checking
Aside from the level of measurements for the response variable which can be easily
verified, there are five other assumptions that need to be checked either graphically or
through formal tests. Among these five assumptions, four are manifested in the
residuals from the model.

A residual 𝑒𝑖 is defined to be the difference between the observed response (𝑦𝑖 ) and the
predicted value (𝑦̂𝑖 ) given the value of the predictor, i.e., 𝑒i = 𝑦𝑖 − 𝑦̂𝑖 .

Diagnostic Check
Check for Informal (Graphical) Formal Tests
1. Linearity Scatter plot matrix of dependent Test for β1
against independent variable
2. Presence of Scatter plot of residuals against Cook’s Distance Measure
Outliers predicted values
3. Normality Normal probability plot Kolmogorov-Smirnov Test
Shapiro-Wilk Test
4. Independence Scatter plot of residuals against Durbin Watson Test
predicted values
5. Homogeneity/ Scatter plot of residuals against Levene’s test
Homoscedasticity predicted values

Remarks:

1. It is important to first test the model assumptions before reading the result of
the linear regression analysis.
2. In cases when the assumptions are violated, remedial measures can be adopted.
For example, when normality in not present, data transformations may be
applied.
3. When extending the simple linear regression to multiple independent variables,
multicollinearity or correlation among these predictors should be checked by
looking at correlation matrices or by computing the variance inflation factor

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 14

10.4.3 Example
Suppose a researcher wishes to investigate the relationship between the achieved grade-
point index (GPI) and the starting salary of recent graduates majoring in business. A
random sample of 30 recent graduates majoring in business is drawn, and the data
pertaining to the GPI and starting salary (in thousands of dollars) are recorded for
each individual in the table below. Perform a simple linear regression analysis.

Starting Starting
Individual GPI Salary Individual GPI Salary
No. (X) (Y) No. (X) (Y)
1 2.7 17.0 16 3.0 17.4
2 3.1 17.7 17 2.6 17.3
3 3.0 18.6 18 3.3 18.1
4 3.3 20.5 19 2.9 18.0
5 3.1 19.1 20 2.4 16.2
6 2.4 16.4 21 2.8 17.5
7 2.9 19.3 22 3.7 21.3
8 2.1 14.5 23 3.1 17.2
9 2.6 15.7 24 2.8 17.0
10 3.2 18.6 25 3.5 19.6
11 3.0 19.5 26 2.7 16.6
12 2.2 15.0 27 2.6 15.0
13 2.8 18.0 28 3.2 18.4
14 3.2 20.0 29 2.9 17.3
15 2.9 19.0 30 3.0 18.5

Solution:

To perform regression analysis in Jamovi, we first encode the data with 3 variables:
ID, GPI, and Salary. ID here only serves as the control number, while GPI and Salary
are treated as the predictor and outcome variable, respectively.

Step 1: Construct a scatterplot of GPI and Salary.

To construct a basic scatterplot in Jamovi, click on Analyses → Exploration →


Scatterplot. Then drag GPI into the X-Axis box and the variable Salary into the Y-
Axis box. Optionally, you may also click the Linear under the Regression Line option
to create a visual representation of the line that best fit the data.

The encoded data and the resulting scatter plot are shown in the following images:

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 15

Encoded Data in Jamovi:

Analyses → Exploration → Scatterplot

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 16

Step 2: Calculate the Linear Regression Coefficients

To perform simple linear regression in Jamovi, click on Analyses → Regression →


Linear Regression. Then drag the outcome variable Salary into the Dependent
Variable box and the predictor GPI into the Covariates box. Jamovi will automatically
print the model fit measures and model coefficients table in the output window.

Step 3: Evaluate the Model and 𝑅2 Measure

The coefficient of determination, denoted by 𝑅2 is 0.748, this means that about 74.8%
of the variability in salary can be explained / predicted using the information from
GPI. Furthermore, the coefficient of GPI was found to be significantly different from
zero (t = 9.13, p-value < 0.001). This indicates that the predictor used can significantly
predict the outcome variable.

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 17

Step 4: Check the Assumptions of the Model (Diagnostic Checking)

The diagnostic checks can easily be performed in Jamovi. Simply find the Assumption
Checks drop down option and check the options: Autocorrelation test, Normality test,
Residual plots, and Cook’s distance. The Jamovi command and output window is
shown in the image below:

The results of the diagnostic checks are shown in the output window. The Cook’s
distance measure checks for presence of outliers, the Durbin-Watson test is a formal
test for presence of autocorrelation, and the Shapiro-Wilk test for normality. The
results of the diagnostic test can be summarized as follows:

• Cook’s distance
no extreme distance measures were found, therefore there is no potential outlier
in the data.
• Durbin-Watson test
(p-value > 0.05) means that the there is no autocorrelation in the data.

• Shapiro-Wilk test
(p-value > 0.05) means that there no departure from normality.

• Scatterplot between residuals and predicted values.


Random patterns indicate presence of constant variance (homoscedasticity).

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 18

Residuals and Fitted Values

Step 5: Interpret the Model to Explain the Relationship Between X and Y

The estimated regression model is given by


̂ 𝑖 = 6.42 + 3.93 𝐺𝑃𝐼𝑖
𝑆𝑎𝑙𝑎𝑟𝑦
Interpreting this model, we need to discuss the slope and y-intercept. For b0 = 6.419,
this means that if the GPI score of the individual is zero, then the individual’s starting
salary is estimated to be 6.419 thousand dollars, on average. However, because a zero
GPI does not exist, then this interpretation of b0 is meaningless. Now, a slope b1 =
3.928 means that for every unit increase in the GPI of an individual, there is a
corresponding average increase of 3.928 thousand dollars in the individual’s starting
salary.

Suppose a newly graduated individual have a GPI of 3.0, the estimated average starting
salary of the individual is computed as 6.419 + 3.928(3) = 18.203 thousand dollars.
Thus, we predict that the individual’s starting salary will be close to this value.

- END OF MODULE 10 -

The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.

You might also like