Professional Documents
Culture Documents
Stat 102 Module 10 FS 23 24
Stat 102 Module 10 FS 23 24
10
Analysis of Relationships
Our interest in this chapter is to determine if two measurements X and Y taken from
the same sample or population are associated/related/dependent on each other. We
will be interested to measure the strength of the relationship and whether we have
enough evidence that the relationship we observed in our sample can be generalized to
the entire population this sample came from. The last part of this chapter is regression
analysis where it is of our interest to use the value of some independent variable to
predict or explain variability among values of our dependent variable.
1. The coefficient takes on values between -1 and 1, inclusive, i.e., -1 < < 1.
2. A positive means that the line slopes upward to the right; a negative means
that it slopes downward to the right.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 1
3. When is 1 or –1, there is perfect linear relationship between X and Y and all
the points (x, y) fall on a straight line. A close to 1 or –1 indicates a strong
linear relationship, but it does not necessarily imply that X causes Y or Y causes
X. It is possible that a third variable may have caused the change in both x
and y, producing the observed relationship.
i i i
i =1 i =1
i
i =1 i =1
Remarks:
4. Just like , when r = 1 or –1, all the points (xi, yi), i=1,…,n, fall on a straight
line; when r=0, they are scattered and give no evidence of a linear relationship.
Any other value of r suggests the degree to which the points tend to be linearly
related.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 2
6 ∑ 𝑑𝑖2
𝑟𝑠 = 1 −
𝑛(𝑛2 − 1)
where 𝑑𝑖 = Rank(𝑋𝑖 ) – Rank(𝑌𝑖 ).
Remarks:
1. 𝑟𝑠 takes on values between -1 and 1, inclusive, i.e., -1 ≤ rs ≤ 1.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 3
2. The closer the value is to ±1 the stronger the monotonic relationship. The
following may serve as guide is interpreting rs value:
±0.00 − 0.25 no association or weak association
±0.26 − 0.50 moderately low association
±0.51 − 0.75 moderately high association
±0.76 − 1.00 high association to perfect association
10.1.3 Example
Consider the data given below. Let X represent the lot size and Y represent the man
hours required. Construct the scatterplot and compute 𝑟𝑠 and 𝑟.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 4
Solution:
Constructing the scatterplot between lot size (X) and man hours (Y) we have:
Solving for r,
n ∑ xi yi − ∑ xi ∑ yi
𝑟=
√(n ∑ xi2 − (∑ xi )2 )(n ∑ yi2 − (∑ yi )2 )
10(61800) − (500)(1100)
=
√(10(28400) − 5002 )(10(134660) − 11002 )
= 0.9978
This means that there is a strong, positive linear correlation between lot size and man
hours. Being positively correlated implies that an increase in lot size is associated with
an increase in man hours needed, and vice versa.
Computing for rs, we need to rank all x values and all y values from smallest to largest.
Solving for 𝑟𝑠 ,
6 ∑ d2i 6(2.5)
𝑟𝑠 = 1 − 2
=1− = 0.9848
n(n − 1) 10(102 − 1)
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 5
10.2.1 Example
Given the data in the previous section, test whether the linear correlation between lot
size and man hours is different from zero at 5% level of significance.
Solution:
Let be the linear correlation coefficient between lot size (X) and man hours (Y).
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 6
The output from Jamovi is similar to the manual calculations showed in the previous
example with lot size and man hours.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 7
Music IQ
Preference High Medium Low Total
Classical 40 26 17 83
Pop 47 59 25 131
Rock 83 104 79 266
Total 170 189 121 480
A contingency table containing r rows and c columns is referred to as r×c table. The
row and column totals are called marginal frequencies. Note that in a test for
independence, these marginal frequencies are not fixed in advance but depend instead
on the way the sample distributed itself across the various cells in the table.
Test Procedure:
1. State the null and alternative hypothesis.
Ho: The two variables are independent
Ha: The two variables are not independent.
2. Choose the level of significance.
3. Compute the test statistic, given by
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 8
r c 2
2
(Oij − Eij)
χ = ∑∑
Eij
i=1 j=1
where Oij= observed number of cases in the ith row of the jth column
Eij = expected number of cases under Ho
(column total)×(row total)
Eij = grand total
4. Decision Rule: Reject Ho if χ2 > χ2α,(r−1)(c−1) .
Remarks:
1. The test is valid if at least 80% of the cells have expected frequencies of at least
5 and no cell has an expected frequency 1.
2. If many expected frequencies are very small, researchers commonly combine
categories of variables to obtain a table having larger cell frequencies.
Generally, one should not pool categories unless there is a natural way to
combine them.
3. For a 2x2 contingency table, a correction called Yates’ correction for continuity
is applied. The formula then becomes
r c 2
2
(|Oij − Eij| − 0.5)
χ = ∑∑
Eij
i=1 j=1
Cramer’s V
Cramer’s V is a measure of association between two nominal variables. It is often
used as post-test to determine strengths of association after chi-square test has
determined significance.
′
χ2
Cramer s V = √
n(a − 1)
Remarks:
1. 0 ≤ Cramer’s V ≤ 1.
2. A Cramer’s V value of 0 corresponds to no association between the variables
while a value of 1 corresponds to complete association.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 9
10.3.1 Example
Using the data on musical preference and intelligence, test the hypothesis that music
preference and intelligence are independent. Also, evaluate the degree of association
using Cramer’s V coefficient.
Solution:
Music IQ
Preference High Medium Low Total
Classical 40 (29.4) 26 (32.7) 17 (20.9) 83
Pop 47 (46.4) 59 (51.6) 25 (33.0) 131
Rock 83 (94.2) 104 (104.7) 79 (67.1) 266
Total 170 189 121 480
Test Statistic:
r c 2
(Oij − Eij)
χ2 = ∑ ∑
Eij
i=1 j=1
(40 − 29.4)2 (26 − 32.7)2 (17 − 20.9)2
= + +
29.4 32.7 20.9
(47 − 46.4)2 (59 − 51.6)2 (25 − 33.0)2
+ + +
46.4 51.6 33.0
(83 − 94.2) 2 (104 − 104.7) 2 (79 − 67.1)2
+ + +
94.2 104.7 67.1
= 3.822 + 1.373 + 0.728 + 0.008 + 1.061 + 1.939 + 1.332 + 0.005 + 2.110
= 12.378
Step 4. Decision Rule: Reject Ho if χ2 > χ2α,(r−1)(c−1) where χ2α,(r−1)(c−1) =
χ20.05,(3−1)(3−1) = χ20.05,4 = 9.488.
Step 5. Decision: We reject Ho because χ2 =12.378>9.488.
Step 6. Conclusion: There is sufficient evidence that music preference and
intelligence are related at 5% level of significance.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 10
𝑦 = 𝛽0 + 𝛽1 𝑥
Where 𝛽0 is the y-intercept, i.e. the value of y when x=0, and 𝛽1 is the slope of the
line, i.e. the change in y for a 1-unit increase in x.
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
A linear regression model that involves two or more explanatory variables is called
multiple regression model expressed as:
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜖
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 11
Estimating 𝛽0 and 𝛽1
The estimated linear regression equation is denoted by:
𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
• 𝑏0 is the estimate of 𝛽0
• 𝑏1 is the estimate of 𝛽1
• 𝑦̂𝑖 is the estimate of 𝑦𝑖
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 12
The formula for b0 (estimate of o) and b1 (estimate of 1) are derived using the method
of least squares where the “best-fitting” line is selected as the one that minimizes the
sum of squares of the deviations of the observed value of Y from those predicted by
the model. The formulas are
n
n n
n X i Yi − X i Yi
b1 = i =1 i =1 i =1
2
n
n
n X i2 − X i
i =1 i =1
bo = y − b1 x
SSE
(y i − yˆ i ) 2
An estimator for 2 is S2 = = i =1 where SSE stands for sum of squares
n−2 n−2
of errors. A (1-)100% Confidence Interval for 1 is
( ) ( )
(b1 − t αn−2 sb1 , b1 + t αn−2 sb1 )
2 2
s2
where sb1 = √ 2 .
n
2 (∑i=1 Xi )
∑n
i=1 Xi − n
1 = 0 t=s1
b 1 < 0 t < -t
1 > 0
b1
t > t
v = n-2 1 0 | t | > t
Coefficient of Determination
The coefficient of determination R2, is defined as the proportion of the variability in
the observed values of Y that can be explained by X. This coefficient is nothing but
the square of the correlation coefficient between X and Y.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 13
Diagnostic Checking
Aside from the level of measurements for the response variable which can be easily
verified, there are five other assumptions that need to be checked either graphically or
through formal tests. Among these five assumptions, four are manifested in the
residuals from the model.
A residual 𝑒𝑖 is defined to be the difference between the observed response (𝑦𝑖 ) and the
predicted value (𝑦̂𝑖 ) given the value of the predictor, i.e., 𝑒i = 𝑦𝑖 − 𝑦̂𝑖 .
Diagnostic Check
Check for Informal (Graphical) Formal Tests
1. Linearity Scatter plot matrix of dependent Test for β1
against independent variable
2. Presence of Scatter plot of residuals against Cook’s Distance Measure
Outliers predicted values
3. Normality Normal probability plot Kolmogorov-Smirnov Test
Shapiro-Wilk Test
4. Independence Scatter plot of residuals against Durbin Watson Test
predicted values
5. Homogeneity/ Scatter plot of residuals against Levene’s test
Homoscedasticity predicted values
Remarks:
1. It is important to first test the model assumptions before reading the result of
the linear regression analysis.
2. In cases when the assumptions are violated, remedial measures can be adopted.
For example, when normality in not present, data transformations may be
applied.
3. When extending the simple linear regression to multiple independent variables,
multicollinearity or correlation among these predictors should be checked by
looking at correlation matrices or by computing the variance inflation factor
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 14
10.4.3 Example
Suppose a researcher wishes to investigate the relationship between the achieved grade-
point index (GPI) and the starting salary of recent graduates majoring in business. A
random sample of 30 recent graduates majoring in business is drawn, and the data
pertaining to the GPI and starting salary (in thousands of dollars) are recorded for
each individual in the table below. Perform a simple linear regression analysis.
Starting Starting
Individual GPI Salary Individual GPI Salary
No. (X) (Y) No. (X) (Y)
1 2.7 17.0 16 3.0 17.4
2 3.1 17.7 17 2.6 17.3
3 3.0 18.6 18 3.3 18.1
4 3.3 20.5 19 2.9 18.0
5 3.1 19.1 20 2.4 16.2
6 2.4 16.4 21 2.8 17.5
7 2.9 19.3 22 3.7 21.3
8 2.1 14.5 23 3.1 17.2
9 2.6 15.7 24 2.8 17.0
10 3.2 18.6 25 3.5 19.6
11 3.0 19.5 26 2.7 16.6
12 2.2 15.0 27 2.6 15.0
13 2.8 18.0 28 3.2 18.4
14 3.2 20.0 29 2.9 17.3
15 2.9 19.0 30 3.0 18.5
Solution:
To perform regression analysis in Jamovi, we first encode the data with 3 variables:
ID, GPI, and Salary. ID here only serves as the control number, while GPI and Salary
are treated as the predictor and outcome variable, respectively.
The encoded data and the resulting scatter plot are shown in the following images:
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 15
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 16
The coefficient of determination, denoted by 𝑅2 is 0.748, this means that about 74.8%
of the variability in salary can be explained / predicted using the information from
GPI. Furthermore, the coefficient of GPI was found to be significantly different from
zero (t = 9.13, p-value < 0.001). This indicates that the predictor used can significantly
predict the outcome variable.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 17
The diagnostic checks can easily be performed in Jamovi. Simply find the Assumption
Checks drop down option and check the options: Autocorrelation test, Normality test,
Residual plots, and Cook’s distance. The Jamovi command and output window is
shown in the image below:
The results of the diagnostic checks are shown in the output window. The Cook’s
distance measure checks for presence of outliers, the Durbin-Watson test is a formal
test for presence of autocorrelation, and the Shapiro-Wilk test for normality. The
results of the diagnostic test can be summarized as follows:
• Cook’s distance
no extreme distance measures were found, therefore there is no potential outlier
in the data.
• Durbin-Watson test
(p-value > 0.05) means that the there is no autocorrelation in the data.
• Shapiro-Wilk test
(p-value > 0.05) means that there no departure from normality.
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.
STAT 102 UNIT 10: ANALYSIS OF RELATIONSHIPS 18
Suppose a newly graduated individual have a GPI of 3.0, the estimated average starting
salary of the individual is computed as 6.419 + 3.928(3) = 18.203 thousand dollars.
Thus, we predict that the individual’s starting salary will be close to this value.
- END OF MODULE 10 -
The University does not authorize you to communicate or reproduce this material. Any distribution or unintended
of this document may be subject to copyright infringement under RA 8293.