Module 7 Data Management Regression and Correlation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

MODULE 7

Data Management: Regression and Correlation

7.1 Introduction
In our daily activities, it is necessary that the relationship between variables
be established before a decision is made. For example, the school registrar must
predict the enrollment before preparing the class schedules. One must know the
sequence of the courses to be offered before a feasible flow chart could be
prepared. In this section, we will discuss some commonly used measures of
association that show the linear relationship between two variables such as
correlation analysis. The term “relationship” means that changes in two variables
are associated with each other. This relationship can be directly or inversely
proportional to each other. Moreover, correlation is used to determine if there is
a relationship between two variables and to determine the strength of the
correlation.
Correlation and linear regression can help us deal with the relationship
between two or more continuous variables. We shall study the dependence of
one variable, the dependent variable to the independent variable.
7.2 Learning Outcome
After finishing this module, you are expected to:

1. explain the purpose of correlation coefficients;


2. choose the appropriate correlation coefficients to show the relationship
between two variables;
3. compute the coefficients of correlation and determination;
4. calculate the average correlation between two variables across several
groups of people.
5. define linear regression;
6. give the purpose of linear regression;
7. define the least-squares regression line and the assumptions
underlying the test of significance;
8. use methods of linear regression and correlation to predict the value of
a variable given certain conditions.
7.3 What You Need to Know
7.3.1 What is the purpose of correlation analysis?
In correlation analysis, the purpose is to measure the strength or closeness
of the relationship between the variables. In other words, we would like to know
‘how strong or weak is the relationship existing between the variables?’ the two
Page 1 of 11
variables associated in a statistical sense do not guarantee the existence of a
causal relationship. But in reverse, the existence of a causal relationship usually
does imply correlation. The magnitude of association is measured by the
absolute value of 𝑟 which can range from 0.00 to 1.00; the greater the absolute
value of 𝑟, the stronger the relationship between the two variables.
The two types of variables involved in a relationship are the independent
variable (𝑋) and the dependent variable (𝑌). In correlation analysis, the 𝑋-variable
is the predictor and the 𝑌-variable is the criterion variable.
A correlation is a relationship between two statistical variables measured
from the same population. In this module, we will only consider linear
correlation which comes in three types: positive linear correlation, negative
linear correlation, and zero linear correlation.
A Positive Linear Correlation indicates that high values for one variable
tend to correspond to high values for the second variable or simply, if one value
increases, so does the other. For example, the height vs. weight for adults (For
a normal individual, as the height increases, the weight also increases).
A Negative Linear Correlation indicates high values for one variable tend
to correspond to low values for the second variable., that is, one variable
increases and the other decreases. For instance, the year of acquiring a vehicle
and the resale price (As the vehicle gets older, the resale price becomes lower).
A Zero Linear Correlation means there is no linear relationship that exists
between the variables. For example, the height and no. of years of education (The
height of the person in no way has a bearing on the number of years he had been
in school).
7.3.1.1 Simple Correlation
In simple correlation, only two variables are studied at once. The two
variables are the independent and dependent variables. The independent
variable, (𝑋), is the variable that can be controlled or picked. The independent
variable, (𝑌), is the variable that you assume to be dependent on the other
variable. The independent variable is used to predict the dependent variable if
there is a correlation between the two variables.
One way to determine the type of linear correlation between two variables is
by means of a scatter plot. The scatter plot is a graph with the independent
variable at the bottom (or along the 𝑥 − 𝑎𝑥𝑖𝑠) and the dependent variable along
the side (𝑥 − 𝑎𝑥𝑖𝑠). For each pair of numbers, we plot a point but the points are
not connected with a line.
The scatter plot shows if there is a linear correlation between two variables.
We can then determine the type of linear correlation as follows:

Page 2 of 15
1. Positive Linear Correlation - general trend in the plotted points is from
bottom left to top right.
2. Negative Linear Correlation - general trend in the plotted points is from top
left to bottom right.
3. No Linear Correlation - No general trend in plotted points, or a non-linear
trend.

The strength of the linear correlation can be judged by looking at how closely
the points approximate a straight line.
Example 1
The following table shows the Height (X) vs. Weight (Y) measurements (both in
inches) for 10 men:

x 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
y 42.5 40.2 44.4 42.8 40.0 47.3 43.4 40.1 42.1 36.0

Interpretation: The diagram scatter plot processed in Excel below shows a


positive linear correlation between the variables.

Example 2.
The following table gives the resale value of a car bought in 1970 at
Php200,000.00.
x (Php) 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997
y (000) 200 150 145 135 120 100 79 65 54 35.0

Page 3 of 11
Interpretation: The diagram indicates a negative linear correlation between the
variables.

Example 3.
Below is a data of the scores in an examination. Make a scatter plot and interpret
the data.

Interpretation: There is a fairly positive correlation between scores in the


midterm examination and the final examination.

Page 4 of 15
7.3.1.2 Coefficient of Correlation
A more precise method of determining the type and strength of a linear
correlation is to calculate the coefficient of linear correlation 𝑟, also known as
Pearson Product-Moment Correlation Coefficient, for the two variables using the
formula:

The coefficient of linear correlation will always be a number between −1.00


and 1.00, with a positive value indicating a positive correlation and a negative
value a negative correlation. A coefficient of 𝑟 = 1.00 for a data set indicates
perfect positive linear correlation, and 𝑟 = −1.00 indicates perfect negative linear
correlation, while 𝑟 = 0 would indicate no linear correlation. The closer the value
of r is to ±1, the stronger the correlation, and the closer to zero, the weaker the
correlation.

Example 4.
Scores of students in the Midterm and Final Examinations were gathered.
The teacher wants to find the strength of linear relationship between the Midterm
scores and the Final Term scores. What is the coefficient of linear correlation?

Midterm score Final Term score


(𝑥 ) (𝑦 )
73 70
86 80
93 96
92 85
72 68
65 68
58 62
75 78

Page 5 of 11
Solution.
The scatter plot in the example suggests that a positive correlation exists
between Midterm and Final term scores.

To verify, we solve for the coefficient of correlation.


Steps Actual Process and Results

1. Compute , and
and column totals.

2. Solve for and


using the formulas.

From the result, we know that the Midterm score and the Final
term score have a strong positive linear correlation.

Page 6 of 15
7.3.2 Regression Analysis
After a relationship between paired data, which are referred to as bivariate
data, has been discovered, one can model the relationship with an equation. One
method of determining a linear relationship for bivariate data is called linear
regression.
In linear regression, we assume that a change in 𝑥 (independent variable)
will lead directly to a change in 𝑦 (dependent variable). Sometimes, we are
interested in predicting the value of 𝑦 from the value of 𝑥. Generally, it is not
logical to believe that 𝑦 caused 𝑥. By convention, we plot the independent variable
along the horizontal axis or the 𝑥-axis and the dependent variable along the
vertical axis or 𝑦-axis.
Furthermore, simple linear regression is similar to correlation in that the
purpose is to measure to what extent there is a linear relationship between two
variables. In particular, the purpose of linear regression is to "predict" the value
of the dependent variable based upon the values of one or more independent
variables. The relationship is summarized by a regression equation consisting of
a slope and an intercept. The slope represents the amount the dependent
variable increases or decreases with unit increase or decrease in the independent
variable and the intercept indicates the value of the dependent variable when the
independent variable takes the value zero.

7.3.2.1 The Least-Squares Regression Line


The least-squares regression line for a set of bivariate data is the line that
minimizes the sum of the squares of the vertical deviations from each data point
to the line.
The least-squares regression line is also called the least-squares line. By
convention, we use the symbol 𝑦 (pronounced 𝑦-hat) in place of 𝑦 in the equation
of a least-squares line. This also helps us differentiate the line’s 𝑦-values from
the 𝑦-values of the given ordered pairs.
The equation of the least-squares line for 𝑛 ordered pairs
(𝑥1, 𝑦1), (𝑥2, 𝑦2), (𝑥3, 𝑦3), … , (𝑥𝑛, 𝑦𝑛) is 𝑦 = 𝑎𝑥 + 𝑏 where

and
𝑏 = 𝑦 − 𝑎𝑥
The notation 𝑥 represents the mean of the 𝑥 values and 𝑦 represents the
mean of the 𝑦 values.

Page 7 of 11
Example 6.
Find the equation of the least-squares line for the ordered pairs in the table
below.
𝑥 𝑦

2.5 3.4
3.0 4.9
3.3 5.5
3.5 6.6
3.8 7.0
4.0 7.7
4.2 8.3
4.5 8.7

Solution.
From the scatter plot in this example, we see that there is a positive
correlation between the two sets of data.

Page 8 of 15
We now proceed with the process of finding the equation of the regression
line.
Steps Actual process and results
1. Prepare the
columns for
and .

2. Compute the 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥 )(∑ 𝑦)


slope 𝑎=
𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2
.
8(195.86) − 28.8(52.1)
= ≈ 2.7303
8(106.72) − (28.8)2

3. Find the means of


and values and the
-intercept .

4. Round and to the


nearest tenth and
find .
The least-squares line equation is .

Page 9 of 11
The regression line is given by the red line in the next figure.

Example 7.
Use the equation of the least-squares line from the previous example to
predict the average 𝑦 values for each of the following 𝑥 values. a. 2.8
b. 4.8
Solution.
Steps Actual process and results

1. Substitute the given 𝑥


values to the a. 𝑦 = 2.7(2.8) − 3.3 = 4.26
formula that was b. 𝑦 = 2.7(4.8) − 3.3 = 9.66
obtained.

2. Round the computed


value to the nearest a. 𝑦 = 4.3
tenth. b. 𝑦 = 9.7

Example 8.
Five children aged 2, 3, 5, 7, and 8 years old weighing 14, 20, 32, 42, and 44
kilograms respectively.

a. Find the equation of the regression line of age on weight.

Page 10 of 15
b. Based on this data, what is the approximate weight of a six-year-old
child?
Solution.
(a)
Steps Actual process and results
1. Prepare the table with
columns for ,
, and .

2. Compute the
slope
.

3. Find the means of


and values and the
-intercept .

4. Round and to the


nearest tenth and
find .
The least-squares line equation is .

(b)

Steps Actual process and results


1. Substitute the given
values to the formula
that was obtained.

2. Round the computed


value to the nearest
tenth. The predicted weight for a six-year-old is kg.

Page 11 of 11

You might also like