Course Pack Correlation

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Chapter 5 CORRELATION AND REGRESSION

This chapter presents the most commonly used techniques for investigating the relationship between two
quantitative variables. The statistical tool used to investigate the strength of the linear relationship between a pair of
variables is called correlation while the tool used to evaluate the relationship of dependent and independent variables in
the equation is called regression. These are vital in the field of economics, business, and other fields of study.

In statistics, we cannot possibly rely on fortune-telling in order to make predictions. An understanding of


correlation and regression analysis will aid us in forecasting.

Lesson 1 Correlation

Learning Competencies:

After this lesson, you should be able to:

 Draw a scatter plot for a set of ordered pairs;


 Compute for the correlation coefficient;
 Test the significance of the correlation coefficient; and
 Solve correlation problems using a software.

In the previous lesson, you have learned about different statistical tools in testing the hypothesis;
the z-test, -test, F-test (ANOVA), and the H-test.

In this lesson, you will learn about the strength of the linear relationship between two
quantitative variables.

A correlation deals with the relationship between two quantitative variables.

The linear correlation coefficient, denoted by r, measures the strength and the direction of
a linear relationship between two variables. This coefficient is sometimes called Pearson product
D moment correlation since it was developed by the English mathematician and biostatistician, Karl
Pearson.
E
To compute for the value of r, the formula is:
F
I
N
where n is the number of pairs of the values of the variables;
I x are the values of the independent variable; and
T y are the values of the dependent variable.

I A bivariate data contains two sets of related data.


A dependent variable in an experiment is the variable that is affected by the independent
O variable or outside factors.
N
Illustrative Example 1

The following examples identify the independent and dependent variables in the given situations:

1. The time spent by a student in reviewing a lesson can increase his score in an examination.
dependent variable (y) : score in an examination
independent variable (x) : time spent in reviewing a lesson

2. Age affects human stamina.


dependent variable (y) : human stamina
independent variable (x) : age

In statistics, the Pearson product-moment correlation coefficient is a measure of the linear


dependence between two variables x and y, giving a value between +1 and −1 inclusive, where 1 is perfect positive linear
correlation, 0 is no linear correlation, and −1 is perfect negative linear correlation. Calculating r is pretty complex, so we
usually rely on technology for the computations. We focus on understanding what r says about a scatterplot.

Scatterplot of Correlation with Bivariate Data

Perfect positive correlation


r = +1 and ρ = 1 High
(where ρ is read as “rho”) positive
correlation

 A positive linear correlation means that as the values of x increases, the value of y also increases. Likewise, as x
decreases, y also decreases. The variables x and y have a strong positive linear correlation if the value of r is
close to 1. Thus, r = 1 indicates a perfect positive correlation.
Perfect negative correlation Low
r = -1 and ρ = -1 negative
correlation

 A negative linear correlation means that as the values of x increases, the value of y decreases. The variables x
and y have a strong negative linear correlation if the value of r is close to -1. Thus, r = -1 indicates a perfect
negative correlation.

No correlation
r = 0 and ρ = 0

 The variables x and y have a weak positive or negative linear correlation if the value of r is close to 0. Likewise,
r = 0 implies that x and y has no linear correlation.

Pearson’s r Product Moment Correlation Chart

Absolute value of coefficient Interpretation


1 Perfect correlation
0.90 – 0.99 Very high correlation
0.70 – 0.89 High correlation
0.50 – 0.69 Moderate correlation
0.30 – 0.49 Low correlation
0.10 – 0.29 Negligible correlation
0 No correlation
Check your progress:

How can we determine the strength of association based on the Pearson correlation coefficient?

 The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to
either +1 or -1 depending on whether the relationship is positive or negative, respectively. Achieving a value of
+1 or -1 means that all your data points are included on the line of best fit – there are no data points that show
any variation away from this line. Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there
is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of
best fit. Different relationships and their correlation coefficients are shown in the diagram above.

Illustrative example 2.

Find the value of the correlation coefficient from the following table:

SUBJECT AGE X GLUCOSE LEVEL Y


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81

Step 1:Make a chart. Use the given data, and add three more columns: xy, x 2, and y2.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81

Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 =  4,257.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779

Step 3: Take the square of the numbers in the x column, and put the result in the x 2  column.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481

Step 4: Take the square of the numbers in the y column, and put the result in the y 2  column.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561

Step 5: Add up all of the numbers in the columns and put the result at the bottom of the column. The Greek letter sigma
(Σ) is a short way of saying “sum of.”

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022

Step 6: Use the following correlation coefficient formula.

The answer is: 2868 / 5413.27 = 0.529809

From our table:

 Σx = 247
 Σy = 486
 Σxy = 20,485
 Σx2 = 11,409
 Σy2 = 40,022
 n is the sample size, in our case = 6
The correlation coefficient =

 6(20,485) – (247 × 486) / [√[[6(11,409) – (247 2)] × [6(40,022) – 4862]]]


= 0.5298

Therefore, the range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means
the variables have a moderate positive correlation.
Illustrative example 3.

The owner of a chain of fruit shake stores would like to study the correlation between atmospheric temperature
and sales during the summer season. A random sample of 12 days is selected with the results given as follows:

Plot the data on a scatter diagram. Does it appear there is a relationship between atmospheric temperature and sales?
Compute the coefficient of correlation. Determine at the 0.05 significance level whether the correlation in the
population is greater than zero.

Solution:

Step 1 : Graph the scatter plot

Step 2 : State the hypotheses

• H o : r = 0 , There is no correlation between atmospheric temperature and total sales of fruit shake.

• H a : r ≠ 0 , There is a correlation between atmospheric temperature and total sales of fruit shake.

Step 3 : The level of significance is 0.05.

Step 4 : Determine the degrees of freedom and the critical values of t.

Df = n – 2 = 12 – 2 = 10 and t = ±2.228

Step 5 : Compute for the value of r.

The coefficient of correlation, r=0.93, between the atmospheric temperature and total sales
indicates a very high positive correlation (very dependable relationship) – that is an increase in
atmospheric temperature is highly associated with the increased in total sales of fruit shake.

Step 6 : Decision rule:

In order to make a decision on the significant relationship we need to determine the value of t.
Step 7 : Conclusion

Since the null hypothesis has been rejected, we can conclude that there is evidence that shows
significant association between the atmospheric temperature and the total sales of fruit shake.

How to find Pearson’s r Correlation Coefficient in Microsoft Excel?

We can use the CORREL function or the Analysis Toolpak add-in in Excel to find the correlation coefficient between
two variables.

 A correlation coefficient of +1 indicates a perfect positive correlation. As variable X increases, variable Y


increases. As variable X decreases, variable Y decreases.

 A correlation coefficient of -1 indicates a perfect negative correlation. As variable X increases, variable Z


decreases. As variable X decreases, variable Z increases.
To use the Analysis Toolpak add-in in Excel to quickly generate correlation coefficients between multiple
variables, execute the following steps.

1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Correlation and click OK.

3. For example, select the range A1:C6 as the Input Range.

4. Check Labels in first row.

5. Select cell A8 as the Output Range.

6. Click OK.
Result.

Conclusion: variables A and C are positively correlated (0.91). Variables A and B are not correlated (0.19). Variables B and
C are also not correlated (0.11) . You can verify these conclusions by looking at the graph.

Activity 1.

Directions:  Read carefully. These questions pertain to correlations and correlation coefficients.

  1. r  = 0.50

 A  B  C  D  E  F
 

  2. r  = 0

 A  B  C  D  E  F
  3. r  = -0.85
 
 A  B  C  D  E  F

  4. r  = 0.92

   A  B  C  D  E  F

    5. r  = 1

 
 A  B  C  D  E
F

  6. r  = -0.48

 
 A  B  C  D  E
F

  7.  What does it mean to say that data has a strong negative correlation?
Choose:  There is no relationship at all between the variables.
   More than half of the variables have a negative value.
   There is a negative cause and effect relationship.
   A linear model with a negative slope is appropriate.

  8.  Which of the following correlation coefficients represents the strongest linear relationship?
Choose:  0.79  0.36  -0.12  -0.87

   9.  The relationship between the number of widgets in a package (x- Number of


axis) and the length of the package (y-axis), in inches, is given in the
table at the right. The linear correlation coefficient for this relationship Widgets
is:
Choose:
   1  -1  0.5  0

   10.  Which calculator output shows the strongest linear relationship between x and


Choose:

 LinReg  LinReg  LinReg


y = ax + b y = a + bx y = ax + b
  a  = 58.135 a = 0.702 a = 0.952
b  = 7.348 b  = 24.286 b  = 3.45
r = 0.843 r = 0.8145 r = 0.633

   11.  Which value of r represents data with a strong positive linear correlation betweeen two variables?
Choose:
   0.91  0.42  1.03

  12.  The table at the right shows the average target training heart Age Average Target
rates, by age, according to the American Heart Association. Which value (years) Heart Rate (bpm)
represents the linear correlation coefficient between a person's age, in
years, and that person's average target training heart rate, in beats per 20 135
minute (bpm)? (Round to four decimal places with Age on x-axis and
Heart Rate on y-axis.) 30 129
Choose:
40 122
   -0.6652  1.3231
50 115
   -0.9996  0.9993
60 108

70 102

Activity 2.
The National Housing Authority (NHA) wants to investigate the relationship between the size of houses and the
rents paid by tenants in Marikina City. The NHA collected the following information on the sizes (in hundreds or square
feet) for eight houses and the monthly rents (in thousands of pesos) paid by the tenants.

Size of 35 40 50 60 28 34 45 25
House
Monthly 11 17 18 20 6 10 19 5
Rent

Construct a scatter diagram for these data. Determine if the relationship exists between the sizes of houses and
the monthly rents using 0.05 significance level.

Activity 3.

You might also like