Correlation and Regression

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

5.

Correlation and Regression


5.1. Introduction
If two variables X and Y vary in such a way that changes in one is accompanied by changes in
the other, then the variables are said to be correlated. Correlation can be zero, negative or
positive

Examples
Marketers are interested in questions such as, is sales related to advertising? Or is there a
relationship between a person’s age and his or her purchasing power? These are just some of the
questions that can be answered by using the technique of correlation and regression analysis.

5.1.1 Scatter Diagram or Plots


A tool used to investigate correlation between two variables. The data is plotted in such a way
that vary of one variable Y is measured along the Y-axis and corresponding value of the other
variable X is plotted on the X-axis.

A - Scatter Diagram B - Scatter Diagram

1.2 20
1
15
0.8
0.6 10
Y

0.4
5
0.2
0 0
0 0.2 0.4 0.6 0.8 1 1.2 0 2 4 6 8 10
X X
C - Scatter Diagram

20

15

10

Y
5

0
0 2 4 6 8 10
X

If all the points on a scatter diagram lie


a). Near a straight line, correlation is linear
b). Near a curve, correlation is non-linear
c). On a straight line, we have perfect linearity

In the three scatter diagrams above,


a). A shows that there is a non-linear relationship between X and Y
b). B shows that there is a positive linear relationship – an increase in X leads to an increase
in Y
c). C shows that there is a negative linear relationship – an increase in X leads to a decrease
in Y

5.2. Correlation Coefficient


The correlation coefficient computed from the sample data measures the strength and direction
of a linear relationship between two variables. The symbol for the sample correlation coefficient
is r. This is the statistic which is to measure the strength of relationships
 when points lie close to the line, the dispersion or degree of scatter is low, so association
exists
 when points are widely dispersed about the line, the association is weak or non-existence

The range of the correlation coefficient is from -1 to +1, written as −1<r <+1 .
a). If r =+1 , there is a strong positive linear relationship between the variables.
b). If r=−1 , there is a strong negative linear relationship between the variables
c). If r is close to 0 , dispersion is wide and variables are uncorrelated or no linear
relationship between the variables.

∑ ( X i − X̄ )( Y i −Ȳ )
r= 1

{∑ ( )(
X 2 −n X̄ 2 Y 2 −n Ȳ 2
i i )} 2

Where
n , is the number of data pairs

x̄ , is the mean value of the x i ' s

ȳ , is the mean of y i ' s

n ∑ xy −∑ x ∑ y
r= 1

Or
[ {n∑ x −(∑ x) }{n ∑ y −(∑ y ) }]
2 2 2 2 2

5.2.1 Coefficient of Determination


2
o It is equal to the square of Correlation coefficient, that is, r
o It shows how much of independent variable can be explained by independent variable

Example
The following data refers to the amount of money spent by 10 customers who visited a
supermarket in a certain year and their social class index.

Amount spent in Supermarket ( x ) 57 54 49 42 38 3 30 2 20 18


Ksh. in 1,000s 2 4
Social Class Index ( y ) 113 111 107 103 100 9 94 8 74 76
6 4

Calculate
i). Correlation coefficient
ii). Coefficient of determination
Solution
When you are faced with a mathematical or statistical problem that has a formula, check the
parameters you require. In our case for the Correlation coefficient, we need what is in the table
below
x y xy 2 2
x y
57 113 6441 3249 12769
54 111 5994 2916 12321
49 107 5243 2401 11449
42 103 4326 1764 10609
38 100 3800 1444 10000
32 96 3072 1024 9216
30 94 2820 900 8836
24 84 2016 576 7056
20 74 1480 400 5476
18 76 1368 324 5776
364 958 36560 14998 93508

Make sure you have the summations down on the last row of the table as shown above.

We use the Correlation coefficient formula


n ∑ xy −∑ x ∑ y
r= 1

[ {n∑ x −(∑ x) }{n ∑ y −(∑ y ) }]


2 2 2 2 2

Note that in the table above, we have calculated the only parameters important in our calculation
of the value of r.
Replace the formula with the values in our table above.
What is the size (n) of the sample we are using?
( 10×36560 )−( 364×958)
1

i). r = {[( 10×14998 )−( 364×364 )][( 10×93508 )−( 958×958 ) ] }


2

1688. 8
1
2
= {1748 . 4×1731 .6 }
16888
=
17399 .797

= 0.9706

There is a high relationship between amount of money spent in supermarket and the
social class index (a positive relationship)

ii). r2 = (0.9706)2 = 0.9420

This means that 94.2% of the variation of the social class (dependent variable) can be
explained by the variation of the amount of money spent in the supermarket every year
(independent variable), and 5.8% is determined by other factors

5.2.2 Rank/ Spearman Correlation


Correlation coefficient is calculated from the actual values of the variables X and Y in the
sample data.
 However, in some cases, relative orders of magnitude of these pairs of values are more
instructive than the values themselves.
 It is thus more useful to access the relationship between the ranks of the two variables.
We use rank correlation

The Rank correlation is also known as “Spearman Rank Correlation Coefficient”


Applicable when variables are ranked
6∑ d
2
R=1−
n(n2 −1)
Where d= difference in Rank
n= Number of pairs of rankings

Example
The table below shows the marks of students for Business Statistics I (Stats 1) and Business
Statistics II (Stats 2). Find R

Stats 1 Rank (Stats 1) Stats 2 Rank (Stats 2) d d


2

80 2 80 2 0 0
60 4 50 5 -1 1
65 3 60 3 0 0
50 5 55 4 1 1
35 6 45 6 0 0
30 7 30 7 0 0
90 1 95 1 0 0

∑ d 2=2 And n=7


6×2
R=1− =0 . 9643
7 (49−1)
There is a high degree of relationship between performances in the two subjects
5.3. Regression Analysis
If the value of the correlation coefficient is significant, the next step is to determine the equation
of the regression line which is the data’s line of best fit.
The purpose of the regression line is to enable the researcher to see the trend and make
predictions on the basis of the data.

Regression analysis is a statistical procedure that can be used to develop a mathematical equation
showing how variables are related.
5.3.1 Simple Linear Regression Model
A single variable is used to predict another variable on the assumption of linear relationship,
Y =a+bX
Where
Y , is the dependent or response variable
X , is the independent or explanatory or regressor variable.
a , represents the Y-intercept

b , the slope of the regression line and indicates the amount of change of dependent
variable for a unit change in the independent variable.
5.3.2 Determination of the Regression Line Equation
In algebra, as we considered the topic on graphs, the equation of a line is usually given as
y=mx +b , where m is the slope of the line and b is the y intercept.
The equation of the regression line is written as Y =a+bX . There are several methods for
finding the regression line but we consider one method.

Formulas for the Regression line


Y =a+bX

n ( ∑ xy )−( ∑ x )( ∑ y )
b= 2
n ( ∑ x2 )−( ∑ x )

a=
∑ y −a ∑ x
a= ȳ−a x̄ Or n n
Example
i). Find the equation of the regression line for the data below which is obtained in the study of
age and blood pressure
2
Subject Age ( x ) Pressure ( y ) xy x
2
y
A 43 128 5,504 1,849 16,384
B 48 120 5,760 2,304 14,400
C 56 135 7,560 3,136 18,225
D 61 143 8,723 3,721 20,449
E 67 141 9,447 4,489 19,881
F 70 152 10,640 4,900 23,104
345 819 47,634 20,399 112,443

Thus ∑ x=345 , ∑ y=819 , ∑ xy=47 ,634 , ∑ x 2=20 , 399, ∑ y 2=112 ,443 and n=6

We compute the values of a and b


( 6 ) ( 47 , 634 )−( 345 ) ( 819 )
b= =0. 96438
6 ( 20 ,399 )−( 345 )2

819 345
a= −0 .96438 =81 .048
6 6

Hence, the equation of the regression line is


Y =81 . 048+0 . 964 X

The regression equation can be used to estimate the pressure given the age

ii). Find the blood pressure for a person who is aged 50 years. This means that the value of
x=50

Y =81 . 048+0 . 964(50 )=129 .25


A person who is 50 years of age will have a blood pressure of around 129 .

Other methods used to determine the regression equation is the method of least squares
considered in the next part.

5.3.3 Method of Least square


This is fitting the line of best fit. Out estimates of the true values of a and b leaves an error
variable or residual as it is not easy to exactly fit the line (only the best fit)

The fitted line should pass through the points of the scatter diagram in such a manner that the
sum of the squares of the vertical deviations of these points from the line will be minimum.
Since some deviations are negative and others positive, we eliminate the signs by squaring each
observation, then use the two normal equations to work out the values of a & b .
We have the normal equations

∑ y=na+b ∑ x
∑ xy=a ∑ x+b ∑ x2
Example
Apply the method of least squares to fit a straight line relationship (Regression of Y on X) for the
following points

x -2.4 -0.8 0. 1.9 3.2


3
y -5.0 -1.5 2. 6.4 11.0
5

Solution
2
Use the normal equations and find x and xy
x y 2 xy 2
x y
-2.4 -5.0 5.76 12.0 25.0
-0.8 -1.5 0.64 1.2 2.25
0.3 2.5 0.09 0.75 6.25
1.9 6.4 3.61 12.16 40.96
3.2 11.0 10.24 35.2 121.0
2.2 13.4 20.34 61.31 195.46

From the table, we have

∑ x=2.2, ∑ y=13.4 , ∑ x 2=20.34 , ∑ xy=61.31, 2


y =195. 46 n=5
Using the normal equations,
∑ y=na+b ∑ x
∑ xy=a ∑ x+b ∑ x2
And the values of x and y in the table, as well as re-arranging the equation, we obtain
5 a+2.2 b=13 . 4
2.2a+20. 34b=61.31
Solving the simultaneous equations, we have
b=2. 861 And a=1. 421

The best straight line for the given values is


Y =1 . 42+2. 86 X

Also called the equation of the regression line of Y on X

Differences between Correlation and Regression


o Regression analysis studies the relationship between the variables while the coefficient of
correlation is a measure of degree of relationship between X and Y
o Regression analysis studies both linear and non-linear relationship between variables while
correlation analysis study only the line

Exercise (Self Assessment Questions)


1. What is the general form of the regression line used in statistics?
2. What is meant by the term, the variables have a negative relationship?
3. Distinguish between correlation and regression
4. Why is correlation important?
5. Define the term correlation coefficient
6. Given the data below for the age and amount of money spent on buying music CD’s in
dollars ($).

Age x 18 26 39 48 53 58
Amount of Money ($) 16 12 9 5 6 2
y

Find the following


a). Correlation coefficient
b). Equation of the regression line
c). Plot the regression line on an X and Y axis
d). Use the equation in (b) above to estimate the amount of money a person of 30 years
will spend in buying music CD’s..
e). Compute the regression line using the method of least squares. Does the method give a
similar line of regression as in (b) above for the same data?

6. Time Series Analysis


6.1 Introduction
Economic and business conditions vary over time. So business managers must find ways to keep
up to date with the effects such changes will have on their planning and strategic process of their
organizations. One technique used is forecasting – making predictions of future events using
historical data. Thus, time series analysis is used for forecasting.

6.1.1 Definition of Time Series


Time series is a set of quantitative data that is obtained at regular periods over time. For
example, the sales volume in the last 3 years for a soft drink company, number of cinema
attendants for the last 24 months at Kenya Cinema, etc. Our main point of focus is the variability
exhibited from one period to another by the variable of interest.

It is important to record both the numerical value and the time period associated with each
measurement. This information is then used to construct a time series plot or a run chart, with the
measurements on the vertical axis and time on the horizontal axis.

Objectives of a Time Series Analysis


The basic assumption of time series analysis is that those factors which have influenced patterns
of economic activity in the past and present will continue to do so in more or less the same
manner in the future.

6.2 Forecasting Methods


Two common approaches to forecasting are qualitative and quantitative

1) Qualitative forecasting method


o Important when historical data are unavailable
o Considered highly subjective and judgmental
o They include factor listing method, expert opinion and the Delphi technique.

2) Quantitative forecasting method


Sub-divided into two types: Time series and causal

(a) Time series forecasting method


o Involve the projection of future values of a variable based entirely on the past and present
observations of the variable.
o For example: economic or business time series are the monthly publication of the consumer
price index.

(b) Causal Forecasting methods


o Involve the determination of factors that relate to the variable to be predicted.
o These include multiple regression analysis with lagged variables, econometric modelling,
leading indicator analysis, diffusion indexes and other economic barometers.

6.3 Components of a Time Series


The time series data observed is influenced by the following factors

1. Trend (T)
Long term pattern of development of data or the course which the data has followed over a
considerable period (several years)
Influenced by changes in technology, population, wealth, value, etc

2. Seasonal Variation (S)


Refers to fairly regular periodic fluctuations due to forces which are rhythmic in nature and
occurs periodically each season.
Seasonal variations may be attributed to the following causes - weather conditions, social
customs, religious customs, etc

3. Cyclical Fluctuations (C)


Refers to repeating up-and-down swings or movements whose duration is more than one year
with differing intensity for a complete cycle.
Influenced by- interaction of numerous factors that influence the economy business cycle:
Boom-Recession-Depression-Recovery.

4. Irregular Variations (I)


Refers to the erratic or “residual” fluctuations in a series that exist and are completely
unpredictable - there is no regular period or time of their occurrence. It is influenced by –
random variations in data due to unforeseen events such as strikes, natural disasters, and wars.

6.4 Estimation of Trend


A time series may have many fluctuations, to make any sensible forecast, it is necessary to
smooth out the fluctuations.

6.1 Moving Average


Moving averages (of period n) for the values of a time series are arithmetic means of successive
and overlapping values, taken n at a time. It is a method of reducing fluctuations and obtaining
trend values with a fair degree of accuracy. This smoothing method that is highly subjective and
dependent on L (length of period selected for constructing the averages)
1. Characteristics of Moving Averages
o The greater the number of periods in the moving average, the greater the smoothing effect.
o The different moving averages produce different effects.
2. Even Period Moving Average
The even numbers are 2, 4, 6, 8 etc. In the period of moving average is 4 years, then the average
of the first four figures will be placed between 2nd and 3 rd year, then the next between 3rd and 4th
year, etc
Choice of Period of Moving Average
The period should neither be too long nor too short so that the trend values are not distorted nor
are irregular fluctuations present in larger magnitude. Whenever there is a cycle in the series, the
best period of moving average is the one which coincides with the period of the cycle.

Y 1 +Y 2 +.. .+Y L
MA ( L)=
Moving Average, L
The moving average is then centered on the middle value of the time series

Example
Given the data on a factory output, calculate the 5day moving average.

5 Day MA –
Week Day Output (Y) 5 Day Total Trend (T)
Week 1 Monday 80    
  Tuesday 104    
  Wednesday 94 460 92.00
  Thursday 120 462 92.40
  Friday 62 468 93.60
Week 2 Monday 82 471 94.20
  Tuesday 110 476 95.20
  Wednesday 97 478 95.60
  Thursday 125 480 96.00
  Friday 64 486 97.20
Week 3 Monday 84 489 97.80
  Tuesday 116 494 98.80
  Wednesday 100 496 99.20
  Thursday 130    
  Friday 66    

Note:
On the 4th column, the totals are obtained as follows
o 460=80+104+ 94+120+62

o 462=104+94 +120+62+82

o 468=94+120+62+82+110 , etc

On the 5th column, the trend or moving averages are obtained as follows
460
92=
o 5
462
92 . 40=
o 5 , etc
Can you explain why we are using 5 and not any other number?
It is because we are working with a 5 day moving average for our data.

Exercise (Self Assessment Questions)


1. What is the main assumption of the time series analysis?
2. List the four components of the time series
3. Define the following terms
a). Forecasting
b). Time series analysis
c). Trend
4. Calculate the trend by four year yearly moving average of the data given below
2003 2003 2004 2005
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
60 65 20 44 62 58 28 50 85 42 33 44 78 71 20 58

You might also like