Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

Correlation and

Regression:
Explaining Association
and Causation

by
Tuhin Chattopadhyay
Slide 1
Application Areas: Correlation

1. Correlation and Regression are generally


performed together. The application of correlation
analysis is to measure the degree of association
between two sets of quantitative data. The correlation
coefficient measures this association. It has a value
ranging from 0 (no correlation) to 1 (perfect positive
correlation), or -1 (perfect negative correlation).

2. For example, how are sales of product A


correlated with sales of product B? Or, how is the
advertising expenditure correlated with other
promotional expenditure? Or, are daily ice cream
sales correlated with daily maximum temperature?

3. Correlation does not necessarily mean there is a


causal effect. Given any two strings of numbers,
there will be some correlation among them. It does
not imply that one variable is causing a change in
another, or is dependent upon another.

4. Correlation is usually followed by regression


analysis in many applications.
Slide 2 Application Areas: Regression

1. The main objective of regression analysis is to explain the


variation in one variable (called the dependent variable),
based on the variation in one or more other variables (called
the independent variables).

2. The applications areas are in ‘explaining’ variations in


sales of a product based on advertising expenses, or number
of sales people, or number of sales offices, or on all the
above variables.

3. If there is only one dependent variable and one


independent variable is used to explain the variation in it,
then the model is known as a simple regression.

4. If multiple independent variables are used to explain the


variation in a dependent variable, it is called a multiple
regression model.

5. Even though the form of the regression equation could be


either linear or non-linear, we will limit our discussion to
linear (straight line) models.

6. As seen from the preceding discussion, the major


application of regression analysis in marketing is in the area
of sales forecasting, based on some independent (or
explanatory) variables. This does not mean that regression
analysis is the only technique used in sales forecasting.
There are a variety of quantitative and qualitative methods
used in sales forecasting, and regression is only one of the
better known (and often used) quantitative techniques.
Slide 3 Methods

There are basically two approaches to regression –


A hit and trial approach .
A pre- conceived approach.

Hit and trial Approach

In the hit and trial approach we collect data on a large


number of independent variables and then try to fit a
regression model with a stepwise regression model,
entering one variable into the regression equation at a time.
The general regression model (linear) is of the type

Y = a + b1x1 + b2x2 +…….+ bnxn

where y is the dependent variable and x1, x2 , x3….xn are the


independent variables expected to be related to y and
expected to explain or predict y. b1, b2, b3…bn are the
coefficients of the respective independent variables, which
will be determined from the input data.

Pre-conceived Approach

The pre-conceived approach assumes the researcher knows


reasonably well which variables explain ‘y’ and the model
is pre-conceived, say, with 3 independent variables x1, x2,
x3. Therefore, not too much experimentation is done. The
main objective is to find out if the pre-conceived model is
good or not. The equation is of the same form as earlier.
Slide 4

Data
1. Input data on y and each of the x variables is
required to do a regression analysis. This data is input
into a computer package to perform the regression
analysis.

2. The output consists of the ‘b’ coefficients for all the


independent variables in the model. The output also
gives you the results of a ‘t’ test for the significance of
each variable in the model, and the results of the ‘F’
test for the model on the whole.

3. Assuming the model is statistically significant at the


desired confidence level (usually 90 or 95% for typical
applications in the marketing area), the coefficient of
determination or R2 of the model is an important part
of the output. The R2 value is the percentage (or
proportion) of the total variance in ‘y’ explained by all
the independent variables in the regression equation.
Slide 5 Recommended usage

1. It is recommended that for exploratory research, the hit-


and-trial approach may be used. But for serious decision-
making, there has to be a-priori knowledge of the
variables which are likely to affect y, and only such
variables should be used in the regression analysis.

2. It is also recommended that unless the model is itself


significant at the desired confidence level (as evidenced
by the F test results printed out for the model), the R²
value should not be interpreted.

3. The variables used (both independent and dependent)


are assumed to be either interval scaled or ratio scaled.
Nominally scaled variables can also be used as
independent variables in a regression model, with dummy
variable coding.

4. If the dependent variable happens to be a nominally


scaled one, discriminant analysis should be the technique
used instead of regression.
Slide 6 Worked Example: Problem

1. A manufacturer and marketer of electric motors would


like to build a regression model consisting of five or six
independent variables, to predict sales. Past data has been
collected for 15 sales territories, on Sales and six different
independent variables. Build a regression model and
recommend whether or not it should be used by the
company.

2. We will assume that data are for a particular year, in


different sales territories in which the company operates, and
the variables on which data are collected are as follows:

Dependent Variable

Y = sales in Rs.lakhs in the territory

Independent Variables

X1 = market potential in the territory (in Rs.lakhs).


X2 = No. of dealers of the company in the territory.
X3 = No. of salespeople in the territory.
X4 = Index of competitor activity in the territory on
a 5 point scale
(1=low, 5 = high level of activity by competitors).
X5 = No. of service people in the territory.
X6 = No. of existing customers in the territory.
Slide 7

Input data:

The data set consisting of 15 observations, is given in


fig 1.
Fig. 1

Data file : REGDATA1.STA (15 cases with 7


variables)
1 2 3 4 5 6 7
SALES POTENTL DEALERS PEOPLE COMPET SERVICE CUSTOM
1
5 25 1 6 5 2 20
2
60 150 12 30 4 5 50
3
20 45 5 15 3 2 25
4
11 30 2 10 3 2 20
5
45 75 12 20 2 4 30
6
6 10 3 8 2 3 16
7
15 29 5 18 4 5 30
8
22 43 7 16 3 6 40
9
29 70 4 15 2 5 39
10
3 40 1 6 5 2 5
11
16 40 4 11 4 2 17
12
8 25 2 9 3 3 10
13
18 32 7 14 3 4 31
14
23 73 10 10 4 3 43
15
81 150 15 35 4 7 70
Slide 8

Correlation

First, let us look at thecorrelations of all the variables


with each other. The correlation table (output from
the computer for the Pearson Correlation procedure)
is shown in Fig. 2. The values in the correlation table
are standardised, and range from 0 to 1 (+ ve and - ve).

Fig.2 : Correlations Table


STAT. Correlations (regdata1.sta)
MULTIPLE
REGRESS.
POTEN DEAL SERV
Variable TL ERS PEOP COM ICE CUST SALES
LE PET OM
POTENTL 1.00 .84 .88 .14 .61 .83 .94
DEALERS .84 1.00 .85 -.08 .68 .86 .91
PEOPLE .88 .85 1.00 -.04 .79 .85 .95
COMPET .14 -.08 -.04 1.00 -.18 -.01 -.05
SERVICE .61 .68 .79 -.18 1.00 .82 .73
CUSTOM .83 .86 .85 -.01 .82 1.00 .88
SALES .94 .91 .95 -.05 .73 .88 1.00
Slide 9

1. Looking at the last column of the table, we find that


except for COMPET (index of competitor activity), all other
variables are highly correlated (ranging from .73 to .95) with
Sales.

2. This means we may have chosen a fairly good set of


independent variables (No. of Dealers, Sales Potential, No.
of Customers, No. of Service People, No. of Sales People) to
try and correlate with Sales.

3. Only the Index of Competitor Activity does not appear to


be strongly correlated (correlation coefficient is -.05) with
Sales. But we must remember that these correlations in Fig.
2 are one-to-one correlations of each variable with the other.
So we may still want to do a multiple regression with an
independent variable showing low correlation with a
dependent variable, because in the presence of other
variables, this independent variable may become a good
predictor of the dependent variable.
Slide 9 contd...

4. The other point to be noted in the correlation table is


whether independent variables are highly correlated with
each other. If they are, like in Fig. 2, this may indicate
that they are not independent of each other, and we may
be able to use only 1 or 2 of them to predict the
dependent variables.

5. As we will see later, our regression ends up


eliminating some of the independent variables, because
all six of them are not required. Some of them, being
correlated with other variables, do not add any value to
the regression model.

6. We now move on to the regression analysis of the


same data.
Slide 10

Regression

We will first run the regression model of the following


form, by entering all the 6 'x' variables in the model -

Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6


……………..Equation 1

and determine the values of a, b1, b2, b3, b4, b5, & b6.

Regression Output:

The results (output) of this regression model are in Fig.4


in table form.

Column 4 of the table, titled ‘B’ lists all the coefficients


for the model. According to this,

a (intercept) = -3.17298
b1 = .22685
b2 = .81938
b3 = 1.09104
b4 = -1.89270
b5 = -0.54925
b6 = 0.06594
Slide 11

These values of a, b1, b2, ..b6 can be substituted in


equation 1 above and we can write the equation
(rounding off all coefficients to 2 decimals), as

Sales = -3.17 + .23 (potential) + .82 (dealers) + 1.09


(salespeople) - 1.89 (competitor activity) - 0.55
(service people) + 0.07 (existing customers)

Before we use this equation, however, we need to


look at the statistical significance of the model, and
the R2 value. These are available from Fig. 3 , the
Analysis of Variance Table, and Fig. 4.

Fig. 3 : The ANOVA Table


STAT. Analysis of Variance; Depen.Var: SALES (regdata1.sta)
MULTIPLE
REGRESS.
Sums of Mean
Effect Squares df Squares F
Regress. 6609.484 6 1101.581 57.13269 .000004
Residual 154.249 8 19.281
Total 6763.733

From Fig. 3, the analysis of variance table, the last


column indicates the p-level to be 0.000004. This
indicates that the model is statistically significant at a
confidence level of (1-0.000004)*100 or
(0.999996)*100, or 99.9996.
Slide 12

The R2 value is 0.977, from the top of Fig. 4. From


Fig. 4, we also note that ‘t’ tests for significance of
individual independent variables indicate that at the
significance level of 0.10 (equivalent to a confidence
level of 90%), only POTENTL and PEOPLE are
statistically significant in the model. The other 4
independent variables are individually not significant.

Fig. 4 MULTIPLE REGRESSION RESULTS:

All independent variables were entered in one block

Dependent Variable: SALES


Multiple R: .988531605
Multiple R-Square: .977194734
Adjusted R-Square: .960090784
Number of cases: 15
F(6, 8) = 57.13269 p< .000004
Standard Error of Estimate: 4.391024067
Intercept: -3.172982117
Std.Error: 5.813394 t(8) = -.5458 p< .600084
Slide 12 contd...

STAT. Regression Summary for Dependent Variable: SALES


MULTIPLE R= .98853160 R2= .97719473 Adjusted R2= .96009078
REGRESS. F(6,8)=57.133 p< .00000 Std.Error of Estimate: 4.3910

N=15 St.Err. St. Err.


BETA of B of B t(8) p-level
BETA
Intercept -3.1729 5.813394 -.54581 .600084
POTENTL .439073 .144411 .22685 .074611 3.04044 .016052
DEALERS .164315 .126591 .81938 .631266 1.29800 .230457
PEOPLE .413967 .158646 1.09104 .418122 2.60937 .031161
COMPET .084871 .060074 -1.89270 1.339712 -1.41276 .195427
SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204
CUSTOM .050490 .149302 .06594 .095002 .33817 .743935
Slide 13

However, ignoring the significance of individual


variables for now, we shall use the model as it is, and try
to apply it for decision making.
The real use of the regression model would be to try and
‘predict’ sales in Rs. lakhs, given all the independent
variable values.

The equation we have obtained means, in effect, that


sales will increase in a territory if the potential increases,
or if the number of dealers increases, or if level of
competitor’s activity decreases, if number of service
people decreases, and if the number of existing
customers increases.

The estimated increase in sales for every unit increase or


decrease in these variables is given by the coefficients of
the respective variables. For instance, if the number of
sales people is increased by 1, sales in Rs . lakhs, are
estimated to increase by 1.09, if all other variables are
unchanged. Similarly, if 1 more dealer is added, sales are
expected to increase by 0.82 lakh, if other variables are
held constant.
Slide 13 contd...

There is one co-efficient, that of the SERVICE variable,


which does not make too much intuitive sense. If we
increase the number of service people, sales are estimated to
decrease according to the –0.55 coefficient of the variable
"No. of Service People" (SERVICE).

But if we look at the individual variable ‘t’ tests, we find that


the coefficients of the variable SERVICE is statistically not
significant (p-level 0.735204 from fig. 4). Therefore, the
coefficient for SERVICE is not to be used in interpreting the
regression, as it may lead to wrong conclusions.

Strictly speaking, only two variables, potential (POTENTL)


and No. of sales people (PEOPLE) are significant
statistically at 90 percent confidence level since their p- level
is less than 0.10. One should therefore only look at the
relationship of sales with one of these variables, or both
these variables.
Slide 14 Making Predictions/Sales Forecasts

Given the levels of X1, X2, X3, X4, X5, and X6 for a
particular territory, we can use the regression model
for prediction of sales.
Before we do that, we have the option of redoing the
regression model so that the variables not statistically
significant are minimized or eliminated.
We can follow either the Forward Stepwise
Regression method, or the Backward Stepwise
Regression method, to try and eliminate the
'insignificant' variables from the full regression model
containing all six independent variables.

Forward Stepwise Regression

For example, we could ask the computer for a Forward


stepwise Regression model, in which case the
algorithm adds one independent variable, at a time ,
starting with the one which ‘explains’ most of the
variation in sales (y), and adding one more X variable
to it , rechecking the model to see that both variables
form a good model, then adding a third variable if it
still adds to the explanation of Y , and so on. Fig, 5
shows the result of running a forward stepwise
Regression, which ends up with only 4 out of 6
independent variables remaining in the regression
model.
Slide 15

Fig. 5
STAT. Regression Summary for Dependent Variable: Sales
MULTIPLE R = .98831786 R2 = .97677220 Adjusted R2 = .96748108
REGRESS. F = (4,10) = 105.13 p<.00000 Std. Error of estimate: 3.9637
N=15 BETA St. Err. B St. Err. of T (10) p-level
of B
BETA
Intercept -3.74194 4.847683 -.77190 .458025
People .390134 .115138 1.02822 .303453 3.38841 .006904
Potentl .462686 .117988 .23905 .060959 3.92147 .002860
Dealers .180700 .102687 .90109 .512065 1.75971 .108955
Compet -.081195 .053434 -1.81074 -1.191624 -1.51955 .159589
The 4 variables in the model are PEOPLE (No, of sales
people) POTENTL (sales potential), Dealers (No of
Dealers) and COMPET (competitive index). Again we
notice, that the two significant variables (those with p value
<.10) at 90 % confidence are only PEOPLE and POTENTL
(p- levels of .006904 and .002860).
But DEALERS is now at p-level of .108955, very close to
significance at 90 % confidence level. This could be the
equation, instead of the one with 6 independent variables,
that we could use. We would be economizing on the two
variables, which are not required if we decide to use the
model from Fig, 5 instead of that from Fig, 4.
The F test for the model in Fig, 5 also indicates it is highly
significant (From top of Fig, 5, F=105.1296, P<.000000)
and R² value for the model is 0.9767, which is very close to
the 6-independent variable model of Fig, 4. If we decide to
use the model from Fig. 5, it would be written as follows -
Sales = -3.74 + 1.03 (PEOPLE) + .24 (POTETL) + .9 (DEALERS) - 1.81 (COMPET)
……….Equation 2
Slide 16

Backward Stepwise Regression

We could, as another alternative, perform a


Backward stepwise Regression, on the same set of 6
independent variables. This procedure starts with all
6 variables in the model, and gradually eliminates
those, one after another, which do not explain much
of the variation in ‘Y’, until it ends with an optimal
mix of independent variables according to pre-set
criteria for the exit of variables.
This results in a model with only 2 independent
variables POTENTL and PEOPLE remaining in the
equation. This model is shown in Fig, 6.

Fig. 6

Backward stepwise regression, no, of steps: 4

STAT Regression Summary for Dependent Variable: SALES


MULTIPL R=.97975624 R2=.95992229 Adjusted R2=.95324267
E F(2, 12)=143.71 P<.00000 Std. Error estimate: 4.7528
REGRESS.
N=15 BETA St.Err. of B St Err. of t(12) P level
BETA BETA
Intercept -10.6164 2.659532 -3.99183 .001788
POTENTL .470825 .120127 .2433 .062065 3.91939 .000728
PEOPLE .540454 .120127 1.4244 .316602 4.49902 .000728
Slide 17
The R² for the model has dropped only slightly, to
0.9599, the F-test for the model is highly significant, and
both the independent variables POTENTL and PEOPLE
are significant at 90 % confidence level (p-levels of .
002037 and .000728 from last column, Fig, 6).
If we were to decide to use this model for prediction , we
only require data to be collected on the number of sales
people (PEOPLE) and the sales potential (POTENTL), in
a given territory . We could form the equation using the
Intercept and coefficients from column “B” in Fig. 6. as
follows-

Sales = -10.6164 + .2433 (POTENTL)


+ 1.4244 (PEOPLE)…………...Equation 3

Thus, if potential in a territory were to be Rs. 50 lakhs,


and the territory had 6 salespeople, then expected sales,
using the above equation would be
= -10.6164 +.2433(50) +1.4244(6)
= 10.095 lakhs.
Similarly, we could use this model to make predictions
regarding sales in any territory for which Potential and
No. of Sales People were known.
Slide 18

Additional comments

1. As we can see from the example discussed, regression


analysis is a very simple (particularly on a computer),
and useful techniques to predict one metric dependent
variable based on a set of metric independent variables.
Its use, however, gets more complex, for instance, if the
independent variables are nominally scaled into two
(dichotomous) or more (polytomous) categories.

2. It is also a good idea to define the range of all


independent variables used for constructing the
regression model. For prediction of Y values, only those
X values which fall within or close to this range (used
earlier in the model construction stage) must be used, for
the predictions to be effective.

3. Finally, we have assumed that a linear model is the


only option available to us. That is not the only choice.
A regression model could be of any non linear variety,
and some of these could be more suitable for particular
cases.
Slide 18 contd….

4. Generally, a look at the plot of Y and X tells us in case of


a simple regression model, whether the linear (straight line)
approach is best or not. But in a multiple regression, this
visual plot may not indicate the best kind of model, as there
are many independent variables, and the plot in 2
dimensions is not possible.

5. In this particular example, we have not used any


macroeconomic variables, but in industrial marketing, we
may use those types of industry or macroeconomic variables
in a regression model. For example, to forecast sales of steel,
we may use as independent variables, the growth rate of a
country’s GDP, the new construction starts, and the growth
rate of the automobile industry.

You might also like