Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

Lesson 17: SIMPLE REGRESSION

REGRESSION

The most commonly used form of regression is linear regression, and the
most common type of linear regression is called ordinary least squares
regression.

Linear regression uses the values from an existing data set consisting of
measurements of the values of two variables, X and Y, to develop a model that
is useful for predicting the value of the dependent variable, Y for given values of
X.
 
  ELEMENTS OF A REGRESSION EQUATION

The regression equation is written as Y = a + bX +e

Y is the value of the Dependent variable (Y), what is being predicted or


explained

a or Alpha, a constant; equals the value of Y when the value of X=0

b or Beta, the coefficient of X; the slope of the regression line; how much Y
changes for each one-unit change in X.

X is the value of the Independent variable (X), what is predicting or explaining


the value of Y

e is the error term; the error in predicting the value of Y, given the value of X (it
is not displayed in most regression equations).

For example, say we know what the average speed is of cars on the
freeway when we have 2 highway patrols deployed (average speed=75 mph) or
10 highway patrols deployed (average speed=35 mph). But what will be the
average speed of cars on the freeway when we deploy 5 highway patrols?
 

Average Speed on Freeway Number of Patrol Cars Deployed


(Y) (X)
75 2
35 10

From our known data, we can use the regression formula (calculations
not shown) to compute the values of and and obtain the following equation: Y=
85 + (-5) X, where
Y is the average speed of cars on the freeway

a=85, or the average speed when X=0

b=(-5), the impact on Y of each additional patrol car deployed

X is the number of patrol cars deployed

That is, the average speed of cars on the freeway when there are no
highway patrols working (X=0) will be 85 mph. For each additional highway
patrol car working, the average speed will drop by 5 mph. For five patrols (X=5),
Y = 85 + (-5) (5) = 85 - 25 = 60 mph

There may be some variations on how regression equations are written in


the literature. For example, you may sometimes see the dependent variable
term (Y) written with a little "hat" ( ^ ) on it, or called Y-hat. This refers to the
predicted value of Y. The plain Y refers to observed values of Y in the data set
used to calculate the regression equation. You may see the symbols for alpha
(a) and beta (b) written in Greek letters, or you may see them written in English
letters. The coefficient of the independent variable may have a subscript, as
may the term for X, for example, b1X1 (this is common in multiple regression).
 
  ASSESSING THE REGRESSION EQUATION

We now have a regression equation. But how good is the equation at


predicting values of Y, for given values of X? For that assessment, we turn to
measures of association and measures of statistical significance that are used
with regression equations.

r2

r2 is a measure of association; it represents the percent of the variance in the


values of Y that can be explained by knowing the value of X. r 2 varies from a
low of 0.0 (none of the variance is explained), to a high of +1.0 (all of the
variance is explained).

s.e.b

s.e.b is the standard error of the computed value of b. A t-test for statistical
significance of the coefficient is conducted by dividing the value of b by its
standard error. By rule of thumb, a t-value of greater than 2.0 is usually
statistically significant but you must consult a t-table to be sure. If the t-value
indicates that the b coefficient is statistically significant, this means that the
independent variable or X (number of patrol cars deployed) should be kept in
the regression equation, since it has a statistically significant relationship with
the dependent variable or Y (average speed in mph). If the relationship was not
statistically significant, the value of the b coefficient would be (statistically
speaking) indistinguishable from zero.

F
F is a test for statistical significance of the regression equation as a whole. It is
obtained by dividing the explained variance by the unexplained variance. By
rule of thumb, an F-value of greater than 4.0 is usually statistically significant
but you must consult an F-table to be sure. If F is significant, than the
regression equation helps us to understand the relationship between X and Y.

For our example above, say we obtained the following values:

r2 = .9
Knowing the value of X (the number of patrol cars deployed), we can explain
90% of the variance in Y (the average speed of motorists on the freeway).

s.e.b = 1.5
Dividing b by s.e.b, we obtain a value for t = -5/1.5 = -3.3. Consulting a t-table,
we find that the coefficient is statistically significant. This means that the
independent variable X (number of patrol cars deployed) should be kept in the
regression equation, since it has a statistically significant relationship with the
dependent variable Y (average speed in mph).

F= 8.4
From the F-table, we see that the regression equation as a whole is statistically
significant. This means that the regression equation is helping us to
understand the relationship between X and Y.
 
 

STEPS IN LINEAR REGRESSION

1. State the hypothesis.


2. State the null hypothesis
3. Gather the data.
4. Compute the regression equation
5. Examine tests of statistical significant and measures of association
6. Relate statistical findings to the hypothesis. Accept or reject the null
hypothesis.
7. Reject, accept or revise the original hypothesis. Make suggestions for
research design and management aspects of the problem.

Example: The motor pool wants to know if it costs more to maintain cars that
are driven more often.
Hypothesis: maintenance costs are affected by car mileage
Null hypothesis: there is no relationship between mileage and maintenance
costs

Dependent variable: Y is the cost in dollars of yearly maintenance on a motor


vehicle
Independent variable: X is the yearly mileage on the same motor vehicle

Data are gathered on each car in the motor pool, regarding number of miles
driven in a given year, and maintenance costs for that year. Here is a sample of
the data collected.
 

Car Number Miles Driven (X) Repair Costs (Y)


1 80,000 $1,200
2 29,000 $150
3 53,000 $650
4 13,000 $200
5 45,000 $325

The regression equation is computed as (computations not shown): Y = 50 + .


03 X

For example, if X=50,000 then Y = 50 + .03 (50,000) = $1,550

a=50 or the cost of maintenance when X=0; if there is no mileage on the car,
then the yearly cost of maintenance=$50

b=.03 the value that Y increases for each unit increase in X; for each extra mile
driven (X), the cost of yearly maintenance increases by $.03

s.e.b = .0005; the value of b divided by s.e.b=60.0; the t-table indicates that
the b coefficient of X is statistically significant (it is related to Y)

r2=.90 we can explain 90% of the variance in repair costs for different vehicles
if we know the vehicle mileage for each car

Conclusion: Reject the null hypothesis of no relationship and accept the


research hypothesis, that mileage affects repair costs.
 
 

ASSUMPTIONS OF LINEAR REGRESSION


In theory, there are several important assumptions that must be satisfied if
linear regression is to be used. These are:
1. Both the independent (X) and the dependent (Y) variables are
measured at the interval or ratio level.
2. The relationship between the independent (X) and the dependent (Y)
variables is linear.
3. Errors in prediction of the value of Y are distributed in a way that
approaches the normal curve.
4. Errors in prediction of the value of Y are all independent of one
another.
5. The distribution of the errors in prediction of the value of Y is constant
regardless of the value of X.

There are a number of advanced statistical tests that can be used to


examine whether or not these assumptions are true for any given regression
equation. However, these are beyond the scope of this discussion.
 
TIME SERIES REGRESSION

    Linear regression is useful for exploring the relationship of an


independent variable that marks the passage of time to a dependent variable
when the relationship is linear; that is, when there is an obvious downward, or
upward, trend in the data over time.

    However, if the trend of the dependent variable over time is not linear,
then linear regression will not capture the relationship. Linear regression fails
to capture seasonal, cyclical, and counter-cyclical trends in time series data.
Neither does linear regression capture the effects of changes in direction of
time series data, nor changes in the rate of change over time. For time series
regression, it is important to obtain a plot of the data over time and inspect it
for possible non-linear trends. There is also a problem if the values at one point
in the time series are determined or strongly influenced by values at a previous
time. This is called auto-correlation. This occurs when the values of the
dependent variable over time are not randomly distributed. Linear regression
can be used with interrupted time series research designs. For example, say a
policy is implemented to reduce the number of accidents among teenage
drivers.

1. Data are gathered for at least 20 or 30 time periods (months or


quarters) before the policy is implemented, and then for another 20 or
30 time periods after the policy is implemented.
2. One linear regression is performed for the accident rate data on the
pre-policy time periods.
3. Another linear regression is performed for the accident rate data on
the post-policy time period.
4. There should be differences in the values of the constant, b
coefficient, s.e.b , and r2 for the two equations.

    If there is a difference between the two equations, then the policy has
had an effect. If all the data points (both pre- and post-) had been included in
the regression equation, the amount of variance explained (r 2) would be quite
low. This is because, if there is a change after the policy is introduced, the
trend is no longer linear. Instead, there are two different linear trends, one
before the policy was introduced, and another, different one after it was
introduced. In setting up the data for time series regression, the researcher
must remember to number the years (or other time periods) consecutively from
1 to n. These are the values for the independent (X) variable. The value of the
dependent variable is the accident rate. For example,
 

Independent Variable (X) - Dependent Variable (Y) -


Year Accident Rate
1 50,000
2 51,000
3 52,000
4 53,000

You might also like