Linear Regression Lecture

Simple Linear Regression
Simple linear regression is a statistical method that allows us to study relationships between two
continuous (quantitative) variables: one independent (x) and one dependent (y). The dependent
variable is the one being explained or predicted, and the independent variable is the one used to
explain the variation in the dependent variable. It is the predictor variable. A (simple) regression
model that gives a straight-line relationship between two variables is called a linear regression
model. Develop an equation to express the linear relationship between the two variables. Using
this equation, we will be able to estimate the value of the dependent variable, Y, based on a
selected value of the independent variable X.
The two diagrams in below show a linear and a nonlinear relationship between the
dependent variable food expenditure and the independent variable income.
The relationship is positive if the dependent and independent variable has the same direction, i.e
as one increase the other increase. The relationship is negative if the dependent and independent
variable has the opposite direction, i.e as one increase the other decrease.
Use of simple linear regression:

 Predict the values of a dependent variable based on values of independent variable (explanatory).
 Explain the effect of the independent variable on the dependent variable
 Explain the dependency of one variable on the other
Two Types of relationships:

 deterministic (or functional) relationships. A model that defines an exact relationship
between variables. All points are on the line. Dependent variable is explained only by the
independent variable. Example: y 1.5x
 This course does not examine deterministic relationships. Instead, we are interested
in statistical relationships, in which the relationship between the variables is not perfect.
Not all points are on the line. A model that accounts for random error. Error is due to other
variables that effect dependent and are not included in the model. This model hypothesizes a
probabilistic relationship between y and x. Example: y 1 5. x  random error
Example:
Let us now consider a specific example. Suppose we take a sample of seven households from a
small city and collect information on their incomes and food expenditures for the last month. The
information obtained (in hundreds of dollars) is given below.
The first step is to plot the data in a scatter diagram. The implication is that the food
expenditure is related to income. As the income increases, it appears the food expenditure also
increases. We refer to Food expenditure as the dependent variable (y) and income as the
independent variable (x). The dependent variable on the vertical or Y-axis and the independent
variable on the horizontal or X-axis. we have a pair of observations for each of the seven
households. Each pair consists of one observation on income and a second on food expenditure.
For example, the first household’s income for the last month was $5500 and its food expenditure
was $1400. By plotting all seven pairs of values, we obtain a scatter diagram or scatterplot.
Figure below gives the scatter diagram for our data. Each dot in this diagram represents one
household. By looking at the scatter diagram we can observe that there exists a strong linear
relationship between food expenditure and income. If a straight line is drawn through the points,
the points will be scattered closely around the line.
Scatter Diagram of Incomes and Food Expenditures of Seven Households
Equation of a Regression Model: Ŷ = A+ β x +Ɛ
Ŷ (Y hat) : is the predicted value of the Y variable for a selected X value.

A: is the Y-intercept or constant term. It is the estimated value of Y where the regression line
crosses the Y-axis when X is zero.
β: is the slope of the line, or the average change in dependent variable Y for each change of one
unit (either increase or decrease) in the independent variable X.
X: is any value of the independent variable.
The random error Ɛ denotes the difference between the actual value of Y and the predicted value
of Ŷ for population data
As shown in Figure below, a large number of straight lines can be drawn through the scatter
diagram. Each of these lines will give different values for a and b. In regression analysis, we try
to find a line that best fits the points in the scatter diagram. Such a line provides the best possible
description of the relationship between the dependent and independent variables. The least
squares method, gives such a line. The line obtained by using the least squares method is called
the least squares regression line.
Least Squares Regression line (LSRL)
Properties:
 The sum of these errors is always zero. The value of an error is positive if the point is
above the regression line and negative if it is below the regression line.
 Error sum of squares, denoted by SSE, is minimum.
Least Squares Regression line: ŷ = a+ bx
a and b, which are calculated using sample data, are called the estimates of A and B, respectively
The purpose of regression analysis is to calculate the values of a and b to develop a linear
equation that best fits the data. The formulas for a and b are:
a=
y bx
where:
y is the mean of y (the dependent variable).
x is the mean of x (the independent variable).
SS xy
b=
SS xx
where:
SSxy is the Sum of squares of y and x.
SSxx is the Sum of squares of the independent variable.
2
∑ x∗∑ y (∑ x )
SSxy = ∑xy - & SSxx = ∑x2-
n n
EXAMPLE
Find the least squares regression line for the data on incomes and food expenditures on the
seven households.
386∗108
SSxy = 6403 - = 447.5714
7
2
(386)
SSxx = 23058- = 1772.8571
7
447.5714
b= = 0.2525
1772.8571
108 386
a=  - 0.2525
7 7 = 1.5050
ŷ = 1.5050+ 0.2525x
Using this estimated regression model, we can find the predicted value of y for any specific
value of x. For instance, suppose we randomly select a household whose monthly income is
$6100, so that x= 61 (recall that x denotes income in hundreds of dollars). The predicted value
of food expenditure for this household is:
ŷ = 1.5050+ 0.2525*61= 16.9075 (100s of dollars) = $1690.75
To plot a straight line, we need to know two points that lie on that line. We can find two
points on a line by assigning any two values to x, within the range of our data, and then
calculating the corresponding values of y and plot the line on the scatter diagram
Interpretation of a and b:
Interpretation of a
Thus, we can state that a household with no income is expected to spend $150.50 per month
on food. However, we should be very careful when making this interpretation of a. In our sample
of seven households, the minimum value of x is 33 and the maximum value is 83. Hence, our
regression line is valid only for the values of x between 33 and 83. If we predict y for a value of x
outside this range, the prediction usually will not hold true. Thus, since x= 0 is outside the range
of household incomes that we have in the sample data, the prediction that a household with zero
income spends $150.50 per month on food does not carry much credibility. The same is true if
we try to predict y for an income greater than 83, which is the maximum value of x.
Interpretation of b
The value of b in a regression model gives the change in y (dependent variable) due to a change
of one unit in x (independent variable). We can also state that, on average, a $1 increase in
income of a household will increase the food expenditure by $.2525. If one household’s income
is increased by $100, that household’s food expenditure increase by $25.25.
Checking the properties of the LSRL
The table above indicate the ∑e = 0 , and that any line that passes through these points
has ∑e2 (SSE) more than 12.7215.
Assumptions Underlying Linear Regression:

1. For each value of X, there are corresponding Y values. These Y values follow
the normal distribution. For any given x, the distribution of errors is normal.
2. The means of these normal distributions lie on the regression line. The random error term
has a mean equal to zero for each x.
3. The standard deviations of these normal distributions are all the same. The distribution of
population errors for each x has the same (constant) standard deviation, This assumption
indicates that the spread of points around the regression line is similar for all x values.
4. The Y values are statistically independent. This means that in selecting a sample, a
particular X does not depend on any other value of X. The errors associated with different
observations are independent.
Standard Deviation of Random Errors
Note that σe denotes the standard deviation of errors for the population. However, usually
is unknown. In such cases, it is estimated by se, which is the standard deviation of errors for
the sample data. The following is the basic formula to calculate se :
se =
√ SSE
n−2
SSE = ∑ (y- ^y )2
In this formula, n - 2 represents the degrees of freedom for the regression model. The reason
for n - 2 is that we lose one degree of freedom to calculate x and one for y. Degrees of
Freedom for a Simple Linear Regression Model are df = n-2.
se =
√
SS yy −b SS xy
n−2
(∑ y )
2
SSyy = ∑y2-
n
Like the value of SSxx, the value of SSyy is always positive. Table below illustrates the
calculation of the standard deviation of errors for the data of Income and food expenditure:
Inferences about B
1. Sampling distribution of b
Mean, Standard Deviation, and Sampling Distribution of b Because of the assumption of

normally distributed random errors, the sampling distribution of b is normal. The mean and
standard deviation of b, denoted by µb and σb, respectively, are
σe
µb = B & σb = SS
√ xx
However, usually the standard deviation of population errors σe is not known. Hence, the
sample standard deviation of errors se is used to estimate σe . In such a case, when σe is
unknown, the standard deviation of b is estimated by sb, which is calculated as
se
sb = SS
√ xx
2. Estimation of B
Confidence Interval for B is : b ± tα/2 * sb
Example: Construct a 95% confidence interval for B for the data on incomes and food
expenditures of seven households.
3. Hypothesis testing of B
1. State hypothesis
Three situations are described:
a. H0: B = 0 (means the line is parallel to x-axis)
Ha: B 0 two-tailed test
b. H0: B ≥ 0
Ha: B < 0 one-tailed test, in more specific a left-tailed test. (direction of line is )
c. H0: B ≤ 0
Ha: B > 0 one-tailed test, in more specific a right-tailed test. (direction of line is )
2. Assumptions
 the distribution of errors is normal.
 The means of these normal distributions is zero.
 The distribution of population errors has the same (constant) standard deviation.
 The Y values are independent.
3. Define significance level, critical value and draw the graph
Use t-distribution, df = n-2
4. Calculate the test statistics
b−B
t=
Sb
5. Compare test statistic to critical value

6. State decision
7. State conclusion
Example: Test at the 1% significance level whether the slope of the regression line for the
example on incomes and food expenditures of seven households is positive.
1. State hypothesis
H 0: B ≤ 0
Ha: B > 0
2. Assumptions
3. Define significance level, critical value and draw the graph
df = 7-2 = 5
4. Calculate the test statistics
b−B 0.2525−0
t= = = 6.662
Sb 0.0379
5. 6.662 > 3.365 falls in the rejection region
6. Reject H0
7. the slope of the regression line for incomes and food expenditures is positive.
Linear Correlation
The correlation coefficient calculated for the population data is denoted as ρ (Greek letter rho)
and the one calculated for sample data is denoted by r.
Correlation is a measure of the strength of the linear relationship between two continuous
variables, dependent and independent variables. Correlation is often referred to as Pearson’s r
and as the Pearson product-moment correlation coefficient. It has no associated units of
measurement. -1< r <+1. The values r =1 and r = -1occur when there is an exact linear
relationship between the two quantitative variables.
SS xy
r=
√ SS xx∗SS yy
A correlation coefficient has two components: The sign indicates either a positive or a negative
linear relationship; the absolute value indicates the strength of the relationship.
In a positive linear relationship, as the x scores increase, the y scores also tend to increase. In a
negative linear relationship, as the x scores increase, the y scores tend to decrease.
Interpreting the size of correlations:

r = 0: no linear relationship
0 - 0.5(or -0.5): weak linear relationship
0.50 - 0.75 (or -0.50 to -0.75): moderate relationship
0.75 – 0.9 (or -0.75 to -0.9): strong linear relationship
0.9-1.0 (or -0.9 to -1.0): very strong linear relationship
r = 1 (or -1): perfect linear relationship
A two-way scatter plot of the data determine whether there is any evidence of a linear
relationship between the two variables. The more the points scattered around the line the
stronger is the association.
Examples of negative, no and positive correlation are as follows.

EXAMPLE : Calculate the correlation coefficient for the example on incomes and food
expenditures of seven households and interpret its value.
Linear correlation is usually rounded to two decimal places.

The correlation coefficient of .95 for incomes and food expenditures of seven households
indicates that income and food expenditure are very strongly and positively correlated.
Coefficient of Determination
How good is the regression model? In other words: How well does the independent variable
explain the dependent variable in the regression model? The coefficient of determination (r2) is
The proportion of the total variation in the dependent variable y that is explained, or accounted
for, by the variation in the independent variable x.
practical formula:
b SS xy
r2 = SS
yy
0 ≤ r2 ≤ 1
Example: Calculate the coefficient of determination for the example on incomes and food
expenditures of seven households and interpret its value.
we can state that 90% of the total variation in food expenditures of households occurs because of
the variation in their incomes, and the remaining 10% is due to other variables not included in
the model.
Regression Analysis: A Complete Example
A random sample of eight drivers insured with a company and having similar minimum required
auto insurance policies was selected. The following table lists their driving experiences
(in years) and monthly auto insurance premiums (in dollars).
(a) Does the insurance premium depend on the driving experience, or does the driving
experience depend on the insurance premium? Do you expect a positive or a negative
relationship between these two variables?
(b) Find the least squares regression line.
(c) Interpret the meaning of the values of a and b.
(d) Plot the scatter diagram and the regression line.
(e) Calculate r and r2, and explain what they mean.
(f) Predict the monthly auto insurance premium for a driver with 10 years of driving
experience.
(g) Compute the standard deviation of errors.
(h) Construct a 90% confidence interval for B. Based on this interval can we say that the slope is
negative.
(i) Test at the 5% significance level whether B is negative.
(j) Calculate the p-value. Based on the p-value what is your decision.
Solution
(a) Based on theory and intuition, we expect the insurance premium to depend on driving
experience. Consequently, the insurance premium is a dependent variable and
driving experience is an independent variable in the regression model. A new driver
is considered a high risk by the insurance companies, and he or she has to pay a
higher premium for auto insurance. On average, the insurance premium is expected
to decrease with an increase in the years of driving experience. Therefore, we expect
a negative relationship between these two variables.
(b) Table below shows the calculation of ∑x, ∑y, ∑xy, ∑x2, and ∑y2
(c) The value of a = 76.6605 gives the value of ŷ for x = 0; that is, it gives the monthly
auto insurance premium for a driver with no driving experience. However, as mentioned
earlier this is meaningless since 0 is not the range of our data i.e the sample contains drivers with
only 2 or more years of experience.
The value of b gives the change in ŷ due to a change of one unit in x. Thus, b=1.5476
indicates that, on average, for every extra year of driving experience, the monthly
auto insurance premium decreases by $1.55. Note that when b is negative, y decreases
as x increases.
(d)
(e)
The value of r = 77 indicates that the driving experience and the monthly auto insurance
premium have a negative strong linear association.
The value of r2 =59 states that 59% of the total variation in insurance premiums is explained by
years of driving experience, and 41% is due to other variables that contribute to the
determination of auto insurance premiums. For example, the premium is expected to depend on
the driving record of a driver and the type and age of the car.
(f) Using the estimated regression line, we find the predicted value of y for x = 10 is:
Thus, we expect the monthly auto insurance premium of a driver with 10 years of
driving experience to be $61.18.
(g) The standard deviation of errors is:
(h)
Thus, we can state with 90% confidence that B (slope in the population) lies in the interval -2.57
to -0.52. That is, on average, the monthly auto insurance premium of a driver decreases by an
amount between $.52 and $2.57 for every extra year of driving experience.
0 is not included in the interval and both limits are negative thus the slope is negative.
(i)
H 0: B ≥ 0
Ha: B < 0
Assumptions
The value of the test statistic t = 2.937 falls in the rejection region. Hence, we reject
the null hypothesis and conclude that B is negative. That is, the monthly auto insurance
premium decreases with an increase in years of driving experience.

Linear Regression Lecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Lecture

Uploaded by

Copyright:

Available Formats

Simple Linear Regression

Use of simple linear regression:

Two Types of relationships:

Equation of a Regression Model: Ŷ = A+ β x +Ɛ

Ŷ (Y hat) : is the predicted value of the Y variable for a selected X value.

 Error sum of squares, denoted by SSE, is minimum.

Least Squares Regression line: ŷ = a+ bx

ŷ = 1.5050+ 0.2525*61= 16.9075 (100s of dollars) = $1690.75

Checking the properties of the LSRL

Assumptions Underlying Linear Regression:

Mean, Standard Deviation, and Sampling Distribution of b Because of the assumption of

5. Compare test statistic to critical value

3. Define significance level, critical value and draw the graph

4. Calculate the test statistics

Interpreting the size of correlations:

Examples of negative, no and positive correlation are as follows.

Linear correlation is usually rounded to two decimal places.

Regression Analysis: A Complete Example

(g) The standard deviation of errors is:

You might also like