Chapter 3 - Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Chapter 3: Regression

Ø Main Concepts Covered:


o Response vs Explanatory Variables.
o Scatterplots and Correlation Coefficient
o Simple Linear Regression
o Warnings and R-squared
Ø Main Purpose of Regression
o The main purpose of regression is to predict values of a variable of interest based on the values of
another variable.
o We use a sample of data to determine if there is a relationship between two different variables.
o If there is a relationship between the two variables (or if there is a pattern in the data) then determine
the formula that represents the pattern. Use the formula to make predictions of one variable based on
the value of the other variable.
Ø Two variables:
o You can determine if there is a relationship between two categorical variables by studying conditional
proportions but we will work on this when we do conditional probabilities in Lecture Part 5.
o We will focus on the relationship between to quantitative variables.

§ Variable:
• This is the variable you want to predict.
• This variable is sometimes called the dependent variable because it depends on the
other variable.
• This will always be known as y.

§ Variable:
• This is the variable we will use to make the prediction.
• This will always be known as x.
Ø Identify the two variables you want to study by determining if there is a relationship between them.
Identify the type of variable: discrete, continuous, or categorical.
Identify which variable would be the response variable and which one is the explanatory variable.
Identify which variable will be labeled x and which variable will be labeled y.
o Example 1: A small business wants to predict the number of items sold per month based on the amount
spent on advertising per month.

o Example 2: The university is going to use the temperature at kickoff to predict the number of hot
chocolates sold during a football game.
Ø Determining if there is a Relationship Between Two Quantitative Variables
o An association exists between two variables if particular values of one variable are more likely to occur
with certain values of the other variable.
o Positive Relationship:
§ As the values of x increase, the values of y tend to .
o Negative Relationship:
§ As the values of x increase, the values of y tend to .
o No Clear Relationship:
§ As the values of x increase, the values of y tend to .
Ø Graphical Way to Determine if there is a Relationship

o are graphs that can be use to determine if there is a relationship


between two quantitative variables:
§ X-axis (horizontal) will contain the values of the variable.

§ Y-axis (vertical) will contain the values of the variable.


§ Since both variables are quantitative both axes are just number lines.
§ Data is plotted as ordered pairs (x, y).
Ø Create a Scatterplot and determine if there appears to be a positive, negative, or no clear relationship between
the variables.
o In StatCrunch:
§ Graph -> Scatterplot
• Put Explanatory variable on x axis.
• Put Response variable on y axis.
o Example 1: A small business wants to predict the number of items sold per month based on the amount
spent on advertising per month.

Appears to be a relationship between the variables.

As Amount spent on advertising , the number of items sold .

o Example 2: The university is going to use the temperature at kickoff to predict the number of hot
chocolates sold during a football game.

Appears to be a relationship between the variables.


As temperature increases, the number of hot chocolates sold tends to .
Ø Numerical Summary to Determine if there is a LINEAR Relationship

o In this class we are only going to look for a relationship between two variables.
o There are other types of relationships (patterns) that exists but we will only study linear relationships.

o The strength of a relationship is based on how close to a the data falls


on the scatterplot.

o The is a measurement of the linear relationship between


two quantitative variables.
o Correlation Coefficient:
§ Notation: r = correlation coefficient.

§ Range of vales:
§ Calculation: Use StatCrunch. Stat-> Summary Stats -> Correlation. Select both the explanatory
and the response variable.
§ Determining if relationship is positive or negative:
• If the value of the correlation coefficient is positive then as x increases, y tends to

.
• If the value of the correlation coefficient is negative then as x increase, y tends to

§ Strength of the linear relationship:


• The closer the value is to -1 or +1 the the linear relationship.

• If r = -1 then the data falls perfectly on a that is decreasing.

• If r = +1 then the data falls perfectly on a that is increasing.

• The closer the value is to the weaker the linear relationship (the further from a
straight line pattern).
• Lets call r > 0.80 a strong linear relationship.
o Calculate the coefficient of correlation and determine if there appears to be a positive, negative, or no
clear relationship between the variables.
§ Use StatCrunch. Stat-> Summary Stats -> Correlation. Select both the explanatory and the
response variable.

o Example: A small business wants to predict the number of items sold per month based on the amount
spent on advertising per month.
o Example 2: The university is going to use the temperature at kickoff to predict the number of hot
chocolates sold during a football game.

o Notice if you look back at the scatterplots:


§ Example 1 should have a value of r that is positive (upward trend on the graph).
§ Example 2 should have a value of r that is negative (downward trend on the graph).
§ Example 2 has data points that fall closer to a straight line pattern so we would expect the value
of r to be closer to ± 1.
Ø Warning about Correlation
o Just because two variables are correlated does not mean that the change in one variable is causing the
change in the other variable.
o Correlation does not imply a cause and effect relationship.
o Correlation could by caused by any of the following:
§ Chance.
§ Cause and effect.
§ Lurking variables.
• Variables there can have significant effect on the variables you are studying but was not
controlled for.
• Example 1: Possible lurking variable could be time of year. Maybe some of those
observations with higher number of items sold took place during months close to the
holiday when sales are higher anyway.
• Example 2: Possible lurking variable could be ranking of opposing team. Some games
have higher attendance than others and more items will be sold because there are more
people at the game.
Ø Simple Linear Regression:
o We have determined how to look at a scatterplot to determine if there is a linear relationship between
two variables and how to calculate the correlation coefficient to describe the strength of the
relationship.
o Now we will find an equation for the straight line that best describes the relationship (pattern).
o The equation can be used to predict the value of the response variable (y) from the value of the
explanatory variable (x).
o The methods we will use to find the equation that best represents the linear pattern are called simple
linear regression. (more on this later).
Ø Regression Line:
o Often called the .
o 𝑦" = 𝑚𝑥 + 𝑏
§ Same equation for a straight line you have always seen.
§ 𝑦" = the predicted value of y (“y-hat”)
§ b = the y-intercept
§ m = the slope
o Interpreting the y-intercept and the slope:
§ y-intercept:
• The predicted value of y when .
• It is not always meaningful to interpret the y-intercept because there might not be any
values of x near zero. (it wouldn’t make sense for x to equal 0).
• Remember the y-intercept is the number by itself (it is not the value multiplied by x).
§ Slope:
• The amount that 𝑦" changes when x .
o When the slope is negative: The predicted value of 𝑦" decreases by the absolute
value of the slope as x increases by one. (correlation would be negative)
o When the slope is positive: The predicted value of 𝑦" increases by the value of
the slope as x increases by one. (correlation would be positive)
• The change in y divided by the change in x.
• Remember the slope is the number multiplied by x.
• Note that the sign of the slope will match the sign on the correlation coefficient.
o Negative slopes => negative relationship => as x increases y tends to decrease.
o Positive slopes => positive relationship => as x increases y tends to increase.
o Example of a prediction equation:
§ Example 3: Assume the prediction equation (regression equation) was found to be 𝒚 ) = 50 + 5x.
• Identify and interpret the slope.
o Slope =

• Identify and interpret the y-intercept.


o Y-intercept =

• What do we expect the scatterplot to look like?

• What do we expect the values of the correlation coefficient to look like?

o Example 4: Assume the prediction equation (regression equation) was found to be 𝒚


) = 50 − 5x.
§ Identify and interpret the slope.

• Slope =
• As the value of x increases by 1, the value of y will .
§ What do we expect the scatterplot to look like?

• relationship exists.

• Dots on scatterplot should be


§ What do we expect the values of the correlation coefficient to look like?

• It would be a number.
Ø Finding the line of best fit:
o StatCrunch will be used to find the equation for the line of best fit through the data.
o Method is called Least Squares Regression
§ It is not the line that passes through the most points.
§ It is the line that minimizes the squared difference between the points (the true values) and the
regression line (the predicted values).
o The difference between the true value (y) and the predicted value (𝑦") is called the .

Ø Steps to Generate the Line of Best Fit in StatCrunch.


o Put the data into StatCrunch.
o Stat -> Regression -> Simple Linear
§ Select the explanatory variable as x.
§ Select the response variable as y.
§ Hit Compute
Ø Making a Prediction:
o Once you have the regression (prediction) equation you can make a prediction of the response variable

by plugging in a value of the explanatory variable. As long as, the value of x was in the

o = the range of values of x that were used to create the


equation. It will go from the smallest value of x in the data to the largest value of x in the data. (you can
identify it by looking at the scatterplot or the dataset).
§ Why? The data pattern could change further out in either direction. Since we do not have data
to study beyond the range of values of x we used to generate the equation we do not
extrapolate.
o Example 1: A small business wants to predict the number of items sold per month based on the amount
spent on advertising per month.

§ Use StatCrunch to generate the equation for the line of best fit.

§ Interpret the slope of the line of best fit.


§ Interpret the y-intercept of the line of best fit.

§ Identify the relevant range.

§ Use the prediction equation to predict the number of items sold if the amount spent on
advertising was $400.

§ If the true number of items sold when the business spent $400 on advertising was 320, calculate
the residual.

o Example 2: The university is going to use the temperature at kickoff to predict the number of hot
chocolates sold during a football game.

§ Use StatCrunch to generate the equation for the line of best fit.

§ Interpret the slope of the line of best fit.

§ Interpret the y-intercept of the line of best fit.


§ Identify the relevant range.

§ Use the prediction equation to predict the number of hot chocolates sold during a football game
if the temperature at kickoff is 55֯.

Ø R-squared:
o The percent of variability we observe in that is
due to the linear relationship between x and y.
o Denoted 𝑟 !
o Must fall between 0% and 100%.
o Can be found using software or by squaring the correlation coefficient, r.

o The larger 𝑟 ! the our predictions are as a result of the


regression equation.
o Examples: Interpret the values of R-squared and state which prediction equation will give better
predictions.

of the variability we observe in the of the variability we observe in the

is due to the is due to the


linear relationship between the two variables. linear relationship between the two variables.

Note of the variability is due to Note of the variability is due to


something else. something else.

You might also like