Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 3

linear regression

What Is Linear Regression?


In the previous unit, you learned that correlation refers to the direction
(positive or negative) and the strength (very strong to very weak) of the
relationship between two quantitative variables.

Like correlation, linear regression also shows the direction and strength of the
relationship between two numeric variables, but regression also uses the best-
fitting straight line through the points on a scatter plot to predict Y values from
X values. With correlation, the values of X and Y are interchangeable. With
regression, the results of the analysis will change if X and Y are swapped.

Note
Concepts in this unit are adapted from Introduction to Statistics.

The Linear Regression Line


Just as with correlations, for regressions to be meaningful, you must:

Use quantitative variables


Check for linear relationship
Watch out for outliers
Like correlation, linear regression is visualized on a scatter plot.

The regression line on the scatter plot is the best-fitting straight line through
the points on the scatter plot. In other words, it is a line that goes through the
points with the least amount of distance from each point to the line.

Why is this line helpful and useful? We can use the linear regression calculation
to calculate, or predict, our Y value if we have a known X value.

To make this clearer, let's look at an example.

A Regression Example
Let’s say you want to predict how much you will need to spend to buy a house that
is 1,500 square feet. Let's use linear regression to predict.

Place the variable that you want to predict, home prices, on the y-axis (this is
also called the dependent variable).
Place the variable you're basing your predictions on, square footage, on the x-axis
(this is also called the independent variable).
Here is a scatter plot showing house prices (y-axis) and square footage (x-axis).

A scatter plot with blue marks showing house prices (y-axis) and square footage (x-
axis)

The scatter plot shows homes with more square feet tend to have higher prices, but
how much will you have to spend for a house that measures 1,500 square feet?

To help answer that question, create a line through the points. This is linear
regression. The regression line will help you to predict what a typical house of a
certain square footage will cost. In this example, you can see the equation for the
regression line.

The equation for the regression line is highlighted.

The equation for the line is Y = 113*X + 98,653 (with rounding).


What does this equation mean? If you bought a place with no square footage (an
empty lot, for example), the price would be $98,653. Here are the steps for how the
equation is solved.

To find Y, multiply the value of X by 113 and then add 98,653. In this case, we are
looking at no square footage, so the value of X is 0.

Y = (113 * 0) + 98,653
Y = 0 + 98,653
Y = 98,653
The value 98,653 is called the y-intercept because this is where the line crosses,
or intercepts, the y-axis. It is the value of Y when X equals 0.

The number 113 is the slope of the line. Slope is a number that describes both the
direction and the steepness of the line. In this case, the slope forecasts that for
every additional square foot, the house price will increase by $113.

So, here’s what you need to spend on a 1,500 square foot house:

Y = (113 * 1500) + 98,653 = $268,153

Take another look at this scatter plot. The blue marks are the actual data. You can
see that you have data for homes between 1,100 and 2,450 square feet.

A scatter plot with blue marks, a gray regression line, and orange lines showing
where X and Y meet on the regression line

Note that this equation cannot be used to predict the price of all houses. Since a
500-square-foot house and a 10,000-square-foot house are both outside of the range
of the actual data, you would need to be careful about making predictions with
those values using this equation.

The r-Squared Value


In addition to the equation in this example, we also see an r-squared value (also
known as the coefficient of determination).

The r-squared value for the regression line is highlighted.

This value is a statistical measure of how close the data is to the regression
line, or how well the model fits your observations. If the data is perfectly on the
line, the r-squared value would be 1, or 100%, meaning that your model fits
perfectly (all observed data points are on the line).

For our home price data, the r-squared value is 0.70, or 70%.

Linear Regression Versus Correlation


You may now be wondering how to distinguish between linear regression and
correlation. See the table below to see a summary of each concept.

Linear regression Correlation


Shows a linear model and prediction, predicting Y from X.

Shows a linear relationship between two values.

Uses r-squared to measure the percentage of variation explained by the model.

Uses r to measure the strength and direction of the correlation.


Does not use X and Y as interchangeable values (because Y is predicted from X).

Uses X and Y as interchangeable values.

Being familiar with the statistical concepts of correlation and regression helps
you to explore and understand the data you work with by examining relationships.

You might also like