Statistics Regression Final Project

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1

Basic Statistics

FOR REFERENCE ONLY - PROPERTY OF AUTHOR

Practical applications of correlation plots and regression.

Introduction

I will be explaining what correlation is, what regression lines are, and also

how each is determined. To help visualize the practical applications for which correlation plots

and regressions are used for, I will show 5 data sets with correlation lines using Excel. However,

any program capable of drawing scatter plots with linear correlations will suffice. After a

correlation line is added and a regression equation determined for each of the 5 data sets, I hope

that their real-world application will be evident.

Definition of Correlation

A correlation is simply the relationship or interdependence of two variables. Correlations

are useful because they can indicate a predictive relationship. For instance, the amount of time a

student spends studying (the first variable) and their academic performance (the second

variable). Common sense would dictate that the more time a student spends studying the better

their academic performance, and vice versa. Hence, the first variable and the second variable are

linked in that their, potentially, is a positive effect.


2

In statistics, these variables are commonly defined as ​x a​ nd ​y. X ​is called the

independent​ variable and ​Y​ is called the ​dependent​ variable. In the example above, the amount

spent studying for an exam is independent as it does not depend on anything and is up to each

individual student. We can call that ​x. ​The students’ academic performance, on the other hand,

does ​depend on the amount spent studying and therefore we can call that the ​y ​variable.

There are three types of correlations. These are ​positive ​correlation, ​negative ​correlation,

and ​zero ​(or no) correlation. ​Positive correlation​ refers to a dependent variable that shows a

clear relationship that is greater than zero between the x and y variables. For instance, height and

weight are positive correlations since taller people tend to be heavier. A ​negative correlation

would be a relationship between two variables in which an increase in one variable results in a

decrease in the other variable. For instance, the more time a student spends playing video games,

the lower their GPA. While one variable increases (playing video games), the other decreases

(their GPA). It’s important to note that a negative correlation does not imply a negative side

effect. As a basic example, the more time a person spends exercising, the lower their weight

tends to be; a negative correlation but not inherently bad. A ​zero correlation​ is one in which

there is no relationship between two variables. For instance, the lumen of a flashlight and how

waterproof it is have no linear relationship whatsoever and therefore we can not determine a

correlation between the two variables (luminosity and water-resistance).

A positive correlation ranges from > 0 to +1, with intervals in between that determining

the strength of the correlation. A negative correlation ranges from < 0 to -1. And a zero

correlation is simply a 0. A +1 correlation is a ​perfect positive correlation.​ Similarly, a -1 is a

perfect negative correlation.


3

● A correlation of +0.5 is a stronger positive correlation than +0.33.

● A correlation of -0.2 is a weaker negative correlation than -0.

● The closer the data points are to the lines, the “stronger” the correlation is. If there are

many outliers, it can be said that the correlation is “moderate” or even “weak”. If there is

no distinguishable relationship, then there is no correlation.

Source: danshiebler.com

Scatter Plots

A scatterplot is a graph that is used to plot the data points for two or more variables. Each

scatterplot has a horizontal axis (x-axis) and a vertical axis (y-axis). One variable is plotted on

each axis. Scatterplots are made up of marks; each mark represents one study participant's

measures on the variables that are on the x-axis and y-axis of the scatterplot. Most scatter plots

contain a ​line of best fit​, which is a straight line drawn through the center of the data points that

best represents the ​trend​ of the data. Scatter plots provide a ​visual representation​ of the
4

relationship between the variables and make it easier to spot trends quickly.

In statistics, the ​correlation coefficient ​r​ measures the strength and direction of a linear

relationship between two variables on a scatter plot. The correlation coefficient tells us how

closely the data variables of a scatter plot fall along a trend line (closer to the trend line would

indicate a strong correlation while further away would indicate a relatively weaker correlation).

The value of ​r​ is always between +1 and –1. To interpret its value, see which of the

following values your correlation coefficient ​r​ is closest to:

● r = ​–1 A perfect downhill (negative) linear relationship

● r = ​–0.70 A strong downhill (negative) linear relationship

● r = ​–0.50 A moderate downhill (negative) relationship

● r ​= –0.30 A weak downhill (negative) linear relationship

● r ​= 0 No linear relationship (zero correlation)

● r ​= +0.30 A weak uphill (positive) linear relationship

● r ​= +0.50 A moderate uphill (positive) relationship

● r​ = +0.70 A strong uphill (positive) linear relationship

● r​ = +1 A perfect uphill (positive) linear relationship


5

Source: sciencedirect.com

Regression Lines

A regression line is a straight line that describes a data set in a visual way. It’s also

known as a trend line or “line of best fit”. Regression lines are very useful for predicting future

outcomes and trends. The purpose of the line is to describe the correlation of a dependent

variable, ​y, ​with one or more independent variables, ​x.​ Regression lines are used in a variety of

ways. Some of the more common ways that they are used are when predicting pandemic

infection rates, predicting stock prices, predicting sports odds and gambling and other areas

where a strong trend may point to a predictable potential future outcome.


6

Data Sets and Examples

Data Set 1

This scatter plot shows a strong positive linear correlation indicating that the more time a student

spends studying the higher their test scores will be. As the X axis increases, the Y axis increases

with a linear upwards trend. The correlation coefficient is 0.86. The closer a correlation

coefficient is to 1, the stronger the correlation and thus this proves to be a strong positive

correlation. The data points are close to the trend line and are indicative of a strong correlation.

If the number of hours of studying is 7, the predicted test score is 76.​ Using the slope

intercept formula y = 3.855x + 49.156, if X = 7, then y = (3.855 x 7) + 49.156.


7

Data Set 2

This scatter plot represents a moderately weak negative linear correlation with a coefficient of

-0.46. As x increases, y tends to decrease with a linear downwards trend. However, compared to

the first data set, it is easy to see that the correlation is not as strong since the data points are

further from the trend line. According to this graph, for one reason or another, the more time a

student spends in a lab the lower their course grade is.


8

Data Set 3

This plot represents a strong (nearly perfect) negative linear correlation with a coefficient of

-0.98. As the age of a person increases, the amount of hours spent jogging per week decreases.

The data points are almost on the trend line itself - indicating a very strong correlation.

If we predict the amount of hours a 40 year old person jogs per week, we can use the intercept

and predict that he or she jogs approximately ​4.6 hours per week​. ​If x = 40, then y = (-0.1396 x

40) + 10.199 = 4.615.


9

Data Set 4

This scatter plot represents a moderate positive linear correlation, with a coefficient of

correlation ​r​ = 0.59. Some data points are on near the trend line while others are further away

and thus this shows a neither strong nor weak linear correlation. According to this graph,

spending more on advertising may influence the number of products sold in a positive way.
10

Data Set 5

Data set 5 plot shows ​no linear relationship​ between the variables as data points do not have a

clear trend line and are scattered randomly throughout. According to this graph, temperature

does not affect plant growth.


11

Causation versus correlation

Correlation, as defined above, indicates a simple relationship between the values of two

variables. A scatter plot displays this data and is a useful tool for visually determining if there

exists a correlation between the variables' strength.

Causation means that one event ​causes ​another event to occur. Causation only applies

when one variable has been proven to cause a change in a dependent variable. Causation is

determined by testing and rigorous experiments with at least 95% confidence intervals.

Causation and correlation can occur simultaneously between two data sets. However,

correlation does not imply causation. As an example, there seems to be a correlation between the

number of 5G cell phone towers and confirmed COVID-19 cases on maps. However, there is no

evidence that the 5G cell phone towers actually cause or increase the risk of getting COVID-19.

The correlation might be that the areas with 5G towers tend to be in large metropolitan areas

with larger populations and that may account for the increase in COVID-19 cases compared to

cities with lower populations and no 5G towers. The 5G towers can’t be said to “cause or

increase the risk” of contracting COVID-19 and thus there is no causal link even though a

positive correlation may be seen. We are always looking for patterns around us to explain what

we see and find links between things. Events that seem to “connect” based on our own common

sense and judgement can not be said to be causal unless tested and should be assumed to be

correlations.
12

Conclusion

In conclusion, scatter plots, correlation and regression lines are very useful statistical

tools that help determine the relationship between a set of data. We can forecast and predict

future outcomes making their usage and interpretation very important in a variety of settings.

Scatter plots allow us to visualize data and quickly determine what type of, if any, correlation

exists between two variables. One drawback of a scatter plot is that it may be used to present

data that shows correlation but not causation and presented as evidence of a false link between

two variables. For instance, in data set 3, it can be said that as we age we tend to jog less per

week. However, this is based off of 6 people and can not be implied that getting older results in

jogging less hours per week. The sample size is very small and the people may have been

cherry-picked to imply causation. More rigorous experiments and studies would need to be done

to determine if there is a causative factor. Knowing these important statistical measurement tools

can help us better understand the relationship between various factors in our world.

You might also like