Regression And-Correlation

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 69

Data Management

Jolly S. Balila, PhD


Lecturer
Mathematics Society of the Philippines-
CALABARZON Enrichment Program for New GE
Course:
Mathematics in the Modern World
June 6-9, 2018
Outline

 Introduction to Data Management


 Measures of Central Tendency
 Measures of Dispersion
 Measures of Relative Position
 Normal Distribution
 Linear Regression and Correlation
Data are the facts and figures collected,
analyzed, and summarized for presentation
and interpretation (Anderson et al., 2018)
Correlation
Correlation

 Correlation is used to measure and describe a


relationship between two variables.
 Usually these two variables are simply
observed as they exist in the environment;
there is no attempt to control or manipulate the
variables.
Correlation

 The correlation coefficient measures three


characteristics of the relationship between X
and Y:
 The direction of the relationship.
 The form of the relationship.
 The degree of the relationship.
 Pearson’s r
Describing relationships:
An example…

Work-life Earnings Estimates by Educational


Attainment
Scatter Plot

 What is the relationship between level of education and lifetime earnings?

Education Level and Lifetime Earnings

X (Education) Y (Income) 5
(Criterion Variable)
Lifetime Earnings

8 3.4
4
7 4.4
6 2.5 3
5 2.1 2
4 1.6
1
3 1.5
2 1.2 0
1 1 0 2 4 6 8 10
Education (Predictor Variable)
Direction of Relationship

 A scatter plot shows at a glance the direction of


the relationship.
 A positive correlation appears as a cluster of data
points that slopes from the lower left to the upper
right.
Positive Correlation
 If the higher scores on X are generally paired
with the higher scores on Y, and the lower
scores on X are generally paired with the
lower scores on Y, then the direction of the
correlation between two variables is positive.
Direction of Relationship

 A scatter plot shows at a glance the direction of


the relationship.
 A negative correlation appears as a cluster of data
points that slopes from the upper left to the lower
right.
Negative Correlation
 If the higher scores on X are generally paired
with the lower scores on Y, and the lower
scores on X are generally paired with the
higher scores on Y, then the direction of the
correlation between two variables is negative.
No Correlation

 In cases where there is no correlation between two


variables (both high and low values of X are equally
paired with both high and low values of Y), there is no
direction in the pattern of the dots.
 They are scattered about the plot in an irregular pattern.
Perfect Correlation

 When there is a perfect linear relationship,


every change in the X variable is accompanied
by a corresponding change in the Y variable.
Form of Relationship

 Pearson’s r assumes an underlying linear


relationship (a relationship that can be best
represented by a straight line).
 Not all relationships are linear.
Strength of Relationship

 How can we describe the strength of the


relationship in a scatter plot?
 A number between -1 and +1 that indicates the
relationship between two variables.
 The sign (- or +) indicates the direction of the relationship.
 The number indicates the strength of the relationship.

-1 ------------ 0 ------------ +1
Perfect Relationship No Relationship Perfect Relationship

The closer to –1 or +1, the stronger the


relationship.
Pearson’s r

 Definitional formula:
degree to which X and Y vary together
r
degree to which X and Y vary separately

COVXY  ( X  X )(Y  Y )
r COV XY 
(s x )(sy ) n
Computational formula:
n( XY )  ( X )( Y )
r
( n X  ( X ) )( n Y  ( Y ) )
2 2 2 2
An Example: Correlation

 What is the relationship between level of education and lifetime earnings?

Education Level and Lifetime Earnings

X (Education) Y (Income) 5
(Criterion Variable)
Lifetime Earnings

8 3.4
4
7 4.4
6 2.5 3
5 2.1 2
4 1.6
1
3 1.5
2 1.2 0
1 1 0 2 4 6 8 10
Education (Predictor Variable)
An Example: Correlation
X Education Y Income XY X2 Y2
8 3.4 27.2 64 11.56
7 4.4 30.8 49 19.36
6 2.5 15 36 6.25
5 2.1 10.5 25 4.41
4 1.6 6.4 16 2.56
3 1.5 4.5 9 2.25
2 1.2 2.4 4 1.44
1 1 1 1 1
36 17.7 97.8 204 48.83
 X  36
 Y  17.7 n( XY )  ( X )( Y )
r
 XY  97.8 ( n X 2  ( X ) 2 )( n Y 2  ( Y ) 2 )
 X 2  204
 Y 2  48.83
n8
An Example: Correlation
 X  36
 Y  17.7
 XY  97.8
 X 2  204
 Y 2  48.83
n8
An Example: Correlation
 Researchers who measure reaction time for human participants
often observe a relationship between the reaction time scores
and the number of errors that the participants commit. This
relationship is known as the speed-accuracy tradeoff. The
following data are from a reaction time study where the
researcher recorded the average reaction time (milliseconds)
and the total number of errors for each individual in a sample of
8 participants. Calculate the correlation coefficient.
Speed Accuracy Tradeoff

Reaction Time Errors 15


184 10
213 6
10
Number of Errors

234 2
197 7
189 13 5

221 10
237 4 0
192 9 150 175 200 225 250
Reaction Time
An Example: Correlation
X X2 Y Y2 XY
184 33856 10 100 1840
213 45369 6 36 1278
234 54756 2 4 468
197 38809 7 49 1379
189 35721 13 169 2457
221 48841 10 100 2210
237 56169 4 16 948
192 36864 9 81 1728
1667 350385 61 555 12308

n( XY )  ( X )( Y )
r
( n X 2  ( X ) 2 )( n Y 2  ( Y ) 2 )
8(12308)  (1667)(61)
r
 8(350385)  (1667) 2  8(555)  (61) 2 
 0.77
Interpreting Pearson’s r

 Correlation does not equal causation.


 Can tell you the strength and direction of a
relationship between two variables but not the
nature of the relationship.
.
 Correlation analysis is used to measure
strength of the association (linear relationship)
between two variables
 Correlation is only concerned with strength of the
relationship
 No causal effect is implied with correlation
Calculating Pearson’s r in
Microsoft Excel
 In Excel, there are many functions that can
calculate a correlation statistic, however, we will
only use =PEARSON in this class.

Example: To determine if there is a relationship


between number of hours spent per week
studying and GPA earned in the class at the
end of the quarter. Calculate Pearson’s r for our
two variables.
Enter the following data into Excel:

StudyHrs = average number of hours spent per week studying for 209
GPA = grade-point average earned in 209 at the end of the quarter
Step 1: Select the cell where you want your r
value to appear (you might want to label it).
Step 2: Click on the function wizard button.
Step 3: Search for and select PEARSON.
Step 4: For Array1, select all the values under StudyHrs.
For Array2, select all the values under GPA.
Step 5: That’s it! Once you have your r value,
don’t forget to round to 2 decimal places.

Knowledge check: What does the r value of 0.88 tell you about
the strength and direction of the correlation between StudyHrs
and GPA?
Scatterplots

 A scatterplot is an excellent way to visually


display the relationship (correlation) between
two variables.
 Each point on the scatterplot represents an
individual’s data on the two variables.

 We will now create a scatterplot for StudyHrs


and GPA.
Step 1: Select both columns of variables you
wish to plot (StudyHrs and GPA).
Step 2: Click on the tab labeled ‘Insert’, and then
select ‘Scatter’ in the ‘Charts’ menu.
Step 3: Select the first plot in the drop-down menu.
Step 4: Remove the legend by clicking on it
and pressing Delete.
Step 5: Add axis titles by selecting the ‘Layout’ tab and
clicking on ‘Axis Titles.’ For the horizontal title, you
want it below the x-axis. For the vertical title, you
want the ‘Rotated Title’ option.

NOTE: Your
chart must be
highlighted for
the ‘Layout’ tab to
appear under
‘Chart Tools.’
A note about x- and y-axes:
 For scatterplots, it does not matter which variable
goes on each axis (this is NOT true for other
types of charts).
 However, you need to make sure you label your
axes with the proper variable name.
 In this example, GPA is on the y-axis and Study
Hours is on the x-axis (we can tell this based on
their different ranges of values).
 As a helpful hint, Excel will automatically put the
first variable (left-hand column) on the x-axis, and
the second variable (right-hand column) on the y-
axis.
Step 6: Change the chart title by selecting it, typing a
new one, and pressing Enter. Chart and axis titles
may be altered by right-clicking on them.
Your scatterplot is now finished!

Remember: Each point in the scatterplot represents an


individual’s data.
Knowledge check: Identify Student 8 in the scatterplot.
Describing Correlations and
Scatterplots
 Scatterplots and correlations are described:
As positive or negative.
 As weak, moderate, or strong.

 Using the r value.

 Sentence 1: There is a strong, positive correlation (r = 0.88)


between the number of hours studied and GPA.
 Then you want to describe the general relationship
between the two variables:
 Sentence 2: More hours of studying for mathematics was
associated with a higher GPA earned in the class at the end of
the quarter.
 Interpretation: “More studying led to a higher GPA” –
this implies causation, which cannot be determined
using correlational research.
Introduction to
Regression Analysis
 Regression analysis is used to:
 Predict the value of a dependent variable based on the
value of at least one independent variable
 Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain
the dependent variable
Simple Linear Regression
Model

 Only one independent variable, X


 Relationship between X and Y is
described by a linear function
 Changes in Y are assumed to be caused
by changes in X
Types of Relationships
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Types of Relationships
(continued)
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Types of Relationships
(continued)
No relationship

X
Hypothetical Data
10
Plant Wt Ht 9
(gm) (cm) 8
7
A 1 3 6
This is called a scatter plot

height (y)
B 2 5 5
4
C 3 7 3
2
D 4 9 1
0
Can we predict the height of 0 1 2 3 4 5 6 7 8 9 10
the plant given its weight? weight (x)
How tall would a 30gm plant
be?
Equation of a Line
10
9 Note that we can draw
8 a straight line that will
7 contain all the points.
6
height (y)

5 The equation of this


4 line will help us predict
3 the height of a plant
2 given its weight.
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The equation of a line
8 is Y = a + bX,
bX where a
7 is the y-intercept and
6 b is the slope.
slope
height (y)

5
4 Note that Y refers to
3 height and X refers to
2 the weight.
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The variables on the x
8 and y axes are
7 sometimes referred to
6 as…
height (y)

5
4 x-axis y-axis
3
2 independent dependent
1 predictor predicted
0 carrier response
input output
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The y-intercept is the
8 point on the y-axis
7 (height) where the line
6 crosses. Therefore the
height (y)

5 y-intercept is +1.
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The slope is equal to
8 y / x or the change
7 in y over the change in
6 x.
height (y)

5
4 2 In our case for every
3 1-unit increase in x,
2 1 we see a 2-unit
1
increase in y. Thus the
0
slope is 2/1 or 2.
0 1 2 3 4 5 6 7 8 9 10
Note: If the line falls instead
weight (x) of rises from left to right,
the slope is negative.
Equation of a Line
10
9
Therefore the equation
8 of the line is
7
6 Y = 1 + 2X
height (y)

5
4 Where 1 is the y-
3 intercept and 2 is the
2 slope.
1
0 Thus the height (Y) of
0 1 2 3 4 5 6 7 8 9 10 a 30gm plant is
weight (x)
Y = 1 + 2(30) = 61cm
Equation of a Line
10
9
In real life however,
8 we do not get such
7 perfect data. Usually,
6 we get something like
this…
height (y)

5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9
In real life however,
8 we do not get such
7 perfect data. Usually,
6 we get something like
this…
height (y)

5
4
3 How can we predict
2 with data like this?
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Residuals
10
9
Note that we can draw
8 several lines that will
7 follow the general
6 trend of the data.
height (y)

5
4 The problem is which
3 one best fits the
2 data?
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Residuals
10
9
Each line we draw
8 may contain some of
7 the data-points.
6
For those points not
height (y)

5
4 on the line, their
3 distances from the line
2 are called residuals,
residuals
1 errors or deviations.
deviations
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Residuals
10
9
Each line we draw
8 also generates its own
7 set of residuals.
6
The line that produces
height (y)

5
4 the least amount of
3 residuals (actually,
2 mean squared
1 residuals) is the line
0 that best fits the data.
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Simple Linear Regression
10
9
The task of finding the
8 equation of such a line
7 (called a regression
6 line)
line is called…
height (y)

5
4 simple linear
3 regression.
regression
2
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Test of Significance
Method 1 – test that the slope of the regression line is 0
(i.e. the regression line is horizontal). This is used to
determine if the regression line shows a statistically
significant linear relationship between X and Y.

Method 2 – test for the significance of Pearson’s r. The


test tests whether the resulting r is a result of chance.

Although the 2 methods employ different t-test formulas,


the resulting t-values are the same.
Simple Linear Regression
Model
The population regression model:
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi  β0  β1Xi  ε i
Linear component Random Error
component
Simple Linear Regression
Model
(continued)

Y Yi  β0  β1Xi  ε i
Observed Value
of Y for Xi

εi Slope = β1
Predicted Value
Random Error
of Y for Xi
for this Xi value

Intercept = β0

Xi X
Simple Linear Regression
Equation
The simple linear regression equation provides an
estimate of the population regression line

Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for

Ŷi  b0  b1Xi
observation i

The individual random error terms ei have a mean of zero


Least Squares Method

 b0 and b1 are obtained by finding the values of


b0 and b1 that minimize the sum of the
squared differences between Y and Ŷ:

min  (Yi Ŷi )  min  (Yi  (b0  b1Xi ))


2 2
Finding the Least Squares
Equation

 The coefficients b0 and b1 , and other


regression results in this chapter, will be
found using Excel

Formulas are shown in the text at the end of


the chapter for those who are interested
Interpretation of the
Slope and the Intercept

 b0 is the estimated average value of Y


when the value of X is zero

 b1 is the estimated change in the


average value of Y as a result of a
one-unit change in X
Proverbs 11:1


 “A false balance is an abomination to
the Lord, but a just weight is a
delight”.
 Thank you for Listening
References

 Anderson, D.R., Sweeney, D.J., Williams, T.A., Camm, J.D.,


Cochran,J.J. (2018). Essentials of modern business statistics.
Cengage Learning, USA.

 Adams, Kathrynn Ann. (2015). Research methods, statistics, and


applications.    Los Angeles : SAGE.
  
 Vogt, W. Paul (2014). Selecting the right analyses for your data:
Quantitative, qualitative, and mixed methods. New York: The
Guilford Press.
EXERCISES

You might also like