Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

1

INTRODUCTION TO STATISTICS &


PROBABILITY

Chapter 2:
Looking at Data–Relationships

Dr. Nahid Sultana

12/8/2022 Copyright© Nahid Sultana 2017-2018


Chapter 2:
Looking at Data–Relationships
2

Introduction
2.4 Least-Squares Regression
2.5 Cautions about Correlation and Regression
2. 6 Data Analysis for Two-Way Tables

Copyright© Nahid Sultana 2017-2018 12/8/2022


Introduction
3

Objectives

➢ Relationships
➢ Scatterplots
➢ Correlation

Copyright© Nahid Sultana 2017-2018 12/8/2022


Bivariate data
4

➢ For each individual studied, we record data on two variables.


➢ We then examine whether there is a relationship between these two
variables:

Size and price of a coffee beverage. Suppose you visited a local


Starbucks to buy a Mocha. The barista explains that this blended
coffee beverage comes in three sizes: small, medium, large, and the
prices are $3.15, $4.65, and $5.15, respectively.

✓ There is a clear association between the size of the Mocha and its price.

Copyright© Nahid Sultana 2017-2018 12/8/2022


Associations Between Variables
5


5 Many interesting examples of the use of statistics involve
relationships between pairs of variables.

Two variables measured on the same cases are associated if


knowing the value of one of the variables tells you something about
the values of the other variable that you would not know without this
information.

➢ A response (dependent) variable measures an outcome of a study.


➢ An explanatory (independent) variable explains changes in the
response variable.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Scatterplot
6

➢ The most useful graph for displaying the relationship between two
quantitative variables on the same individuals is a scatterplot.

How to Make a Scatterplot


1. Decide which variable should go on which axis.
2. Typically, the explanatory or independent variable is plotted
on the x-axis, and the response or dependent variable is plotted
on the y-axis.
3. Label and scale your axes.
4. Plot individual data values.
6
Copyright© Nahid Sultana 2017-2018 12/8/2022
Scatterplot (Cont…)
7

Example: Make a scatterplot of the relationship between body


weight and backpack weight for a group of hikers.
Body weight (lb) 120 187 109 103 131 165 158 116
Backpack weight (lb) 26 30 26 24 29 35 31 28

7 12/8/2022
Copyright© Nahid Sultana 2017-2018
Interpreting Scatterplots
8

How to Examine a Scatterplot


➢ After plotting two variables on a scatterplot, we describe the
overall pattern of the relationship. Specifically, we look for form,
direction, and strength .
Form: linear, curved, clusters, no pattern
Direction: positive, negative, no direction
Strength: how closely the points fit the “form”
➢… and clear deviations from that pattern
Outliers of the relationship, an individual value that falls
8 outside the overall pattern of the relationship
Copyright© Nahid Sultana 2017-2018 12/8/2022
Interpreting Scatterplots (Cont…)
(Form)
9

Linear

No relationship

Nonlinear

Copyright© Nahid Sultana 2016-2017 12/8/2022


Interpreting Scatterplots (Cont…)
(Direction)
10

Positive association: High values of one variable tend to occur


together with high values of the other variable.
Negative association: High values of one variable tend to occur
together with low values of the other variable

Copyright© Nahid Sultana 2016-2017 12/8/2022


Interpreting Scatterplots (Cont…)
11

No relationship: X and Y vary independently. Knowing X tells you


nothing about Y.

Copyright© Nahid Sultana 2016-2017 12/8/2022


Interpreting Scatterplots (Cont…)
(Strength)
12

The strength of the relationship between the two variables can be


seen by how much variation, or scatter, there is around the main
form.

Copyright© Nahid Sultana 2016-2017 12/8/2022


Interpreting Scatterplots (Cont…)
(Outliers)
13

In a scatterplot, outliers are points that fall outside of the overall


pattern of the relationship.

Copyright© Nahid Sultana 2016-2017 12/8/2022


Interpreting Scatterplots (Cont…)
14

✓ There is one possible


outlier―the hiker with
the body weight of 187
pounds seems to be
carrying relatively less
weight than are the
other group members.

Strength Direction Form


✓ There is a moderately strong, positive, linear relationship between body
weight and backpack weight.
✓ It appears that lighter hikers are carrying lighter backpacks.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Categorical variables in scatterplots
15

To add a categorical variable, use a different plot color or symbol for


each category.

What may look like a positive


linear relationship is in fact a
series of negative linear
associations.
Plotting different habitats in
different colors allows us to
make that important distinction.

Copyright© Nahid Sultana 2016-2017 12/8/2022


Categorical variables in scatterplots
16
(Cont…)
Comparison of men and women
racing records over time.
Each group shows a very strong
negative linear relationship that
would not be apparent without the
gender categorization.
Relationship between lean body
mass and metabolic rate in men
and women.
Both men and women follow the
same positive linear trend, but
women show a stronger association.
Copyright© Nahid Sultana 2016-2017 12/8/2022
Categorical explanatory variables
When the explanatory variable is categorical, you cannot make a
scatterplot, but you can compare the different categories side by side on
the same graph (boxplots, or mean +/− standard deviation).

Comparison of income (quantitative


response variable) for different
education levels (five categories).

But be careful in your


interpretation: This is NOT a
positive association, because
education is not quantitative.
Nonlinear Relationships
▪ There are other forms of relationships besides linear. The
scatterplot below is an example of a nonlinear form.

▪ Note that there is curvature in the relationship between x


and y.

18
Correlation

➢ The correlation coefficient r


➢ Properties of r
➢ Influential points

19
The correlation coefficient "r"
20

➢ The correlation coefficient is a ➢ Suppose that we have data


measure of the direction and on variables x and y for n
strength of a linear relationship. individuals.
➢ The means and standard
➢ Correlation can only be used to
deviations of the two variables
describe quantitative variables.
are x and for the x-values,
Categorical variables don’t have and y and for y-values.
means and standard deviations.
➢ The correlation r between x
➢ It is calculated using the mean and y
and the standard deviation of
x i − x  y i − y 
n 
1
both the x and y variables. r=   
n −1 i=1  sx  sy 


Copyright© Nahid Sultana 2017-2018 12/8/2022


"r" ranges from -1 to +1
21

Properties of Correlation
➢ r is always a no. between –1 and 1.
➢ r > 0 indicates a positive association.
r < 0 indicates a negative association.
➢ Values of r near 0 indicate a very
weak linear relationship.
➢ The strength of the linear relationship
increases as r moves away from 0
toward –1 or 1.
➢ The extreme values r = –1 and r = 1
occur only in the case of a perfect
linear relationship.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Properties of Correlation
22

1. Correlation makes no distinction between explanatory and response


variables.
2. r has no units and does not change when we change the units of
measurement of x, y, or both.
3. Positive r indicates positive association between the variables, and
negative r indicates negative association.
4. The correlation r is always a number between –1 and 1.
Cautions:
▪ Correlation requires that both variables be quantitative.
▪ Correlation does not describe curved relationships between
variables, no matter how strong the relationship is.
▪ Correlation is not resistant. r is strongly affected by a few
outlying observations.
▪ Correlation is not a complete summary of two-variable data.
2.4 Least-Squares Regression
23

Objectives

➢ Regression lines
➢ Least-squares regression line
➢ Facts about Least-Squares Regression
➢ Correlation and Regression

Copyright© Nahid Sultana 2017-2018 12/8/2022


Regression line
24

➢ Correlation tells us about strength and direction of the linear


relationship between two quantitative variables.
➢ In Regression we study the association between two variables in
order to explain the values of one from the values of the other
(i.e., make predictions).
➢ When there is a linear association between two variables, then a
straight line equation can be used to model the relationship.
➢ In regression the distinction between Response and Explanatory is
important.

Copyright© Nahid Sultana 2017-2018 12/8/2022


Regression Line
25

A regression line is a straight line that describes how a response variable


y changes as an explanatory variable x changes.
We can use a regression line to predict the value of y for a given value of x.

Example: Predict the number of


new adult birds that join the colony
based on the percent of adult
birds that return to the colony from
the previous year.

If 60% of adults return, how


many new birds are predicted?
Regression line (Cont…)
26

➢ A regression line is a line that best describes the linear


relationship between the two variables, and it is expressed by
means of an equation of the form:

Where is the slope and is the intercept.

➢ Once the equation of the regression line is established, we can


use it to predict the response y for a specific value of the
explanatory variable x .
Copyright© Nahid Sultana 2017-2018 12/8/2022
The least-squares regression line
27

The least-squares regression line is the line that makes the sum of
the squares of the vertical distances of the data points from the
line as small as possible.

Copyright© Nahid Sultana 2017-2018 12/8/2022


The least-squares regression line (Cont.)
28

The equation of the least-squares regression line of y on x is


yˆ = b 0 + b1 x

yˆis the predicted y value (y hat)


b1 is the slope
b0 is the y-intercept

Copyright© Nahid Sultana 2017-2018 12/8/2022


How to plot the least-squares
regression line
29

sy
First we calculate the slope of the line, b1 = r
sx
Where
r is the correlation,
sy is the standard deviation of the response variable y,
sx is the standard deviation of the explanatory variable x.

Once we know b1, the slope, we can calculate b0, the y-intercept:
b0 = y − b1 x
Where x and y are the sample means of the x and y variables

Typically, we use stats software.


Copyright© Nahid Sultana 2017-2018 12/8/2022
Two different regression lines can be drawn if we
interchange the roles of x and y.
30

Example:
Fitted Line Plot Fitted Line Plot
Fat = 3.505 - 0.003441 NEA NEA = 745.3 - 176.1 Fat
700
4
600

Nonexercise activity (calories)


500
Fat gain (Kilograms)

3
400

300
2
200

100
1
0

-100
0
-100 0 100 200 300 400 500 600 700 0 1 2 3 4
Nonexercise activity (calories) Fat gain (Kilograms)

Correlation coefficient of NEA and Fat, r = -0.779 stay same in both cases
Copyright© Nahid Sultana 2017-2018 12/8/2022
BEWARE!!!
31

Not all calculators and software use the same convention. Some use:

yˆ = a + bx

yˆ = ax + b
And some use:

Make sure you know what YOUR calculator gives you for a and b before
you answer homework or exam questions.
Facts About Least-Squares Regression
32

Least- squares is the most common method for fitting a regression


line to data. Here are some facts about least-squares regression
lines.

➢ Fact 1: A change of one standard deviation in x


corresponds to a change of r standard deviations in y.
➢ Fact 2: The LSRL always passes through (x-bar, y-bar)
➢ Fact 3: The distinction between explanatory and response
variables is essential.
(in 1000s)
33 Year
1 977
Powerboat s
4 47
Dead Manate es
13
yˆ = 0.125 x − 41 .4
1 978 4 60 21
1 979 4 81 24
1 980 4 98 16
1 981 5 13 24
1 982 5 12 20
1 983 5 26 15
1 984 5 59 34
1 985 5 85 33
1 986 6 14 33
1 987 6 45 39
1 988 6 75 43
1 989 7 11 50
1 990 7 19 47

➢ There is a positive linear relationship between the number of


powerboats registered and the number of manatee deaths.
➢ The least squares regression line has the equation: yˆ = 0.125 x − 41.4
➢ Thus if we were to limit the number of powerboat registrations to
500,000, what could we expect for the number of manatee deaths?
yˆ = 0.125(500) − 41.4  yˆ = 62.5 − 41.4 = 21.1 ➢ Roughly 21 manatees.
----Could we use this regression line to predict the number of manatee
deaths for a year with 200,000 powerboat registrations?
Copyright© Nahid Sultana 2017-2018 12/8/2022
Extrapolation
!!!
34
!!!

Extrapolation is the use of a


regression line for prediction
far outside the range of values
of x used to obtain the line.

Such predictions are often not


accurate.

Copyright© Nahid Sultana 2017-2018 12/8/2022


Extrapolation (cont…)
35

➢ Sarah’s height was plotted


against her age.
➢ Can you guess (predict)
her height at age 42
months?
➢ Can you predict her height
at age 30 years (360
months)?

Copyright© Nahid Sultana 2017-2018 12/8/2022


Extrapolation (cont…)
36

➢ Regression line:
y-hat = 71.95 + .383 x
➢ Height at age 42 months?

y-hat = 88
➢ Height at age 30 years?

y-hat = 209.8
➢ She is predicted to be 6’10.5”
at age 30! What’s wrong?
Copyright© Nahid Sultana 2017-2018 12/8/2022
Coefficient of determination, r 2

37

➢ Least-squares regression looks at the distances of the data points


from the line only in the y direction.

➢ The variables x and y play different roles in regression.

➢ Even though correlation r ignores the distinction between x and y,


there is a close connection between correlation and regression.

➢ r2 is called the coefficient of determination.

➢ r2 represents the percentage of the variance in y (vertical scatter


from the regression line) that can be explained by changes in x.

Copyright© Nahid Sultana 2017-2018 12/8/2022


38

r = -1 Changes in x
r2 = 1 explain 100% of r = 0.87
the variations in y. r2 = 0.76
Y can be entirely
predicted for any
given value of x.

Changes in x
r=0
explain 0% of the Here the change in x only
r2 = 0
variations in y.
explains 76% of the change in
The values y takes
y. The rest of the change in y
are entirely
independent of (the vertical scatter, shown as
what value x red arrows) must be explained
takes. by something other than x.
38
Copyright© Nahid Sultana 2017-2018 12/8/2022
r r==–0.3, r 2 = 0.09, or 9%
–0.3, r 2 = 0.09, or 9%
The
Theregression
regressionmodel
modelexplains
explainsnot
noteven
even10%
10%
ofofthe
thevariations
variationsininy.y.

r r==–0.7, r 2 = 0.49, or 49%


–0.7, r 2 = 0.49, or 49%
The
Theregression
regressionmodel
modelexplains
explainsnearly
nearlyhalf
halfofof
the
thevariations
variationsininy.y.

r r==–0.99, r 2 = 0.9801, or ~98%


–0.99, r 2 = 0.9801, or ~98%
The
Theregression
regressionmodel
modelexplains
explainsalmost
almostallallofof
the
thevariations
variationsininy.y.
39
Copyright© Nahid Sultana 2017-2018 12/8/2022
2.5 Cautions About Correlation and
40
Regression
Objectives

➢ Residuals and residual plots


➢ Outliers and influential observations
➢ Lurking variables
➢ Correlation and causation

Copyright© Nahid Sultana 2017-2018 12/8/2022


Residuals
41
A residual is the difference between an observed value of the
response variable and the value predicted by the regression line:
residual = observed y – predicted y = y − yˆ

Points above the


The sum of these
line have a positive residuals is always 0.
residual. 
Points below the line have a
negative residual.

Predicted ŷ
dist. ( y − yˆ ) = residual
Observed y

Copyright© Nahid Sultana 2017-2018 12/8/2022


Residual plots
42

➢ A residual plot is a scatterplot of the regression residuals against


the explanatory variable.
➢ Residual plots help us assess the fit of a regression line.
➢ If residuals are scattered randomly around 0, chances are your
data fit a linear model, was normally distributed, and you didn’t
have outliers.
Copyright© Nahid Sultana 2017-2018 12/8/2022
The x-axis in a residual plot is
the same as on the
scatterplot.

Only the y-axis is different.

43
Copyright© Nahid Sultana 2017-2018 12/8/2022
Residuals are randomly
scattered—good!

Curved pattern—means the


relationship you are looking at is
not linear.

A change in variability across a


plot is a warning sign. You need to
find out why it is, and remember
that predictions made in areas of
larger variability will not be as
44
Copyright© Nahid Sultana 2017-2018 good. 12/8/2022
Outliers and Influential Points
An outlier is an observation that lies outside the overall pattern of the
other observations.
45

➢ Outliers in the y direction have large residuals.


➢ Outliers in the x direction are often influential for the least-squares
regression line, meaning that the removal of such points would
markedly change the equation of the line.

Copyright© Nahid Sultana 2017-2018 12/8/2022


Outliers and Influential Points (cont…)
46

Gesell Adaptive Score and Age at First Word

After removing child 18


r 2 = 11%

From all of the data


r 2 = 41%

Copyright© Nahid Sultana 2017-2018 12/8/2022


Cautions About Correlation and
47
Regression
➢ Both describe linear relationships.
➢ Both are affected by outliers.
➢ Always plot the data before interpreting.
➢ Beware of extrapolation: Use caution in predicting y when x is
outside the range of observed x’s.
➢ Beware of lurking variables--these have an important effect on the
relationship among the variables in a study, but are not included in
the study.
➢ Correlation does not imply causation!
Copyright© Nahid Sultana 2017-2018 12/8/2022
Example:
A personal trainer wants to look at the relationship between number of hours of
exercise per week and resting heart rate of her clients. The data show a linear
pattern with the summary statistics shown below:

mean standard deviation

x= hours of exercise
sx​=4.8
per week

y=resting heart rate


sy=7.2
(beats per minute)

r =−0.88

Find the equation of the least-squares regression line for predicting resting
heart rate from the hours of exercise per week.

48
2.5 Data Analysis for Two-Way Tables
49

Objectives

➢ The Two-Way Table


➢ Joint distribution
➢ Marginal Distribution
➢ Conditional Distributions

Copyright© Nahid Sultana 2017-2018 12/8/2022


Two-way tables
50

Two-way tables summarize data about two categorical variables (or


factors) collected on the same set of individuals.
Example (Smoking Survey in Arizona): High school students were
asked whether they smoke and whether their parents smoke.
Does parental smoking influence the smoking habits of their high school
children?
Explanatory Variable: Smoking habit of student’s parents
(both smoke/ one smoke/ neither smoke)
Response variable: Smoking habit of student
(smokes/does not smoke)
To analyze the relationship we can summarize the result in a Two-way
table: Copyright© Nahid Sultana 2017-2018 12/8/2022
Two-way tables (Cont …)
51

Explanatory (Row) Variable: Smoking habit of student’s parents


Response (Column) variable: Smoking habit of student

High school students were asked whether they smoke,


and whether their parents smoke: Second factor:
Student smoking status

First factor:
Parent smoking status 400 1380
416 1823
188 1168

This 3X2 two-way table has 3 rows and 2 columns. Numbers are counts
or frequency
Copyright© Nahid Sultana 2017-2018 12/8/2022
Margins
52

Margins show the total for each column and each row.

400 1380 Margin for parental


416 1823 smoking
188 1168

Margin for student smoking

➢ For each cell, we can compute a proportion by dividing the cell


entry by the total sample size.
➢ The collection of these proportions is the joint distribution of the
two categorical variables. Copyright© Nahid Sultana 2017-2018 12/8/2022
Marginal distributions
(When examine the distribution of a single variable in a two-way table)
53

❖ Marginal distributions: Distribution of column variable separately (or


row variable separately) expressed in counts or percent.

400 1380
416 1823
188 1168

400 1380 33.1%


416 1823 41.7% 1780
188 1168 25.2%
 33.1%
5375
18.7% 81.3% 100%
Copyright© Nahid Sultana 2017-2018
1004 12/8/2022
= 18.7%
5375
Marginal distribution (Cont..)
Sum of Counts
Parental smoking
45%
Smoker Nonsmoker Total

Percent of students interviewed


40%
Both
54
400 1380 33.1% 35%

One 416 1823 41.7% 30%

25%
Neither 188 1168 25.2% 20%

Total 18.7% 81.3% 100%


15%

10%

5%
The marginal distributions can be 0%
displayed on separate bar graphs, Both One Neither
Student smoking
typically expressed as percents
Sum of Counts Parents
90%

Percent of students interviewed


80%
instead of raw counts. 70%

Each graph represents only one 60%

of the two variables, ignoring the


50%

40%
second one. 30%

Each marginal distribution can also 20%

10%
be shown in a pie chart.Copyright© Nahid Sultana 0%2017-2018 12/8/2022
Smoker Nonsmoker
Conditional Distribution
55

A conditional distribution is the distribution of one factor for each


level of the other factor.
A conditional percent is computed using the counts within a single row
or a single column. The denominator is the corresponding row or
column total (rather than the table grand total).

400
400 1380
1380
416
416
1823
1823
188 1168
188 1168

Percent of students
Percent who
of students smoke
who smoke when
whenboth
bothparents smoke
parents smoke =
= 400/1780
400/1780
Copyright© Nahid Sultana 2017-2018 12/8/2022
= 22.5%
= 22.5%
Conditional distributions (Cont…)
56

➢ Comparing conditional distributions helps us describe the “relationship"


between the two categorical variables.
➢ We can compare the percent of individuals in one level of factor 1 for
each level of factor 2.

400 1380
416 1823
188 1168

Conditional distribution of student smokers for different parental smoking statuses:


Percent of students who smoke when both parents smoke = 400/1780 = 22.5%
Percent of students who smoke when one parent smokes = 416/2239 = 18.6%
Percent of students who smoke when neither parent smokes = 188/1356 = 13.9%
Copyright© Nahid Sultana 2017-2018 12/8/2022
Conditional distributions (Cont…)
57

The conditional distributions can be compared graphically by displaying the percents


making up one level of one factor, for each level of the other factor.

Conditional distribution of student smoking status for different levels of parental


smoking status: Percent who Percent who
Row total
smoke do not smoke
Both parents smoke 22% 78% 100%
One parent smokes 19% 81% 100%
Neither parent smokes 14% 86% 100%

Copyright© Nahid Sultana 2017-2018 12/8/2022


Conditional Distribution
58

➢ In the table below, the 25 to 34 age group occupies the first column.

Copyright© Nahid Sultana 2017-2018 12/8/2022


Conditional distributions (Cont…)
59

Here the percents are calculated


by age range (columns).

29.30% = 11071
37785
= cell total .
column total

Copyright© Nahid Sultana 2017-2018 12/8/2022


The conditional distributions can be graphically compared using side by
side bar graphs of one variable for each value of the other variable.

Here, the percents are


calculated by age range
(columns).

60 Copyright© Nahid Sultana 2017-2018 12/8/2022


60
61

Young adults by gender and chance of getting rich by age 30

Female Male Total


Almost no chance 96 98 194
Some chance, but probably not 426 286 712
A 50-50 chance 696 720 1416
A good chance 663 758 1421
Almost certain 486 597 1083
Total 2367 2459 4826

What are the variables described by this two-way table?

How many young adults were surveyed?


Copyright© Nahid Sultana 2017-2018 12/8/2022
Marginal Distribution
62

Young adults by gender and chance of


getting rich Examine the marginal distribution of
Female Male Total chance of getting rich.
Almost no chance 96 98 194
Some chance, but 426 286 712
probably not
A 50-50 chance 696 720 1416
A good chance 663 758 1421
Almost certain 486 597 1083
Total 2367 2459 4826

Response Percent
Almost no chance 194/4826 = 4.0%
Some chance 712/4826 = 14.8%
A 50-50 chance 1416/4826 = 29.3%
A good chance 1421/4826 = 29.4%
Copyright© Nahid Sultana 2017-2018 12/8/2022
Almost certain 1083/4826 = 22.4%
Conditional Distribution
Young adults by gender and chance of getting rich 1. Calculate the conditional distribution of opinion
63
Female Male Total among males.
Almost no chance 96 98 194 2. Examine the relationship between gender and
Some chance, 426 286 712
opinion.
but probably not
A 50-50 chance 696 720 1416

A good chance 663 758 1421

Almost certain 486 597 1083

Total 2367 2459 4826

Response Male Female

Almost no chance 98/2459 = 96/2367 =


4.0% 4.1%
Some chance 286/2459 426/2367 =
= 11.6% 18.0%
A 50-50 chance 720/2459 696/2367 =
= 29.3% 29.4%
A good chance 758/2459 663/2367 =
= 30.8% 28.0%
Almost certain 597/2459 Copyright©
486/2367 = Nahid Sultana 2017-2018 12/8/2022 63
= 24.3% 20.5%
Simpson’s Paradox
64

Consider the acceptance rates for the following groups of men


and women who applied to college.

Not
Counts Accepted Total
accepted Not
Percents Accepted
accepted
Men 198 162 360
Men 55% 45%
Women 88 112 200
Women 44% 56%
Total 286 274 560

A higher percentage of men were accepted: Is there evidence


of discrimination?
Copyright© Nahid Sultana 2017-2018 12/8/2022
Simpson’s Paradox (cont…)
65

Consider the acceptance rates when broken down by type of school.


BUSINESS SCHOOL
Not Not
Counts Accepted Total Percents Accepted
accepted accepted
Men 18 102 120
Men 15% 85%
Women 24 96 120
Total 42 198 240 Women 20% 80%

ART SCHOOL
Not Not
Counts Accepted Total Percents Accepted
accepted accepted
Men 180 60 240
Women 64 16 80 Men 75% 25%
Total 244 76 320 Women 80% 20%
Within each school a higher percentage of women were accepted than men.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Simpson’s Paradox (cont…)
Within each school a higher percentage of women were accepted than men.
66

There is not any discrimination against women!!!


➢ lurking variables have an important effect on the relationship
among the variables in a study, but are not included in the study.

✓ Lurking variable: Applications were split between the Business


School (240) and the Art School (320).
This is an example of Simpsons Paradox.
➢ When the lurking variable (Type of School: Business or Art) is
ignored the data seem to suggest discrimination against women.
➢ However, when the type of school is considered, the association is
reversed and suggests discrimination against men.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Simpson’s Paradox (cont…)
67

An association or comparison that holds for all of several groups


can reverse direction when the data are combined to form a
single group. This reversal is called Simpson’s paradox.

Copyright© Nahid Sultana 2017-2018 12/8/2022

You might also like