Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Lecture 2: Summarizing data

Bo LI

School of Economics and Management


Tsinghua University

libo@sem.tsinghua.edu.cn

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 1 / 73


Overview
1 Examining numerical data
Dot plots and the mean
Histograms and shape
Variance and standard deviation
Box plots, quartiles, and the median
Robust statistics
Transforming data
Scatterplots for paired data
2 Considering categorical data
Contingency tables and bar plots
Row and column proportions
Using a bar plot with two variables
Mosaic plots
Pie charts
Comparing numerical data across groups
Deceptive
Bo LI (Tsinghua SEM) graphs Lec 2: Summarizing data 2 / 73
Examining numerical data Dot plots and the mean

Dot plots

Useful for visualizing one numerical variable. Darker colors


represent areas where there are more observations.

2.5 3.0 3.5 4.0

GPA

How would you describe the distribution of GPAs in this data set?
Make sure to say something about the center, shape, and spread of
the distribution.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 3 / 73


Examining numerical data Dot plots and the mean

Dot plots & mean

2.5 3.0 3.5 4.0

GPA

The mean, also called the average (marked with a triangle in the
above plot), is one way to measure the center of a distribution of
data.
The mean GPA is 3.59.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 4 / 73


Examining numerical data Dot plots and the mean

Mean

The sample mean, denoted as x̄, can be calculated as

x1 + x2 + · · · + xn
x̄ = ,
n
where x1 , x2 , · · · , xn represent the n observed values.
The population mean is also computed the same way but is
denoted as µ. It is often not possible to calculate µ since
population data are rarely available.
The sample mean is a sample statistic, and serves as a point
estimate of the population mean. This estimate may not be
perfect, but if the sample is good (representative of the
population), it is usually a pretty good estimate.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 5 / 73
Examining numerical data Dot plots and the mean

Stacked dot plot


Higher bars represent areas where there are more observations,
makes it a little easier to judge the center and the shape of the
distribution.



● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●● ● ●
●●
● ●
● ● ● ● ●●
● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●● ● ● ●●● ●● ●
● ● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●
● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●

2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

GPA
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 6 / 73
Examining numerical data Histograms and shape

Histograms - Extracurricular hours

Histograms provide a view of the data density. Higher bars


represent where the data are relatively more common.
Histograms are especially convenient for describing the shape of
the data distribution.
The chosen bin width can alter the story the histogram is telling.
150

100

50

0
0 10 20 30 40 50 60 70
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 7 / 73
Examining numerical data Histograms and shape

Bin width
Which one(s) of these histograms are useful? Which reveal too
much about the data? Which hide too much?
200 150

150
100

100

50
50

0 0
0 20 40 60 80 100 0 10 20 30 40 50 60 70

Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities

80 40

60 30

40 20

20 10

0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 8 / 73
Examining numerical data Histograms and shape

Shape of a distribution: modality


Does the histogram have a single prominent peak (unimodal),
several prominent peaks (bimodal/multimodal), or no apparent peaks
(uniform)?

14
20
15

15

15

10
10

8
10

10

6
5

4
5

2
0

0
0 5 10 15 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Note: In order to determine modality, step back and imagine a


smooth curve over the histogram – imagine that the bars are wooden
blocks and you drop a limp spaghetti over them, the shape the
spaghetti would take could be viewed as a smooth curve.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 9 / 73
Examining numerical data Histograms and shape

Shape of a distribution: skewness

Is the histogram right skewed, left skewed, or symmetric?

30
15

60

25
20
10

40

15
10
20
5

5
0

0
0 2 4 6 8 10 0 5 10 15 20 25 0 20 40 60 80

Note: Histograms are said to be skewed to the side of the long tail.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 10 / 73


Examining numerical data Histograms and shape

Shape of a distribution: unusual observations

Are there any unusual observations or potential outliers?

40
30
25

30
20

20
15
10

10
5
0

0 5 10 15 20 20 40 60 80 100

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 11 / 73


Examining numerical data Histograms and shape

Extracurricular activities
How would you describe the shape of the distribution of hours per
week students spend on extracurricular activities?
150

100

50

0
0 10 20 30 40 50 60 70

Hours / week spent on extracurricular activities

Unimodal and right skewed, with a potentially unusual observation


Bo LI (Tsinghua SEM) Lec 2: Summarizing data 12 / 73
Examining numerical data Histograms and shape

Commonly observed shapes of distributions

modality

unimodal uniform
bimodal multimodal

skewness

symmetric
right skew left skew

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 13 / 73


Examining numerical data Histograms and shape

Practice

Which of these variables do you expect to be uniformly


distributed?

weights of adult females


salaries of a random sample of people from North Carolina
house prices
birthdays of classmates (day of the month)

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 14 / 73


Examining numerical data Variance and standard deviation

Variance

Variance is roughly the average squared deviation from the mean.


Pn
2 (xi − x̄)2
s = i=1
n−1

The sample mean is x̄ = 6.71,


80
and the sample size is
60

n = 217. 40

The variance of amount of 20

0
sleep students get per night 2 4 6 8 10 12

Hours of sleep / night


can be calculated
2
as: 2
+···+(7−6.71)2
s2 = (5−6.71) +(9−6.71)
217−1 = 4.11 hours2

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 15 / 73


Examining numerical data Variance and standard deviation

Variance (cont.)

Why do we use the squared deviation in the calculation of


variance?

To get rid of negatives so that observations equally distant from


the mean are weighed equally.
To weigh larger deviations more heavily.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 16 / 73


Examining numerical data Variance and standard deviation

Standard deviation

The standard deviation is the square root of the variance, and



has the same units as the data. s = s2 .

The standard deviation of


amount of sleep students get 80

per night can be calculated as: 60

√ 40
s = 4.11 = 2.03 hours
20

We can see that all of the data 0


2 4 6 8 10 12
are within 3 standard Hours of sleep / night

deviations of the mean.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 17 / 73


Examining numerical data Variance and standard deviation

Central Tendency v.s. Dispersion

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 18 / 73


Examining numerical data Variance and standard deviation

Central Tendency v.s. Dispersion

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 19 / 73


Examining numerical data Variance and standard deviation

Central Tendency v.s. Dispersion

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 20 / 73


Examining numerical data Variance and standard deviation

Concentration of a Distribution

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 21 / 73


Examining numerical data Variance and standard deviation

Concentration of a Distribution

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 22 / 73


Examining numerical data Variance and standard deviation

Concentration of a Distribution

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 23 / 73


Examining numerical data Box plots, quartiles, and the median

Median

The median is the value that splits the data in half when ordered
in ascending order.

0, 1, 2, 3, 4

If there are an even number of observations, then the median is


the average of the two values in the middle.

2+3
0, 1, 2, 3, 4, 5 → = 2.5
2
Since the median is the midpoint of the data, 50% of the values
are below it. Hence, it is also the 50th percentile.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 24 / 73


Examining numerical data Box plots, quartiles, and the median

Q1, Q3, and IQR

The 25th percentile is also called the first quartile, Q1.


The 50th percentile is also called the median.
The 75th percentile is also called the third quartile, Q3.
Between Q1 and Q3 is the middle 50% of the data. The range
these data span is called the interquartile range, or the IQR.

IQR = Q3 − Q1

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 25 / 73


Examining numerical data Box plots, quartiles, and the median

Box plot
The box in a box plot represents the middle 50% of the data, and
the thick line in the box is the median.
70

60
# of study hours / week

50

40

30

20

10

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 26 / 73


Examining numerical data Box plots, quartiles, and the median

Anatomy of a box plot


70

60
# of study hours / week

50
suspected outliers

40
max whisker reach
& upper whisker
30

20 Q3 (third quartile)

● median



10 ●

● Q1 (first quartile)







0 lower whisker

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 27 / 73


Examining numerical data Box plots, quartiles, and the median

Whiskers and outliers

Whiskers of a box plot can extend up to 1.5 × IQR away from the
quartiles.
max upper whisker reach = Q3 + 1.5 × IQR
max lower whisker reach = Q1 − 1.5 × IQR

IQR : 20 − 10 = 10
max upper whisker reach = 20 + 1.5 × 10 = 35
max lower whisker reach = 10 − 1.5 × 10 = −5

A potential outlier is defined as an observation beyond the


maximum reach of the whiskers. It is an observation that appears
extreme relative to the rest of the data.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 28 / 73
Examining numerical data Box plots, quartiles, and the median

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.


Identify data collection and entry errors.
Provide insight into interesting features of the data.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 29 / 73


Examining numerical data Robust statistics

Extreme observations
How would sample statistics such as mean, median, SD, and IQR
of household income be affected if the largest value was replaced with
$10 million? What if the smallest value was replaced with $10 million?



● ●
● ●
● ● ● ● ●
●● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
●● ●●●● ● ● ●
● ● ● ● ●
● ●● ● ● ●● ●
● ● ●
● ●● ● ● ● ●
● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

Annual Household Income


Bo LI (Tsinghua SEM) Lec 2: Summarizing data 30 / 73
Examining numerical data Robust statistics

Robust statistics

robust not robust


scenario median IQR x̄ s
original data 190K 200K 245K 226K
move largest to $10 million 190K 200K 309K 853K
move smallest to $10 million 200K 200K 316K 854K

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 31 / 73


Examining numerical data Robust statistics

Robust statistics

Median and IQR are more robust to skewness and outliers than
mean and SD. Therefore,

for skewed distributions it is often more helpful to use median and


IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean
and SD to describe the center and spread

If you would like to estimate the typical household income for a


student, would you be more interested in the mean or median income?
Median

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 32 / 73


Examining numerical data Robust statistics

Mean vs. median

If the distribution is symmetric, center is often defined as the


mean: mean ≈ median
Symmetric

mean
median

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 33 / 73


Examining numerical data Robust statistics

Mean vs. median

If the distribution is skewed or has extreme outliers, center is often


defined as the median
Right-skewed: mean > median; Left-skewed: mean < median

Right−skewed Left−skewed

mean mean
median median

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 34 / 73


Examining numerical data Robust statistics

Practice
Which is most likely true for the distribution of percentage of time
actually spent taking notes in class versus on Facebook, Twitter, etc.?
50

40

30

20 median: 80%
10 mean: 76%
0
0 20 40 60 80 100

% of time in class spent taking notes

mean> median mean ≈ median

mean < median impossible to tell

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 35 / 73


Examining numerical data Transforming data

Extremely skewed data


When data are extremely skewed, transforming them might make
modeling easier. A common transformation is the log transformation.

The histograms on the left shows the distribution of number of


basketball games attended by students. The histogram on the right
shows the distribution of log of number of games attended.

40
150

30
100
20

50
10

0 0
0 10 20 30 40 50 60 70 0 1 2 3 4

# of basketball games attended # of basketball games attended

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 36 / 73


Examining numerical data Transforming data

Pros and cons of transformations

Skewed data are easier to model with when they are transformed
because outliers tend to become far less prominent after an
appropriate transformation.

# of games 70 50 25 ···

log(# of games) 4.25 3.91 3.22 ···

However, results of an analysis might be difficult to interpret


because the log of a measured variable is usually meaningless.

What other variables would you expect to be extremely skewed?


Salary, housing prices, etc.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 37 / 73
Examining numerical data Scatterplots for paired data

Scatterplot

Scatterplots are useful for visualizing the relationship between


two numerical variables.
Do life expectancy and total fertility appear
to be associated or independent?
They appear to be linearly and negatively
associated: as fertility increases, life expectancy
decreases.
Was the relationship the same throughout
the years, or did it change?
The relationship changed over the years.
http://www.gapminder.org/world

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 38 / 73


Examining numerical data Scatterplots for paired data

Correlation

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 39 / 73


Examining numerical data Scatterplots for paired data

Correlation

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 40 / 73


Considering categorical data Contingency tables and bar plots

Contingency tables

A table that summarizes data for two categorical variables is


called a contingency table.

The contingency table below shows the distribution of survival and


ages of passengers on the Titanic.

Survival
Died Survived Total
Adult 1438 654 2092
gender
Child 52 57 109
Total 1490 711 2201

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 41 / 73


Considering categorical data Contingency tables and bar plots

Bar plots

A bar plot is a common way to display a single categorical


variable. A bar plot where proportions instead of frequencies are
shown is called a relative frequency bar plot.
1500

60.0%

Relative frequency
1000
Frequency

40.0%

500
20.0%

0 0.0%
Died Survived Died Survived
Survival Survival

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 42 / 73


Considering categorical data Contingency tables and bar plots

Bar plots

How are bar plots different than histograms?


Bar plots are used for displaying distributions of categorical variables, histograms
are used for numerical variables. The x-axis in a histogram is a number line, hence the
order of the bars cannot be changed. In a bar plot, the categories can be listed in any
order (though some orderings make more sense than others, especially for ordinal
variables.)

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 43 / 73


Considering categorical data Row and column proportions

Choosing the appropriate proportion

Does there appear to be a relationship between age and survival


for passengers on the Titanic?

Survival
Died Survived Total
Adult 1438 654 2092
Age
Child 52 57 109
Total 1490 711 2201

To answer this question we examine the row proportions:

% Adults who survived: 654 / 2092 ≈ 0.31


% Children who survived: 57 / 109 ≈ 0.52
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 44 / 73
Considering categorical data Using a bar plot with two variables

Bar plots with two variables

Stacked bar plot: Graphical display of contingency table


information, for counts.
Side-by-side bar plot: Displays the same information by placing
bars next to, instead of on top of, each other.
Standardized stacked bar plot: Graphical display of contingency
table information, for proportions.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 45 / 73


Considering categorical data Using a bar plot with two variables

What are the differences between the three visualizations shown


below?

1500
1.00
2000

Relative frequency
0.75
1500 1000
Frequency

Frequency
Survival Survival Survival

1000 Died Died 0.50 Died


Survived Survived Survive
500
500 0.25

0 0 0.00
Adult Child Adult Child Adult Child
Age Age Age

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 46 / 73


Considering categorical data Mosaic plots

Mosaic plots

What is the difference between the two visualizations shown


below?
1.00

Adult Child
Relative frequency

0.75

Survival
0.50 Died Died
Survived

0.25

0.00 Survived
Adult Child
Age

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 47 / 73


Considering categorical data Pie charts

Pie charts
Can you tell which order encompasses the lowest percentage of
mammal species? RODENTIA
CHIROPTERA
CARNIVORA
ARTIODACTYLA
PRIMATES
SORICOMORPHA
LAGOMORPHA
DIPROTODONTIA
DIDELPHIMORPHIA
CETACEA
DASYUROMORPHIA
AFROSORICIDA
ERINACEOMORPHA
SCANDENTIA
PERISSODACTYLA
HYRACOIDEA
PERAMELEMORPHIA
CINGULATA
PILOSA
MACROSCELIDEA
TUBULIDENTATA
PHOLIDOTA
MONOTREMATA
PAUCITUBERCULATA
SIRENIA
PROBOSCIDEA
DERMOPTERA
NOTORYCTEMORPHIA
MICROBIOTHERIA

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 48 / 73


Considering categorical data Comparing numerical data across groups

Does there appear to be a relationship between class year and


number of clubs students are in?

8 ● ●

6 ● ●

● ●

0 ● ●

First−year Sophomore Junior Senior

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 49 / 73


Considering categorical data Comparing numerical data across groups

Side-by-side box plots and hollow histograms

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 50 / 73


Considering categorical data Deceptive graphs

Deceptive graphs

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 51 / 73


Considering categorical data Deceptive graphs

Deceptive graphs

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 52 / 73


Considering categorical data Deceptive graphs

Deceptive graphs

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 53 / 73


Considering categorical data Deceptive graphs

Deceptive graphs

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 54 / 73


Considering categorical data Deceptive graphs

Deceptive graphs

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 55 / 73


Case study: Gender discrimination Study description and data

Gender discrimination

In 1972, as a part of a study on gender discrimination, 48 male


bank supervisors were each given the same personnel file and
asked to judge whether the person should be promoted to a
branch manager job that was described as “routine”.
The files were identical except that half of the supervisors had
files showing the person was male while the other half had files
showing the person was female.
It was randomly determined which supervisors got “male”
applications and which got “female” applications.
Of the 48 files reviewed, 35 were promoted.
The study is testing whether females are unfairly discriminated
against.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 56 / 73
Case study: Gender discrimination Study description and data

Data

At a first glance, does there appear to be a relatonship between


promotion and gender?

Promotion
Promoted Not Promoted Total
Male 21 3 24
Gender
Female 14 10 24
Total 35 13 48

% of males promoted: 21/24 = 0.875


% of females promoted: 14/24 = 0.583

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 57 / 73


Case study: Gender discrimination Study description and data

Practice
We saw a difference of almost 30% (29.2% to be exact) between
the proportion of male and female files that are promoted. Based on
this information, which of the below is true?
If we were to repeat the experiment we will definitely see that
more female files get promoted. This was a fluke.
Promotion is dependent on gender, males are more likely to be
promoted, and hence there is gender discrimination against
women in promotion decisions. Maybe
The difference in the proportions of promoted male and female
files is due to chance, this is not evidence of gender discrimination
against women in promotion decisions. Maybe
Women are less qualified than men, and this is why fewer females
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 58 / 73
Case study: Gender discrimination Competing claims

Two competing claims

“There is nothing going on.”


Promotion and gender are independent, no gender
discrimination, observed difference in proportions is simply due to
chance. → Null hypothesis
“There is something going on.”
Promotion and gender are dependent, there is gender
discrimination, observed difference in proportions is not due to
chance. → Alternative hypothesis

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 59 / 73


Case study: Gender discrimination Competing claims

A trial as a hypothesis test

Hypothesis testing is very


much like a court trial.
H0 : Defendant is innocent
HA : Defendant is guilty
We then present the evidence
- collect data.
Then we judge the evidence - “Could these data plausibly have
happened by chance if the null hypothesis were true?”
If they were very unlikely to have occurred, then the evidence raises
more than a reasonable doubt in our minds about the null
hypothesis.
Ultimately we must make a decision. How unlikely is unlikely?
Image Link
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 60 / 73
Case study: Gender discrimination Competing claims

A trial as a hypothesis test (cont.)

If the evidence is not strong enough to reject the assumption of


innocence, the jury returns with a verdict of “not guilty”.
The jury does not say that the defendant is innocent, just that there
is not enough evidence to convict.
The defendant may, in fact, be innocent, but the jury has no way of
being sure.
Said statistically, we fail to reject the null hypothesis.
We never declare the null hypothesis to be true, because we simply
do not know whether it’s true or not.
Therefore we never “accept the null hypothesis”.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 61 / 73


Case study: Gender discrimination Competing claims

A trial as a hypothesis test (cont.)

In a trial, the burden of proof is on the prosecution.


In a hypothesis test, the burden of proof is on the unusual claim.
The null hypothesis is the ordinary state of affairs (the status quo),
so it’s the alternative hypothesis that we consider unusual and for
which we must gather evidence.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 62 / 73


Case study: Gender discrimination Competing claims

Recap: hypothesis testing framework

We start with a null hypothesis (H0 ) that represents the status


quo.
We also have an alternative hypothesis (HA ) that represents our
research question, i.e. what we’re testing for.
We conduct a hypothesis test under the assumption that the null
hypothesis is true, either via simulation (today) or theoretical
methods (later in the course).
If the test results suggest that the data do not provide convincing
evidence for the alternative hypothesis, we stick with the null
hypothesis. If they do, then we reject the null hypothesis in favor of
the alternative.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 63 / 73
Case study: Gender discrimination Testing via simulation

Simulating the experiment...

... under the assumption of independence, i.e. leave things up to


chance.
If results from the simulations based on the chance model look
like the data, then we can determine that the difference between the
proportions of promoted files between males and females was simply
due to chance (promotion and gender are independent).

If the results from the simulations based on the chance model do


not look like the data, then we can determine that the difference
between the proportions of promoted files between males and females
was not due to chance, but due to an actual effect of gender
(promotion and gender are dependent).
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 64 / 73
Case study: Gender discrimination Testing via simulation

Application activity: simulating the experiment (I)

Use a deck of playing cards to simulate this experiment.


Let a face card represent not promoted and a non-face card
represent a promoted. Consider aces as face cards.
Set aside the jokers.
Take out 3 aces → there are exactly 13 face cards left in the deck
(face cards: A, K, Q, J).
Take out a number card → there are exactly 35 number (non-face)
cards left in the deck (number cards: 2-10).

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 65 / 73


Case study: Gender discrimination Testing via simulation

Application activity: simulating the experiment (II)

Shuffle the cards and deal them intro two groups of size 24,
representing males and females.
Count and record how many files in each group are promoted
(number cards).
Calculate the proportion of promoted files in each group and take
the difference (male - female), and record this value.
Repeat steps 2 - 4 many times.

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 66 / 73


Case study: Gender discrimination Testing via simulation

Step 1

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 67 / 73


Case study: Gender discrimination Testing via simulation

Step 2 - 4

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 68 / 73


Case study: Gender discrimination Checking for independence

Practice
Do the results of the simulation you just ran provide convincing
evidence of gender discrimination against women, i.e. dependence
between gender and promotion decisions?
No, the data do not provide convincing evidence for the alternative
hypothesis, therefore we can’t reject the null hypothesis of
independence between gender and promotion decisions. The
observed difference between the two proportions was due to
chance.
Yes, the data provide convincing evidence for the alternative
hypothesis of gender discrimination against women in promotion
decisions. The observed difference between the two proportions
was due to a real effect of gender.
Bo LI (Tsinghua SEM) Lec 2: Summarizing data 69 / 73
Case study: Gender discrimination Checking for independence

Simulations using software


These simulations are tedious and slow to run using the method
described earlier. In reality, we use software to generate the
simulations. The dot plot below shows the distribution of simulated
differences in promotion rates based on 100 simulations.





● ●
● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●

−0.4 −0.2 0 0.2 0.4

Bo LI (Tsinghua SEM) Difference in promotion


Lec 2: Summarizing datarates 70 / 73
Case study: Gender discrimination Checking for independence

Simulation R Code I

1 nsim <− 100


2 n1 <− n2 <− 24
3 gender <− c ( rep ( ’ male ’ , n1 ) , rep ( ’ female ’ , n2 ) )
4 n <− l e n g t h ( gender )
5 sim <− m a t r i x ( 0 , nrow = n , n c o l = nsim )
6 f o r ( i i n 1 : nsim ) {
7 sim [ , i ] <− sample ( gender , n , r e p l a c e = FALSE )
8 }
9 outcome <− c ( rep ( c ( ’ promoted ’ , ’ n o t promoted ’ ) , c ( 2 1 , 3 ) ) ,
rep ( c ( ’ promoted ’ , ’ n o t promoted ’ ) , c ( 1 4 , 10) ) )

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 71 / 73


Case study: Gender discrimination Checking for independence

Simulation R Code II

1 s t a t i s t i c <− f u n c t i o n ( out , group ) {


2 t 1 <− o u t == ” promoted ” & group == ” female ”
3 t 2 <− o u t == ” promoted ” & group == ” male ”
4 sum ( t 1 ) / n1 − sum ( t 2 ) / n2
5 }
6 sim d i s t <− a p p l y ( sim , 2 , s t a t i s t i c , o u t = outcome )
7 X <− c ( ) ; Y <− c ( )
8 f o r ( i i n 1 : l e n g t h ( sim d i s t ) ) {
9 x <− sim d i s t [ i ]
10 r e c <− sum ( sim d i s t == x )
11 X <− append ( X , rep ( x , r e c ) )
12 Y <− append ( Y , 1 : r e c )
13 }

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 72 / 73


Case study: Gender discrimination Checking for independence

Simulation R Code III

1 p l o t ( X , Y , x l i m =range ( sim d i s t ) +c ( −1 ,1) ∗sd ( sim d i s t ) / 4 , x l a b


= ” D i f f e r e n c e i n promotion r a t e s ” , axes = FALSE , y l i m =c
( 0 , max ( Y ) ) , c o l = ” #569BBD” , cex = 0 . 8 , pch =20)
2 a x i s ( 1 , a t = seq ( − 0 . 4 , 0 . 4 , 0 . 1 ) , l a b e l s = c ( −0.4 , ” ” , −0.2 , ” ”
,0 , ” ” ,0.2 , ” ” ,0.4) )
3 a b l i n e ( h =0)

Bo LI (Tsinghua SEM) Lec 2: Summarizing data 73 / 73

You might also like