Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

STAT 7000 Chapter 1.

2 -
Summarizing Data
Probability
Ash Abebe
Summarizing Data
Discrete Data
Continuous Data
The Empirical Rule
Probability
Summarizing Data
Discrete Data
Describing Discrete Data

Example : Handedness ? We ask 20 students whether they


are right or left handed.
L R R R R L R R R R
R R R R L R R R R R
Summarizing Data
Continuous Data
Describing Continuous Data : 5 Basic Descriptive Statistics
The 5 basic descriptive statistics are
Min = the minimum sample point
100% of the data are Min
Q
1
= the 1st quartile
25% of the data are Q
1
& 75% of the data are Q
1
Q
2
= the median; 50% of the data are Q
2
& 50% are Q
2
Q
3
= the 3rd quartile
75% of the data are Q
3
& 25% of the data are Q
3
Max = the maximum sample point
100% of the data are Max
Summarizing Data
Continuous Data
Example : 5 Basic Descriptive Statistics
Here is a sample of head sizes (maximum measurement across the
top of the skull in mm) of n = 25 Etruscans.
141 148 132 138 154 142 150 146 155 158 150 140 147
148 144 150 149 145 149 158 143 141 144 144 126
The sorted data is
126 132 138 140 141 141 142 143 144 144 144 145 146
147 148 148 149 149 150 150 150 154 155 158 158
Q
2
= (n + 1)/2
th
= 13
th
ordered data point, Q
2
= 146.
Q
1
= n/4
th
= 6.25
th
7
th
ordered data point, Q
1
= 142.
Q
3
= n/4
th
7
th
ordered data point from the largest, Q
3
= 150.
Min = 126, Max = 158.
Summarizing Data
Continuous Data
Histogram
Histogram of etr
etr
F
r
e
q
u
e
n
c
y
125 130 135 140 145 150 155 160
0
2
4
6
8
Stem-and-leaf plot
12 | 6
13 | 2
13 | 8
14 | 01123444
14 | 5678899
15 | 0004
15 | 588
Summarizing Data
Continuous Data
Measures of Center
Suppose our sample of size n is represented by x
1
, x
2
, . . . , x
n
.
There are several measures of center :

The sample median


Q
2
= the mid-point of the sorted sample

The sample mean is the arithmetic average


x =
x
1
+x
2
+ +x
n
n
=
1
n
n

i =1
x
i

The Hodges-Lehmann estimator is


HL = median
i j

x
i
+x
j
2

The sample mode is the most frequent value.


Summarizing Data
Continuous Data
Example : Measures of Center
Consider the following contrived dataset : 11, 18, 6, 4, 8, 15, 22.
We can easily get the 5 number summary
Min Q1 Q2 Q3 Max.
4.0 7.0 11.0 16.5 22.0
and x = 12. To get HL, get pairwise averages
4 6 8 11 15 18 22
4 4 5 6 7.5 9.5 11 13
6 6 7 8.5 10.5 12 14
8 8 9.5 11.5 13 15
11 11 13 14.5 16.5
15 15 16.5 18.5
18 18 20
22 22
and get the median of the pairwise averages
(11.5 + 12)/2 = 11.75.
Summarizing Data
Continuous Data
Measures of Spread
There are several measures of spread :

The sample range


R = Max Min

The sample interquartile range (IQR)


IQR = Q
3
Q
1

The sample standard deviation


s =

(x
1
x)
2
+ + (x
n
x)
2
n 1
=

n
i =1
(x
i
x)
2
n 1

The mean absolute deviation (MAD) from the median


MAD =
1
n
n

i =1
|x
i
Q
2
|
Summarizing Data
Continuous Data
Example : Measures of Spread
Consider the dataset : 11, 18, 6, 4, 8, 15, 22. We can easily get
the 5 number summary
Min Q1 Q2 Q3 Max.
4.0 7.0 11.0 16.5 22.0
R = 22 4 = 18 and IQR = 16.5 4 = 12.5. The sample
standard deviation is
s
2
=
(11 12)
2
+ + (22 12)
2
6
= 43.67 s = 6.61
Finally,
MAD =
|11 11| + +|22 11|
7
= 5.29
Summarizing Data
Continuous Data
Robustness
Which measures are sensitive to outliers?
Data med mean IQR s
Set 1: 11 18 6 4 8 15 22 11 12 12.5 6.61
Set 2: 11 18 6 4 8 15 72 11 19.1 12.5 23.8
Set 3: 11 18 6 4 8 15 720 11 112 12.5 268
Set 4: 11 18 6 4 8 15 2200 11 323 12.5 828
Set 5: 11 18 6 4 8 15 7200 11 1037 12.5 2717
Set 6: 11 18 6 4 8 15 72000 11 10295 12.5 27210
To aect the median, one needs to contaminate at least 50% of
the data. To aect the IQR, one needs to contaminate at least
25% of the data.
Summarizing Data
Continuous Data
What is an outlier?

An outlier is a data point that is numerically distant from the


rest of the data.

A rule of thumb to detect potential outliers:

Compute the lower inner fence (LIF) and the upper inner fence
(UIF) as
LIF = Q
1
1.5IQR , UIF = Q
3
+ 1.5IQR

Any point that is not contained in the two fences is agged as


a potential outlier.

The data set 11, 18, 6, 4, 8, 15, 22 does not contain any
potential outliers.

The data set 11, 18, 6, 4, 8, 15, 72 has one potential outlier
(72).

A boxplot contains information on the 5 number summary


and outliers.
Summarizing Data
Continuous Data
Boxplot

The boxplot gives a graphical view of Min, Q


1
, Q
2
, Q
3
, Max
when the data does not contain any potential outliers.

The LIF and/or UIF are plotted if the data has outliers.
Consider the Etruscan skull sizes data. We have
Min = 126, Q
1
= 142, Q
2
= 146, Q
3
= 150, Max = 158.
1
2
5
1
3
0
1
3
5
1
4
0
1
4
5
1
5
0
1
5
5
Summarizing Data
Continuous Data
Shapes
Symmetric
x
F
r
e
q
u
e
n
c
y
4 2 0 2 4
0
5
0
0
1
0
0
0
1
5
0
0
Left Skewed
x
F
r
e
q
u
e
n
c
y
5 10 15 20
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
Right Skewed
x
F
r
e
q
u
e
n
c
y
0 2 4 6 8 10 12 14
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
The Empirical Rule
Empirical Rule
If the histogram of the data is approximately mound-shaped

about 68% of the data fall in between x s and x +s

about 95% of the data fall in between x 2s and x + 2s

about 99.5% of the data fall in between x 3s and x + 3s


Example: Scores on an achievement test taken by all high school
seniors in Alabama are known to have an approximately bell-shaped
distribution with x = 64 and standard deviation s = 10.

about 68% of the students scores fall in between 54 and 74

about 95% of the students scores fall in between 44 and 84

about 99.5% of the students scores fall in between 34 and 94


Probability
Some Denitions

An experiment results in an outcome.


1. Flip a fair coin.
2. Roll a pair of fair six-sided dice.

The collection of all outcomes of an experiment is the sample


space (S).
1. S = {H, T}
2. S = {(1, 1), (1, 2), . . . , (6, 6)}

An event is a subset of the sample space.


1. The coin comes up tails A = {T}.
2. The sum on the upfaces is 7 or 11
B = {(1.6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1), (5, 6), (6, 5)}

We are interested in the probabilities (sizes) of events.


Probability
Probability
A probability is a function P() that assigns numbers to events in
such a way that
1. For any event A, P(A) 0.
2. P(S) = 1
3. If two events cannot occur at the same time the probability
that one or the other occurs is the sum of the probabilities of
the individual events, i.e. if A and B are two events such that
A B = , then P(A B) = P(A) +P(B).
If S contains N elements (#(S) = N) that are equally likely, then
P(A) =
#(A)
N
Probability
Example : Probability
Roll a pair of fair six-sided dice. Then
S = {(1, 1), (1, 2), . . . , (6, 6)}
and #(S) = 36. Let
A = sum is 7 , B = sum is 11
Now #(A) = 6 and #(B) = 2. Thus P(A) = 1/6 and
P(B) = 1/18.
Note that A B = . Thus P(A B) = 1/6 + 1/18 = 2/9.

You might also like