Professional Documents
Culture Documents
STAT 7000 Chapter 1.2 - Summarizing Data Probability: Ash Abebe
STAT 7000 Chapter 1.2 - Summarizing Data Probability: Ash Abebe
2 -
Summarizing Data
Probability
Ash Abebe
Summarizing Data
Discrete Data
Continuous Data
The Empirical Rule
Probability
Summarizing Data
Discrete Data
Describing Discrete Data
i =1
x
i
x
i
+x
j
2
(x
1
x)
2
+ + (x
n
x)
2
n 1
=
n
i =1
(x
i
x)
2
n 1
i =1
|x
i
Q
2
|
Summarizing Data
Continuous Data
Example : Measures of Spread
Consider the dataset : 11, 18, 6, 4, 8, 15, 22. We can easily get
the 5 number summary
Min Q1 Q2 Q3 Max.
4.0 7.0 11.0 16.5 22.0
R = 22 4 = 18 and IQR = 16.5 4 = 12.5. The sample
standard deviation is
s
2
=
(11 12)
2
+ + (22 12)
2
6
= 43.67 s = 6.61
Finally,
MAD =
|11 11| + +|22 11|
7
= 5.29
Summarizing Data
Continuous Data
Robustness
Which measures are sensitive to outliers?
Data med mean IQR s
Set 1: 11 18 6 4 8 15 22 11 12 12.5 6.61
Set 2: 11 18 6 4 8 15 72 11 19.1 12.5 23.8
Set 3: 11 18 6 4 8 15 720 11 112 12.5 268
Set 4: 11 18 6 4 8 15 2200 11 323 12.5 828
Set 5: 11 18 6 4 8 15 7200 11 1037 12.5 2717
Set 6: 11 18 6 4 8 15 72000 11 10295 12.5 27210
To aect the median, one needs to contaminate at least 50% of
the data. To aect the IQR, one needs to contaminate at least
25% of the data.
Summarizing Data
Continuous Data
What is an outlier?
Compute the lower inner fence (LIF) and the upper inner fence
(UIF) as
LIF = Q
1
1.5IQR , UIF = Q
3
+ 1.5IQR
The data set 11, 18, 6, 4, 8, 15, 22 does not contain any
potential outliers.
The data set 11, 18, 6, 4, 8, 15, 72 has one potential outlier
(72).
The LIF and/or UIF are plotted if the data has outliers.
Consider the Etruscan skull sizes data. We have
Min = 126, Q
1
= 142, Q
2
= 146, Q
3
= 150, Max = 158.
1
2
5
1
3
0
1
3
5
1
4
0
1
4
5
1
5
0
1
5
5
Summarizing Data
Continuous Data
Shapes
Symmetric
x
F
r
e
q
u
e
n
c
y
4 2 0 2 4
0
5
0
0
1
0
0
0
1
5
0
0
Left Skewed
x
F
r
e
q
u
e
n
c
y
5 10 15 20
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
Right Skewed
x
F
r
e
q
u
e
n
c
y
0 2 4 6 8 10 12 14
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
2
5
0
0
The Empirical Rule
Empirical Rule
If the histogram of the data is approximately mound-shaped