Annotated 3 Ch3 Data Description F2014

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Stat 305, Fall 2014

Name

Chapter 3: Data Description


Theoretical vs. Empirical Distribution
Theoretical Distribution: The expected pattern to be followed by a variable.
Example: Roll 1 six sided die 60 times and record the number on the side that lands
face up.
Variable of interest: # on face-up side
Expected Pattern: ten 1s, ten 2s, . . . , ten 6s

Empirical Distribution: The pattern followed by the observed data (the actual pattern)
Example: Roll 1 six sided die 60 times and record the number on the side that lands
face up.
Variable of interest: # on face-up side
Observed data pattern: six 1s, eight 2s, twelve 3s, ten 4s, fourteen 5s, ten 6s.

Descriptive Statistics for Quantitative Data


Recall: Quantitative data are numerical characteristics associated with items in a
sample.
Goal: Describe important distributional characteristics
We will focus on quantitative data in this course.
1

Dot Diagram
1. Order data (smallest to largest)
2. Label x-axis with range of data values and y-axis with count of values at each distinct
point.
3. Add one dot to the plot for each data value, stacking duplicate values vertically.

Example 1
The government requires manufacturers to monitor the amount of radiation emitted through
the closed door of a microwave. The following are radiation amounts emitted by 24 microwaves measured by one manufacturer.
.01
.10
.18

.08
.05
.20

.05
.10
.30

.11
.20
.15

.02
.01

.12
.09

.08
.05

.03
.09

.10
.02

.07
.10

Stem-and-Leaf Plots
1. Order data values
2. Select one or more leading digits for the stem values; last digit becomes the leaf
3. List possible stem values in a vertical column
4. Draw a vertical line to the right of the stem
5. Add leaf values, in order, on the other side of the vertical line

Example 1 (part 2)
It is easiest to order your data before any analysis.
.01
.08
.18

.01
.09
.20

.02
.09
.20

.02
.10
.30

.03
.05 .05
.10 .10
.10

.05
.11

.07
.12

.08
.15

Key: The decimal point is 1 digit(s) to the left of the |


-ORKey: 0|1 = .01
0
1
2
3

1122355578899
00001258
00
0

Split stem-and-leaf plots


Have two leaf positions, one for 0-4 leaves and one for 5-9 leaves
Helps give a better indication of the distribution of the data when too many observations
Key: The decimal point is 1 digit(s) to the left of the |
-ORKey: 0|1 = .01
0
0
1
1
2
2
3

11223
55578899
000012
58
00
0

Back-to-Back Stem-and-Leaf plots


Used to compare two data sets
One data set as before (one right side of stem, left to right)
Second data set on left side of stem going right to left

More stem-and-leaf plots (Examples)


(Note: These data sets have been ordered for you.)
1. 28.2 29.4 30.1 30.9 31.4 32.0 32.2 32.5 32.5 32.6 33.3 34.2 34.4 34.9 36.6

2. 58.65 58.97 59.72 60.15 62.87

3. Data set 1: 3.5 4.2 4.6 4.6 5.0 5.1 6.4 6.8
Data set 2: 5.2 5.5 5.7 5.7 5.8 5.9 6.2 7.2

Frequency Tables
Use intervals of equal length
Number of intervals varies, a matter of judgment
Every endpoint of the intervals is in exactly one interval (ie: no overlapping)

Example 1 (part 3)
0.01-0.05
8

0.06-0.10
9

0.11-0.15
3

0.16-0.20
3

0.21-0.25
0

0.26-0.30
1

Relative Frequency Tables


Start with a frequency table
Divide every cell by the total number of observations
Cumulative Relative Frequency Table: Add up relative frequencies as you go.

Example 1:

Frequency
Relative
Frequency
Cumulative
Relative
Frequency

0.01-0.05
8

0.06-0.10
9

0.11-0.15
3

0.16-0.20
3

0.21-0.25
0

0.26-0.30
1

Sum
24

.333

.375

.125

.125

.042

1.00

.333

.708

.833

.958

.958

1.00

Another Frequency Table


(Note: These data sets have been ordered for you.)
1. 28.2 29.4 30.1 30.9 31.4 32.0 32.2 32.5 32.5 32.6 33.3 34.2 34.4 34.9 36.6

Histogram
A plot of frequency or relative frequency
How to make a histogram:
Use intervals of equal length
Show entire vertical axis beginning at zero and avoid breaking either axis
Keep a uniform scale across a given axis
Center bars of appropriate heights at the midpoints of the intervals

Rule of Thumb: # of intervals # of observations

Example 1

Common Distributional Shapes

Bell-shaped (Symmetric Unimodal)

Uniform

Right-Skewed

Left-Skewed

Bimodal

Truncated (J-shaped)

Quantiles
Definition: for any number 0 p 1, the p quantile is the number, denoted as Q(p),
such that p is the percentage of the distribution that lies to the left of (below) Q(p),
and 1 p is the percentage of the distribution that lies to the right of Q(p)
For an ordered data set x1 x2 xn
For i = 1, 2, . . . , n the p =
xi . That is:

i.5
n

quantile of the data set is the ith smallest data point,



Q(p) = Q

i .5
n


= xi

Example 2
Annual incomes (in thousands of dollars) for 8 families (in a common geographical location)
are given below:
23, 31, 43, 47, 51, 58, 67, 103
Which quantiles are exactly observations from this data set?

Quantiles Not Observed in the Data Set


General procedure for finding the p quantile of an empirical distribution
1. Order data values x(1) x(2) x(n)
2. Set i = np + 0.5
3. If i {1, 2, . . . , n} then Q(p) = x(i )
otherwise,
Q(p) = (di e i ) xbi c + (i bi c) xdi e
Notation:
d e Ceiling = next largest integer (round up). (i.e. d4.3e = 5)
b c Floor = previous smallest integer (round down). (i.e. b4.3c = 4)

Example 2 (part 2)
23, 31, 43, 47, 51, 58, 67, 103
1. What data value corresponds to the .25 quantile?

2. The .90 quantile?

3. The .75 quantile?

Quartile
Special quantiles:
Q(.25): Q1 , 1st quartile, lower quartile
Q(.5): Q2 , 2nd quartile, median
Q(.75): Q3 , 3rd quartile, upper quartile
Special values associated with quartiles:
Inter-quartile range (IQR): Q3 Q1
Upper fence: Q3 + 1.5IQR
Lower fence: Q1 1.5IQR

Boxplot
Steps for making a boxplot: (with ordered data)
0. Draw your scale.
1. Draw a vertical lines at Q1 , Q2 , Q3 and connect with a box.
2. Compute IQR = Q3 Q1 , Upper and Lower fences
Upper Fence = Q3 + 1.5IQR
Lower Fence = Q1 1.5IQR
3. Draw asterisks (or dots) for any data values less than the lower fence and any values
greater than the upper fence; these we will define as outliers.
4. Draw a line from the sides of the box to the smallest value greater than the LF and
the largest value smaller than the UF.

Example 2 (part 3)
Make a boxplot
23, 31, 43, 47, 51, 58, 67, 103
1. First we need quartiles.
(a) (From above) Q1 =
(b) (From above) Q3 =
(c) Q2 =

2. Next calculate IQR and fences.


(a) IQR =

(b) Upper Fence =

(c) Lower Fence =

3. Finally draw the boxplot.

10

Example 3
Ten batteries were tested to determine how long the batteries would last (hrs) under normal
conditions. Below are the ten values that were obtained:
100, 120, 80, 90, 95, 115, 120, 110, 105, 95
1. Calculate Q(.35)

2. Calculate Q(.42)

3. Calculate Q(.90)

4. Draw a boxplot based on the 10 values above.

11

Side-by-Side Boxplots
Side-by-side boxplots can be used to compare two data sets Make sure they are set on the
same scale to make a comparison possible.

Quantile-Quantile Plots (Q-Q Plots)


Used to make comparisons of the shapes of distributions for two data sets.
If two data sets are generated by distributions of the same shape, then the quantiles
of one data set should be linearly related to the quantiles of the second data set.


Plot of the ordered pairs Q1 i.5
, Q2 i.5
, for i=1,. . . ,n
n
n
Points in a straight line indicate they are from the same distribution.
If n1 6= n2 then use the smaller of the two.

Example 4a (n1 = n2 )
Data Set 1: 1, 2, 3, 4, 5
Data Set 2: 6, 7, 8, 9, 10

12

Example 4b (n1 6= n2 )
Data Set 1: 1, 2, 3, 4, 5
Data Set 2: 6, 7, 8, 9, 10, 11

Example 4c
Data set 1: 1, 5, 7, 8, 9, 10
Data set 2: -10, -9, -8, -7, -5, -1

Normal Probability Plot


A type of Q-Q plot that allows us to determine if the distribution of our data is
bell-shaped (the shape of the theoretical normal distribution)
Rather than plot 2 data sets against one another, plot 1 data set against quantiles
from a known normal distribution
A straight line indicates our data is normal/bell-shaped
An S-shape indicates our data is skewed.
We will talk more about Normal Probability Plots in Chapter 5.

13

Standard Numerical Measures


For univariate quantitative data:
Measures of Location/Center give an indication of where most of the data is located.
Measures of Variability/Spread give an indication of how spread out the data is.

Median (Location)
Same as Q(0.5); ie: gives center value of the data set.
Not affected by a few extreme or outlying observations.
Example:
2, 3, 5, 8, 12
2, 3, 5, 8, 100

Q(.5) = 5
Q(.5) = 5

Mean (Location)
For x1 , x2 , . . . , xn the mean is given by
n

x
=

1X
xi
n
i=1

Also called first moment or center of mass Strongly affected by a few extreme or
outlying observations.
Example:
2, 3, 5, 8, 12
2, 3, 5, 8, 100

x
=6
x
= 23.6

Mode (Location)
The most frequently occurring data point
Can also be used for qualitative data
Can have multiple modes
Not affected by outliers (so to speak)
Example:
2, 3, 5, 5, 5, 8, 8, 12
2, 3, 5, 5, 5, 8, 8, 8, 100

mode = 5
modes = 5,8

14

IQR/ Range (Variability)


IQR= Q(.75) - Q(.25)
Measures the spread of the middle half of the data.
Not sensitive to extreme values.
Range = Largest value - Smallest value
If data is ordered x1 x2 xn , then R = xn x1 .
Highly sensitive to extreme values.

Variance/Standard Deviation (Variability)


Sample Variance given by the formula:
n

1 X
s =
(xi xn )2
n1
2

i=1

-orn
X
n
X
2

s =

x2i

!2
xi

i=1

i=1

n1

Gives a measure of how much the data is spread from the sample mean. Larger values
of s2 indicate more spread.
Average (squared) distance from the mean.

Sample standard deviation, s = s2 .


Sensitive to extreme outliers.

Example 5
Calculate the mean and standard deviation for the data below.
4, 8, 2, 14, 7, 12

15

Recalculate
the standard
using summary statistics.
Pn deviation
Pn
2 = 473
x
x
=
47,
and
i=1 i
i=1 i

Which One Should I Use?


When describing a distribution, generally use a measure of location and a measure of
spread.
Mean and Standard deviation for symmetric quantitative data.
Median and IQR for skewed quantitative data.
Mode for categorical data

Statistics vs. Parameters


Statistic: a numerical summary of sample data. (We will focus on this for now.)
Sample mean, x

Sample variance, s2
Parameter: a numerical summary of population data. (More on this in chapter 5.)
Population mean,
Population variance, 2

Descriptive Statistics for Qualitative Data


Recall: qualitative data we generally aggregate into counts
Generally it is helpful to calculate rates on a per-item basis (proportions)
p=

total # items of interest


total # of items

p =

# items of interest in sample


# of items in sample

Graphical tools
Bar chart: like a histogram without intervals
Pie chart
Dot diagram
See 3.4 for more details

16

You might also like