Professional Documents
Culture Documents
Unit 1 - Descriptive Statistics
Unit 1 - Descriptive Statistics
Unit 1 - Descriptive Statistics
Graphical method:
Bar and pie charts, Histogram
Summary Statistics:
Measures of location, measures of variability,
boxplot
Average is around 17
Statistic parameter
estimator
Qualitative ( quality
Nominal: Names, color, gender, nationality,
brand,
Ordinal : level of education ( school, high
school, college)
Size of cloths :
32 36 37 38 40 : quantitative
S, M, L, XL : qualitative
21
Graphical Methods
Descriptive statistics can be divided into two general areas;
graphical and numerical. In this part, we consider
representing a data set using graphical techniques.
Appropriate graphs are-
For qualitative data: Bar chart and Pie chart
For quantitative data: Histogram; Boxplot
frequency,
relative frequency each category = frequency/ n ( n : size of the
sample: sum of the frequency )
IE FF GC GC OT FF FF FF FF IE
GC FF FF OT FF FF IE GC FF FF
GC IE IE IE GC FF OT OT OT OT
FF IE IE IE OT IE FF OT IE FF
FF IE IE GC IE FF GC GC GC FF
The distribution of the CPU times is skewed to the right with one potential outlier.
22, 21, 7, 16, 15, 15, 26, 16, 1, 13, 21, 21, 20, 19
(2) 17, 24, 21, 22, 26, 22, 19, 21, 23, 11, 19, 14, 23, 25,
26, 15, 17, 26, 21, 18, 19,21,24,18,16,20,21,20,23,33
(3) 56,52, 13,34,33, 18, 44, 41, 48, 75, 24, 19,35, 27, 46,
62, 71, 24, 66, 94, 40,18,15,39,53,23,41,78,15,35
x i
x i 1
n
Similarly, the population mean, denoted by µ, is given by
N
x i
i 1
N
where N is the population size.
Sometimes a sample may contain a few points that are much
larger or smaller than the rest. Such points are called outliers
and may affect the mean.
STAT210: Probability and Statistics 35
Median
The median is the value in the middle when the data are
arranged in ascending order (smallest value to largest value).
To find the median the values in the sample are ordered from
smallest to largest, then
If n is odd, the sample median is the number in (n+1)/2
position .
If n is even, the sample median is the average of the
numbers in n/2 and (n/2)+1 positions.
Although the mean is the more commonly used measure of
central location, in some situations the median is preferred.
The mean is influenced by extremely small and large data
values. In such case, the median is often the preferred
measure of central location.
STAT210: Probability and Statistics 36
Mean vs. Median
Mean tends to be drawn in the direction of the tail of a
skewed distribution. The median is more appropriate when
the distribution is highly skewed.
Mean can be greatly a effected by the presence of outliers
whereas median is not.
For symmetric distributions, mean and median are the
same.
For skewed distributions, the mean lies towards the longer
tail relative to the median.
Trimmed Mean:
The trimmed mean is a measure of center that is not affected by
outliers.
With the trimmed mean, p% of the data is trimmed from either
end of the data set.
First, arranging the sample values in (ascending or descending)
order. 2 Then, trimming an equal number of them (np/100 points)
from each end. Finally, computing the sample mean of the
remaining points.
Note: Minitab prints the 5% trimmed mean.
The first quartile, Q1, is the value that has approximately 25% of
the observations below it. It represents the median of the lower half
of the data and corresponds to the 25th percentile.
The second quartile or median is the 50th percentile.
The third quartile, Q3, has approximately 75% of the observations
below it and corresponds to the 75th percentile.
STAT210: Probability and Statistics 39
Measures of Variability: Variance and Standard
Deviation
The variance is the average of squared deviations of values from the
mean. The population variance (σ2) is given by
N
1
2
N
( x )
i 1
i
2
The Interquartile Range (IQR) is the range for the middle 50% of
the data.
IQR = Q3 - Q1
It is not in influenced by outliers but used to detect them.
Minitab Output:
Descriptive Statistics: CPU Time
53 46 36 48 39 35 37 36 39 45
compare the number of intrusions before and after the change, construct
parallel boxplots and comment on your findings.
STAT210: Probability and Statistics 47
Exercise
(3) Match each histogram to the boxplot that represents the
same data set.
24.1 13.3 16.2 17.5 19.0 23.9 14.8 22.2 21.7 20.7
13.5 15.8 13.1 16.1 21.9 23.9 19.3 12.0 19.9 19.4
15.4 16.7 19.5 16.2 16.9 17.1 20.2 13.4 19.8 17.7
19.7 18.7 17.6 15.9 15.2 17.1 15.0 18.8 21.6 11.9