Professional Documents
Culture Documents
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Department of Statistics
Universidad Carlos III de Madrid
Descriptive Statistics
Statistics — 2014–2015
1 Main definitions
Descriptive statistics are used to describe, in a simple way, a data collec-
tion. More specifically, they describe or summarize quantitatively the main
features of the data collection we are concerned with.
Definition 1 (Population, size). A statistical population consists of the to-
tality of the observations with which we are concerned. The number of
observations in the population is called the size of the population, it can be
finite or infinite, and we denote it by N
It might be very expensive to record all the observations from a very large
population. In fact, a population might be of infinite size, which makes it
unfeasible to record all its observations. In such a case, we take a sample
from the population.
Definition 2 (Sample, size). A sample is a subset of observations selected
from a population. We say that it is representative if the main features from
the population are well represented in it. The sample size is the number of
observations in the sample, we denote it by n.
If the sample coincides with the population, we say that we have a census.
Definition 3 (Variable, observation). A variable (X) is a symbol that repre-
sents a characteristic of interest in the population. It describes the observa-
tions from the population. Each observation (x) corresponds to one specific
value (numerical or not) of the variable.
1
Types of variables
2
The cumulative absolute frequency of the i-th value is the number of ele-
ments in the sample not greater than xi , it is denoted by Ni
Ni = n1 + n2 + . . . + ni
The cumulative relative frequency of the i-th value is the sum of the rel-
ative frequencies of the elements from the sample that are not greater than
xi , it is denoted by Fi
Ni
F i = f1 + f2 + . . . + fi , Fi =
n
A frequency table adopts the structure below,
xi ni fi Ni Fi
If our dataset contains information about the whole population (it should
be a finite population and we would have a census), we could talk about
population frequencies. The same applies to the measures of central location,
scatter, and the rest of definitions in this notes. Nevertheless, we will usually
refer to samples, in case we consider a population, we will say explicitly.
3 Grouped data
Sometimes it is convenient to group the quantitative data we are working
with. The most common reason is that we have a variable that assumes a
lot of different values (not repeated in the sample). In such a case, the set
of possible values of the variable is divided into non-overlapping intervals,
usually called class intervals, cells, or bins. Frequencies will now refer to
them.
A crucial practical issue is choosing the number of bins. It often works
well to take it approximately equal to the square root of the number of
observations.
The i-th class interval is alternatively of the type (Li , Li ], or [Li , Li ).
Associated with the i-th bin, we have its lower limit, Li , its upper limit,
Li , its mid-point, mi = (Li + Li )/2, and its width, ci = Li − Li . The bins
might have unequal widths.
A frequency table for grouped data adopts the structure below,
3
(Li , Li ] ni fi Ni Fi
If it is a population mean, that is, we are considering all the individuals from
the population (n = N ), it is commonly denoted by µ.
Properties.
Pk
1. i=1 (xi − x)ni = 0
Notice that in order to compute the mean, all observations are taken into
account. Therefore, abnormally large (or small) observations have a high
influence on it.
4
If we are working with grouped data, we take the mid-point of each bin
in order to compute the mean,
k
n1 m1 + n2 m2 + . . . + nk mk X
x= = mi fi
n i=1
4.1.2 Median, Me
We can compute it for quantitative variables. It is a number such that, at
least half of the observations are not greater than it, and at least half are not
smaller. If this definition can be applied to all the number in an interval, we
will take the mid-point of such interval.
In order to compute the median, we must find the smallest value xi such
that Fi ≥ 0.5, that is, Fi ≥ 0.5 and Fi−1 < 0.5. If Fi > 0.5, then Me = xi , if
alternatively Fi = 0.5, then Me = (xi + xi+1 )/2.
Property. The average Euclidean distance of the sample observations to
their median is minimal, that is, for any a ∈ R
k
X k
X
|xi − Me|ni ≤ |xi − a|ni
i=1 i=1
The computation of the median takes only into account the position of
the ordered observations, not the observations themselves. For this reason, it
behaves better than the mean in the presence of outliers (observations that
are numerically distant from the rest of the data).
4.1.3 Mode
It is the value with a higher frequency. It is sensible to compute it, even
for categorical data. When there are two modes, we talk about a bimodal
distribution.
In case we are working with grouped data, the mode is the bin (or its
mid-point) that attains the largest quotient for relative frequency between
width (fi /ci ).
5
4.1.4 Harmonic mean, xH
If our data consists of rates, the harmonic mean provides us with the average
rate,
n
xH = Pk
i=1 ni /xi
4.2 Quantiles
Quantiles are computed for quantitative variables in a similar way to the
median, they only take into account the position of each observation in an
ordered sample. As particular instances of quantiles, we have quartiles, per-
centiles, and deciles.
4.2.1 Quartiles
They divide the sample into four parts with equal number of observations.
• Q1 , first quartile, at least 25% of the observations are not greater than
Q1 , and at least 75% are not smaller than Q1 .
6
• Q2 , second quartile, it is the median, Q2 = Me.
• Q3 , third quartile, at least 75% of the observations are not greater than
Q3 , and at least 25% are not smaller than Q3 .
4.2.2 Percentiles
They divide the ordered sample into 100 parts.
Given a natural number 1 ≤ α ≤ 99, the α-th percentile, Pα satisfies
that at least α% of the observations are not greater than Pα and at least
(100 − α)% of the observations are not smaller than Pα .
It should be obvious to the reader that Q1 = P25 and Q3 = P75 .
In order to compute percentile Pα , we consider the smallest observation
whose cumulative relative frequency is not smaller than α/100, that is, we
consider xi such that Fi ≥ α/100 and Fi−1 < α/100. If Fi > α/100, then
Pα = xi , if alternatively Fi = α/100, then Pα = (α/100)xi + (1 − α/100)xi+1 .
k
2
X (xi − x)2 ni
S =
i=1
n−1
7
Property.
k
!
n X
S2 = x2i fi − (x)2
n−1 i=1
notice that, in order to define it we should have observations from all the
individuals in the population.
8
6 Charts
6.1 Bar chart
Consists of rectangular bars, each representing a different value from the sam-
ple and with length proportional to the relative (or absolute) frequency of the
value that they represent. The bars can be plotted vertically or horizontally.
Energy Consumption in Spain by Energy Source (2013) Energy Consumption in Spain by Energy Source (2013)
40 Petroleum
30
Percentage
OTHERS
20
Natural Gas
Renewable Ener.
10
0
Petroleum
Natural Gas
Coal
Nuclear Power
Renewable Ener.
OTHERS
Figure 1: Bar chart (left) and pie chart (right) for the energy consumption
in Spain by source in 2013 (Source: IDAE, http://www.idae.es).
6.3 Histogram
It is the most common representation for continuous data. It consists of
adjacent rectangles, erected over discrete intervals (the bins) and with an
9
area equal to the relative frequency of the observations in the interval. Al-
ternatively, the height of the rectangle over the i-th bin is fi /ci . Since it
represents grouped data, it is possible to obtain several different histograms
for the same sample (by constructing different bins to group the data).
Histogram 100m
2.0
1.5
Density
1.0
0.5
0.0
100m
Figure 2: Histogram for the time at the 100m race at the Decathlon during
the Olympic games in Athens 2004.
10
Example. Given the times of the athletes at the 100m race of the Decathlon
at the Athens 2004 Olympics, 11.10, 10.89, 11.28, 11.08, 10.55, 10.99, 11.06,
10.87, 11.14, 11.33, 11.23, 11.08, 10.92, 11.36, 10.86, 10.97, 10.89, 11.14,
10.91, 10.85, 10.98, 10.68, 10.69, 10.80, 10.62, 10.50, 10.90, 10.85, 10.44,
10.95, we obtain the following stem-and-leaf diagram:
104 4
105 0 5
106 2 8 9
107
108 0 5 5 6 7 9 9
, the leafs represent cents of a second.
109 0 1 2 5 7 8 9
110 6 8 8
111 0 4 4
112 3 8
113 3 6
Notice than we can visualize a stemplot as a rotated histogram all whose
bins have equal width.
● ●
10.6 10.8 11.0 11.2 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0
Figure 3: Box plot for the times at the 100m race of the Decathlon in the
Athens 2004 Olympics.
11
Notice that, apart from the extreme values, quartiles, and median, the box
plots plotted by most statistical packages, also indicate which observations,
if any, might be considered outliers. Those are observations whose distance
to the nearest quartile (Q1 or Q3 ) exceeds 1.5IQR.
7.1 Skewness
The skewness is a measure of the asymmetry of the distribution of the vari-
able. It is defined as
Pk
ni (xi − x)3 /n
Skew = i=1 .
s3
A negative skew indicates that the tail on the left side of the probability
density function is longer than the right side and the bulk of the values
(including the median) lie to the right of the mean. A positive skew indicates
that the tail on the right side is longer than the left side and the bulk of the
values lie to the left of the mean. A zero value indicates that the values
are relatively evenly distributed on both sides of the mean, typically but not
necessarily implying a symmetric distribution
The mean of a right-skewed variable is greater than its median, and con-
versely the mean of a left-skewed variable is smaller than its median.
7.2 Kurtosis
The kurtosis is a measure of the peakedness (concentration about the mean)
of the distribution of the variable. It is defined as
Pk
ni (xi − x)4 /n
Kurt = i=1 − 3.
s4
The reference value is the kurtosis of a Gaussian variable, that is set to O.
This explains the −3 in the definition of the kurtosis and it is the reason why
some authors refer to it as excess kurtosis.
12
A high kurtosis distribution has a sharper peak and longer, fatter tails,
while a low kurtosis distribution has a more rounded peak and shorter thin-
ner tails. Depending on the sign of the kurtosis, we have three types of
distributions:
• A distribution with positive excess kurtosis is called leptokurtic.
• A distributions with zero excess kurtosis are called mesokurtic.
• A distribution with negative excess kurtosis is called platykurtic.
13
Absolute marginal frequency of xi , ni· = ni1 + ni2 + · · · + nil = lj=1 nij .
P
Relative marginal frequency of xi , fi· = ni· /n.
Absolute marginal frequency of yj , n·j = n1j + n2j + · · · + nkj = ki=1 nij .
P
Relative marginal frequency of yj , f·j = n·j /n.
For X and Y , we can compute any measure of location, spread, or shape.
Alternatively, we can plot any of the previous graphical representations.
A two way or double entry table can be filled with absolute or relative
frequencies and adopts the structure below. Marginal frequencies can be
represented on the last row and column.
14
8.3 Statistical independence
The main reason, we study two variables as a bivariate one is to understand
the dependence relation between them, if there is any. Two variables are
statistically independent if there is no relation between them.
Definition 5. Two variables X and Y are independent if the conditional
distribution of X given any value of Y remains unchanged, that is,
ni1 ni2 nil
= = ... for all i = 1, . . . , k
n·1 n·2 n·l
or equivalently
fi |1 = fi |2 = . . . = fi |l for all i = 1, . . . , k
15
Data cloud for Decathlon (long jump vs. 100m)
11.2
11.0
100m
10.8
10.6
long jump
Figure 4: Data cloud for the long jump (axis X) versus the times at the 100m
race (axis Y).
16
• When the covariance is zero (sXY = 0),¡ there is no linear relation
between the variables. Nevertheless, a nonlinear relation might well
exist.
When X and Y are independent, its covariance is zero, sXY = 0, but the
reciprocal result does not hold.
8.4.4 Correlation
Pearson’s correlation is given by
sXY
ρXY =
sX sY
which is bounded between −1 and 1.
17
Regression line
11.2
11.0
100m
10.8
10.6
long jump
Figure 5: Regression line for the 100m time over the long jump distance.
18
9 Time series
A time series is a sequence of observations of a variable ordered in time.
Commonly these observations are taken at equally spaced time intervals.
Despite time series have many applications in Energy Engineering (the pro-
duction, consumption, and price of energy is typically modeled as a time
series), their study will not be included in this introductory Statistics course.
5500000
year
Figure 6: Line chart for the total electric energy produced in Alaska during
the years 1990 to 2012 (Source: US Energy Information Administration,
http://www.eia.gov).
The three basic patterns that are studied in a time series are:
• Cycle. A cyclic pattern exhibits rises and falls that are not of fixed
period.
19