Descriptive Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Ignacio Cascos Fernández

Department of Statistics
Universidad Carlos III de Madrid

Descriptive Statistics

Statistics — 2014–2015

1 Main definitions
Descriptive statistics are used to describe, in a simple way, a data collec-
tion. More specifically, they describe or summarize quantitatively the main
features of the data collection we are concerned with.
Definition 1 (Population, size). A statistical population consists of the to-
tality of the observations with which we are concerned. The number of
observations in the population is called the size of the population, it can be
finite or infinite, and we denote it by N
It might be very expensive to record all the observations from a very large
population. In fact, a population might be of infinite size, which makes it
unfeasible to record all its observations. In such a case, we take a sample
from the population.
Definition 2 (Sample, size). A sample is a subset of observations selected
from a population. We say that it is representative if the main features from
the population are well represented in it. The sample size is the number of
observations in the sample, we denote it by n.
If the sample coincides with the population, we say that we have a census.
Definition 3 (Variable, observation). A variable (X) is a symbol that repre-
sents a characteristic of interest in the population. It describes the observa-
tions from the population. Each observation (x) corresponds to one specific
value (numerical or not) of the variable.

1
Types of variables

• Quantitative: It takes numeric values, refers to something that can be


measured.

– Discrete: the set of possible values is finite or, at least, denumer-


able (Example: number of brothers or sisters).
– Continuous: the set of possible values is uncountable (Example:
timelife of a battery).

• Qualitative (or categorical): the possible values to be assumed by the


variable are not numeric. The variable represents a characteristic that
cannot be measured (Example: color).

• Dichotomous: may only assume two values, (YES/NO); {0, 1}.

2 Frequencies and tables


In order to start working with the observations from a sample, their values
are usually sorted. When the variable is quantitative (numeric), they are
sorted in an increasing way.
Given a variable X, let us consider a sample of size n with k different
values, x1 , . . . , xk (in case of a quantitative variable, x1 < x2 < . . . < xk ).
The absolute frequency of xi is the number of times that xi appears in
the sample. It is denoted by ni and satisfies
k
X
ni = n1 + n2 + . . . + nk = n
i=1

The relative frequency of xi is the quotient of the absolute frequency of


xi (ni ) divided between the sample size (n), it is denoted by fi
k
ni X
fi = , it satisfies fi = 1.
n i=1

For quantitative variables (values from the sample sorted in an increasing


way), we also define the cumulative frequencies.

2
The cumulative absolute frequency of the i-th value is the number of ele-
ments in the sample not greater than xi , it is denoted by Ni
Ni = n1 + n2 + . . . + ni
The cumulative relative frequency of the i-th value is the sum of the rel-
ative frequencies of the elements from the sample that are not greater than
xi , it is denoted by Fi
Ni
F i = f1 + f2 + . . . + fi , Fi =
n
A frequency table adopts the structure below,
xi ni fi Ni Fi

If our dataset contains information about the whole population (it should
be a finite population and we would have a census), we could talk about
population frequencies. The same applies to the measures of central location,
scatter, and the rest of definitions in this notes. Nevertheless, we will usually
refer to samples, in case we consider a population, we will say explicitly.

3 Grouped data
Sometimes it is convenient to group the quantitative data we are working
with. The most common reason is that we have a variable that assumes a
lot of different values (not repeated in the sample). In such a case, the set
of possible values of the variable is divided into non-overlapping intervals,
usually called class intervals, cells, or bins. Frequencies will now refer to
them.
A crucial practical issue is choosing the number of bins. It often works
well to take it approximately equal to the square root of the number of
observations.
The i-th class interval is alternatively of the type (Li , Li ], or [Li , Li ).
Associated with the i-th bin, we have its lower limit, Li , its upper limit,
Li , its mid-point, mi = (Li + Li )/2, and its width, ci = Li − Li . The bins
might have unequal widths.
A frequency table for grouped data adopts the structure below,

3
(Li , Li ] ni fi Ni Fi

4 Measuring the location of the data


4.1 Measuring central location
The central location measures are representative values of a data set. Their
aim is to summarize as many information as possible from a data set with a
single number (or observation).

4.1.1 (Sample) mean, x


It is computed for quantitative variables and it is the geometric center (center
of gravity) of our data,
k
n1 x1 + n2 x2 + . . . + nk xk X
x= = xi f i
n i=1

If it is a population mean, that is, we are considering all the individuals from
the population (n = N ), it is commonly denoted by µ.
Properties.
Pk
1. i=1 (xi − x)ni = 0

2. the mean quadratic distance of the sample observations to their sample


mean is minimal, that is, for any a ∈ R
k
X k
X
2
(xi − x) ni ≤ (xi − a)2 ni
i=1 i=1

Notice that in order to compute the mean, all observations are taken into
account. Therefore, abnormally large (or small) observations have a high
influence on it.

4
If we are working with grouped data, we take the mid-point of each bin
in order to compute the mean,
k
n1 m1 + n2 m2 + . . . + nk mk X
x= = mi fi
n i=1

4.1.2 Median, Me
We can compute it for quantitative variables. It is a number such that, at
least half of the observations are not greater than it, and at least half are not
smaller. If this definition can be applied to all the number in an interval, we
will take the mid-point of such interval.
In order to compute the median, we must find the smallest value xi such
that Fi ≥ 0.5, that is, Fi ≥ 0.5 and Fi−1 < 0.5. If Fi > 0.5, then Me = xi , if
alternatively Fi = 0.5, then Me = (xi + xi+1 )/2.
Property. The average Euclidean distance of the sample observations to
their median is minimal, that is, for any a ∈ R
k
X k
X
|xi − Me|ni ≤ |xi − a|ni
i=1 i=1

The computation of the median takes only into account the position of
the ordered observations, not the observations themselves. For this reason, it
behaves better than the mean in the presence of outliers (observations that
are numerically distant from the rest of the data).

4.1.3 Mode
It is the value with a higher frequency. It is sensible to compute it, even
for categorical data. When there are two modes, we talk about a bimodal
distribution.
In case we are working with grouped data, the mode is the bin (or its
mid-point) that attains the largest quotient for relative frequency between
width (fi /ci ).

5
4.1.4 Harmonic mean, xH
If our data consists of rates, the harmonic mean provides us with the average
rate,
n
xH = Pk
i=1 ni /xi

4.1.5 Geometric mean, xG


The geometric mean only applies to positive numbers and should be used
when our results are presented as ratios to reference values,
q
xG = n xn1 1 xn2 2 . . . xnk k

4.1.6 5% trimmed mean, xT


A trimmed (or truncated ) mean is the mean value of the data set that results
after discarding a fixed percentage (5%) of the lowest and highest observa-
tions
 2 −1
kX 
1
xT = (Fk1 − 0.05)xk1 + (0.95 − Fk2 −1 )xk2 + fi x i
0.9 i=k +1
1

where k1 and k2 are such that,

Fk1 −1 < 0.05 ≤ Fk1 ; Fk2 −1 ≤ 0.95 < Fk2

4.2 Quantiles
Quantiles are computed for quantitative variables in a similar way to the
median, they only take into account the position of each observation in an
ordered sample. As particular instances of quantiles, we have quartiles, per-
centiles, and deciles.

4.2.1 Quartiles
They divide the sample into four parts with equal number of observations.

• Q1 , first quartile, at least 25% of the observations are not greater than
Q1 , and at least 75% are not smaller than Q1 .

6
• Q2 , second quartile, it is the median, Q2 = Me.

• Q3 , third quartile, at least 75% of the observations are not greater than
Q3 , and at least 25% are not smaller than Q3 .

• Q4 , fourth quartile, it is the largest observation from the sample..

4.2.2 Percentiles
They divide the ordered sample into 100 parts.
Given a natural number 1 ≤ α ≤ 99, the α-th percentile, Pα satisfies
that at least α% of the observations are not greater than Pα and at least
(100 − α)% of the observations are not smaller than Pα .
It should be obvious to the reader that Q1 = P25 and Q3 = P75 .
In order to compute percentile Pα , we consider the smallest observation
whose cumulative relative frequency is not smaller than α/100, that is, we
consider xi such that Fi ≥ α/100 and Fi−1 < α/100. If Fi > α/100, then
Pα = xi , if alternatively Fi = α/100, then Pα = (α/100)xi + (1 − α/100)xi+1 .

5 Measuring the spread of the data


Spread statistics describe the variability or scatter in the data.

5.1 Sample range


The sample range is the difference between the largest and smallest observa-
tions, xk − x1 .

5.2 Interquartile Rage


The interquartile range is the difference between the third and the first quar-
tiles, IQR = Q3 − Q1 .

5.3 Sample variance, (s2 )

k
2
X (xi − x)2 ni
S =
i=1
n−1

7
Property.
k
!
n X
S2 = x2i fi − (x)2
n−1 i=1

The population variance, denoted by σ 2 , is commonly defined as


k
X
2
σ = (xi − x)2 fi
i=1

notice that, in order to define it we should have observations from all the
individuals in the population.

5.4 Standard deviation, (s)


The standard deviation is the square root of the variance and quantifies the
error we make when we represent a sample by its mean value alone. The
sample standard deviation is commonly denoted by S.
The population standard deviation is commonly denoted by σ.

5.5 Median Absolute Deviation

MAD = Me|X − Me(X)|

5.6 Coefficient of Variation, (CV )


s
CV = 100
|x|
The coefficient of variation is a normalized measure of dispersion (it is a
dimensionless quantity). For this reason it is commonly given as a percentage.
We can also measure the shape of the distribution of our data. Neverthe-
less, we will introduce some simple charts before in order to explain what do
we mean by shape.

8
6 Charts
6.1 Bar chart
Consists of rectangular bars, each representing a different value from the sam-
ple and with length proportional to the relative (or absolute) frequency of the
value that they represent. The bars can be plotted vertically or horizontally.

6.2 Pie chart


Circular Chart that is divided into sectors, illustrating proportion. In a pie
chart, each sector represents a different value from the sample and its arc
length is proportional to the relative frequency of the value it represents.

Energy Consumption in Spain by Energy Source (2013) Energy Consumption in Spain by Energy Source (2013)

40 Petroleum

30
Percentage

OTHERS
20
Natural Gas
Renewable Ener.

10

Coal Nuclear Power

0
Petroleum

Natural Gas

Coal

Nuclear Power

Renewable Ener.

OTHERS

Figure 1: Bar chart (left) and pie chart (right) for the energy consumption
in Spain by source in 2013 (Source: IDAE, http://www.idae.es).

6.3 Histogram
It is the most common representation for continuous data. It consists of
adjacent rectangles, erected over discrete intervals (the bins) and with an

9
area equal to the relative frequency of the observations in the interval. Al-
ternatively, the height of the rectangle over the i-th bin is fi /ci . Since it
represents grouped data, it is possible to obtain several different histograms
for the same sample (by constructing different bins to group the data).

Histogram 100m

2.0
1.5
Density

1.0
0.5
0.0

10.4 10.6 10.8 11.0 11.2 11.4

100m

Figure 2: Histogram for the time at the 100m race at the Decathlon during
the Olympic games in Athens 2004.

6.4 Frequency polygon


Piecewise linear curve joining each pair of consecutive upper mid-points of
the rectangles in a histogram. That is, it joins the points (mi , fi /ci ). The
left end of this piecewise linear curve, (m1 , f1 /c1 ), is joint with (L1 , 0) and
the right end, (mk , fk /ck ), with (Lk , 0).

6.5 Stem-and-leaf diagram


The stem-and-leaf diagram or stemplot is a device for presenting quantitative
data in a graphical format, similar to a histogram, to assist in visualizing the
shape of a distribution. On a column on the left, the stems are represented
in an up-down increasing order. Each observation is then represented as a
leave at the left of its stem in a left-right increasing order.

10
Example. Given the times of the athletes at the 100m race of the Decathlon
at the Athens 2004 Olympics, 11.10, 10.89, 11.28, 11.08, 10.55, 10.99, 11.06,
10.87, 11.14, 11.33, 11.23, 11.08, 10.92, 11.36, 10.86, 10.97, 10.89, 11.14,
10.91, 10.85, 10.98, 10.68, 10.69, 10.80, 10.62, 10.50, 10.90, 10.85, 10.44,
10.95, we obtain the following stem-and-leaf diagram:
104 4
105 0 5
106 2 8 9
107
108 0 5 5 6 7 9 9
, the leafs represent cents of a second.
109 0 1 2 5 7 8 9
110 6 8 8
111 0 4 4
112 3 8
113 3 6
Notice than we can visualize a stemplot as a rotated histogram all whose
bins have equal width.

6.6 Box plot


A box plot or box-and-whisker plot depicts samples of quantitative data
through their five-number summaries: the smallest observation (sample min-
imum), lower quartile (Q1 ), median (Me = Q2 ), upper quartile (Q3 ), and
largest observation (sample maximum).

Box plot 100m Boxplot long jump

● ●

10.6 10.8 11.0 11.2 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0

Figure 3: Box plot for the times at the 100m race of the Decathlon in the
Athens 2004 Olympics.

11
Notice that, apart from the extreme values, quartiles, and median, the box
plots plotted by most statistical packages, also indicate which observations,
if any, might be considered outliers. Those are observations whose distance
to the nearest quartile (Q1 or Q3 ) exceeds 1.5IQR.

7 Measuring the shape of the distribution of


the data
We can also measure the shape of the histogram.

7.1 Skewness
The skewness is a measure of the asymmetry of the distribution of the vari-
able. It is defined as
Pk
ni (xi − x)3 /n
Skew = i=1 .
s3
A negative skew indicates that the tail on the left side of the probability
density function is longer than the right side and the bulk of the values
(including the median) lie to the right of the mean. A positive skew indicates
that the tail on the right side is longer than the left side and the bulk of the
values lie to the left of the mean. A zero value indicates that the values
are relatively evenly distributed on both sides of the mean, typically but not
necessarily implying a symmetric distribution
The mean of a right-skewed variable is greater than its median, and con-
versely the mean of a left-skewed variable is smaller than its median.

7.2 Kurtosis
The kurtosis is a measure of the peakedness (concentration about the mean)
of the distribution of the variable. It is defined as
Pk
ni (xi − x)4 /n
Kurt = i=1 − 3.
s4
The reference value is the kurtosis of a Gaussian variable, that is set to O.
This explains the −3 in the definition of the kurtosis and it is the reason why
some authors refer to it as excess kurtosis.

12
A high kurtosis distribution has a sharper peak and longer, fatter tails,
while a low kurtosis distribution has a more rounded peak and shorter thin-
ner tails. Depending on the sign of the kurtosis, we have three types of
distributions:
• A distribution with positive excess kurtosis is called leptokurtic.
• A distributions with zero excess kurtosis are called mesokurtic.
• A distribution with negative excess kurtosis is called platykurtic.

8 Simultaneous description of two variables


We will study two characteristic of each individual.
Definition 4. A bivariate variable (X, Y ) is a symbol representing two char-
acteristics from the individuals in a population.
Given a bivariate variable (X, Y ), we consider a sample of size n. Variable
X assumes k different values in the sample, x1 , . . . , xk , if it is quantitative
x1 < x2 < . . . < xk . Variable Y assumes l different values in the sample,
y1 , . . . , yl , if it is quantitative y1 < y2 < . . . < yl .
Observations are now of the type (xi , yj ).
The absolute frequency of (xi , yj ) is the number of times that (xi , yj )
appears in the sample. It is denoted by nij , and satisfies
k X
X l
nij = n.
i=1 j=1

The relative frequency of (xi , yj ) is the quotient of the absolute frequency


of (xi , yj ), nij between the sample size n. It is denoted by fij
k X
l
ni X
fij = if satisfies fij = 1.
n i=1 j=1

8.1 Marginal distributions


The univariate distribution of each of the components of a bivariate variable
is called marginal distribution. Given (X, Y ), we can study the marginal
distribution of X and the marginal distribution of Y .

13
Absolute marginal frequency of xi , ni· = ni1 + ni2 + · · · + nil = lj=1 nij .
P
Relative marginal frequency of xi , fi· = ni· /n.
Absolute marginal frequency of yj , n·j = n1j + n2j + · · · + nkj = ki=1 nij .
P
Relative marginal frequency of yj , f·j = n·j /n.
For X and Y , we can compute any measure of location, spread, or shape.
Alternatively, we can plot any of the previous graphical representations.
A two way or double entry table can be filled with absolute or relative
frequencies and adopts the structure below. Marginal frequencies can be
represented on the last row and column.

X\Y y1 y2 ... yl ni·


x1 n11 n12 ... n1l n1·
x2 n21 n22 ... n2l n2·
.. .. .. .. .. ..
. . . . . .
xk nk1 nk2 ... nkl nk·
n·j n·1 n·2 ... n·l n

8.2 Conditional distributions


We talk about conditional distributions when instead of considering the whole
sample as a reference set, we restrict to the observations that satisfy a certain
condition. Such a condition might involve only one variable or both.
The absolute frequency of xi given a certain condition is the number of
observations satisfying the condition and for which variable X assumes the
value xi .
The relative frequency of xi given a certain condition is the absolute
frequency of xi given the condition divided between the total number of
observations in the sample that satisfy the condition.
The distribution of X given that Y = yj , denoted by X|Y =yj is the
distribution of all observations from X for with Y assumes value yj . Its
absolute frequencies (ni |j ) constitute the j-th column of the two way table.
The relative frequencies are given by fi |j = nij /n·j
For any conditional distribution, we can compute a measure of location,
spread, or shape. Alternatively, we can plot any of the previous graphical
representations.

14
8.3 Statistical independence
The main reason, we study two variables as a bivariate one is to understand
the dependence relation between them, if there is any. Two variables are
statistically independent if there is no relation between them.
Definition 5. Two variables X and Y are independent if the conditional
distribution of X given any value of Y remains unchanged, that is,
ni1 ni2 nil
= = ... for all i = 1, . . . , k
n·1 n·2 n·l
or equivalently

fi |1 = fi |2 = . . . = fi |l for all i = 1, . . . , k

The previous relation is also equivalent to


nij ni· n·j
= × for all i, j.
n n n
That is, X and Y are statistically independent if the relative frequency of
each pair (xi , yj ) equals the product of the (marginal) relative frequencies of
xi and yj (fij = fi· f·j for all i, j).
The expected value for the (i, j) entry of a two way table for absolute fre-
quencies when the variables are independent is nfi· f·j

8.4 Linear regression (Least Squares Estimation), cor-


relation
We now restrict to quantitative variables.

8.4.1 Data cloud


The simplest and most common graphical representation of bivariate quanti-
tative data consists in a data cloud. Associated with each observation (xi , yj )
from the sample, a point with coordinates xi and yj is plotted in a Cartesian
plane. In a data cloud, we can appreciate the relation between the variables.

15
Data cloud for Decathlon (long jump vs. 100m)

11.2
11.0
100m

10.8
10.6

6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0

long jump

Figure 4: Data cloud for the long jump (axis X) versus the times at the 100m
race (axis Y).

8.4.2 Covariance, (sXY )


The covariance of two variables is a measure of how much they change to-
gether. It is given by:
Pk Pl
i=1 j=1 (xi − x)(yj − y)nij
sXY = .
n
Property. Pk Pl
i=1 j=1 xi yj nij
sXY = − x y.
n
• A positive covariance (sXY > 0) indicates that higher than average
values of one variable tend to be paired with higher than average values
of the other variable.

• A negative covariance (sXY < 0) indicates that higher than average


values of one variable tend to be paired with lower than average values
of the other variable.

16
• When the covariance is zero (sXY = 0),¡ there is no linear relation
between the variables. Nevertheless, a nonlinear relation might well
exist.

When X and Y are independent, its covariance is zero, sXY = 0, but the
reciprocal result does not hold.

8.4.3 Linear regression, Least Squares Fitting


The linear regression establishes a line that fits our data up to some extend.
This line (linear equation) allows us to predict the value of one variable
(dependant variable) when we know the value of the other (independent vari-
able).
Let X be the independent variable and Y the dependent variable, our aim
is to find an equation of the type ŷ = a + bx that allows us to predict Y when
the value of X is known. We are interested in the intercept and the slope of
such a line, (a and b) for whom the expression
k X
l
X 2
F (a, b) = yj − (a + bxi ) nij
i=1 j=1

assumes the smallest possible value.


Those values are
sXY sXY
b = 2 , a = y − 2 x.
sX sX
Alternatively, the regression line of Y over X can be written as
sXY
ŷ − y = (x − x).
s2X

The regression line of X over Y can be computed in an analogous way.

8.4.4 Correlation
Pearson’s correlation is given by
sXY
ρXY =
sX sY
which is bounded between −1 and 1.

17
Regression line

11.2
11.0
100m

10.8
10.6

6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0

long jump

Figure 5: Regression line for the 100m time over the long jump distance.

• When r = 1 there exists perfect positive linear correlation.

• When r = −1 there exists perfect negative linear correlation.

• When r < 0, both regression lines are decreasing.

• When r > 0, both regression lines are increasing.

• When r = 0, the variables are said to be uncorrelated.

18
9 Time series
A time series is a sequence of observations of a variable ordered in time.
Commonly these observations are taken at equally spaced time intervals.
Despite time series have many applications in Energy Engineering (the pro-
duction, consumption, and price of energy is typically modeled as a time
series), their study will not be included in this introductory Statistics course.

Total electric energy produced in Alaska 1990−2012


6500000
MWh

5500000

1990 1995 2000 2005 2010

year

Figure 6: Line chart for the total electric energy produced in Alaska during
the years 1990 to 2012 (Source: US Energy Information Administration,
http://www.eia.gov).

The three basic patterns that are studied in a time series are:

• Trend. Consisting on a long-term increase or decrease in the data (if


might be linear or not).

• Seasonality. Pattern repeated every fixed and known period of time


(e.g. every week or every year).

• Cycle. A cyclic pattern exhibits rises and falls that are not of fixed
period.

19

You might also like