Probability & Statistics B: Review of Simple Data Summaries

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 85

Chapter 1

PROBABILITY & STATISTICS B Review of simple


data summaries
CONTENTS
1. Introduction
2. Graphical summaries
3. Numerical summaries
1. INTRODUCTION
Statistical Science has two main aims:

Descriptive Statistics – To describe particular populations, samples or


phenomena and state conclusions that refer only to the studied data.

Statistical Inference – To use the available data in order to draw general


conclusions regarding the population sampled; to make predictions about
future behaviour of random variables; and facilitate decision making under
uncertainty.
1. INTRODUCTION
Example 1
A class has 9 students. The GPA’s of these 9 students are:

2.60 2.65 3.00 3.15 3.65 3.78 3.83 3.95 3.95

(Descriptive Statistics) Determine the median GPA for this class.


1. INTRODUCTION
Example 2
A college has 10,000 students. You randomly pick 9 students, and their GPA’s
are:

2.60 2.65 3.00 3.15 3.65 3.78 3.83 3.95 3.95

(Statistical Inference) Determine the median GPA for this college of 10,000
students.

*reference*
1. INTRODUCTION
Example 3
A college has 10,000 students. You randomly pick 9 students, and their GPA’s
are:

2.60 2.65 3.00 3.15 3.65 3.78 3.83 3.95 3.95

(Statistical Inference) You then randomly pick another student. What is the
probability that her GPA is higher than 3.5?
1. INTRODUCTION
Summarizing data, both graphically and numerically, is central to a good
statistical analysis.

Univariate data – data on a single variable.


2. GRAPHICAL SUMMARIES
Graphical displays (plots, tables, etc.) most often provide summaries that help
us understand important aspects of the data immediately.
2. GRAPHICAL SUMMARIES
3 basic types of data:

Categorical data – data record categories.

Discrete data – numeric data that take integer values.

Continuous data – numeric data that take values on a continuous scale.


2. GRAPHICAL SUMMARIES
Example 4
Examples of categorical, discrete, and continuous data?
2. GRAPHICAL SUMMARIES
Categorical data
Tables, barplots, and pie charts are among the most commonly used ways for
visually presenting categorical data.
2. GRAPHICAL SUMMARIES
Example 5
20 students in a Statistics class were asked (in a feed back questionnaire) the
following question:

“How did you find the tutorial sessions?”


(a) Good (b) Okay (c) Bad

Here are the answers: b b b c N/A a b c a a b b b c a c b c a b

Create a table, a barplot, and a pie chart.


2. GRAPHICAL SUMMARIES
Example 5: Table
2. GRAPHICAL SUMMARIES
Example 5: Barplot
2. GRAPHICAL SUMMARIES
Example 5: Pie chart
Pie chart for questionnaire data
N/A a
5% 25%

c
25%

b
45%
2. GRAPHICAL SUMMARIES
Discrete data
We can also present the distribution of discrete (countable) data through
appropriate frequency tables and graphs.
2. GRAPHICAL SUMMARIES
Example 6
The table below gives the distribution of the number of claims per
policyholder for different policyholders in a general insurance portfolio.

Create a table and a barplot. Number of claims Frequency


0 65623
1 12571
2 1664
3 148
4 13
5 1
6 0
2. GRAPHICAL SUMMARIES
Example 6: Table in more detail

Number of Relative Cumulative Cumul. relative


Frequency
claims frequency frequency frequency
0 65623 0.8201 65623 0.8201
1 12571 0.1571 78194 0.9772
2 1664
3 148
4 13
5 1
6 0
Total 80020 1.000 - -
2. GRAPHICAL SUMMARIES
Example 6: Barplot
2. GRAPHICAL SUMMARIES
Continuous data
Continuous data can be presented in similar ways to these used for other
numerical data.

Caution is required with how the data are grouped in relevant categories, or
ranges, when frequency tables are considered.

Graphical summaries help identify the main characteristics of the data, e.g.
whether your data look consistent with a Normal distribution.
2. GRAPHICAL SUMMARIES
Continuous data
All graphs should:

Be reasonably accurate


Be on the correct (appropriate) scale
Have good annotation (title, labels on axes units, etc)
2. GRAPHICAL SUMMARIES
Example 7
The amounts of betting payouts (in pounds) from a particular company in a
given year are shown below:

2527 1787 3770 5701 2310 1652 822 918 2770 4891
3061 2126 1729 4618 3469

Create a table and a barplot.


2. GRAPHICAL SUMMARIES
Example 7: Table in detail

Relative Cumulative Cumul. Relative


Payout amount Frequency
frequency frequency frequency
0-1000 2 0.1333 2 0.1333
1001-2000 3 0.2000 5 0.3333
2001-3000
3001-4000
4001-5000
5001-6000
6001+
Total - -
2. GRAPHICAL SUMMARIES
Example 7: Histogram
Frequency
2. GRAPHICAL SUMMARIES
Example 7: Histogram
2. GRAPHICAL SUMMARIES
Continuous data
There are infinite number of ways for determining the groups in a frequency
table for continuous variables.

In practice, grouping depends on the nature of the variable and the aims of
the analysis.
2. GRAPHICAL SUMMARIES
Continuous data
Two basic criteria used:

Homogeneity of units belonging to the same group leads to a relatively


large number of groups with narrow width.

Simplicity implies a smaller number of groups.


2. GRAPHICAL SUMMARIES
Example 8
The amounts of betting payouts (in pounds) from a particular company in a
given year are shown below:

2527 1787 3770 5701 2310 1652 822 918 2770 4891
3061 2126 1729 4618 3469

Create a table and a histogram by using ranges 0-500, 501-1000, 1001-


1500, etc.
2. GRAPHICAL SUMMARIES
Example 8: Table
Payout amount Frequency
0-500
501-1000
1001-1500
1501-2000
2001-2500
2501-3000
3001-3500
3501-4000
4001-4500
4501-5000
5001-5500
5501-6000
6001-6500
6501+
2. GRAPHICAL SUMMARIES
Example 8: Histogram
2. GRAPHICAL SUMMARIES
Histogram for continuous data
A histogram is similar to a bar plot and is very useful when the number of
observations is large.
2. GRAPHICAL SUMMARIES
Histogram for continuous data
To form a histogram from observations 𝑥1 , 𝑥2 , … , 𝑥𝑛 , we simply:

1. Identify min and max observations


2. Split the range of the values into equal intervals (bins)
3. Count the observations falling in each bin
4. Draw a rectangle above each interval with height equal to the frequency and
each interval, or area equal to the relevant proportion.
5. The intervals can have unequal length (at the extremes of the distribution)
6. In either case the area should be proportional to the probability corresponding
to the bin frequencies
2. GRAPHICAL SUMMARIES
Stem-and-leaf plot for continuous data
A useful graphical way to represent relatively small data sets.

It retains the values of the data, while also showing the shape of the
distribution.

It is constructed by splitting each observation into two parts: the preceding


digit (stem) and the last digit (leaf).
2. GRAPHICAL SUMMARIES
Example 9
The following data show the salaries (in 1000s) of a sample of 30 employees
from a company:

20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76

Create a stem-and-leaf plot.


2. GRAPHICAL SUMMARIES
Example 9: stem-and-leaf plot
2. GRAPHICAL SUMMARIES
Example 10
The following data show the weights of male and female of a study group.

Male: 45 49 55 58 60 68 69 69 75 78 90
Female: 39 40 42 44 44 45 51 53 53 59 60 72

Create a stem-and-leaf plot.


2. GRAPHICAL SUMMARIES
Example 10: stem-and-leaf plot
2. GRAPHICAL SUMMARIES
Dotplot for continuous data
A useful graphical way to represent relatively small data sets.

It shows data values as dots along a continuous horizontal axis (in ascending
order)

It is good for showing cluster of points, gaps (where there are no observations)
and atypical observations or outliers.
2. GRAPHICAL SUMMARIES
Example 11
The following data show the salaries (in 1000s) of a sample of 30 employees
from a company:

20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76

Create a dotplot.
2. GRAPHICAL SUMMARIES
Example 11: dotplot
3. NUMERICAL SUMMARIES
Graphs usually provide simple and easily interpreted descriptions of the
structure of the frequency distribution of studied data.
3. NUMERICAL SUMMARIES
Often we want to obtain further summarization of the data. These are given
by numerical summaries and are useful because:

It is not always easy to recall information presented in detailed


tables/graphs

In some cases detailed information is not required

Most importantly, numerical summaries facilitate comparisons.


3. NUMERICAL SUMMARIES
Numerical summaries are mainly used for the computation of certain
characteristic measures of the population, known as parameters of the
population with respect to the considered variable.

These parameters take the form of numerical expressions that determine the

Location
Dispersion
General form

of the distribution of the population variable.


3. NUMERICAL SUMMARIES
Measures of location

1. The sample mean

2. The sample median

3. Quantiles, quartiles and percentiles


3. NUMERICAL SUMMARIES
The sample mean
Also referred to as the arithmetic mean.

If 𝑥1 , 𝑥2 , … , 𝑥𝑛 constitute a random sample (observed values), then the sample


mean is given by:
1
𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛
3. NUMERICAL SUMMARIES
The sample mean
For group data, the sample mean is given by:
1 𝑛
𝑥ҧ = σ 𝑓 𝑥𝑖
𝑛 𝑖=1 𝑖

where 𝑓𝑖 is the frequency of 𝑥𝑖 , and 𝑛 = σ𝑛𝑖=1 𝑓𝑖 .


3. NUMERICAL SUMMARIES
Example 12
You are given the following general insurance claim data:

1100 1500 1600 2000 3500 4000 4800 4800 5000 5100

Calculate the sample mean.


3. NUMERICAL SUMMARIES
Example 13
The table below gives the distribution of the number of claims per
policyholder for different policyholders in a general insurance portfolio.

Find the mean number of claims Number of claims Frequency


per policyholder. 0 65623
1 12571
2 1664
3 148
4 13
5 1
6 0
3. NUMERICAL SUMMARIES
The sample mean
Some properties of the sample mean:

a) If the variable is constant 𝑋 = 𝑐, then 𝑥ҧ = 𝑐

b) min 𝑥𝑖 ≤ 𝑥ҧ ≤ max 𝑥𝑖 , 𝑖 = 1,2, … , 𝑛

c) σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ = 0

d) If 𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 with 𝑎 and 𝑏 real numbers, then 𝑦ത = 𝑎 + 𝑏𝑥ҧ


3. NUMERICAL SUMMARIES
The sample mean
Some properties of the sample mean:

e) If the data consist of 𝑘 groups of size 𝑛1 , 𝑛2 , … , 𝑛𝑘 and respective means


𝑥1ҧ , 𝑥ҧ𝑥 , … , 𝑥ҧ𝑘 , then the overall mean is given by:
1
𝑥ҧ = σ𝑘𝑗=1 𝑛𝑗 𝑥𝑗ҧ
𝑛

where 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 .
3. NUMERICAL SUMMARIES
Example 14
A college has three schools: School of Engineering, School of Mathematical
Sciences, and School of Management.

School of Engineering has 55 students, and their average GPA is 3.5.


School of Mathematical Sciences has 105 students, and their average GPA is
3.8.
School of Management has 76 students, and their average GPA is 3.33.

Calculate the average GPA of the students from this college.


3. NUMERICAL SUMMARIES
The sample median
The sample median is the value that splits the data into two halves, i.e. the
𝑛+1
observation with rank .
2

Sample median = 𝑥 𝑛+1 if 𝑛 is odd.


2

1
Sample median = 𝑥 𝑛 +𝑥 𝑛
+1
if 𝑛 is even.
2 2 2
3. NUMERICAL SUMMARIES
Example 15
You are given the following general insurance claim data:

1100 1500 1600 2000 3500 4000 4800 4800 5000 5100

Calculate the sample median.


3. NUMERICAL SUMMARIES
The sample median
Notes:

a) With frequency data, the median can be calculated with the help of the
cumulative frequencies in a suitable table.

b) The median can also be calculated for non-closed distributions.

c) The median is more robust than the mean, i.e. it is not affected by extreme
values in the data set.
3. NUMERICAL SUMMARIES
Example 16
AT Chocolate Factory, the monthly salaries are:

1 CEO – $100,000

4 managers – $8000 $7500 $7000 $6000

10 normal staffs – $5000 $4000 $3500 $3500 $3500 $3300 $3200


$3000 $2900 $2600

Calculate the sample mean and sample median.


3. NUMERICAL SUMMARIES
Quantiles, quartiles and percentiles
The 𝑝 quantile is the point in the data with position 1 + 𝑛 − 1 𝑝, i.e. the
ordered value 𝑥 1+ 𝑛−1 𝑝 .

First quartile: 𝑄1 = 𝑥 1+ 𝑛−1 0.25

Second quartile: 𝑄1 = 𝑥 1+ 𝑛−1 0.5

Third quartile: 𝑄1 = 𝑥 1+ 𝑛−1 0.75


3. NUMERICAL SUMMARIES
Quantiles, quartiles and percentiles
𝑄1 exceeds exactly 25% of the data.

𝑄2 is also the median, which exceeds exactly 50% of the data.

𝑄3 exceeds exactly 75% of the data.

𝑥 1+ 𝑛−1 𝑝 exceeds exactly 100𝑝% of the data.

Need to use appropriate interpolation is quantiles are between data values.


3. NUMERICAL SUMMARIES
Example 17
You are given the following general insurance claim data:

1100 1500 1600 2000 3500 4000 4800 4800 5000 5100

Calculate the 0.1 quantile and 3rd quartile.


3. NUMERICAL SUMMARIES
Quantiles, quartiles and percentiles
The 𝑝 quantile is also sometimes defined as the ordered data value 𝑥 𝑛+1 𝑝 .
3. NUMERICAL SUMMARIES
Example 18
You are given the following general insurance claim data:

1100 1500 1600 2000 3500 4000 4800 4800 5000 5100

Calculate the 0.1 quantile and 3rd quartile, by using 𝑥 𝑛+1 𝑝 .


3. NUMERICAL SUMMARIES
Boxplots
Boxplots provide a graphical summary of the data, based on the five-number
summary – min, 𝑄1 , median, 𝑄3 , max.

Also refereed to as box-and-whisker plots.


3. NUMERICAL SUMMARIES
Boxplots

Very useful for identifying symmetry and outliers

Presented as a box with lines at 𝑄1 , the median and 𝑄3 and whiskers which
extend to the min and the max

Usually whiskers extend only to 1.5 times the interquartile range (IQR).

Possible outliers are plotted with points


3. NUMERICAL SUMMARIES
Boxplots
3. NUMERICAL SUMMARIES
Example 19
The following data show the salaries (in 1000s) of a sample of 30 employees
from a company:

20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76

Create a boxplot and comment.


3. NUMERICAL SUMMARIES
Example 19: Boxplot
3. NUMERICAL SUMMARIES
Measures of dispersion

1. The sample variance

2. The interquartile range (IQR)

3. The coefficient of variation (CV)

A measure of location is more reliable when the deviation (or dispersion) of


the observations from it is relatively small.
3. NUMERICAL SUMMARIES
The sample variance
With observations 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the sample variance is given by:
1
𝑠2 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑛−1

1 1
𝑠2 = σ𝑛𝑖=1 𝑥𝑖2 − σ𝑛𝑖=1 𝑥𝑖 2
𝑛−1 𝑛

𝑛 1 𝑛
2
𝑠 = σ𝑖=1 𝑥𝑖2 − 𝑥ҧ 2
𝑛−1 𝑛
3. NUMERICAL SUMMARIES
The sample standard deviation
The sample standard deviation is simply the square root of the sample
variance.

1
𝑠= σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑛−1

𝑛 1 𝑛 1 𝑛 2
𝑠= σ𝑖=1 𝑥𝑖2 − σ𝑖=1 𝑥𝑖
𝑛−1 𝑛 𝑛
3. NUMERICAL SUMMARIES
Example 20
You are given the following general insurance claim data:

1100 1500 1600 2000 3500 4000 4800 4800 5000 5100

Calculate the sample variance and hence standard deviation.


3. NUMERICAL SUMMARIES
The sample variance
For grouped data, the sample variance is given by:

𝑛 1 𝑘 1 𝑘 2
2
𝑠 = σ 𝑓 𝑥𝑖2 − σ 𝑓 𝑥𝑖
𝑛−1 𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
3. NUMERICAL SUMMARIES
Example 21
The table below gives the distribution of the number of claims per
policyholder for different policyholders in a general insurance portfolio.

Find the sample variance of Number of claims Frequency


number of claims per policyholder. 0 65623
1 12571
2 1664
3 148
4 13
5 1
6 0
3. NUMERICAL SUMMARIES
The sample variance
Some properties of the sample variance:

a) 𝑠 2 ≥ 0 with 𝑠 2 = 0 if and only if the variable is constant, e.g. 𝑋 = 𝑐.

b) If 𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 with 𝑎 and 𝑏 real numbers, then 𝑠𝑦2 = 𝑏 2 𝑠𝑥2 .

c) For any real number 𝑐, the sum of squares σ𝑛𝑖=1 𝑥𝑖 − 𝑐 2 takes its
minimum value when 𝑐 = 𝑥.ҧ
3. NUMERICAL SUMMARIES
The sample variance
Some properties of the sample variance:

d) If the data consist of 𝑘 groups of size 𝑛1 , 𝑛2 , … , 𝑛𝑘 and respective


variances 𝑠12 , 𝑠22 , … , 𝑠𝑘2 , then the overall variance is given by:
1 1 2
𝑠2 = σ𝑘𝑗=1 𝑛𝑗 − 1 𝑠𝑗2 + σ𝑘𝑗=1 𝑛𝑗 𝑥𝑗ҧ − 𝑥ҧ
𝑛−1 𝑛−1

where 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 .
3. NUMERICAL SUMMARIES
The interquartile range (IQR)
The IQR is more resistant to extreme observations in the data then the sample
variance:

IQR = 𝑄3 − 𝑄1

i.e. it is the length of the interval containing the central half of the
observations.
3. NUMERICAL SUMMARIES
Example 22
The following data show the salaries (in 1000s) of a sample of 30 employees
from a company:

20 21 21 25 25 27 29 30 32 35 36
38 39 45 50 51 52 52 55 56 63 68
69 71 71 73 73 73 74 76

Find the IQR.


3. NUMERICAL SUMMARIES
The coefficient of variation
The coefficient of variation is defined as:
𝑠
𝐶𝑉 =
𝑥ҧ

This facilitates comparisons as:


It gives a measure of dispersion that is relative to the mean
It does not depend on the scale of the measurement
It is free of any measurement units
3. NUMERICAL SUMMARIES
Example 23
You are given the following sample daily stock prices:

Stock A: 12 15 17 14 13 13

Stock B: 420 440 490 480 485 480

Calculate the coefficient of variation for both.


3. NUMERICAL SUMMARIES
Measures of asymmetry and kurtosis

1. Pearson’s symmetry coefficient

2. Pearson’s kurtosis coefficient


3. NUMERICAL SUMMARIES
Measures of asymmetry and kurtosis
First define the sample moments:
1 𝑛
The kth order moment about the value 𝑐 is defined as σ 𝑥𝑖 − 𝑐 𝑘 .
𝑛 𝑖=1

When 𝑐 = 0, we have the simple moments:


1 𝑛
𝜈𝑘 = σ𝑖=1 𝑥𝑖 𝑘
𝑛

When 𝑐 = 𝑥,ҧ we have the central moments:


1
𝜇𝑘 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑘
𝑛
3. NUMERICAL SUMMARIES
Pearson’s symmetry coefficient
Pearson’s symmetry coefficient is defined as:
1 𝑛
𝜇3 σ
𝑛 𝑖=1
𝑥𝑖 −𝑥ҧ 3
𝛽1 = = 3
𝑠3 1
σ𝑛 𝑥𝑖 −𝑥ҧ 2
𝑛−1 𝑖=1
3. NUMERICAL SUMMARIES
Pearson’s symmetry coefficient
Data are symmetry distributed, 𝛽1 = 0

Positive asymmetry (right-skewed), 𝛽1 > 0

Negative asymmetry (left-skewed), 𝛽1 < 0

The converse is not true! 𝛽1 = 0 does not necessarily imply that data are
symmetric.
3. NUMERICAL SUMMARIES
Example 24 (From Example 7)
The amounts of betting payouts (in pounds) from a particular company in a
given year are shown below:

2527 1787 3770 5701 2310 1652 822 918 2770 4891
3061 2126 1729 4618 3469

Calculate the Pearson’s symmetry coefficient.


3. NUMERICAL SUMMARIES
Recall Example 7: Histogram
3. NUMERICAL SUMMARIES
Pearson’s kurtosis coefficient
Pearson’s kurtosis coefficient is defined as:
1 𝑛
𝜇4 σ
𝑛 𝑖=1
𝑥𝑖 −𝑥ҧ 4
𝛾2 = −3= 4
𝑠4 1
σ𝑛 𝑥𝑖 −𝑥ҧ 2
𝑛−1 𝑖=1
3. NUMERICAL SUMMARIES
Pearson’s kurtosis coefficient
Pearson’s kurtosis coefficient measures the kurtosis of the frequency
distribution of the data (i.e. How peaked the distribution is).

Kurtosis of any normal distribution is 0.

More peaked than normal distribution and heavy tails, Kurtosis > 0

Less peaked than normal distribution and light tails, Kurtosis < 0

You might also like