Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Exploratory Data Analysis

ONNETH O. TEJADA
STAT 1200 – Management Science
2nd Semester, 2021-2022

CENTRAL LUZON STATE UNIVERSITY


Learning Outcomes
DEPARTMENT OF
STATISTICS

After completing this chapter, the students must be able to


• Summarize data using measures of central tendency, such as mean,
median and mode.
• Describe data, using measures of variation, such as range, variance and
standard deviation.
• Identify the position of a data value in a dataset using various
measures of position, such as percentiles, deciles, and quartiles.
• Use the different graphical techniques of exploratory data analysis.

Exploratory Data Analysis | 2


DEPARTMENT OF
STATISTICS 2.1 Numerical Summary Measures
2.1.1 Measures of Location
1. Percentile
• Percentiles are one type of quantiles—or fractiles—which partition data
into groups with roughly the same number of values in each group.
• Percentiles are measures of location, denoted !" , !$ , … , !&& which divide a
set of data into 100 groups with about 1% of the values in each group.
• The process of finding the percentile that corresponds to a particular data
value x is given by the following:

number of values less than 4


percentile of value 4 = ×100
total number of values

Exploratory Data Analysis | 3


DEPARTMENT OF
STATISTICS

Example:
The data below shows the sales (in thousand $) of 20 randomly selected
businessman. Find the percentile of 257.4.
215.9 225.8 253.6 254.5 257.4 260.5 266.1 269.8 272.9 285.3
293.9 304.6 320.5 324.8 332.7 336.7 369.8 375.1 442.2 450.1

4
percentile of 257.4 = ×100 = 20
20
Therefore, the sales 257.4 is the 20th percentile. It indicates that the sales
257.4 separates the lowest 20% of the sales from the highest 80%.

Exploratory Data Analysis | 4


DEPARTMENT OF
STATISTICS

Converting percentile to a data value:


#
!= ×'
100
Where:
' total number of values in the data se
# percentile being used (Example: For the 25th percentile)
locator that gives the position of a value
!
(Example: For the 12th value in the sorted list)
() kth percentile (Example: is the 25th percentile)

Exploratory Data Analysis | 5


DEPARTMENT OF
STATISTICS

Example:
Find the value of 90th percentile of the sales (in thousand $) of 20 randomly
selected businessman.
215.9 225.8 253.6 254.5 257.4 260.5 266.1 269.8 272.9 285.3
293.9 304.6 320.5 324.8 332.7 336.7 369.8 375.1 442.2 450.1

90
#= ×20 = 18
100

# = 18 means that the 90th percentile is the 18th value, counting from the
lowest. That is +,- = 375.1, it indicates that 90% of the sales is below 375.1
and about 10% are above 375.1.
Exploratory Data Analysis | 6
DEPARTMENT OF
STATISTICS

2. Quartiles
• Quartiles are measures of location, denoted by !" , !$ , !% which divide a
set of data into four groups with about 25% of the values in each group.
&' (First quartile): Separates the bottom 25% of the sorted values from the top
75%. (To be more precise, at least 25% of the sorted values are
less than or equal to and at least 75% of the values are greater
than or equal to )
&( (Second quartile): Same as the median; separates the bottom 50% of the sorted
values from the top 50%.
&) (Third quartile): Separates the bottom 75% of the sorted values from the top
25%. (To be more precise, at least 75% of the sorted values are
less than or equal to and at least 25% of the values are greater
than or equal to )
Exploratory Data Analysis | 7
DEPARTMENT OF
STATISTICS

Example:
Find the value of 1st quartile of the sales (in thousand $) of 20 randomly
selected businessman.
215.9 225.8 253.6 254.5 257.4 260.5 266.1 269.8 272.9 285.3
293.9 304.6 320.5 324.8 332.7 336.7 369.8 375.1 442.2 450.1

Note: #$ = &'( , #' = &() , #+ = &,(


25
-= ×20 = 5
100
- = 5 means that the 25th percentile is the 5th value, counting from the lowest.
That is &'( = #$ = 257.4.

Exploratory Data Analysis | 8


DEPARTMENT OF
STATISTICS

Some other statistics defined using quartiles and percentiles are the
following:
• interquartile range (or IQR)
!" − !$
• semi-interquartile range
%& '%(
)
• midquartile
!" + !$
2
• 10–90 percentile range
,-. − ,$.

Exploratory Data Analysis | 9


DEPARTMENT OF
STATISTICS

2.1.2 Measures of Central Tendency


• This chapter present several numerical measures that provide additional
alternatives for summarizing data
• A number that is meant to convey the idea of ‘centralness’ for the data set
• A value about which the set of observations tend to cluster
• Typical/average value of the data set

3 measures of central tendency:


1. Arithmetic mean
2. Median
3. Mode
Exploratory Data Analysis | 10
DEPARTMENT OF
STATISTICS

1. Arithmetic mean
• the most common average
• Sum of all observations divided by the number of observations.

Population mean: Sample mean:


∑%
"#$ &"
∑(
"#$ &"
µ= '=
&
% (
Where: Where:
&" = value of the ith observation &" = value of the ith observation
N = number of observations or population size n = number of observations or sample size
i = 1, 2, …, N i = 1, 2, …, n

Exploratory Data Analysis | 11


DEPARTMENT OF
STATISTICS

Properties of the Mean:


• It always exists.
• It is unique.
• It reflects the magnitude of every observation
• It is easily affected by extreme values.
• The mean of the subgroups can be combined into the overall mean of all
the data, called the weighted mean.
• The sample mean is a point estimator of the population mean.
• The mean discussed here is also called arithmetic mean. There are other
kinds of mean like harmonic mean (for averaging rates, geometric mean,
etc.)
Exploratory Data Analysis | 12
DEPARTMENT OF
STATISTICS

Example 1:
Given the following temperatures (in degrees Celsius):
33 32 30 29 25 30 32
Find the mean temperature.

Solution:
∑'
$%& ($ **+*,+*-+,.+,/+*-+*, ,11
!=
X = = = 30.14
) 0 0
Therefore, the mean temperature is 30.14 degrees Celcius.

Exploratory Data Analysis | 13


DEPARTMENT OF
STATISTICS

Example 2:
Given the following weights (in kg):
49, 51, 65, 49, 58, 60
Find the mean weight.

Solution:
∑'
$%& ($ +,-./-0.-+,-.1-02 334
!=
X = = = 55.33
) 0 0
this implies that the mean weight is 55.33 kg.

Exploratory Data Analysis | 14


DEPARTMENT OF
STATISTICS

2. Median (or middle observation)


• denoted by "! or Md
• is a single value that divides an array of observations into two equal parts,
such that half of the observations are above it and half are below it
In symbols;
üCheck first if the data is in array
#$%& if n is odd
'
!
"= #$ . #$%'
' '
if n is even
'

Exploratory Data Analysis | 15


DEPARTMENT OF
STATISTICS

Properties of the Median:


• It is a positional value.
• Extreme values do not affect the median as strongly as they do the mean.

Remark:
• The median is the measure of location most often reported for annual
income and poverty value data because a few extremely large incomes or
property values can inflate the mean. In such cases, the median is the
preferred measure of central location.

Exploratory Data Analysis | 16


DEPARTMENT OF
STATISTICS

Example 1:

Given the following temperatures (in degrees Celsius):


33 32 30 29 25 30 32
Find the median.

Solution:
Arrange the data into array form: 25 29 30 30 32 32 33.
Since n=7 is an odd number,
"! = "(%&')/* = "(+&')/* = "(,)
Therefore, the median is the 4th observation in the array which is 30.
Exploratory Data Analysis | 17
DEPARTMENT OF
STATISTICS

Example 2:
Given the following weights (in kg):
49, 51, 65, 49, 58, 60
Find the median weight.
Solution:
Arrange the data into array form: 49, 49, 51, 58, 60, 65.
Since n=6 (no. of observations) is an even number,
$% ' $%(& $* ' $*(&
$+ ' $, -.'-/
!=
X & &
= & &
= = = 54.5
) ) ) )
Therefore, the median is the average of 3rd and 4th observations in the
array which is equivalent to 54.5 kg.
Exploratory Data Analysis | 18
DEPARTMENT OF
STATISTICS

3. Mode
• Denoted by "! or Mo
• Locates the point where the observation values occur with the greatest
density
• Generally a less popular measure than the mean or the median
• Determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence

Example: 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
Answer: To find the mode, find the frequency for each observation.
Therefore, the mode is the observation with the highest frequency which is 2.
Exploratory Data Analysis | 19
DEPARTMENT OF
STATISTICS

Characteristics of the Mode


• It does not always exist; and if it does, it may not be unique. A data set
is said to be unimodal if there is only one mode, bimodal if there are
two modes, trimodal if there are three modes, and so on.
• It is not affected by extreme values.
• The mode can be used for qualitative as well as quantitative data.

Exploratory Data Analysis | 20


DEPARTMENT OF
STATISTICS

Graphical Comparison of the 3 Measures of Central Tendency


• Measures of skewness help us to know to what degree and in which
direction (positive or negative) the frequency distribution has a departure
from symmetry.
• In mathematics, a figure is called symmetric if there exists a point in it
through which if a perpendicular is drawn on the X-axis, it divides the
figure into two congruent parts i.e. identical in all respect or one part can
be superimposed on the other i.e mirror images of each other.
• In Statistics, a distribution is called symmetric if mean, median and mode
coincide. Otherwise, the distribution becomes asymmetric.

Exploratory Data Analysis | 21


DEPARTMENT OF
STATISTICS

1. Normal (Symmetric Distribution)


• Bell-shaped
• Symmetric meaning the left side
is just the mirror image of the
right side
• The three measures of central
tendency have the same values

x = x! = x"

Exploratory Data Analysis | 22


DEPARTMENT OF
STATISTICS

2. Positively Skewed (Skewed to the right)


• Distribution tapers more to the right
than to the left
• Longer tail to the right
• More concentration of values below
than above the mean

x > x! > x"


Exploratory Data Analysis | 23
DEPARTMENT OF
STATISTICS

2. Negatively Skewed (Skewed to the left)


• Distribution tapers more to the left than
to the right
• Longer tail to the left
• More concentration of values above
than below the mean

x < x! < x"

Exploratory Data Analysis | 24


DEPARTMENT OF
STATISTICS

2.1.3 Measures of Variability


• A measures of variability is a
quantity that measures the
spread or variability of the
observation in a given population.

Common measures of variability:


1. Range
2. Variance
3. Standard deviation
4. Standard error of the mean
5. Coefficient of variation

Exploratory Data Analysis | 25


DEPARTMENT OF
STATISTICS

1. Range
• difference between the highest value (HV) and the lowest value (LV) in the
population
• it uses only the extreme values
• it fails to communicate any information about the clustering or the lack of
clustering of the values between the extremes
• a weakness is that an outlier can greatly alter its value
R = HV – LV = max– min

Exploratory Data Analysis | 26


DEPARTMENT OF
STATISTICS

2. Variance
• the mean of the squared deviations of the observation from the mean

Characteristics of the Variance


• Always non-negative
• A large variance corresponds to a highly dispersed set of values
• Easy to manipulate for further mathematical computations
• Make use of all the observations in the data
• Comes in a unit that is the squares of the unit in the data

Exploratory Data Analysis | 27


DEPARTMENT OF
STATISTICS

3. Standard Deviation
• The average deviation between the individual scores in the distribution and
the mean for the distribution; square root of the variance
• Values close together have a small standard deviation, but values with
much more variation have a larger standard deviation.
• The standard deviation has the same units of measurement (such as
minutes or grams or dollars) as the original data values.
• It is affected by the value of every observation. It may be distorted by few
extreme values.

Exploratory Data Analysis | 28


DEPARTMENT OF
STATISTICS

4. Standard Error of the mean


• standard deviation of the sampling distribution of the mean
$ )
Population: !" = Sample: &(' =
% *

5. Coefficient of Variation (CV)


• defined as the ratio of the standard deviation and the mean and is
expresses in percent
• unitless; useful for comparing two data sets with different units of
measurement
$ )
Population: +, = " ×100 Sample: +, = ('
×100

Exploratory Data Analysis | 29


DEPARTMENT OF
STATISTICS

Example 1:
Given the following temperatures (in degrees Celsius):
33 32 30 29 25 30 32
Solution:
a. Range = R = HV − LV = 33 − 25 = 8
0
∑ 2/ (0<<)0
∑ ./ 0 1 6,89:1 6,89:16,:69.58 8,.A6
, 3 >
b. Variance = s = 415
= ?15
= 6
= 6
= 7.14
c. Standard deviation = E = s , = 7.14 = 2.67
I ,.6?
d. Standard error of the mean = EHG = = = 1.01
J ?
I ,.6?
e. Coefficient of variation = LM = ×100 = ×100 = 8.86%
HN :9.58
Exploratory Data Analysis | 30
DEPARTMENT OF
STATISTICS

Example 2:
Given the following weights (in kg):
49, 51, 65, 49, 58, 60
Solution:
a. Range = R = HV − LV = 33 − 25 = 8
0
∑ 2/ (;;0)0
∑ ./ 0 1 56,89,1 56,89,156,?@A.>@ ,,5.??
, 3 =
b. Variance = s = 415
= >15
= 8
= 8
= 44.27
c. Standard deviation = E = s , = 44.27 = 6.65
I >.>8
d. Standard error of the mean = EHG = = = 2.71
J >
I >.>8
e. Coefficient of variation = LM = HN
×100 = 88.? ×100 = 12.02%
Exploratory Data Analysis | 31
DEPARTMENT OF
STATISTICS

Supplement to the Use of Standard Deviation


1. Empirical Rule
For data having bell-shaped
distribution:
§ Approximately 68% of the data
values will be within 1 sd of the
mean.
§ Approximately 95% of the data
values will be within 2 sd of the
mean.
§ Almost all (99%) of the values
will be within 3 sd of the mean.

Exploratory Data Analysis | 32


DEPARTMENT OF
STATISTICS

2. Chebyshev’s Theorem
• enables us to make statements about the proportions of data values that
must be within a specified number of standard deviations of the mean
• Chebyshev’s Theorem can be applied to any data set regardless of the
shape of the distribution
• it states that at least (1-1/z2) of the data values must be within z standard
deviations of the mean, where z is any value greater than 1
Some implications of this theorem, with z = 2, 3, and 4 standard deviations
of the mean
§ At least 75% of the data values must be within z = 2 standard deviations of the mean.
§ At least 89% of the data values must be within z = 3 standard deviations of the mean.
§ At least 94% of the data values must be within z = 4 standard deviations of the mean.
Exploratory Data Analysis | 33
DEPARTMENT OF
STATISTICS 2.2 Graphical Summary Techniques
2.2.1 Stem-and-Leaf Display
A statistical technique to present a set of data. Each numerical value is
divided into two parts. The leading digit(s) becomes the stem and the
trailing digit the leaf. The stems are located along the vertical axis, and the
leaf values are stacked against each other along the horizontal axis.
Example:
Stem Leaf
temperatures (in degrees Celsius):
1 5
15 25 26 27 28 2 5 6 7 8 9 9
29 29 30 30 31 3 0 0 1 2 2 3 4 5 6 7
32 32 33 34 35 4 0 2
36 37 40 42 51 5 1

Exploratory Data Analysis | 34


DEPARTMENT OF
STATISTICS

2.2.2 Boxplots
Boxplots give us information about the distribution and spread of the data.
Procedure for Constructing a Boxplot
1. Find the 5-number summary consisting of the minimum value !" , the median !# ,
and the maximum value.
2. Construct a scale with values that include the minimum and maximum data
values.
3. Construct a box (rectangle) extending from !" to !# and draw a line in the box at
the median value.
4. Draw lines extending outward from the box to the minimum and maximum data
values.
*In boxplot, a data value is an outlier if it is above !# + 1.5×IQR or below
!" − 1.5×IQR .
Exploratory Data Analysis | 35
DEPARTMENT OF
STATISTICS

Example: Boxplot

criteria for identifying outliers


!" − 1.5×IQR
!+ + 1.5×IQR

Exploratory Data Analysis | 36


DEPARTMENT OF
STATISTICS

2.2.3 Histogram
• A graph in which the classes are marked on the horizontal axis and the class
frequencies on the vertical axis.
• The class frequencies are represented by the heights of the bars, and the
bars are drawn adjacent to each other.

Exploratory Data Analysis | 37


DEPARTMENT OF
STATISTICS

2.2.4 Scatterplot
• A type of mathematical
diagram using Cartesian
coordinates to display
values for two variables
for a set of data

Exploratory Data Analysis | 38


DEPARTMENT OF
STATISTICS

References
• Triola, M.F. (2018). Elementary statistics (13th ed.). US: Pearson.
• Barrow, M. (2017). Statistics for Economics, Accounting and Business Studies. Pearson
• Lind, Douglas A., Marchal, William G. and Wathen Samuel A. (2017). Statistical Techniques in Business &
Economics, 17th Edition. McGraw-Hill Education, 2 Penn Plaza, New York.
• Witte, R. S. and Witte, J. S. (2016). Statistics. Wiley
• Rohatgi, Vijay K. and Ehsanes Saleh, A.K. (2015). Introduction to Probability and Statistics. Wiley
• Bluman, A.G. (2014). Elementary Statistics: A Step-By-Step Approach. Mc Graw Hill Education
• Newbold, P., Carlson, W.L. and Thorne, B.M. (2013). Statistics for Business and Economics. Peason.
• Moore, D.S., McCabe, G.P., and Craig, B.A. (2014). Introduction to the Practice of Statistics. W. H.
Freeman
• Larson R. and Farber B. (2012). Elementary Statistics Picturing the World. Pearson Education, Inc.
• Mendenhall, W., Beaver, R.j and Beaver, B.M. (2012). Introduction to Probability and Statistics. Duxbury
Press.

Exploratory Data Analysis | 39

You might also like