Professional Documents
Culture Documents
STAT 1200 - 2 Exploratory Data Analysis
STAT 1200 - 2 Exploratory Data Analysis
ONNETH O. TEJADA
STAT 1200 – Management Science
2nd Semester, 2021-2022
Example:
The data below shows the sales (in thousand $) of 20 randomly selected
businessman. Find the percentile of 257.4.
215.9 225.8 253.6 254.5 257.4 260.5 266.1 269.8 272.9 285.3
293.9 304.6 320.5 324.8 332.7 336.7 369.8 375.1 442.2 450.1
4
percentile of 257.4 = ×100 = 20
20
Therefore, the sales 257.4 is the 20th percentile. It indicates that the sales
257.4 separates the lowest 20% of the sales from the highest 80%.
Example:
Find the value of 90th percentile of the sales (in thousand $) of 20 randomly
selected businessman.
215.9 225.8 253.6 254.5 257.4 260.5 266.1 269.8 272.9 285.3
293.9 304.6 320.5 324.8 332.7 336.7 369.8 375.1 442.2 450.1
90
#= ×20 = 18
100
# = 18 means that the 90th percentile is the 18th value, counting from the
lowest. That is +,- = 375.1, it indicates that 90% of the sales is below 375.1
and about 10% are above 375.1.
Exploratory Data Analysis | 6
DEPARTMENT OF
STATISTICS
2. Quartiles
• Quartiles are measures of location, denoted by !" , !$ , !% which divide a
set of data into four groups with about 25% of the values in each group.
&' (First quartile): Separates the bottom 25% of the sorted values from the top
75%. (To be more precise, at least 25% of the sorted values are
less than or equal to and at least 75% of the values are greater
than or equal to )
&( (Second quartile): Same as the median; separates the bottom 50% of the sorted
values from the top 50%.
&) (Third quartile): Separates the bottom 75% of the sorted values from the top
25%. (To be more precise, at least 75% of the sorted values are
less than or equal to and at least 25% of the values are greater
than or equal to )
Exploratory Data Analysis | 7
DEPARTMENT OF
STATISTICS
Example:
Find the value of 1st quartile of the sales (in thousand $) of 20 randomly
selected businessman.
215.9 225.8 253.6 254.5 257.4 260.5 266.1 269.8 272.9 285.3
293.9 304.6 320.5 324.8 332.7 336.7 369.8 375.1 442.2 450.1
Some other statistics defined using quartiles and percentiles are the
following:
• interquartile range (or IQR)
!" − !$
• semi-interquartile range
%& '%(
)
• midquartile
!" + !$
2
• 10–90 percentile range
,-. − ,$.
1. Arithmetic mean
• the most common average
• Sum of all observations divided by the number of observations.
Example 1:
Given the following temperatures (in degrees Celsius):
33 32 30 29 25 30 32
Find the mean temperature.
Solution:
∑'
$%& ($ **+*,+*-+,.+,/+*-+*, ,11
!=
X = = = 30.14
) 0 0
Therefore, the mean temperature is 30.14 degrees Celcius.
Example 2:
Given the following weights (in kg):
49, 51, 65, 49, 58, 60
Find the mean weight.
Solution:
∑'
$%& ($ +,-./-0.-+,-.1-02 334
!=
X = = = 55.33
) 0 0
this implies that the mean weight is 55.33 kg.
Remark:
• The median is the measure of location most often reported for annual
income and poverty value data because a few extremely large incomes or
property values can inflate the mean. In such cases, the median is the
preferred measure of central location.
Example 1:
Solution:
Arrange the data into array form: 25 29 30 30 32 32 33.
Since n=7 is an odd number,
"! = "(%&')/* = "(+&')/* = "(,)
Therefore, the median is the 4th observation in the array which is 30.
Exploratory Data Analysis | 17
DEPARTMENT OF
STATISTICS
Example 2:
Given the following weights (in kg):
49, 51, 65, 49, 58, 60
Find the median weight.
Solution:
Arrange the data into array form: 49, 49, 51, 58, 60, 65.
Since n=6 (no. of observations) is an even number,
$% ' $%(& $* ' $*(&
$+ ' $, -.'-/
!=
X & &
= & &
= = = 54.5
) ) ) )
Therefore, the median is the average of 3rd and 4th observations in the
array which is equivalent to 54.5 kg.
Exploratory Data Analysis | 18
DEPARTMENT OF
STATISTICS
3. Mode
• Denoted by "! or Mo
• Locates the point where the observation values occur with the greatest
density
• Generally a less popular measure than the mean or the median
• Determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence
Example: 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
Answer: To find the mode, find the frequency for each observation.
Therefore, the mode is the observation with the highest frequency which is 2.
Exploratory Data Analysis | 19
DEPARTMENT OF
STATISTICS
x = x! = x"
1. Range
• difference between the highest value (HV) and the lowest value (LV) in the
population
• it uses only the extreme values
• it fails to communicate any information about the clustering or the lack of
clustering of the values between the extremes
• a weakness is that an outlier can greatly alter its value
R = HV – LV = max– min
2. Variance
• the mean of the squared deviations of the observation from the mean
3. Standard Deviation
• The average deviation between the individual scores in the distribution and
the mean for the distribution; square root of the variance
• Values close together have a small standard deviation, but values with
much more variation have a larger standard deviation.
• The standard deviation has the same units of measurement (such as
minutes or grams or dollars) as the original data values.
• It is affected by the value of every observation. It may be distorted by few
extreme values.
Example 1:
Given the following temperatures (in degrees Celsius):
33 32 30 29 25 30 32
Solution:
a. Range = R = HV − LV = 33 − 25 = 8
0
∑ 2/ (0<<)0
∑ ./ 0 1 6,89:1 6,89:16,:69.58 8,.A6
, 3 >
b. Variance = s = 415
= ?15
= 6
= 6
= 7.14
c. Standard deviation = E = s , = 7.14 = 2.67
I ,.6?
d. Standard error of the mean = EHG = = = 1.01
J ?
I ,.6?
e. Coefficient of variation = LM = ×100 = ×100 = 8.86%
HN :9.58
Exploratory Data Analysis | 30
DEPARTMENT OF
STATISTICS
Example 2:
Given the following weights (in kg):
49, 51, 65, 49, 58, 60
Solution:
a. Range = R = HV − LV = 33 − 25 = 8
0
∑ 2/ (;;0)0
∑ ./ 0 1 56,89,1 56,89,156,?@A.>@ ,,5.??
, 3 =
b. Variance = s = 415
= >15
= 8
= 8
= 44.27
c. Standard deviation = E = s , = 44.27 = 6.65
I >.>8
d. Standard error of the mean = EHG = = = 2.71
J >
I >.>8
e. Coefficient of variation = LM = HN
×100 = 88.? ×100 = 12.02%
Exploratory Data Analysis | 31
DEPARTMENT OF
STATISTICS
2. Chebyshev’s Theorem
• enables us to make statements about the proportions of data values that
must be within a specified number of standard deviations of the mean
• Chebyshev’s Theorem can be applied to any data set regardless of the
shape of the distribution
• it states that at least (1-1/z2) of the data values must be within z standard
deviations of the mean, where z is any value greater than 1
Some implications of this theorem, with z = 2, 3, and 4 standard deviations
of the mean
§ At least 75% of the data values must be within z = 2 standard deviations of the mean.
§ At least 89% of the data values must be within z = 3 standard deviations of the mean.
§ At least 94% of the data values must be within z = 4 standard deviations of the mean.
Exploratory Data Analysis | 33
DEPARTMENT OF
STATISTICS 2.2 Graphical Summary Techniques
2.2.1 Stem-and-Leaf Display
A statistical technique to present a set of data. Each numerical value is
divided into two parts. The leading digit(s) becomes the stem and the
trailing digit the leaf. The stems are located along the vertical axis, and the
leaf values are stacked against each other along the horizontal axis.
Example:
Stem Leaf
temperatures (in degrees Celsius):
1 5
15 25 26 27 28 2 5 6 7 8 9 9
29 29 30 30 31 3 0 0 1 2 2 3 4 5 6 7
32 32 33 34 35 4 0 2
36 37 40 42 51 5 1
2.2.2 Boxplots
Boxplots give us information about the distribution and spread of the data.
Procedure for Constructing a Boxplot
1. Find the 5-number summary consisting of the minimum value !" , the median !# ,
and the maximum value.
2. Construct a scale with values that include the minimum and maximum data
values.
3. Construct a box (rectangle) extending from !" to !# and draw a line in the box at
the median value.
4. Draw lines extending outward from the box to the minimum and maximum data
values.
*In boxplot, a data value is an outlier if it is above !# + 1.5×IQR or below
!" − 1.5×IQR .
Exploratory Data Analysis | 35
DEPARTMENT OF
STATISTICS
Example: Boxplot
2.2.3 Histogram
• A graph in which the classes are marked on the horizontal axis and the class
frequencies on the vertical axis.
• The class frequencies are represented by the heights of the bars, and the
bars are drawn adjacent to each other.
2.2.4 Scatterplot
• A type of mathematical
diagram using Cartesian
coordinates to display
values for two variables
for a set of data
References
• Triola, M.F. (2018). Elementary statistics (13th ed.). US: Pearson.
• Barrow, M. (2017). Statistics for Economics, Accounting and Business Studies. Pearson
• Lind, Douglas A., Marchal, William G. and Wathen Samuel A. (2017). Statistical Techniques in Business &
Economics, 17th Edition. McGraw-Hill Education, 2 Penn Plaza, New York.
• Witte, R. S. and Witte, J. S. (2016). Statistics. Wiley
• Rohatgi, Vijay K. and Ehsanes Saleh, A.K. (2015). Introduction to Probability and Statistics. Wiley
• Bluman, A.G. (2014). Elementary Statistics: A Step-By-Step Approach. Mc Graw Hill Education
• Newbold, P., Carlson, W.L. and Thorne, B.M. (2013). Statistics for Business and Economics. Peason.
• Moore, D.S., McCabe, G.P., and Craig, B.A. (2014). Introduction to the Practice of Statistics. W. H.
Freeman
• Larson R. and Farber B. (2012). Elementary Statistics Picturing the World. Pearson Education, Inc.
• Mendenhall, W., Beaver, R.j and Beaver, B.M. (2012). Introduction to Probability and Statistics. Duxbury
Press.