Professional Documents
Culture Documents
Week 2 - Chapter 1 Introduction To Statistics (Part 2)
Week 2 - Chapter 1 Introduction To Statistics (Part 2)
INTRODUCTION TO STATISTICS
SLIDE | 1
CONTENTS COMBINES ALL THREE SUB-CHAPTERS
REASON: ALL OF THEM ARE EDA TECHNIQUES
1.2
1.3 1.5
1.1 STATISTICAL 1.4
REVIEWS ON NORMAL
STATISTICAL PROBLEM- EXPLORATORY
DESCRIPTIVE PROBABILITY
TERMINOLOGIES SOLVING DATA ANALYSIS
STATISTICS PLOT
METHODOLOGY
SLIDE | 2
1.3
REVIEWS ON DESCRIPTIVE
STATISTICS
SLIDE | 3
LEARNING OUTCOMES
Summarise the data using measures of central tendency, such as the mean, median, mode, and
midrange.
Describe the data using measures of variation, such as the range, variance, standard deviation,
and coefficient of variation.
Identify the position of a data value in a data set using measures of position such as quartiles,
deciles, and percentiles.
SLIDE | 4
1.4
SLIDE | 5
LEARNING OUTCOMES
Identify outliers.
SLIDE | 6
1.5
SLIDE | 7
LEARNING OUTCOMES
SLIDE | 8
EXPLORATORY DATA ANALYSIS (EDA)
EXPLORATORY DATA ANALYSIS
Definition: A process of utilising statistical tools such as numerical and graphical summaries to investigate data sets in order to
understand their important characteristics.
UNIVARIATE MULTIVARIATE
NON- NON-
GRAPHICAL GRAPHICAL
GRAPHICAL GRAPHICAL
This chapter limited the discussion to univariate analysis Do not discuss multivariate analysis in this chapter
SLIDE | 9
EXPLORATORY DATA ANALYSIS (EDA)
OVERVIEWS OF TECHNIQUES (CHAPTER 1)
NON-GRAPHICAL GRAPHICAL
• Measures of Central Tendency • Histogram (Please refer to Slides 38 & 39-Part 1)
Mean • Stem-and-leaf plot (Removes from this course)
Median • Mixture stem and leaf plot (Bivariate analysis)
Mode
Midrange
• Boxplot
• Parallel boxplot (Bivariate & Multivariate analysis)
• Measures of Variation/Dispersion • Normal probability plot
Standard deviation
Variance
Coefficient of Variation (CVar)
Range
IQR
• Measures of Position
Quartile
Deciles
Percentile
SLIDE | 10
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF CENTRAL TENDENCY (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that represent the centre point or typical value of a data set.
MODE
MEDIAN MIDRANGE
MEAN The most commonly occurring value in
a data series.
POPULATION: Please refer to Slide 20 for the details. (Has the highest frequency)
*Note: Calculator fx-570 EX Classwiz *Note: In some situations, there is no
can be used to double-check mode value (the frequency for each
observation equals 1). Moreover, there
SAMPLE: your answers. Please consult is more than one mode value in some
with me if you cannot find a situations (the frequency of the
solution. observations is equals and the highest
compared to the rest).
SLIDE | 11
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF CENTRAL TENDENCY (LIMITED TO UNGROUPED DATA)
IDENTIFY THE SHAPE OF THE DISTRIBUTION
SYMMETRICAL DISTRIBUTION
Mean = Median = Mode
SLIDE | 12
EXERCISE 1.3.1
d) 11.4 11.6 12.6 12.7 12.8 13.3 13.3 13.6 13.7 13.8
Answer: Left-skewed distribution; Reason: (Mean = 12.88) < (Median = 13.05) < (Mode = 13.3)
SLIDE | 13
EXERCISE 1.3.1 (CONTINUED)
2. The following set of data represents the number of hospitals for selected countries.
123 108 195 138 115 179 119 148 147 180 146 178 189
108 193 114 179 147 108 128 164 174 128 159 193 175
d) What is the best measure of the average of this set of data? Why?
Answer: Median. This is because the sample data is a skewed dataset. Therefore, the best measure of central tendency
for the skewed dataset is the median.
SLIDE | 14
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF VARIATION/DISPERSION (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that can be used to describe the distribution or dispersion of the data.
MEASURE OF VARIATION/DISPERSION
SLIDE | 15
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF VARIATION/DISPERSION (LIMITED TO UNGROUPED DATA)-COMPARISON
SLIDE | 16
FUNDAMENTS OF
ACCURATE AND PRECISION
Accuracy is the difference between the true average and the observed average.
If the average value differs from the true average, then the system is not accurate.
SLIDE | 17
EXERCISE 1.3.2
Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65
2. The following set of sample data represents the battery lifetime (in hours) from two different brands. Which brand of battery
performed better?
A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3
B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1
SLIDE | 18
EXERCISE 1.3.2 (CONTINUED)
3. The average of the accountants at a huge company is 31 years with a standard deviation of 4 years. The average salary of
the accountants is RM 44255 per year with a standard deviation of RM780. Compare the variations of age and income.
Solution:
Based on the question, the measurement unit is different. Hence, computing the CVar is indeed in comparing the variation.
The variation of income is less variable rather than age. This is because .
SLIDE | 19
EXERCISE 1.3.2 (CONTINUED)
a) If you are playing football and you always hit the left goal post instead of scoring.
Answer: Not accurate, but precise.
b) A candy manufacturer claims that each packet contains 20 candies. A sample packet has 18, 21, 19, 21, 19, 20, 22
candies, respectively. The average is 20 candies with an error of 1 candy.
Answer: Precise*, while accurate cannot be identified. This is because the population mean is not given in the
question.
*Note: Precise here is also a subjective decision as well as the sample standard deviation is
c) A manufacturer claims that each chocolate packet contains 20 chocolates. A sample of packets has 17, 18, 18, 17, 18,
17, and 17 chocolates, respectively.
Answer: Precise. while accurate cannot be identified. This is because the population mean is not given in the
question).
SLIDE | 20
EXERCISE 1.3.2 (CONTINUED)
d) In an experiment, with five trials, the end results of the five trials are 35 kg, 36 kg, 36 kg, 35 kg, and 36 kg. The actual
value (as found in a scientific data book) is meant to be 42 kg.
Answer: Precise but not accurate.
e) In an experiment, with five trials, the average value is 35 kg. The actual value (as found in a scientific data book) is meant
to be 35 kg.
Answer: Precise and accurate.
SLIDE | 21
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF POSITION (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that can be used to determines the position of a single value in relation to other values.
MEASURES OF POSITION
RULE OF THUMB
Scenario 1: If is not a whole number, round it up to the next whole number.
Scenario 2: If is a whole number, then use
COMMON MISTAKE IN MY PREVIOUS CLASS
They assume the subscript of as the value of the observation. In statistics theory, it represents the order of the observation, but NOT the value of the observation.
*Note: Decile is widely utilised in social sciences such as finance and economics. SLIDE | 22
EXERCISE 1.3.3
Solution:
Rearrange the dataset in ascending order:
1 2 3 4 4 5 6 7 9
c)
SLIDE | 23
EXERCISE 1.3.3 (CONTINUED)
2. A teacher gives a 25-point test to ten students. The scores are shown below.
9 22 11 14 13 3 7 15 18 16
Solution:
Rearrange the dataset in ascending order:
3 7 9 11 13 14 15 16 18 22
c)
SLIDE | 24
EXPLORATORY DATA ANALYSIS (EDA)
OUTLIERS
DEFINITION
An extremely high or extremely low data value when compared with the rest of the data values.
OUTLIER(S)
Any observation(s) that lies outside is taken into account as potential outlier(s).
*Note: If you are interested in data science, you may try to explore the extreme outliers.
𝐼𝑄𝑅= 𝑄3 − 𝑄1 SLIDE | 25
EXERCISE 1.4.1
Solution:
Rearrange the dataset in ascending order:
1 2 3 4 4 5 6 7 19
i. ;
ii. Lower boundary:
iv. Decision: Since 19 is the observation that lies outside the range , therefore 19 is the outlier.
SLIDE | 26
EXERCISE 1.4.1 (CONTINUED)
Solution:
Rearrange the dataset in ascending order:
2 3 4 5 6 6 7 7 8 11 12 19 21
i. ;
ii. Lower boundary:
iv. Decision: Since 21 is the observation that lies outside the range , therefore 21 is the outlier.
SLIDE | 27
EXPLORATORY DATA ANALYSIS (EDA)
THE LEAST SENSITIVE MEASURES OF CENTRAL TENDENCY AND MEASURES OF VARIATION TO
OUTLIER(S) AND SKEWED DATA
*Note: You may try to simulate a set of datasets with skewed/outliers. Then you compute the values before and after you have
removed the outliers. Subsequently, you try to compare the changes in the values. The values with small changes meant it is most SLIDE | 28
robust to the skewed/outliers.
EXPLORATORY DATA ANALYSIS (EDA)
STEM AND LEAF PLOT (ADDITIONAL KNOWLEDGE)
DEFINITION
A device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualising the shape of the distribution.
SLIDE | 29
EXPLORATORY DATA ANALYSIS (EDA)
BOXPLOT
DEFINITION
Boxplot is a graphical representation of a five-number summary (minimum, 1st quartile, 2nd quartile (median), 3rd quartile, and
maximum) of a data set and outliers.
STEP 1
Arrange the data in ascending order
SLIDE | 30
EXERCISE 1.4.2
1. Plot a boxplot for the following data. Then, describe the data.
a) 3.2 5.9 4.3 6.9 4.5 8.0 4.7 8.9 5.7 11.9
b) 5.8 9.7 6.7 13.4 6.8 14.7 7.2 16.4 8.2 28.1
Solution:
a) Rearrange the dataset in ascending order:
3.2 4.3 4.5 4.7 5.7 5.9 6.9 8.0 8.9 11.9
0 1 2 3 4 5 6 7 8 9
i. ;
ii. Lower boundary:
iv. Decision: Since there is no any observation that lies outside the range , therefore,
SLIDE | 31
EXERCISE 1.4.2 (CONTINUED)
1. Plot a boxplot for the following data. Then, describe the data.
a) 3.2 5.9 4.3 6.9 4.5 8.0 4.7 8.9 5.7 11.9
b) 5.8 9.7 6.7 13.4 6.8 14.7 7.2 16.4 8.2 28.1
Solution:
b) Rearrange the dataset in ascending order:
5.8 6.7 6.8 7.2 8.2 9.7 13.4 14.7 16.4 28.1 0 2 4 6 8 10 12 14 16
i. ; 7
ii. Lower boundary:
iv. Decision: Since 28.1 that lies outside the range , followed 28.1 is the outlier. As the results,
SLIDE | 32
EXERCISE 1.4.2 (CONTINUED)
2. Two samples of ten springs made of steel rods supplied by two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows.
Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3
Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1
Compare the distributions using boxplots. Then, give a comment on the flexibility of springs supplied by two different
companies.
Solution:
i. ; ;
;;
ii.
Company A Company B
Lower boundary
Upper boundary
SLIDE | 33
EXERCISE 1.4.2 (CONTINUED)
iii. Five-number summary & Outlier(s) Company A Company B
Minimum 6.7 9.6
7.3 9.8
8.25 10.15
8.8 11.0
Maximum 9.3 11.1
Outlier(s) 4.2 -
iv.
Right-skewed distribution
Company B
Left-skewed distribution
Company A
4 5 6 7 8 9 10 11 12
SLIDE | 34
EXERCISE 1.4.2 (CONTINUED)
COMMENTS
Average:
Since therefore the spring supplied by Company B has higher flexibility rather than Company A.
Variability:
Since , therefore the flexibility of the spring supplied by Company A is less consistent rather than Company B.
SLIDE | 35
EXERCISE 1.4.2 (CONTINUED)
3. The following table presents the viscosity (in Pascal) of chemical substances from three (3) batches of the chemical
process.
Batches Viscosity
Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3
Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8
Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9
SLIDE | 36
EXERCISE 1.4.2 (CONTINUED)
Solution:
a)
Measures of position Batch A Batch B Batch C
1st quartile 14.30 14.10 14.10
Median 14.55 15.05 14.55
3rd quartile 15.20 15.40 15.80
Outlier No No No
Batch B
Since there is no any observation that lies outside therefore no outlier for the dataset of Batch B.
SLIDE | 37
EXERCISE 1.4.2 (CONTINUED)
Solution:
b) Measures of position Batch A Batch B Batch C
Minimum 13.30 13.30 13.40
1st quartile 14.30 14.10 14.10
Median 14.55 15.05 14.55
3rd quartile 15.20 15.40 15.80
Maximum 15.30 15.80 16.90
Outlier No No No
C Right-skewed distribution
B Left-skewed distribution
A Left-skewed distribution
SLIDE | 38
EXERCISE 1.4.2 (CONTINUED)
Solution:
c)
Shape: Batch A: Left-skewed distribution; Batch B: Left-skewed distribution; Batch C: Right-skewed distribution
Average:
Since , which leads to the average of viscosity of chemical substance for Batches A and C has equivalent. Conversely, the
average of viscosity of chemical substance for Batch B is higher rather than Batches A and C. This is because
Variability:
Since , which leads to the viscosity of chemical substance for Batch A is more consistent rather than Batches B and C. In
contrast, Batch C is least consistent rather than Batches A and B.
SLIDE | 39
EXPLORATORY DATA ANALYSIS (EDA)
NORMAL PROBABILITY PLOT (A SPECIAL CASE FOR Q-Q PLOT)
DEFINITION
The normal probability plot is a graphical technique for assessing whether or not a data set is an approximately normal distribution.
STEP 1
Sort the data in ascending order and denote each sorted data as , where .
STEP 2
Numbered the sorted data from to .
STEP 3
Calculate the probability value for each using .
SLIDE | 40
EXERCISE 1.5
1. A sample of size of six is drawn. The sample is arranged in increasing order and given as follows.
2. The following data represent the number of movies in Asia for the 14-year period.
2084 1497 1014 910 899 870 859 848 837 826 815 750 737 637
SLIDE | 41
THANK YOU
END OF CHAPTER 1
SLIDE | 42
APPENDIX A
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570MS
YouTube Link
https://www.youtube.com/watch?v=whq2I09V11c
SLIDE | 43
APPENDIX B
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570ES/FX-570ES PLUS
YouTube Link
https://www.youtube.com/watch?v=9CJItYX10fY&t=14s
SLIDE | 44
APPENDIX C
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570EX CLASSWIZ
YouTube Link
https://www.youtube.com/watch?v=K1OmgQOPG3o
SLIDE | 45