Week 2 - Chapter 1 Introduction To Statistics (Part 2)

CHAPTER 1 (PART 2)
INTRODUCTION TO STATISTICS
SLIDE | 1
CONTENTS COMBINES ALL THREE SUB-CHAPTERS
REASON: ALL OF THEM ARE EDA TECHNIQUES
1.2
1.3 1.5
1.1 STATISTICAL 1.4
REVIEWS ON NORMAL
STATISTICAL PROBLEM- EXPLORATORY
DESCRIPTIVE PROBABILITY
TERMINOLOGIES SOLVING DATA ANALYSIS
STATISTICS PLOT
METHODOLOGY
1.1.1 What is Statistics? 1.2.1 Identifying the 1.3.1 Measures of Central

1.4.1 Outliers
1.1.2 Why We Need Problem or Tendency
1.4.2 Boxplot
Statistics? Opportunity 1.3.2 Measures of
1.1.3 Population and 1.2.2 Deciding on the Variation(Dispersion)
Sample Method of Data 1.3.3 Measures of Position
1.1.4 Descriptive and Collection 1.3.4 Descriptive Statistics
Inferential Statistics 1.2.3 Collecting the Data Using Microsoft Excel
(Combined with (Sampling
1.1.1) Techniques)
1.1.5 Role of the Computer 1.2.4 Classifying and
in Statistics Summarising the
Data
1.2.5 Presenting and
Analysing the Data
1.2.6 Making the Decision
and Conclusion
SLIDE | 2
1.3
REVIEWS ON DESCRIPTIVE
STATISTICS
SLIDE | 3
LEARNING OUTCOMES
Summarise the data using measures of central tendency, such as the mean, median, mode, and
midrange.
Describe the data using measures of variation, such as the range, variance, standard deviation,
and coefficient of variation.
Identify the position of a data value in a data set using measures of position such as quartiles,
deciles, and percentiles.
SLIDE | 4
1.4
EXPLORATORY DATA ANALYSIS
SLIDE | 5
LEARNING OUTCOMES
Identify outliers.
Draw and interpret a boxplot.
SLIDE | 6
1.5
NORMAL PROBABILITY PLOT
SLIDE | 7
LEARNING OUTCOMES
Check the normality assumption using the normal probability plot.

*Note: Normal probability plot is the special case for Quantile-Quantile Normality plot (Q-Q plot). Most
of the statistical software will be provided for the Probability-Probability Normality plot (P-P plot) and Q-
Q plot.
SLIDE | 8
EXPLORATORY DATA ANALYSIS (EDA)
EXPLORATORY DATA ANALYSIS
Definition: A process of utilising statistical tools such as numerical and graphical summaries to investigate data sets in order to
understand their important characteristics.
UNIVARIATE MULTIVARIATE
NON- NON-
GRAPHICAL GRAPHICAL
GRAPHICAL GRAPHICAL
This chapter limited the discussion to univariate analysis Do not discuss multivariate analysis in this chapter
SLIDE | 9
OVERVIEWS OF TECHNIQUES (CHAPTER 1)
NON-GRAPHICAL GRAPHICAL
• Measures of Central Tendency • Histogram (Please refer to Slides 38 & 39-Part 1)
 Mean • Stem-and-leaf plot (Removes from this course)
 Median • Mixture stem and leaf plot (Bivariate analysis)
 Mode
 Midrange
• Boxplot
• Parallel boxplot (Bivariate & Multivariate analysis)
• Measures of Variation/Dispersion • Normal probability plot
 Standard deviation
 Variance
 Coefficient of Variation (CVar)
 Range
 IQR
• Measures of Position
 Quartile
 Deciles
 Percentile
SLIDE | 10
MEASURES OF CENTRAL TENDENCY (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that represent the centre point or typical value of a data set.
MEASURES OF CENTRAL TENDENCY
MODE
MEDIAN MIDRANGE
MEAN The most commonly occurring value in
a data series.
POPULATION: Please refer to Slide 20 for the details. (Has the highest frequency)
*Note: Calculator fx-570 EX Classwiz *Note: In some situations, there is no
can be used to double-check mode value (the frequency for each
observation equals 1). Moreover, there
SAMPLE: your answers. Please consult is more than one mode value in some
with me if you cannot find a situations (the frequency of the
solution. observations is equals and the highest
compared to the rest).
SLIDE | 11
MEASURES OF CENTRAL TENDENCY (LIMITED TO UNGROUPED DATA)
IDENTIFY THE SHAPE OF THE DISTRIBUTION
Mode Median Mean

Mean Median Mode
Mean < Median Median < Mean

Mean< Mode (STPM) Mode< Mean (STPM)
LEFT-SKEWED DISTRIBUTION RIGHT-SKEWED DISTRIBUTION

Mean < Median < Mode Mean = Median Mode < Median < Mean
Mean=Mode (STPM)
This also indicated that This also indicated that

there is the majority of there are majority data
data observations are observations located on
located on the right side. the left side.
SYMMETRICAL DISTRIBUTION
Mean = Median = Mode
SLIDE | 12
EXERCISE 1.3.1
1. Determine the shape of the distribution of the following data.
a) Mean = Mode = Median=11

Answer: Symmetrical distribution
b) Mean = 25, Mode = 13, Median =17

Answer: Right-skewed distribution; Reason: (Mode = 13) < (Median =17) < (Mean =25)
c) Mean = 5, Mode = 73, Median =17

Answer: Left-skewed; Reason: (Mean = 5) < (Median = 17) < (Mode =73)
d) 11.4 11.6 12.6 12.7 12.8 13.3 13.3 13.6 13.7 13.8
Answer: Left-skewed distribution; Reason: (Mean = 12.88) < (Median = 13.05) < (Mode = 13.3)
SLIDE | 13
EXERCISE 1.3.1 (CONTINUED)
2. The following set of data represents the number of hospitals for selected countries.
123 108 195 138 115 179 119 148 147 180 146 178 189
108 193 114 179 147 108 128 164 174 128 159 193 175
a) Find the mean, median, mode, and midrange.

Answer: Mean = 151.3462; Median = 147.5; Mode = 108; Midrange = (108 + 195)/2 = 151.5.
b) Is the average values calculated in (a), a parameter or a statistic? Why?

Answer: Statistic. The average value is summarised based on the sample data.
c) What is the distribution type that describes the data?

Answer: Right-skewed distribution; Reason: (Mode =108) < (Median = 147.5) < (Mean = 151.3462).
d) What is the best measure of the average of this set of data? Why?
Answer: Median. This is because the sample data is a skewed dataset. Therefore, the best measure of central tendency
for the skewed dataset is the median.
SLIDE | 14
MEASURES OF VARIATION/DISPERSION (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that can be used to describe the distribution or dispersion of the data.
MEASURE OF VARIATION/DISPERSION
COEFFICIENT OF RANGE INTERQUARTILE RANGE

STANDARD DEVIATION VARIANCE
VARIATION
POPULATION: POPULATION:
POPULATION:
where
SAMPLE: SAMPLE:
SAMPLE: 1st quartile
3rd quartile
WHEN WE EMPLOY THE COEFFICIENT OF VARIATION (RELATIVE DISPERSION)?

Scenario 1: When the comparing variables are in different measurement units.
Scenario 2: When the mean of the comparing variables have widening gaps although they have similar measurement units.
SLIDE | 15
MEASURES OF VARIATION/DISPERSION (LIMITED TO UNGROUPED DATA)-COMPARISON
When OR , this also indicated that
Data set 1 Data set 2
 Less dispersed  More dispersed

 Less spread  More spread
 Less variable (small variation)  More variable (large variation)
 More consistent  Less consistent
 More precise  Less precise
 More accurate (Please ignore)-Reason:  Less accurate (Please ignore)
Accuracy is based on the average, not
the variation
 Better  Worse
SLIDE | 16
FUNDAMENTS OF
ACCURATE AND PRECISION
Accuracy is the difference between the true average and the observed average.
If the average value differs from the true average, then the system is not accurate.
The precision is the degree to which repeated measurements under

unchanged conditions show the same result. In other words, precision
refers to the closeness of two or more measurements to each other.
SLIDE | 17
EXERCISE 1.3.2
1.Which of the following set of sample data is less variable?
Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65
Answer: Method A. Reason: sA = 3.6742 < sB = 7.8493
2. The following set of sample data represents the battery lifetime (in hours) from two different brands. Which brand of battery
performed better?
A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3
B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1
Answer: Sample B. Reason: A = 1.5296 hours > B = 0.6150 hours
SLIDE | 18
3. The average of the accountants at a huge company is 31 years with a standard deviation of 4 years. The average salary of
the accountants is RM 44255 per year with a standard deviation of RM780. Compare the variations of age and income.
Solution:
Based on the question, the measurement unit is different. Hence, computing the CVar is indeed in comparing the variation.
The variation of income is less variable rather than age. This is because .
SLIDE | 19
4. Identify each situation as either accurate or precise or both.
a) If you are playing football and you always hit the left goal post instead of scoring.
Answer: Not accurate, but precise.
b) A candy manufacturer claims that each packet contains 20 candies. A sample packet has 18, 21, 19, 21, 19, 20, 22
candies, respectively. The average is 20 candies with an error of 1 candy.
Answer: Precise*, while accurate cannot be identified. This is because the population mean is not given in the
question.
*Note: Precise here is also a subjective decision as well as the sample standard deviation is
c) A manufacturer claims that each chocolate packet contains 20 chocolates. A sample of packets has 17, 18, 18, 17, 18,
17, and 17 chocolates, respectively.
Answer: Precise. while accurate cannot be identified. This is because the population mean is not given in the
question).
SLIDE | 20
4. Identify each situation as either accurate or precise or both.
d) In an experiment, with five trials, the end results of the five trials are 35 kg, 36 kg, 36 kg, 35 kg, and 36 kg. The actual
value (as found in a scientific data book) is meant to be 42 kg.
Answer: Precise but not accurate.
e) In an experiment, with five trials, the average value is 35 kg. The actual value (as found in a scientific data book) is meant
to be 35 kg.
Answer: Precise and accurate.
SLIDE | 21
MEASURES OF POSITION (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that can be used to determines the position of a single value in relation to other values.
MEASURES OF POSITION
QUARTILE DECILE PERCENTILE
RULE OF THUMB
Scenario 1: If is not a whole number, round it up to the next whole number.
Scenario 2: If is a whole number, then use
COMMON MISTAKE IN MY PREVIOUS CLASS
They assume the subscript of as the value of the observation. In statistics theory, it represents the order of the observation, but NOT the value of the observation.
*Note: Decile is widely utilised in social sciences such as finance and economics. SLIDE | 22
EXERCISE 1.3.3
1.Given a set of data as 9 2 1 4 3 7 5 4 6.
a) Find the value corresponding to the 4th decile.

b) Find the value corresponding to the 3rd quartile.
Solution:
Rearrange the dataset in ascending order:
1 2 3 4 4 5 6 7 9
c)
SLIDE | 23
2. A teacher gives a 25-point test to ten students. The scores are shown below.
9 22 11 14 13 3 7 15 18 16
a) Find the score corresponding to the 20th percentile.

b) Find the score corresponding to the 7th decile.
Solution:
3 7 9 11 13 14 15 16 18 22
c)
SLIDE | 24
OUTLIERS
DEFINITION
An extremely high or extremely low data value when compared with the rest of the data values.
LOWER BOUNDARY (LOWER INNER FENCE)
UPPER BOUNDARY (UPPER INNER FENCE)
OUTLIER(S)
Any observation(s) that lies outside is taken into account as potential outlier(s).
*Note: If you are interested in data science, you may try to explore the extreme outliers.
𝐼𝑄𝑅= 𝑄3 − 𝑄1 SLIDE | 25
EXERCISE 1.4.1
1. Given 19 2 1 4 3 7 5 4 6. Find outlier(s) if any.
Solution:
1 2 3 4 4 5 6 7 19
i. ;
ii. Lower boundary:
iii. Upper boundary:
iv. Decision: Since 19 is the observation that lies outside the range , therefore 19 is the outlier.
SLIDE | 26
2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find outlier(s) if any.
Solution:
2 3 4 5 6 6 7 7 8 11 12 19 21
i. ;
ii. Lower boundary:
iv. Decision: Since 21 is the observation that lies outside the range , therefore 21 is the outlier.
SLIDE | 27
THE LEAST SENSITIVE MEASURES OF CENTRAL TENDENCY AND MEASURES OF VARIATION TO
OUTLIER(S) AND SKEWED DATA
MEASURES OF CENTRAL TENDENCY MEASURES OF VARIATION/DISPERSION
WHO IS THE WINNER? WHO IS THE WINNER?
 MEAN  STANDARD DEVIATION

 MEDIAN  VARIANCE
 MODE  COEFFICIENT OF VARIATION
 MIDRANGE  RANGE
 INTERQUARTILE RANGE
*Note: You may try to simulate a set of datasets with skewed/outliers. Then you compute the values before and after you have
removed the outliers. Subsequently, you try to compare the changes in the values. The values with small changes meant it is most SLIDE | 28
robust to the skewed/outliers.
STEM AND LEAF PLOT (ADDITIONAL KNOWLEDGE)
DEFINITION
A device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualising the shape of the distribution.
Stem and leaf plot Mixture stem and leaf plot

(back-to-back stem and leaf plot)
SLIDE | 29
BOXPLOT
DEFINITION
Boxplot is a graphical representation of a five-number summary (minimum, 1st quartile, 2nd quartile (median), 3rd quartile, and
maximum) of a data set and outliers.
STEPS TO CONSTRUCT A BOXPLOT
STEP 1
Arrange the data in ascending order
STEP 2 (I modified the step in the module)

Calculate (1st quartile), (median), and (3rd quartile).

STEP 2
Identify the presence of the outlier(s)
Numbered the sorted dataBoundary:
Lower from to . ; Upper Boundary:

Determine the minimum and maximum values.
*Note: Outlier(s) cannot be taken into account as minimum and maximum values.
STEP 3
Calculate the probability value for each
STEP 5 (I modified using
the step . module)
in the
Sketch the boxplot based on the calculated five-number summary
SLIDE | 30
EXERCISE 1.4.2
1. Plot a boxplot for the following data. Then, describe the data.
a) 3.2 5.9 4.3 6.9 4.5 8.0 4.7 8.9 5.7 11.9
b) 5.8 9.7 6.7 13.4 6.8 14.7 7.2 16.4 8.2 28.1
Solution:
a) Rearrange the dataset in ascending order:
3.2 4.3 4.5 4.7 5.7 5.9 6.9 8.0 8.9 11.9
0 1 2 3 4 5 6 7 8 9
i. ;
ii. Lower boundary:
iv. Decision: Since there is no any observation that lies outside the range , therefore,
SLIDE | 31
1. Plot a boxplot for the following data. Then, describe the data.
a) 3.2 5.9 4.3 6.9 4.5 8.0 4.7 8.9 5.7 11.9
b) 5.8 9.7 6.7 13.4 6.8 14.7 7.2 16.4 8.2 28.1
Solution:
b) Rearrange the dataset in ascending order:
5.8 6.7 6.8 7.2 8.2 9.7 13.4 14.7 16.4 28.1 0 2 4 6 8 10 12 14 16
i. ; 7
ii. Lower boundary:
iv. Decision: Since 28.1 that lies outside the range , followed 28.1 is the outlier. As the results,
SLIDE | 32
2. Two samples of ten springs made of steel rods supplied by two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows.
Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3
Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1
Compare the distributions using boxplots. Then, give a comment on the flexibility of springs supplied by two different
companies.
Solution:
i. ; ;
;;
ii.
Company A Company B
Lower boundary
Upper boundary
Presentation of outlier(s) 4.2 is the outlier No outlier
SLIDE | 33
iii. Five-number summary & Outlier(s) Company A Company B
Minimum 6.7 9.6
7.3 9.8
8.25 10.15
8.8 11.0
Maximum 9.3 11.1
Outlier(s) 4.2 -
iv.
Right-skewed distribution
Company B
Left-skewed distribution
Company A
4 5 6 7 8 9 10 11 12
SLIDE | 34
COMMENTS
Average:
Since therefore the spring supplied by Company B has higher flexibility rather than Company A.
Variability:
Since , therefore the flexibility of the spring supplied by Company A is less consistent rather than Company B.
SLIDE | 35
3. The following table presents the viscosity (in Pascal) of chemical substances from three (3) batches of the chemical
process.
Batches Viscosity
Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3
Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8
Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9
a) Complete the table below by showing all the necessary calculations.

Measures of position Batch A Batch B Batch C
1st quartile 14.30 14.10
Median 14.55 14.55
3rd quartile 15.40 15.80
Outlier No No
b) Draw three boxplots on the same x-axis by using the information in (a).
c) Compare the boxplots in terms of shape, average and variability.
SLIDE | 36
Solution:
a)
Measures of position Batch A Batch B Batch C
1st quartile 14.30 14.10 14.10
Median 14.55 15.05 14.55
3rd quartile 15.20 15.40 15.80
Outlier No No No
Lower Boundary Upper Boundary
Batch B
Since there is no any observation that lies outside therefore no outlier for the dataset of Batch B.
SLIDE | 37
Solution:
b) Measures of position Batch A Batch B Batch C
Minimum 13.30 13.30 13.40
1st quartile 14.30 14.10 14.10
Median 14.55 15.05 14.55
3rd quartile 15.20 15.40 15.80
Maximum 15.30 15.80 16.90
Outlier No No No
C Right-skewed distribution
B Left-skewed distribution
A Left-skewed distribution
13 13.5 14 14.5 15 15.5 16 16.5 17
SLIDE | 38
Solution:
c)
Shape: Batch A: Left-skewed distribution; Batch B: Left-skewed distribution; Batch C: Right-skewed distribution
Average:
Since , which leads to the average of viscosity of chemical substance for Batches A and C has equivalent. Conversely, the
average of viscosity of chemical substance for Batch B is higher rather than Batches A and C. This is because
Variability:
Since , which leads to the viscosity of chemical substance for Batch A is more consistent rather than Batches B and C. In
contrast, Batch C is least consistent rather than Batches A and B.
SLIDE | 39
NORMAL PROBABILITY PLOT (A SPECIAL CASE FOR Q-Q PLOT)
DEFINITION
The normal probability plot is a graphical technique for assessing whether or not a data set is an approximately normal distribution.
STEPS TO PLOT NORMAL PROBABILITY PLOT
STEP 1
Sort the data in ascending order and denote each sorted data as , where .
STEP 2
Numbered the sorted data from to .
STEP 3
Calculate the probability value for each using .
SLIDE | 40
EXERCISE 1.5
1. A sample of size of six is drawn. The sample is arranged in increasing order and given as follows.
3.01 3.35 4.79 5.96 7.89 9.15
Do these data appear to come from an approximately normal distribution?

Answer: Yes (Reason: This is because the points are approximately formed as a straight line.)
2. The following data represent the number of movies in Asia for the 14-year period.
2084 1497 1014 910 899 870 859 848 837 826 815 750 737 637
Does this data appear to come from an approximately normal distribution?

Answer: No (Reason: This is because the points deviate to form a straight line.)
SLIDE | 41
THANK YOU
END OF CHAPTER 1
SLIDE | 42
APPENDIX A
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570MS
YouTube Link
https://www.youtube.com/watch?v=whq2I09V11c
1st edition 2nd edition
SLIDE | 43
APPENDIX B
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570ES/FX-570ES PLUS
YouTube Link
https://www.youtube.com/watch?v=9CJItYX10fY&t=14s
SLIDE | 44
APPENDIX C
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570EX CLASSWIZ
YouTube Link
https://www.youtube.com/watch?v=K1OmgQOPG3o
SLIDE | 45

Week 2 - Chapter 1 Introduction To Statistics (Part 2)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2 - Chapter 1 Introduction To Statistics (Part 2)

Uploaded by

Copyright:

Available Formats

CHAPTER 1 (PART 2)

1.1.1 What is Statistics? 1.2.1 Identifying the 1.3.1 Measures of Central

EXPLORATORY DATA ANALYSIS

Draw and interpret a boxplot.

NORMAL PROBABILITY PLOT

Check the normality assumption using the normal probability plot.

MEASURES OF CENTRAL TENDENCY

Mode Median Mean

Mean < Median Median < Mean

LEFT-SKEWED DISTRIBUTION RIGHT-SKEWED DISTRIBUTION

This also indicated that This also indicated that

1. Determine the shape of the distribution of the following data.

a) Mean = Mode = Median=11

b) Mean = 25, Mode = 13, Median =17

c) Mean = 5, Mode = 73, Median =17

a) Find the mean, median, mode, and midrange.

b) Is the average values calculated in (a), a parameter or a statistic? Why?

c) What is the distribution type that describes the data?

COEFFICIENT OF RANGE INTERQUARTILE RANGE

WHEN WE EMPLOY THE COEFFICIENT OF VARIATION (RELATIVE DISPERSION)?

When OR , this also indicated that

Data set 1 Data set 2

 Less dispersed  More dispersed

The precision is the degree to which repeated measurements under

1.Which of the following set of sample data is less variable?

Answer: Method A. Reason: sA = 3.6742 < sB = 7.8493

Answer: Sample B. Reason: A = 1.5296 hours > B = 0.6150 hours

4. Identify each situation as either accurate or precise or both.

4. Identify each situation as either accurate or precise or both.

QUARTILE DECILE PERCENTILE

1.Given a set of data as 9 2 1 4 3 7 5 4 6.

a) Find the value corresponding to the 4th decile.

a) Find the score corresponding to the 20th percentile.

LOWER BOUNDARY (LOWER INNER FENCE)

UPPER BOUNDARY (UPPER INNER FENCE)

1. Given 19 2 1 4 3 7 5 4 6. Find outlier(s) if any.

iii. Upper boundary:

2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find outlier(s) if any.

iii. Upper boundary:

MEASURES OF CENTRAL TENDENCY MEASURES OF VARIATION/DISPERSION

WHO IS THE WINNER? WHO IS THE WINNER?

 MEAN  STANDARD DEVIATION

Stem and leaf plot Mixture stem and leaf plot

STEPS TO CONSTRUCT A BOXPLOT

STEP 2 (I modified the step in the module)

STEP 3 (I modified the step in the module)

STEP 4 (I modified the step in the module)

iii. Upper boundary:

iii. Upper boundary:

Presentation of outlier(s) 4.2 is the outlier No outlier

a) Complete the table below by showing all the necessary calculations.

Lower Boundary Upper Boundary

13 13.5 14 14.5 15 15.5 16 16.5 17

STEPS TO PLOT NORMAL PROBABILITY PLOT

3.01 3.35 4.79 5.96 7.89 9.15

Do these data appear to come from an approximately normal distribution?

Does this data appear to come from an approximately normal distribution?

1st edition 2nd edition

You might also like