Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

CHAPTER 1 (PART 2)

INTRODUCTION TO STATISTICS

SLIDE | 1
CONTENTS COMBINES ALL THREE SUB-CHAPTERS
REASON: ALL OF THEM ARE EDA TECHNIQUES

1.2
1.3 1.5
1.1 STATISTICAL 1.4
REVIEWS ON NORMAL
STATISTICAL PROBLEM- EXPLORATORY
DESCRIPTIVE PROBABILITY
TERMINOLOGIES SOLVING DATA ANALYSIS
STATISTICS PLOT
METHODOLOGY

1.1.1 What is Statistics? 1.2.1 Identifying the 1.3.1 Measures of Central


1.4.1 Outliers
1.1.2 Why We Need Problem or Tendency
1.4.2 Boxplot
Statistics? Opportunity 1.3.2 Measures of
1.1.3 Population and 1.2.2 Deciding on the Variation(Dispersion)
Sample Method of Data 1.3.3 Measures of Position
1.1.4 Descriptive and Collection 1.3.4 Descriptive Statistics
Inferential Statistics 1.2.3 Collecting the Data Using Microsoft Excel
(Combined with (Sampling
1.1.1) Techniques)
1.1.5 Role of the Computer 1.2.4 Classifying and
in Statistics Summarising the
Data
1.2.5 Presenting and
Analysing the Data
1.2.6 Making the Decision
and Conclusion

SLIDE | 2
1.3

REVIEWS ON DESCRIPTIVE
STATISTICS

SLIDE | 3
LEARNING OUTCOMES

Summarise the data using measures of central tendency, such as the mean, median, mode, and
midrange.

Describe the data using measures of variation, such as the range, variance, standard deviation,
and coefficient of variation.

Identify the position of a data value in a data set using measures of position such as quartiles,
deciles, and percentiles.

SLIDE | 4
1.4

EXPLORATORY DATA ANALYSIS

SLIDE | 5
LEARNING OUTCOMES

Identify outliers.

Draw and interpret a boxplot.

SLIDE | 6
1.5

NORMAL PROBABILITY PLOT

SLIDE | 7
LEARNING OUTCOMES

Check the normality assumption using the normal probability plot.


*Note: Normal probability plot is the special case for Quantile-Quantile Normality plot (Q-Q plot). Most
of the statistical software will be provided for the Probability-Probability Normality plot (P-P plot) and Q-
Q plot.

SLIDE | 8
EXPLORATORY DATA ANALYSIS (EDA)
EXPLORATORY DATA ANALYSIS
Definition: A process of utilising statistical tools such as numerical and graphical summaries to investigate data sets in order to
understand their important characteristics.

UNIVARIATE MULTIVARIATE

NON- NON-
GRAPHICAL GRAPHICAL
GRAPHICAL GRAPHICAL

This chapter limited the discussion to univariate analysis Do not discuss multivariate analysis in this chapter

SLIDE | 9
EXPLORATORY DATA ANALYSIS (EDA)
OVERVIEWS OF TECHNIQUES (CHAPTER 1)

NON-GRAPHICAL GRAPHICAL
• Measures of Central Tendency • Histogram (Please refer to Slides 38 & 39-Part 1)
 Mean • Stem-and-leaf plot (Removes from this course)
 Median • Mixture stem and leaf plot (Bivariate analysis)
 Mode
 Midrange
• Boxplot
• Parallel boxplot (Bivariate & Multivariate analysis)
• Measures of Variation/Dispersion • Normal probability plot
 Standard deviation
 Variance
 Coefficient of Variation (CVar)
 Range
 IQR

• Measures of Position
 Quartile
 Deciles
 Percentile

SLIDE | 10
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF CENTRAL TENDENCY (LIMITED TO UNGROUPED DATA)

DEFINITION
A summary of statistics that represent the centre point or typical value of a data set.

MEASURES OF CENTRAL TENDENCY

MODE
MEDIAN MIDRANGE
MEAN The most commonly occurring value in
a data series.
POPULATION: Please refer to Slide 20 for the details. (Has the highest frequency)
*Note: Calculator fx-570 EX Classwiz *Note: In some situations, there is no
can be used to double-check mode value (the frequency for each
observation equals 1). Moreover, there
SAMPLE: your answers. Please consult is more than one mode value in some
with me if you cannot find a situations (the frequency of the
solution. observations is equals and the highest
compared to the rest).

SLIDE | 11
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF CENTRAL TENDENCY (LIMITED TO UNGROUPED DATA)
IDENTIFY THE SHAPE OF THE DISTRIBUTION

Mode Median Mean


Mean Median Mode

Mean < Median Median < Mean


Mean< Mode (STPM) Mode< Mean (STPM)

LEFT-SKEWED DISTRIBUTION RIGHT-SKEWED DISTRIBUTION


Mean < Median < Mode Mean = Median Mode < Median < Mean
Mean=Mode (STPM)

This also indicated that This also indicated that


there is the majority of there are majority data
data observations are observations located on
located on the right side. the left side.

SYMMETRICAL DISTRIBUTION
Mean = Median = Mode

SLIDE | 12
EXERCISE 1.3.1

1. Determine the shape of the distribution of the following data.

a) Mean = Mode = Median=11


Answer: Symmetrical distribution

b) Mean = 25, Mode = 13, Median =17


Answer: Right-skewed distribution; Reason: (Mode = 13) < (Median =17) < (Mean =25)

c) Mean = 5, Mode = 73, Median =17


Answer: Left-skewed; Reason: (Mean = 5) < (Median = 17) < (Mode =73)

d) 11.4 11.6 12.6 12.7 12.8 13.3 13.3 13.6 13.7 13.8
Answer: Left-skewed distribution; Reason: (Mean = 12.88) < (Median = 13.05) < (Mode = 13.3)

SLIDE | 13
EXERCISE 1.3.1 (CONTINUED)
2. The following set of data represents the number of hospitals for selected countries.

123 108 195 138 115 179 119 148 147 180 146 178 189

108 193 114 179 147 108 128 164 174 128 159 193 175

a) Find the mean, median, mode, and midrange.


Answer: Mean = 151.3462; Median = 147.5; Mode = 108; Midrange = (108 + 195)/2 = 151.5.

b) Is the average values calculated in (a), a parameter or a statistic? Why?


Answer: Statistic. The average value is summarised based on the sample data.

c) What is the distribution type that describes the data?


Answer: Right-skewed distribution; Reason: (Mode =108) < (Median = 147.5) < (Mean = 151.3462).

d) What is the best measure of the average of this set of data? Why?
Answer: Median. This is because the sample data is a skewed dataset. Therefore, the best measure of central tendency
for the skewed dataset is the median.

SLIDE | 14
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF VARIATION/DISPERSION (LIMITED TO UNGROUPED DATA)

DEFINITION
A summary of statistics that can be used to describe the distribution or dispersion of the data.

MEASURE OF VARIATION/DISPERSION

COEFFICIENT OF RANGE INTERQUARTILE RANGE


STANDARD DEVIATION VARIANCE
VARIATION
POPULATION: POPULATION:
POPULATION:
where
SAMPLE: SAMPLE:
SAMPLE: 1st quartile
3rd quartile

WHEN WE EMPLOY THE COEFFICIENT OF VARIATION (RELATIVE DISPERSION)?


Scenario 1: When the comparing variables are in different measurement units.
Scenario 2: When the mean of the comparing variables have widening gaps although they have similar measurement units.

SLIDE | 15
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF VARIATION/DISPERSION (LIMITED TO UNGROUPED DATA)-COMPARISON

When OR , this also indicated that

Data set 1 Data set 2

 Less dispersed  More dispersed


 Less spread  More spread
 Less variable (small variation)  More variable (large variation)
 More consistent  Less consistent
 More precise  Less precise
 More accurate (Please ignore)-Reason:  Less accurate (Please ignore)
Accuracy is based on the average, not
the variation
 Better  Worse

SLIDE | 16
FUNDAMENTS OF
ACCURATE AND PRECISION
Accuracy is the difference between the true average and the observed average.
If the average value differs from the true average, then the system is not accurate.

The precision is the degree to which repeated measurements under


unchanged conditions show the same result. In other words, precision
refers to the closeness of two or more measurements to each other.

SLIDE | 17
EXERCISE 1.3.2

1.Which of the following set of sample data is less variable?

Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65

Answer: Method A. Reason: sA = 3.6742 < sB = 7.8493

2. The following set of sample data represents the battery lifetime (in hours) from two different brands. Which brand of battery
performed better?

A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3
B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1

Answer: Sample B. Reason: A = 1.5296 hours > B = 0.6150 hours

SLIDE | 18
EXERCISE 1.3.2 (CONTINUED)

3. The average of the accountants at a huge company is 31 years with a standard deviation of 4 years. The average salary of
the accountants is RM 44255 per year with a standard deviation of RM780. Compare the variations of age and income.

Solution:
Based on the question, the measurement unit is different. Hence, computing the CVar is indeed in comparing the variation.

The variation of income is less variable rather than age. This is because .

SLIDE | 19
EXERCISE 1.3.2 (CONTINUED)

4. Identify each situation as either accurate or precise or both.

a) If you are playing football and you always hit the left goal post instead of scoring.
Answer: Not accurate, but precise.

b) A candy manufacturer claims that each packet contains 20 candies. A sample packet has 18, 21, 19, 21, 19, 20, 22
candies, respectively. The average is 20 candies with an error of 1 candy.
Answer: Precise*, while accurate cannot be identified. This is because the population mean is not given in the
question.
*Note: Precise here is also a subjective decision as well as the sample standard deviation is

c) A manufacturer claims that each chocolate packet contains 20 chocolates. A sample of packets has 17, 18, 18, 17, 18,
17, and 17 chocolates, respectively.
Answer: Precise. while accurate cannot be identified. This is because the population mean is not given in the
question).

SLIDE | 20
EXERCISE 1.3.2 (CONTINUED)

4. Identify each situation as either accurate or precise or both.

d) In an experiment, with five trials, the end results of the five trials are 35 kg, 36 kg, 36 kg, 35 kg, and 36 kg. The actual
value (as found in a scientific data book) is meant to be 42 kg.
Answer: Precise but not accurate.

e) In an experiment, with five trials, the average value is 35 kg. The actual value (as found in a scientific data book) is meant
to be 35 kg.
Answer: Precise and accurate.

SLIDE | 21
EXPLORATORY DATA ANALYSIS (EDA)
MEASURES OF POSITION (LIMITED TO UNGROUPED DATA)
DEFINITION
A summary of statistics that can be used to determines the position of a single value in relation to other values.

MEASURES OF POSITION

QUARTILE DECILE PERCENTILE

RULE OF THUMB
Scenario 1: If is not a whole number, round it up to the next whole number.
Scenario 2: If is a whole number, then use
COMMON MISTAKE IN MY PREVIOUS CLASS
They assume the subscript of as the value of the observation. In statistics theory, it represents the order of the observation, but NOT the value of the observation.

*Note: Decile is widely utilised in social sciences such as finance and economics. SLIDE | 22
EXERCISE 1.3.3

1.Given a set of data as 9 2 1 4 3 7 5 4 6.

a) Find the value corresponding to the 4th decile.


b) Find the value corresponding to the 3rd quartile.

Solution:
Rearrange the dataset in ascending order:

1 2 3 4 4 5 6 7 9

c)

SLIDE | 23
EXERCISE 1.3.3 (CONTINUED)

2. A teacher gives a 25-point test to ten students. The scores are shown below.

9 22 11 14 13 3 7 15 18 16

a) Find the score corresponding to the 20th percentile.


b) Find the score corresponding to the 7th decile.

Solution:
Rearrange the dataset in ascending order:

3 7 9 11 13 14 15 16 18 22

c)

SLIDE | 24
EXPLORATORY DATA ANALYSIS (EDA)
OUTLIERS

DEFINITION
An extremely high or extremely low data value when compared with the rest of the data values.

LOWER BOUNDARY (LOWER INNER FENCE)

UPPER BOUNDARY (UPPER INNER FENCE)

OUTLIER(S)
Any observation(s) that lies outside is taken into account as potential outlier(s).

*Note: If you are interested in data science, you may try to explore the extreme outliers.

𝐼𝑄𝑅= 𝑄3 − 𝑄1 SLIDE | 25
EXERCISE 1.4.1

1. Given 19 2 1 4 3 7 5 4 6. Find outlier(s) if any.

Solution:
Rearrange the dataset in ascending order:

1 2 3 4 4 5 6 7 19

i. ;
ii. Lower boundary:

iii. Upper boundary:

iv. Decision: Since 19 is the observation that lies outside the range , therefore 19 is the outlier.

SLIDE | 26
EXERCISE 1.4.1 (CONTINUED)

2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find outlier(s) if any.

Solution:
Rearrange the dataset in ascending order:

2 3 4 5 6 6 7 7 8 11 12 19 21

i. ;
ii. Lower boundary:

iii. Upper boundary:

iv. Decision: Since 21 is the observation that lies outside the range , therefore 21 is the outlier.

SLIDE | 27
EXPLORATORY DATA ANALYSIS (EDA)
THE LEAST SENSITIVE MEASURES OF CENTRAL TENDENCY AND MEASURES OF VARIATION TO
OUTLIER(S) AND SKEWED DATA

MEASURES OF CENTRAL TENDENCY MEASURES OF VARIATION/DISPERSION

WHO IS THE WINNER? WHO IS THE WINNER?

 MEAN  STANDARD DEVIATION


 MEDIAN  VARIANCE
 MODE  COEFFICIENT OF VARIATION
 MIDRANGE  RANGE
 INTERQUARTILE RANGE

*Note: You may try to simulate a set of datasets with skewed/outliers. Then you compute the values before and after you have
removed the outliers. Subsequently, you try to compare the changes in the values. The values with small changes meant it is most SLIDE | 28
robust to the skewed/outliers.
EXPLORATORY DATA ANALYSIS (EDA)
STEM AND LEAF PLOT (ADDITIONAL KNOWLEDGE)

DEFINITION
A device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualising the shape of the distribution.

Stem and leaf plot Mixture stem and leaf plot


(back-to-back stem and leaf plot)

SLIDE | 29
EXPLORATORY DATA ANALYSIS (EDA)
BOXPLOT
DEFINITION
Boxplot is a graphical representation of a five-number summary (minimum, 1st quartile, 2nd quartile (median), 3rd quartile, and
maximum) of a data set and outliers.

STEPS TO CONSTRUCT A BOXPLOT

STEP 1
Arrange the data in ascending order

STEP 2 (I modified the step in the module)


Calculate (1st quartile), (median), and (3rd quartile).

STEP 3 (I modified the step in the module)


STEP 2
Identify the presence of the outlier(s)
Numbered the sorted dataBoundary:
Lower from to . ; Upper Boundary:

STEP 4 (I modified the step in the module)


Determine the minimum and maximum values.
*Note: Outlier(s) cannot be taken into account as minimum and maximum values.
STEP 3
Calculate the probability value for each
STEP 5 (I modified using
the step . module)
in the
Sketch the boxplot based on the calculated five-number summary

SLIDE | 30
EXERCISE 1.4.2
1. Plot a boxplot for the following data. Then, describe the data.

a) 3.2 5.9 4.3 6.9 4.5 8.0 4.7 8.9 5.7 11.9
b) 5.8 9.7 6.7 13.4 6.8 14.7 7.2 16.4 8.2 28.1

Solution:
a) Rearrange the dataset in ascending order:

3.2 4.3 4.5 4.7 5.7 5.9 6.9 8.0 8.9 11.9
0 1 2 3 4 5 6 7 8 9

i. ;
ii. Lower boundary:

iii. Upper boundary:

iv. Decision: Since there is no any observation that lies outside the range , therefore,

SLIDE | 31
EXERCISE 1.4.2 (CONTINUED)
1. Plot a boxplot for the following data. Then, describe the data.

a) 3.2 5.9 4.3 6.9 4.5 8.0 4.7 8.9 5.7 11.9
b) 5.8 9.7 6.7 13.4 6.8 14.7 7.2 16.4 8.2 28.1

Solution:
b) Rearrange the dataset in ascending order:

5.8 6.7 6.8 7.2 8.2 9.7 13.4 14.7 16.4 28.1 0 2 4 6 8 10 12 14 16

i. ; 7
ii. Lower boundary:

iii. Upper boundary:

iv. Decision: Since 28.1 that lies outside the range , followed 28.1 is the outlier. As the results,

SLIDE | 32
EXERCISE 1.4.2 (CONTINUED)
2. Two samples of ten springs made of steel rods supplied by two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows.
Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3
Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1
Compare the distributions using boxplots. Then, give a comment on the flexibility of springs supplied by two different
companies.
Solution:
i. ; ;
;;
ii.
Company A Company B

Lower boundary

Upper boundary

Presentation of outlier(s) 4.2 is the outlier No outlier

SLIDE | 33
EXERCISE 1.4.2 (CONTINUED)
iii. Five-number summary & Outlier(s) Company A Company B
Minimum 6.7 9.6
7.3 9.8
8.25 10.15
8.8 11.0
Maximum 9.3 11.1
Outlier(s) 4.2 -

iv.

Right-skewed distribution

Company B

Left-skewed distribution

Company A

4 5 6 7 8 9 10 11 12

SLIDE | 34
EXERCISE 1.4.2 (CONTINUED)
COMMENTS
Average:
Since therefore the spring supplied by Company B has higher flexibility rather than Company A.
Variability:
Since , therefore the flexibility of the spring supplied by Company A is less consistent rather than Company B.

SLIDE | 35
EXERCISE 1.4.2 (CONTINUED)
3. The following table presents the viscosity (in Pascal) of chemical substances from three (3) batches of the chemical
process.
Batches Viscosity

Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3

Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8

Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9

a) Complete the table below by showing all the necessary calculations.


Measures of position Batch A Batch B Batch C
1st quartile 14.30 14.10
Median 14.55 14.55
3rd quartile 15.40 15.80
Outlier No No
b) Draw three boxplots on the same x-axis by using the information in (a).
c) Compare the boxplots in terms of shape, average and variability.

SLIDE | 36
EXERCISE 1.4.2 (CONTINUED)

Solution:
a)
Measures of position Batch A Batch B Batch C
1st quartile 14.30 14.10 14.10
Median 14.55 15.05 14.55
3rd quartile 15.20 15.40 15.80
Outlier No No No

Lower Boundary Upper Boundary

Batch B

Since there is no any observation that lies outside therefore no outlier for the dataset of Batch B.

SLIDE | 37
EXERCISE 1.4.2 (CONTINUED)
Solution:
b) Measures of position Batch A Batch B Batch C
Minimum 13.30 13.30 13.40
1st quartile 14.30 14.10 14.10
Median 14.55 15.05 14.55
3rd quartile 15.20 15.40 15.80
Maximum 15.30 15.80 16.90
Outlier No No No

C Right-skewed distribution

B Left-skewed distribution

A Left-skewed distribution

13 13.5 14 14.5 15 15.5 16 16.5 17

SLIDE | 38
EXERCISE 1.4.2 (CONTINUED)
Solution:
c)
Shape: Batch A: Left-skewed distribution; Batch B: Left-skewed distribution; Batch C: Right-skewed distribution

Average:
Since , which leads to the average of viscosity of chemical substance for Batches A and C has equivalent. Conversely, the
average of viscosity of chemical substance for Batch B is higher rather than Batches A and C. This is because

Variability:
Since , which leads to the viscosity of chemical substance for Batch A is more consistent rather than Batches B and C. In
contrast, Batch C is least consistent rather than Batches A and B.

SLIDE | 39
EXPLORATORY DATA ANALYSIS (EDA)
NORMAL PROBABILITY PLOT (A SPECIAL CASE FOR Q-Q PLOT)

DEFINITION
The normal probability plot is a graphical technique for assessing whether or not a data set is an approximately normal distribution.

STEPS TO PLOT NORMAL PROBABILITY PLOT

STEP 1
Sort the data in ascending order and denote each sorted data as , where .

STEP 2
Numbered the sorted data from to .

STEP 3
Calculate the probability value for each using .

SLIDE | 40
EXERCISE 1.5
1. A sample of size of six is drawn. The sample is arranged in increasing order and given as follows.

3.01 3.35 4.79 5.96 7.89 9.15

Do these data appear to come from an approximately normal distribution?


Answer: Yes (Reason: This is because the points are approximately formed as a straight line.)

2. The following data represent the number of movies in Asia for the 14-year period.

2084 1497 1014 910 899 870 859 848 837 826 815 750 737 637

Does this data appear to come from an approximately normal distribution?


Answer: No (Reason: This is because the points deviate to form a straight line.)

SLIDE | 41
THANK YOU
END OF CHAPTER 1

SLIDE | 42
APPENDIX A
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570MS

YouTube Link
https://www.youtube.com/watch?v=whq2I09V11c

1st edition 2nd edition

SLIDE | 43
APPENDIX B
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570ES/FX-570ES PLUS

YouTube Link
https://www.youtube.com/watch?v=9CJItYX10fY&t=14s

SLIDE | 44
APPENDIX C
COMPUTE THE MEAN AND VARIANCE UTILISING CALCULATOR FX-570EX CLASSWIZ

YouTube Link
https://www.youtube.com/watch?v=K1OmgQOPG3o

SLIDE | 45

You might also like