Stat L2 2021 Fall

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 58

BHSC230

Research Methods I:
Statistics for Behavioral Science
3 semester credits, Fall- 2021

Lecture 2
measure of central tendency

Instructor : Kelvin Ng
Email : teacherkelvinng@gmail.com
Timetable

Time : 1:45-4:30pm
Recommended Textbook

• Freund & Perles (2004). Statistics: A first


course, 8th ed., Pearson.
• Coolican, H. (2004). Research Methods in
Statistics in Psychology, 4th ed., Hodder &
Stoughton.
• or any introductory statistics course
• or equivalent
Assessment

• There will be quizzes, exercise and exam


and possibly presentation.
• Generally, you will have quiz for around 3-
4 weeks including mid-term and final
examinations.
• The first would be on Sept/Oct..
• Attendance: (10%)
• The time that attendance is taken during class
several times.
• Whenever the number of absences exceeds
20 percent of the total course
appointments, the instructor may give a
failing grade.
• Don’t come late and leave early. Tardy or
early leave (15 mins) for 3 times equal to
one absent.
• Class participation (10%)
• e.g. answer or ask questions
Class rule

• No computer or phone is needed in this


class.
• All should be turn off.
• Calculator is needed.
Course Outline
1. Probability concepts
2. measure of central tendency
3. measures of variation
4. frequency distributions
5. using frequency distributions
6. point-estimation and confidence intervals
7. sampling distribution
8. levels of significance in hypothesis testing
9. t and z tests, ANOVA
10. chi-square
11. correlation
12. regression
This week

•Overview
•Types of data
•Measure of central tendency
Types of data

A. Nominal data (= or \=)


B. Ordinal data ( < or >)
C. Interval data (+ or -)
D. Ratio data (x or / )
Types of data
A. Nominal data
(e.g. marital status; jerseys of football players;
freshmen, sophomore, junior or senior; hot or
cold)
• Weakest level of data
• Can only tell are the data the same or not.
• Clearly shown the number of respondents in
every category (= \=)
• Emphasize on the frequency of each category
• Cannot do any manipulation on the data
• e.g. freshmen + sophomore = senior?!
• Not suitable to calculate the average.
• e.g. M=1, F=2
• Average gender for this class is 1.44 ?!
• 4 M and 5 F
B. Ordinal data
• allow minimal operations on the data.
• can be rank ordered i.e. less than or
greater than (<>)
• e.g. happier than, more painful, harder,
louder; Champion, first runner-up, second
runner-up)
• can tell one level is ‘better’ than another
level
• e.g. Likert scale 5,4,3,2,1
• but the difference between levels is
meaningless
• e.g. you are happier than Mary but there
is no point to calculate the happy level
between Mary and you?!
• second lowest level of data
C. Interval data
• Allow to calculate the differences between
levels and the difference between levels is
meaningful.
• Can perform plus or minus +- but cannot
perform multiply or divide (x / ) operations.
• e.g. 30C is hotter than 20C. The difference
is 10C. 20C is hotter than 10C. The
difference is also 10C.
• The difference between these 2 pairs is
the same (10C)
• but 20C is not twice as hot as 10C!
• The reason is unit C does not have an
absolute zero reference point.
D. Ratio data (have true zero value)
• can also perform multiply and divide operations
• the highest level of data
• e.g. the amount of money $, body height cm,
temperature scale with absolute zero (A), marks
(0%-100%)
• e.g. 160cm is twice as high as 80cm
• the data type must have zero reference point
otherwise division is meaningless
What are the types of data? Justify
your answer.
1. GPA?
2. Hair length?
3. Short-sightedness?
4. The ranking in class?
5. GDP of a country?
6. Floor level of a building?
7. Direction of a map?
8. Types of fruits in a supermarket?
9. Grade of HKCEE?
10.Pixels of a picture?
11.Likert scale? (1, 2, 3, 4, 5)
Questions to discuss
1. Which is the lowest level ? And which is the
highest?
2. Can ordinal data be converted to interval data?
3. Can ratio data be converted to interval data?
4. Which type of data is most useful?
5. Can you suggest an example to each type of the
data?
• Take
• a
• break

now!
Measure of central tendency

A. Mean, median, and mode


B. Percentile
Mean

• Mean – the most popular measure of


central tendency
• layperson calls mean an average.
• equal to sum divided by number of items
• what is the mean of the following
numbers? 1, 3, 5, 7, 9
Weighted mean

• What is the mean of the following


numbers?
a) 1, 3, 3, 5, 7, 9, 11, 11, 13
b) 140, 150, 160, 170, 180
c) 140, 150, 150, 160, 170, 170, 180
d) 140, 143, 143, 150, 180, 180.
Sometime, you do not need to calculate!
Just redistribute them.
Examples

• What is the mean of the following properties of


Hong Kong citizens?
1. Height
2. Weight
3. Income
4. Saving
5. Number of mobile phones
6. Number of credit cards
Mean can be misleading
• Sometimes, population mean is quite misleading.
Consider the following marks of a quiz of 8
students in a class.
• 20, 30, 40, 40, 40, 45, 45, 45, 100, 100.
• What is the mean?
• It is 50+ which is higher than the passing mark
50. Is this satisfactory?
• The principal announce that his class attain the
benchmark and all should proceed to the next
level/grade.
• Is this decision reasonable? Why?
• Sometimes, mean can be affected or ‘disturbed’
heavily by extreme values. (e.g. the two 100
marks).
• e.g. when you calculate the mean income of HK
citizens, if you include several Mr. Lee (s), then
your result is biased a lot.
• Even if the billionaires are excluded, the most of
the HK citizens have an income below the mean
of HK citizens! Don’t you believe?
• What is the mean of HK citizens?
• >20,000 per month
• If the Governor claims HK citizens earn fairly
high income in the world because the mean
income of HK citizens is >20,000 per month, is it
justified?
• Not really, because ~70% of HK citizens earn
well less than this amount.
• Thus, other types of central tendency measures
are needed as well.
Median and mode

• What is median?
• The middle or centre of a set of data.
• If the data is arranged according to its
size of value in ascending/descending
order, the middle one (or the middle two
divided by 2) is the median.
• e.g. 1, 3, 5, 7, 7, 9. The median is
What’s the median of the
following?
a) 1, 3, 3, 5, 7, 9, 11, 11, 13
b) 140, 150, 160, 170, 180
c) 140, 150, 150, 160, 170, 170, 180
d) 140, 143, 143, 150, 180, 180.
e) Back to the case of passing mark of a class,
20, 30, 40, 40, 40, 45, 45, 45, 100, 100.
Sometimes, median might be equal to mean,
but some are not. It depends on whether there
is extreme values
Things to discuss
• What is the median of income of HK citizens?
• It’s just ~$15,500 in 2016 which is much less
than the mean value of ~$24,000.
• Political bodies say the income of HK citizens is
so less but the government say the income of
HK citizens is quite reasonable.
• Which side would you agree with? Why?
• Which kind of central tendency is more reliable?
• Both are needed!
skewness

• Skewness of a curve reflect how


symmetrical a curve is.
• Distribution having a tail on the left is
called negatively skewed whereas
• distribution having a tail on the right is
called positively skewed.
• https://www.youtube.com/watch?
v=XSSRrVMOqlQ

• What is skewness?
Income distribution of HK 2006

• Income number %
• 1,000 – 1,999 65 534 2.1
• 2,000 – 3,999 149 921 4.7
• 4,000 – 5,999 329 103 9.8
• 6,000 – 7,999 460 953 13.8
• 8,000 – 9,999 418 416 12.5
• 10,000 – 14,999 693 526 20.7
• 15,000 – 19,999 354 073 10.6
• 20,000 – 24,999 222 694 6.7
• 25,000 – 39,999 264 781 7.9
• ≥ 40,000 210 878 6.3
• 總計
• Total 344 986 100%
Income distribution of HK 2011

• Income number %
• <2,000 61 935 1.9
• 2,000 – 3,999 110 714 3.4
• 4,000 – 5,999 159 539 4.9
• 6,000 – 7,999 362 962 11.1
• 8,000 – 9,999 452 218 13.9
• 10,000 – 14,999 754 368 23.0
• 15,000 – 19,999 411 534 12.6
• 20,000 – 24,999 284 518 8.7
• 25,000 – 29,999 141 632 4.3
• 30,000 – 39,999 216 234 6.6
• 40,000 – 59,999 173 093 5.3
• ≥ 60,000 147 889 4.5
• 總計
• Total 3 278 665 100%
Income distribution of HK 2016
• Income number %
• <2,000 43 583 1.3
• 2,000 – 3,999 78 813 2.3
• 4,000 – 5,999 99 300 2.9
• 6,000 – 7,999 130 754 3.8
• 8,000 – 9,999 283 102 8.3
• 10,000 – 14,999 891 262 26.1
• 15,000 – 19,999 572 777 16.8
• 20,000 – 24,999 372 665 10.9
• 25,000 – 29,999 190 703 5.6
• 30,000 – 39,999 277 029 8.1
• 40,000 – 59,999 247 662 7.2
• ≥ 60,000 231 457 6.8
• 總計
• Total 3419107 100%
• Are the distribution positively skewed or
negatively skewed?
• Please tabulate the data and plot a graph.
• Income of HK citizens is ______ skewed.
• Another good example is the number of
sex partners in a lifetime.
• Can you think of some examples?
• Example of negatively skewed curve is
retirement age.
• Most people retire between the age of 50-
70 and very few before 40.
• Can you think of some examples?
Income distribution of HK
citizen
1996 2001 2006
• Income number % number % number %
• < 1,000 31 447 1.0 29 659 0.9 26 764 0.8
• 1,000 – 1,999 26 154 0.8 27 410 0.8 39 364 1.2
• 2,000 – 3,999 242 429 8.0 278 579 8.6 324 434 9.7
• 4,000 – 5,999 316 331 10.5 266 587 8.3 329 103 9.8
• 6,000 – 7,999 478 408 15.9 397 899 12.3 460 953 13.8
• 8,000 – 9,999 476 114 15.8 395 476 12.2 418 416 12.5
• 10,000 – 14,999 668 722 22.2 743 033 23.0 693 526 20.7
• 15,000 – 19,999 295 968 9.8 370 981 11.5 354 073 10.6
• 20,000 – 24,999 166 805 5.5 251 116 7.8 222 694 6.7
• 25,000 – 39,999 171 238 5.7 258 035 8.0 264 781 7.9
• ≥ 40,000 142 848 4.7 210 332 6.5 210 878 6.3
• 總計
• Total 3 016 464 100% 3 229 107 100% 3 344 986 100%
• How about GPA? Is it positively skewed or
negatively skewed?
• Sometimes positive and sometimes
negative and most likely it is symmetrically
distributed.
• How about body height?
2006
Income distribution

40

數列1
percentageofeachgroup

20

0
0-5000 5001-10000 10001-15000 15001-20000 20001-25000 25001-30000 30000-35000 35000-40000

數列1 17 31 20 10 7 3 3 2
income
percentage

10
15
20
25

0
5
•<2,000
2011
•2,000
– 3,999
•4,000
– 5,999
•6,000
– 7,999
•8,000
– 9,999
•10,000

•15,000

income

•20,000
income distribution of HK at 2011


•25,000

•30,000

•40,000

•≥
60,000
2016
Income distribution of HK at 2016

30
25
20
percentage

15
10
5
0
•<2,000 •2,000 •4,000 •6,000 •8,000 •10,000 •15,000 •20,000 •25,000 •30,000 •40,000
•≥
– 3,999 – 5,999 – 7,999 – 9,999 – – – – – –
14,999 19,999 24,999 29,999 39,999 59,999 60,000
income
2016
<10,000 18.6

•10,000 – 19,999 42.9

•20,000 – 29,999 16.5

•30,000 – 39,999 8.1

•40,000 – 59,999 7.2

•≥ 60,000 6.8
2016
50

40

30

20

10

0
<10,000 •10,000 •20,000 •30,000 •40,000 •≥ 60,000
– 19,999 – 29,999 – 39,999 – 59,999
Mode

• Another measure of location that is


sometimes used to describe the middle set
of data.
• The mode is the value that occurs with
highest frequency.
• It is the most typical value of a set of
data.
What is the mode of the following
set of data?
a) 1, 3, 3, 5, 7, 9, 11, 11, 13
b) 140, 150, 160, 170, 180
c) 140, 150, 150, 160, 170, 170, 180
d) 140, 143, 143, 150, 180, 180.
e) Back to the case of passing mark of a
class, 20, 30, 40, 40, 40, 45, 45, 45,
100, 100.
f) What is the mode of income of HK
citizens?
• The determination of mode is simple and
require no calculation at all.
• However, there is no mode or more than 1
mode.
• Thus, mode is not a good measure of
central location by itself alone.
• Take
• a
• break

now!
Percentiles and quartiles

• Only applied to large set of data. Why?


• A set of data can be divided into 4 equal
portions when the data is arranged in either
ascending/descending order.
• Quartiles is the one fourth of all the values.
• There are first quartile (Q1), second quartile
(Q2) and third quartile (Q3).
Quartile
• Data set: 1, 3, 6, 9, 12, 15, 18, 23, 24, 35, 37
• The first quartile (Q1) is the median of all the
values to the left of the median position of
whole set of data (including the median is
acceptable).
• Also known as lower quartile.
• Approximately, it is the 25th smallest value.
• e.g. 1, 3, 6, 9, 12, 15, 18, 23, 24, 35, 37
• Thus, Q1 is 6.
• Data set: 1, 3, 6, 9, 12, 15, 18, 23, 24, 25,
37
• Then which is Q3?
• The third quartile (Q3) is the median of all
the values to the right of the median
position of whole set of data including the
median.
• Approximately, it is the 25th largest value.
• Also known as upper quartile.
• It’s 24.
• Data set: 1, 3, 6, 9, 12, 15, 18, 23, 24, 25,
37
• Which is Q2?
• It’s 15. It’s the median.
• The second quartile (Q2) is the median of
whole set of data.
• i.e. Q3 > Q2 > Q1
Percentile

• Percentile is obtained when the whole set


of data is divided into 100 equal portions.
• A percentile is the value of a variable
below which a certain percent of
observations fall.
• So the 20th percentile is the value (or
score) below which 20 percent of the
observations may be found.
• e.g. 50 percentile equal to the median
• 75 percentile equal to the third quartile (Q3)
• 25 percentile equal to the first quartile (Q1)
• When your score is equal to the 95
percentile in a test, you are better than 95%
of the students of the whole class!
• Is it good?
Data set : 1, 2, 3, 4……., 98, 99, 100
Which is the 50th percentile?
50.5
Which is the 25th percentile?
25.5
Which is the 75th percentile?
75.5
Data set : 1, 2, 3, 4……., 98, 99, 100
Which is the 1st percentile?
~1.5
Which is the 10th percentile?
~10.5
Which is the 99th percentile?
~99.5
The End of Lecture 2

You might also like