Professional Documents
Culture Documents
STAT2170 and STAT6180: Applied Statistics: // // / Karol Binkowski
STAT2170 and STAT6180: Applied Statistics: // // / Karol Binkowski
I Population refers to all the individuals (as a whole) that are relevant
to a study or a research question of interest. It can be as general or
as specific as you like, depending on what group you want to make
conclusions about. E.g.:
I all rock-wallabies (including past and future)
I I opened a 200g bag and found 217 M&M in total but only 8 yellow.
I Therefore, 8/217 = 3.7% were yellow
I Reminder: we expected 16.7%
I Would we get a different answer if we had picked a different sample?
Sample Population
mean x µ
standard deviation s σ
variance s2 σ2
proportion p π
regression coef. b β
correlation coef. r ρ
I symmetric about µ
I extends from −∞ to ∞
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
DEPARTMENT OF MATHEMATICS & STATISTICS 13
The Central Limit Theorem (CLT)
I For any distribution, means of repeated samples of the same size will
be approximately normally distributed, if the sample size (n) is large
enough, i.e., regardless of the distribution of a random variable, say,
Y,
σ2
approx
Y ∼ N µ, if n is sufficient large.
n
I The more skewed the original data (population), the larger the sample
size required for Y to be approximately normally distributed.
I For a not so skewed variable, a size (n) of 10-15 may be large enough
and 20-30 will often be large enough.
I σ = 23.8cm
Y −µ
√ ∼ Z ∼ N (0, 1)
σ/ n
I Data: y
18.3 17.9 19.1 16.8 18.9 17.4 19.6 18.3 19.6 16.3
Notation Terminology
H0 : µ = 19.25 Null hypothesis
H1 : µ 6= 19.25 Alternative hypothesis; two tailed/sided
α = 0.05 Significance level
I The lower the P-Value, the less evidence we have to ‘believe’ H0 (and
the more evidence we have against H0 ).
I How likely is it to get our sample (or more extreme) if the true
proportion of yellow M&Ms is 16.7%?
I Conclusion?
I Note: calculations for the P-Value for a proportion are not shown.
This will not be tested in this course (it made for a ‘yum’ example).
DEPARTMENT OF MATHEMATICS & STATISTICS 28
//
//
/ Type I and Type II errors
Type I and Type II errors
I H1 : Accused is guilty.
I As α decreases, β increases.
I NOT that µ is random and there is a 95% chance that it is inside the
fixed interval!
Data: y = (18.3, 17.9, 19.1, 16.8, 18.9, 17.4, 19.6, 18.3, 19.6, 16.3)
I 95% CI and P-value (α = 0.05)
I CI for µ is (17.29, 19.15)
y − µ0
zobs = √
σ/ n
18.22 − 19.25
= √ 0.015 (one-tail only)
1.5/ 10
−1.03
=
0.4743416 −3 −2 −1 0 1 2 3
= −2.17
P-Value = P(Z < −2.1714307) = 0.015 < 0.05
I Here we are only interested in the probability of a more divergent
mean that is lower than 19.25. Since P-value < 0.05, we reject H0 at
5% significance level, and conclude that there is evidence that the
true mean encoding time of the new encoder is less than 19.25
minutes, i.e., shorter than that of the encoder on the market.
Two-sided; H1 : µ 6= µ0
Reject H0 Reject H0
α α
2 2
−zα/2 0 zα/2
Reject Reject
α α
−zα 0 0 zα
I Warning: You need to have a reason to do a one sided test. Does the
treatment improve/increase/reduce, etc? There is no right or wrong
answer - one researcher might do a two tailed test, another might do
a one tailed test.
I If you don’t have a reason, do a two sided test.
I As ν → ∞, tν → N (0, 1)
I The usual case is that no parameters are known and all we have is our
sample.
−5 −4 −3 −2 −1 0 1 2 3 4 5
y1 = 8, y2 = 9, y3 =?
I If we know y = 9, we know the value of y3 .
I The data has 2 degrees of freedom (2 of the sample values are free to
change before the 3rd is fixed).
I You will need to register (one option is to use your student email to
Sign up with Google) and create a New Project to setup your
personal/unit specific Workspace.
I As you will be working off a server, you will have to Upload any data
files first before carrying out your analysis and then Export/download
your results (if appropriate).
I You can find the Upload and Export (nested under More) button
within the Files pane.
I If you have a Mac, install the latest release from the newest
R-x.x.x.pkg link (or a legacy version if you have an older operating
system). After you install R, you should also install XQuartz
http://xquartz.macosforge.org to be able to use some
visualisation packages.
I If you are using Linux, choose your specific operating system and
follow the installation instructions.
I Open RStudio
I Setup a project:
I Click on File menu at the top (or the add project icon)
I You should then gain a *.Rproj file. Simply open that to bring you
back to your workspace.
I Navigate to the directory in the Files pane and then set directory
using the More button.
I header=TRUE instructs that the first entry in the pulse.dat file (which
is ‘pulse’) is the variable name and the remaining entries are the data
I dat is an object that contains the pulse data. Can see it two ways
I Directly using dat$pulse (recommended)
I pulse
stem(dat$pulse)
#
# The decimal point is 1 digit(s) to the right of the |
#
# 5 | 56
# 6 | 12
# 6 | 55689
# 7 | 133
# 7 | 667777
# 8 | 01112
# 8 | 556667
# 9 | 3
7
6
5 Student Pulse rates (bpm)
Frequency
3 2
1
0 4
60 70 80 90
dat$pulse
60 70 80 90
summary(dat$pulse)
sd(dat$pulse)
# [1] 9.729739
mean(dat$pulse)
# [1] 75.23333
# time
# 1 18.3
# 2 17.9
# 3 19.1
# 4 16.8
# 5 18.9
# 6 17.4
# 7 19.6
# 8 18.3
# 9 19.6
# 10 16.3
# time
# 1 18.3
# 2 17.9
# 3 19.1
# 4 16.8
# 5 18.9
# 6 17.4
# time
# 5 18.9
# 6 17.4
# 7 19.6
# 8 18.3
# 9 19.6
# 10 16.3
I This is the ad-hoc ‘standard’ five number summmary along with the
sample mean.
#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.01827
# alternative hypothesis: true mean is not equal to 19.25
# 95 percent confidence interval:
# 17.4101 19.0299
# sample estimates:
# mean of x
# 18.22
#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.009134
# alternative hypothesis: true mean is less than 19.25
# 95 percent confidence interval:
# -Inf 18.87629
# sample estimates:
# mean of x
# 18.22
#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.9909
# alternative hypothesis: true mean is greater than 19.25
# 95 percent confidence interval:
# 17.56371 Inf
# sample estimates:
# mean of x
# 18.22