STAT2170 and STAT6180: Applied Statistics: // // / Karol Binkowski

STAT2170 and STAT6180: Applied Statistics
Week 1: Review of Basic Statistics

//
Karol
// Binkowski
/
//
//
/ Statistical terms and notation
Statistical terms and notation
DEPARTMENT OF MATHEMATICS & STATISTICS 3

Statistical terms and notation
I When we carry out an experiment or a survey, we often take a sample

from a much larger population.
I Population refers to all the individuals (as a whole) that are relevant
to a study or a research question of interest. It can be as general or
as specific as you like, depending on what group you want to make
conclusions about. E.g.:
I all rock-wallabies (including past and future)
I all two-year old rock-wallabies in the Sydney region
I all students who studied statistics last year
I all currently registered accounting firms in NSW
I Population of interest is known as the target population

Sample
I A sample is usually a manageable subset of the target population to

be studied.
I We may use that sample to make statements (known as making

inference) about that population.
I The experiment or survey needs to be designed so that the sample to

be used is an unbiased representation of that population
I i.e. a representative sample of the target population.
I One way to obtain a representative sample is to use some random

selection process, such as taking simple random samples. There are
also other ways to get good and representative samples (covered in
STAT2114 or STAT6114).

Sampling
I Think of an example of a sample and name its target population

A (yum) example
I Say we want to know what proportion of M&Ms are yellow
I Without a sample, what would be your estimate?

I If an equal colour distribution, we expect __ out of __ to be yellow,
or __%
I We can’t count all M&Ms produced, so . . .

My M&Ms sample
I I opened a 200g bag and found 217 M&M in total but only 8 yellow.
I Therefore, 8/217 = 3.7% were yellow
I Reminder: we expected 16.7%
I Would we get a different answer if we had picked a different sample?

Sampling demonstration
I Video (Video’s Made by a/Prof Robin Turner, n.d.) : Randomly

sampling 200 M&M’s assuming an equal colour distribution and
determining the proportion of yellow
I 3.7% is expressed as 0.037 here (i.e. as a proportion)

Parameter and Statistics
I Parameters: (known as population parameters)

I values that characterize a population
I fixed for that population
I usually never known
I Statistics: (known as sample statistics)

I values that characterize a sample
I vary from sample to sample
I known exactly (can be computed from sample)
I used to estimate parameters

Notation
Sample Population
mean x µ
standard deviation s σ
variance s2 σ2
proportion p π
regression coef. b β
correlation coef. r ρ

//
//
/ Normal distribution and
central limit theorem
Normal Distribution
I bell-shaped and completely defined by µ & σ (see graphs)
I may be written as Y ∼ N(µ, σ 2 )
I symmetric about µ
I extends from −∞ to ∞
I Many variables modelled with a Normal distribution such as weights,

heights and GPA scores
Different µ, same σ Same µ, different σ
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
The Central Limit Theorem (CLT)
I For any distribution, means of repeated samples of the same size will
be approximately normally distributed, if the sample size (n) is large
enough, i.e., regardless of the distribution of a random variable, say,
Y,
σ2

approx
Y ∼ N µ, if n is sufficient large.
n
I The more skewed the original data (population), the larger the sample
size required for Y to be approximately normally distributed.
I For a not so skewed variable, a size (n) of 10-15 may be large enough
and 20-30 will often be large enough.

CLT demonstration
I Video: Randomly sampling 30 people from a (skewed) population and

estimating the average height in cm
I σ = 23.8cm

I Video (Video’s Made by a/Prof Robin Turner, n.d.): Randomly
sampling 30 people from a (skewed) population and estimating the
average height in cm

Z-scores
I If Y is Normally distributed N (µ, σ 2 ), then:

I For an individual Y :
Y −µ
∼Z
σ
where Z ∼ N (0, 1), the so called ‘standard’ normal distribution.
I For a sample mean Y :
Y −µ
√ ∼ Z ∼ N (0, 1)
σ/ n
This sample mean result is exact if Y is Normally distributed but only

approximate if Y is not normal but n is large enough (and the
approximation improves as n increases)
I The standard deviation of sample means, known as the standard

error, (se), is,
σ
se(Y ) = √
n
It is clear from the expression of the standard error,
σ
se(Y ) = √ ,
n
the standard deviation of the sample means can be reduced by increasing
the sample size.
Note:
I σ is the measure of variability between individuals in the population;
I The SE of the sample mean measures the variability of the sample

means, and thus is an indication of the accuracy of a sample mean
as an estimate of the population mean.

Standard error demonstration
I Video (Video’s Made by a/Prof Robin Turner, n.d.) : Randomly
sampling 10, 30 and 100 people from a (skewed) population and
estimating the average height in cm

//
//
/ Hypothesis testing - z-test
Example : z-test
I A video encoder algorithm has a rendering time for an industry
standard video that is normally distributed with mean of 19.25
seconds and a variance of 2.25 (i.e. σ = 1.5)
I A new video encoder is developed and our question is:

I Does this new encoder have a different encoding time compared
with the encoder available on the market?
I Take a sample of 10 different computers (equivalent specification) and

obtain the encoding time in minutes for each using the new encoder:
I Data: y
# Warning in kable_pipe(x = structure(c("18.3", "17.9", "19.1", "16.8", "18.9",

# The table should have a header (column names)
18.3 17.9 19.1 16.8 18.9 17.4 19.6 18.3 19.6 16.3
I y = 18.22 and s = 1.132

Is this evidence that the new encoder:
I has a population mean of 19.25, i.e., this is just random (chance)
variation about that value.
I OR has a population mean that is NOT 19.25.
We assume that the standard deviation is unchanged, i.e., σ = 1.5. (More

on this later.) As σ is known, we may carry out a z-test with the following
hypotheses, to answer the research question.
Notation Terminology
H0 : µ = 19.25 Null hypothesis
H1 : µ 6= 19.25 Alternative hypothesis; two tailed/sided
α = 0.05 Significance level

y − µ0
zobs = √
σ/ n
18.22 − 19.25
= √ 0.015 0.015
1.5/ 10
−1.03
= = −2.1714
0.4743416 −3 −2 −1 0 1 2 3
P-Value = P(Z < −2.17 or Z > 2.17) = (1 − 0.985) × 2 = 0.03

I Therefore, reject H0 at the 5% significance level as the P-Value is less
than 0.05, and there is sufficient evidence at the 5% level of
significance to conclude that the true mean encoding time for the new
encoder is NOT 19.25 minutes but significantly less than 19.25
minutes. (It was 18.22 in the sample.)

Steps of hypothesis testing:
1. Set up null and alternative hypotheses and choose level of significance

α
2. Assume H0 is true and calculate the test statistic checking the

relevant assumptions required for the test.
3. Calculate the P-Value, assuming H0 is true.
4. Compare P-Value with level of significance:

I Reject H0 if P-Value < α.
I Do Not Reject H0 if P-Value > α.
5. Write a statistical and contextual conclusion.

More on P-Value
I P-Value is the probability of the test statistic taking its observed

value or a more extreme value (further away from the hypothesised
value specified in H0 , known as the null value).
I The lower the P-Value, the less evidence we have to ‘believe’ H0 (and
the more evidence we have against H0 ).
I The P-Value is NOT the probability that H0 is true. (A not-guilty

verdict in court does not mean that the defendant is innocent. They
are either innocent or there wasn’t enough evidence to convict them.)

Significance Level
I The significance level is the ‘cut-off point’, the minimum probability

for believing H0 (not rejecting it).
I For the encoding example, if H0 was true, we would expect 3% (the

P-Value) of our samples to have a sample mean that is -2.1714307
units or further away from 19.25, i.e., 97% of the time, we would
observe a sample mean closer to 19.25.
I If using 5% (0.05) as the significance level, we would say that it is

unlikely (as pvalp < 0.05) that this sample was from a population
with mean of 19.25, i.e., 3% (< 5%) is considered unlikely enough for
us to conclude that the true (population) mean of the new encoder is
in fact 19.25. So we conclude that we have evidence to suggest that
µ (population mean of the new encoder) is significantly different from
19.25.

I If we had chosen a significance level of α = 0.01, then the P-Value
(0.03) is considered large as it’s greater than this α and so would not
reject H0 . In this case we can say that we do NOT have sufficient
evidence to conclude that the true mean time of the new encoder, µ,
is not 19.25 at the 1% level of significance, i.e., the true mean could
be 19.25.
I Based on our analyses, we have enough evidence to conclude that µ

is not 19.25 at the 5% level of significance, but not enough evidence
to conclude that µ is not 19.25 at the 1% level of significance.
I We will never know for sure whether µ is 19.25 or is not. So whatever

our conclusion, we could be making an error.

Back to the M&Ms example
I Our sample had 3.7% yellow
I How likely is it to get our sample (or more extreme) if the true
proportion of yellow M&Ms is 16.7%?
I P-Value < 0.0001
I Conclusion?
I Note: calculations for the P-Value for a proportion are not shown.
This will not be tested in this course (it made for a ‘yum’ example).
//
//
/ Type I and Type II errors
Type I and Type II errors
If H0 is true and we reject H0 (due to our sample with an unusually low or

high mean due to random chance) ⇒ Type-I error.
P(Type I error) = α, (the significance level)
since the test rejects α of the z-values as being too extreme. If H0 is false
and we do not reject H0 because we obtained an unusual sample with a
mean similar to the hypothesised ⇒ Type-II error.
P(Type II error) = β (= 1 − the power of the test)
which depends on α, the true value of µ and of course, the sample size n.

Possible Outcomes for tests of significance
H0 true H0 false
H0 retained 1−α Type II error: β
H0 rejected Type I error: α Power: 1 − β
E.g. Intuitive Court outcomes

I H0 : Accused is innocent
I H1 : Accused is guilty.
Decision H0 true (accused innocent) H0 false (accused guilty)

Acquittal Correct acquittal Bad acquittal
Convicted Bad conviction Correct conviction

I Note: Can’t simultaneously make α and β low:
I As α increases, β decreases.
I As α decreases, β increases.
I Typically α is set to 0.05, sometimes 0.01, and a high α value would

result in a relatively low β.
I The choice of level of significance depends on the experiment/survey.

For example:
I If trialing a new drug to lower blood pressure, we may consider a very
low α (say α = 0.005) as we don’t want to take the risk of putting on
the market a useless (new) drug, although this will increase the risk of
missing a useful (new) drug.
I In a preliminary screening trial of new drugs:

I One may consider a high α (say α = 0.10), as we don’t want to miss
any possible useful drug that could warrant further testing.
I However, under such condition, we may have believed (i.e., Reject H0 )

on all or some of the drugs tested (i.e., new drugs) that may later turn
//
//
/ Confidence interval
Confidence intervals
I Rather than hypothesis testing where we test for the parameter being
a particular value, a second approach is to use the sample to predict
what possible values (or range) that the parameter might be, using
Confidence Intervals
I Recall, P(−1.96 < Z < 1.96) = 0.95

I There is a 95% chance that a randomly selected z will lie in the range
(−1.96, 1.96).
I If a variable Y ∼ N (µ, σ 2 ), P(−1.96 < Y −µ

√
σ/ n
< 1.96) = 0.95
I There is a 95% chance that the interval with random endpoints
y ± 1.96 √σn would contain the fixed true population µ.
I NOT that µ is random and there is a 95% chance that it is inside the
fixed interval!
I 95% Confidence interval for µ (when σ is known) is

σ
y ± 1.96 √
n
Generalising Confidence levels
I General form for 100(1 − α)% CI for µ: (if σ is known)

σ
y ± zα/2 √
n
I Earlier encoder timing example, use α = 0.05, a 95% CI for µ is:
√
18.22 ± 1.96 × 1.5/ 10
18.22 ± 0.92904 = (17.29, 19.15)
y − 1.96 √σn y y + 1.96 √σn
17.2 17.4 17.6 17.8 18 18.2 18.4 18.6 18.8 19 19.2

Use α = 0.01, a 99% CI for µ is: √
18.22 ± 2.58 × 1.5/ 10
18.22 ± 1.22292 = (16.996, 19.444)
I Interpretation: We are 99% confident that the interval

(16.996,19.444) contains the true mean.
I NB: This is a wider interval, as expected.
I For both the z-test and associated confidence interval, we need to

know the true value of σ.

Confidence intervals vs Hypothesis testing
I Confidence Intervals: Give a range of values where we believe the

parameter to be, given the set of data.
I Focus is on the estimation of the parameter.
I Hypothesis testing: Gives a value on how much ‘evidence’ we have

against a claim.
I Focus is on specific claims about parameter.
I The choice of which to be used depends on the research question.

Encoder Example: Two sided tests H0 : µ = 19.25 (H1 : µ 6=
19.25) and confidence intervals
Data: y = (18.3, 17.9, 19.1, 16.8, 18.9, 17.4, 19.6, 18.3, 19.6, 16.3)
I 95% CI and P-value (α = 0.05)
I CI for µ is (17.29, 19.15)
I P-Value = 0.03 < 0.05
I 99% CI and P-value (α = 0.01)

I CI for µ is (17, 19.44)
I P-Value = 0.03 > 0.01

//
//
/ One sided tests
One sided tests
Previously in the encoder data, a two-sided test was conducted,

H0 : µ = 19.25 H1 : µ 6= 19.25
I Sometimes the research question is only concerned with one direction.

I Here a quicker encoder (lower encoding time) is of interest.
I Investigate whether encoding time is lower than 19.25
I Carry out a one sided test.

One sided test: Example H0 : µ = 19.25; H1 : µ < 19.25
y − µ0
zobs = √
σ/ n
18.22 − 19.25
= √ 0.015 (one-tail only)
1.5/ 10
−1.03
=
0.4743416 −3 −2 −1 0 1 2 3
= −2.17
P-Value = P(Z < −2.1714307) = 0.015 < 0.05
I Here we are only interested in the probability of a more divergent
mean that is lower than 19.25. Since P-value < 0.05, we reject H0 at
5% significance level, and conclude that there is evidence that the
true mean encoding time of the new encoder is less than 19.25
minutes, i.e., shorter than that of the encoder on the market.

Or we can say there is sufficient evidence at the 5% level of significance to
conclude that the true mean is less than 19.25
Note: When we are only interested in one tail/side, say µ < 19.25, if a
different sample gives a y that is greater than 19.25, no matter how high
it is, we will still retain H0 (probably with a high level of confidence). E.g.:
If y = 50.0, resulting in a very large P-Value (almost 1), we will retain H0
that µ = 19.25 because there is absolutely no evidence to support H1
(µ < 19.25).
The only difference between one tailed and two tailed test is that you are
changing the rejection and acceptance regions (P-Value = One-tail vs
Two-tails). Everything else is the same.

Comparison of Two-sided vs One-sided
Two-sided; H1 : µ 6= µ0
Reject H0 Reject H0
α α
2 2
−zα/2 0 zα/2
One-sided; H1 : µ < µ0 One-sided; H1 : µ > µ0
Reject Reject
α α
−zα 0 0 zα

One sided vs two sided tests
I Often only interested in one side/tail

I fertilizer to improve yield
I drug to reduce blood pressure
I learning technique to increase memory
I have numbers of a species declined
I Warning: You need to have a reason to do a one sided test. Does the
treatment improve/increase/reduce, etc? There is no right or wrong
answer - one researcher might do a two tailed test, another might do
a one tailed test.
I If you don’t have a reason, do a two sided test.
I Don’t let the data suggest a one sided test.

//
//
/ t-distribution and t-test
tν -distribution
I Generalises the Z distribution

I Symmetric about zero. Z
t2
I Slightly flatter at the centre at t10
zero than Z
I ’Fatter’ tails than Z
I Defined by its ’degrees of

freedom’, tν −2 0 2
I As ν → ∞, tν → N (0, 1)

One sample t-test
I If σ is known then our test statistic is:

I zobs = y −µ0
√
σ/ n
∼ N (0, 1).
I Often it is not realistic to assume that σ is known, which is required

for a z test and relevant CI.
I The usual case is that no parameters are known and all we have is our
sample.
I So σ is estimated with s, the sample standard deviation.
I The new test statistic is

I
Y − µ0
tobs = S
√
n
I NB: If Y ∼ N (µ0 , σ 2 ) then tobs ∼ tn−1 .

Hypotheses: H0 : µ = 19.25; Ha : µ 6= 19.25
y −19.25 18.22−19.25 −1.03
Obs Test stat: t = 1.132
√
= 1.132
√
= 0.3579698 = −2.877
10 10
Null distribution: If H0 is true, T behaves like a tn−1 = t9

t9
−5 −4 −3 −2 −1 0 1 2 3 4 5
P-Value = 2P(tn−1 > |t|)

= 2P(t9 > | − 2.877|)
= 0.018 (exact)
< 2 × 0.01 = 0.02 (using tables)
< 0.05
P-Value = P(t9 > 2.877 or t9 < −2.877) = 0.009 × 2 = 0.018
Therefore, reject H0 as the P-Value is less than 0.05, and there is sufficient
evidence that the mean encoding time is not 19.25 mins.

Encoder example: Hypothesis tests and confidence intervals
I P-Value = 0.0181. Conclusion depends on α:

I At the level α = 0.05, we reject H0 .
I At the level α = 0.01, we do not reject H0 .
I 95% confidence interval for µ:

s
y ± tα/2 √
n
1.132
18.22 ± 2.262 × √
10
18.22 ± 0.81 = (17.41, 19.03)

Degrees of Freedom
I The degrees of freedom of the t distribution is always determined by

the degrees of freedom of sample variance, s 2 . It has its
mathematical definition (beyond the scope of this unit), but is
difficult to define simply in words. For the one-sample t test, the
degrees of freedom is n − 1. In this case:
I We have used s 2 in the t-test.
I In calculating s 2 , we estimated µ with y .
I So we lose a ‘degree of freedom’ in the data by using the sample

estimate.

Example: Sample of 3 observations
y1 = 8, y2 = 9, y3 =?
I If we know y = 9, we know the value of y3 .
I The data has 2 degrees of freedom (2 of the sample values are free to
change before the 3rd is fixed).
I For the single sample case, remember (n − 1) degrees of freedom.

//
//
/ R (RStudio) and Class
exercises
Setup: Part 1 – Download and install
I If you are using RStudio in the computing labs on campus, you may
want to use the one on Rstudio Cloud (https://rstudio.cloud).
Accessible via any browser.
I You will need to register (one option is to use your student email to
Sign up with Google) and create a New Project to setup your
personal/unit specific Workspace.
I As you will be working off a server, you will have to Upload any data
files first before carrying out your analysis and then Export/download
your results (if appropriate).
I You can find the Upload and Export (nested under More) button
within the Files pane.
I We are not recommending the version on appstream and

http://rstudio.science.mq.edu.au/.
I The best option is to simply install it on your computer. It is freely

available online.
I To install RStudio on your own computer,
I first download Base R from from https://www.r-project.org/
I (preferably latest R version 4.0.2 (2020-06-22))
I If you have a Mac, install the latest release from the newest
R-x.x.x.pkg link (or a legacy version if you have an older operating
system). After you install R, you should also install XQuartz
http://xquartz.macosforge.org to be able to use some
visualisation packages.
I If you are installing the Windows version, choose the “base”

subdirectory and click on the download link at the top of the page.
After you install R, you should also install RTools
https://cran.rstudio.com/bin/windows/Rtools/; use the
“recommended” version highlighted near the top of the list.
I If you are using Linux, choose your specific operating system and
follow the installation instructions.
I download and install RStudio Desktop

https://www.rstudio.com/products/rstudio/download (Open
Source License) version for your operating system under the list titled
Installers for Supported Platforms.

Setup: Part 2 – Setup workspace
I Decide on a place to store your data and files
I Open RStudio
I Setup a project:
I Click on File menu at the top (or the add project icon)
I Select Existing Directory
I Browse to the directory of your choice
I Click Create Project
I You should then gain a *.Rproj file. Simply open that to bring you
back to your workspace.

I Alternatively, setup the working directory:
I Click on Session menu at the top,
I Select Set Working Directory
I Select Choose Directory,
I Browse to the directory of your choice
I Click Select Folder
I Alternative methods for setting up the working directory:

I Command line: type setwd("C:/stat2170/") to set working
directory to be stat2170 directory on your C: (equivalent path
statement for Linux of Mac OS).
I Navigate to the directory in the Files pane and then set directory
using the More button.

Reading Data
I data below are measurements of pulse rate in beats per minute

(pulse), collected in a first year stats class.
dat = read.table("pulse.dat", header=TRUE)
I header=TRUE instructs that the first entry in the pulse.dat file (which
is ‘pulse’) is the variable name and the remaining entries are the data
I dat is an object that contains the pulse data. Can see it two ways
I Directly using dat$pulse (recommended)
I Attaching it to the environment

I attach(dat)
I pulse
I Downside to attaching, it is invisible in the workspace.
I Don’t forget to detach(dat).

Stem and leaf plot
stem(dat$pulse)
#
# The decimal point is 1 digit(s) to the right of the |
#
# 5 | 56
# 6 | 12
# 6 | 55689
# 7 | 133
# 7 | 667777
# 8 | 01112
# 8 | 556667
# 9 | 3

Histogram
hist(dat$pulse, main = "Student Pulse rates (bpm)")
7
6
5 Student Pulse rates (bpm)
Frequency
3 2
1
0 4
60 70 80 90
dat$pulse

Boxplot
boxplot(dat$pulse, horizontal = TRUE,

main = "Student Pulse rates (bpm)")
Student Pulse rates (bpm)
60 70 80 90

Numerical Summaries
summary(dat$pulse)
# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 55.00 68.25 77.00 75.23 81.75 93.00
sd(dat$pulse)
# [1] 9.729739
mean(dat$pulse)
# [1] 75.23333

Basics: Reading Data, summarise and test.
I Suppose the encoder data is stored in a text file, encoder.txt, in

the working directory.
I To carry out the t-test shown on the previous slides, type the
following R code into R or RStudio:
dat1 = read.table("encoder.txt", header = TRUE)

str(dat1) # Look at the data object
dat1 # Look at all observations
head(dat1) # display first six observations
summary(dat1$time) # obtain descriptive statistics
t.test(dat1$time, mu = 19.25) # Conduct the test
I NB: t.test uses two sided test by default

I type ? t.test to see documentation
I RStudio output shown on the next few slides

Basics: RStudio output
dat1 = read.table("encoder.txt", header = TRUE)

str(dat1) # Look at the object
# 'data.frame': 10 obs. of 1 variable:

# $ time: num 18.3 17.9 19.1 16.8 18.9 17.4 19.6 18.3 19.6 16.3
dat1 # Look at the entire object
# time
# 1 18.3
# 2 17.9
# 3 19.1
# 4 16.8
# 5 18.9
# 6 17.4
# 7 19.6
# 8 18.3
# 9 19.6
# 10 16.3

head(dat1) # display first six records/rows
# time
# 1 18.3
# 2 17.9
# 3 19.1
# 4 16.8
# 5 18.9
# 6 17.4
tail(dat1) # display last six records/rows
# time
# 5 18.9
# 6 17.4
# 7 19.6
# 8 18.3
# 9 19.6
# 10 16.3


# 16.30 17.52 18.30 18.22 19.05 19.60
I This is the ad-hoc ‘standard’ five number summmary along with the
sample mean.


# 16.30 17.52 18.30 18.22 19.05 19.60
t.test(dat1$time, mu = 19.25) # Conduct the test
#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.01827
# alternative hypothesis: true mean is not equal to 19.25
# 95 percent confidence interval:
# 17.4101 19.0299
# sample estimates:
# mean of x
# 18.22

t.test(dat1$time, mu = 19.25, alternative = "less")
#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.009134
# alternative hypothesis: true mean is less than 19.25
# -Inf 18.87629
# sample estimates:
# mean of x
# 18.22
I Note the P-Value is half of the corresponding two-sided P-value

t.test(dat1$time, mu = 19.25, alternative = "greater")
#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.9909
# alternative hypothesis: true mean is greater than 19.25
# 17.56371 Inf
# sample estimates:
# mean of x
# 18.22
I P-value > 0.05 ⇒ retain H0

References
Video’s Made by a/Prof Robin Turner. n.d.

STAT2170 and STAT6180: Applied Statistics: // // / Karol Binkowski

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT2170 and STAT6180: Applied Statistics: // // / Karol Binkowski

Uploaded by

Copyright:

Available Formats

STAT2170 and STAT6180: Applied Statistics

Week 1: Review of Basic Statistics

DEPARTMENT OF MATHEMATICS & STATISTICS 3

I When we carry out an experiment or a survey, we often take a sample

I all two-year old rock-wallabies in the Sydney region

I all students who studied statistics last year

I all currently registered accounting firms in NSW

I Population of interest is known as the target population

DEPARTMENT OF MATHEMATICS & STATISTICS 4

I A sample is usually a manageable subset of the target population to

I We may use that sample to make statements (known as making

I The experiment or survey needs to be designed so that the sample to

I One way to obtain a representative sample is to use some random

DEPARTMENT OF MATHEMATICS & STATISTICS 5

I Think of an example of a sample and name its target population

I Without a sample, what would be your estimate?

I We can’t count all M&Ms produced, so . . .

DEPARTMENT OF MATHEMATICS & STATISTICS 7

DEPARTMENT OF MATHEMATICS & STATISTICS 8

I Video (Video’s Made by a/Prof Robin Turner, n.d.) : Randomly

I 3.7% is expressed as 0.037 here (i.e. as a proportion)

I Parameters: (known as population parameters)

I fixed for that population

I usually never known

I Statistics: (known as sample statistics)

I vary from sample to sample

I known exactly (can be computed from sample)

I used to estimate parameters

DEPARTMENT OF MATHEMATICS & STATISTICS 10

DEPARTMENT OF MATHEMATICS & STATISTICS 11

I may be written as Y ∼ N(µ, σ 2 )

I Many variables modelled with a Normal distribution such as weights,

Different µ, same σ Same µ, different σ

DEPARTMENT OF MATHEMATICS & STATISTICS 14

I Video: Randomly sampling 30 people from a (skewed) population and

DEPARTMENT OF MATHEMATICS & STATISTICS 15

DEPARTMENT OF MATHEMATICS & STATISTICS 16

I If Y is Normally distributed N (µ, σ 2 ), then:

I For a sample mean Y :

This sample mean result is exact if Y is Normally distributed but only

I The standard deviation of sample means, known as the standard

I The SE of the sample mean measures the variability of the sample

DEPARTMENT OF MATHEMATICS & STATISTICS 18

DEPARTMENT OF MATHEMATICS & STATISTICS 19

I A new video encoder is developed and our question is:

I Take a sample of 10 different computers (equivalent specification) and

# Warning in kable_pipe(x = structure(c("18.3", "17.9", "19.1", "16.8", "18.9",

I y = 18.22 and s = 1.132

I OR has a population mean that is NOT 19.25.

We assume that the standard deviation is unchanged, i.e., σ = 1.5. (More

DEPARTMENT OF MATHEMATICS & STATISTICS 22

P-Value = P(Z < −2.17 or Z > 2.17) = (1 − 0.985) × 2 = 0.03

DEPARTMENT OF MATHEMATICS & STATISTICS 23

1. Set up null and alternative hypotheses and choose level of significance

2. Assume H0 is true and calculate the test statistic checking the

3. Calculate the P-Value, assuming H0 is true.

4. Compare P-Value with level of significance:

I Do Not Reject H0 if P-Value > α.

5. Write a statistical and contextual conclusion.

DEPARTMENT OF MATHEMATICS & STATISTICS 24

I P-Value is the probability of the test statistic taking its observed

I The P-Value is NOT the probability that H0 is true. (A not-guilty