Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

STAT2170 and STAT6180: Applied Statistics

Week 1: Review of Basic Statistics


//
Karol
// Binkowski
/
//
//
/ Statistical terms and notation
Statistical terms and notation

DEPARTMENT OF MATHEMATICS & STATISTICS 3


Statistical terms and notation

I When we carry out an experiment or a survey, we often take a sample


from a much larger population.

I Population refers to all the individuals (as a whole) that are relevant
to a study or a research question of interest. It can be as general or
as specific as you like, depending on what group you want to make
conclusions about. E.g.:
I all rock-wallabies (including past and future)

I all two-year old rock-wallabies in the Sydney region

I all students who studied statistics last year

I all currently registered accounting firms in NSW

I Population of interest is known as the target population

DEPARTMENT OF MATHEMATICS & STATISTICS 4


Sample

I A sample is usually a manageable subset of the target population to


be studied.

I We may use that sample to make statements (known as making


inference) about that population.

I The experiment or survey needs to be designed so that the sample to


be used is an unbiased representation of that population
I i.e. a representative sample of the target population.

I One way to obtain a representative sample is to use some random


selection process, such as taking simple random samples. There are
also other ways to get good and representative samples (covered in
STAT2114 or STAT6114).

DEPARTMENT OF MATHEMATICS & STATISTICS 5


Sampling

I Think of an example of a sample and name its target population


DEPARTMENT OF MATHEMATICS & STATISTICS 6
A (yum) example
I Say we want to know what proportion of M&Ms are yellow

I Without a sample, what would be your estimate?


I If an equal colour distribution, we expect __ out of __ to be yellow,
or __%

I We can’t count all M&Ms produced, so . . .

DEPARTMENT OF MATHEMATICS & STATISTICS 7


My M&Ms sample

I I opened a 200g bag and found 217 M&M in total but only 8 yellow.
I Therefore, 8/217 = 3.7% were yellow
I Reminder: we expected 16.7%
I Would we get a different answer if we had picked a different sample?

DEPARTMENT OF MATHEMATICS & STATISTICS 8


Sampling demonstration

I Video (Video’s Made by a/Prof Robin Turner, n.d.) : Randomly


sampling 200 M&M’s assuming an equal colour distribution and
determining the proportion of yellow

I 3.7% is expressed as 0.037 here (i.e. as a proportion)


DEPARTMENT OF MATHEMATICS & STATISTICS 9
Parameter and Statistics

I Parameters: (known as population parameters)


I values that characterize a population

I fixed for that population

I usually never known

I Statistics: (known as sample statistics)


I values that characterize a sample

I vary from sample to sample

I known exactly (can be computed from sample)

I used to estimate parameters

DEPARTMENT OF MATHEMATICS & STATISTICS 10


Notation

Sample Population
mean x µ
standard deviation s σ
variance s2 σ2
proportion p π
regression coef. b β
correlation coef. r ρ

DEPARTMENT OF MATHEMATICS & STATISTICS 11


//
//
/ Normal distribution and
central limit theorem
Normal Distribution
I bell-shaped and completely defined by µ & σ (see graphs)

I may be written as Y ∼ N(µ, σ 2 )

I symmetric about µ

I extends from −∞ to ∞

I Many variables modelled with a Normal distribution such as weights,


heights and GPA scores

Different µ, same σ Same µ, different σ

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
DEPARTMENT OF MATHEMATICS & STATISTICS 13
The Central Limit Theorem (CLT)

I For any distribution, means of repeated samples of the same size will
be approximately normally distributed, if the sample size (n) is large
enough, i.e., regardless of the distribution of a random variable, say,
Y,
σ2
 
approx
Y ∼ N µ, if n is sufficient large.
n

I The more skewed the original data (population), the larger the sample
size required for Y to be approximately normally distributed.

I For a not so skewed variable, a size (n) of 10-15 may be large enough
and 20-30 will often be large enough.

DEPARTMENT OF MATHEMATICS & STATISTICS 14


CLT demonstration

I Video: Randomly sampling 30 people from a (skewed) population and


estimating the average height in cm

I σ = 23.8cm

DEPARTMENT OF MATHEMATICS & STATISTICS 15


I Video (Video’s Made by a/Prof Robin Turner, n.d.): Randomly
sampling 30 people from a (skewed) population and estimating the
average height in cm

DEPARTMENT OF MATHEMATICS & STATISTICS 16


Z-scores

I If Y is Normally distributed N (µ, σ 2 ), then:


I For an individual Y :
Y −µ
∼Z
σ
where Z ∼ N (0, 1), the so called ‘standard’ normal distribution.

I For a sample mean Y :

Y −µ
√ ∼ Z ∼ N (0, 1)
σ/ n

This sample mean result is exact if Y is Normally distributed but only


approximate if Y is not normal but n is large enough (and the
approximation improves as n increases)

I The standard deviation of sample means, known as the standard


error, (se), is,
σ
se(Y ) = √
n
DEPARTMENT OF MATHEMATICS & STATISTICS 17
It is clear from the expression of the standard error,
σ
se(Y ) = √ ,
n
the standard deviation of the sample means can be reduced by increasing
the sample size.
Note:
I σ is the measure of variability between individuals in the population;

I The SE of the sample mean measures the variability of the sample


means, and thus is an indication of the accuracy of a sample mean
as an estimate of the population mean.

DEPARTMENT OF MATHEMATICS & STATISTICS 18


Standard error demonstration
I Video (Video’s Made by a/Prof Robin Turner, n.d.) : Randomly
sampling 10, 30 and 100 people from a (skewed) population and
estimating the average height in cm

DEPARTMENT OF MATHEMATICS & STATISTICS 19


//
//
/ Hypothesis testing - z-test
Example : z-test
I A video encoder algorithm has a rendering time for an industry
standard video that is normally distributed with mean of 19.25
seconds and a variance of 2.25 (i.e. σ = 1.5)

I A new video encoder is developed and our question is:


I Does this new encoder have a different encoding time compared
with the encoder available on the market?

I Take a sample of 10 different computers (equivalent specification) and


obtain the encoding time in minutes for each using the new encoder:

I Data: y

# Warning in kable_pipe(x = structure(c("18.3", "17.9", "19.1", "16.8", "18.9",


# The table should have a header (column names)

18.3 17.9 19.1 16.8 18.9 17.4 19.6 18.3 19.6 16.3

I y = 18.22 and s = 1.132


DEPARTMENT OF MATHEMATICS & STATISTICS 21
Is this evidence that the new encoder:
I has a population mean of 19.25, i.e., this is just random (chance)
variation about that value.

I OR has a population mean that is NOT 19.25.

We assume that the standard deviation is unchanged, i.e., σ = 1.5. (More


on this later.) As σ is known, we may carry out a z-test with the following
hypotheses, to answer the research question.

Notation Terminology
H0 : µ = 19.25 Null hypothesis
H1 : µ 6= 19.25 Alternative hypothesis; two tailed/sided
α = 0.05 Significance level

DEPARTMENT OF MATHEMATICS & STATISTICS 22


y − µ0
zobs = √
σ/ n
18.22 − 19.25
= √ 0.015 0.015
1.5/ 10
−1.03
= = −2.1714
0.4743416 −3 −2 −1 0 1 2 3

P-Value = P(Z < −2.17 or Z > 2.17) = (1 − 0.985) × 2 = 0.03


I Therefore, reject H0 at the 5% significance level as the P-Value is less
than 0.05, and there is sufficient evidence at the 5% level of
significance to conclude that the true mean encoding time for the new
encoder is NOT 19.25 minutes but significantly less than 19.25
minutes. (It was 18.22 in the sample.)

DEPARTMENT OF MATHEMATICS & STATISTICS 23


Steps of hypothesis testing:

1. Set up null and alternative hypotheses and choose level of significance


α

2. Assume H0 is true and calculate the test statistic checking the


relevant assumptions required for the test.

3. Calculate the P-Value, assuming H0 is true.

4. Compare P-Value with level of significance:


I Reject H0 if P-Value < α.

I Do Not Reject H0 if P-Value > α.

5. Write a statistical and contextual conclusion.

DEPARTMENT OF MATHEMATICS & STATISTICS 24


More on P-Value

I P-Value is the probability of the test statistic taking its observed


value or a more extreme value (further away from the hypothesised
value specified in H0 , known as the null value).

I The lower the P-Value, the less evidence we have to ‘believe’ H0 (and
the more evidence we have against H0 ).

I The P-Value is NOT the probability that H0 is true. (A not-guilty


verdict in court does not mean that the defendant is innocent. They
are either innocent or there wasn’t enough evidence to convict them.)

DEPARTMENT OF MATHEMATICS & STATISTICS 25


Significance Level

I The significance level is the ‘cut-off point’, the minimum probability


for believing H0 (not rejecting it).

I For the encoding example, if H0 was true, we would expect 3% (the


P-Value) of our samples to have a sample mean that is -2.1714307
units or further away from 19.25, i.e., 97% of the time, we would
observe a sample mean closer to 19.25.

I If using 5% (0.05) as the significance level, we would say that it is


unlikely (as pvalp < 0.05) that this sample was from a population
with mean of 19.25, i.e., 3% (< 5%) is considered unlikely enough for
us to conclude that the true (population) mean of the new encoder is
in fact 19.25. So we conclude that we have evidence to suggest that
µ (population mean of the new encoder) is significantly different from
19.25.

DEPARTMENT OF MATHEMATICS & STATISTICS 26


I If we had chosen a significance level of α = 0.01, then the P-Value
(0.03) is considered large as it’s greater than this α and so would not
reject H0 . In this case we can say that we do NOT have sufficient
evidence to conclude that the true mean time of the new encoder, µ,
is not 19.25 at the 1% level of significance, i.e., the true mean could
be 19.25.

I Based on our analyses, we have enough evidence to conclude that µ


is not 19.25 at the 5% level of significance, but not enough evidence
to conclude that µ is not 19.25 at the 1% level of significance.

I We will never know for sure whether µ is 19.25 or is not. So whatever


our conclusion, we could be making an error.

DEPARTMENT OF MATHEMATICS & STATISTICS 27


Back to the M&Ms example

I Our sample had 3.7% yellow

I How likely is it to get our sample (or more extreme) if the true
proportion of yellow M&Ms is 16.7%?

I P-Value < 0.0001

I Conclusion?
I Note: calculations for the P-Value for a proportion are not shown.
This will not be tested in this course (it made for a ‘yum’ example).
DEPARTMENT OF MATHEMATICS & STATISTICS 28
//
//
/ Type I and Type II errors
Type I and Type II errors

If H0 is true and we reject H0 (due to our sample with an unusually low or


high mean due to random chance) ⇒ Type-I error.
P(Type I error) = α, (the significance level)
since the test rejects α of the z-values as being too extreme. If H0 is false
and we do not reject H0 because we obtained an unusual sample with a
mean similar to the hypothesised ⇒ Type-II error.
P(Type II error) = β (= 1 − the power of the test)
which depends on α, the true value of µ and of course, the sample size n.

DEPARTMENT OF MATHEMATICS & STATISTICS 30


Possible Outcomes for tests of significance
H0 true H0 false
H0 retained 1−α Type II error: β
H0 rejected Type I error: α Power: 1 − β

E.g. Intuitive Court outcomes


I H0 : Accused is innocent

I H1 : Accused is guilty.

Decision H0 true (accused innocent) H0 false (accused guilty)


Acquittal Correct acquittal Bad acquittal
Convicted Bad conviction Correct conviction

DEPARTMENT OF MATHEMATICS & STATISTICS 31


I Note: Can’t simultaneously make α and β low:
I As α increases, β decreases.

I As α decreases, β increases.

I Typically α is set to 0.05, sometimes 0.01, and a high α value would


result in a relatively low β.

I The choice of level of significance depends on the experiment/survey.


For example:
I If trialing a new drug to lower blood pressure, we may consider a very
low α (say α = 0.005) as we don’t want to take the risk of putting on
the market a useless (new) drug, although this will increase the risk of
missing a useful (new) drug.

I In a preliminary screening trial of new drugs:


I One may consider a high α (say α = 0.10), as we don’t want to miss
any possible useful drug that could warrant further testing.

I However, under such condition, we may have believed (i.e., Reject H0 )


on all or some of the drugs tested (i.e., new drugs) that may later turn
DEPARTMENT OF MATHEMATICS & STATISTICS 32
//
//
/ Confidence interval
Confidence intervals
I Rather than hypothesis testing where we test for the parameter being
a particular value, a second approach is to use the sample to predict
what possible values (or range) that the parameter might be, using
Confidence Intervals

I Recall, P(−1.96 < Z < 1.96) = 0.95


I There is a 95% chance that a randomly selected z will lie in the range
(−1.96, 1.96).

I If a variable Y ∼ N (µ, σ 2 ), P(−1.96 < Y −µ



σ/ n
< 1.96) = 0.95
I There is a 95% chance that the interval with random endpoints
y ± 1.96 √σn would contain the fixed true population µ.

I NOT that µ is random and there is a 95% chance that it is inside the
fixed interval!

I 95% Confidence interval for µ (when σ is known) is


σ
y ± 1.96 √
n
DEPARTMENT OF MATHEMATICS & STATISTICS 34
Generalising Confidence levels

I General form for 100(1 − α)% CI for µ: (if σ is known)


σ
y ± zα/2 √
n
I Earlier encoder timing example, use α = 0.05, a 95% CI for µ is:

18.22 ± 1.96 × 1.5/ 10
18.22 ± 0.92904 = (17.29, 19.15)

y − 1.96 √σn y y + 1.96 √σn

17.2 17.4 17.6 17.8 18 18.2 18.4 18.6 18.8 19 19.2

DEPARTMENT OF MATHEMATICS & STATISTICS 35


Use α = 0.01, a 99% CI for µ is: √
18.22 ± 2.58 × 1.5/ 10
18.22 ± 1.22292 = (16.996, 19.444)

I Interpretation: We are 99% confident that the interval


(16.996,19.444) contains the true mean.
I NB: This is a wider interval, as expected.

I For both the z-test and associated confidence interval, we need to


know the true value of σ.

DEPARTMENT OF MATHEMATICS & STATISTICS 36


Confidence intervals vs Hypothesis testing

I Confidence Intervals: Give a range of values where we believe the


parameter to be, given the set of data.
I Focus is on the estimation of the parameter.

I Hypothesis testing: Gives a value on how much ‘evidence’ we have


against a claim.
I Focus is on specific claims about parameter.

I The choice of which to be used depends on the research question.

DEPARTMENT OF MATHEMATICS & STATISTICS 37


Encoder Example: Two sided tests H0 : µ = 19.25 (H1 : µ 6=
19.25) and confidence intervals

Data: y = (18.3, 17.9, 19.1, 16.8, 18.9, 17.4, 19.6, 18.3, 19.6, 16.3)
I 95% CI and P-value (α = 0.05)
I CI for µ is (17.29, 19.15)

I P-Value = 0.03 < 0.05

I 99% CI and P-value (α = 0.01)


I CI for µ is (17, 19.44)

I P-Value = 0.03 > 0.01

DEPARTMENT OF MATHEMATICS & STATISTICS 38


//
//
/ One sided tests
One sided tests

Previously in the encoder data, a two-sided test was conducted,


H0 : µ = 19.25 H1 : µ 6= 19.25

I Sometimes the research question is only concerned with one direction.


I Here a quicker encoder (lower encoding time) is of interest.

I Investigate whether encoding time is lower than 19.25

I Carry out a one sided test.

DEPARTMENT OF MATHEMATICS & STATISTICS 40


One sided test: Example H0 : µ = 19.25; H1 : µ < 19.25

y − µ0
zobs = √
σ/ n
18.22 − 19.25
= √ 0.015 (one-tail only)
1.5/ 10
−1.03
=
0.4743416 −3 −2 −1 0 1 2 3
= −2.17
P-Value = P(Z < −2.1714307) = 0.015 < 0.05
I Here we are only interested in the probability of a more divergent
mean that is lower than 19.25. Since P-value < 0.05, we reject H0 at
5% significance level, and conclude that there is evidence that the
true mean encoding time of the new encoder is less than 19.25
minutes, i.e., shorter than that of the encoder on the market.

DEPARTMENT OF MATHEMATICS & STATISTICS 41


Or we can say there is sufficient evidence at the 5% level of significance to
conclude that the true mean is less than 19.25
Note: When we are only interested in one tail/side, say µ < 19.25, if a
different sample gives a y that is greater than 19.25, no matter how high
it is, we will still retain H0 (probably with a high level of confidence). E.g.:
If y = 50.0, resulting in a very large P-Value (almost 1), we will retain H0
that µ = 19.25 because there is absolutely no evidence to support H1
(µ < 19.25).
The only difference between one tailed and two tailed test is that you are
changing the rejection and acceptance regions (P-Value = One-tail vs
Two-tails). Everything else is the same.

DEPARTMENT OF MATHEMATICS & STATISTICS 42


Comparison of Two-sided vs One-sided

Two-sided; H1 : µ 6= µ0

Reject H0 Reject H0

α α
2 2

−zα/2 0 zα/2

One-sided; H1 : µ < µ0 One-sided; H1 : µ > µ0

Reject Reject

α α

−zα 0 0 zα

DEPARTMENT OF MATHEMATICS & STATISTICS 43


One sided vs two sided tests

I Often only interested in one side/tail


I fertilizer to improve yield

I drug to reduce blood pressure

I learning technique to increase memory

I have numbers of a species declined

I Warning: You need to have a reason to do a one sided test. Does the
treatment improve/increase/reduce, etc? There is no right or wrong
answer - one researcher might do a two tailed test, another might do
a one tailed test.
I If you don’t have a reason, do a two sided test.

I Don’t let the data suggest a one sided test.

DEPARTMENT OF MATHEMATICS & STATISTICS 44


//
//
/ t-distribution and t-test
tν -distribution

I Generalises the Z distribution


I Symmetric about zero. Z
t2
I Slightly flatter at the centre at t10
zero than Z

I ’Fatter’ tails than Z

I Defined by its ’degrees of


freedom’, tν −2 0 2

I As ν → ∞, tν → N (0, 1)

DEPARTMENT OF MATHEMATICS & STATISTICS 46


One sample t-test

I If σ is known then our test statistic is:


I zobs = y −µ0

σ/ n
∼ N (0, 1).

I Often it is not realistic to assume that σ is known, which is required


for a z test and relevant CI.

I The usual case is that no parameters are known and all we have is our
sample.

I So σ is estimated with s, the sample standard deviation.

I The new test statistic is


I
Y − µ0
tobs = S

n

I NB: If Y ∼ N (µ0 , σ 2 ) then tobs ∼ tn−1 .

DEPARTMENT OF MATHEMATICS & STATISTICS 47


Hypotheses: H0 : µ = 19.25; Ha : µ 6= 19.25
y −19.25 18.22−19.25 −1.03
Obs Test stat: t = 1.132

= 1.132

= 0.3579698 = −2.877
10 10

Null distribution: If H0 is true, T behaves like a tn−1 = t9


t9

−5 −4 −3 −2 −1 0 1 2 3 4 5

P-Value = 2P(tn−1 > |t|)


= 2P(t9 > | − 2.877|)
= 0.018 (exact)
< 2 × 0.01 = 0.02 (using tables)
< 0.05
P-Value = P(t9 > 2.877 or t9 < −2.877) = 0.009 × 2 = 0.018
Therefore, reject H0 as the P-Value is less than 0.05, and there is sufficient
evidence that the mean encoding time is not 19.25 mins.

DEPARTMENT OF MATHEMATICS & STATISTICS 48


Encoder example: Hypothesis tests and confidence intervals

I P-Value = 0.0181. Conclusion depends on α:


I At the level α = 0.05, we reject H0 .

I At the level α = 0.01, we do not reject H0 .

I 95% confidence interval for µ:


s
y ± tα/2 √
n
1.132
18.22 ± 2.262 × √
10
18.22 ± 0.81 = (17.41, 19.03)

DEPARTMENT OF MATHEMATICS & STATISTICS 49


Degrees of Freedom

I The degrees of freedom of the t distribution is always determined by


the degrees of freedom of sample variance, s 2 . It has its
mathematical definition (beyond the scope of this unit), but is
difficult to define simply in words. For the one-sample t test, the
degrees of freedom is n − 1. In this case:
I We have used s 2 in the t-test.

I In calculating s 2 , we estimated µ with y .

I So we lose a ‘degree of freedom’ in the data by using the sample


estimate.

DEPARTMENT OF MATHEMATICS & STATISTICS 50


Example: Sample of 3 observations

y1 = 8, y2 = 9, y3 =?
I If we know y = 9, we know the value of y3 .
I The data has 2 degrees of freedom (2 of the sample values are free to
change before the 3rd is fixed).

I For the single sample case, remember (n − 1) degrees of freedom.

DEPARTMENT OF MATHEMATICS & STATISTICS 51


//
//
/ R (RStudio) and Class
exercises
Setup: Part 1 – Download and install
I If you are using RStudio in the computing labs on campus, you may
want to use the one on Rstudio Cloud (https://rstudio.cloud).
Accessible via any browser.

I You will need to register (one option is to use your student email to
Sign up with Google) and create a New Project to setup your
personal/unit specific Workspace.

I As you will be working off a server, you will have to Upload any data
files first before carrying out your analysis and then Export/download
your results (if appropriate).
I You can find the Upload and Export (nested under More) button
within the Files pane.

I We are not recommending the version on appstream and


http://rstudio.science.mq.edu.au/.

I The best option is to simply install it on your computer. It is freely


available online.
DEPARTMENT OF MATHEMATICS & STATISTICS 53
I To install RStudio on your own computer,
I first download Base R from from https://www.r-project.org/
I (preferably latest R version 4.0.2 (2020-06-22))

I If you have a Mac, install the latest release from the newest
R-x.x.x.pkg link (or a legacy version if you have an older operating
system). After you install R, you should also install XQuartz
http://xquartz.macosforge.org to be able to use some
visualisation packages.

I If you are installing the Windows version, choose the “base”


subdirectory and click on the download link at the top of the page.
After you install R, you should also install RTools
https://cran.rstudio.com/bin/windows/Rtools/; use the
“recommended” version highlighted near the top of the list.

I If you are using Linux, choose your specific operating system and
follow the installation instructions.

I download and install RStudio Desktop


https://www.rstudio.com/products/rstudio/download (Open
Source License) version for your operating system under the list titled
Installers for Supported Platforms.

DEPARTMENT OF MATHEMATICS & STATISTICS 54


Setup: Part 2 – Setup workspace

I Decide on a place to store your data and files

I Open RStudio

I Setup a project:
I Click on File menu at the top (or the add project icon)

I Select Existing Directory

I Browse to the directory of your choice

I Click Create Project

I You should then gain a *.Rproj file. Simply open that to bring you
back to your workspace.

DEPARTMENT OF MATHEMATICS & STATISTICS 55


I Alternatively, setup the working directory:
I Click on Session menu at the top,

I Select Set Working Directory

I Select Choose Directory,

I Browse to the directory of your choice

I Click Select Folder

I Alternative methods for setting up the working directory:


I Command line: type setwd("C:/stat2170/") to set working
directory to be stat2170 directory on your C: (equivalent path
statement for Linux of Mac OS).

I Navigate to the directory in the Files pane and then set directory
using the More button.

DEPARTMENT OF MATHEMATICS & STATISTICS 56


Reading Data

I data below are measurements of pulse rate in beats per minute


(pulse), collected in a first year stats class.

dat = read.table("pulse.dat", header=TRUE)

I header=TRUE instructs that the first entry in the pulse.dat file (which
is ‘pulse’) is the variable name and the remaining entries are the data
I dat is an object that contains the pulse data. Can see it two ways
I Directly using dat$pulse (recommended)

I Attaching it to the environment


I attach(dat)

I pulse

I Downside to attaching, it is invisible in the workspace.

I Don’t forget to detach(dat).

DEPARTMENT OF MATHEMATICS & STATISTICS 57


Stem and leaf plot

stem(dat$pulse)

#
# The decimal point is 1 digit(s) to the right of the |
#
# 5 | 56
# 6 | 12
# 6 | 55689
# 7 | 133
# 7 | 667777
# 8 | 01112
# 8 | 556667
# 9 | 3

DEPARTMENT OF MATHEMATICS & STATISTICS 58


Histogram

hist(dat$pulse, main = "Student Pulse rates (bpm)")

7
6
5 Student Pulse rates (bpm)
Frequency
3 2
1
0 4

60 70 80 90
dat$pulse

DEPARTMENT OF MATHEMATICS & STATISTICS 59


Boxplot

boxplot(dat$pulse, horizontal = TRUE,


main = "Student Pulse rates (bpm)")

Student Pulse rates (bpm)

60 70 80 90

DEPARTMENT OF MATHEMATICS & STATISTICS 60


Numerical Summaries

summary(dat$pulse)

# Min. 1st Qu. Median Mean 3rd Qu. Max.


# 55.00 68.25 77.00 75.23 81.75 93.00

sd(dat$pulse)

# [1] 9.729739

mean(dat$pulse)

# [1] 75.23333

DEPARTMENT OF MATHEMATICS & STATISTICS 61


Basics: Reading Data, summarise and test.

I Suppose the encoder data is stored in a text file, encoder.txt, in


the working directory.
I To carry out the t-test shown on the previous slides, type the
following R code into R or RStudio:

dat1 = read.table("encoder.txt", header = TRUE)


str(dat1) # Look at the data object
dat1 # Look at all observations
head(dat1) # display first six observations
summary(dat1$time) # obtain descriptive statistics
t.test(dat1$time, mu = 19.25) # Conduct the test

I NB: t.test uses two sided test by default


I type ? t.test to see documentation

I RStudio output shown on the next few slides

DEPARTMENT OF MATHEMATICS & STATISTICS 62


Basics: RStudio output

dat1 = read.table("encoder.txt", header = TRUE)


str(dat1) # Look at the object

# 'data.frame': 10 obs. of 1 variable:


# $ time: num 18.3 17.9 19.1 16.8 18.9 17.4 19.6 18.3 19.6 16.3

dat1 # Look at the entire object

# time
# 1 18.3
# 2 17.9
# 3 19.1
# 4 16.8
# 5 18.9
# 6 17.4
# 7 19.6
# 8 18.3
# 9 19.6
# 10 16.3

DEPARTMENT OF MATHEMATICS & STATISTICS 63


head(dat1) # display first six records/rows

# time
# 1 18.3
# 2 17.9
# 3 19.1
# 4 16.8
# 5 18.9
# 6 17.4

tail(dat1) # display last six records/rows

# time
# 5 18.9
# 6 17.4
# 7 19.6
# 8 18.3
# 9 19.6
# 10 16.3

DEPARTMENT OF MATHEMATICS & STATISTICS 64


summary(dat1$time) # obtain descriptive statistics

# Min. 1st Qu. Median Mean 3rd Qu. Max.


# 16.30 17.52 18.30 18.22 19.05 19.60

I This is the ad-hoc ‘standard’ five number summmary along with the
sample mean.

DEPARTMENT OF MATHEMATICS & STATISTICS 65


summary(dat1$time) # obtain descriptive statistics

# Min. 1st Qu. Median Mean 3rd Qu. Max.


# 16.30 17.52 18.30 18.22 19.05 19.60

t.test(dat1$time, mu = 19.25) # Conduct the test

#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.01827
# alternative hypothesis: true mean is not equal to 19.25
# 95 percent confidence interval:
# 17.4101 19.0299
# sample estimates:
# mean of x
# 18.22

DEPARTMENT OF MATHEMATICS & STATISTICS 66


t.test(dat1$time, mu = 19.25, alternative = "less")

#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.009134
# alternative hypothesis: true mean is less than 19.25
# 95 percent confidence interval:
# -Inf 18.87629
# sample estimates:
# mean of x
# 18.22

I Note the P-Value is half of the corresponding two-sided P-value

DEPARTMENT OF MATHEMATICS & STATISTICS 67


t.test(dat1$time, mu = 19.25, alternative = "greater")

#
# One Sample t-test
#
# data: dat1$time
# t = -2.8769, df = 9, p-value = 0.9909
# alternative hypothesis: true mean is greater than 19.25
# 95 percent confidence interval:
# 17.56371 Inf
# sample estimates:
# mean of x
# 18.22

I P-value > 0.05 ⇒ retain H0

DEPARTMENT OF MATHEMATICS & STATISTICS 68


References

Video’s Made by a/Prof Robin Turner. n.d.

DEPARTMENT OF MATHEMATICS & STATISTICS 69

You might also like