Professional Documents
Culture Documents
STAT 04 Simplify Notes
STAT 04 Simplify Notes
STAT 04 Simplify Notes
2 This is where R functions are typed (or otherwise input) and most of the output appears
3 The `Environment’ tab will display data objects created by the program
4 There are a number of tabs but those you are most likely to use are Files, Plots and Help
1.Simple Function
Make Minor changes to early commands use ↑ key in 2 to cycle through what you have entered previously
Arithmetic symbols with right order (), ^, *, /, +, - (*and / make no difference)
Expontent notation
e.g. 3.142e-4
0.0003142
Assign a value to a name (never leave a space between < and -, they should be treated as a single entity)
e.g. a <- 2
Find Remainder (use %%)
666 / 17
[1] 39.17647
666 - 39 * 17
[1] 3
666%%17
[1] 3
Useful Package
base Core arithmetical, logical, input, output and programming functions. The second item under `A’
is abs, the function for calculating absolute values
stats Fundamental statistical functions including all well known distributions and tests
graphics R has very powerful graphical facilities which we will explore later in the course.
Note that, confusingly, the absolute value of a number (eg |−6|=6) is sometimes also referred to using
the term modulus - the R function is abs () - but here the term `modulus’ will be used as the remainder
when a number is divided by another
‘Help’ tab
Usage lists the pieces of data (arguments) required for the function to work. Many of the arguments
have a default setting, given by the = sign
Value explains what is output when you use the function
See Also describes functions which are similar to the one you are considering and examining these is a
good way of exploring R
Examples near the bottom are always helpful and often use the built in datasets. Mentions of S, S3 and
S4 refer to the letter after R as discussed in Section 1.1.
2.1 Vectors
Vectors - sequence of pieces of the same type of data, e.g. (2,4,8,16) or ("John","Paul","George","Ringo")
But you can’t mix element in a vector e.g. (2,”John”)
C() – combine
Examples (type these to R)
fib_seq_7 <- c (1, 1, 2, 3, 5, 8, 13)
first_134 <- seq (from = 1, to = 134, by = 1)
first_135_not_1 <- first_134 + 1
display all the values by simply type the name in the program
fib_seq_7
[1] 1, 1, 2, 3, 5, 8, 13
Result Display:
2. three_digit_integers_up_to_134 - the code in [ ] compares the values in first_134 with 99 and produces a
vector of TRUE / FALSE entries. The whole statement then just picks out those elements of the original
vector which are TRUE.
3. first_20_sq - > the vector within [ ] is different from the one in which we are interested. They have the
same length, so the TRUE / FALSE vector created can be used to pick elements from the vector of
interest
first_134 <= 20
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Rest are all FALSE
[ ] can use TRUE / FALSE as well as the numerical position in the vector
first_20_sq
[1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289
[18] 324 361 400
[ ] in the first column of the output represents the position of the left hand element of that row in the display
of the vector
Continuous (quantitative) These data can take any value within some interval on the real line.
afence <- runif (141)
Runif: random deviates - selecting randomly from the interval [0,1]
Creating summary statistics
summary (afence)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01067 0.29054 0.48652 0.51276 0.76836 0.99110
Discrete (quantitative) These are numerical data that can only take certain values, often integers. - store in datasets
– e.g. ‘rivers’ datasets
For integer data, minimum and maximum are the two summary statistics (from the six) that are always integers
Ordinal (qualitative) These are data with a natural ordering but which are described by labels
Porker cards - the ordering of the suits is spades (♠), hearts (♡), diamonds (♢), clubs (♣), and the ranks A(ce),
K(ing), Q(ueen), J(ack), 10, 9, 8, 7, 6, 5, 4, 3, 2
Table () is useful for Discrete data - with repeated value/ also okay for categorical and Ordinal data
not useful for Continuous data - each entry is likely to be different, the table would just show
each value with a count of 1 and so would be worthless
Nominal (qualitative, Categorical) These are data described by labels with no natural ordering. They behave in
a similar way to ordinal data - e.g.gender, football jersey number
2.3 Matrices and arrays
Matrices deal with two demension and array with more than two
A matrix, Neo, can be created with the following function
Neo <- matrix (data = seq (from = 1, to = 9, by = 1), nrow = 3)
Type View (Neo) – or click on the any Data item to display data
Matrix default by going down the column (use help function to check how to go by row)
Neo[3,3] <- 51
Neo[,1] *6
[1] 6 12 18
Matrix and vector multiplication is carried out with the function %*%
Match the demension of Matrix & vector correctly, R treats a vector with six entries as a 6*1 vector
Neo2 is
Neo_2 <- Neo %*% Neo
Neo_2[1,3]
[1] 396
Alternative way
(Neo %*% Neo)[1,3]
[1] 396
Three dimensional array where the size of each dimension is given by the second argument
Display Array (by typing ‘Burr’)
Burr
,,1
,,2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
,,3
,,4
Name the rows & columns of matrices and arrays using functions such as rownames () or dimnames ()
colnames (Neo) <- c ("Keanu", "Charles", "Reeves")
Save the data- Session, Save Workplace as … and then using a suitable name
2.4 Objects
R is what is known as an Object Oriented Programming Language .
This means that any data object, such as piover2, is a member of a particular class
Examples
class (piover2)
[1] "numeric"
class (c (2, 3, 5))
[1] "numeric"
class (first_134 > 99)
[1] "logical"
class (0 + 1i)
[1] "complex"
class (a)
[1] "numeric"
class (my_hand)
[1] "character"
class (Neo)
[1] "matrix"
default for any number is numeric. If you would like the class to be integer you can either describe it that way in
the first place
a <- as.integer (2)
class (a)
[1] "integer"
first_134 %% 3 == 0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[10] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[19] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[28] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
Etc.
##Select the values that are TRUE
first_134[first_134 %% 3 == 0]
[1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42
[15] 45 48 51 54 57 60 63 66 69 72 75 78 81 84
[29] 87 90 93 96 99 102 105 108 111 114 117 120 123 126
[43] 129 132
##Use == if you want something to exactly equal to something
Data frame
Different as matrix, different data object (class) between column
Uses [ , ] notation
Delete data
rm (selective_affinities)
detach () function once you finished using a dataframe or you may end up with column names from different
dataframes which could easily get confused.
class (qanda$Hair)
[1] "factor"
Hair can only take values from a particular range of levels – factor in R are treated as descriptions
Rename a data (creating a new data)
qanda <- “selective_affinities”
4.1 Programs
A program is a sequence of computer language commands that can be saved and run as a whole
start a new program use menu commands (suggested to put “.R” to the end of the name)
File, New File, R Script
write.csv() create a data file which Excel can read easily (save data to spreadsheet)
print () function which just displays individual objects – use when there is an inherent structure
use summary(qanda$Hair) presents
A summary of Hair is 170 11 46 1 2
calculate the mean, median and standard deviation of Siblings from the sample
1. date ()
2. format ()
Add heading, change axis labels, change the form of data points (circular symbol too large)
plot (table (qanda$Siblings), main = "Frequency of numbers of siblings",
xlab = "number of siblings", ylab = "number of students")
main gives the header and ylab the y axis label
class (qanda$Siblings)
[1] "integer"
class (table (qanda$Siblings))
[1] "table"
the plot () function treats the two classes in different ways - the vector of integers has circles for each datapoint
and the table has vertical lines
Add text after the # symbol is ignored when the program runs – explaination about the program for others to
understand
boxplot()
Show connection between hair colour and handspan (compare datasets) useful to show location, shape, spread of
quantitative data
boxplot (qanda$Handspan ~ qanda$Hair, xlab = "Hair", ylab = "Handspan")
The ~ symbol is used when we create a statistical model using a formula.
stripchart() - creates dotplots
dotplots for Handspan and Siblings
par (mfrow = c (1,2))
stripchart (qanda$Handspan, main = "Handspan", xlab = "handspan (cm)", ylab = "",
method = "jitter", pch = 20, col = "red")
stripchart (qanda$Siblings, main = "Siblings", xlab = "number of siblings", ylab = "",
method = "stack", pch = 20, col = "blue", at = 0)
As there were very few possible values for Siblings, a stacked plot seemed to represent the data most clearly.
The Handspan data are continuous so the jitter method which reduces overlap seemed most appropriate
hist() – histrogram
R defaults to equally spaced breaks between bars and to a frequency histogram, rather than a histogram of the
density
Scatter plots:
plot (x = qanda$Handspan, y = qanda$Shoes,
main = "Are handspan and numbers of pairs of shoes related?",
xlab = "handspan (cm)", ylab = "pairs of shoes", pch = 3, col = "green")
pch is the symbol for data point (?par)
6.1 Additional features for plots
Use Cocoa_prices.csvFile and rename it
Cocoa_prices <- read.csv ("Cocoa_prices.csv", header = FALSE,
colClasses = list (V1 = 'factor'))
cocoa <- Cocoa_prices
plot(cocoa)
6.1.1 Lines
More lines: we would like to show lines for 2013 and 2014 and, perhaps the mean values for each year
lines () function to add a line from the data and abline () to produce a straight line for the mean
lines (cocoa_wide$"2013", col = "blue")
abline (h = mean (cocoa_wide$"2013"), lty = 3, lwd = 0.5, col = "blue")
6.1.2 Legend
Axis problem: the x axis would look better with the names of months
Useful Function:
7.2 Discrete distributions
Bernoulli distribution
Binomial Distribution
dbinom () The mass function (the ‘d’ is for density, the continuous equivalent) for the Binomial distribution. As
well as specifying the value of X that we are interested in, we need to set the values of the parameters.
Example: the distribution of the number of times I wear an ironed short in 2019 (call that X) is Binomial with
parameters n=365 and p=0.2.
The probability that I will wear an ironed shirt exactly 16 times in 2019.
X~N(365,0.2)
P(X=16)
dbinom(x, size, prob, log = FALSE)
dbinom (x = 16, size = 365, prob = 0.2)
[1] 3.358186e-18
The ?dbinom function lists a fourth possible argument, log = FALSE. The ‘=’ tells you that there is a default
value, namely FALSE, and you can ignore the argument if you are happy with that value.
Calculate the probability that I wear an ironed shirt for E[X] days in 2019
X∼bin(365,0.2)X∼bin(365,0.2), E[X]=365∗0.2=73
dbinom (x = 73, size = 365, prob = 0.2)
[1] 0.05214145
Calculate the probability that I wear an ironed short 16 days or fewer in 2019
P(X≤16)
sum (dbinom (x = 0:16, size = 365, prob = 0.2))
[1] 4.096763e-18
Binomial distribution has a first value of 0 not 1
pbinom () The cdf (The ‘p’ stands for probability, however, this is not the probability density function) for the
Binomial distribution.
The sums of the left and right hand 17 entries will only be the same if the Binomial distribution here is
symmetrical. That is only the case if p=0.5
the sum of the 17 right hand tail entries
pbinom (q = 348, size = 365, prob = 0.2, lower.tail = FALSE)
[1] 1.109471e-218
Calculate the probability that the number of days that I will wear an ironed short in 2019 begins with the number
‘2’.
starts_2 <- c (2, 20:29, 200:299)
sum (dbinom (x = starts_2, size = 365, prob = 0.2))
[1] 1.625303e-10
sum (dbinom (x = starts_2, size = 365, prob = 0.2, log = TRUE))
[1] -21150.55
qbinom () The quantile (percentile) function (or inverse cdf) for the Binomial distribution.
e.g. to work out the smallest value for q such that Pr(X≤q)≥0.8 (ie the smallest value of the random variable, X,
below which 80% of the distribution lies)
qbinom (p = 0.8, size = 365, prob = 0.2)
[1] 79
The interpretation is: in a year, what is the minimum number of days so that the probability of my wearing an
ironed shirt on that number or fewer is at least 80%
The skew is positive as the right hand tail of the distribution stretches out further than the left (think about the
tail probabilities we calculated in Exercise 7.2), compared with the median.
Calculate the vector y to represent the probabilities of each value of x. Plot x against yy as a bar chart with
appropriate headers and labels to show the distribution of X graphically.
x <- 0:365
y <- dbinom (x = x, size = 365, prob = 0.2)
barplot (y, main = "Bin (365, 0.2) distribution",
xlab = "mode of number of days ironed shirt worn in a year",
ylab = "probability", cex.main = 2)
x_max <- order (y)[366] - 1
axis(1, at = x_max + 1, labels = x_max, tick = FALSE)
Geometric Distribution
Exercise 7.4 Let Y∼geometric(0.2). Y∼geometric(0.2) be the random variable describing the number of days
from the beginning of 2019 until I wear an ironed shirt.
Calculate P(Y=7)
dgeom (x = 6, prob = 0.2)
[1] 0.0524288
use x = 6 as this represents the number of failures, the 7th attempt being a success
What is E[Y]?
For a geometric(p) distribution the mean is 1p or, in this case, 5
Poisson Distribution
x <- 0:20
y <- dpois (x = x, lambda = 5)
y1 <- dbinom (x = x, size = 10, prob = 0.5)
y2 <- dbinom (x = x, size = 20, prob = 0.25)
y3 <- dbinom (x = x, size = 40, prob = 0.125)
max_y <- max (y, y1, y2, y3)
> plot (x, y, main = "Binomial tends to Poisson as n increases for constant np",
+ type = "b", xlab = "", ylab = "probability", ylim = c (0, max_y))
When we are using a quantile function, the first argument must be a probability. However, if we accidentally
enter an invalid probability, the following happens
calculate the area under the whole of the density plot for Y
pexp (q = Inf, rate = 1)
[1] 1
Normal Distribution
Let X∼N(2,4)
P(X<4)
pnorm(q = 4, mean = 2, sd = 2)
[1] 0.8413447
NA (not available)
NA≠NaN any NaNdata is clearly NA, but it is not true in reverse, eg missing data is not necessarily not a
number, it is just missing
Also, 2 standard deviations away from the mean in either direction takes you well past the quartiles. Overall,
normality is certainly worth further investigation
The fit is not great. It looks as if the data are negatively skewed
The axes have different numbering schemes. The x (theoretical) axis shows the number of standard deviations
away from the mean while the y (sample) axis shows the actual data values.
The straight line passes through the theoretical qL and qU
8 Stimulation
a simulation envelope: you create a number (often 19) of samples, of the same size as the dataset, from the
theoretical distribution and compare the QQ plot for the dataset with that for the highest and lowest of them
Randomness: random numbers from a non quantum computer are not random - there must be an algorithm that
generates them - they are actually pseudorandom
We can compare a histogram of the random data with the shape of the corresponding Normal distribution as
follows
What % of the area under a Normal curve is represented by ±3±3 standard deviations from the mean?
pnorm(q = 3, mean = 0, sd = 1) - pnorm(q = -3, mean = 0, sd = 1)
[1] 0.9973002
Exercise 8.2
consider a discrete distribution which can take values {1,2,3,4}, each
with probability 14. If we take a random sample of 4 items from that
distribution, the probability that we will have exactly one of each value
is only 3/32. The probability that we will nearly have one of each
value, ie 3 different values with one repeated, is 916
4/4*3/4*2/4*1/4=3/32
Exercise 8.3
Exercise 8.4
Useful Function
9 Multiple Plots
Use Sheffield_max_temp_data.csv
Boxplots
Sheff_temp <- read.csv ("Sheffield_max_temp_data.csv", header = FALSE)
colnames (Sheff_temp) <- c ("Year", "Month", "Max_temp")
boxplot (Sheff_temp$Max_temp ~ Sheff_temp$Month, xlab = "", ylab = "Celsius",
main = "Average maximum monthly temperature in Sheffield, 1883 - 2014")
The median temperatures are at their lowest in January, rise to a peak in July and then decline. However, as
February and August are quite similar to January and July respectively, the decline in temperature during the
second half of the year happens more quickly than the rise during the first half. For most pairs of consecutive
months, there is little overlap between their central 50% boxes. This is not the case for the three winter months
where the boxes are similar. Looking at the whiskers and outliers we have the same clear distinction between
months apart from during winter, except that, for example, the whiskers range for any month will always
overlap with that for a month two away in either direction. Although January has the lowest median the coldest
three months over the time period considered are all in February. Overall, there is a clear temperature gradient
throughout the year, apart from winter, but the wide variation within months means you could need a hot water
bottle at any point
Change the
axis (side = 1, at = 1:12, labels = month.abb)
Display two plots side by side: the first your output from Exercise 9.1 and the second a time series plot of the
temperatures from January of each year with appropriate x axis numbers.
par (mfrow = c (1, 2))
boxplot (Sheff_temp$Max_temp ~ Sheff_temp$Month, xlab = "", ylab = "Celsius",
main = "Monthly, Sheffield 1883 - 2014", xaxt = "n")
axis (1, at = 1:12, labels = month.abb)
plot (Sheff_temp[Sheff_temp$Month == 1,3], type = "l", xlab = "", ylab = "Celsius",
main = "Januarys, Sheffield 1883 - 2014", xaxt = "n")
axis (1, at = (seq (0, 120, 20)), labels = seq (1883, 2003, 20))
Text justification For text at a particular location, justification determines whether the text is centred
there (the default) or starts or ends there. Use the adj argument
Colour Pretty much any element of a plot can be coloured and there is a separate section in ?par which
discusses this. Some of the arguments are bg (background colour), col.axis and fg
Font As well as colour, it is possible to change the size (eg cex.main and type, such as italics
(eg font.lab, of any text in a plot
Lines Again, there is a section in ?par explaining how you can customise lines. Particular arguments
include width (lwd, ends (lend and type, such as dotted (eg lty
Margins These are the blank areas around a plot and it may be worth spending some time adjusting
these, particularly if you have multiple plots side by side. Arguments beginning ma and om can be used
Axes These arguments typically begin x or y and determine the endpoints of the axes together with the
locations of tick marks etc