STAT 04 Simplify Notes

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 34

STAT 04 Notes

R is a programming language, R Studio is integrated development environment

 2 This is where R functions are typed (or otherwise input) and most of the output appears
 3 The `Environment’ tab will display data objects created by the program
 4 There are a number of tabs but those you are most likely to use are Files, Plots and Help

1.Simple Function
Make Minor changes to early commands use ↑ key in 2 to cycle through what you have entered previously
Arithmetic symbols with right order (), ^, *, /, +, - (*and / make no difference)

Absolute value – abs()


! – factorial ()
e.g |7!/𝑐𝑜𝑠(𝜋)|
abs(factorial(7)/cos(pi))
[1] 5040

Expontent notation
e.g. 3.142e-4
0.0003142

Assign a value to a name (never leave a space between < and -, they should be treated as a single entity)
e.g. a <- 2
Find Remainder (use %%)
666 / 17
[1] 39.17647
666 - 39 * 17
[1] 3
666%%17
[1] 3

Useful Package

 base Core arithmetical, logical, input, output and programming functions. The second item under `A’
is abs, the function for calculating absolute values
 stats Fundamental statistical functions including all well known distributions and tests
 graphics R has very powerful graphical facilities which we will explore later in the course.
Note that, confusingly, the absolute value of a number (eg |−6|=6) is sometimes also referred to using
the term modulus - the R function is abs () - but here the term `modulus’ will be used as the remainder
when a number is divided by another
‘Help’ tab

 Usage lists the pieces of data (arguments) required for the function to work. Many of the arguments
have a default setting, given by the = sign
 Value explains what is output when you use the function
 See Also describes functions which are similar to the one you are considering and examining these is a
good way of exploring R
 Examples near the bottom are always helpful and often use the built in datasets. Mentions of S, S3 and
S4 refer to the letter after R as discussed in Section 1.1.

2.1 Vectors
Vectors - sequence of pieces of the same type of data, e.g. (2,4,8,16) or ("John","Paul","George","Ringo")
But you can’t mix element in a vector e.g. (2,”John”)
C() – combine
Examples (type these to R)
fib_seq_7 <- c (1, 1, 2, 3, 5, 8, 13)
first_134 <- seq (from = 1, to = 134, by = 1)
first_135_not_1 <- first_134 + 1
display all the values by simply type the name in the program
fib_seq_7
[1] 1, 1, 2, 3, 5, 8, 13

Result Display:

Writing first_135_not_1 from scretch :


first_135_not_1 <- seq (from = 2, to = 135, by = 1)

Combine two vectors of the same length to create a third as in


e.g. create a vecotr of the first 134 positive square integers
first_134_sq <- first_134 * first_134

Same result are achieve with applying a function to a single vector


first_134_sq <- first_134 ^ 2
note that this is not vector multiplication (each element in the first vector is multiplied by the equivalent entry in the
second)

Pick out elements of a vector


use square brackets, [ ], after the vector’s name
fib_seq_7[6]
[1] 8
fib_seq_7[4:5]
[1] 3 5
fib_seq_7[-6]
[1] 1, 1, 2, 3, 5, 13
fib_seq_7[-(3:6)]
[1] 1, 1, 11
The number in the square brackets refers to the position in the vector, not the value
[-6] removes the element in position 6

More Complex example


first_133 <- first_134[-134]
three_digit_integers_up_to_134 <- first_134[first_134 > 99]
first_20_sq <- first_134_sq[first_134 <= 20]

1. first_133 - takes away the 134th entry

2. three_digit_integers_up_to_134 - the code in [ ] compares the values in first_134 with 99 and produces a
vector of TRUE / FALSE entries. The whole statement then just picks out those elements of the original
vector which are TRUE.

3. first_20_sq - > the vector within [ ] is different from the one in which we are interested. They have the
same length, so the TRUE / FALSE vector created can be used to pick elements from the vector of
interest
first_134 <= 20
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Rest are all FALSE

[ ] can use TRUE / FALSE as well as the numerical position in the vector
first_20_sq
[1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289
[18] 324 361 400
[ ] in the first column of the output represents the position of the left hand element of that row in the display
of the vector

use [ ] to update existing vectors or create new ones


fib_seq_8 <- c (fib_seq_7, fib_seq_7[6] + fib_seq_7[7])

2.2 Types of Data


Quantitative data can be counted, measured, and expressed using numbers. 
Qualitative data is descriptive and conceptual, can be categorized based on traits and characteristics.

Continuous (quantitative) These data can take any value within some interval on the real line.
afence <- runif (141)
Runif: random deviates - selecting randomly from the interval [0,1]
Creating summary statistics
summary (afence)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01067 0.29054 0.48652 0.51276 0.76836 0.99110
Discrete (quantitative) These are numerical data that can only take certain values, often integers. - store in datasets
– e.g. ‘rivers’ datasets

For integer data, minimum and maximum are the two summary statistics (from the six) that are always integers

Ordinal (qualitative) These are data with a natural ordering but which are described by labels
Porker cards - the ordering of the suits is spades (♠), hearts (♡), diamonds (♢), clubs (♣), and the ranks A(ce),
K(ing), Q(ueen), J(ack), 10, 9, 8, 7, 6, 5, 4, 3, 2

used the table () function which creates a frequency or summary table


summary () function, as it does not provide useful information for qualitative data
my_hand <- c ("spades", "hearts", "hearts", "spades", "diamonds", "diamonds",
"hearts", "clubs", "spades", "spades","diamonds", "hearts", "clubs")
table (my_hand)
my_hand
clubs diamonds hearts spades
2 3 4 4

Table () is useful for Discrete data - with repeated value/ also okay for categorical and Ordinal data
not useful for Continuous data - each entry is likely to be different, the table would just show
each value with a count of 1 and so would be worthless
Nominal (qualitative, Categorical) These are data described by labels with no natural ordering. They behave in
a similar way to ordinal data - e.g.gender, football jersey number
2.3 Matrices and arrays
Matrices deal with two demension and array with more than two
A matrix, Neo, can be created with the following function
Neo <- matrix (data = seq (from = 1, to = 9, by = 1), nrow = 3)

Type View (Neo) – or click on the any Data item to display data

Matrix default by going down the column (use help function to check how to go by row)

Entries can be extracted or updated for matrix using two indices[ , ]


If there is no entry for one of the indices then the whole of that dimension is selected.  
Neo[2, 3]
[1] 8
2 is row 2, 3 is column 3

Iamamatrix <- matrix (data = seq (from = 3, to = 26), nrow = 4)

Code to extract the vector (20, 24)


Iamamatrix[2, 5:6], Iamamatrix[2, c (5, 6)]

tom <- Neo[, 3]


Altnerative
tom <- Neo[1:3,3]

Neo[3,3] <- 51

Neo[,1] *6
[1] 6 12 18

Matrix and vector multiplication is carried out with the function  %*%
Match the demension of Matrix & vector correctly, R treats a vector with six entries as a 6*1 vector
Neo2 is
Neo_2 <- Neo %*% Neo

Neo_2[1,3]
[1] 396
Alternative way
(Neo %*% Neo)[1,3]
[1] 396

t () and solve ()  for the transpose and inverse of a matrix respectively


why is the inverse function named solve()?
One of the most common reasons for wanting to invert a matrix is to solve a set of simultaneous equations.  

array ()- for data in more than two dimensions


Burr <- array (data = seq (from = 1, to = 24, by = 1), dim = c (2, 3, 4))

Three dimensional array where the size of each dimension is given by the second argument
Display Array (by typing ‘Burr’)
Burr
,,1

[,1] [,2] [,3]


[1,] 1 3 5
[2,] 2 4 6

,,2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12

,,3

[,1] [,2] [,3]


[1,] 13 15 17
[2,] 14 16 18

,,4

[,1] [,2] [,3]


[1,] 19 21 23
[2,] 20 22 24

Extract or updata data [ , ,]


e.g. extract the vector containing the number 16
Burr[2, 2, 3]
[1] 16

Extract the 2×2 matrix with first row (15,17) and second row (21,23)


t (Burr[1, 2:3, 3:4])
[,1] [,2]
[1,] 15 17
[2,] 21 23

Name the rows & columns of matrices and arrays using functions such as  rownames ()  or dimnames ()
colnames (Neo) <- c ("Keanu", "Charles", "Reeves")

Save the data- Session, Save Workplace as … and then using a suitable name
2.4 Objects
R is what is known as an Object Oriented Programming Language .
This means that any data object, such as piover2, is a member of a particular class
Examples
class (piover2)
[1] "numeric"
class (c (2, 3, 5))
[1] "numeric"
class (first_134 > 99)
[1] "logical"
class (0 + 1i)
[1] "complex"
class (a)
[1] "numeric"
class (my_hand)
[1] "character"
class (Neo)
[1] "matrix"

default for any number is numeric. If you would like the class to be integer you can either describe it that way in
the first place
a <- as.integer (2)
class (a)
[1] "integer"

Class of multiply an integer object by 2 (defult numeric - make it integer)


a <- integer (2)
class (2 * a)
[1] "numeric"

a <- integer (2)


class (as.integer (2 * a))
[1] "integer"

Other useful function:


1. length () - outputs the dimension of a vector
2. & and | - symbols representing AND and OR: used when combining statements that result in TRUE /
FALSE outputs
3. ncol () - outputs the number of columns in a matrix or array

Combination (e.g. 6C2)


e.g. Number of ways of picking 5 objects from a set of 14
> choose (14,5)
[1] 2002

Subset & Remainder


Use first_134 and [ ] to produce a vector of those integers less than 135 divisible exactly by 3
first_134[first_134 %% 3]
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
[30] 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[59] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
[88] 2 1 2
To find the values that are divisible by 3

first_134 %% 3 == 0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[10] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[19] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[28] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
Etc.
##Select the values that are TRUE

first_134[first_134 %% 3 == 0]
[1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42
[15] 45 48 51 54 57 60 63 66 69 72 75 78 81 84
[29] 87 90 93 96 99 102 105 108 111 114 117 120 123 126
[43] 129 132
##Use == if you want something to exactly equal to something

If dick <- diag (Neo), what is Neo[Neo[2,] > dick[2], 2]


>dick
[1] 1 5 51
> dick[2]
[1] 5
> Neo[2,]
[1] 2 5 8
> Neo[3,2]
[1] 6
> Neo[Neo[2,]> dick [2],2]
[1] 6

3.1 Summary Statistics


eggs <- c (5, 11, 9, 7, 6, 11, 7, 8, 3, 14, 6, 9, 9, 7, 9)
Genral summary statistics
mean, median, mode, standard deviation, qL, qU, range, skewness
if the data change (e.g. number of eggs in day 9) it would definitely be changed are standard deviation, range
and skewness (examine the formulae in each case).
Those that would possibly change are median (if, for instance, 3 and 14 change to 8 and 9), mode (currently, 3
days each have 7 and 9 eggs), qL (day 3.75 when they are put in order of number of eggs), qU (day 11.25). The
mean is 8 eggs per day and will stay that way.
The mode, standard deviation and skewness cannot be deduced from the summary () function
Use Table () to find the mode of eggs
Pick the columns from table (eggs) with the highest value

Other summary statistics


 Percentiles are useful for large datasets when you might be interested in extreme values (eg top 3% of
IQ scores)
 The coefficient of variation is the standard deviation divided by the mean. It is useful as it gives you a
sense of how big the standard deviation is with respect to the data and simplifies comparisons of spread
and variability between datasets
 The MAD is the mean absolute deviation and is a measure of spread or dispersion which, similar to the
median, is useful for skewed data or data with outliers
 Kurtosis is a measure of the shape of the peak of a dataset when you plot it, It can be useful for
comparing a number of datasets but is best forgotten

3.2 Working Directory


To find out which working directory your in
Type getwd ()

Change working directory is in RStudio  


Session, Set Working Directory, Choose Directory…
Altnerative way
select Files in 4, navigate to the directory you wish to use and then use the 4 menu options
More, Set As Working Directory
Type setwd ()

Change the default working directory using the menu option


Tools, Global options, General

3.3 Importing data and dataframes


Data used selective_affinities.csv – type:
Download the file to the work directory
read.csv ("selective_affinities.csv",
colClasses = c (Hair = 'factor', Birth = 'factor'))
Import it to work directory - tick stings as factor
Alternative use import function in the menu bar of 3
Import Dataset, From Text (base)…

Data frame
Different as matrix, different data object (class) between column
Uses [ , ] notation

Use table () for categorical data (e.g. hair colour)


table (selective_affinities[,1])
Black Blonde Brown Grey Red
170 11 46 1 2

Use Summary () for continuous data (e.g. Handspan)


summary (selective_affinities[,3])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.60 15.53 18.80 17.61 20.20 28.00

Delete data
rm (selective_affinities)

access the columns in a dataframe


e.g. qanda$Hair
qanda$Hair[1:2]
[1] Black Blonde
Levels: Black Blonde Brown Grey Red
You do not have to type qanda$ every time as
attach (qanda)
Hair[1:20]
[1] Black Blonde Black Black Brown Black Black Black Black Black
[11] Black Black Black Black Black Black Red Brown Black Black
Levels: Black Blonde Brown Grey Red
detach (qanda)

detach () function once you finished using a dataframe or you may end up with column names from different
dataframes which could easily get confused.

class (qanda$Hair)
[1] "factor"
Hair can only take values from a particular range of levels – factor in R are treated as descriptions
Rename a data (creating a new data)
qanda <- “selective_affinities”

Import a text file (e.g. selective_affinities2.txt)


1. use the menu option [Import Dataset, From Text (base)]  
2. read.delim ()
3. read.table ()
The read.table () function defaults to the data having no headers (in a dataset, a header would mean that the first
line of data has names for the columns). Your command will need to include the argument  header = TRUE:
even_more_selective_affinities <- read.table ("selective_affinities2.txt", header = TRUE, colClasses = c (Hair
= 'factor', Birth = 'factor'))

In the read.csv () function the default is that headers are present

4.1 Programs
A program is a sequence of computer language commands that can be saved and run as a whole
start a new program use menu commands (suggested to put “.R” to the end of the name)
File, New File, R Script

Run or source the program


 to run the whole program, use the button near the top right hand corner of 1 or the menu option
(Ctrl+Alt+R) - Code, Run Region, Run All
 to run part of the program, highlight the code you want to run and press Ctrl+↩ or cut and paste to 2
 to run a single line of code, place your cursor anywhere on that line in 1 and press Ctrl+↩
 to source a program file that has been saved, use the menu option - Code, Source File…
and choose your file, or use the source () function
- source will run the program but only show the output
- source will Echo will show both the coding and the output

Find the number of people particpated in the questionaires


nrow (qanda)
[1] 230

save(list, file=” ”) function saves any changes made to the file


save (qanda, file = "qanda.Rdata")
load (“”)  function retrieves the file
save a program – Ctrl+S

write.csv() create a data file which Excel can read easily (save data to spreadsheet)

4.2 Program Output


Clears the R console
RStudio menu - Edit, Clear Console or press Ctrl+L

display output to the screen is using the  cat () function


cat ("The number of people who filled in the questionnaire is", no_qu, "\n")
The number of people who filled in the questionnaire is 230
The text (ie “The number…”), the data object (ie no_qu) and a command to start a new line, \n.
Round the decimal
cat ("correlation between siblings and Height is", round(cor (qanda$Siblings, qanda$Height), 2))
The correlation between siblings and Height is -0.01

print () function which just displays individual objects – use when there is an inherent structure
use summary(qanda$Hair) presents
A summary of Hair is 170 11 46 1 2

calculate the mean, median and standard deviation of Siblings from the sample

4.3 Bivrate Data


univariate data (ie one value for each person) - eg Hair or Siblings
multivariate data - seven items of data per person
bivariate data – correlation cor() - e.g. Handspan and Height
cor (qanda$Handspan, qanda$Height)
[1] 0.2273731

Var () in R calculate the sample variance instead of population variance


Useful function:

1. date ()
2. format ()

Optional Exercise: Data summary


5.1 Basic Plots
Use ?plot and ?par in the help to look up for extra arguement
plot (qanda$Siblings)

Add heading, change axis labels, change the form of data points (circular symbol too large)
plot (table (qanda$Siblings), main = "Frequency of numbers of siblings",
xlab = "number of siblings", ylab = "number of students")
main gives the header and ylab the y axis label

class (qanda$Siblings)
[1] "integer"
class (table (qanda$Siblings))
[1] "table"
the plot () function treats the two classes in different ways - the vector of integers has circles for each datapoint
and the table has vertical lines

R generally ignores spaces and blank lines


Most argument to improve the plot can be found in ?plot and ?par (graphical parameter)
To improve
 Increase the size of the heading by 50%
 Colour the lines red
 Triple the width of the lines
1. plot (table (qanda$Siblings), main = "Frequency of numbers of siblings",
2. xlab = "number of siblings", ylab = "number of students", cex.main = 1.5,
3. col = "red", lwd = 3)

Example (Exercise 5.2)


a galaxy class university on the planet Trantor has presented student satisfaction ratings for the last few years
per Figure 5.3
To Improve
trantor_data <- c (87, 88, 87, 88, 88, 89, 87)
plot (trantor_data, main = "Trantor University student satisfaction", type = "1",
xlab = "year", ylab = "satisfaction % (from 50%)", cex.main = 2,
ylim = c (50, 100), col = "red", lwd = 2)

the type of plot is a line


Use ylim = c() to set limit of y-axis
R graphics package does not allow us to break or modify the y axis

5.2 Saving Plots


Export Plots
`Plots’ tab in 4 is the Export option
As an image (technically a png file) – better for display on screen
dev.copy (png, file = "trantor_plot.png", width = 30, height = 18.5, units = "cm", res = 300)
dev.off ()

As a PDF – better for printing


dev.copy (device = pdf, file = "trantor_plot.pdf", width = 11, height = 8)
dev.off ()
remember to change to specific unit of measurement (there will be defult in inches)
dev.off () function tells R to stop writing to the new device (ie revert to writing to the screen) and only then is the
file created

Add text after the # symbol is ignored when the program runs – explaination about the program for others to
understand

Red / green should not be placed next to each other in plots as it


is the most common form of colour blindness, particularly among males

5.3 Other Plots


barplot()
Use for continuous data, all the data except for Handspan and Height could be represented with a bar chart.

boxplot()
Show connection between hair colour and handspan (compare datasets) useful to show location, shape, spread of
quantitative data
boxplot (qanda$Handspan ~ qanda$Hair, xlab = "Hair", ylab = "Handspan")
The ~ symbol is used when we create a statistical model using a formula. 
stripchart() - creates dotplots
dotplots for Handspan and Siblings
par (mfrow = c (1,2))
stripchart (qanda$Handspan, main = "Handspan", xlab = "handspan (cm)", ylab = "",
method = "jitter", pch = 20, col = "red")
stripchart (qanda$Siblings, main = "Siblings", xlab = "number of siblings", ylab = "",
method = "stack", pch = 20, col = "blue", at = 0)
As there were very few possible values for Siblings, a stacked plot seemed to represent the data most clearly.
The Handspan data are continuous so the jitter method which reduces overlap seemed most appropriate

hist() – histrogram
R defaults to equally spaced breaks between bars and to a frequency histogram, rather than a histogram of the
density

Histrogram for handspan:


par (mfrow = c (2,2))
# Histogram 1 - the titles and y axis labels describe the differences between the plots
hist (qanda$Handspan, freq = TRUE, main = "Handspan - default breaks",
xlab = "handspan", ylab = "frequency")
# Histogram 2 - these comments break up the code
hist (qanda$Handspan, freq = TRUE, main = "Handspan - breaks every 0.5cm",
xlab = "handspan", ylab = "frequency",
breaks = seq (floor (min (qanda$Handspan)), ceiling (max (qanda$Handspan)), 0.5))
# Histogram 3 - the word histogram has Greek components
hist (qanda$Handspan, freq = FALSE, main = "Handspan - integer breaks",
xlab = "handspan", ylab = "relative frequency / cm",
breaks = seq (floor (min (qanda$Handspan)), ceiling (max (qanda$Handspan)), 1))
# Histogram 4 - which mean `animal or plant tissue + written down'
hist (qanda$Handspan, main = "Handspan - first break at 18cm",
xlab = "handspan", ylab = "relative frequency / cm",
breaks = c (floor (min (qanda$Handspan)), seq (18, ceiling (max (qanda$Handspan)), 1)))

stem() – stem and leaf plots

Scatter plots:
plot (x = qanda$Handspan, y = qanda$Shoes,
main = "Are handspan and numbers of pairs of shoes related?",
xlab = "handspan (cm)", ylab = "pairs of shoes", pch = 3, col = "green")
pch is the symbol for data point (?par)
6.1 Additional features for plots
Use Cocoa_prices.csvFile  and rename it
Cocoa_prices <- read.csv ("Cocoa_prices.csv", header = FALSE,
colClasses = list (V1 = 'factor'))
cocoa <- Cocoa_prices
plot(cocoa)

Manipulating the characters in data fields (strings):


Splitting the first column of the dataframe (currently named V1 - have a look at it in RStudio) into month and
year
cocoa_imp <- Cocoa_prices
Mths <- substr (cocoa_imp$V1, start = 1, stop = 3)
Yrs <- paste ("20", substr (cocoa_imp$V1, start = 5, stop = 6), sep = "")
The original years we imported were missing the century, so we added it (“20”) in using the  paste () function.
Sep= “ ” ,the default for the paste () function is to put a space between objects that are pasted together –
originally 20 12.
cocoa <- data.frame (Yrs, Mths, cocoa_imp$V2)
colnames (cocoa) <- c (colnames (cocoa)[1:2], "Price")

Rename the final column of the new dataframe


reorganise the data to give separate columns for each year using  reshape ()
cocoa_wide <- reshape (cocoa, timevar = "Yrs", idvar = "Mths", direction =
"wide")
colnames (cocoa_wide) <- c (colnames (cocoa_wide)[1], "2012", "2013",
"2014")

Move between long and wide dataframes - stack () and unstack ()


Plot the 2012 data as a line with appropriate headings, x and y axis labels and y axis range (I have used 2000 to
3000). 
plot (cocoa_wide[,2], type = "l", ylim = c (2000, 3000),
main = "Cocoa price: 2012 monthly averages",
xlab = "month", ylab = "price")
reshape cocoa_widecocoa_wide into cocoa_longcocoa_long so that it is in the original format of cocoa

months <- as.character (cocoa_wide$Mths)


years <- colnames (cocoa_wide[2:4])
cocoa_long <- reshape (cocoa_wide, varying = years, v.names = "Price", timevar = "Yrs",
times = years, direction = "long")

6.1.1 Lines
More lines: we would like to show lines for 2013 and 2014 and, perhaps the mean values for each year
lines () function to add a line from the data and abline () to produce a straight line for the mean
lines (cocoa_wide$"2013", col = "blue")
abline (h = mean (cocoa_wide$"2013"), lty = 3, lwd = 0.5, col = "blue")
6.1.2 Legend

Missing legend: a description of each of the lines would be helpful

legend (x = 10, y = 2300, legend = c ("2012", "2013", "2014"),


col = c ("black", "blue", "red"), lty = rep (1, times = 3), cex = 0.5)
6.1.3 Text

Text: we haven’t described the dotted lines

text (x = 4, y = 2700, labels = "The dotted lines are annual means",


font = 3, col = "gold2")
6.1.4 Axes

Axis problem: the x axis would look better with the names of months

months <- as.character(cocoa_wide$Months)


axis (side = 1, at = 1:12, labels = months)

Exercise 6.4 – model answer in Moodle

Comment of the graph:


In most years, the price shows an increasing trend through the year until summer in the northern hemisphere and
then a decreasing trend until the year end. However, in 2013 the price continued to rise throughout the year.
Care should be taken in drawing too strong a conclusion as we only have three years data and the typical year (if
there is one) may be 2013 rather than 2012

Useful Function:

 points () for adding points to plots


 symbols () for adding symbols to plots
 segments () as an alternative to lines ()

7.2 Discrete distributions

Bernoulli distribution

Binomial Distribution
dbinom () The mass function (the ‘d’ is for density, the continuous equivalent) for the Binomial distribution. As
well as specifying the value of X that we are interested in, we need to set the values of the parameters. 

Example: the distribution of the number of times I wear an ironed short in 2019 (call that  X) is Binomial with
parameters n=365 and p=0.2. 
The probability that I will wear an ironed shirt exactly 16 times in 2019.
X~N(365,0.2)
P(X=16)
dbinom(x, size, prob, log = FALSE)
dbinom (x = 16, size = 365, prob = 0.2)
[1] 3.358186e-18
The ?dbinom function lists a fourth possible argument, log = FALSE. The ‘=’ tells you that there is a default
value, namely FALSE, and you can ignore the argument if you are happy with that value. 

Calculate the probability that I wear an ironed shirt for E[X] days in 2019
X∼bin(365,0.2)X∼bin(365,0.2), E[X]=365∗0.2=73
dbinom (x = 73, size = 365, prob = 0.2)
[1] 0.05214145

Calculate the probability that I wear an ironed short 16 days or fewer in 2019
P(X≤16)
sum (dbinom (x = 0:16, size = 365, prob = 0.2))
[1] 4.096763e-18
Binomial distribution has a first value of 0 not 1

pbinom () The cdf (The ‘p’ stands for probability, however, this is not the probability density function) for the
Binomial distribution. 

The sums of the left and right hand 17 entries will only be the same if the Binomial distribution here is
symmetrical. That is only the case if p=0.5
the sum of the 17 right hand tail entries
pbinom (q = 348, size = 365, prob = 0.2, lower.tail = FALSE)
[1] 1.109471e-218

Calculate the probability that the number of days that I will wear an ironed short in 2019 begins with the number
‘2’.
starts_2 <- c (2, 20:29, 200:299)
sum (dbinom (x = starts_2, size = 365, prob = 0.2))
[1] 1.625303e-10
sum (dbinom (x = starts_2, size = 365, prob = 0.2, log = TRUE))
[1] -21150.55

qbinom () The quantile (percentile) function (or inverse cdf) for the Binomial distribution.
e.g. to work out the smallest value for q such that Pr(X≤q)≥0.8 (ie the smallest value of the random variable, X,
below which 80% of the distribution lies) 
qbinom (p = 0.8, size = 365, prob = 0.2)
[1] 79

The interpretation is: in a year, what is the minimum number of days so that the probability of my wearing an
ironed shirt on that number or fewer is at least 80%

The skew is positive as the right hand tail of the distribution stretches out further than the left (think about the
tail probabilities we calculated in Exercise 7.2), compared with the median. 

Calculate the vector y to represent the probabilities of each value of x. Plot x against yy as a bar chart with
appropriate headers and labels to show the distribution of X graphically.
x <- 0:365
y <- dbinom (x = x, size = 365, prob = 0.2)
barplot (y, main = "Bin (365, 0.2) distribution",
xlab = "mode of number of days ironed shirt worn in a year",
ylab = "probability", cex.main = 2)
x_max <- order (y)[366] - 1
axis(1, at = x_max + 1, labels = x_max, tick = FALSE)

calculate the median using qbinom (p = 0.5, size = 366, prob = 0.2)


mode is the highest bar in the plot and is also 73

Geometric Distribution
Exercise 7.4 Let Y∼geometric(0.2). Y∼geometric(0.2) be the random variable describing the number of days
from the beginning of 2019 until I wear an ironed shirt.
Calculate P(Y=7)
dgeom (x = 6, prob = 0.2)
[1] 0.0524288
use x = 6 as this represents the number of failures, the 7th attempt being a success  

Do you think the distribution of Y is positively or negatively skewed?


The distribution would appear to be positively skewed as the tail extends infinitely

What is E[Y]?
For a geometric(p) distribution the mean is 1p or, in this case, 5

Poisson Distribution
x <- 0:20
y <- dpois (x = x, lambda = 5)
y1 <- dbinom (x = x, size = 10, prob = 0.5)
y2 <- dbinom (x = x, size = 20, prob = 0.25)
y3 <- dbinom (x = x, size = 40, prob = 0.125)
max_y <- max (y, y1, y2, y3)

> plot (x, y, main = "Binomial tends to Poisson as n increases for constant np",
+ type = "b", xlab = "", ylab = "probability", ylim = c (0, max_y))

lines (x = x, y = y1, type = "b", col = "red")


lines (x = x, y = y2, type = "b", col = "blue")
lines (x = x, y = y3, type = "b", col = "gold")
legend (x = 10, y = 0.2, legend = c ("Poisson (5)", "Binomial (10, 0.5)",
"Binomial (20, 0.25)", "Binomial (40, 0.125)"),
col = c ("black", "red", "blue", "gold"), lty = rep (1, 4))

7.3 Continuous Distribution


Uniform Distribution
X∼U(−1,1)
The parameters are the minimum and maximum, in this case -1 and 1.
Evaluate the density function at 0.3
dunif (x = 0.3, min = -1, max = 1)
[1] 0.5
should not be interpreted as a probability, ie it is not P(X=0.3), it is just the value of a function at a point
P(X≤1)
punif (q = 1, min = -1, max = 1)
[1] 1
P(X<1)
punif (q = 0.999999999999, min = -1, max = 1)
[1] 1

When we are using a quantile function, the first argument must be a probability. However, if we accidentally
enter an invalid probability, the following happens

qunif (p = 2, min = -1, max = 1)


Warning in qunif(p = 2, min = -1, max = 1): NaNs produced
[1] NaN
where  NaN stands for Not a Number . There is an infinity quantity in R (Inf), if not in real life, and NaN can also
arise when calculations involve it such as
a <- - 88 / 0
a*0
[1] NaN
and the answer is not clearly either 0 or Inf.

Expotential distribution can take values from zero to infinity


Y∼exp(1) at a value of 0.5
dexp(.5)
[1] 0.6065307
pexp(.5)
[1] 0.3934693
The fact that the pdf is greater than the cdf is not a problem as the pdf does not represent a probability.  

calculate the area under the whole of the density plot for Y
pexp (q = Inf, rate = 1)
[1] 1

Normal Distribution
Let X∼N(2,4)
P(X<4)
pnorm(q = 4, mean = 2, sd = 2)
[1] 0.8413447

What are qX1 such that P(X<qX1)=0.95 and qX2 such that P(X<qX2)=0.975?


qnorm(p = 0.95, mean = 2, sd = 2)
[1] 5.289707
qnorm(p = 0.975, mean = 2, sd = 2)
[1] 5.919928
noting that R has μ and σ (not σ2) as the parameters for the Normal distribution.

Consider Y∼N(0,1). Calculate qY1 and qY2 at the same percentage points as for X.


qnorm(p = 0.95, mean = 0, sd = 1)
[1] 1.644854
qnorm(p = 0.975, mean = 0, sd = 1)
[1] 1.959964
algebraic relationship between qX1 and qY1

Same for  qY2 and qX2

NA (not available)
NA≠NaN any NaNdata is clearly NA, but it is not true in reverse, eg missing data is not necessarily not a
number, it is just missing

Refer to :Handspan histograms you created in Exercise 5.8


HS2 <- c (qanda$Handspan[1:(nrow (qanda) - 1)], NA)

Deciding whether data are close to normal


Look at the graph
look at some summary statistics
95% of the data should be within 1.96 standard deviations either side of the mean
summary (HS2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.60 15.50 18.80 17.61 20.20 28.00 1
sd (HS2, na.rm = TRUE)
[1] 4.216173
sd () will ignore NAs only if told to remove them as we have done here, using na.rm = TRUE.
The mean and median are only moderately close. The first and third quartiles are 3.275cm and 1.4cm from the
median respectively, suggesting negative skew.

Also, 2 standard deviations away from the mean in either direction takes you well past the quartiles. Overall,
normality is certainly worth further investigation

A QQ plot for the Handspan data with an appropriate heading


qqnorm (y = qanda$Handspan, pch = 20)
qqline (qanda$Handspan, col = "red")

The fit is not great. It looks as if the data are negatively skewed 
The axes have different numbering schemes. The x (theoretical) axis shows the number of standard deviations
away from the mean while the y (sample) axis shows the actual data values.
The straight line passes through the theoretical qL and qU

8 Stimulation
a simulation envelope: you create a number (often 19) of samples, of the same size as the dataset, from the
theoretical distribution and compare the QQ plot for the dataset with that for the highest and lowest of them

Randomness: random numbers from a non quantum computer are not random - there must be an algorithm that
generates them - they are actually pseudorandom

rand_norm <- rnorm (n = 100)


generates 100 random values from the N(0,1) distribution

We can compare a histogram of the random data with the shape of the corresponding Normal distribution as
follows

x <- seq (from = -3, to = 3, length.out = 10000)


y <- dnorm (x = x, mean = 0, sd = 1)
max_y <- max (y, hist (rand_norm, plot = FALSE)$density)
hist (rand_norm, freq = FALSE, xlab = "standard deviations", ylab = "",
xlim = c (-4, 4), ylim = c (0, max_y), main = "Random data v distribution")
lines (x, y, col = "blue")

What % of the area under a Normal curve is represented by ±3±3 standard deviations from the mean?
pnorm(q = 3, mean = 0, sd = 1) - pnorm(q = -3, mean = 0, sd = 1)
[1] 0.9973002

Exercise 8.2
consider a discrete distribution which can take values {1,2,3,4}, each
with probability 14. If we take a random sample of 4 items from that
distribution, the probability that we will have exactly one of each value
is only 3/32. The probability that we will nearly have one of each
value, ie 3 different values with one repeated, is 916
4/4*3/4*2/4*1/4=3/32

Exercise 8.3
Exercise 8.4
Useful Function

 crossprod () for vector outer product


 sample () for drawing samples from discrete distributions with few values
 sort ()
 order ()
 curve () which simplifies the plotting of distributions

9 Multiple Plots
Use Sheffield_max_temp_data.csv
Boxplots
Sheff_temp <- read.csv ("Sheffield_max_temp_data.csv", header = FALSE)
colnames (Sheff_temp) <- c ("Year", "Month", "Max_temp")
boxplot (Sheff_temp$Max_temp ~ Sheff_temp$Month, xlab = "", ylab = "Celsius",
main = "Average maximum monthly temperature in Sheffield, 1883 - 2014")

The median temperatures are at their lowest in January, rise to a peak in July and then decline. However, as
February and August are quite similar to January and July respectively, the decline in temperature during the
second half of the year happens more quickly than the rise during the first half. For most pairs of consecutive
months, there is little overlap between their central 50% boxes. This is not the case for the three winter months
where the boxes are similar. Looking at the whiskers and outliers we have the same clear distinction between
months apart from during winter, except that, for example, the whiskers range for any month will always
overlap with that for a month two away in either direction. Although January has the lowest median the coldest
three months over the time period considered are all in February. Overall, there is a clear temperature gradient
throughout the year, apart from winter, but the wide variation within months means you could need a hot water
bottle at any point

Change the
axis (side = 1, at = 1:12, labels = month.abb)

Sheff_temp_noNA, which is the same as Sheff_temp but without the NAs


Sheff_temp_noNA <- na.omit (Sheff_temp)
max_year <- Sheff_temp_noNA[Sheff_temp_noNA[,3] == max (Sheff_temp_noNA$Max_temp), 1]
min_year <- Sheff_temp_noNA[Sheff_temp_noNA[,3] == min (Sheff_temp_noNA$Max_temp), 1]
It starts with the data without the NANAs and then within the [ ]s it selects the row where the value in column 3
(ie the temperature) is the same as the maximum temperature over the period. It selects for that row, the value in
column 1, which is the year. The double equals sign is used whenever you want to test whether a variable has a
particular value.

Alternative - use which.max () and which.min ()


Sheff_temp_noNA <- na.omit (Sheff_temp)
max_year <- Sheff_temp_noNA[which.max (Sheff_temp_noNA[,3]),1]
min_year <- Sheff_temp_noNA[which.min (Sheff_temp_noNA[,3]),1]

Alternative – use identify ()


identify (x = Sheff_temp$Month, y = Sheff_temp$Max_temp,
labels = Sheff_temp$Year, atpen = TRUE)
identified what the x and y axes represent (ie month and temperature) and then specified what label you would
like to appear where you clicked (ie year). The atpen setting just tells the function to put the label where you
clicked rather than at some predetermined point

Display two plots side by side: the first your output from Exercise 9.1 and the second a time series plot of the
temperatures from January of each year with appropriate x axis numbers.
par (mfrow = c (1, 2))
boxplot (Sheff_temp$Max_temp ~ Sheff_temp$Month, xlab = "", ylab = "Celsius",
main = "Monthly, Sheffield 1883 - 2014", xaxt = "n")
axis (1, at = 1:12, labels = month.abb)
plot (Sheff_temp[Sheff_temp$Month == 1,3], type = "l", xlab = "", ylab = "Celsius",
main = "Januarys, Sheffield 1883 - 2014", xaxt = "n")
axis (1, at = (seq (0, 120, 20)), labels = seq (1883, 2003, 20))

9.1.3 Matrix Plots


pairs () function creates a matrix of plots, comparing every pair of columns from the original dataframe with
each other so that you can attempt to assess visually whether there is any correlation or other form of
dependence
Par ()
creates default settings for plots that will be used until they are changed with another  par () command. Some
arguments, such as mfrow, can only be set with a par () function while others, such as xlim, can be changed for
particular plots within, say, the plot () command. Settings that can be changed with these par () arguments
include

 Text justification For text at a particular location, justification determines whether the text is centred
there (the default) or starts or ends there. Use the adj argument
 Colour Pretty much any element of a plot can be coloured and there is a separate section in ?par which
discusses this. Some of the arguments are bg (background colour), col.axis and fg
 Font As well as colour, it is possible to change the size (eg cex.main and type, such as italics
(eg font.lab, of any text in a plot
 Lines Again, there is a section in ?par explaining how you can customise lines. Particular arguments
include width (lwd, ends (lend and type, such as dotted (eg lty
 Margins These are the blank areas around a plot and it may be worth spending some time adjusting
these, particularly if you have multiple plots side by side. Arguments beginning ma and om can be used
 Axes These arguments typically begin x or y and determine the endpoints of the axes together with the
locations of tick marks etc

You might also like