Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Important R codes and notes

#path directory first

#Create R Project

#name scripts

General

Run ‘?the-function-name’ for asking help

Basic ggplot

library(ggplot2)
x <- seq(from = -10, to = 10, by = 0.1)
y <- _

#geom can be line, bar, etc


qplot(_, _, geom = "line", ylab = "Y", xlab = "X")

library (readxl)

name <- read_excel (“filename.xlsx”, sheet= “Sheet 1”)

#attaching a csv file

compensation <- read_csv("compensation.csv")

library (tidyverse)

#Load before dlpyr and ggplot

#cleans data but is just a dplyr pre-cursor of some sort


Library (dplyr)

# dplyr verbs:

#names() tells you the names you assigned each column

#head() returns the first six rows of the dataset

#tail() shows last rows

#dim() tells you the numbers of rows and columns i.e the
dimension of the dataset.

#str() returns the structure of the dataset, combining nearly all


of the previous functions into one handy function. It is an
outstanding way to ensure that the data you have imported are
what you intended to import, and to remind yourself about the
features of your data.

#glimpse() function: Provides a horizontal view of the data

#as_tibble() function: Provides a vertical view of the data

#Problem: Imported data has more columns and/or rows than it


should and there are NAs
Answer: This is likely to be caused by Excel, which has saved
some extra columns and/or rows. To see if this is the problem,
open your data file (.csv file) using Notepad (Windows) or
TextEdit (Mac). If you see any extra commas at the end of rows,
or lines of commas at the bottom of your data, this is the
problem. To fix this, open your .csv file in Excel, select only the
data you want, copy this, and paste this into a new Excel file.
Save this new file as before (and delete the old one). It should
not have the extra rows/ columns.
Problem: There’s only one column in the imported data, but
definitely more than one in data file
Answer: This is probably caused by your file not being ‘comma
separated’, but R expecting it to be. Most often this happens
when Excel decides it wants to save a .csv file with semicolon (;)
separation, instead of comma! There are several options. On the
Excel side, try and ensure that Excel is using commas for .csv
files (though this is sometimes easier said than done). On the R
side, try using the Import Dataset facility in Rstudio; during the
process, you can see the raw data and the imported data and
look at the raw data to see what the separator is, and then select
this in the dialogue box.
#summary() Provides summary statistics of dataframe

#select() is for selecting columns

#select() can also be used to select all columns except one. For
example, if we wanted to leave out the Root column, leaving only
the Fruit and Grazing columns:
#select(compensation, -Root)

#slice() is for selecting rows

#You can ask for one row, a sequence, or a discontinuous set.


For example, to get the second row, we use: slice(compensation,
2) to select row 2 only. Use slice(compensation, 2:10) to select
rows 2 to 10.

#The c() function collects the rows that we want and is a very
useful function in R to specify discontinuous lists for many
things. Try the same code, but this time asking for row numbers
1, 5, 10 and 15: slice(compensation, c(1, 5, 10, 15))

#How about select(slice(dataset, c(2,5,6)), column, column)?

#filter() function: This function is useful for sub setting.


Operators:

 == equals
 != does not equal
 < less than
 > greater than
 >= equal to or greater than
 <= equal to or less than
One can easily select rows according to multiple conditions. For
example, to keep only rows with Fruit > 80 OR less than 20, we
employ a vertical line:

#Filter (compensation, Fruit > 80 | Fruit < 20))

Assign name to filtered data!

#mutate() function: As with all dplyr functions…mutate() starts


with the dataframe in which the variables reside, and then
designates a new column name and the transformation. We will
make this new column appear in our working dataframe by
employing a neat trick, assigning the values returned by mutate()
to an object of the same name as the original data. We are
essentially overwriting the data frame! e.g:

Compensation <- mutate (compensation, IFruit= log(Fruit)), IFruit


is the new column

arrange() function: Sometimes it’s important or desirable to put


the observations (rows) of our data in a particular order, i.e. to
sort them. For example, we might want to see the compensation
data in order of increasing Fruit production. We can use the
arrange() function:

#like the sort function on R

#arrange (compensation, Fruit)

#You can also select filtered data AND existing row. Example:
Imagine you want fruit production > 80, and the rootstock widths
ONLY. That’s a filter () and a select() agenda:
#select ( filter ( compensation, Fruit > 80), Root)
The pipe command is %>%. You can read this like ‘put the answer
of the left-hand command into the function on the right’.
Not necessary
e.g: compensation %>% filter (Fruit > 80) %>% select (Root)
assigned: FruitRoot <- compensation %>% filter (Fruit > 80) %>%
select (Root)

summarise ( group_by (compensation, Grazing), meanFruit =


mean(Fruit))
If we have more than one grouping variable, we can add them
with commas in between: summarise ( group_by (compensation,
Grazing), meanFruit = mean(Fruit), sdFruit = sd(Fruit))
If you want to use the means, you must use the <- symbol and
assign the result to a new object.

VISUALISATION

Using ggplot2

library (ggplot2)

Like the functions in the dplyr package,

Ggplot( dataframe, aes(x =xvar, y= yvar))+ geom_()

- Two layers here


- Layer 1= dataframe nd aes
- Layer 2 = geom
- If a layer has 2 expressions or more then it must be in bracket, if there’s 1
then there’s no need
- Each layer is follow by a +
- Each graph label is a one layer, so xlab and ylab are 2 diff layers
- Faceting, or latticing, is about producing a matrix of graphs, automatically
structured by a factor/categorical variable in your data.
- this trick works for almost all graphics in ggplot2’ s toolbox.
- with facet_wrap() , the ~ symbol precedes the grouping variable and is
called tilde
-
-

# the first argument to give the function ggplot() is the data frame

# the aes() function defines the graph’s aesthetics; tell R, in this example, to
associate the x-position of the data points with the values in the Root variable,
and the y-position of the points with the values in the Fruit variable. We are
setting up and establishing which variables define the axes of the graph.

#The other significant thing to commit to memory about ggplot2 is that it works
by adding layers and components to the aesthetic map. The data and
aesthetics always form the first layer. Then we add the geometric objects,
like points, lines, and other things that report the data. We can also add/adjust
other components, like x- and y-label customization. The trick is to know that
the + symbol is how we add these layers and components.

# ggplot(data, aes(x= this, y= that, colour = columnname)

- Or size= columnname
- Column name is the exact name of the column ‘whose’ data will be
visualized- when there’s 1 x set and multiple y sets/columns

# follow the first line with a + and then a return/enter. On the next line, we add
the geometric layer: points. We use the function geom_point() or geom_line()

# geom_point() geom_boxplot()geom_histogram()

#theme_colour() changes the background

#theme_bw() change the brackground to white- bg is white, points and graph is


black

#the bracket after each function can be filled- e.g; aes,


geom_point, theme_()

# it can be filled with colour, size,

#ggplot( dataframe, aes( x= absh, y= hsahj) + geom_point( size=


5) + theme_bw()

# can also add labels

# + xlab(“Fruit”) + ylab(“Amount”)
SCATTER PLOT

ggplot(compensation, aes(x = Root, y = Fruit)) + geom_point()

ggplot(comp ensation, aes(x = Root, y = Fruit)) +


geom_point(size=5) + theme_bw()

ggplot(compensation, aes(x = Root, y = Fruit, colour= Grazing)) +


geom_point(size=5) + xlab("Root Biomass") + ylab("Fruit
Production")+ theme_bw()

BOXPLOT

#can add geom_point() to make box plot with points

#geom_point( alpha= 0.5) for transparency

ggplot(____, aes(x = ____, y = Fruit)) + geom_boxplot() +


xlab("Grazing treatment") + ylab("Fruit Production")+ theme_bw()

ggplot(____, aes(x = ____, y = ____)) + geom_boxplot() +


geom_point(size=4, colour='pink', alpha = 0.5) + xlab("Grazing
treatment") + ylab("Fruit Production")+ theme_bw()

HISTOGRAM

#aes only has x, with no y

#ggplot(____, aes(x = ____)) + geom_histogram()

#binwidth or bins when graph looks weird

# can change either the binwidth (how wide each bar is in ‘fruit
units’)

#or the number of bins (e.g. ggplot() defaulted to 30 here). Both of


the following produce roughly the same view of the data:
#ggplot(compensation, aes(x = Fruit)) + geom_histogram(binwidth
= 15)

OR

#ggplot(compensation, aes(x = Fruit)) + geom_histogram(bins =


10)

#facet_wrap():

#divides dataset by a certain grouping variable (group with


multiple factors/options) so that R produces separate graphs of
the separate columns (with the same x)

#Example: divide the data by the Grazing treatment; has grazed


and ungrazed, providing two histograms

#with facet_wrap() , the ~ symbol ALWAYS precedes the grouping variable and
is called tilde

# ggplot(compensation, aes(x = Fruit)) +


geom_histogram(binwidth = 15) + facet_wrap(~Grazing, ncol=1)

# ggsave() works by saving the figure in the Plots window to a


filename you specify.

1. In the Plots tab on the right of RStudio, there is an Export button. This
provides options to save to image file types such as .png or .tiff, to save to
PDF, or to copy the figure to the clipboard. This works on all platforms via
RStudio. A very useful engine of a dialogue box arises, allowing figure size,
resolution, and location to be specified.
2. ggplot2 has a built-in function called ggsave(). ggsave() works by saving
the figure in the Plots window to a filename you specify. Wonderfully,
ggsave() creates the correct figure type by using the suffix you specified
in your filename. The example below will save the current figure to the
working directory (where R is looking!) as a .png file. Of course, you can
define a more complex location, change the height and width, the units,
and the resolution, if you so desire. The help file ?ggsave is pretty helpful
#ncol= 1 is a specification of the number of columns of graphs to
plot.

Functions:

 Aes- how should graph look like. Add axes,


 Geom- points, lines,
 Theme_bw- removing grey background
 geom_point(size=5)- to increase size of points
 xlab and ylab- Label axes
 Colour= - Change the colours of the points to match the
grouping factor (variable or level).
 Shape= - Change the shape of the points to match the
grouping factor (variable or level).
 geom_boxplot()- To create a boxplot
 can add points layer to box and whisker and add colour and
size of the points

ggsave() works by saving the figure in the Plots window to a


filename you specify.

ncol= 1 is a specification of the number of columns of graphs


to plot.

TESTS

Definitions

t-test statistic-

degrees of freedom-

p-value- probability that H0 is true

variance- how varied the data is or the spread of points

normally distributed-

null hypothesis or h0 - datasets are the same, there is no diff in their means

alternative hypothesis- there is a different between means, so h 0 is rejected


95% CI- we’re 95% sure the diff between the means lie with the CI- eg; 95%
CI[5.32; 6.4]. CI includes 0 is H0 is true

Paired- similar groups treated with the same treatments

Unpaired- similar groups with different treatments?

1. Plot it first. Make the picture that should tell you the answer
2. Build model- Once you’ve made a picture, you embark on
translating the hypothesis you are testing into a statistical
model
3. Check assumptions- Once you’ve specified your model in R
and run it, the vital first step is NOT to then interpret the
results, but instead assess that you are not violating the
assumptions of your model. For example, a two-sample t-
test may assume equal variance in the two groups. By
assessing if these assumptions are met, you are ensuring
that the results returned by the modelIng are reliable. If the
assumptions are not met, then the predictions from, or
interpretation of, the model is compromised. You will not be
making reliable inference.
4. Interpret the output of the statistical modelling. It is here
that we interpret the test statistic and associated p-value.
The final step is to integrate the modeling results into your
original figure—a process some of you may know as adding
5. Predicted or fitted lines (or points) to your graph

1. plot your data


2. build your model
3. check your assumptions
4. interpret your model
5. replot the data and the model.

The two-sample t-test


Called ‘Welch test’ Compares the behavior and/or response of 2 groups with
regards to… or the manipulation of…
The two-sample t-test is a comparison of the means of two groups of numeric
values. Can be factors of the same group, e.g; West garden and East garden

- X= categorical
- X= 2 categories or qualitative groups
- Y1= continuous

#It is appropriate when the sample sizes in each group are small.

#However, it does make some assumptions about the data being analysed. The
standard two-sample, Students t-test assumes that the data in each group are
normally distributed and that their variances are equal.

1 Plot data
 Load: readxl, tidyverse, dplyr, ggplot2
 likely use histogram (when only 2 groups) or boxplot
 use facet_wrap() to generate graphs for each group
 facet_wrap(~MainCategory, ncol= 1)
2 Build model- run two sample t test

 use the t.test() to to ask whether there is a diff in means of


the same group
 H0 is no, H alt is yes
 t.test( yname ~xname, data= dataset)
 Xname is x or the grouping category hence the ~

3 Check assumptions
1. Look at t
2. Look at p
3. Hypothesis
4. CI- If this interval is around the difference between the two means (does
not include zero). Keep in mind that the difference would be 0 if the means
were the same
5. We can conclude that they are probably different. This falls in line with
the test statistic and associated p-value.
6. Finally, the output provides the means in each group.

4 Interpret

• Interpret as much as you can from your graph before doing any statistics.
• Write a biologically focused (rather than statistically focused) sentence
describing your result using your statistics to support your conclusion

5 No replot

One-way ANOVA

1. Plot your data- box-and-whisker plot is a quick and


effective tool for viewing variation in a response variable as
a function of a grouping, categorical variable

ggplot(daphnia, aes(x = parasite, y = growth.rate)) +


geom_boxplot() + theme_bw() + coord_flip()
coord flip is for switching axes to make plot look better

2. Build your model- Use lm() e.g: model_grow <-


lm(growth.rate ~ parasite, data = daphnia)
3. Check your assumptions- use autoplot() e.g:
autoplot(model_grow, smooth.colour = NA)
4. Interpret your model- anova()
5. Replot the data and the model.

Two-way ANOVA
Linear regression
Multiple Regression
GLM

You might also like