Professional Documents
Culture Documents
Important R Codes and Notes
Important R Codes and Notes
#Create R Project
#name scripts
General
Basic ggplot
library(ggplot2)
x <- seq(from = -10, to = 10, by = 0.1)
y <- _
library (readxl)
library (tidyverse)
# dplyr verbs:
#dim() tells you the numbers of rows and columns i.e the
dimension of the dataset.
#select() can also be used to select all columns except one. For
example, if we wanted to leave out the Root column, leaving only
the Fruit and Grazing columns:
#select(compensation, -Root)
#The c() function collects the rows that we want and is a very
useful function in R to specify discontinuous lists for many
things. Try the same code, but this time asking for row numbers
1, 5, 10 and 15: slice(compensation, c(1, 5, 10, 15))
== equals
!= does not equal
< less than
> greater than
>= equal to or greater than
<= equal to or less than
One can easily select rows according to multiple conditions. For
example, to keep only rows with Fruit > 80 OR less than 20, we
employ a vertical line:
#You can also select filtered data AND existing row. Example:
Imagine you want fruit production > 80, and the rootstock widths
ONLY. That’s a filter () and a select() agenda:
#select ( filter ( compensation, Fruit > 80), Root)
The pipe command is %>%. You can read this like ‘put the answer
of the left-hand command into the function on the right’.
Not necessary
e.g: compensation %>% filter (Fruit > 80) %>% select (Root)
assigned: FruitRoot <- compensation %>% filter (Fruit > 80) %>%
select (Root)
VISUALISATION
Using ggplot2
library (ggplot2)
# the first argument to give the function ggplot() is the data frame
# the aes() function defines the graph’s aesthetics; tell R, in this example, to
associate the x-position of the data points with the values in the Root variable,
and the y-position of the points with the values in the Fruit variable. We are
setting up and establishing which variables define the axes of the graph.
#The other significant thing to commit to memory about ggplot2 is that it works
by adding layers and components to the aesthetic map. The data and
aesthetics always form the first layer. Then we add the geometric objects,
like points, lines, and other things that report the data. We can also add/adjust
other components, like x- and y-label customization. The trick is to know that
the + symbol is how we add these layers and components.
- Or size= columnname
- Column name is the exact name of the column ‘whose’ data will be
visualized- when there’s 1 x set and multiple y sets/columns
# follow the first line with a + and then a return/enter. On the next line, we add
the geometric layer: points. We use the function geom_point() or geom_line()
# geom_point() geom_boxplot()geom_histogram()
# + xlab(“Fruit”) + ylab(“Amount”)
SCATTER PLOT
BOXPLOT
HISTOGRAM
# can change either the binwidth (how wide each bar is in ‘fruit
units’)
OR
#facet_wrap():
#with facet_wrap() , the ~ symbol ALWAYS precedes the grouping variable and
is called tilde
1. In the Plots tab on the right of RStudio, there is an Export button. This
provides options to save to image file types such as .png or .tiff, to save to
PDF, or to copy the figure to the clipboard. This works on all platforms via
RStudio. A very useful engine of a dialogue box arises, allowing figure size,
resolution, and location to be specified.
2. ggplot2 has a built-in function called ggsave(). ggsave() works by saving
the figure in the Plots window to a filename you specify. Wonderfully,
ggsave() creates the correct figure type by using the suffix you specified
in your filename. The example below will save the current figure to the
working directory (where R is looking!) as a .png file. Of course, you can
define a more complex location, change the height and width, the units,
and the resolution, if you so desire. The help file ?ggsave is pretty helpful
#ncol= 1 is a specification of the number of columns of graphs to
plot.
Functions:
TESTS
Definitions
t-test statistic-
degrees of freedom-
normally distributed-
null hypothesis or h0 - datasets are the same, there is no diff in their means
1. Plot it first. Make the picture that should tell you the answer
2. Build model- Once you’ve made a picture, you embark on
translating the hypothesis you are testing into a statistical
model
3. Check assumptions- Once you’ve specified your model in R
and run it, the vital first step is NOT to then interpret the
results, but instead assess that you are not violating the
assumptions of your model. For example, a two-sample t-
test may assume equal variance in the two groups. By
assessing if these assumptions are met, you are ensuring
that the results returned by the modelIng are reliable. If the
assumptions are not met, then the predictions from, or
interpretation of, the model is compromised. You will not be
making reliable inference.
4. Interpret the output of the statistical modelling. It is here
that we interpret the test statistic and associated p-value.
The final step is to integrate the modeling results into your
original figure—a process some of you may know as adding
5. Predicted or fitted lines (or points) to your graph
- X= categorical
- X= 2 categories or qualitative groups
- Y1= continuous
#It is appropriate when the sample sizes in each group are small.
#However, it does make some assumptions about the data being analysed. The
standard two-sample, Students t-test assumes that the data in each group are
normally distributed and that their variances are equal.
1 Plot data
Load: readxl, tidyverse, dplyr, ggplot2
likely use histogram (when only 2 groups) or boxplot
use facet_wrap() to generate graphs for each group
facet_wrap(~MainCategory, ncol= 1)
2 Build model- run two sample t test
3 Check assumptions
1. Look at t
2. Look at p
3. Hypothesis
4. CI- If this interval is around the difference between the two means (does
not include zero). Keep in mind that the difference would be 0 if the means
were the same
5. We can conclude that they are probably different. This falls in line with
the test statistic and associated p-value.
6. Finally, the output provides the means in each group.
4 Interpret
• Interpret as much as you can from your graph before doing any statistics.
• Write a biologically focused (rather than statistically focused) sentence
describing your result using your statistics to support your conclusion
5 No replot
One-way ANOVA
Two-way ANOVA
Linear regression
Multiple Regression
GLM