Intro To Statistic Using R - Session 2

Intro to statistics using R - session 2
2023-04-12
Introduction to statistics using R - Session 2

Introduction to R and RStudio - R intermission
During this short intro to R, we will see the main basic functions, and we will learn how to perform the steps (and associated basic functions) from
importing a data set, that I prepared as an example, to summarizing the descriptive statistics of our data set.
R basics
R is case sensitive.
Let’s create an object and see what happens if we use the wrong case to call it
hab<-"forest"
hab
## [1] "forest"
#Hab
That’s right, we get an error message.
R tolerates extra spaces

For clarity, I encourage you to use spaces.
hab2<-"prairie"
rm(hab2) #deletes the newly created hab2 so you can see that adding spaces changes nothing
hab2 <- "prairie"
First, you can see that adding spaces or not changes nothing. Second, notice the # sign: as explain you can add whatever comments in your r
code, as long as you add a # in front, it will not run.
🍀If for some reason R does not respond, or you made a mistake you can terminate whatever current command you are running by pressing the
esc key.
R is a calculator, with some useful functions!
3+2 #simple addition
## [1] 5
The [1] in front of the result means that the observation number at the beginning of the line is the first observation. Not very useful here for a
simple calculation, but when you get a series of calculations with an output taking 10 lines, that may come handy!
pi #pi is a built-in constant in R
## [1] 3.141593
pi*5^2 #pi*r^2 i.e., area of a circle with r = 5
## [1] 78.53982
log(8) #log base e
## [1] 2.079442
log10(8) #log base 10
## [1] 0.90309
exp(8) #exponential or natural anti-log
## [1] 2980.958
sqrt(8) #square root
## [1] 2.828427
As we said during last session, R has convenient functions for pretty much everything basic. I encourage you to (every now and then) use R when
you have easy calculation to perform, instead of using excel or the computer integrated calculator for example. Using R regularly is the only way to
get used to it, and become proficient.
As you just saw by running these lines, the results only got displayed in the console. That’s because we did not create any object. So if you want to
keep an output for later use, don’t forget to create objects.
Create objects
r <- log(25) #we created an object r (radius)

area<-pi*r^2 #now we can create an object area reusing the object r directly
To create an object you use the operator <- sometimes referred to as “gets operator”. So, in the first line above you would read r gets log(25). You
can also use =
diam = r*2
As you can see the objects are stored in the Environment pane. You can create objects with multiple values (we’ll see that in a minute).
🍀Tips about naming objects: not as easy as it seems. You can an object pretty much anything you want, but there are a few rules. 1. keep the
names as short as possible, while keeping them informative (easier said than done), 2. NO special characters, that’s just opening doors for trouble,
3. if you name an object with multiple words, use . or _ to separate them, or capitalize each word (ex.: my_data, my.model, MyOutput), 4. a few
words are not allowed because they are reserved for specific cases like TRUE or NA. You can try but R won’t let you (I actually encourage you to
try to see what happens, 5. Don’t call an object with the name of a function, R might or might not let you but if it does let you, you will definitely run
into problems (ex.: instead of data name dat, or df).
Let’s see how to create objects with multiple values for different data type
int<- c(1, 2, 3) #see int is numeric and has 3 obs [1:3]

chr<- c("Hello world!", "Howdy!", "great day!") #and now we have a character variable, with 3 values [1:3]
Now don’t forget when writing characters you need to add ““. Let see what happens if you forget.
#sky<- star
You get an error, R thinks, you were calling an object that doesn’t exist.
sky<- c("star", "moon", "sun")
Here, we go, it works! And congrats, you have been using an R function, c() is a function and a short for concatenate. Functions in R are ALWAYS
followed by round brackets, and everything you put into the function is separated by commas.
R functions and vectors

Remember (Remember the fifth of November…): During last session we used many functions.
mean(int)
## [1] 2
sd(int)
## [1] 1
length(chr)
## [1] 3
Just to name a few. Most common calculation, R has a pre-written function for it. Don’t hesitate to re-read last session’s code again, to remember
the different functions we used.
When dealing with a vector of length > 1, you can extract specific values from your vector. For example, we want the 2nd entry of vector chr
chr[2]
## [1] "Howdy!"
# you can stored this value for later use

h<- chr[2]
Using c, you can extract multiple values
chr[c(1,2)]
## [1] "Hello world!" "Howdy!"
let’s create a vector containing a sequence of numbers. You have (like for pretty much everything else) quite a few ways to do it. Below are two
common ways to create a suite of integers withing a certain range.
my.vec<-5:20
vec<-seq(from = 5, to = 20, by = 1)
The first way is shorter but only works when we want our sequence to be every 1 number. With the second you could customize further (see
below).
vec2<-seq(5, 20, by = 0.5)

vec3<-seq(5, 20, by = 0.1)
Let’s go back to my.vec: we can ask R to tell us which entries are bigger than 10
my.vec[my.vec > 10] #in my.vec which values of my.vec are > 10
## [1] 11 12 13 14 15 16 17 18 19 20
or you can ask if specific values are bigger than 10
my.vec > 10
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE
In the command above, R returns logical values TRUE and FALSE
Now, let’s see the most common logical operators: |, &, ==, !=, >=, <=
my.vec[my.vec < 10 & my.vec > 5] # & is AND
## [1] 6 7 8 9
my.vec[my.vec <= 10 & my.vec >= 5]
## [1] 5 6 7 8 9 10
my.vec[my.vec > 8 | my.vec ==5] # | is OR and == is equal to
## [1] 5 9 10 11 12 13 14 15 16 17 18 19 20
my.vec[my.vec != 20 & my.vec != 10] #adding ! in front of = means you want to exclude it
## [1] 5 6 7 8 9 11 12 13 14 15 16 17 18 19
You can also replace elements in your vector
my.vec[2]<- 800 #just 1

my.vec
## [1] 5 800 7 8 9 10 11 12 13 14 15 16 17 18 19 20
my.vec[c(6, 7)] <- 500# or many

my.vec
## [1] 5 800 7 8 9 500 500 12 13 14 15 16 17 18 19 20
my.vec[my.vec > 19] <- 1000 #conditionally

my.vec
## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

## [16] 1000
my.vec[my.vec < 1000 & my.vec > 100]<- 5 #double condition

my.vec
## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

## [16] 1000
You can also perform calculation on entire vectors.
my.vec2<- my.vec*2
log.vec<- log(my.vec)
You can also do these calculations on data frame columns. Take the code from session 1, we have done just that (like asking the mean of column
wing_span from df mean(df$wing_span).
Ok, that’s it for the basics. Just 2 last pieces of info:
🍀Tip: If you want to save your work, remember these 2 functions: save and save.image
#save(nameOfObject, file = "name_of_file.RData") #to save an object. Very useful when your object is a model that
has been running for 10 days!
#save.image(file = "name_of_file_Date.RData") #To save all your workspace at once
Try these functions out!
To load a RData object, there is the wait for it… load() function.
#load(file = "name_of_file_Date.RData")
You’ve notice I’ve put a # in front of these 3 commands. That’s because, when I’ll print this .Rmd file as an HTML file, R runs everything and if one
line returns an error, the printing aborts. If you want to try replave the generic object name I’ve inserted by a real one and don’t forget to delete the
#.
🍀Tips for naming a file on your computer (that works for any file, from word docs, and ppt presentations to R files or data bases): 1. never
overwrite former versions unless you have a very good reason to. You may want to trace back the changes you’ve been making, especially when
collaborating on the same file with other people. 2. Whether you already have multiple version of the same file or not, always add the date, that will
help you being organize, and please don’t make the rookie mistake of naming a file V2.2 (they’re not softwares) or V3, or Final, or FinalVersion,
trust me, I’ve seen many a computer with 10 FinalVersions of the same file… that’s never a good idea. If you collaborate with internationals,
remember that we all have different conventions to write dates: 05/11/2023 will either be November 5th, 2023, or May 11th, 2023 depending on
where you’re from. I strongly recommend to use YYYYMMDD, it’s a widely used format for international collabs because not ambiguous. 3. Keep
the name a short as you can, but still provide all key infos. Your future You will thank Past You for it, and collaborators may appreciate. For example
eagle_data.csv is a terrible name, because we don’t have any info beside the fact that it’s about eagle.
Canada_boldeagle_morpho_data_20102022_20230412.csv is indeed longer, but we have the info we need now: the db contains the morphometric
data of the Canadian Bald eagle population from 2010 to 2022, and was last updated on April 12 2023. The same goes for an R workspace:
“R_seminar_series_session2_20231204.RData” is much better than “R_code_FinalVersion.RData”.
Asking R for help or information

There are a few ways. Let’s take for example the read.csv function
?read.csv # the documentation now appears in the Output pane
## starting httpd help server ... done
If you’re not sure anymore of the exact name of the function you can use the help.search function
help.search("read csv")
R now proposes a few different options that correspond to your key words.
Now, good practice wants you to systematically report the version of R and packages you used to perform an analysis. That’s also key info to
provide if you’re asking help online (just like for everything else computer): some bugs or specific behaviors are linked to your software version.
SessionInfo(), tells you everything about your session including R version, the platform, and your OS, current timezone, language…m and the
loaded packages attached or not. Check it out:
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)

## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 evaluate_0.20
## [5] cachem_1.0.7 rlang_1.1.0 cli_3.6.0 rstudioapi_0.14
## [9] jquerylib_0.1.4 bslib_0.4.2 rmarkdown_2.21 tools_4.2.2
## [13] xfun_0.37 yaml_2.3.7 fastmap_1.1.0 compiler_4.2.2
## [17] htmltools_0.5.4 knitr_1.42 sass_0.4.5
To get the version of a specific package you can use packageVersion()
packageVersion("dplyr")
## [1] '1.1.0'
And to cite your packages right, use citation()
citation("dplyr")
##
## To cite package 'dplyr' in publications use:
##
## Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
## Grammar of Data Manipulation_. R package version 1.1.0,
## <https://CRAN.R-project.org/package=dplyr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {dplyr: A Grammar of Data Manipulation},
## author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
## year = {2023},
## note = {R package version 1.1.0},
## url = {https://CRAN.R-project.org/package=dplyr},
## }
Let’s play with a real data set!
Presentation of the data set

The data set I have provided is a modified version of a real-life data set that was build to assess the relationship between total mercury
concentration (THg) in diverse organs of red foxes. The data set contains the following variables: individual ID, age in number of years as an
integer, age category (adult or juvenile), sex (F or M), organ (liver, muscle, renal cortex, renal medulla, brain, claw and guard hair), and Total
mercury concentration (THg) as a continuous variable . As it is often the case in real life data sets, we have some NA for THg in diverse tissues.
Steps to produce descriptive stats of your data set

1. Setting your working directory
First, we need to define our working directory (wd). Here, I recommend that you create a new file called “R_seminar_session2” in Documents to
run this code.
As for most command in R, there are many ways to define the wd. Let’s see a couple ways.
The classic way (which I very much dislike)
#wd<- setwd("C:/Users/crodrigues/Documents/R_seminar_session2")
Why do I dislike it? Because it’s hardly a reproducible line of code, since each person on earth has organized their computer files differently, the
path will change for each of us!
Because worse even: say I finally decide to properly organize my files on my computer, but I have many R projects going on. I deleted files,
created new ones, moved objects into new files etc… I would have to rewrite every setwd command in my codes… Huge waste of time!
Also, because the longer the path, the more likely you may make mistakes when writing the path and get errors such as Error in setwd(“xxx”) :
cannot change working directory. You’ll need to pinpoint where you wrote something wrong. Again, waste of time…
I could go on, on why I hate setting the wd the classic way. Coding is for lazy people, because they always find the easier way to do it 😜
My (easier) solution here is:
#setwd(choose.dir())
I put a # because, unfortunately, it often creates issues when knitting a Rmd file into HTML, but in normal R it will open a pop up window in which
you can directly choose which file you want to use as working directory. Plus you send your code to friends, they don’t need to change that line,
they can directly use it with their own file organization.
You can either set it directly as above, or store the path in an object we will call wd, that appears in value as you can see. If you need to feed the
path of the wd to another function at some point, you can just use the name of the object “wd”.
#wd<-setwd(choose.dir())
Now we have our working directory, we want to load our data set.
2. Loading the data set

You need your data as a text format, so ideally csv or txt. Here, we have a csv, so let’s load it
df<- read.csv("RF_mercury_long_20230406.csv")
The command is read.csv(), we call our dataframe df (How original! 😜).

If you work with Frenchies, they will likely have a weird format of csv with “;” as separator instead of “,” and a “,” for decimal point instead of “.”. In
that case, the command will be read.csv2()
If you have a .txt file the command is read.table().
The same way you chose the wd from a pop-up window, you could do the same for whatever files you want to import.
#df<- read.delim(file.choose(""))
But unlike for the wd, I do not recommend this selection method. If you have multiple versions of your files (since you should never overwrite a
previous version), it’s better to write down the name of the file you last worked with. Say you’re publishing your paper and analysis have been done
like a year ago (or more!), and a reviewer ask you to re-do or check something, trust me, you will thank Past you to have written down the exact file
name you’ve been using last. That will avoid you possibly not finding the same results you reported in your “Results” section. Of course that
suggest you are not renaming your data bases on a regular basis…
3. Summarizing the data for data exploration

The next step will be the very first steps of the data exploration process, which allows you to get to know your data set, like it’s your best friend.
Data exploration is a crucial part of data analysis. In fact, most of the times, you will spend way more time exploring your data than actually
modelling them. running a linear model, take 1min, but choosing (and then validating it, a step we will cover in a future session), will take much
much longer, and will guarantee that you did a proper job, and can trust your results. If you don’t provide detailed information on these steps,
reviewers will reject your paper (or at least they should, if they know what they’re doing. I totally would), because there would be no way to know if
we can trust your claims. Plus, it will ensure reproducibility: if I take your data and follow the steps you describe in your “Methods” section, I should
1. be easily able to do so, just by reading your Methods, 2. find the same results. Reproducibility ensures that, as scientists, we do our job right.
3.1 Check out the missing values

First, we will check how many NAs we have and where they are, to make sure they’re not following a pattern, and thus be a problem.
colSums(is.na(df))
## id age age_cat sex traploc organ THg

## 0 0 0 0 0 0 55
which(colSums(is.na(df))>0)
## THg
## 7
names(which(colSums(is.na(df))>0))
## [1] "THg"
The first line returns the number of NAs for each column (this is the line I personally always use in first intention, because it is the most
informative). The function is.na() basically transform each cell of your data frame into a logical value: is it NA? TRUE or FALSE. Then the
colSums() function counts the number of TRUE per column.
The second line names and gives you the position of columns where the count of NAs > 0. and the third line only provides the column name.
Now that we know that we only have missing values in THg, we want to check there is no pattern. So, we will switch to package dplyr (which you
will get to know quite well in future sessions), to summarize the NAs by group.
library(dplyr) #first we load the package
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':

##
## filter, lag
## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union
df<- as.data.frame(unclass(df),
stringsAsFactors = TRUE) #this line convert all characters into factors at once
dCount.sex <- df %>% # Count NA by sex

group_by(sex) %>%
summarize(count_na = sum(is.na(THg)), n = length(THg))
dCount.sex
## # A tibble: 2 × 3
## sex count_na n
## <fct> <int> <int>
## 1 F 26 182
## 2 M 29 203
dCount.age <- df %>% # Count NA by age category

group_by(age_cat) %>%
dCount.age
## # A tibble: 2 × 3
## age_cat count_na n
## <fct> <int> <int>
## 1 adult 27 175
## 2 juvenile 28 210
#To check by age, we can plot it because it will be easier to visualize for most people than a table with numbers
for so many categories
#First, we create a table for age like above
dCount.age2 <- df %>% # Count NA by age
group_by(age) %>%
#Then we create a new variable: that is the proportion of NAs (Nas per age/n)
dCount.age2$prop_na<- dCount.age2$count_na/dCount.age2$n
#Finally we load ggpubr (wrapper for ggplot2, easier to use) and make a scatterplot with the correlation R value
and a p value for a pearson correlation
library(ggpubr)
## Loading required package: ggplot2
ggscatter(dCount.age2, x = "age", y = "prop_na",

add = "reg.line", # Add regression line
conf.int = TRUE, # Add confidence interval
add.params = list(color = "blue",
fill = "lightgray")
)+
stat_cor(method = "pearson", label.x = 0.2, label.y = 1) # Add correlation coefficient
So, missing values seem to be pretty random. The R we obtained for age is quite high, but that’s because we have very few old animals, so we can
safely consider that the apparent pattern is just due to sampling.
We first loaded the dplyr package using the library() function, and the second line is a command to convert all characters from your data set into
factors, which are (in general) easier to deal with in R. The usual command to convert from one data type to another is
as.NewDataType(as.OldDataType())
df$id<- as.character(as.factor(df$id)) #back to character

df$id<- as.factor(as.character(df$id)) #and again convert id to factors
3.2 Summarizing THg per group

Here, we will use dplyr again, but this time to summarize THg concentration per group.
thg.SexAge <- df %>%

filter(!is.na(df$THg)) %>%
group_by(sex, age_cat) %>%
summarize(min = min(THg), max = max(THg), mean = mean(THg),
se = sd(THg)/sqrt(length(THg)), n = length(THg))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
thg.SexAge
## # A tibble: 4 × 7
## # Groups: sex [2]
## sex age_cat min max mean se n
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 F adult 20.9 2747. 682. 134. 39
## 2 F juvenile 10.6 2510. 341. 49.9 117
## 3 M adult 31.9 3495. 677. 70.4 109
## 4 M juvenile 15.4 2501. 364. 68.9 65
We have to filter out the NAs otherwise, they will cause problems (take out the filter line, and run the code to see).
Let’s do the same for trapping location.
thg.loc <- df %>%

filter(!is.na(df$THg)) %>%
group_by(traploc) %>%
summarize(min = min(THg), max = max(THg), mean = mean(THg),
se = sd(THg)/sqrt(length(THg)), n = length(THg))
thg.loc
## # A tibble: 10 × 6
## traploc min max mean se n
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Button Bay 15.4 2510. 420. 95.5 39
## 2 Goose Creek 42.8 2468. 658. 243. 13
## 3 Line 25 20.0 2595. 573. 152. 21
## 4 Mack Lake area 17.3 107. 57.2 15.5 5
## 5 North River 12.4 3424. 589. 91.6 63
## 6 Seal River 10.6 3495. 603. 141. 30
## 7 Southknife Lake 45.8 2641. 721. 173. 23
## 8 Town of Churchill 17.5 2501. 407. 67.4 78
## 9 Twin Lakes 16.7 2097. 352. 70.7 49
## 10 Wakeworth Lake 25.8 2090. 661. 266. 9
xtabs(~ traploc + sex, data = df)
## sex
## traploc F M
## Button Bay 21 21
## Goose Creek 14 7
## Line 25 0 21
## Mack Lake area 7 0
## North River 42 21
## Seal River 7 35
## Southknife Lake 7 21
## Town of Churchill 28 63
## Twin Lakes 42 14
## Wakeworth Lake 14 0
tab<-xtabs(~ traploc + age_cat + sex, data = df)

ftable(tab)
## sex F M
## traploc age_cat
## Button Bay adult 0 7
## juvenile 21 14
## Goose Creek adult 7 7
## juvenile 7 0
## Line 25 adult 0 14
## juvenile 0 7
## Mack Lake area adult 0 0
## juvenile 7 0
## North River adult 14 21
## juvenile 28 0
## Seal River adult 0 21
## juvenile 7 14
## Southknife Lake adult 7 21
## juvenile 0 0
## Town of Churchill adult 7 28
## juvenile 21 35
## Twin Lakes adult 7 7
## juvenile 35 7
## Wakeworth Lake adult 7 0
## juvenile 7 0
We saw, that some locations (namely GosseCreek and Wakeworth Lake) may be associated with higher THg mercury in fox tissues, but we also
saw above that adult females seem to have more mercury in their tissues. We, thus, need to check if the distribution of sex per location is balanced
(the 2 xtabs lines that we also saw last session show you the count of sex - line1 - and sex*age_cat - line 2- per location), and it is not balanced. If
we were going to analyze these data today, that info should get you to think about possibly excluding one of the variable sex or traploc as they may
be strongly associated, and should definitely make you test for that association specifically. We’ll talk about correlation between explanatory
variables further in future sessions.
Next, we can produce a table summarizing THg per tissue per sex and age. We will use a different way this time, one that does not involve dplyr.
dfc<-df[complete.cases(df),]
sum.tab <- aggregate(dfc$THg,
by = list(dfc$sex, dfc$age_cat, dfc$organ),
FUN = function(x) c(min = min(x), max = max(x),
median = median(x), mean = mean(x),
se = sd(x)/sqrt(length(x)),
n = length(x)))
sum.tab<-do.call(data.frame, sum.tab)
colnames(sum.tab)<- c("sex", "age", "organ", "min", "max", "median", "mean", "se", "n")
Like with dplyr, ignoring the NAs will cause problems. We need to get rid of them, and we did that with function complete.cases() which only keeps
rows that are complete. Then, we used the function aggregate(). The arguments you give to aggregate are the column you want to summarize by
group (here, the THg), then the groups (as a list), and finally the functions you want (here, we asked for min, max, median, mean, standard error*,
and sample size). *Remember Standard error se does not have a specific function, so you need to add it by hand sd (Standard deviation)/ sqrt
(square root) of n (sample size).
Then, we converted sum.tab into a real data frame, and renamed the columns.
Yay! We have our summary table! 🎉

Below, is another bit of code to make the table pretty. We won’t cover it in class, but I encourage you to try to run it.
💪 We have our summary table. Now, we’ll see how to make it pretty.
library(rempsyc)
## Warning: package 'rempsyc' was built under R version 4.2.3
## Suggested APA citation: Thériault, R. (2022). rempsyc: Convenience functions for psychology
## (R package version 0.1.1) [Computer software]. https://rempsyc.remi-theriault.com
library(flextable) #load the two necessary libraries
## Warning: package 'flextable' was built under R version 4.2.3
##
## Attaching package: 'flextable'
## The following objects are masked from 'package:ggpubr':

##
## border, font, rotate
nice_table(sum.tab) #nice function, eh?
sex age organ min max median mean se n
F adult brain 20.86 109.93 54.95 54.46 16.15 5
M adult brain 31.92 269.44 117.16 114.44 14.31 17
F juvenile brain 10.57 126.31 30.14 34.89 6.46 17
M juvenile brain 15.45 45.71 20.03 25.88 4.33 7
F adult claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6
M adult claw 523.99 2,134.76 1,054.77 1,197.70 150.83 13
F juvenile claw 209.40 1,764.90 512.78 704.44 128.21 16
M juvenile claw 209.85 1,727.20 607.53 777.40 196.29 9
F adult GH 1,976.55 2,747.21 2,159.25 2,260.72 116.88 6
M adult GH 658.73 3,424.01 1,973.80 1,813.03 236.37 13
F juvenile GH 481.09 2,509.83 926.77 1,240.99 194.82 16
M juvenile GH 411.50 2,501.03 1,327.61 1,248.34 264.80 9
F adult KidCort 178.62 455.45 363.94 340.49 63.62 4
M adult KidCort 326.75 3,494.81 809.66 1,018.81 208.39 15
F juvenile KidCort 107.40 495.49 245.96 262.25 29.90 15
M juvenile KidCort 166.53 406.48 268.49 272.45 25.68 9
F adult KidMed 29.50 220.63 138.79 131.93 43.06 4
M adult KidMed 49.63 1,035.02 235.06 289.59 64.87 15
F juvenile KidMed 28.25 167.54 62.39 77.60 9.92 15
M juvenile KidMed 37.56 141.14 73.10 82.97 11.11 9
F adult liver 154.98 430.22 180.34 246.80 41.92 7
M adult liver 181.33 1,348.47 340.33 500.97 73.38 18
F juvenile liver 48.51 259.36 101.92 114.94 12.02 19
M juvenile liver 86.92 263.93 118.35 133.53 15.71 11
F adult muscle 35.98 184.30 61.68 92.58 21.90 7
M adult muscle 48.86 679.73 175.75 225.81 40.89 18
F juvenile muscle 17.33 101.55 42.57 46.22 5.18 19
M juvenile muscle 24.77 121.54 43.03 53.19 8.22 11
Now we have a nice table, but we need to arrange it because we technically have multilevel headers. and instead here we have each group as a
separate column.
#First, we'll combine sex and age in the same column

sum.tab$sex_age<- paste0(substr(sum.tab$sex, 1, 1), ".",
substr(sum.tab$age, 1, 1))
# Remove columns using select()

sum.tab2 <- sum.tab %>% select(-c(sex, age))
#Then, we'll separate the rows by group sex_age

fa<- subset(sum.tab2, sex_age == "F.a")
ma<- subset(sum.tab2, sex_age == "M.a")
fj<- subset(sum.tab2, sex_age == "F.j")
mj<- subset(sum.tab2, sex_age == "M.j")
#And create a new df with the 4 df bound columnwise - we need to exclude the first column from 3 of teh df and th
e last column from all of them
dat <- cbind(fa[,1:7], fj[,2:7], ma[,2:7], mj[,2:7])
#Now we rename the column (except the first one) the proper way so that the function understands what header it s
hould take
names(dat)[-1] <- c(paste0("Female.adult.", names(sum.tab2[2:7])),
paste0("Female.juvenile.", names(sum.tab2[2:7])),
paste0("Male.adult.", names(sum.tab2[2:7])),
paste0("Male.juvenile.", names(sum.tab2[2:7])))
#Now we'll rename the organs the way we want them to appear in the table
# Renaming factor levels dplyr
dat <- dat %>%
mutate(organ=recode(organ, "GH"="Guard hair", "brain" = "Brain",
"claw"="Claw", "KidCort" = "Renal cortex",
"KidMed" = "Renal medulla", "liver"= "Liver",
"muscle"= "Muscle"))
#Now we use nice_table()

nice_table(dat)
organ Female.adult.min Female.adult.max Female.adult.median Female.adult.mean Female.adult.se Female.adult.n Female.juvenile.min Female.juvenile.max Female.juvenile.median Female.juvenile.mean Female.juvenile.se Fema
Brain 20.86 109.93 54.95 54.46 16.15 5.00 10.57 126.31 30.14 34.89 6.46
Claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6.00 209.40 1,764.90 512.78 704.44 128.21
Guard
1,976.55 2,747.21 2,159.25 2,260.72 116.88 6.00 481.09 2,509.83 926.77 1,240.99 194.82
hair
Renal
178.62 455.45 363.94 340.49 63.62 4.00 107.40 495.49 245.96 262.25 29.90
cortex
Renal
29.50 220.63 138.79 131.93 43.06 4.00 28.25 167.54 62.39 77.60 9.92
medulla
Liver 154.98 430.22 180.34 246.80 41.92 7.00 48.51 259.36 101.92 114.94 12.02
Muscle 35.98 184.30 61.68 92.58 21.90 7.00 17.33 101.55 42.57 46.22 5.18
#All seems in order, ready for the last step which consists of separating headers
nice_table(dat, separate.header = TRUE, italics = seq(dat))
Female Male
organ adult juvenile adult juvenile
min max median mean se n min max median mean se n min max median mean se n min max median mean se n
Brain 20.86 109.93 54.95 54.46 16.15 5 10.57 126.31 30.14 34.89 6.46 17 31.92 269.44 117.16 114.44 14.31 17 15.45 45.71 20.03 25.88 4.33 7
Claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6 209.40 1,764.90 512.78 704.44 128.21 16 523.99 2,134.76 1,054.77 1,197.70 150.83 13 209.85 1,727.20 607.53 777.40 196.29 9
Guard
1,976.55 2,747.21 2,159.25 2,260.72 116.88 6 481.09 2,509.83 926.77 1,240.99 194.82 16 658.73 3,424.01 1,973.80 1,813.03 236.37 13 411.50 2,501.03 1,327.61 1,248.34 264.80 9
hair
Renal
178.62 455.45 363.94 340.49 63.62 4 107.40 495.49 245.96 262.25 29.90 15 326.75 3,494.81 809.66 1,018.81 208.39 15 166.53 406.48 268.49 272.45 25.68 9
cortex
Renal
29.50 220.63 138.79 131.93 43.06 4 28.25 167.54 62.39 77.60 9.92 15 49.63 1,035.02 235.06 289.59 64.87 15 37.56 141.14 73.10 82.97 11.11 9
medulla
Liver 154.98 430.22 180.34 246.80 41.92 7 48.51 259.36 101.92 114.94 12.02 19 181.33 1,348.47 340.33 500.97 73.38 18 86.92 263.93 118.35 133.53 15.71 11
Muscle 35.98 184.30 61.68 92.58 21.90 7 17.33 101.55 42.57 46.22 5.18 19 48.86 679.73 175.75 225.81 40.89 18 24.77 121.54 43.03 53.19 8.22 11
💪
That’s all for this example. Using flextables, you can really customize your tables any way you wish. Here, is the link to package flextable user
guide: https://ardata-fr.github.io/flextable-book/
We will stop here for today, we will see much more of R during the next sessions, as we will slowly shift from
mostly theoretical seminars to mostly practical ones.

Intro To Statistic Using R - Session 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Statistic Using R - Session 2

Uploaded by

Copyright:

Available Formats

Intro to statistics using R - session 2

Introduction to statistics using R - Session 2

That’s right, we get an error message.

R tolerates extra spaces

R is a calculator, with some useful functions!

3+2 #simple addition

pi #pi is a built-in constant in R

pi*5^2 #pi*r^2 i.e., area of a circle with r = 5

log(8) #log base e

log10(8) #log base 10

exp(8) #exponential or natural anti-log

sqrt(8) #square root

r <- log(25) #we created an object r (radius)

int<- c(1, 2, 3) #see int is numeric and has 3 obs [1:3]

sky<- c("star", "moon", "sun")

R functions and vectors

# you can stored this value for later use

Using c, you can extract multiple values

## [1] "Hello world!" "Howdy!"

vec2<-seq(5, 20, by = 0.5)

or you can ask if specific values are bigger than 10

In the command above, R returns logical values TRUE and FALSE

my.vec[my.vec < 10 & my.vec > 5] # & is AND

my.vec[my.vec <= 10 & my.vec >= 5]

my.vec[my.vec > 8 | my.vec ==5] # | is OR and == is equal to

You can also replace elements in your vector

my.vec[2]<- 800 #just 1

my.vec[c(6, 7)] <- 500# or many

## [1] 5 800 7 8 9 500 500 12 13 14 15 16 17 18 19 20

my.vec[my.vec > 19] <- 1000 #conditionally

## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

my.vec[my.vec < 1000 & my.vec > 100]<- 5 #double condition

## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

You can also perform calculation on entire vectors.

Ok, that’s it for the basics. Just 2 last pieces of info:

#save.image(file = "name_of_file_Date.RData") #To save all your workspace at once

Try these functions out!

Asking R for help or information

?read.csv # the documentation now appears in the Output pane

## starting httpd help server ... done

## R version 4.2.2 (2022-10-31 ucrt)

To get the version of a specific package you can use packageVersion()

And to cite your packages right, use citation()

Let’s play with a real data set!

Presentation of the data set

Steps to produce descriptive stats of your data set

The classic way (which I very much dislike)

2. Loading the data set

The command is read.csv(), we call our dataframe df (How original! 😜).

If you have a .txt file the command is read.table().

3. Summarizing the data for data exploration

3.1 Check out the missing values

## id age age_cat sex traploc organ THg

library(dplyr) #first we load the package

## The following objects are masked from 'package:stats':

## The following objects are masked from 'package:base':

dCount.sex <- df %>% # Count NA by sex

dCount.age <- df %>% # Count NA by age category

## Loading required package: ggplot2

ggscatter(dCount.age2, x = "age", y = "prop_na",

df$id<- as.character(as.factor(df$id)) #back to character

3.2 Summarizing THg per group

thg.SexAge <- df %>%

pi5^2 #pir^2 i.e., area of a circle with r = 5