Applied Biostatistics 2020 - 02 The R Environment

Global Nutrition & Health
Elective Module “Applied Biostatistics”
Teacher: Alexandr Parlesak

February 3rd – March 13th, 2020
University College
Copenhagen
72 h Face-to-face learning;
84 h Directed learning;
84 h Autonomous learning
10 ECTS Points
Applied Biostatistics 2020 A. Parlesak 1
The R Environment
Data types and

data handling in R
– a first overview

Help from the Textbook:
Introductory Statistics with R

Second Edition 2008
Peter Dalgaard
Copyright © 2008 by Springer

Sources of Help (1)
The command (call) I’ve done does not work.
What can I do?
• Check the spelling. R is very sensitive to spelling
mistakes, even in terms of upper/lower case letters.
• Check whether you are applying the call appropriately –
use the Compendium in the textbook (Appendix C)
• If you do not have a clue on the function you want to
apply: check the Introduction document downloaded
along with R in your folder C:\Programmer\R\R-
3.6.2\doc\manual\R-Intro.pdf
• Try a simple Google search on your problem – the “R”
community is strong and active.
• Try ?(unknown term) in the R Studio environment
Sources of Help (2)
The command (call) I’ve done does not work.
What can I do?
• Check “Appendix C. Compendium” in the textbook

• If the input sign in the R environment (“>”) changed
to a “+” and R does not react on your calls then
your last call is not finished yet (frequently a “)” is
missing).
• If it’s a long call, use the “arrow up” function to

recall the last command line(s).

Basic Data Types in R
• A number (integer, real, or complex) is always directly
understood by R – you can therefore use R as an
‘overgrown’ calculator.
• Terms not understood by R as numbers are automatically
read as text (=“string”, “character”). This happens also if you
try to use a comma as a decimal separator. Any
mathematical operation with strings gives you an error
message as return.
• One exception are logical variables (“TRUE”, “FALSE”),
which are recognized as correct/wrong statuses (“logicals”) if
written in capital letters.
For a better understanding of the following chapters, go through
the “Intro to basics” exercises at https://www.datacamp.com/
(you have to register as a user).
Basic Terms/Operations
in the R Environment
• Basic arithmetic and functions

sqrt(x), log(x), min(V),
max(V), length(V) exp(x)
• rnorm(n)
• The assignment function (<-)
• Getting help (e.g. ?read.table)
• Vector generation
c(x1,x2,x3,… xn)
• Calculations with vectors (BMI example)
• sum(V), mean(V), sd(V), median(V)
• Random picking function: sample(DF$V)
etc. – check App. C, pp.328)

Some Basics: Calculating with Exponents and Logarithms
o The exponent (power, a) of a number tells you about how often a

number (basis, b) was multiplied with itself,
e.g. 104 = 10 x 10 x 10 x 10 = 10 000.
x = logb(a)  bx = a (a,b > 0 and b ≠ 1)
o The logarithm (log) is the function indicating the exponent to a basis:
- decadic logarithm: basis of 10 (lg): lg(104) = 4 (“R”: log10(x))
- natural logarithm: basis of e (ln) (“R”: log(x))
- binary logarithm: basis of 2 (lb) (“R”: log2(x))
o Commonly used bases are 10, Euler’s number (e = 2.71828...), and 2
(=> problem of compound interest)
o Besides this, the following rules are valid:
· log(a) + log(b) = log(ab)
· log(a) – log(b) = log(a/b)
· n log(a) = log(an)
· logb(x) = logb(g) logg(x)
Problem of Compound Interest (1)
o The formula for interest

that is compounded is
o A represents the amount of money after a certain amount of time

P represents the principle or the amount of money you start with
r represents the interest rate and is always represented as a decimal
t represents the amount of time in years
n is the number of times interest is compounded in one year, for
Example:
if interest is compounded annually then n = 1
if interest is compounded quarterly then n = 4
if interest is compounded monthly then n = 12
http://www.algebralab.org/Word/Word.aspx?file=Algebra_InterestII.xml Applied Biostatistics 2020 A. Parlesak 9

Problem of Compound Interest (2)
Suppose Karen has $1000 that she invests in an account that pays
3.5% interest compounded quarterly. How much money does Karen
have at the end of 5 years?
P = $1000 (the amount being invested)

r = 0.035 (the interest rate, decimal of 3.5%)
n = 4 (quarterly compound of interest, or four times per year)
t = 5 (The money will stay in the account for 5 years)
Hence, A is
So after 5 years, the account is worth $1190.34.
Using n as the number of compounding

intervals, the limit for large n is the number e;
More generally, an account that starts at $1, and yields (1+R) dollars
at simple interest, will yield eR dollars with continuous compounding.
http://www.algebralab.org/Word/Word.aspx?file=Algebra_InterestII.xml Applied Biostatistics 2020 A. Parlesak 10
Continuously Paid Interests
The formula to calculate continuously paid interests is:
• A = P ert
A: amount of money after a certain time

P: principle amount of money you start with
r: yearly interest rate (use as decimal)
t: time in years
e: Euler’s number
• Problem: Egon, Bob, Kim, Ann, and Mette have
123 945.-/23 567.-/467 734.-/78 945.-/41 296.- US$ on their
accounts, resp.
• They deposited 89 321,-/19 999.-/456 621.-/74 333.-/39 879.- US$
25, 32, 47, 14, and 8 months ago to an account at Bank Utopia.
• Generate a dataframe for this situation (build first the vectors and
then the dataframe from these: Dalgaard p. 20)
• Use vector calculation to identify the yearly interest rate that the five
investors got, assuming a continuous compound of interest.
Vectors in R
• Any type of data (number, string, or logical variable) can be bound to a
vector, which (here) means an ordered one-line/one-column array of data.
• Vectors are created with the “c” command (for concatenate):
> work_days <-c("Monday","Tuesday","Wednesday","Thursday","Friday")
> work_days
"Monday“ "Tuesday“ "Wednesday“ "Thursday“ "Friday"
• The order of elements is important – linking elements of vectors depends
completely on their position within the vector.
• Vectors that contain numbers (integers, real, or complex) can be linked by
mathematical operations to give a new vector:
> weight <- c(67.4,56.8,66.2,79.4)
> height <- c(1.68,1.62,1.74,1.88)
> BMI=weight/height^2
> BMI
[1] 23.88039 21.64304 21.86550 22.46492
For a better understanding, go through the “Vectors” exercise at
https://www.datacamp.com/ and follow the instructions in Dalgaard, p. 4/12
– 1.1.3 “Vectorized arithmetic” and 1.2.3 “Vectors”.
Vectors in R: Ordering of Ratios
If a vector consists of ratios, the (inverse) ranking of its values can be
determined using the “order” function:
> xv<-c(4,67,34,21,9,89,76)
> xv
[1] 4 67 34 21 9 89 76
> order(xv)
[1] 1 5 4 3 2 7 6
Applying the “order” function within the selection brackets returns the sorted
vector:
> xv_ord<-xv[order(xv)]
> xv_ord
[1] 4 9 21 34 67 76 89
For a better understanding, go through the “Dataframes” exercise at

https://www.datacamp.com/

Vectors in R: Factors and Levels
• For the analysis of nominal categorical variables and ordinal categorical
variables in vectors, the single categories within a vector can be assigned
as “factors”. The single categories within these vectors are called levels:
> survey_vector = c("M", "F", "F", "M", "M")
> survey_vector
[1] "M" "F" "F" "M" "M“
> factor_survey_vector = factor(survey_vector)
> factor_survey_vector
[1] M F F M M
Levels: F M (<- automatically sorted to alphabetical order)
• The levels can easily be renamed with the “levels” function:
> levels(factor_survey_vector) = c("Female", "Male")
• > factor_survey_vector (^- due to automatically
• [1] Male Female Female Male Male sorted alphabetical order)
• Levels: Female Male
Great practical exercise on this (and the next 2 slides): Pick Chapter 4 – Factors at
“https://www.datacamp.com/courses/introduction-to-r
Vectors in R: Ordering of Levels
In case of ordinal (ranked) levels, the order has to be defined. This is particularly important for
Likert scales, which are key tools in social research:
• > increase_tax = c("agree", "strongly disagree", "strongly disagree",
"strongly agree", "indifferent", "disagree", "strongly agree", "indifferent",
"strongly disagree", "strongly disagree", "strongly disagree", "disagree",
"strongly agree")
> factor_increase_tax = factor(increase_tax)
> factor_increase_tax
[1] agree strongly disagree strongly disagree strongly agree indifferent
[6] disagree strongly agree indifferent strongly disagree strongly disagree
[11] strongly disagree disagree strongly agree
Levels: agree disagree indifferent strongly agree strongly disagree
(Levels are automatically sorted due to alphabetical order)
With the levels() command, you can rename all levels within the vector:
> levels(factor_increase_tax) = c("A","D","I","SA","SD")
[1] A SD SD SA I D SA I SD SD SD D SA
Levels: A D I SA SD
Building in the ‘ordered=TRUE” extension to the factor call, you can fix the ranking of the levels:
> factor_increase_tax = factor(factor_increase_tax, ordered = TRUE, levels =
c("SD","D","I","A","SA"))
^- enforces hierarchy due to
sequence of the following
[1] A D D I SA SD I SA D D D SD I
concatenated elements
Levels: SD < D < I < A < SA
Matrices and Dataframes in R
• Different vectors (irrespectively of the type of data) can be bound to give a
dataframe, which e.g. can contain all results from a study (e.g. the vectors
named “Age”, “Sex”, “SES”, “Height”, “Body_mass” etc.)
[data.frame(V1,V2,V3,…), cbind(V,DF) and rbind(V,DF)]
• The dataframe can have a name (“Met_Study”) and the vectors are then
called e.g. “Met_Study$Age”.
• An arranged order of vectors of the same data type (without headers) is
called matrix.
• If you wish to operate with the vector names only (without the name of the
dataframe before the vector), use the attach command (detach to
unspecify).
• A vector can (but does not have to) have a header, which can be
assigned with the “names” function.
For a better understanding, go through the “Matrices” exercise at

https://www.datacamp.com/ and follow the instructions in Dalgaard, p. 16 –
1.2.7 “Matrices and arrays”.
Getting Overview of Dataframes in R (1)
• E.g. for the built-in dataframe “mtcars”, you can get an overview of the
header and the tail with head() and tail():
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2

Getting Overview of Dataframes in R (2)
• E.g. for the same dataframe, you can get an overview of the dataframe
structure with str():
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

Terms/Operations of the R Environment
You Should be Familiar with by Now
- for Compedium, check Delgaard, App. C, p.328
Getting info on dataframes/vectors
• dim(DATAFRAME),names(DF)
• attach(DF) and detach(DF)
• head(DF), tail(DF)
• str(DF) or str(DF$V)
• summary(DF)
Compare vectors within DFs
• present$ht_boys > present$ht_girls
Addressing segments of DFs (e.g. DF[1:10,45:66])
Generate subsets of DFs
• subset(DF,DF$gender==“male")
Logical operators “and” (&) and “or” (|,ALT 124) for subsets :
• subset(DF,DF$gender==“female” & DF$age>18)
Importing Data from Spreadsheets (1)
I would like to import data from a spreadsheet
(e.g. Excel). How can I do this?
Arrange data appropriately in the
spreadsheet, meaning:
• Generate a column with an identifier for
each case (e.g. “Student 1”, “Student 2”, etc.).
• All data measured with this case
are allocated in one row (line).
• The first (top) field of each column contains the name of the vector
(=variable measured within the sample, e.g. “Age”, “Sex”, “Height”, …).
• Make sure to assign missing data with an appropriate identifier (e.g.
“NA” for “not available”) – don’t leave the fields simply empty!
• Omit empty spaces in variable names – use “shifted hyphen” (“_”) to ally
combined variable names (e.g. “Sleep_dur”).
• Choose short names for variables (vectors) and data frames.
Safe the spreadsheet

in the text format:
• Open Excel with the
spreadsheet to be imported
into R being the actual one.
• Safe it with the “Safe as …”
function, choosing the
“Tab delimited” option.
• If you continue (click “Yes”
twice) the actual spreadsheet
will be saved to a text file.
• Note that, depending on the default settings, decimal separators can be
points or commas. For R, points are obligatory. If commas are used,
change “,” to “.” with the “Replace” function of Excel.
• Blanks in names of variables/cases will give you two separate columns –
one for each part of the name separated by the blank. Use “_”.
Import the spreadsheet (text format) into R:
• Open R.
• Apply the “read.table” function
(e.g. cd<-read.table("C:/Documents and Settings/v41920/My
Documents/PHMetropol/Biostatistics/ClinDat_Alc.txt",header=T) )
Note the use of slashes used instead of backslashes in the path.
• The call “header=T” allocates the top fields of each column to the vector
name.
• You can call the resulting data frame with its name (here: “cd”).
• You can call the single vectors of the created data frame with their names
linked by “$” to the name of the data frame (e.g. “cd$Height”).
• Note that different groups (e.g. “m” and “f” in “cd$Sex”) are part of ONE
vector. This needs to be communicated appropriately to the test to be
applied [e.g. t.test(cd$Height~cd$Sex) instead of t.test(cd$Height,cd$Sex)]
for the case of 2 separate vectors.

Importing Data from URLs (1)
Import a dataframe (csv format) into R/R Studio:
• Apply the “load(url(LINK))” function for R datasets:

load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData"))
• This call does not work with password-protected gates.
• In this case, the name of the dataframe corresponds to the file name (here:
ames), which is automtically attached (calls for vector name without the
name of the dataframe work).

Saving Data from Spreadsheets (1)
Save spreadsheet (text format) into R:
• Assuming that you imported/created a data file in R that you’d like to save
is in the workspace.
• Apply e.g. the “write.csv” function to save as a comma-delimited file

(e.g. write.csv(present, file="C:/Users/alpa/Documents/present.txt")).
• You can edit the resulting file with its given name (here: “present.txt”) with
any editing program (notepad, Excel, etc.).

Saving Data from Spreadsheets (2)
You can save the whole actual workspace from R (including currently
active dataframes to disk with the call save:
e.g. > save(ames, file="C:/Users/alpa/Documents/ames.Rdata")

> ?save
Usage
save(..., list = character(),
file = stop("'file' must be specified"),
ascii = FALSE, version = NULL, envir = parent.frame(),
compress = !ascii, compression_level,
eval.promises = TRUE, precheck = TRUE)
save.image(file = ".RData", version = NULL, ascii = FALSE,

compress = !ascii, safe = TRUE)
Arguments
... the names of the objects to be saved

Graph Building in R
• Histograms:
hist(DF$V)
• Plots of continuous-continuous interactions:
plot(DF$V1, DF$V2) or
plot(DF$V1~ DF$V2)
• Plots of categorial-continuous interactions:
boxplot(DF$V1~ DF$V2)
barplot(DF$V1~ DF$V2)
• Multiple chart generation
mosaicplot(DF$V1, DF$V2)

2 Equations – 2 Unknowns Exercise
• You are requested to mix soy bean flour and corn (maize) flour to
give 1 kg of a final blend that contains the maximum amount of
protein, but does not exceed a total fat content of 9.5%.
• By which proportions do you have to mix the soy bean flour (fat:
20.7%, protein: 37.8%, USDA 16415) and corn (maize) flour (fat:
3.86%, protein: 9.28%, USDA 20019)?
• What is the final concentration of folate in the mixture? (soy flour:
365μg/100g, corn flour: 29μg/100g)

Applied Biostatistics 2020 - 02 The R Environment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Biostatistics 2020 - 02 The R Environment

Uploaded by

Copyright:

Available Formats

Global Nutrition & Health

Elective Module “Applied Biostatistics”

Teacher: Alexandr Parlesak

Elective Module “Applied Biostatistics”

Data types and

Applied Biostatistics 2020 A. Parlesak 2

Help from the Textbook:

Introductory Statistics with R

Copyright © 2008 by Springer

• Check “Appendix C. Compendium” in the textbook

• If it’s a long call, use the “arrow up” function to

Applied Biostatistics 2020 A. Parlesak 5

• Basic arithmetic and functions

etc. – check App. C, pp.328)

Applied Biostatistics 2020 A. Parlesak 7

o The exponent (power, a) of a number tells you about how often a

o The formula for interest

o A represents the amount of money after a certain amount of time

http://www.algebralab.org/Word/Word.aspx?file=Algebra_InterestII.xml Applied Biostatistics 2020 A. Parlesak 9

P = $1000 (the amount being invested)

So after 5 years, the account is worth $1190.34.

Using n as the number of compounding

A: amount of money after a certain time

For a better understanding, go through the “Dataframes” exercise at

Applied Biostatistics 2020 A. Parlesak 13

For a better understanding, go through the “Matrices” exercise at

Applied Biostatistics 2020 A. Parlesak 17

Applied Biostatistics 2020 A. Parlesak 18

Safe the spreadsheet

Applied Biostatistics 2020 A. Parlesak 22

Import a dataframe (csv format) into R/R Studio:

• Apply the “load(url(LINK))” function for R datasets:

Note the use of slashes used instead of backslashes in the path.

• This call does not work with password-protected gates.

Applied Biostatistics 2020 A. Parlesak 23

Save spreadsheet (text format) into R:

• Apply e.g. the “write.csv” function to save as a comma-delimited file

Note the use of slashes used instead of backslashes in the path.

Applied Biostatistics 2020 A. Parlesak 24

e.g. > save(ames, file="C:/Users/alpa/Documents/ames.Rdata")

save.image(file = ".RData", version = NULL, ascii = FALSE,

Applied Biostatistics 2020 A. Parlesak 25

Applied Biostatistics 2020 A. Parlesak 26

Applied Biostatistics 2020 A. Parlesak 27

You might also like