Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Global Nutrition & Health

Elective Module “Applied Biostatistics”

Teacher: Alexandr Parlesak


February 3rd – March 13th, 2020

University College
Copenhagen

72 h Face-to-face learning;
84 h Directed learning;
84 h Autonomous learning

10 ECTS Points
Applied Biostatistics 2020 A. Parlesak 1
Global Nutrition & Health

Elective Module “Applied Biostatistics”

The R Environment

Data types and


data handling in R
– a first overview

Applied Biostatistics 2020 A. Parlesak 2


Global Nutrition & Health
Elective Module “Applied Biostatistics”

Help from the Textbook:

Introductory Statistics with R


Second Edition 2008

Peter Dalgaard

Copyright © 2008 by Springer


Applied Biostatistics 2020 A. Parlesak 3
Sources of Help (1)
The command (call) I’ve done does not work.
What can I do?
• Check the spelling. R is very sensitive to spelling
mistakes, even in terms of upper/lower case letters.
• Check whether you are applying the call appropriately –
use the Compendium in the textbook (Appendix C)
• If you do not have a clue on the function you want to
apply: check the Introduction document downloaded
along with R in your folder C:\Programmer\R\R-
3.6.2\doc\manual\R-Intro.pdf
• Try a simple Google search on your problem – the “R”
community is strong and active.
• Try ?(unknown term) in the R Studio environment
Applied Biostatistics 2020 A. Parlesak 4
Sources of Help (2)
The command (call) I’ve done does not work.
What can I do?

• Check “Appendix C. Compendium” in the textbook


• If the input sign in the R environment (“>”) changed
to a “+” and R does not react on your calls then
your last call is not finished yet (frequently a “)” is
missing).

• If it’s a long call, use the “arrow up” function to


recall the last command line(s).

Applied Biostatistics 2020 A. Parlesak 5


Basic Data Types in R
• A number (integer, real, or complex) is always directly
understood by R – you can therefore use R as an
‘overgrown’ calculator.
• Terms not understood by R as numbers are automatically
read as text (=“string”, “character”). This happens also if you
try to use a comma as a decimal separator. Any
mathematical operation with strings gives you an error
message as return.
• One exception are logical variables (“TRUE”, “FALSE”),
which are recognized as correct/wrong statuses (“logicals”) if
written in capital letters.
For a better understanding of the following chapters, go through
the “Intro to basics” exercises at https://www.datacamp.com/
(you have to register as a user).
Applied Biostatistics 2020 A. Parlesak 6
Basic Terms/Operations
in the R Environment

• Basic arithmetic and functions


sqrt(x), log(x), min(V),
max(V), length(V) exp(x)
• rnorm(n)
• The assignment function (<-)
• Getting help (e.g. ?read.table)
• Vector generation
c(x1,x2,x3,… xn)
• Calculations with vectors (BMI example)
• sum(V), mean(V), sd(V), median(V)
• Random picking function: sample(DF$V)

etc. – check App. C, pp.328)

Applied Biostatistics 2020 A. Parlesak 7


Some Basics: Calculating with Exponents and Logarithms

o The exponent (power, a) of a number tells you about how often a


number (basis, b) was multiplied with itself,
e.g. 104 = 10 x 10 x 10 x 10 = 10 000.
x = logb(a)  bx = a (a,b > 0 and b ≠ 1)
o The logarithm (log) is the function indicating the exponent to a basis:
- decadic logarithm: basis of 10 (lg): lg(104) = 4 (“R”: log10(x))
- natural logarithm: basis of e (ln) (“R”: log(x))
- binary logarithm: basis of 2 (lb) (“R”: log2(x))
o Commonly used bases are 10, Euler’s number (e = 2.71828...), and 2
(=> problem of compound interest)
o Besides this, the following rules are valid:
· log(a) + log(b) = log(ab)
· log(a) – log(b) = log(a/b)
· n log(a) = log(an)
· logb(x) = logb(g) logg(x)
Applied Biostatistics 2020 A. Parlesak 8
Problem of Compound Interest (1)

o The formula for interest


that is compounded is

o A represents the amount of money after a certain amount of time


P represents the principle or the amount of money you start with
r represents the interest rate and is always represented as a decimal
t represents the amount of time in years
n is the number of times interest is compounded in one year, for

Example:
if interest is compounded annually then n = 1
if interest is compounded quarterly then n = 4
if interest is compounded monthly then n = 12

http://www.algebralab.org/Word/Word.aspx?file=Algebra_InterestII.xml Applied Biostatistics 2020 A. Parlesak 9


Problem of Compound Interest (2)
Suppose Karen has $1000 that she invests in an account that pays
3.5% interest compounded quarterly. How much money does Karen
have at the end of 5 years?

P = $1000 (the amount being invested)


r = 0.035 (the interest rate, decimal of 3.5%)
n = 4 (quarterly compound of interest, or four times per year)
t = 5 (The money will stay in the account for 5 years)
Hence, A is

So after 5 years, the account is worth $1190.34.

Using n as the number of compounding


intervals, the limit for large n is the number e;

More generally, an account that starts at $1, and yields (1+R) dollars
at simple interest, will yield eR dollars with continuous compounding.
http://www.algebralab.org/Word/Word.aspx?file=Algebra_InterestII.xml Applied Biostatistics 2020 A. Parlesak 10
Continuously Paid Interests
The formula to calculate continuously paid interests is:
• A = P ert

A: amount of money after a certain time


P: principle amount of money you start with
r: yearly interest rate (use as decimal)
t: time in years
e: Euler’s number
• Problem: Egon, Bob, Kim, Ann, and Mette have
123 945.-/23 567.-/467 734.-/78 945.-/41 296.- US$ on their
accounts, resp.
• They deposited 89 321,-/19 999.-/456 621.-/74 333.-/39 879.- US$
25, 32, 47, 14, and 8 months ago to an account at Bank Utopia.
• Generate a dataframe for this situation (build first the vectors and
then the dataframe from these: Dalgaard p. 20)
• Use vector calculation to identify the yearly interest rate that the five
investors got, assuming a continuous compound of interest.
Applied Biostatistics 2020 A. Parlesak 11
Vectors in R
• Any type of data (number, string, or logical variable) can be bound to a
vector, which (here) means an ordered one-line/one-column array of data.
• Vectors are created with the “c” command (for concatenate):
> work_days <-c("Monday","Tuesday","Wednesday","Thursday","Friday")
> work_days
"Monday“ "Tuesday“ "Wednesday“ "Thursday“ "Friday"
• The order of elements is important – linking elements of vectors depends
completely on their position within the vector.
• Vectors that contain numbers (integers, real, or complex) can be linked by
mathematical operations to give a new vector:
> weight <- c(67.4,56.8,66.2,79.4)
> height <- c(1.68,1.62,1.74,1.88)
> BMI=weight/height^2
> BMI
[1] 23.88039 21.64304 21.86550 22.46492
For a better understanding, go through the “Vectors” exercise at
https://www.datacamp.com/ and follow the instructions in Dalgaard, p. 4/12
– 1.1.3 “Vectorized arithmetic” and 1.2.3 “Vectors”.
Applied Biostatistics 2020 A. Parlesak 12
Vectors in R: Ordering of Ratios
If a vector consists of ratios, the (inverse) ranking of its values can be
determined using the “order” function:
> xv<-c(4,67,34,21,9,89,76)
> xv
[1] 4 67 34 21 9 89 76
> order(xv)
[1] 1 5 4 3 2 7 6

Applying the “order” function within the selection brackets returns the sorted
vector:
> xv_ord<-xv[order(xv)]
> xv_ord
[1] 4 9 21 34 67 76 89

For a better understanding, go through the “Dataframes” exercise at


https://www.datacamp.com/

Applied Biostatistics 2020 A. Parlesak 13


Vectors in R: Factors and Levels
• For the analysis of nominal categorical variables and ordinal categorical
variables in vectors, the single categories within a vector can be assigned
as “factors”. The single categories within these vectors are called levels:
> survey_vector = c("M", "F", "F", "M", "M")
> survey_vector
[1] "M" "F" "F" "M" "M“
> factor_survey_vector = factor(survey_vector)
> factor_survey_vector
[1] M F F M M
Levels: F M (<- automatically sorted to alphabetical order)
• The levels can easily be renamed with the “levels” function:
> levels(factor_survey_vector) = c("Female", "Male")
• > factor_survey_vector (^- due to automatically
• [1] Male Female Female Male Male sorted alphabetical order)
• Levels: Female Male
Great practical exercise on this (and the next 2 slides): Pick Chapter 4 – Factors at
“https://www.datacamp.com/courses/introduction-to-r
Applied Biostatistics 2020 A. Parlesak 14
Vectors in R: Ordering of Levels
In case of ordinal (ranked) levels, the order has to be defined. This is particularly important for
Likert scales, which are key tools in social research:
• > increase_tax = c("agree", "strongly disagree", "strongly disagree",
"strongly agree", "indifferent", "disagree", "strongly agree", "indifferent",
"strongly disagree", "strongly disagree", "strongly disagree", "disagree",
"strongly agree")
> factor_increase_tax = factor(increase_tax)
> factor_increase_tax
[1] agree strongly disagree strongly disagree strongly agree indifferent
[6] disagree strongly agree indifferent strongly disagree strongly disagree
[11] strongly disagree disagree strongly agree
Levels: agree disagree indifferent strongly agree strongly disagree
(Levels are automatically sorted due to alphabetical order)
With the levels() command, you can rename all levels within the vector:
> levels(factor_increase_tax) = c("A","D","I","SA","SD")
> factor_increase_tax
[1] A SD SD SA I D SA I SD SD SD D SA
Levels: A D I SA SD

Building in the ‘ordered=TRUE” extension to the factor call, you can fix the ranking of the levels:
> factor_increase_tax = factor(factor_increase_tax, ordered = TRUE, levels =
c("SD","D","I","A","SA"))
^- enforces hierarchy due to
> factor_increase_tax
sequence of the following
[1] A D D I SA SD I SA D D D SD I
concatenated elements
Levels: SD < D < I < A < SA
Applied Biostatistics 2020 A. Parlesak 15
Matrices and Dataframes in R
• Different vectors (irrespectively of the type of data) can be bound to give a
dataframe, which e.g. can contain all results from a study (e.g. the vectors
named “Age”, “Sex”, “SES”, “Height”, “Body_mass” etc.)
[data.frame(V1,V2,V3,…), cbind(V,DF) and rbind(V,DF)]
• The dataframe can have a name (“Met_Study”) and the vectors are then
called e.g. “Met_Study$Age”.
• An arranged order of vectors of the same data type (without headers) is
called matrix.
• If you wish to operate with the vector names only (without the name of the
dataframe before the vector), use the attach command (detach to
unspecify).
• A vector can (but does not have to) have a header, which can be
assigned with the “names” function.

For a better understanding, go through the “Matrices” exercise at


https://www.datacamp.com/ and follow the instructions in Dalgaard, p. 16 –
1.2.7 “Matrices and arrays”.
Applied Biostatistics 2020 A. Parlesak 16
Getting Overview of Dataframes in R (1)
• E.g. for the built-in dataframe “mtcars”, you can get an overview of the
header and the tail with head() and tail():
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

> tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2

Applied Biostatistics 2020 A. Parlesak 17


Getting Overview of Dataframes in R (2)

• E.g. for the same dataframe, you can get an overview of the dataframe
structure with str():
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

Applied Biostatistics 2020 A. Parlesak 18


Terms/Operations of the R Environment
You Should be Familiar with by Now
- for Compedium, check Delgaard, App. C, p.328
Getting info on dataframes/vectors
• dim(DATAFRAME),names(DF)
• attach(DF) and detach(DF)
• head(DF), tail(DF)
• str(DF) or str(DF$V)
• summary(DF)
Compare vectors within DFs
• present$ht_boys > present$ht_girls
Addressing segments of DFs (e.g. DF[1:10,45:66])
Generate subsets of DFs
• subset(DF,DF$gender==“male")
Logical operators “and” (&) and “or” (|,ALT 124) for subsets :
• subset(DF,DF$gender==“female” & DF$age>18)
Applied Biostatistics 2020 A. Parlesak 19
Importing Data from Spreadsheets (1)
I would like to import data from a spreadsheet
(e.g. Excel). How can I do this?
Arrange data appropriately in the
spreadsheet, meaning:
• Generate a column with an identifier for
each case (e.g. “Student 1”, “Student 2”, etc.).
• All data measured with this case
are allocated in one row (line).
• The first (top) field of each column contains the name of the vector
(=variable measured within the sample, e.g. “Age”, “Sex”, “Height”, …).
• Make sure to assign missing data with an appropriate identifier (e.g.
“NA” for “not available”) – don’t leave the fields simply empty!
• Omit empty spaces in variable names – use “shifted hyphen” (“_”) to ally
combined variable names (e.g. “Sleep_dur”).
• Choose short names for variables (vectors) and data frames.
Applied Biostatistics 2020 A. Parlesak 20
Importing Data from Spreadsheets (2)

Safe the spreadsheet


in the text format:
• Open Excel with the
spreadsheet to be imported
into R being the actual one.
• Safe it with the “Safe as …”
function, choosing the
“Tab delimited” option.
• If you continue (click “Yes”
twice) the actual spreadsheet
will be saved to a text file.
• Note that, depending on the default settings, decimal separators can be
points or commas. For R, points are obligatory. If commas are used,
change “,” to “.” with the “Replace” function of Excel.
• Blanks in names of variables/cases will give you two separate columns –
one for each part of the name separated by the blank. Use “_”.
Applied Biostatistics 2020 A. Parlesak 21
Importing Data from Spreadsheets (3)
Import the spreadsheet (text format) into R:
• Open R.
• Apply the “read.table” function
(e.g. cd<-read.table("C:/Documents and Settings/v41920/My
Documents/PHMetropol/Biostatistics/ClinDat_Alc.txt",header=T) )
Note the use of slashes used instead of backslashes in the path.
• The call “header=T” allocates the top fields of each column to the vector
name.
• You can call the resulting data frame with its name (here: “cd”).
• You can call the single vectors of the created data frame with their names
linked by “$” to the name of the data frame (e.g. “cd$Height”).
• Note that different groups (e.g. “m” and “f” in “cd$Sex”) are part of ONE
vector. This needs to be communicated appropriately to the test to be
applied [e.g. t.test(cd$Height~cd$Sex) instead of t.test(cd$Height,cd$Sex)]
for the case of 2 separate vectors.

Applied Biostatistics 2020 A. Parlesak 22


Importing Data from URLs (1)

Import a dataframe (csv format) into R/R Studio:

• Apply the “load(url(LINK))” function for R datasets:


load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData"))

Note the use of slashes used instead of backslashes in the path.

• This call does not work with password-protected gates.

• In this case, the name of the dataframe corresponds to the file name (here:
ames), which is automtically attached (calls for vector name without the
name of the dataframe work).

Applied Biostatistics 2020 A. Parlesak 23


Saving Data from Spreadsheets (1)

Save spreadsheet (text format) into R:

• Assuming that you imported/created a data file in R that you’d like to save
is in the workspace.

• Apply e.g. the “write.csv” function to save as a comma-delimited file


(e.g. write.csv(present, file="C:/Users/alpa/Documents/present.txt")).

Note the use of slashes used instead of backslashes in the path.

• You can edit the resulting file with its given name (here: “present.txt”) with
any editing program (notepad, Excel, etc.).

Applied Biostatistics 2020 A. Parlesak 24


Saving Data from Spreadsheets (2)

You can save the whole actual workspace from R (including currently
active dataframes to disk with the call save:

e.g. > save(ames, file="C:/Users/alpa/Documents/ames.Rdata")


> ?save
Usage
save(..., list = character(),
file = stop("'file' must be specified"),
ascii = FALSE, version = NULL, envir = parent.frame(),
compress = !ascii, compression_level,
eval.promises = TRUE, precheck = TRUE)

save.image(file = ".RData", version = NULL, ascii = FALSE,


compress = !ascii, safe = TRUE)
Arguments
... the names of the objects to be saved

Applied Biostatistics 2020 A. Parlesak 25


Graph Building in R

• Histograms:
hist(DF$V)
• Plots of continuous-continuous interactions:
plot(DF$V1, DF$V2) or
plot(DF$V1~ DF$V2)
• Plots of categorial-continuous interactions:
boxplot(DF$V1~ DF$V2)
barplot(DF$V1~ DF$V2)
• Multiple chart generation
mosaicplot(DF$V1, DF$V2)

Applied Biostatistics 2020 A. Parlesak 26


2 Equations – 2 Unknowns Exercise

• You are requested to mix soy bean flour and corn (maize) flour to
give 1 kg of a final blend that contains the maximum amount of
protein, but does not exceed a total fat content of 9.5%.
• By which proportions do you have to mix the soy bean flour (fat:
20.7%, protein: 37.8%, USDA 16415) and corn (maize) flour (fat:
3.86%, protein: 9.28%, USDA 20019)?
• What is the final concentration of folate in the mixture? (soy flour:
365μg/100g, corn flour: 29μg/100g)

Applied Biostatistics 2020 A. Parlesak 27

You might also like