Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Advanced R

Chapter 2.5: Parsing & Metaprogramming

Daniel Horn & Sheila Görz

Summer Semester 2022

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 1 / 67


Chapter overview

1 The three variants of parsing

2 The internal representation - Expressions

3 Valid inputs in R

4 Separators

5 Metaprogramming

6 Using expressions

7 Summary and conclusion

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 2 / 67


How does R evaluate commands?

Before we start with the programming aspect, we first have to ask


ourselves: what happens internally in R when a command is entered into
the console or when R commands are sourced from data files? After all,
program code is just text at first. So, how do we get to actual commands
from here?
First, the textual representation of the R code is transformed into an
internal form → Parsing
This can be passed to the ’R evaluator’ which then evaluates the code.
How does such an internal code representation look like? How is it stored?
What actually happens during parsing? How many different types of
parsing are there?
Let’s dive into all these questions now.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 3 / 67


General remarks on parsing

In the context of programming, parsing describes the segmentation of


program code into smaller chunks so that the ’evaluator’ can evaluate
it correctly.
The parser transforms the code into a tree structure for the evaluator.
This tree is also called syntax tree or parsing tree.
The parser also checks the code for syntax errors and prevents the
evaluation of syntactically incorrect code.
There are three different variants of parsing in R:
Read - eval - print.
Parsing code from text files.
Parsing character strings.
Objects that are already parsed but not evaluated yet, are represented
by the expression type.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 4 / 67


The three variants of parsing

The three variants of parsing

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 5 / 67


The three variants of parsing

The three variants of parsing I

1. Read - eval - print


This is the standard procedure in R: The parser reads the (console)
input until a complete statement is provided. This statement is
immediately transformed into the internal form and passed to the
evaluator which then evaluates the statement and prints the result to
the console.
When the console displays a ’>’, this means that a complete statement
is present and the call is concluded.
When the console displays a ’+’, this means that the statement is not
yet complete.
If a syntax error occurs during parsing, the parsing process is aborted
and the parser is restored to its original state.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 6 / 67


The three variants of parsing

The three variants of parsing II

Brief detour: When is an R statement complete?


All opening parentheses require a closing counterpart.
All unary operators require an argument.
All binary operators require the second argument.
A character string needs to be concluded with the same quotation
marks used for starting it.
When ’enter’ is pressed and the (unfinished) statement contains an
error.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 7 / 67


The three variants of parsing

The three variants of parsing III

2. Parsing code from text files


If code is to be read from text files (.txt, .R, etc.), then it is
completely parsed before being evaluated.
This means: If a file containing code is being read (e.g. using the
function source()) and there is an error at line x, then all preceeding
lines 1 to x-1 are not evaluated either.
To explicitly parse code, R offers the parse() function. Parsed
statements can be evaluated using eval().
The compositional use of parse() and eval() mimics the source()
function in a very simplified way.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 8 / 67


The three variants of parsing

The three variants of parsing IV

Example:
# The file parsing_example.txt contains the following statements:
# x <- 5 + 3
# y <- 1:10
# z <- sum(4, 6, 9)
parse("parsing_example.txt")

## expression(x <- 5 + 3, y <- 1:10, z <- sum(4, 6, 9))

eval(parse("parsing_example.txt"))
ls()

## [1] "x" "y" "z"

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 9 / 67


The three variants of parsing

The three variants of parsing V

3. Parsing character strings


character strings can also be processed to evaluable expressions via
parse().
To do this, pass the character string as the text argument of the
parse() function.
Example:
parse(text = c("5 + 3", "1:10", "sum(4, 6, 9)"))

## expression(5 + 3, 1:10, sum(4, 6, 9))

# Syntax errors get recognized:


parse(text = "5 + ")

## Error in parse(text = "5 + "): <text>:2:0: unexpected end of input


## 1: 5 +
## ^

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 10 / 67


The internal representation - Expressions

The internal representation - Expressions

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 11 / 67


The internal representation - Expressions

Internal representation - The parse tree I

The expression returned by parse() is the data type expected as


the input type by eval().
Internally, a parsed R object is stored as a tree: the parse tree. Also
known as: abstract syntax tree (AST).
The internal representation can be accessed using getParseData().
However, this expresses the tree as a data.frame. Thus, one still
needs to ’craft’ the actual tree structure from this.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 12 / 67


The internal representation - Expressions

Internal representation - The parse tree II


getParseData(parse(text = "2 + 3 * 4 / 1:2", keep.source = TRUE))

## line1 col1 line2 col2 id parent token terminal text


## 19 1 1 1 15 19 0 expr FALSE
## 1 1 1 1 1 1 2 NUM_CONST TRUE 2
## 2 1 1 1 1 2 19 expr FALSE
## 3 1 3 1 3 3 19 '+' TRUE +
## 18 1 5 1 15 18 19 expr FALSE
## 10 1 5 1 9 10 18 expr FALSE
## 4 1 5 1 5 4 5 NUM_CONST TRUE 3
## 5 1 5 1 5 5 10 expr FALSE
## 6 1 7 1 7 6 10 '*' TRUE *
## 7 1 9 1 9 7 8 NUM_CONST TRUE 4
## 8 1 9 1 9 8 10 expr FALSE
## 9 1 11 1 11 9 18 '/' TRUE /
## 17 1 13 1 15 17 18 expr FALSE
## 11 1 13 1 13 11 12 NUM_CONST TRUE 1
## 12 1 13 1 13 12 17 expr FALSE
## 13 1 14 1 14 13 17 ':' TRUE :
## 14 1 15 1 15 14 15 NUM_CONST TRUE 2
## 15 1 15 1 15 15 17 expr FALSE

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 13 / 67


The internal representation - Expressions

Internal representation - The parse tree II

getParseData(parse(text = "2 + 3 * 4 / 1:2", keep.source = TRUE))

19

2 3 18
+

1 10 9 17
2 /

5 6 8 12 13 15
∗ :

4 7 11 14
3 4 1 2

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 14 / 67


The internal representation - Expressions

Internal representation - The parse tree III


getParseData(parse(text = "add <- function(x, y) {
return(x + y)
}", keep.source = TRUE))

## line1 col1 line2 col2 id parent token terminal text


## 37 1 1 3 27 37 0 expr FALSE
## 1 1 1 1 3 1 3 SYMBOL TRUE add
## 3 1 1 1 3 3 37 expr FALSE
## 2 1 5 1 6 2 37 LEFT_ASSIGN TRUE <-
## 36 1 8 3 27 36 37 expr FALSE
## 4 1 8 1 15 4 36 FUNCTION TRUE function
## 5 1 16 1 16 5 36 '(' TRUE (
## 6 1 17 1 17 6 36 SYMBOL_FORMALS TRUE x
## 7 1 18 1 18 7 36 ',' TRUE ,
## 9 1 20 1 20 9 36 SYMBOL_FORMALS TRUE y
## 10 1 21 1 21 10 36 ')' TRUE )
## 33 1 23 3 27 33 36 expr FALSE
## 12 1 23 1 23 12 33 '{' TRUE {
## 27 2 30 2 42 27 33 expr FALSE
## 14 2 30 2 35 14 16 SYMBOL_FUNCTION_CALL TRUE return
## 16 2 30 2 35 16 27 expr FALSE
## 15 2 36 2 36 15 27 '(' TRUE (
## 23 2 37 2 41 23 27 expr FALSE
## 17 2 37 2 37 17 19 SYMBOL TRUE x
## 19 2 37 2 37 19 23 expr FALSE
## 18 2 39 2 39 18 23 '+' TRUE +
## 20 2 41 2 41 20 22 SYMBOL TRUE y
## 22 2 41 2 41 22 23 expr FALSE
## 21 2 42 2 42 21 27 ')' TRUE )
## 31 3 27 3 27 31 33 '}' TRUE }
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 15 / 67
The internal representation - Expressions

Internal representation - The parse tree III


getParseData(parse(text = "add <- function(x, y) {
return(x + y)
}", keep.source = TRUE))

36

3 2 35
<−

1 4 6 7, 9 32
add function 5 ( x y ) 10

12 26 30
{ }

16 15 23 21
( )

14 19 18 22
return +

17 20
x y
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 16 / 67
The internal representation - Expressions

Internal representation - The parse tree IV

Unfortunately, drawing these pretty trees is very cumbersome. Therefore,


we’re using the ast() function from the pryr package from now on to
depict ASTs. Here, () represents an inner node while the other lines
represent the tree’s leaves.

pryr::ast(sin(2 + 3)) pryr::ast(mean(1:10, trim = 0.1))

## \- () ## \- ()
## \- `sin ## \- `mean
## \- () ## \- ()
## \- `+ ## \- `:
## \- 2 ## \- 1
## \- 3 ## \- 10
## \- 0.1

On the next couple of slides, we will take a look at the structure and
contents of an AST.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 17 / 67
The internal representation - Expressions

The leaves: constants and symbols


First, we’ll deal with the tree’s leaves: Every leaf of an AST is either a
constant or a symbol:

Constants Names of variables (=


ˆ symbols)
pryr::ast(2) pryr::ast(x)

## \- 2 ## \- `x

pryr::ast("a") pryr::ast(sin)

## \- "a" ## \- `sin

pryr::ast(TRUE) pryr::ast(`+`)

## \- TRUE ## \- `+

Constants are always scalars, i.e. We’re also accessing most functions
vectors of length 1. using symbols.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 18 / 67


The internal representation - Expressions

Function calls I

Function calls represent the inner nodes of an AST:

pryr::ast(sin(2)) pryr::ast(sum(1L, 2L, 3L))

## \- () ## \- ()
## \- `sin ## \- `sum
## \- 2 ## \- 1L
## \- 2L
## \- 3L

Being an inner node of the AST, every function call has multiple childs
which are trees (ASTs) themselves. The first child always denotes the called
function while the remaining children represent the function’s arguments.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 19 / 67


The internal representation - Expressions

Function calls II

The AST is a recursive data structure: Every argument of a function call


can be another function call itself. Thus, the tree can have an almost
arbitrary depth:

pryr::ast(sin(cos(x))) pryr::ast(mean(seq(1, 10)))

## \- () ## \- ()
## \- `sin ## \- `mean
## \- () ## \- ()
## \- `cos ## \- `seq
## \- `x ## \- 1
## \- 10

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 20 / 67


The internal representation - Expressions

Function calls III

It’s not advisable to rely on the order of the arguments, because argument
matching rules in R are rather complicated after all:

pryr::ast(mean(x = c(1, 2), pryr::ast(mean(t = 0.1, ,


trim = 0.1, na.rm = TRUE)) TRUE, x = c(1, 2)))

## \- () ## \- ()
## \- `mean ## \- `mean
## \- () ## \- 0.1
## \- `c ## \- `MISSING
## \- 1 ## \- TRUE
## \- 2 ## \- ()
## \- 0.1 ## \- `c
## \- TRUE ## \- 1
## \- 2

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 21 / 67


The internal representation - Expressions

Function calls IV

We’re also recalling: Everything in R is a function call. The AST always


contains the prefix variants of functions. (It’s a bit more complicated for
replacement functions ...)

pryr::ast(2 + 3) pryr::ast(while(cond) cond = TRUE)

## \- () ## \- ()
## \- `+ ## \- `while
## \- 2 ## \- `cond
## \- 3 ## \- ()
## \- `=
## \- `cond
## \- TRUE

More elements can (almost) not appear inside of an AST and thus the
internal structure of every piece of R code is aptly described. Except for:
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 22 / 67
The internal representation - Expressions

Pairlists

An internal node may also be a pairlist (denoted by: []):


pryr::ast(function(x = 2) x)

## \- ()
## \- `function
## \- []
## \ x = 2
## \- `x
## \- <srcref>

Pairlists are used in R mostly for the formals of a function. As such, they
appear in the AST if the function function has been called beforehand.
Pairlists themselves can contain constants, symbols and calls once more.

→ We’ll ignore pairlists for this part of the lecture.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 23 / 67


Valid inputs in R

Valid inputs in R

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 24 / 67


Valid inputs in R

Valid inputs in R

Now, we’ll take a short look at valid inputs in R. In particular, we’re


interested in:
Which number formats does R comprehend?
Which character strings have a special meaning?
What are valid variable names?
Which words can’t be used for variable names?
Note that the parser generally ignores comments. All characters in between
the ’#’ and the end of line are ignored. Thus, comments can contain
basically anything.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 25 / 67


Valid inputs in R

Valid inputs in R - Numeric constants I

An extensive description of how R is parsing numeric constants, can be


found under ?NumericConstants.
There are four numeric constants that don’t start with a number or a
decimal point:

Inf (double), NA_real_ (double),


NaN (double), NA_integer_ (integer).

R ’only’ comprehends the ASCII characters 0-9 as digits.


As the decimal separator, . is used instead of ,.
R comprehends scientific notation:

2.1e-2 2.1E2
## [1] 0.021 ## [1] 210

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 26 / 67


Valid inputs in R

Valid inputs in R - Numeric constants II

For numbers with an absolute value between 0 and 1, the leading zero
can be omitted:

.5 -.5
## [1] 0.5 ## [1] -0.5

Numbers followed by an L are interpreted as integers, as long as


they don’t contain a decimal point.
Numbers followed by an i are interpreted as complex and as such, no
longer belong to the numeric type.
A negative sign in front of the number is interpreted as an unary
operator by the parser and thus doesn’t belong to the number itself.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 27 / 67


Valid inputs in R

Valid inputs in R - Numeric constants III

In R, numbers can also be entered using the hexadecimal notation. In


this case, numbers start with 0x, followed by the digits 0-9, the letters
a-f and/or the letters A-F.

0xa 0x123
## [1] 10 ## [1] 291

Hexadecimal numbers can be followed by a power of 2 (in decimal


notation). Here, p denotes the beginning of the power (otherwise, it
wouldn’t be possible to generate non-integers):

# 175 * 2^12 # 17 * 2^(-2)


0xafp12 0x11p-2
## [1] 716800 ## [1] 4.25

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 28 / 67


Valid inputs in R

Valid inputs in R - characters

As characters, all printable characters are valid - this can vary


depending on the encoding used.
There are a few so-called ’escape sequences’ that have a special
meaning. Two examples:
\n causes a line break,
\t is a tab character.
A complete list of all escape sequences can be retrieved from:
https://cran.r-project.org/doc/manuals/r-release/R-lang.
html#Literal-constants

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 29 / 67


Valid inputs in R

Valid inputs in R - Variable names I

The naming of variables in R is subject to a set of simple rules:


Valid variable names
1 Consist of letters, digits, dots and the underscore.

2 Either start with a letter or a dot. When starting with a dot, the
second character must not be a digit.
3 Blank characters inside of variable names are not permitted.
4 It’s not allowed to use reserved words (more on them shortly).

We can circumvent these rules by using backticks. E.g: a§b is not a valid
name, whereas ‘a§b‘ is a valid name.
a§b
`a§b` <- 2
`a§b`
## Error: <text>:1:2: unexpected input
## 1: a§
## [1] 2
## ^

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 30 / 67


Valid inputs in R

Valid inputs in R - Variable names II

Definition: Hidden variables


A variable whose name starts with a dot is called a hidden variable. Hidden
variables are only displayed by ls() (for example) when setting the
parameter all.names to TRUE.

.a <- 5
ls()

## [1] "a§b" "x" "y" "z"

ls(all.names = TRUE)

## [1] ".a" "a§b" "x" "y" "z"

This behavior is consistent with most operating systems where files


beginning with a ’.’ are also called hidden and only displayed after an
explicit call.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 31 / 67
Valid inputs in R

Valid inputs in R - Reserved words

There are in total 19 words in R that are not allowed as variable names.
They are:
if and else,
for, while and repeat,
in, next and break,
function,
TRUE and FALSE,
NULL, Inf and NaN,
NA, NA_integer_, NA_real_, NA_complex_ and NA_character_.
A complete list can also be found via ?Reserved.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 32 / 67


Separators

Separators

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 33 / 67


Separators

Separators I

As already mentioned, the decimal point is used to to separate a


number’s decimal places.
The comma is used to separate function arguments and multiple
indices.
Although blank spaces are sensible in many places (see chapter on
programming style), they are only strictly necessary in a few instances.
An example:

x<-5 x< -5
## [1] FALSE

Semicolons are used to execute multiple commands in a single line.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 34 / 67


Separators

Separators II

A line break inside of an R statement is interpreted as blank space by


the parser:
x<
-5
## [1] FALSE

If an R statement finishes at a line break, it is regarded as complete


and the parser terminates.
This is also the case when the R statement could potentially continue:
if (x < 2) print(x)
else print("too large")
## Error: <text>:2:1: unexpected ’else’
## 1: if (x < 2) print(x)
## 2: else
## ^

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 35 / 67


Metaprogramming

Metaprogramming

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 36 / 67


Metaprogramming

What is metaprogramming?

Definition according to Wikipedia


Metaprogramming is a programming technique in which computer
programs have the ability to treat other programs as their data.

We’ll see more on this in the chapter on functional programming. Functions


are programs and we can write functions that generate functions.

The term metaprogramming actually has a deeper meaning for us. In R, it


means to us that:
1 We have parsed a program, i.e. we have an expression.
2 We want to write a second program that modifies this expression.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 37 / 67


Metaprogramming

How do we obtain an expression?

When running program code in R, we always receive the resulting value in


return. The expression is passed directly to the evaluator and evaluated
without being accessible to us.

To actually access the expression itself, we can use quote():

1:10 quote(1:10)

## [1] 1 2 3 4 5 6 7 8 9 10 ## 1:10

quote() returns its input without evaluating it. Alternatively, we can also
create an expression using parse() without directly evaluating it.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 38 / 67


Metaprogramming

An expression’s data type


The data type of an expression is not expression, but rather language:
str(quote(1:10))

## language 1:10

The function expression() on the other hand returns an object of the


expression type. We’ll call these objects expression objects: They are a
vector data type corresponding to a vector containing our expressions and
rank above lists in the hierarchy.
expr.obj = expression(1:10, 11:20)
typeof(expr.obj)

## [1] "expression"

typeof(expr.obj[[1]])

## [1] "language"

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 39 / 67


Metaprogramming

Evaluating an expression

To evaluate an expression, i.e. to explicitly call the evaluator, R offers the


function eval().

expr = quote({x = 1}) eval(expr)


x x

## [1] 5 ## [1] 1

eval() is the exact opposite of quote(). Their calls neutralize each other:
eval(quote(eval(eval(quote(quote(2 + 2))))))

## [1] 4

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 40 / 67


Metaprogramming

Modifying an expression I

An expression is either a constant, a symbol or a function call. Since the


first two are simple data types, we’ll deal with the latter:

expr = quote(sum(1, 2, 3)) length(expr)


expr
## [1] 4
## sum(1, 2, 3)
expr[1]
pryr::call_tree(expr)
## sum()
## \- ()
## \- `sum expr[[1]]
## \- 1
## \- 2 ## sum
## \- 3

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 41 / 67


Metaprogramming

Modifying an expression II
Findings: Expressions have a length and can be subsetted similarly to lists.
Just like for lists: [ returns an expression with the respective elements, [[
returns the content of the element.

typeof(expr[1]) typeof(expr[2])

## [1] "language" ## [1] "language"

typeof(expr[[1]]) typeof(expr[[2]])

## [1] "symbol" ## [1] "double"

What we already know from the previous subchapter:


Single elements of a call can either be symbols, constants (integer,
double, logical, character, complex (raw is not possible!)) or calls
themselves (language).
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 42 / 67
Metaprogramming

Modifying an expression III

The first element always denotes the function that is being called. This
element can be a symbol or a call itself (constants that are also functions
don’t exist).

factory = function() typeof(expr2[[1]])


function(x) x^2
expr2 = quote(factory()(2)) ## [1] "language"
expr2[[1]]

## factory()

The other elements of the call are the arguments of the function call. They
are potentially named and may be referenced by their name.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 43 / 67


Metaprogramming

Modifying an expression IV

Just like vectors, calls can be modified using the regular replacement
functions $<- and [[<-:

expr[[1]] = as.symbol("mean") expr[[2]] = 5


expr expr

## mean(1, 2, 3) ## mean(5, 2, 3)

As usual: R does not police us, we can do as we please. We should,


however, still not shoot ourselves in the foot.

expr[[1]] = 1 expr[-1]
expr
## 5(2, 3)
## 1(5, 2, 3)

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 44 / 67


Metaprogramming

Creating a call

To manually generate a call on our own, we can use the functions call()
and as.call():

expr = call(":", 1, 10) expr = call("mean", quote(1:10),


expr na.rm = TRUE)
expr
## 1:10
## mean(1:10, na.rm = TRUE)
eval(expr)
eval(expr)
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 5.5

Here, the first argument must always be the name of a function.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 45 / 67


Metaprogramming

What’s the current call?

Sometimes, it’s useful for functions to know their own call. That’s what the
functions sys.call() and match.call() are for:

f = function(butter = 1, magarine = 2, f(ma = 2, 2)


milch = 3) {
## $sys
list(sys = sys.call(), ## f(ma = 2, 2)
match = match.call()) ##
} ## $match
## f(butter = 2, magarine = 2)

For example, this is used by the function lm() to return the call as well:
lm(mpg ~ wt, data = mtcars)$call

## lm(formula = mpg ~ wt, data = mtcars)

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 46 / 67


Using expressions

Using expressions

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 47 / 67


Using expressions

Example 1: Axis labels Example 2: Column names


x = seq(0, 2 * pi, length = 100) x = 1:7
sinx = sin(x) y = 2:8
plot(x, sinx, type = "l") data.frame(x, y, x + y)

## x y x...y
−1.0 0.0 1.0

## 1 1 2 3
## 2 2 3 5
sinx

## 3 3 4 7
## 4 4 5 9
## 5 5 6 11
0 2 4 6 ## 6 6 7 13
x ## 7 7 8 15

How does R know the axis labels? How does R know column names?

In both examples, the respective R functions not only use the values of
their input parameters, but also the corresponding expressions.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 48 / 67
Using expressions

The limits of quote()

Let’s try to recreate this behavior using quote() while reducing the
functionality to a bare minimum:

f = function(x) f(sin(x))
quote(x)
f(1:10) ## x

## x

quote() can’t work here:


quote() returns its input as an expression without evaluating it.
However, in this case the input of quote() is always x.
What we know: x is a promise object consisting of an expression, an
environment and a value (cf. Chapter 2.4). We want to access this
very expression and not the expression of quote()’s input.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 49 / 67


Using expressions

substitute()

The required functionality is provided by the substitute() function.

f = function(x) f(sin(x))
substitute(x)
f(1:10) ## sin(x)

## 1:10

Combined with deparse(), we can now generate names and recreate


Examples 1 and 2:

g = function(x) g(sin(x))
deparse(substitute(x))
g(1:10) ## [1] "sin(x)"

## [1] "1:10"

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 50 / 67


Using expressions

Avoiding quotation marks

Both of these calls are equivalent in loading the package e1071.

library("e1071") library(e1071)

Why does the second call work? After all, there is no e1071 object:
e1071

## Error in eval(expr, envir, enclos): object ’e1071’ not found

Taking a closer look at the source code library(), we find:


package <- as.character(substitute(package))

Thus, the input’s expression is resorted to here as well.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 51 / 67


Using expressions

The limits of substitute()

Let’s take a look at these (equivalent?) implementations:

f = function(x) h = function(x)
substitute(x) deparse(substitute(x))
g = function(x)
deparse(f(x))

However, their outputs differ:

g(1:10) h(1:10)

## [1] "x" ## [1] "1:10"

To make sense of this, we’ll examine how substitute() works exactly.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 52 / 67


Using expressions

What does substitute() do?

The R help page tells us about substitute(exp, env):

substitute() returns the parse tree for the (unevaluated) expression expr,
substituting any variables bound in env.

It individually checks every element of the AST for a potential substitution:


Constants and calls are never substituted.
If env is the global environment, then no substitutions occur either.
Concerning symbols, env is checked for corresponding bindings:
Existing: The symbol gets substituted for its value.
Existing and the value is a promise object: The symbol is substituted
for the corresponding expression.
Not existing: No substitution takes place.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 53 / 67


Using expressions

substitute() is an interactive function I


Equipped with this knowledge, we now understand these implementations:

h = function(x) h(1:10)
deparse(expr = substitute(x))
## [1] "1:10"

1 First, deparse() is called. The content of the formal parameter expr


is a promise object.
2 When expr is accessed for the first time, the promise is evaluated.
This evaluation takes place inside of the calling environment of
deparse(), i.e. in the execution environment of h().
3 The input of substitute() is the expression x. As such, the
corresponding AST only consists of this symbol.
4 Since x is a promise object in the calling environment of
substitute(), it is substituted for the corresponding expression.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 54 / 67
Using expressions

substitute() is an interactive function II


f = function(x) g(1:10)
substitute(x)
g = function(x) ## [1] "x"
deparse(expr = f(x))

1 deparse() is called, expr is a promise object.


2 When expr is accessed for the first time, f(x) is executed in the
execution environment of g().
3 f(x) calls substitute(x). The input of substitute() is once more
only the expression x.
4 x is a promise object in the execution environment of f().
5 Since the call has been f(x), the corresponding expression is merely x.
6 The environment of the promise is the execution environment of g().
Here, x (again as a promise object) is bound.
7 However, substitute() finds the expression of the outer promise.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 55 / 67
Using expressions

substitute() is an interactive function III

What does this quirk of substitute() mean to us?


We can only use substitute() in an interactive context: Only when
substitute() is called inside the function which is called with the
actual expression, the desired result occurs.
However, using substitute() in a purely interactive way in the
global environment does not benefit us either:
x = 1:10
substitute(x)
## x

Thus, substitute() is a rather reluctant tool. But aside from quote(), it


is the only tool we have to access underlying expressions.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 56 / 67


Using expressions

The subset() function


Example 3: Working interactively, we can use the subset() function to
select rows of a dataset:

sample.df = data.frame( subset(sample.df, a >= 4)


a = 1:5,
b = 5:1, ## a b c
c = c(5, 3, 1, 4, 1) ## 4 4 2 4
) ## 5 5 1 1
sample.df
subset(sample.df, b == c)
## a b c
## 1 1 5 5 ## a b c
## 2 2 4 3 ## 1 1 5 5
## 3 3 3 1 ## 5 5 1 1
## 4 4 2 4
## 5 5 1 1

How does subset() work? The variables a, b and c don’t exist in any
environment, yet no error occurs.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 57 / 67
Using expressions

Let’s recreate subset() I

We already know most of the central components of subset():


We can obtain the expression of a >= 4 using substitute().
We can evaluate it using eval().
We want to evaluate the expression in a certain environment, namely
in the data frame sample.df.
To this end, eval() has an envir argument.

mySubset = function(df, condition) { mySubset(sample.df, a >= 4)


expr = substitute(condition)
rows = eval(expr = expr, envir = df) ## a b c
df[rows, ] ## 4 4 2 4
} ## 5 5 1 1

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 58 / 67


Using expressions

Let’s recreate subset() II

Unfortunately, we’re not done just yet:

y = 4 expr = 4
mySubset(sample.df, a >= y) mySubset(sample.df, a >= expr)

## a b c ## Warning in a >= expr: longer object


## 4 4 2 4 length is not a multiple of shorter
## 5 5 1 1 object length
## [1] a b c
## <0 rows> (or 0-length row.names)

We didn’t supply an environment for the envir argument, but a


data.frame. This worked just fine for the first test.
But: A data.frame has no parent environment. Thus, R can’t perform
scoping and the desired result is not always guaranteed.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 59 / 67


Using expressions

Let’s recreate subset() III

eval() allows us to not only pass environments as envir, but also


lists and as such, data frames.
If envir is a list, the enclosing environment to be used for scoping
must be specified via the enclos argument. Its default is the current
environment.
We want for variables to be looked for in the calling environment of
mySubset().
mySubset = function(df, condition) { expr = 4
expr = substitute(condition) mySubset(sample.df, a >= expr)
rows = eval(expr = expr, envir = df,
enclos = parent.frame()) ## a b c
df[rows, ] ## 4 4 2 4
} ## 5 5 1 1

Alternatively, we could convert the data frame into an environment


ourselves using the list2env() function.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 60 / 67


Using expressions

Loopholes I

Just like substitute() we can only use mySubset() in an interactive


context. As a counter-example, we’re linking mySubset() with a
permutation of the dataset’s rows.

scramble = function(x) subscramble(sample.df, a >= 4)


x[sample(nrow(x)), ]
## Error in eval(expr = expr, envir =
subscramble = function(x, condition) df, enclos = parent.frame()): object
scramble(mySubset(x, condition)) ’a’ not found

Here, we’re seeing an effect that is similar to what we have observed a


couple of slides ago when calling substitute() inside another function:

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 61 / 67


Using expressions

Loopholes II
During the evaluation of mySubset(), condition is substituted for
the corresponding expression which is stored in expr.
However, this expression is merely the symbol condition which itself
is bound to a promise object.
The desired expression is located in the environment of this promise
object and is unaccessible to us.
When calling eval(), the symbol condition is then evaluated inside
the specified environment. Evaluating a symbol means (cf. ?eval)
replacing it by its value.
The calling environment of mySubset() is the execution environment
of subscramble() where the symbol condition is bound to a
promise object. Thus, the result of eval() is this promise object.
Subsequently, the promise object is evaluated in its own environment
where, however, no binding of a exists.
Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 62 / 67
Using expressions

Loopholes III

Workaround: We’re defining two functions. The first function already


expects the expression as its input while the second function extracts it:
mySubsetQ = function(df, condition) { mySubset = function(df, condition) {
rows = eval(condition, df, mySubsetQ(df, substitute(condition))
parent.frame()) }
df[rows, ]
}

In an interactive context, we can keep on using mySubset() like we’re used


to. Inside of other functions, we’re resorting to mySubsetQ() and have to
extract the expressions ourselves:

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 63 / 67


Using expressions

Loopholes IV

mySubset(sample.df, a >= expr) subscramble = function(x, condition) {


expr = substitute(condition)
## a b c scramble(mySubsetQ(x, expr))
## 4 4 2 4 }
## 5 5 1 1 subscramble(sample.df, a >= 4)

## a b c
## 4 4 2 4
## 5 5 1 1

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 64 / 67


Using expressions

Modifying expressions using substitute()

Until now, we have only used substitute() to extract expression from


promise objects. The actual use case of substitute() is modifying
expressions:

substitute(a + b, list(a = "y")) substitute(a + b,


list(a = call(":", 1, 10)))
## "y" + b
## 1:10 + b

y = 2
substitute(a + b, list(a = y)) substitute(a + b, list(`+` = quote(`*`)))

## 2 + b ## a * b

substitute(a + b, list(a = quote(y))) substitute(a + b, list(`+` = 1))

## y + b ## 1(a, b)

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 65 / 67


Summary and conclusion

Summary and conclusion

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 66 / 67


Summary and conclusion

What have we learned and how should we use it? I

Originating as text, R code is first translated into an expression by the


parser and then evaluated by the evaluator.
Expressions have a recursive tree structure and consist of symbols,
constants and function calls.
We can treat expressions like lists, especially when it comes to
subsetting and modifying single elements.
We can access and process the expression of a statement using
quote() and substitute().
quote() and substitute() are rather difficult to handle, because
we’re using R to look inside of R.

Daniel Horn & Sheila Görz Advanced R Summer Semester 2022 67 / 67

You might also like