Data Transformation

Data transformation with dplyr : : CHEATSHEET
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x |> f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract
Summarize Cases w
www
ww that meet logical criteria.
mtcars |> filter(mpg > 20) w
www column values as a vector, by name or index.
mtcars |> pull(wt)
distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.
w
www
Apply summary functions to columns to create a new table of
w
www
ww
rows with duplicate values. mtcars |> select(mpg, wt)
summary statistics. Summary functions take vectors as input and mtcars |> distinct(gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows
w
www
ww
summary function Move columns to new position.
by position. mtcars |> relocate(mpg, cyl, .a er = last_col())
mtcars |> slice(10:15)
summarize(.data, …)
w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
mtcars |> summarize(avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. mtcars |> select(mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows.
NULL) Count number of rows in each group defined contains(match) num_range(prefix, range) :, e.g., mpg:cyl
mtcars |> slice_sample(n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) !, e.g., !gear
by the variables in … Also tally(), add_count(),
w
ww add_tally(). starts_with(match) matches(match) everything()
mtcars |> count(cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values. MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
mtcars |> slice_min(mpg, prop = 0.25)
df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
slice_head(.data, …, n, prop) and slice_tail()
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a Select the first or last rows. across(.cols, .funs, …, .names = NULL) Summarize
w
ww
"grouped" copy of a table grouped by columns in ... dplyr mtcars |> slice_head(n = 5) or mutate multiple columns in the same way.
functions will manipulate each "group" separately and combine df |> summarize(across(everything(), mean))
the results.
Logical and boolean operators to use with filter() c_across(.cols) Compute across columns in
w
ww
== < <= is.na() %in% | xor() row-wise data.
w
www
ww mtcars |> != > >= !is.na() ! &
df |>
rowwise() |>
w
group_by(cyl) |>
summarize(avg = mean(mpg)) See ?base::Logic and ?Comparison for help. mutate(x_total = sum(c_across(1:2)))
MAKE NEW VARIABLES
ARRANGE CASES Apply vectorized functions to columns. Vectorized functions take
Use rowwise(.data, …) to group data into individual rows. dplyr arrange(.data, …, .by_group = FALSE) Order vectors as input and return vectors of the same length as output
functions will compute results for each row. Also apply functions (see back).
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. vectorized function
mtcars |> arrange(mpg) mutate(.data, …, .keep = "all", .before = NULL,
starwars |> mtcars |> arrange(desc(mpg))
ww
www w
www
ww
.a er = NULL) Compute new column(s). Also
w
w ww
rowwise() |> add_column().
mutate(film_count = length(films)) mtcars |> mutate(gpm = 1 / mpg)
ADD CASES mtcars |> mutate(gpm = 1 / mpg, .keep = "none")
add_row(.data, …, .before = NULL, .a er = NULL)
ungroup(x, …) Returns ungrouped copy of table.
w
www
ww
Add one or more rows to a table. rename(.data, …) Rename columns. Use
w
www
w
g_mtcars <- mtcars |> group_by(cyl) cars |> add_row(speed = 1, dist = 1) rename_with() to rename with a function.
ungroup(g_mtcars) mtcars |> rename(miles_per_gallon = mpg)
CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ft
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARIZE () COMBINE VARIABLES COMBINE CASES
mutate() applies vectorized functions to summarize() applies summary functions to x y
columns to create new columns. Vectorized columns to create a new table. Summary A B C E F G A B C E F G A B C
functions take vectors as input and return

vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1
x
a t 1
b u 2
A B C
vectorized function summary function

bind_cols(…, .name_repair) Returns tables
placed side by side as a single table. Column
+ y
c v 3
d w 4 bind_rows(…, .id = NULL)
Returns tables one on top of the
lengths must be equal. Columns will NOT be DF A B C other as a single table. Set .id to
matched by id (to do that look at Relational Data x a t 1
a column name to add a column
OFFSET COUNT below), so be sure to check that both tables are
x
y
b
c
u
v
2
3 of the original table names (as
dplyr::lag() - o set elements by 1 dplyr::n() - number of values/rows ordered the way you want before binding. y d w 4 pictured).
dplyr::lead() - o set elements by -1 dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NAs RELATIONAL DATA
CUMULATIVE AGGREGATE
dplyr::cumall() - cumulative all() POSITION Use a "Mutating Join" to join one table to Use a "Filtering Join" to filter one table against
dplyr::cumany() - cumulative any() columns from another, matching values with the the rows of another.
cummax() - cumulative max() mean() - mean, also mean(!is.na()) rows that they correspond to. Each join retains a
median() - median x y
dplyr::cummean() - cumulative mean() di erent combination of values from the tables. A B C A B D
cummin() - cumulative min()
cumprod() - cumulative prod()
cumsum() - cumulative sum()
LOGICAL
A B C D le _join(x, y, by = NULL, copy = FALSE,
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
mean() - proportion of TRUEs
sum() - # of TRUEs
a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
RANKING
b u 2 2
na_matches = "na") Join matching
A B C semi_join(x, y, by = NULL, copy = FALSE,
c v 3 NA
values from y to x.
a t 1
…, na_matches = "na") Return rows of x
dplyr::cume_dist() - proportion of all values <= ORDER b u 2
that have a match in y. Use to see what
dplyr::dense_rank() - rank w ties = min, no gaps will be included in a join.
dplyr::min_rank() - rank with ties = min dplyr::first() - first value A B C D right_join(x, y, by = NULL, copy = FALSE,
dplyr::ntile() - bins into n bins dplyr::last() - last value a t 1 3
dplyr::percent_rank() - min_rank scaled to [0,1] dplyr::nth() - value in nth location of vector b u 2 2
na_matches = "na") Join matching A B C anti_join(x, y, by = NULL, copy = FALSE,
dplyr::row_number() - rank with ties = "first"
d w NA 1
values from x to y.
c v 3
…, na_matches = "na") Return rows of x
RANK that do not have a match in y. Use to see
MATH inner_join(x, y, by = NULL, copy = FALSE, what will not be included in a join.
quantile() - nth quantile A B C D
+, - , *, /, ^, %/%, %% - arithmetic ops min() - minimum value

a t 1 3
log(), log2(), log10() - logs
b u 2 2
na_matches = "na") Join data. Retain Use a "Nest Join" to inner join one table to
max() - maximum value another into a nested data frame.
<, <=, >, >=, !=, == - logical comparisons only rows with matches.
dplyr::between() - x >= le & x <= right SPREAD A B C y nest_join(x, y, by = NULL, copy =
dplyr::near() - safe == for floating point numbers A B C D full_join(x, y, by = NULL, copy = FALSE, a t 1 <tibble [1x2]>
FALSE, keep = FALSE, name =
IQR() - Inter-Quartile Range a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE, b u 2 <tibble [1x2]>
MISCELLANEOUS mad() - median absolute deviation b u 2 2 c v 3 <tibble [1x2]> NULL, …) Join data, nesting
c v 3 NA na_matches = "na") Join data. Retain all matches from y in a single new
dplyr::case_when() - multi-case if_else() sd() - standard deviation d w NA 1 values, all rows.
var() - variance data frame column.
starwars |>
mutate(type = case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot", Row Names COLUMN MATCHING FOR JOINS SET OPERATIONS
TRUE ~ "other") Tidy data does not use rownames, which store a A B C intersect(x, y, …)
A B.x C B.y D Use by = c("col1", "col2", …) to
) variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
a t 1 t 3
specify one or more common
dplyr::coalesce() - first non-NA values by rownames, first move them into a column. b u 2 u 2
columns to match on.
element across a set of vectors c v 3 NA NA
setdi (x, y, …)
tibble::rownames_to_column() le _join(x, y, by = "A") A B C
dplyr::if_else() - element-wise if() + else() A B C A B
a t 1 Rows that appear in x but not y.
dplyr::na_if() - replace specific values with NA 1 a t 1 a t Move row names into col. b u 2
pmax() - element-wise max() a <- mtcars |>

A.x B.x C A.y B.y Use a named vector, by = c("col1" =
2 b u 2 b u a t 1 d w union(x, y, …)
pmin() - element-wise min() 3 c v 3 c v
rownames_to_column(var = "C") "col2"), to match on columns that A B C
b u 2 b u a t 1 Rows that appear in x or y,
c v 3 a t have di erent names in each table. b u 2
duplicates removed). union_all()
tibble::column_to_rownames() le _join(x, y, by = c("C" = "D")) c v 3
A B C A B d w 4 retains duplicates.
1 a t t 1 a
Move col into row names.
2 b u u 2 b
a |> column_to_rownames(var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to
3 c v v 3 c a t 1 d w
give to unmatched columns that Use setequal() to test whether two data sets
b u 2 b u
have the same name in both tables. contain the exact same rows (in any order).
Also tibble::has_rownames() and c v 3 a t
tibble::remove_rownames(). le _join(x, y, by = c("C" = "D"),
su ix = c("1", "2"))
CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff
ff
ff
ft
ff
ff
ft
ff

Data Transformation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Transformation

Uploaded by

Copyright:

Available Formats

Data transformation with dplyr : : CHEATSHEET

distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.

functions take vectors as input and return

vectorized function summary function

+, - , *, /, ^, %/%, %% - arithmetic ops min() - minimum value

pmax() - element-wise max() a <- mtcars |>

You might also like