Professional Documents
Culture Documents
Data Transformation
Data Transformation
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x |> f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract
Summarize Cases w
www
ww that meet logical criteria.
mtcars |> filter(mpg > 20) w
www column values as a vector, by name or index.
mtcars |> pull(wt)
w
www
Apply summary functions to columns to create a new table of
w
www
ww
rows with duplicate values. mtcars |> select(mpg, wt)
summary statistics. Summary functions take vectors as input and mtcars |> distinct(gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows
w
www
ww
summary function Move columns to new position.
by position. mtcars |> relocate(mpg, cyl, .a er = last_col())
mtcars |> slice(10:15)
summarize(.data, …)
w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
mtcars |> summarize(avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. mtcars |> select(mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows.
NULL) Count number of rows in each group defined contains(match) num_range(prefix, range) :, e.g., mpg:cyl
mtcars |> slice_sample(n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) !, e.g., !gear
by the variables in … Also tally(), add_count(),
w
ww add_tally(). starts_with(match) matches(match) everything()
mtcars |> count(cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values. MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
mtcars |> slice_min(mpg, prop = 0.25)
df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
slice_head(.data, …, n, prop) and slice_tail()
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a Select the first or last rows. across(.cols, .funs, …, .names = NULL) Summarize
w
ww
"grouped" copy of a table grouped by columns in ... dplyr mtcars |> slice_head(n = 5) or mutate multiple columns in the same way.
functions will manipulate each "group" separately and combine df |> summarize(across(everything(), mean))
the results.
Logical and boolean operators to use with filter() c_across(.cols) Compute across columns in
w
ww
== < <= is.na() %in% | xor() row-wise data.
w
www
ww mtcars |> != > >= !is.na() ! &
df |>
rowwise() |>
w
group_by(cyl) |>
summarize(avg = mean(mpg)) See ?base::Logic and ?Comparison for help. mutate(x_total = sum(c_across(1:2)))
MAKE NEW VARIABLES
ARRANGE CASES Apply vectorized functions to columns. Vectorized functions take
Use rowwise(.data, …) to group data into individual rows. dplyr arrange(.data, …, .by_group = FALSE) Order vectors as input and return vectors of the same length as output
functions will compute results for each row. Also apply functions (see back).
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. vectorized function
mtcars |> arrange(mpg) mutate(.data, …, .keep = "all", .before = NULL,
starwars |> mtcars |> arrange(desc(mpg))
ww
www w
www
ww
.a er = NULL) Compute new column(s). Also
w
w ww
rowwise() |> add_column().
mutate(film_count = length(films)) mtcars |> mutate(gpm = 1 / mpg)
ADD CASES mtcars |> mutate(gpm = 1 / mpg, .keep = "none")
add_row(.data, …, .before = NULL, .a er = NULL)
ungroup(x, …) Returns ungrouped copy of table.
w
www
ww
Add one or more rows to a table. rename(.data, …) Rename columns. Use
w
www
w
g_mtcars <- mtcars |> group_by(cyl) cars |> add_row(speed = 1, dist = 1) rename_with() to rename with a function.
ungroup(g_mtcars) mtcars |> rename(miles_per_gallon = mpg)
CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ft
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARIZE () COMBINE VARIABLES COMBINE CASES
mutate() applies vectorized functions to summarize() applies summary functions to x y
columns to create new columns. Vectorized columns to create a new table. Summary A B C E F G A B C E F G A B C
TRUE ~ "other") Tidy data does not use rownames, which store a A B C intersect(x, y, …)
A B.x C B.y D Use by = c("col1", "col2", …) to
) variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
a t 1 t 3
specify one or more common
dplyr::coalesce() - first non-NA values by rownames, first move them into a column. b u 2 u 2
columns to match on.
element across a set of vectors c v 3 NA NA
setdi (x, y, …)
tibble::rownames_to_column() le _join(x, y, by = "A") A B C
dplyr::if_else() - element-wise if() + else() A B C A B
a t 1 Rows that appear in x but not y.
dplyr::na_if() - replace specific values with NA 1 a t 1 a t Move row names into col. b u 2
CC BY SA Posit So ware, PBC • info@posit.co • posit.co • Learn more at dplyr.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • dplyr 1.1.2 • Updated: 2023-07
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff
ff
ff
ft
ff
ff
ft
ff