Welcome To The Tidyverse: February 2019

Welcome to the  
tidyverse
February 2019
Hadley Wickham  
@hadleywickham 
Chief Scientist, RStudio
The tidyverse is a
language for solving
data science challenges
with R code.
Your turn: What data do we
need to recreate this plot?
We need five variables
# A tibble: 193 x 6
country four_regions year income life_exp pop
<chr> <chr> <int> <int> <dbl> <int>
1 Afghanistan asia 2015 1750 57.9 33700000
2 Albania europe 2015 11000 77.6 2920000
3 Algeria africa 2015 13700 77.3 39900000
4 Andorra europe 2015 46600 82.5 78000
5 Angola africa 2015 6230 64 27900000
6 Antigua and Barbuda americas 2015 20100 77.2 99900
7 Argentina americas 2015 19100 76.5 43400000
8 Armenia europe 2015 8180 75.4 2920000
9 Australia asia 2015 43800 82.6 23800000
10 Austria europe 2015 44100 81.4 8680000
# … with 183 more rows
This is “tidy” data
# A tibble: 193 x 6
country four_regions year income life_exp pop
<chr> <chr> <int> <int> <dbl> <int>
Observation 1 Afghanistan asia 2015 1750 57.9 33700000

2 Albania europe 2015 11000 77.6 2920000
3 Algeria africa 2015 13700 77.3 39900000
4 Andorra europe 2015 46600 82.5 78000
5 Angola africa 2015 6230 64 27900000
6 Antigua and Barbuda americas 2015 20100 77.2 99900
7 Argentina americas 2015 19100 76.5 43400000
8 Armenia europe 2015 8180 75.4 2920000
9 Australia asia 2015 43800 82.6 23800000
10 Austria europe 2015 44100 81.4 8680000
Variable
The original data looked like this
# A tibble: 193 x 220
country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809` `1810` `1811` `1812` `1813`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Afghan… 603 603 603 603 603 603 603 603 603 603 604 604 604 604
2 Albania 667 667 667 667 667 668 668 668 668 668 668 668 668 668
3 Algeria 715 716 717 718 719 720 721 722 723 724 725 726 727 728
4 Andorra 1200 1200 1200 1200 1210 1210 1210 1210 1220 1220 1220 1220 1220 1230
5 Angola 618 620 623 626 628 631 634 637 640 642 645 648 651 654
6 Antigu… 757 757 757 757 757 757 757 758 758 758 758 758 758 758
7 Argent… 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510 1510
8 Armenia 514 514 514 514 514 514 514 514 514 514 514 515 515 515
9 Austra… 814 816 818 820 822 824 825 827 829 831 833 835 837 839
10 Austria 1850 1850 1860 1870 1880 1880 1890 1900 1910 1920 1920 1930 1940 1950
# … with 183 more rows, and 205 more variables: `1814` <int>, `1815` <int>, `1816` <int>, `1817` <int>,
# `1818` <int>, `1819` <int>, `1820` <int>, `1821` <int>, `1822` <int>, `1823` <int>, `1824` <int>,
# `1846` <int>, `1847` <int>, `1848` <int>, `1849` <int>, `1850` <int>, `1851` <int>, `1852` <int>, ...
Start with single year
Called the pipe;

pronounced “then”
Called the reverse assignment
operator; pronounced “creates”
gapminder %>%
filter(year == 2015) ->
gapminder15
Phonics are important!
gapminder %>% Take the gapminder data, then
filter(year == 2015) -> filter rows where year equals 2015, creating
gapminder15 gapminder15 variable
80
70
life_exp
60
gapminder15 %>%
50 ggplot(aes(income, life_exp))
0 25000 50000 75000 100000 12500
income
● ●
● ●
●
● ● ● ● ●● ● ●
● ● ●● ●
● ● ●
● ● ●● ● ● ● ● ●
●
80 ●
●
● ● ●
●● ● ● ●
● ● ● ●
●
● ● ●
● ● ● ●●
● ● ●
● ● ● ● ● ● ●
● ● ●
●● ●●● ●
● ● ● ● ●
●●● ●
●
●
● ● ●
●● ● ●
● ● ●
●
●● ● ●
● ●● ●● ●
● ●
● ● ● ●
● ● ●
●● ● ●
70 ● ● ●●
● ●
life_exp
●● ● ●
●
●●● ● ● ●
● ●
●
●● ●
● ●● ● ●
● ●
●
●●●● ●●
●●
● ●
●
●
●
● ●
● ●
●
●●
60 ●●
●●●
●● ●
●● ●
●
●●
gapminder15 %>%
●
ggplot(aes(income, life_exp)) +
50 ● ● geom_point()
0 25000 50000 75000 100000 12500
income
● ●
● ●
●
● ● ● ● ●● ● ●
● ● ●● ●
● ● ●
● ● ●● ● ● ● ● ●
●
80 ●
●
● ● ●
●● ● ● ●
● ● ● ●
●
● ● ●
● ● ● ●●
● ● ●
● ● ● ● ● ● ●
● ● ●
●● ●●● ●
● ● ● ● ●
●●● ●
●
●
● ● ●
●● ● ●
● ● ●
●
●● ● ●
● ●● ●● ●
● ●
● ● ● ●
● ● ●
●● ● ●
70 ● ● ●●
● ●
life_exp
●● ● ●
●
●●● ● ● ●
● ●
●
●● ●
● ●● ● ●
● ●
●
●●●● ●●
●●
● ●
●
●
●
● ●
● ●
●
●●
60 ●●
●●●
●● ●
●● ●
●
●●
●
50 ● ●
0 25000 50000 75000 100000 12500

income
● ●
● ●
● ●●
● ● ● ● ● ●
●
● ● ●●
● ● ● ● ●
● ● ● ● ●●
● ●
80 ●
●
● ● ●
●● ● ● ●
● ●● ● ●
●
●● ● ● ●●
● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●●
● ●
●
● ●● ● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ● ● ● ● ●
● ● ●●
● ● ●
● ●
● ●
● ● ●
70 ● ●
●
● ●
life_exp
●● ● ● ●
● ●
● ● ● ●
● ●
● ● ●
●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ●
● ●
●● ●
● ●●
● ●
● ●
60 ● ● ●
gapminder15 %>%
● ● ●
● ●
● ● ●
● ●
●
●
geom_point() +
50 ● ● scale_x_log10()
1e+03 1e+04 1e+05
income
● ●
● ●
● ●●
● ● ● ● ● ●
●
● ● ●●
● ● ● ● ●
● ● ● ●●●
● ●
80 ●
●
● ● ●
●● ● ● ●
● ● ● ●
●
●
● ● ● ● ●
● ● ●● ● ●
● ● ● ●
● ● ●● ● ●
● ● ● ●●
● ●
● ●● ● ● ●●●
● ● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ●● ●
●
●
●
● ●●
● ● ●
● ●
●
four_regions
● ● ● ●
70 ● ●
●
● ●
● africa
life_exp
●● ● ● ●
● ●● ● ●
● ●
● ●
● ● americas
● ●
● ● ● ● ●
●
● ● ● ● ● ● ● ● asia
● ● ●
● ●
●●
● ●● ●
● europe
● ●
● ●
60 ●
●
● ●
● ● ● ●
● ● ●
● ●
●
●
gapminder15 %>%
50 ● ●
geom_point(aes(colour = four_regions)) +
1e+03 1e+04 1e+05
scale_x_log10()
income
● ●
●
● ●●
● ● ● ● ● ● ●
●
● ● ●●
●
● ● ●
● ●● ● ● ●●● ●
80 ●●
● ● ●
● ●● ● ●
●● ● ●
●
●● ● ● ●● ●
● ●●●
●● ● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
pop
●
● ● ●
● ●● ● ● ●
● ● ●
●
●
● ●
●
●
● ●
●
● 5e+08
●
● ● ●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
● ●
● ●
● ● 1e+09
70 ● ●
●
● ●
●
life_exp
●● ● ●
●
●
● ● ● ● ● ●
four_regions
● ● ● ●
● ● ● ●
●
● ●
● ● ● ●● ● ● africa
● ● ●
● ●
●
●
● ●● ● ● americas
●
● ●
●
60 ●
●
● ● ● asia
● ● ● ●
● ● ●
●
● ● ● europe
●
gapminder15 %>%
50 ● ●
geom_point(aes(colour = four_regions, size = pop)) +

1e+03 1e+04 1e+05
scale_x_log10()
income
Your turn: What’s missing?
● ●
●
● ●●
●● ● ●●● ● ●
● ● ●●
●
● ● ● ● ● ● ●●● ●
● ●
80 ●●
● ● ●
● ●● ● ●
●● ● ●
●
●● ● ● ●● ●
● ●●●
●● ●
● ● ● ●
●
●
●
●
●●
●
●
●
●
pop
●
● ● ●
● ●● ● ● ●
● ● ●
●
●
● ●
●
●
● ●
●
● 5e+08
●
● ● ●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
● ●
● ●
● ● 1e+09
70 ● ●
●
● ●
● ●
life_exp
●● ● ●
●
● ● ●
●
● ● ●
● ●
●
four_regions
● ● ● ●
●
● ●
● ● ● ●● ● ● africa
● ● ●
● ●
●
●
● ●● ● ● americas
●
● ●
●
60 ●
●
● ● ● asia
● ● ● ●
● ● ●
●
● ● ● europe
●
50 ● ●
1e+03 1e+04 1e+05

income
It’s a lot more work to make a expository graphic
gapminder15 %>%
geom_point(aes(fill = four_regions, size = pop), shape = 21) +
scale_x_log10(breaks = 2^(-1:7) * 1000) +
scale_size(range = c(1, 20), guide = FALSE) +
scale_fill_manual(
guide = FALSE,
values = c(
africa = "#60D2E6",
americas = "#9AE847",
asia = "#EC6475",
europe = "#FBE84D"
)
) +
labs(
x = "Income (GDP / capita)",
y = "Life expectancy (years)"
)
Your turn: Can you spot
●
●●● ●●
●● ●●●
●
●
●
●
●
● ●
80 the subtle problem? ●
●
●
●● ●●
●
●●●●
● ●
●
● ●
●● ●● ●
●
● ●
●
●
●
● ●●● ●●● ●
●
●
●
● ● ●
● ● ●
● ●
●●
●
●●
●
●
●
● ●
●● ● ● ●●
●
●● ● ●
●
●
Life expectancy (years)
● ● ●
●
●
●
●
●
●●
●
●
● ● ● ● ●
●
●
● ●●
●
● ●
●
70 ●
●
●
●
●
● ● ●
●● ●
●
●
● ●● ● ● ● ●
● ● ● ●
●● ● ● ● ●
●● ●
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●● ●
●
60 ● ● ● ●●
●
● ● ●
● ●
●
●● ●
●
50 ● ●
500 1000 2000 4000 8000 16000 32000 64000 128000

Income (GDP / capita)
●
●●● ●●
●● ●●●
●
●
●
●
●
● ●
80 ●
●
●
●● ●●
●
●●●●
● ●
●
● ●
●● ●● ●
●
● ●
●
●
●
● ●●● ●●● ●
●
●
●
● ● ●
● ● ●
● ●
●●
●
●●
●
●
●
● ●
●● ● ● ●●
●
●● ● ●
●
●
Life expectancy (years)
● ● ●
●
●
●
●
●
●●
●
●
● ● ● ● ●
●
●
● ●●
●
● ●
●
70 ●
●
●
●
●
● ● ●
●● ●
●
●
● ●● ● ● ● ●
● ● ● ●
●● ● ● ● ●
●● ●
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●● ●
●
60 ● ● ● ●●
●
● ● ●
● ●
●
●● ●
●
50 ● ●
500 1000 2000 4000 8000 16000 32000 64000 128000

A little motivation
data %>%
arrange(desc(pop)) %>%
...
You can also turn your code into a function
gap_plot <- function(data) {
data %>%
arrange(desc(pop)) %>%
geom_point(aes(fill = four_regions, size = pop), shape = 21) +
scale_x_log10(breaks = 2^(-1:7) * 1000) +
scale_size(range = c(1, 20), guide = FALSE) +
scale_fill_manual(values = c(
africa = "#60D2E6",
americas = "#9AE847",
asia = "#EC6475",
europe = "#FBE84D"
)) +
labs(
x = "Income (GDP / capita)",
y = "Life expectancy",
fill = "Region"
)
}
gapminder %>%
80
filter(country == "New Zealand") %>%
gap_plot()
70
Life expectancy
60 Region
● asia
50
●
●●
●
●
●
●●
●
●
●
●
40 ●
●
●
●
●●●
●
●
●●●
●
●
● ● ● ● ● ● ● ● ● ●●
●●●
● ● ● ● ● ● ●●●● ●
●●●
●●
●
1000 2000 4000 8000 16000 32000
gapminder %>% ●
● ●
50 filter(year == 1900) %>% ●
●
● ●
gap_plot() ● ● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ● Region
Life expectancy
40 ● ●
● ● ● ●
●● ●
● africa
● ●●
●
● ● ●
●
●●
● ●
●
● ● americas
●
●
●
●● ●
●
●
●
●
●●● ●
●
● ● ●
●●● ●● ●
●
●
● ● ● asia
● ● ●● ●● ●
●
●
● ●
●●
●● ●
● ●
●
●
● ●
●
●
●
●
● ● ● ● ●●
●
● ●● ● ●
● ●
●
●
●● ●● ● europe
30
● ●
●
● ●
●
●
● ●●
●●
●
● ●
●●
●
● ●
● ●
●●
●
●
●
●
●
●
●
●● ●
●●
●● ●
●
●●
●
● ●
● ● ● ● ●
● ●
●
●
20
●●
500 1000 2000 4000 8000

gapminder %>% ●
● ●
●●
50
filter(year == 1905) %>% ●
● ● ●● ●
●
●
gap_plot() ●
●
●
●
●
●
●
● ● ● ●
● ● ● ●
40
● ●
● ●
● ●
●
● Region
Life expectancy
●
●
● ● ● ●
●
●
●
● ●
●
●
●● ●
●
●● ● africa
● ●
●
● ●
●
● ● ● ● ● ● ● ●
● ●●●●
●
● ●●●
● ●
●●
●
●
●●
●
● ● ●●
● ●●
●
● ● ●
● ●
● ● ● ● americas
●
●● ●
● ●●● ● ●● ●
● ● ●● ●
30
● ●
●
● ●
●
● ● ●●
● ●
●
●
● ●
● ●● ●
●
●
●
●
●
● asia
● ● ●●
● ●
●
●
●
● ●
●
● ●
●●●● ●
● ●
●
●
●● ● ● europe
●
● ●
●
● ●
20 ●●
●
10
500 1000 2000 4000 8000

How can we do this with every plot?
by_year <- gapminder %>%

filter(year %% 5 == 0) %>%
group_split(year)
plots <- map(by_year, ~ gap_plot(.x))

Datasets Plots
Demo
The tidyverse is a
with R code.
Import Visualise
readr
readxl ggplot2
haven
xml2
tidyverse.org
Tidy
tibble
Transform
dplyr lubridate
Model
tidyr forcats stringr
hms
recipes
parsnip
purrr
magrittr
r4ds.had.co.nz
Program
The tidyverse is a
with R code.
The disadvantages of code are obvious
Why code?
1. Code is text
2. Code is read-able
3. Code is reproducible
⌘C
⌘V
Copy
Paste
Why code?
1. Code is text
What have you done?
Why code?
1. Code is text
.Rmd Prose and code
.md Prose and results
.html Human shareable

.Rmd Prose and code
.md Prose and results
.doc .tex
.html ...
.ppt .pdf
Other skills
SQL marketing
git(hub)
Conclusions
We wanted users to be able to begin in an interactive
environment, where they did not consciously think of
themselves as programming. Then as their needs became
clearer and their sophistication increased, they should be
able to slide gradually into programming, when the
language and system aspects would become more
important.
— John Chambers, “Stages in the Evolution of S”
https://unsplash.com/photos/8HyrGTYPQ68 — Eric Muhr
c c e ss
su P i t o f
https://simplystatistics.org/2018/07/12/use-r-keynote-2018/
# A tibble: 153 x 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
library(dplyr)
airquality %>%
group_by(Month) %>%
summarize(o3 = mean(Ozone, na.rm = TRUE))
aggregate(
airquality[“Ozone”],
airquality[“Month”],
mean, na.rm = TRUE
)
Why doesn’t
airquality$Ozone work?
aggregate(
mean, na.rm = TRUE
)
aggregate(
mean, na.rm = TRUE
) Passing a function to another function
aggregate(
mean, na.rm = TRUE
) Argument to mean, not to aggregate()
Import Visualise
readr ggplot2
readxl
haven
xml2
tidyverse.org
Tidy Transform
tibble dplyr lubridate
Model
tidyr forcats stringr
hms
recipes
parsnip
purrr
magrittr
r4ds.had.co.nz
Program
This work is licensed as
Creative Commons 
Attribution-ShareAlike 4.0  
International
To view a copy of this license, visit  
https://creativecommons.org/licenses/by-sa/4.0/

Welcome To The Tidyverse: February 2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Welcome To The Tidyverse: February 2019

Uploaded by

Copyright:

Available Formats

Welcome to the

Observation 1 Afghanistan asia 2015 1750 57.9 33700000

Called the pipe;

gapminder %>% Take the gapminder data, then

0 25000 50000 75000 100000 12500

geom_point(aes(colour = four_regions, size = pop)) +

1e+03 1e+04 1e+05

500 1000 2000 4000 8000 16000 32000 64000 128000

500 1000 2000 4000 8000 16000 32000 64000 128000

500 1000 2000 4000 8000

500 1000 2000 4000 8000

by_year <- gapminder %>%

plots <- map(by_year, ~ gap_plot(.x))

.md Prose and results

.html Human shareable

.md Prose and results

You might also like