Professional Documents
Culture Documents
Welcome To The Tidyverse: February 2019
Welcome To The Tidyverse: February 2019
tidyverse
February 2019
Hadley Wickham
@hadleywickham
Chief Scientist, RStudio
The tidyverse is a
language for solving
data science challenges
with R code.
Your turn: What data do we
need to recreate this plot?
We need five variables
# A tibble: 193 x 6
country four_regions year income life_exp pop
<chr> <chr> <int> <int> <dbl> <int>
1 Afghanistan asia 2015 1750 57.9 33700000
2 Albania europe 2015 11000 77.6 2920000
3 Algeria africa 2015 13700 77.3 39900000
4 Andorra europe 2015 46600 82.5 78000
5 Angola africa 2015 6230 64 27900000
6 Antigua and Barbuda americas 2015 20100 77.2 99900
7 Argentina americas 2015 19100 76.5 43400000
8 Armenia europe 2015 8180 75.4 2920000
9 Australia asia 2015 43800 82.6 23800000
10 Austria europe 2015 44100 81.4 8680000
# … with 183 more rows
This is “tidy” data
# A tibble: 193 x 6
country four_regions year income life_exp pop
<chr> <chr> <int> <int> <dbl> <int>
filter(year == 2015) -> filter rows where year equals 2015, creating
gapminder15 gapminder15 variable
80
70
life_exp
60
gapminder15 %>%
50 ggplot(aes(income, life_exp))
0 25000 50000 75000 100000 12500
income
● ●
● ●
●
● ● ● ● ●● ● ●
● ● ●● ●
● ● ●
● ● ●● ● ● ● ● ●
●
80 ●
●
● ● ●
●● ● ● ●
● ● ● ●
●
● ● ●
● ● ● ●●
● ● ●
● ● ● ● ● ● ●
● ● ●
●● ●●● ●
● ● ● ● ●
●●● ●
●
●
● ● ●
●● ● ●
● ● ●
●
●● ● ●
● ●● ●● ●
● ●
● ● ● ●
● ● ●
●● ● ●
70 ● ● ●●
● ●
life_exp
●● ● ●
●
●●● ● ● ●
● ●
●
●● ●
● ●● ● ●
● ●
●
●●●● ●●
●●
● ●
●
●
●
● ●
● ●
●
●●
60 ●●
●●●
●● ●
●● ●
●
●●
gapminder15 %>%
●
ggplot(aes(income, life_exp)) +
50 ● ● geom_point()
0 25000 50000 75000 100000 12500
income
● ●
● ●
●
● ● ● ● ●● ● ●
● ● ●● ●
● ● ●
● ● ●● ● ● ● ● ●
●
80 ●
●
● ● ●
●● ● ● ●
● ● ● ●
●
● ● ●
● ● ● ●●
● ● ●
● ● ● ● ● ● ●
● ● ●
●● ●●● ●
● ● ● ● ●
●●● ●
●
●
● ● ●
●● ● ●
● ● ●
●
●● ● ●
● ●● ●● ●
● ●
● ● ● ●
● ● ●
●● ● ●
70 ● ● ●●
● ●
life_exp
●● ● ●
●
●●● ● ● ●
● ●
●
●● ●
● ●● ● ●
● ●
●
●●●● ●●
●●
● ●
●
●
●
● ●
● ●
●
●●
60 ●●
●●●
●● ●
●● ●
●
●●
●
50 ● ●
●● ● ● ●
● ●
● ● ● ●
● ●
● ● ●
●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ●
● ●
●● ●
● ●●
● ●
● ●
60 ● ● ●
gapminder15 %>%
● ● ●
● ●
● ● ●
● ●
●
●
ggplot(aes(income, life_exp)) +
geom_point() +
50 ● ● scale_x_log10()
1e+03 1e+04 1e+05
income
● ●
● ●
● ●●
● ● ● ● ● ●
●
● ● ●●
● ● ● ● ●
● ● ● ●●●
● ●
80 ●
●
● ● ●
●● ● ● ●
● ● ● ●
●
●
● ● ● ● ●
● ● ●● ● ●
● ● ● ●
● ● ●● ● ●
● ● ● ●●
● ●
● ●● ● ● ●●●
● ● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ●● ●
●
●
●
● ●●
● ● ●
● ●
●
four_regions
● ● ● ●
70 ● ●
●
● ●
● africa
life_exp
●● ● ● ●
● ●● ● ●
● ●
● ●
● ● americas
● ●
● ● ● ● ●
●
● ● ● ● ● ● ● ● asia
● ● ●
● ●
●●
● ●● ●
● europe
● ●
● ●
60 ●
●
● ●
● ● ● ●
● ● ●
● ●
●
●
gapminder15 %>%
ggplot(aes(income, life_exp)) +
50 ● ●
geom_point(aes(colour = four_regions)) +
1e+03 1e+04 1e+05
scale_x_log10()
income
● ●
●
● ●●
● ● ● ● ● ● ●
●
● ● ●●
●
● ● ●
● ●● ● ● ●●● ●
80 ●●
● ● ●
● ●● ● ●
●● ● ●
●
●● ● ● ●● ●
● ●●●
●● ● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
pop
●
● ● ●
● ●● ● ● ●
● ● ●
●
●
● ●
●
●
● ●
●
● 5e+08
●
● ● ●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
● ●
● ●
● ● 1e+09
70 ● ●
●
● ●
●
life_exp
●● ● ●
●
●
● ● ● ● ● ●
four_regions
● ● ● ●
● ● ● ●
●
● ●
● ● ● ●● ● ● africa
● ● ●
● ●
●
●
● ●● ● ● americas
●
● ●
●
60 ●
●
● ● ● asia
● ● ● ●
● ● ●
●
● ● ● europe
●
gapminder15 %>%
ggplot(aes(income, life_exp)) +
50 ● ●
● ●
●
● ●●
●● ● ●●● ● ●
● ● ●●
●
● ● ● ● ● ● ●●● ●
● ●
80 ●●
● ● ●
● ●● ● ●
●● ● ●
●
●● ● ● ●● ●
● ●●●
●● ●
● ● ● ●
●
●
●
●
●●
●
●
●
●
pop
●
● ● ●
● ●● ● ● ●
● ● ●
●
●
● ●
●
●
● ●
●
● 5e+08
●
● ● ●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
● ●
● ●
● ● 1e+09
70 ● ●
●
● ●
● ●
life_exp
●● ● ●
●
● ● ●
●
● ● ●
● ●
●
four_regions
● ● ● ●
●
● ●
● ● ● ●● ● ● africa
● ● ●
● ●
●
●
● ●● ● ● americas
●
● ●
●
60 ●
●
● ● ● asia
● ● ● ●
● ● ●
●
● ● ● europe
●
50 ● ●
● ●
●
●
●
● ●●● ●●● ●
●
●
●
● ● ●
● ● ●
● ●
●●
●
●●
●
●
●
● ●
●● ● ● ●●
●
●● ● ●
●
●
Life expectancy (years)
● ● ●
●
●
●
●
●
●●
●
●
● ● ● ● ●
●
●
● ●●
●
● ●
●
70 ●
●
●
●
●
● ● ●
●● ●
●
●
● ●● ● ● ● ●
● ● ● ●
●● ● ● ● ●
●● ●
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●● ●
●
60 ● ● ● ●●
●
● ● ●
● ●
●
●● ●
●
50 ● ●
● ●
●
●
●
● ●●● ●●● ●
●
●
●
● ● ●
● ● ●
● ●
●●
●
●●
●
●
●
● ●
●● ● ● ●●
●
●● ● ●
●
●
Life expectancy (years)
● ● ●
●
●
●
●
●
●●
●
●
● ● ● ● ●
●
●
● ●●
●
● ●
●
70 ●
●
●
●
●
● ● ●
●● ●
●
●
● ●● ● ● ● ●
● ● ● ●
●● ● ● ● ●
●● ●
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●● ●
●
60 ● ● ● ●●
●
● ● ●
● ●
●
●● ●
●
50 ● ●
data %>%
arrange(desc(pop)) %>%
ggplot(aes(income, life_exp)) +
...
You can also turn your code into a function
gap_plot <- function(data) {
data %>%
arrange(desc(pop)) %>%
ggplot(aes(income, life_exp)) +
geom_point(aes(fill = four_regions, size = pop), shape = 21) +
scale_x_log10(breaks = 2^(-1:7) * 1000) +
scale_size(range = c(1, 20), guide = FALSE) +
scale_fill_manual(values = c(
africa = "#60D2E6",
americas = "#9AE847",
asia = "#EC6475",
europe = "#FBE84D"
)) +
labs(
x = "Income (GDP / capita)",
y = "Life expectancy",
fill = "Region"
)
}
gapminder %>%
80
filter(country == "New Zealand") %>%
gap_plot()
70
Life expectancy
60 Region
● asia
50
●
●●
●
●
●
●●
●
●
●
●
40 ●
●
●
●
●●●
●
●
●●●
●
●
● ● ● ● ● ● ● ● ● ●●
●●●
● ● ● ● ● ● ●●●● ●
●●●
●●
●
1000 2000 4000 8000 16000 32000
Income (GDP / capita)
gapminder %>% ●
● ●
50 filter(year == 1900) %>% ●
●
● ●
gap_plot() ● ● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ● Region
Life expectancy
40 ● ●
● ● ● ●
●● ●
● africa
● ●●
●
● ● ●
●
●●
● ●
●
● ● americas
●
●
●
●● ●
●
●
●
●
●●● ●
●
● ● ●
●●● ●● ●
●
●
● ● ● asia
● ● ●● ●● ●
●
●
● ●
●●
●● ●
● ●
●
●
● ●
●
●
●
●
● ● ● ● ●●
●
● ●● ● ●
● ●
●
●
●● ●● ● europe
30
● ●
●
● ●
●
●
● ●●
●●
●
● ●
●●
●
● ●
● ●
●●
●
●
●
●
●
●
●
●● ●
●●
●● ●
●
●●
●
● ●
● ● ● ● ●
● ●
●
●
20
●●
gap_plot() ●
●
●
●
●
●
●
● ● ● ●
● ● ● ●
40
● ●
● ●
● ●
●
● Region
Life expectancy
●
●
● ● ● ●
●
●
●
● ●
●
●
●● ●
●
●● ● africa
● ●
●
● ●
●
● ● ● ● ● ● ● ●
● ●●●●
●
● ●●●
● ●
●●
●
●
●●
●
● ● ●●
● ●●
●
● ● ●
● ●
● ● ● ● americas
●
●● ●
● ●●● ● ●● ●
● ● ●● ●
30
● ●
●
● ●
●
● ● ●●
● ●
●
●
● ●
● ●● ●
●
●
●
●
●
● asia
● ● ●●
● ●
●
●
●
● ●
●
● ●
●●●● ●
● ●
●
●
●● ● ● europe
●
● ●
●
● ●
20 ●●
●
10
haven
xml2
tidyverse.org
Tidy
tibble
Transform
dplyr lubridate
Model
tidyr forcats stringr
hms
recipes
parsnip
purrr
magrittr
r4ds.had.co.nz
Program
The tidyverse is a
language for solving
data science challenges
with R code.
The disadvantages of code are obvious
Why code?
1. Code is text
2. Code is read-able
3. Code is reproducible
⌘C
⌘V
Copy
Paste
Why code?
1. Code is text
2. Code is read-able
3. Code is reproducible
What have you done?
Why code?
1. Code is text
2. Code is read-able
3. Code is reproducible
.Rmd Prose and code
.doc .tex
.html ...
.ppt .pdf
Other skills
SQL marketing
git(hub)
Conclusions
We wanted users to be able to begin in an interactive
environment, where they did not consciously think of
themselves as programming. Then as their needs became
clearer and their sophistication increased, they should be
able to slide gradually into programming, when the
language and system aspects would become more
important.
— John Chambers, “Stages in the Evolution of S”
https://unsplash.com/photos/8HyrGTYPQ68 — Eric Muhr
c c e ss
su P i t o f
https://simplystatistics.org/2018/07/12/use-r-keynote-2018/
# A tibble: 153 x 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# … with 143 more rows
library(dplyr)
airquality %>%
group_by(Month) %>%
summarize(o3 = mean(Ozone, na.rm = TRUE))
aggregate(
airquality[“Ozone”],
airquality[“Month”],
mean, na.rm = TRUE
)
Why doesn’t
airquality$Ozone work?
aggregate(
airquality[“Ozone”],
airquality[“Month”],
mean, na.rm = TRUE
)
aggregate(
airquality[“Ozone”],
airquality[“Month”],
mean, na.rm = TRUE
) Passing a function to another function
aggregate(
airquality[“Ozone”],
airquality[“Month”],
mean, na.rm = TRUE
) Argument to mean, not to aggregate()
Import Visualise
readr ggplot2
readxl
haven
xml2
tidyverse.org
Tidy Transform
tibble dplyr lubridate
Model
tidyr forcats stringr
hms
recipes
parsnip
purrr
magrittr
r4ds.had.co.nz
Program
This work is licensed as
Creative Commons
Attribution-ShareAlike 4.0
International
To view a copy of this license, visit
https://creativecommons.org/licenses/by-sa/4.0/