Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Data Import : : CHEAT SHEET

R’s tidyverse is built around tidy data stored


in tibbles, which are enhanced data frames.
Read Tabular Data - These functions share the common arguments: Data types
read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), readr functions guess
The front side of this sheet shows the types of each column and
how to read text files into R with quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,
n_max), progress = interactive()) convert types when appropriate (but will NOT
readr. convert strings to factors automatically).
The reverse side shows how to Comma Delimited Files
a,b,c
A B C
read_csv("file.csv") A message shows the type of each column in the
create tibbles with tibble and to 1 2 3
result.
1,2,3 To make file.csv run:
layout tidy data with tidyr. 4 5 NA
4,5,NA write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv")
## Parsed with column specification:
## cols(
OTHER TYPES OF DATA A B C Semi-colon Delimited Files ## age = col_integer(), age is an
a;b;c read_csv2("file2.csv")
Try one of the following packages to import 1 2 3 ## sex = col_character(), integer
other types of files 1;2;3 4 5 NA write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv") ## earn = col_double()
4;5;NA ## )
• haven - SPSS, Stata, and SAS files
Files with Any Delimiter sex is a
• readxl - excel files (.xls and .xlsx) earn is a double (numeric) character
• DBI - databases
A B C read_delim("file.txt", delim = "|")
a|b|c 1 2 3 write_file(x = "a|b|c\n1|2|3\n4|5|NA", path = "file.txt")
• jsonlite - json 1|2|3 4 5 NA 1. Use problems() to diagnose problems.
• xml2 - XML 4|5|NA Fixed Width Files x <- read_csv("file.csv"); problems(x)
• httr - Web APIs read_fwf("file.fwf", col_positions = c(1, 3, 5))
• rvest - HTML (Web Scraping) abc
A B C
write_file(x = "a b c\n1 2 3\n4 5 NA", path = "file.fwf")
1 2 3 2. Use a col_ function to guide parsing.
123 4 5 NA • col_guess() - the default
Save Data 4 5 NA
Tab Delimited Files
read_tsv("file.tsv") Also read_table(). • col_character()
• col_double(), col_euro_double()
write_file(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA", path = "file.tsv")
Save x, an R object, to path, a file path, as: • col_datetime(format = "") Also
USEFUL ARGUMENTS col_date(format = ""), col_time(format = "")
Comma delimited file
write_csv(x, path, na = "NA", append = FALSE, Example file Skip lines • col_factor(levels, ordered = FALSE)
a,b,c 1 2 3
col_names = !append) write_file("a,b,c\n1,2,3\n4,5,NA","file.csv") read_csv(f, skip = 1) • col_integer()
1,2,3 4 5 NA
File with arbitrary delimiter f <- "file.csv" • col_logical()
4,5,NA
write_delim(x, path, delim = " ", na = "NA", • col_number(), col_numeric()
append = FALSE, col_names = !append) A B C No header A B C Read in a subset • col_skip()
CSV for excel
1 2 3
read_csv(f, col_names = FALSE) 1 2 3 read_csv(f, n_max = 1) x <- read_csv("file.csv", col_types = cols(
write_excel_csv(x, path, na = "NA", append =
4 5 NA A = col_double(),
Provide header B = col_logical(),
FALSE, col_names = !append) x y z
Missing Values C = col_factor()))
String to file
A B C read_csv(f, col_names = c("x", "y", "z")) A B C

write_file(x, path, append = FALSE)


1 2 3 NA 2 3 read_csv(f, na = c("1", "."))
4 5 NA 4 5 NA 3. Else, read in as character vectors then parse
String vector to file, one element per line with a parse_ function.
write_lines(x,path, na = "NA", append = FALSE) • parse_guess()
Object to RDS file Read Non-Tabular Data • parse_character()
• parse_datetime() Also parse_date() and
write_rds(x, path, compress = c("none", "gz", Read a file into a raw vector
"bz2", "xz"), ...) Read a file into a single string parse_time()
read_file(file, locale = default_locale()) read_file_raw(file)
Tab delimited files • parse_double()
Read each line into its own string Read each line into a raw vector • parse_factor()
write_tsv(x, path, na = "NA", append = FALSE,
col_names = !append) read_lines(file, skip = 0, n_max = -1L, na = character(), read_lines_raw(file, skip = 0, n_max = -1L, • parse_integer()
locale = default_locale(), progress = interactive()) progress = interactive()) • parse_logical()
Read Apache style log files • parse_number()
read_log(file, col_names = FALSE, col_types = NULL, skip = 0, n_max = -1, progress = interactive()) x$A <- parse_number(x$A)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2021–03

Tibbles - an enhanced data frame Tidy Data with tidyr Split Cells
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new Use these functions to
A table is tidy if: Tidy data:
S3 class for storing tabular data, the A * B -> C split or combine cells
tibble. Tibbles inherit the data frame A B C A B C A B C A * B C into individual, isolated
class, but improve three behaviors: values.
• Subsetting - [ always returns a new tibble,
[[ and $ always return a vector.
& separate(data, col, into, sep = "[^[:alnum:]]
Each variable is in Each observation, or Makes variables easy Preserves cases during +", remove = TRUE, convert = FALSE,
• No partial matching - You must use full its own column case, is in its own row to access as vectors vectorized operations extra = "warn", fill = "warn", ...)
column names when subsetting
Separate each cell in a column to make
• Display - When you print a tibble, R provides a
concise view of the
Reshape Data - change the layout of values in a table several columns.
table3
data that fits on Use pivot_longer() and pivot_wider() to reorganize the values of a table into a new layout.
# A tibble: 234 × 6 country year rate country year cases pop
manufacturer model displ
one screen 1
<chr>
audi
<chr> <dbl>
a4 1.8 pivot_longer(data, cols, names_to = "name", pivot_wider(data, id_cols = NULL, names_from = name, A 1999 0.7K/19M A 1999 0.7K 19M
2 audi a4 1.8
3 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
4
5
audi
audi
a4
a4
2.0
2.8
names_prefix = NULL, names_sep = NULL, names_prefix = "", names_sep = "_", names_glue = NULL,
B 1999 37K/172M B 1999 37K 172
6
7
audi
audi
a4
a4
2.8
3.1 names_pattern = NULL, names_ptypes = list(), names_sort = FALSE, names_repair = "check_unique", B 2000 80K/174M B 2000 80K 174
names_transform = list(), names_repair = values_from = value, values_fill = NULL, values_fn = NULL, ...)
8 audi a4 quattro 1.8

w
w
9 audi a4 quattro 1.8
10 audi a4 quattro 2.0 C 1999 212K/1T C 1999 212K 1T
# ... with 224 more rows, and 3
# more variables: year <int>,
"check_unique", values_to = "value", values_drop_na = C 2000 213K/1T C 2000 213K 1T
# cyl <int>, trans <chr>
FALSE, values_ptypes = list(), values_transform = pivot_wider() pivots a names_from and a
list(), ...) values_from column into a rectangular field of separate(table3, rate, sep = "/",
tibble display cells. into = c("cases", "pop"))
pivot_longer() pivots cols columns, moving table2
156 1999 6 auto(l4)
column names into a names_to column, and
separate_rows(data, ..., sep = "[^[:alnum:].]
157 1999 6 auto(l4) country year type count country year cases pop
158 2008 6 auto(l4)
159 2008
160 1999
8 auto(s4)
4 manual(m5) column values into a values_to column. A 1999 cases 0.7K A 1999 0.7K 19M
161 1999
162 2008
4 auto(l4)
4 manual(m5) A 1999 pop 19M A 2000 2K 20M +", convert = FALSE)
163 2008
164 2008
4 manual(m5)
4 auto(l4) table4a A 2000 cases 2K B 1999 37K 172M
165 2008
166 1999
[ reached
4
4
auto(l4)
auto(l4)
getOption("max.print") country 1999 2000 country year cases A 2000 pop 20M B 2000 80K 174M Separate each cell in a column to make
A large table -- omitted 68 rows ]
A 0.7K 2K A 1999 0.7K B 1999 cases 37K C 1999 212K 1T several rows.
to display data frame display
B 37K 80K B 1999 37K B 1999 pop 172M C 2000 NA NA table3
C 212K 213K C 1999 212K B 2000 cases 80K
• Control the default appearance with options: A 2000 2K B 2000 pop 174M
country year rate country year rate
A 1999 0.7K/19M A 1999 0.7K
options(tibble.print_max = n, B
C
2000 80K
2000 213K
C
C
1999
1999
cases 212K
pop 1T
A 2000 2K/20M A 1999 19M
tibble.print_min = m, tibble.width = Inf) B 1999 37K/172M A 2000 2K
B 2000 80K/174M A 2000 20M
• View full data set with View() or glimpse() pivot_longer(table4a, cols = 2:3, pivot_wider(table2, names_from = type, C 1999 212K/1T B 1999 37K
names_to = "year", values_to = "cases") values_from = count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
B 2000 174M
CONSTRUCT A TIBBLE IN TWO WAYS
tibble(…)
Handle Missing Values C
C
1999
1999
212K
1T

Both drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
Construct by columns.
replace = list(), ...)
C 2000 1T
make this Drop rows containing Fill in NA’s in … columns with most
tibble(x = 1:3, y = c("a", "b", "c")) tibble NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate, sep = "/")
tribble(…) x x x
Construct by rows. A tibble: 3 × 2
x y
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1 unite(data, col, ..., sep = "_", remove = TRUE)
tribble( ~x, ~y, <int> <chr> B NA D 3 B NA B 1 B NA B 2
Collapse cells across several columns to
1 1 a C NA C NA C 1 C NA C 2
1, "a", 2 2 b D 3 D 3 D 3 D 3 D 3 make a single column.
2, "b", 3 3 c E NA E NA E 3 E NA E 2
table5
3, "c")
drop_na(x, x2) fill(x, x2) replace_na(x, list(x2 = 2)) country century year country year
as_tibble(x, …) Convert data frame to tibble. Afghan 19 99 Afghan 1999

enframe(x, name = "name", value = "value") Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
00
99
Afghan
Brazil
2000
1999
Convert named vector to a tibble Brazil 20 00 Brazil 2000
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
is_tibble(x) Test whether x is a tibble. Adds to the data missing combinations of the Create new tibble with all possible combinations China 20 00 China 2000

values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2021–03

You might also like