Data Import::: Cheat Sheet

Data Import : : CHEAT SHEET
R’s tidyverse is built around tidy data stored

in tibbles, which are enhanced data frames.
Read Tabular Data - These functions share the common arguments: Data types
read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), readr functions guess
The front side of this sheet shows the types of each column and
how to read text files into R with quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,
n_max), progress = interactive()) convert types when appropriate (but will NOT
readr. convert strings to factors automatically).
The reverse side shows how to Comma Delimited Files
a,b,c
A B C
read_csv("file.csv") A message shows the type of each column in the
create tibbles with tibble and to 1 2 3
result.
1,2,3 To make file.csv run:
layout tidy data with tidyr. 4 5 NA
4,5,NA write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv")
## Parsed with column specification:
## cols(
OTHER TYPES OF DATA A B C Semi-colon Delimited Files ## age = col_integer(), age is an
a;b;c read_csv2("file2.csv")
Try one of the following packages to import 1 2 3 ## sex = col_character(), integer
other types of files 1;2;3 4 5 NA write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv") ## earn = col_double()
4;5;NA ## )
• haven - SPSS, Stata, and SAS files
Files with Any Delimiter sex is a
• readxl - excel files (.xls and .xlsx) earn is a double (numeric) character
• DBI - databases
A B C read_delim("file.txt", delim = "|")
a|b|c 1 2 3 write_file(x = "a|b|c\n1|2|3\n4|5|NA", path = "file.txt")
• jsonlite - json 1|2|3 4 5 NA 1. Use problems() to diagnose problems.
• xml2 - XML 4|5|NA Fixed Width Files x <- read_csv("file.csv"); problems(x)
• httr - Web APIs read_fwf("file.fwf", col_positions = c(1, 3, 5))
• rvest - HTML (Web Scraping) abc
A B C
write_file(x = "a b c\n1 2 3\n4 5 NA", path = "file.fwf")
1 2 3 2. Use a col_ function to guide parsing.
123 4 5 NA • col_guess() - the default
Save Data 4 5 NA
Tab Delimited Files
read_tsv("file.tsv") Also read_table(). • col_character()
• col_double(), col_euro_double()
write_file(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA", path = "file.tsv")
Save x, an R object, to path, a file path, as: • col_datetime(format = "") Also
USEFUL ARGUMENTS col_date(format = ""), col_time(format = "")
Comma delimited file
write_csv(x, path, na = "NA", append = FALSE, Example file Skip lines • col_factor(levels, ordered = FALSE)
a,b,c 1 2 3
col_names = !append) write_file("a,b,c\n1,2,3\n4,5,NA","file.csv") read_csv(f, skip = 1) • col_integer()
1,2,3 4 5 NA
File with arbitrary delimiter f <- "file.csv" • col_logical()
4,5,NA
write_delim(x, path, delim = " ", na = "NA", • col_number(), col_numeric()
append = FALSE, col_names = !append) A B C No header A B C Read in a subset • col_skip()
CSV for excel
1 2 3
read_csv(f, col_names = FALSE) 1 2 3 read_csv(f, n_max = 1) x <- read_csv("file.csv", col_types = cols(
write_excel_csv(x, path, na = "NA", append =
4 5 NA A = col_double(),
Provide header B = col_logical(),
FALSE, col_names = !append) x y z
Missing Values C = col_factor()))
String to file
A B C read_csv(f, col_names = c("x", "y", "z")) A B C
write_file(x, path, append = FALSE)

1 2 3 NA 2 3 read_csv(f, na = c("1", "."))
4 5 NA 4 5 NA 3. Else, read in as character vectors then parse
String vector to file, one element per line with a parse_ function.
write_lines(x,path, na = "NA", append = FALSE) • parse_guess()
Object to RDS file Read Non-Tabular Data • parse_character()
• parse_datetime() Also parse_date() and
write_rds(x, path, compress = c("none", "gz", Read a file into a raw vector
"bz2", "xz"), ...) Read a file into a single string parse_time()
read_file(file, locale = default_locale()) read_file_raw(file)
Tab delimited files • parse_double()
Read each line into its own string Read each line into a raw vector • parse_factor()
write_tsv(x, path, na = "NA", append = FALSE,
col_names = !append) read_lines(file, skip = 0, n_max = -1L, na = character(), read_lines_raw(file, skip = 0, n_max = -1L, • parse_integer()
locale = default_locale(), progress = interactive()) progress = interactive()) • parse_logical()
Read Apache style log files • parse_number()
read_log(file, col_names = FALSE, col_types = NULL, skip = 0, n_max = -1, progress = interactive()) x$A <- parse_number(x$A)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2021–03
Tibbles - an enhanced data frame Tidy Data with tidyr Split Cells
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new Use these functions to
A table is tidy if: Tidy data:
S3 class for storing tabular data, the A * B -> C split or combine cells
tibble. Tibbles inherit the data frame A B C A B C A B C A * B C into individual, isolated
class, but improve three behaviors: values.
• Subsetting - [ always returns a new tibble,
[[ and $ always return a vector.
& separate(data, col, into, sep = "[^[:alnum:]]
Each variable is in Each observation, or Makes variables easy Preserves cases during +", remove = TRUE, convert = FALSE,
• No partial matching - You must use full its own column case, is in its own row to access as vectors vectorized operations extra = "warn", fill = "warn", ...)
column names when subsetting
Separate each cell in a column to make
• Display - When you print a tibble, R provides a
concise view of the
Reshape Data - change the layout of values in a table several columns.
table3
data that fits on Use pivot_longer() and pivot_wider() to reorganize the values of a table into a new layout.
# A tibble: 234 × 6 country year rate country year cases pop
manufacturer model displ
one screen 1
<chr>
audi
<chr> <dbl>
a4 1.8 pivot_longer(data, cols, names_to = "name", pivot_wider(data, id_cols = NULL, names_from = name, A 1999 0.7K/19M A 1999 0.7K 19M
2 audi a4 1.8
3 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
4
5
audi
audi
a4
a4
2.0
2.8
names_prefix = NULL, names_sep = NULL, names_prefix = "", names_sep = "_", names_glue = NULL,
B 1999 37K/172M B 1999 37K 172
6
7
audi
audi
a4
a4
2.8
3.1 names_pattern = NULL, names_ptypes = list(), names_sort = FALSE, names_repair = "check_unique", B 2000 80K/174M B 2000 80K 174
names_transform = list(), names_repair = values_from = value, values_fill = NULL, values_fn = NULL, ...)
8 audi a4 quattro 1.8
w
w
9 audi a4 quattro 1.8
10 audi a4 quattro 2.0 C 1999 212K/1T C 1999 212K 1T
# ... with 224 more rows, and 3
# more variables: year <int>,
"check_unique", values_to = "value", values_drop_na = C 2000 213K/1T C 2000 213K 1T
# cyl <int>, trans <chr>
FALSE, values_ptypes = list(), values_transform = pivot_wider() pivots a names_from and a
list(), ...) values_from column into a rectangular field of separate(table3, rate, sep = "/",
tibble display cells. into = c("cases", "pop"))
pivot_longer() pivots cols columns, moving table2
156 1999 6 auto(l4)
column names into a names_to column, and
separate_rows(data, ..., sep = "[^[:alnum:].]
157 1999 6 auto(l4) country year type count country year cases pop
158 2008 6 auto(l4)
159 2008
160 1999
8 auto(s4)
4 manual(m5) column values into a values_to column. A 1999 cases 0.7K A 1999 0.7K 19M
161 1999
162 2008
4 auto(l4)
4 manual(m5) A 1999 pop 19M A 2000 2K 20M +", convert = FALSE)
163 2008
164 2008
4 manual(m5)
4 auto(l4) table4a A 2000 cases 2K B 1999 37K 172M
165 2008
166 1999
[ reached
4
4
auto(l4)
auto(l4)
getOption("max.print") country 1999 2000 country year cases A 2000 pop 20M B 2000 80K 174M Separate each cell in a column to make
A large table -- omitted 68 rows ]
A 0.7K 2K A 1999 0.7K B 1999 cases 37K C 1999 212K 1T several rows.
to display data frame display
B 37K 80K B 1999 37K B 1999 pop 172M C 2000 NA NA table3
C 212K 213K C 1999 212K B 2000 cases 80K
• Control the default appearance with options: A 2000 2K B 2000 pop 174M
country year rate country year rate
A 1999 0.7K/19M A 1999 0.7K
options(tibble.print_max = n, B
C
2000 80K
2000 213K
C
C
1999
1999
cases 212K
pop 1T
A 2000 2K/20M A 1999 19M
tibble.print_min = m, tibble.width = Inf) B 1999 37K/172M A 2000 2K
B 2000 80K/174M A 2000 20M
• View full data set with View() or glimpse() pivot_longer(table4a, cols = 2:3, pivot_wider(table2, names_from = type, C 1999 212K/1T B 1999 37K
names_to = "year", values_to = "cases") values_from = count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
B 2000 174M
CONSTRUCT A TIBBLE IN TWO WAYS
tibble(…)
Handle Missing Values C
C
1999
1999
212K
1T
Both drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
Construct by columns.
replace = list(), ...)
C 2000 1T
make this Drop rows containing Fill in NA’s in … columns with most
tibble(x = 1:3, y = c("a", "b", "c")) tibble NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate, sep = "/")
tribble(…) x x x
Construct by rows. A tibble: 3 × 2
x y
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1
x1
A
x2
1 unite(data, col, ..., sep = "_", remove = TRUE)
tribble( ~x, ~y, <int> <chr> B NA D 3 B NA B 1 B NA B 2
Collapse cells across several columns to
1 1 a C NA C NA C 1 C NA C 2
1, "a", 2 2 b D 3 D 3 D 3 D 3 D 3 make a single column.
2, "b", 3 3 c E NA E NA E 3 E NA E 2
table5
3, "c")
drop_na(x, x2) fill(x, x2) replace_na(x, list(x2 = 2)) country century year country year
as_tibble(x, …) Convert data frame to tibble. Afghan 19 99 Afghan 1999
enframe(x, name = "name", value = "value") Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
00
99
Afghan
Brazil
2000
1999
Convert named vector to a tibble Brazil 20 00 Brazil 2000
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
is_tibble(x) Test whether x is a tibble. Adds to the data missing combinations of the Create new tibble with all possible combinations China 20 00 China 2000
values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2021–03

Data Import::: Cheat Sheet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Import::: Cheat Sheet

Uploaded by

Copyright:

Available Formats

Data Import : : CHEAT SHEET

R’s tidyverse is built around tidy data stored

write_file(x, path, append = FALSE)

You might also like