Professional Documents
Culture Documents
CleaningData Chapter 3
CleaningData Chapter 3
Type conversions
Cleaning Data in R
Types of variables in R
Cleaning Data in R
Types of variables in R
> class("hello")
[1] "character"
> class(3.844)
[1] "numeric"
> class(77L)
[1] "integer"
> class(factor("yes"))
[1] "factor"
> class(TRUE)
[1] "logical"
Cleaning Data in R
Type conversions
> as.character(2016)
[1] "2016"
> as.numeric(TRUE)
[1] 1
> as.integer(99)
[1] 99
> as.factor("something")
[1] something
Levels: something
> as.logical(0)
[1] FALSE
Cleaning Data in R
Overview of lubridate
Cleaning Data in R
CLEANING DATA IN R
Let's practice!
CLEANING DATA IN R
String manipulation
Cleaning Data in R
Overview of stringr
Cleaning Data in R
Cleaning Data in R
Cleaning Data in R
CLEANING DATA IN R
Let's practice!
CLEANING DATA IN R
Missing and
special values
Cleaning Data in R
Missing values
In R, represented as NA
#N/A (Excel)
Empty string
Cleaning Data in R
Special values
1/0
1/0 + 1/0
33333^33333
0/0
1/0 - 1/0
Cleaning Data in R
Cleaning Data in R
to find NAs
B
Min.
: 3.0
1st Qu.:13.0
Median :23.0
Mean
:38.0
3rd Qu.:55.5
Max.
:88.0
NA's
:1
Min.
: 1.00
1st Qu.: 1.75
Median : 2.50
Mean
:12.75
3rd Qu.:13.50
Max.
:45.00
Cleaning Data in R
CLEANING DATA IN R
Let's practice!
CLEANING DATA IN R
Outliers and
obvious errors
Cleaning Data in R
Outliers
# Simulate some data
> set.seed(10)
> x <- c(rnorm(30, mean = 15, sd = 5), -5, 28, 35)
# View a boxplot
> boxplot(x, horizontal = TRUE)
Outliers
Cleaning Data in R
Outliers
Several causes
Valid measurements
Variability in measurement
Experimental error
Cleaning Data in R
Obvious errors
What if these values are supposed to represent ages?
Cleaning Data in R
Obvious errors
Several causes
Measurement error
Cleaning Data in R
Min.
:-1.0
1st Qu.:40.3
Median :48.5
Mean
:47.8
3rd Qu.:56.3
Max.
:75.1
Cleaning Data in R
Cleaning Data in R
CLEANING DATA IN R
Let's practice!