58.tidy Data in R For Linguists

BRADLEY RENTZ
T I D Y D ATA
What is tidy data?
Tidy data have the following characteristics:
• Observations are in rows

• Variables are in columns
• Contained in a single dataset
An example of tidy data
Participant Gender Trial Value

01 M 1 100
01 M 2 210
02 F 1 50
02 F 2 75
An example of messy data
Participant Trial1 Trial2

M01 100 210
F02 50 75
R Packages for tidy data
library("tidyverse")
## Loading tidyverse: ggplot2

## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
4 bradley rentz
## Conflicts with tidy packages ----------------
## filter(): dplyr, stats

## lag(): dplyr, stats
Workflow for data wrangling
• Import data
• Change from wide data to long
• Split columns
• Fix typos or irregulaties
• Create new variables as needed
Import messy data
data.messy <- read_csv("messy_data.csv")
## Parsed with column specification:

## cols(
## participant = col_character(),
## vowel = col_character(),
## word = col_character(),
## f2_trial1 = col_integer(),
## f1_trial4 = col_integer()
## )
Find the problems
participant vowel word f2_trial1 f2_trial2 f2_trial3

f.29 a apple 1736 1742 1730
f.30 a apple 1766 1783 1744
m.27 a apple 1780 1803 1763
m.18 a apple 1701 1707 1687
6 bradley rentz
participant vowel word f2_trial1 f2_trial2 f2_trial3

m.29 a apple 1693 1697 1689
f.34 a apple 1768 1779 1741
f.19 a apple 1758 1776 1755
Explore the data more
f2_trial4 f1_trial1 f1_trial2 f1_trial3 f1_trial4

1737 700 693 713 684
1768 710 713 747 694
1780 682 685 708 656
1704 715 714 735 701
1696 667 684 706 657
1773 683 679 693 659
1760 758 748 777 734
What makes that dataset messy?
Why is that data messy? How does it violate the tidy data rules?
Problem 1: The columns contain observations
The first problem is the f2_trial1, f2_trial2, etc., represent observations

instead of variables. We want to convert all of these columns to just
three columns: trial (1 or 2), measurement type (F1 or F2), and value.
To do this we use gather() from the package tidyr, which con-
verts from a wide format to long format.
data.long <- gather(data.messy,

key=measurement,
value=value,
-participant,-vowel,-word)
Long data
participant vowel word measurement value

f.29 a apple f2_trial1 1736
f.19 u boot f2_trial1 737
m.30 u boot f2_trial2 642
f.48 u boot f1_trial2 279
f.34 u Hoop f1_trial4 224
Problem 2: Two variables in one column
The next problem that we have to deal with is that two of our columns
have two variables in them. Participant has both gender and the par-
ticipant number. Measurment has both the trial and the measurement
type (F1 or F2). To separate the columns we use separate() from
8 bradley rentz
tidyr.
Separating columns
# separate measurement
data.long <- separate(data.long,
measurement,
c("measurement","trial"),
sep="_")
# separate participant (note: for . need regex \.
# but need to escape the \ with a \, so \\. )
# convert=T allows the new columns to have different data types
data.long <- separate(data.long,
participant,
c("gender","participant"),
sep="\\.",
convert=T)
Almost tidy data
gender participant vowel word measurement trial value

f 29 a apple f2 trial1 1736
m 27 u hoop f2 trial1 731
f 30 u hoop f2 trial2 614
f 37 u boot f2 trial3 390
m 18 u boot f2 trial4 588
f 37 u hoop f1 trial1 238
f 21 a apple f1 trial4 727
Problem 3: Irregular spellings
When entering data by hand, there are sometime irregularities in

spelling, such a variation in capitalizations, typos, or extra spaces.
To check for irregulaties, we convert all of our string variables to
factors and then check what levels of the factors exist.
data.long$gender <- as.factor(data.long$gender)

data.long$participant <- as.factor(data.long$participant)
data.long$word <- as.factor(data.long$word)
data.long$vowel <- as.factor(data.long$vowel)
data.long$measurement <- as.factor(data.long$measurement)
tidy data 9
data.long$trial <- as.factor(data.long$trial)
Checking factors
To check the levels of each factor, we can use the summary() com-
mand. What problems in spelling are there?
summary(data.long[,1:6])
## gender participant vowel word

## f:256 29 : 96 a:192 apple:112
## m:192 30 : 64 A: 32 at :112
## 18 : 32 u:224 boot :104
## 19 : 32 Boot : 8
## 21 : 32 hoop :104
## 27 : 32 Hoop : 8
## (Other):160
## measurement trial
## f1:224 trial1:112
## f2:224 trial2:112
## trial3:112
## trial4:112
##
##
##
Using stringr to fix strings
Now that we have identified the errors, we can fix them with str_replace()
from stringr.
library(stringr)
data.long$word <- str_replace(data.long$word,"Hoop","hoop")
data.long$word <- str_replace(data.long$word,"Boot","boot")
data.long$vowel <- str_replace(data.long$vowel,"A","a")
Clean data
Let’s check to see if there are other errors.
data.long$gender <- as.factor(data.long$gender)

data.long$participant <- as.factor(data.long$participant)
data.long$word <- as.factor(data.long$word)
10 bradley rentz
data.long$vowel <- as.factor(data.long$vowel)

data.long$measurement <- as.factor(data.long$measurement)
data.long$trial <- as.factor(data.long$trial)
summary(data.long[,1:6])
## gender participant vowel word

## f:256 29 : 96 a:224 apple:112
## m:192 30 : 64 u:224 at :112
## 18 : 32 boot :112
## 19 : 32 hoop :112
## 21 : 32
## 27 : 32
## (Other):160
## measurement trial
## f1:224 trial1:112
## f2:224 trial2:112
## trial3:112
## trial4:112
##
##
##
Exploring dplyr functions
Now that we have clean data we can process and explore our data
with the dplyr functions filter(), arrange(), select(), summarise(),
group_by(), and mutate(). We will also use the pipe %>% from the
magrittr package to make using these (and other) functions easier.
The pipe
The pipe %>% allows us to nest code very easily is read as then.
For example, let’s say we want to select only a few columns of
a dataframe, then select row values that are greater than a certain
value, then mean for each variable by participant.
The pseudo-code would like this:
new.df <- old.df %>% select columns that we want %>%

filter the row values that we want %>%
group by participant %>%
calculate the mean for each variable
Selecting columns with select()
The function select() from dplyr is used to select certain columns

from a dataframe. Columns can selected individually, excluded indi-
vidually by using - before the name, or selected as a range with :, as
in first_column:last_column.
Let’s select the columns participant, vowel, measurement, trial,
and value.
data.selected <- data.long %>%

select(participant, vowel, measurement, trial, value)
# or can select those by excluding the other columns with -
data.selected2 <- data.long %>%
select(-gender,-word)
12 bradley rentz
Arranging values
The arrange() function from dplyr is in a sort function. You can use
it to sort by one column or by multiple columns. It can also sort in
descending order by using desc() arround the column name.
Let’s sort the data.selected by participant, then vowel, then de-
scending order of trial.
data.sorted <- data.selected %>%

arrange(participant, vowel, desc(trial))
Filtering rows
The function filter() from dplyr allows us to subset the data by

selecting rows based on whatever logical rule we want.
For this example let’s keep on the rows where the value is greater
than or equal to 500 and the participant is not 19.
data.filtered <- data.sorted %>%

filter(participant != "19" & value >= 500)
Logical values
The logical values that can be used are
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal too
& AND
| OR
Adding new columns
The mutate() function of dplyr allows us to add columns to our

dataframe based on calculations run on other columns. It is a 1:1
calculation. As an example let’s say we want to calculate the z-score
for the values we measured. To do this we have to first use a helper
function called group_by(). This function groups the data in groups
first before doing other calculations. Let’s calculate the z-score for
each value grouped by participant and measurement type (F1 or F2).
tidy data 13
data.zscore <- data.filtered %>%

group_by(participant, measurement) %>%
mutate(z.score = (value - mean(value))/sd(value))
Z score
participant vowel measurement trial value z.score

18 a f2 trial4 1704 0.9467597
18 a f2 trial4 1734 0.9975243
18 a f1 trial4 701 -1.2665945
18 a f1 trial4 731 -0.2533189
18 a f2 trial3 1687 0.9179931
18 a f2 trial3 1707 0.9518362
Summarizing data
The summarise() function from dplyr also creates new columns;

however, it is not a 1:1 function. The function instead creates a re-
duced output that summarises the data. It also only returns as many
columns as is necessary to display the results.
An example application would be to show the means, standard
deviations, or similar statistics.
data.global.means <- data.filtered %>%

summarise(global.means = mean(value))
data.global.means
## # A tibble: 1 × 1
## global.means
## <dbl>
## 1 1050.686
Grouped central tendencies
data.grouped.ct <- data.filtered %>%

group_by(participant, vowel, measurement) %>%
summarise(means = mean(value), sds = sd(value),
medians = median(value))
head(data.grouped.ct)
## Source: local data frame [6 x 6]

14 bradley rentz
## Groups: participant, vowel [4]

##
## participant vowel measurement means
## <fctr> <fctr> <fctr> <dbl>
## 1 18 a f1 738.500
## 2 18 a f2 1716.375
## 3 18 u f2 572.625
## 4 21 a f1 751.625
## 5 21 a f2 1677.125
## 6 21 u f2 744.000
## # ... with 2 more variables: sds <dbl>,
## # medians <dbl>

58.tidy Data in R For Linguists

Uploaded by

Copyright:

Available Formats

You might also like

58.tidy Data in R For Linguists

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

58.tidy Data in R For Linguists

Uploaded by

Copyright:

Available Formats

BRADLEY RENTZ

Tidy data have the following characteristics:

• Observations are in rows

An example of tidy data

Participant Gender Trial Value

An example of messy data

Participant Trial1 Trial2

R Packages for tidy data

## Loading tidyverse: ggplot2

## Conflicts with tidy packages ----------------

## filter(): dplyr, stats

Import messy data

data.messy <- read_csv("messy_data.csv")

## Parsed with column specification:

Find the problems

participant vowel word f2_trial1 f2_trial2 f2_trial3

participant vowel word f2_trial1 f2_trial2 f2_trial3

Explore the data more

f2_trial4 f1_trial1 f1_trial2 f1_trial3 f1_trial4

Problem 1: The columns contain observations

The first problem is the f2_trial1, f2_trial2, etc., represent observations

data.long <- gather(data.messy,

participant vowel word measurement value

Problem 2: Two variables in one column

Almost tidy data

gender participant vowel word measurement trial value

Problem 3: Irregular spellings

When entering data by hand, there are sometime irregularities in

data.long$gender <- as.factor(data.long$gender)

data.long$trial <- as.factor(data.long$trial)

## gender participant vowel word

Using stringr to fix strings

Let’s check to see if there are other errors.

data.long$gender <- as.factor(data.long$gender)

data.long$vowel <- as.factor(data.long$vowel)

## gender participant vowel word

new.df <- old.df %>% select columns that we want %>%

Selecting columns with select()

The function select() from dplyr is used to select certain columns

data.selected <- data.long %>%

data.sorted <- data.selected %>%

The function filter() from dplyr allows us to subset the data by

data.filtered <- data.sorted %>%

The logical values that can be used are

Adding new columns

The mutate() function of dplyr allows us to add columns to our

data.zscore <- data.filtered %>%

participant vowel measurement trial value z.score

The summarise() function from dplyr also creates new columns;

data.global.means <- data.filtered %>%

Grouped central tendencies

data.grouped.ct <- data.filtered %>%

## Source: local data frame [6 x 6]

## Groups: participant, vowel [4]

You might also like