Professional Documents
Culture Documents
58.tidy Data in R For Linguists
58.tidy Data in R For Linguists
58.tidy Data in R For Linguists
T I D Y D ATA
What is tidy data?
library("tidyverse")
• Import data
• Change from wide data to long
• Split columns
• Fix typos or irregulaties
• Create new variables as needed
Why is that data messy? How does it violate the tidy data rules?
Long data
The next problem that we have to deal with is that two of our columns
have two variables in them. Participant has both gender and the par-
ticipant number. Measurment has both the trial and the measurement
type (F1 or F2). To separate the columns we use separate() from
8 bradley rentz
tidyr.
Separating columns
# separate measurement
data.long <- separate(data.long,
measurement,
c("measurement","trial"),
sep="_")
# separate participant (note: for . need regex \.
# but need to escape the \ with a \, so \\. )
# convert=T allows the new columns to have different data types
data.long <- separate(data.long,
participant,
c("gender","participant"),
sep="\\.",
convert=T)
Checking factors
To check the levels of each factor, we can use the summary() com-
mand. What problems in spelling are there?
summary(data.long[,1:6])
Now that we have identified the errors, we can fix them with str_replace()
from stringr.
library(stringr)
data.long$word <- str_replace(data.long$word,"Hoop","hoop")
data.long$word <- str_replace(data.long$word,"Boot","boot")
data.long$vowel <- str_replace(data.long$vowel,"A","a")
Clean data
Now that we have clean data we can process and explore our data
with the dplyr functions filter(), arrange(), select(), summarise(),
group_by(), and mutate(). We will also use the pipe %>% from the
magrittr package to make using these (and other) functions easier.
The pipe
The pipe %>% allows us to nest code very easily is read as then.
For example, let’s say we want to select only a few columns of
a dataframe, then select row values that are greater than a certain
value, then mean for each variable by participant.
The pseudo-code would like this:
Arranging values
The arrange() function from dplyr is in a sort function. You can use
it to sort by one column or by multiple columns. It can also sort in
descending order by using desc() arround the column name.
Let’s sort the data.selected by participant, then vowel, then de-
scending order of trial.
Filtering rows
Logical values
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal too
& AND
| OR
Z score
Summarizing data
## # A tibble: 1 × 1
## global.means
## <dbl>
## 1 1050.686