Professional Documents
Culture Documents
Data Transformation
Data Transformation
Data Science
for Linguists
Rapanui VSO
German SVO
Abui SOV
English SVO
Wembawemba VOS
The language Rapanui has the word order VSO
The language German has the word order SVO
The language Abui has the word order SOV
The language English has the word order SVO
The language Wembawemba has the word order VOS
(X, Y)
Variable
Description
fi
• What are the fundamental operations for manipulating tabular data?
X Y
Mutate (column insertion or modi cation)
a b a⊙b
c d c⊙d
e f e⊙f
g h g⊙h
j k j⊙k
fi
Summarize/aggregate (row statistics computation)
⊕ ⊕ ⊕ ⊕
7
1
Min Max Avg
2
1 8 5.2
5
8
The summary does not have to preserve the structure of the original
columns!
Group/Split
Index
a
a
a
b
a
a
b
b
a
b
Language Family
Family Languages
a X
X 2
b X
Y 1
c Y
Z 1
d Z
Languages = number_of_rows
Rank = rank(Phonemes)
Which operations
do you need?
• What is the word order of Rapanui?
Indo-European SVO,SOV 2
Austronesian VOS 1
Timor-Alor-Pantar SOV 1
Join (combine tables)
id x id y id x y
a 1 a pear a 1 pear
b 0 ⊛ b pear b 0 pear
c 1 c apple c 1 apple
d 0 d apple d 0 apple
id x y
id x id y
a 1 pear
a 1 a pear a 1 apple
b 0 ⊛ a apple b 0 plum
c 1 b plum b 0 grape
c 1 -
d 0 b grape
d 0 -
id x y
id x id y
a 1 pear
a 1 a pear
⊛ b 0 -
b 0 c apple
c - apple
d 0 d apple
d 0 apple
id x id y
id x y
a 1 a pear
⊛ a 1 pear
b 0 c apple
d 0 apple
d 0 d apple
Only matching rows are kept, rows with gaps are deleted
Left/Right Join
id x y
a 1 pear
id x id y jo in b 0 -
Left
d 0 apple
a 1 a pear
⊛
b 0 c apple
Righ
t join
id x y
d 0 d apple
a 1 pear
c - apple
d 0 apple
3 Max m. 3 2 Max - -
5 Boris m. 5 1 Boris - -
6 Julia w. 6 1 Julia - -
name mother father grandma grandpa
Max - - - -
Boris - - - -
Julia - - - -
• Joins help us to transform data into a form that is easier to query and aggregate
Pivoting
e.g. “I want to sort out the smokers our of my dataset, split them into women
and men and compute the average life expectancy for both groups”
• How do I do it?
e.g. “I load the dataset in R with the command read_csv(), choose the
smoker data with filter(smoker), group the dataset using
group_by(sex) and compute the life expectancy via
summarize(life_expectancy = mean(age_at_death))”
• Draw up the initial, the intermediate and the nal table structure
• Consider every separate step and try to describe them as precise as possible
— without going into technical detail
WALS Ethnologue
lter(F98A != empty)
mutate(SAcoding = F98A)
inner_join(iso_code)
summarize(counts)
grouped by(continent)
fi
How?
lter(F98A != empty)
mutate(SAcoding = F98A)
inner_join(iso_code)
summarize(counts)
WALS %>%
grouped_by(continent)
select(iso_code, F98A) %>%
filter(!is.na(F98A)) %>%
mutate(SAcoding = …) %>%
inner_join(
select(Ethnologue, iso_code, continent),
join_by(iso_code)
) %>%
summarize(
SA_same = sum(SAcoding == “same”),
SA_different = sum(SAcoding != “same”),
.by = continent
)
fi
fl
• A collection of R packages
• DataCamp courses!
fi
fl