Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Module 272-015

Data Science
for Linguists

Taras Zakharko Department of


taras.zakharko@uzh.ch Comparative Linguistics
Basics of data transformation
Why tables?

Language Word order

Rapanui VSO

German SVO

Abui SOV

English SVO

Wembawemba VOS
The language Rapanui has the word order VSO
The language German has the word order SVO
The language Abui has the word order SOV
The language English has the word order SVO
The language Wembawemba has the word order VOS

The language X has the word order Y

(X, Y)
Variable

• Tables represent relations

Exemplar/ • Rows are observations


Observation

• Columns are variables

• Order of rows and columns is usually not


signi cant

Description
fi
• What are the fundamental operations for manipulating tabular data?

• How can these operations be combined with each other?


Filter/subset (row restriction)
Select (column restriction)
Rename

X Y
Mutate (column insertion or modi cation)

a b a⊙b
c d c⊙d
e f e⊙f
g h g⊙h
j k j⊙k
fi
Summarize/aggregate (row statistics computation)

⊕ ⊕ ⊕ ⊕

Produces a single row that summarizes the entire table based on


some user-speci ed criterion
fi
Summarize/aggregate (row statistics computation)

7
1
Min Max Avg
2
1 8 5.2
5
8

The summary does not have to preserve the structure of the original
columns!
Group/Split

Index

a
a
a
b
a
a
b
b
a
b

Index (group variable) de nes the groups. A group is a subset of


data where the value of the index is identical across the rows
fi
Grouped operations

Language Family
Family Languages
a X
X 2
b X
Y 1
c Y
Z 1
d Z

Languages = number_of_rows

Summarize grouped by Family


Grouped operations

Language Family Classes Language Family Classes


a X 1 b X 5
b X 5 c Y 2
c Y 2 d Y 2
d Y 2
Classes ≥ mean(Classes)

Filter grouped by Family


Grouped operations

Language Family Phonemes Language Family Phonemes Rank


a X 30 a X 30 2
b X 25 b X 25 1
c Y 18 c Y 18 1
d Y 22 d Y 22 2

Rank = rank(Phonemes)

Mutate grouped by Family


Filter Selects rows (observations) according to a condition

Select/Rename Selects columns (or changes their arrangement)

Mutate Changes data in a column or adds a new column

Summarize Aggregates over rows and columns

All operations can be grouped!


Example

Language Family WordOrder


Rapanui Austronesian VSO
German Indo-European SVO
Abui Timor-Alor-Pantar SOV
English Indo-European SVO
Sorbian Indo-European SOV
Wemba Wemba Pama-Nyungan VOS

Which operations
do you need?
• What is the word order of Rapanui?

• How many Pama-Nyungan languages are described in the dataset?

• How many language families are described in the dataset?


Example
Language Family WordOrder
Rapanui Austronesian VSO
Which operations
German Indo-European SVO do you need?
Abui Timor-Alor-Pantar SOV
English Indo-European SVO
Sorbian Indo-European SOV
Wemba Wemba Pama-Nyungan VOS

Family Attested.WO N.Languages

Indo-European SVO,SOV 2

Austronesian VOS 1

Timor-Alor-Pantar SOV 1
Join (combine tables)

id x id y id x y
a 1 a pear a 1 pear
b 0 ⊛ b pear b 0 pear
c 1 c apple c 1 apple
d 0 d apple d 0 apple

Tables must have a common key


Join (combine tables)

id x y
id x id y
a 1 pear
a 1 a pear a 1 apple
b 0 ⊛ a apple b 0 plum
c 1 b plum b 0 grape
c 1 -
d 0 b grape
d 0 -

Values will be duplicated where needed to match all combinations!


Full Join

id x y
id x id y
a 1 pear
a 1 a pear
⊛ b 0 -
b 0 c apple
c - apple
d 0 d apple
d 0 apple

Gaps are lled with a dummy value (NA in R)


fi
Inner Join

id x id y
id x y
a 1 a pear
⊛ a 1 pear
b 0 c apple
d 0 apple
d 0 d apple

Only matching rows are kept, rows with gaps are deleted
Left/Right Join

id x y
a 1 pear
id x id y jo in b 0 -
Left
d 0 apple
a 1 a pear

b 0 c apple
Righ
t join
id x y
d 0 d apple
a 1 pear
c - apple
d 0 apple

Match rows from left/right table, ll gaps where needed


(inner join is both a left and a right join)
fi
Example

id name sex parent child name mother father

1 Maria w. 1 2 Maria Julia Boris

2 Peter m. 1 4 Peter Maria Max

3 Max m. 3 2 Max - -

4 Sarah w. 3 4 Sarah Maria Max

5 Boris m. 5 1 Boris - -

6 Julia w. 6 1 Julia - -
name mother father grandma grandpa

Maria Julia Boris - -

Peter Maria Max Julia Boris

Max - - - -

Sarah Maria Max Julia Boris

Boris - - - -

Julia - - - -

• When designing a database, we often want to have “good structure” (normalization)

• When querying the database, “good structure” can be an obstacle

• Joins help us to transform data into a form that is easier to query and aggregate
Pivoting

Language Case Marker


German Nom Ø
German Akk suf x Language Nom Akk Abs Erg Loc
German Ø suf x - - suf x
German Loc suf x
suf x,
Dyirbal Erg suf x Russian Ø suf x - -
prep
Dyirbal Abs Ø
Dyirbal - - Ø suf x ?
Russian Loc suf x
Russian Loc prep

Turn around the table structure, interchanging rows and columns


Pivot wide: rows are transformed into columns
Pivot long: columns are transformed into rows
fi
fi
fi
fi
fi
fi
fi
fi
fi
Data transformation:
from theory to practice
What? vs. How?

• What needs to be done?

e.g. “I want to sort out the smokers our of my dataset, split them into women
and men and compute the average life expectancy for both groups”

General plan of action!

• How do I do it?

e.g. “I load the dataset in R with the command read_csv(), choose the
smoker data with filter(smoker), group the dataset using
group_by(sex) and compute the life expectancy via
summarize(life_expectancy = mean(age_at_death))”

The exact implementation!


What?

• Always start with a plan!

• Draw up the initial, the intermediate and the nal table structure

• Consider every separate step and try to describe them as precise as possible
— without going into technical detail

• Think “visually” - Table transformations are geometric transformations!


fi
What?: Data Flow Graphs

WALS Ethnologue

select(iso_code, F98A) select(iso_code, continent)

lter(F98A != empty)

mutate(SAcoding = F98A)

inner_join(iso_code)

summarize(counts)

grouped by(continent)
fi
How?

• Exact implementation depends on the tools you use

SQL select name,address from Person where age>=18

R Person %>% filter(age>=18) %>% select(name, address)

• It is a language/tool that you have to learn

• We are using R with the tidyverse extension


From data ow graph to R tidyverse code
WALS Ethnologue

select(iso_code, F98A) select(iso_code, continent)

lter(F98A != empty)

mutate(SAcoding = F98A)

inner_join(iso_code)

summarize(counts)
WALS %>%
grouped_by(continent)
select(iso_code, F98A) %>%
filter(!is.na(F98A)) %>%
mutate(SAcoding = …) %>%
inner_join(
select(Ethnologue, iso_code, continent),
join_by(iso_code)
) %>%
summarize(
SA_same = sum(SAcoding == “same”),
SA_different = sum(SAcoding != “same”),
.by = continent
)
fi
fl
• A collection of R packages

• De nes a grammar of data transformations

• Based on the data ow model

• “Tidy Data” principle

• DataCamp courses!
fi
fl

You might also like