Professional Documents
Culture Documents
LD&C-SP25 1 Schnell+Haig+Seifart
LD&C-SP25 1 Schnell+Haig+Seifart
LD&C-SP25 1 Schnell+Haig+Seifart
Archive
University of Zurich
University Library
Strickhofstrasse 39
CH-8057 Zurich
www.zora.uzh.ch
Year: 2022
The following work is licensed under a Creative Commons: Attribution-NonCommercial 4.0 International (CC
BY-NC 4.0) License.
1 Introduction
The global decline in linguistic diversity was first brought to public atten-
tion some 30 years ago (Hale et al. 1992), and has since continued unabated
(Seifart et al. 2018; Bromham et al. 2021). From the perspective of language
communities, language loss means an irredeemable rupture of collective and
individual memory; for the language sciences, each disappearing language
shrinks our window on the range of variability in human language. The ur-
gency of language documentation could scarcely be more obvious.1
For typology, a major utility of language documentation was the adequate
representation of typological rara in linguistic theory: syntactic ergativity,
1 This volume grew out of a workshop on Corpus-based Typology: Spoken Language from a
Cross-linguistic Perspective, held at the Annual General Meeting of the German Linguistics
Society (DGfS) at the University of Hamburg in March 2020. We, the editors, would like
to express our thanks to the audience at the workshop for very stimulating discussion, to
the external reviewers of the contributions of this volume for their critical feedback, and
to Nils Schiborr for supervising the final volume production. The responsibility for any
remaining errors is of course our own.
2 LD&C SP25 — Doing corpus-based typology with spoken language data
2 Corpus-based typology
Corpus-based typology refers to a set of approaches that use corpora to con-
duct language typology. The following sections are devoted to key aspects of
this research agenda: Section 2.1 addresses the concept of ‘corpus’ in a cross-
linguistic context; Section 2.2 looks at the role of language-internal variation;
Section 2.3 addresses different aspects of language usage. We refer the reader
Language documentation and corpus-based typology 3
to Levshina (2021a) and Schnell & Schiborr (in press) for more extensive over-
views of recent developments; here we merely summarize the main points
before identifying remaining challenges and desiderata (Section 2.4), which
the current volume intends to address.
(UDHR)2 for which translations in about 1400 languages are now available
(cf. Bentz & Ferrer-i-Cancho 2016). The VoxClamantis project (Salesky et al.
2020) is a corpus of Bible translations read aloud by native speakers, with first-
pass phoneme-level alignments for more than 600 languages. A lower degree
of content-control is found in so-called parallax corpora, where speakers pro-
duce texts in response to a specific non-verbal stimulus. The best-known ex-
amples are those based on re-tellings of the Pear Story video (Chafe 1980),
or the Frog Story picture book (Mayer 1969; see Berman & Slobin 1994). See
Barth & Evans (2017) and Barth & Evans (this volume) for examples of the
parallax methodology. Finally, at the lowest level of content control are cor-
pora produced without any specific pre-defined content constraints. These
include, for example, life stories, traditional narratives, descriptions of activ-
ities, and so on, loosely referred to as “original text typology” (Haig et al.
2011). Texts of this type are among the most common outputs of language
documentation, and are thus of particular relevance in the current connec-
tion. Degree of content control has a considerable impact on the nature of
the typological questions that can be addressed: High content control is de-
signed to elicit a high degree of semantic consistency across different cor-
pora, and is thus ideal for probing, for instance, cross-linguistic differences
and commonalities of specific event types in specific contexts (e.g. motion
events, Wälchli & Sölling 2013). However, high content control comes at the
cost of a lack of naturalness, and possible interference from the source texts.
Corpora also vary across other dimensions, for example medium (spoken,
signed, or written, further discussed in Section 2.4 below), single-participant
(monologue) or multi-participant, register, and genre. This volume is de-
voted to spoken language corpora, which always minimally involve record-
ing, and generally transcription, followed by further levels of annotation,
which may exhibit various degrees of conceptual abstraction, from a simple,
content-based translation to exhaustive morpheme-for-morpheme glossing,
with indications of prosody, information structure, or semantic roles. See Sei-
fart (this volume), Mettouchi & Vanhove (this volume) and Haig et al. (this
volume) for different approaches to phonological, prosodic, morphosyntactic,
and information structure annotation.
2 https://www.unicode.org/udhr/
Language documentation and corpus-based typology 5
3 https://universaldependencies.org/
6 LD&C SP25 — Doing corpus-based typology with spoken language data
in language use. For instance, Cysouw (2014) induces semantic roles (such as
Goal, Agent, or Patient), or even macro-roles (like S, A, and P) from the usage
of case-like markers across 15 languages, based on parallel religious texts. In
this section we give examples of research that seeks motivations for typolo-
gical (dis)preferences in different levels of linguistics structure in patterns
of language use. These examples illustrate what CBT’s corpus-linguistic
approach adds to the study of variability beyond the recognition of intra-
language variation that is already registered within MVT and related ap-
proaches, which can, in principle, be applied to, for example, elicited or pre-
processed data.
In syntactic typology, dependency length minimisation (DLM) has long
been hypothesized to underlie the typological preference for harmonic word
order patterns, for instance VO co-occurring with prepositions in head-initial
languages (cf. Behaghel 1909; Dryer 1992; Hawkins 2014, 2004). How DLM
shapes actual language use has now been directly investigated comparatively
in corpora from over 50 languages (Futrell et al. 2020; Jing et al. 2021). Simil-
arly, there are now corpus-based results on the avoidance of crossing depend-
encies (Blasi et al. 2019), as well as cross-linguistic differences in word order
variability (more or less “free” word order) (Futrell et al. 2015; Levshina 2019)
and the trade-off between word order variability and case marking (Koplenig
et al. 2017; Levshina 2019, 2021b).
Another classic topic in syntactic typology is zero reference, discussed
in the theoretical and typological literature under the heading of “pro-drop”
(Rizzi 1982; Jaeggli & Safir 1989; Roberts & Holmberg 2009; cf. Dryer 2013).
Bickel (2003) and Stoll & Bickel (2009) investigate the rate of zero expression
of syntactic arguments across a set of three languages from the Himalayas,
and Russian. Factors impacting on referential density, that is the rate of overt
to covert forms of reference, are identified as patterns of clause combining
as well as ethnolinguistic considerations of discourse production and recep-
tion. While these authors focus on possible realizations of all arguments in
a particular corpus, Torres Cacoullos & Travis (2019) focus on the specific
conditions of zero reference for subjects, finding, for instance, that the same
factors are involved in both English and Spanish texts, but differ in their de-
gree of magnitude. Vollmer (2019) applies classification tree methodologies
to investigate referential choice in a diverse sample of 8 languages, likewise
identifying underlying commonalities across diverse corpora. These works
8 LD&C SP25 — Doing corpus-based typology with spoken language data
thus for the Uniform Information Density hypothesis (Frank & Jaeger 2008).
Comparing 17 languages, Coupé et al. (2019) find that higher syllable com-
plexity (which induces higher informativity of each syllable) correlates with
slower articulation rate, resulting in a pattern whereby information trans-
mission rates across languages are surprisingly comparable. Regarding the
temporal chunking of speech, Inbar et al. (2020) measured the regularity of
sequences of intonation units across six spoken corpora, with results suggest-
ing that they closely match brain waves at 1Hz.
animacy) as the most relevant factor of pronoun use. Biber (1995: 359) finds
differences across registers to be remarkably similar across diverse languages
(English, Somali, Korean, Nukulaelae Tuvaluan),4 pointing to the high rel-
evance of considering cross-register variation in cross-linguistic research of
language use.
Finally, CBT has to confront the problem of the relative comparability of
corpora: Bickel (2003) makes a strong case for the use of content-controlled
data in a study of referential density, given the high content-sensitivity of
the parameter under study. Given the lesser representativeness of such ex-
perimentally elicited texts, an alternative approach is to determine pragmat-
ically defined usage contexts that can be considered broadly comparable, as
advanced in pragmatic typology (e.g. Dingemanse et al. 2014). Which of these
two approaches is best suited for a given research agenda will depend heavily
on the respective projects.
4 Biber (1995) focuses on a comparative study of English and Somali registers, yet the assess-
ment encompasses all four languages.
12 LD&C SP25 — Doing corpus-based typology with spoken language data
cover potential variation across language users (as mentioned in Section 2.4).
Hence, to investigate lexical choices in reference to humans across diverse
languages, Social Cognition Parallax Interview Corpus (SCOPIC) texts are
elicited with the help of a picture task stimulus that is designed so as to elicit
instances of human references, and likewise lexical choices in other domains.
This optimally enables comparisons across languages and across individual
users, since contexts are kept stable. Barth et al. (this volume) present four
case studies on intra-language and intra-language user variation in a sample
of thirteen languages using these data. Overall, the authors find substan-
tial variation across users of a single language, which actually exceeds the
variability between languages, especially in the case of research questions
in semantic typology. These findings highlight the necessity for closer mon-
itoring of community-level representativeness in CBT and for investigating
methods that allow researchers to assess contribution of individual language
users.
5 https://multicast.aspra.uni-bamberg.de/resources/hambam/
18 LD&C SP25 — Doing corpus-based typology with spoken language data
by the wayside in the reception and citation of this seminal article), there are
good reasons from a collaborative documentary point of view to give preced-
ence to communities’ preferences, often for traditional narratives, and the
contributions in this volume bear witness to the fact that such data are most
useful for CBT. This is corroborated by a general corpus-linguistic view that
seeks to determine various conditions on language use, so that confinement
to experimentally elicited data would yield too narrow a scope of situational
factors.
Finally, corpora need to be sustainably and accessibly archived with clear
instructions how they can be used, in line with best practices of documentary
linguistics (Gippert et al. 2006; Thieberger & Berez 1963) and of open and
reproducible science (Wilkinson et al. 2016; Berez-Kroeker et al. 2018). Nearly
all contributors to this special publication are in the process of building web-
accessible corpora and hence contribute to the enterprise of open science
and collaborative research that CBT relies on much more than other fields in
linguistics.
4 Conclusions
Its initial success story notwithstanding, CBT is still in its infancy and unified
research agendas and standards are still emerging. Considerable advances
have already been made by researchers working primarily on digital corpora
of written standardized languages. This volume broadens the scope of CBT
by connecting it with language documentation, enabling a shift towards com-
parison based on context-embedded, naturally variable spoken and signed
language usage, where the social and cognitive-articulatory factors that mo-
tivate typological generalizations are actually operative. As coverage of the
world’s languages in spoken and signed corpora grows, CBT will also be able
to feed into a more holistic approach to areal and diachronic typology by in-
vestigating the conditioned use of linguistic structures by different users with
different demographics and across social settings and communities with dif-
ferent cultural backgrounds. This development will incur major challenges
in data collection and processing, including in particular linguistic annota-
tions of various kinds for comparative purposes. While this amounts to an
enormous undertaking, evoking a “sea change” in linguistics propagated by
Language documentation and corpus-based typology 19
Levinson & Evans (2010), we believe that the confluence of current advances
in corpus-based typology and language documentation will bring us a signi-
ficant step further in this direction.
References
Barth, Danielle & Evans, Nicholas. 2017. SCOPIC design and overview. In Barth,
Danielle & Evans, Nicholas (eds.), The Social Cognition Parallax Interview Cor-
pus (SCOPIC): A cross-linguistic resource (Language Documentation & Conservation
special publication 12), 1–23. Honolulu, HI: University of Hawai’i Press. (https:
//hdl.handle.net/10125/24742).
Barth, Danielle & Evans, Nicholas & Arka, I Wayan & Bergqvist, Henrik & Forker,
Diana & Gipper, Sonja & Hodge, Gabrielle & Kashima, Eri & Kasuga, Yuki
& Kawakami, Carine & Kimoto, Yukinori & Knuchel, Dominique & Kogura,
Norikazu & Kurabe, Keita & Mansfield, John & Narrog, Heiko & Pratiwi, De-
sak Putu Eka & van Putten, Saskia & Senge, Chikako & Tykhostup, Olena. This
volume. Language vs. individuals in cross-linguistic corpus typology. In Haig,
Geoffrey & Schnell, Stefan & Seifart, Frank (eds.), Doing corpus-based typology
with spoken language corpora: State of the art (Language Documentation & Conser-
vation special publication 25), 179–232. Honolulu, HI: University of Hawai’i Press.
(https://hdl.handle.net/10125/74661).
Barth, Danielle & Schnell, Stefan. 2021. Understanding corpus linguistics. London: Rout-
ledge.
Behaghel, Otto. 1909. Beziehungen zwischen Umfang und Reihenfolge von
Satzgliedern. Indogermanische Forschungen 25. 110–142. (https://doi.org/10.
1515/9783110242652.110).
Bentz, Christian & Ferrer-i-Cancho, Ramon. 2016. Zipf’s law of abbreviation as a lan-
guage universal. In Bentz, Christian & Jäger, Gerhard & Yanovich, Igot (eds.), Pro-
ceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguist-
ics. Tübingen: University of Tubingen. (https://publikationen.uni-tuebingen.
de/xmlui/handle/10900/68558).
Berez-Kroeker, Andrea L. & Andreassen, Helene N. & Gawne, Lauren & Holton, Gary
& Kung, Susan Smythe & Pulsifer, Peter & Collister, Lauren B. & The Data
Citation and Attribution in Linguistics Group & The Linguistics Data Interest
Group. 2018. The Austin Principles of Data Citation in Linguistics (Version 1.0): In-
troduction and guidelines for annotators (Version 7.0). (http : / / site . uit . no /
linguisticsdatacitation/austinprinciples/).
20 LD&C SP25 — Doing corpus-based typology with spoken language data
Berman, Ruth A. & Slobin, Dan I. 1994. Relating events in narrative: A cross-linguistic
developmental study. Mahwah, NJ: Erlbaum.
Biber, Douglas. 1995. Dimensions of register variation: A cross-linguistic comparison.
Cambridge: Cambridge University Press.
Bickel, Balthasar. 2003. Referential density in discourse and syntactic typology. Lan-
guage 79(4). 708–736.
Bickel, Balthasar. 2009. Typological patterns and hidden diversity. Plenary talk de-
livered at the 8th Biannual Meeting of the Association for Linguistic Typo-
logy, Berkeley, United States of America, 24 July 2013. (https : / / www .
comparativelinguistics . uzh . ch / dam / jcr : 00000000 - 774a - e877 - 0000 -
0000407727a0/alt2009bickel-plenary.pdf).
Bickel, Balthasar. 2015. Distributional typology. In Heine, Bernd & Narrog, Heiko
(eds.), The Oxford handbook of linguistic analysis, 2nd edn. Oxford: Oxford Uni-
versity Press. (https://doi.org/10.1093/oxfordhb/9780199677078.013.0046).
Blasi, Damian & Cotterell, Ryan & Wolf-Sonkin, Lawrence & Stoll, Sabine & Bickel,
Balthasar & Baroni, Marco. 2019. On the distribution of deep clausal embeddings:
A large cross-linguistic study. In Korhonen, Anna & Traum, David & Màrquez,
Lluís (eds.), Proceedings of the 57th Annual Meeting of the Association for Computa-
tional Linguistics, 3938–3943. Florence: Association for Computational Linguist-
ics.
Bromham, Lindell & Dinnage, Russell & Skirgård, Hedvig & Ritchie, Andrew & Car-
dillo, Marcel & Meakins, Felicity & Greenhill, Simon & Hua, Xia. 2021. Global pre-
dictors of language endangerment and the future of linguistic diversity. Nature
Ecology & Evolution. (https://doi.org/10.1038/s41559-021-01604-y).
Bybee, Joan L. & Pagliuca, William & Perkins, Revere D. 1990. On the asymmetries
in the affixation of grammatical material. In Croft, William & Denning, Keith &
Kemmer, Suzanne (eds.), Studies in typology and diachrony: Papers presented to
Joseph H. Greenberg on his 75th birthday, 1–42. Amsterdam: John Benjamins.
Chafe, Wallace (ed.). 1980. The Pear Stories: Cognitive, cultural, and linguistic aspects
of narrative production. Norwood, NJ: Ablex.
Chanard, Christian. 2015. ELAN-CorpA: Lexicon-aided annotation in ELAN. In
Mettouchi, Amina & Chanard, Christian (eds.), CorpAfroAs: The CorpAfroAs cor-
pus of spoken AfroAsiatic languages. (https : / / doi . org / 10 . 1075 / scl . 68 .
website).
Cohen Priva, Uriel. 2017. Informativity and the actuation of lenition. Language 93(3).
569–597.
Comrie, Bernard. 1981. Language universals and linguistic typology. London: Black-
well.
Language documentation and corpus-based typology 21
Coupé, Christophe & Oh, Yoon & Dediu, Dan & Pellegrino, François. 2019. Different
languages, similar encoding efficiency: Comparable information rates across the
human communicative niche. Science Advances 5(9). eaaw2594. (https : / / doi .
org/10.1126/sciadv.aaw2594).
Cutler, Ann & Hawkins, John A. & Gilligan, Gary. 1985. The suffixing preference: A
processing explanation. Linguistics 23(5). 723–758. (https://doi.org/10.1515/
ling.1985.23.5.723).
Cysouw, Michael. 2014. Inducing semantic roles. In Luraghi, Silvia & Narrog, Heiko
(eds.), Perspectives on Semantic Roles, 23–68. Berlin: Mouton de Gruyter. (https:
//doi.org/10.1075/tsl.106.02cys).
Dahl, Östen. 2015. How WEIRD are WALS languages? Paper presented at the Closing
Conference of the Department of Linguistics at the Max Planck Institute for Evol-
utionary Anthropology, Leipzig, Germany, 1–3 May 2015. (https : / / www . eva .
mpg.de/fileadmin/content_files/linguistics/conferences/2015-diversity-
linguistics/Dahl_slides.pdf).
Dingemanse, Mark & Blythe, Joe & Dirksmeyer, Tyko. 2014. Formats for other-
initiation of repair across languages: An exercise in pragmatic typology. Studies
in Language 38(1). 5–43. (https://doi.org/10.1075/sl.38.1.01din).
Dingemanse, Mark & Rossi, Giovanni & Floyd, Simeon. 2017. Place reference in
story beginnings: A cross-linguistic study of narrative and interactional afford-
ances. Language in Society 46(2). 129–158. (https : / / doi . org / 10 . 1017 /
S0047404516001019).
Donohue, Mark. 1999. A grammar of Tukang Besi. Berlin: Mouton de Gruyter.
Dryer, Matthew S. 1992. The Greenbergian word order correlations. Language 68(1).
81–138.
Dryer, Matthew S. 2013. Feature 101A: Expression of pronominal subjects. In Dryer,
Matthew S. & Haspelmath, Martin (eds.), The World Atlas of Language Structures
Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (https://
wals.info/feature/101A).
Duranti, Alessandro. 1981. The Samoan fono: A sociolinguistic study. Canberra: Aus-
tralian National University.
Duranti, Alessandro. 1997. Linguistic anthropology. Cambridge: Cambridge University
Press.
ELAN developers. 2020. ELAN (Version 6.2). Nijmegen: Max Planck Institute for Psy-
cholinguistics. (https://archive.mpi.nl/tla/elan).
Frank, Austin F. & Jaeger, Florian. 2008. Speaking rationally: Uniform information
density as an optimal strategy for language production. Proceedings of the 30th An-
nual Meeting of the Cognitive Science Society (CogSci-08), Washington D.C., United
States of America, 23–26 July 2008. 939–944.
22 LD&C SP25 — Doing corpus-based typology with spoken language data
Futrell, Richard & Levy, Roger P. & Gibson, Edward. 2020. Dependency locality as an
explanation principle for word order. Language 76(2). 371–412.
Futrell, Richard & Mahowald, Kyle & Gibson, Edward. 2015. Quantifying word order
freedom in dependency corpora. Proceedings of the 3rd International Conference
on Dependency Linguistics (Depling 2015), Uppsala, Sweden, 24–26 August 2015. 91–
100.
Gerdes, Kim & Kahane, Sylvain & Chen, Xinying. 2021. Typometrics: From implica-
tional to quantitative universals in word order typology. Glossa 6(1). 1–31. (https:
//doi.org/{10.5334/gjgl.764}).
Gibson, Edward & Futrell, Richard & Piantadosi, Steven T. & Dautriche, Isabelle &
Mahowald, Kyle & Bergen, Leon & Levy, Roger. 2019. How efficiency shapes hu-
man language. Trends in Cognitive Sciences 23(12). 1087. (https://doi.org/10.
1016/j.tics.2019.09.005).
Gippert, Jost & Himmelmann, Nikolaus P. & Mosel, Ulrike (eds.). 2006. Essentials of
language documentation. Berlin: Mouton de Gruyter.
Greenberg, Joseph H. 1963. Some universals of grammar with particular reference
to the order of meaningful elements. In Greenberg, Joseph H. (ed.), Universals of
language, 73–113. Cambridge, MA: MIT Press.
Greenberg, Joseph H. 1966. Language universals, with special reference to feature hier-
archies. The Hague: Mouton.
Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Rela-
tions and Animacy in Discourse): Introduction and guidelines for annotators (Version
7.0). (https://multicast.aspra.uni-bamberg.de/#annotations).
Haig, Geoffrey & Schnell, Stefan (eds.). 2021. Multi-CAST: Multilingual Corpus of An-
notated Spoken Texts. Version 2108. (https : / / multicast . aspra . uni - bamberg .
de/).
Haig, Geoffrey & Schnell, Stefan & Schiborr, Nils N. This volume. Universals of ref-
erence in discourse and grammar: Evidence from the Multi-CAST collection of
spoken corpora. In Haig, Geoffrey & Schnell, Stefan & Seifart, Frank (eds.), Doing
corpus-based typology with spoken language corpora: State of the art (Language Doc-
umentation & Conservation special publication 25), 141–177. Honolulu, HI: Univer-
sity of Hawai’i Press. (https://hdl.handle.net/10125/74660).
Haig, Geoffrey & Schnell, Stefan & Wegener, Claudia. 2011. Comparing corpora from
endangered languages: Explorations in language typology based on original texts.
In Haig, Geoffrey & Nau, Nicole & Schnell, Stefan & Wegener, Claudia (eds.),
Documenting endangered languages: Achievements and perspectives, 55–86. Berlin:
Mouton de Gruyter. (https://doi.org/10.1515/9783110260021.55).
Language documentation and corpus-based typology 23
Hale, Ken & Krauss, Michael & Watahomigie, Lucille J. & Yamamoto, Akira Y. & Craig,
Colette & Masayesva Jeanne, LaVerne & England, Nora C. 1992. Endangered lan-
guages. Language 68(1). 1–42. (https://doi.org/10.1353/lan.1992.0052).
Haspelmath, Martin. 2021. Explaining grammatical coding asymmetries: Form-
frequency correspondences and predictability. Journal of Linguistics 57(3). 605–
633. (https://doi.org/10.1017/S0022226720000535).
Haspelmath, Martin & Calude, Andreea & Spagnol, Michael & Narrog, Heiko & Bamy-
aci, Elif. 2014. Coding causal-noncausal verb alternations: A form-frequency cor-
respondence explanation. Journal of Linguistics 50(3). 587–625.
Hawkins, John A. 1994. A performance theory of order and constituency. Cambridge:
Cambridge University Press.
Hawkins, John A. 2004. Efficiency and complexity in grammars. Oxford: Oxford Uni-
versity Press.
Hawkins, John A. 2014. Cross-linguistic variation and efficiency. Oxford: Oxford Uni-
versity Press.
Hellwig, Birgit & Defina, Rebecca & Kidd, Evan & Allen, Shanley & Davidson, Lucy &
Kelly, Barbara F. This volume. Child language documentation: The sketch acquis-
ition project. In Haig, Geoffrey & Schnell, Stefan & Seifart, Frank (eds.), Doing
corpus-based typology with spoken language corpora: State of the art (Language
Documentation & Conservation special publication 25), 29–58. Honolulu, HI: Uni-
versity of Hawai’i Press. (https://hdl.handle.net/10125/74657).
Henrich, Joseph & Heine, Steven J. & Norenzayan, Ara. 2010. The weirdest people in
the world? Behavioral and Brain Sciences 33(2–3). 61–83. (https://doi.org/10.
1017/S0140525X0999152X).
Hildebrandt, Kristine A. & Jany, Carmen & Silva, Wilson (eds.). 2017. Documenting
variation in endangered languages (Language Documentation & Conservation spe-
cial publication 13). Honolulu, HI: University of Hawai’i Press. (http : / / hdl .
handle.net/10125/24754).
Himmelmann, Nikolaus P. 1998. Documentary and descriptive linguistics. Linguistics
36(2). 161–195.
Himmelmann, Nikolaus P. 2014. Asymmetries in the prosodic phrasing of function
words: Another look at the suffixing preference. Language 90(4). 927–960.
Hymes, Dell H. 1961. Functions of speech: An evolutionary approach. In Gruber, Fre-
derick C. (ed.), Anthropology and education, 55–83. Philadelphia, PA: University
of Philadelphia Press.
Hymes, Dell H. 1962. The ethnography of speaking. In Gladwin, Thomas & Sturtevant,
William C. (eds.), Anthropology and human behaviour, 13–53. Washington, D.C.:
Anthropological Society of Washington.
24 LD&C SP25 — Doing corpus-based typology with spoken language data
Inbar, Maya & Grossman, Eitan & Landau, Ayelet N. 2020. Sequences of intonation
units form a 1 Hz rhythm. Scientific Reports 10(1). 15846. (https://doi.org/10.
1038/s41598-020-72739-4).
Izre’el, Shlomo & Mettouchi, Amina. 2015. Representation of speech in CorpAfroAs:
Transcriptional strategies and prosodic units. In Mettouchi, Amina & Vanhove,
Martine & Caubet, Dominique (eds.), Corpus-based studies of lesser-described lan-
guages: The CorpAfroAs corpus of spoken AfroAsiatic, 13–41. Amsterdam: John Ben-
jamins.
Jaeggli, Osvaldo & Safir, Kenneth J. 1989. The null subject parameter and parametric
theory. In Jaeggli, Osvaldo & Safir, Kenneth J. (eds.), The null subject parameter,
1–44. Dordrecht: Kluwer.
Jing, Yingqi & Widmer, Paul & Bickel, Balthasar. 2021. Word order variation is partially
constrained by syntactic complexity. Cognitive Science 45(11). e13056. (https://
doi.org/10.1111/cogs.13056).
Just, Erika & Čéplö, Slavomír. To appear. Differential object indexing in Maltese — a
corpus based pilot study. In Turek, Przemysław & Nintemann, Julia (eds.), Maltese:
Contemporary changes and historical innovations. Mouton de Gruyter: Berlin.
Koplenig, Alexander & Meyer, Peter & Wolfer, Sascha & Müller-Spitzer, Carolin. 2017.
The statistical trade-off between word order and word structure — Large-scale
evidence for the principle of least effort. PLoS ONE 12(3). e0173614. (https://doi.
org/10.1371/journal.pone.0173614).
Labov, William. 1972. Sociolinguistic patterns. Philadelphia, PA: University of
Pennsylvania Press.
Labov, William. 1994. Principles of linguistic change: Internal factors. Malden, MA:
Blackwell.
Levinson, Stephen C. & Evans, Nicholas. 2010. Time for a sea change in linguistics:
Response to comments on ‘The myth of language universals’. Lingua 12. 2733–
2758. (https://doi.org/10.1016/j.lingua.2010.08.001).
Levshina, Natalia. 2019. Token-based typology and word order entropy: A study based
on universal dependencies. Languages in Contrast 23(3). 533–572. (https://doi.
org/10.1515/lingty-2019-0025).
Levshina, Natalia. 2021a. Corpus-based typology: Applications, challenges and some
solutions. Linguistic Typology. (https://doi.org/10.1515/lingty-2020-0118).
Levshina, Natalia. 2021b. Cross-linguistic trade-offs and causal relationships between
cues to grammatical subject and object, and the problem of efficiency-related ex-
planations: A reverse-engineering approach. Frontiers in Psychology 12. 648200.
(https://doi.org/10.3389/fpsyg.2021.648200).
MacWhinney, Brian. 2000. The CHILDES Project: Tools for analysing talk. Mahwah, NJ:
Erlbaum.
Language documentation and corpus-based typology 25
Piantadosi, Steven T. & Tily, Harry J. & Gibson, Edward. 2011. Word lengths are optim-
ized for efficient communication. Proceedings of the National Academy of Sciences
108(9). 3526–3529. (https://doi.org/10.1073/pnas.1012551108).
Piantadosi, Steven T. & Tily, Harry J. & Gibson, Edward. 2012. The communicative
function of ambiguity in language. Cognition 122(3). 1280–1291.
Pimentel, Tiago & Meister, Clara & Salesky, Elizabeth & Teufel, Simone & Blasi, Dam-
ián & Cotterell, Ryan. 2021. A surprisal–duration trade-off across and within the
world’s languages. Preprint published on arXiv 2109.15000. (http://arxiv.org/
abs/2109.15000).
Rizzi, Luigi. 1982. Issues in Italian syntax. Dordrecht: Foris.
Roberts, Ian & Holmberg, Anders. 2009. Introduction: Parameters in minimalist theory.
In Biberauer, Theresa & Holmberg, Anders & Roberts, Ian & Sheehan, Michelle
(eds.), Parametric variation: Null subjects in minimalist theory, 1–57. Cambridge:
Cambridge University Press.
Sacks, Harvey & Schegloff, Emanuel A. & Jefferson, Gail. 1974. A simplest systemat-
ics for the organization of turn-taking in conversation. Language 50(4). 696–735.
(https://doi.org/10.2307/412243).
Salesky, Elizabeth & Chodroff, Eleanor & Pimentel, Tiago & Wiesner, Matthew & Cot-
terell, Ryan & Black, Alan W. & Eisner, Jason. 2020. A corpus for large-scale phon-
etic typology. Proceedings of the 58nd Annual Meeting of the Association for Com-
putational Linguistics (ACL’20). 4526–4546. (https : / / doi . org / 10 . 18653 / v1 /
2020.acl-main.415).
Schegloff, Emanuel A. 2006. Sequence organization in interaction: A primer in conver-
sation analysis. Cambridge: Cambridge University Press.
Schnell, Stefan & Barth, Danielle. 2018. Discourse motivations for pronominal and
zero objects across registers in Vera’a. Language Variation and Change 30(1). 51–
81.
Schnell, Stefan & Schiborr, Nils N. In press. Cross-linguistic corpus studies in linguistic
typology. Annual Review of Linguistics.
Seifart, Frank. This volume. Combining documentary linguistics and corpus phonetics
to advance corpus-based typology. In Haig, Geoffrey & Schnell, Stefan & Seifart,
Frank (eds.), Doing corpus-based typology with spoken language corpora: State of
the art (Language Documentation & Conservation special publication 25), 115–139.
Honolulu, HI: University of Hawai’i Press. (https : / / hdl . handle . net / 10125 /
74659).
Seifart, Frank & Evans, Nicholas & Hammarström, Harald & Levinson, Stephen C.
2018. Language documentation 25 years on. Language 94(4). e324–e345. (https:
//doi.org/10.1353/lan.2018.0070).
Language documentation and corpus-based typology 27
Seifart, Frank & Haig, Geoffrey & Himmelmann, Nikolaus P. & Jung, Dagmar & Mar-
getts, Anne & Trilsbeek, Paul (eds.). 2012. Potentials of language documentation:
Methods, analyses, and utilization (Language Documentation & Conservation spe-
cial publication 3). Honolulu, HI: University of Hawai’i Press. (http : / / hdl .
handle.net/10125/4510).
Slobin, Dan I. (ed.). 1985. The crosslinguistic study of language acquisition, Volume 1:
The data. Mahwah, NJ: Erlbaum.
Stanford, James N. & Preston, Dennis R. (eds.). 2009. Variation in indigenous and minor-
ity languages. Amsterdam: John Benjamins.
Stoll, Sabine & Bickel, Balthasar. 2009. How deep are differences in referential dens-
ity? In Guo, Jiansheng & Lieven, Elena & Budwig, Nancy & Ervin-Tripp, Susan &
Nakamura, Keiko & Özçalışkan, Şeyda (eds.), Crosslinguistic approaches to the psy-
chology of language: Research in the tradition of Dan Isaac Slobin, 543–555. London:
Psychology Press.
Stoll, Sabine & Bickel, Balthasar. 2013. Capturing diversity in language acquisition
research. In Bickel, Balthasar & Grenoble, Lenore A. & Peterson, David A. & Tim-
berlake, Alan (eds.), Language typology and historical contingency, 195–216. Ams-
terdam: Benjamins.
Thieberger, Nicholas & Berez, Andrea L. 1963. Linguistic data management. In
Thieberger, Nicholas (ed.), The Oxford handbook of linguistic fieldwork, 90–118.
Oxford: Oxford University Press.
Tomlin, Russell. 1986. Basic word order: Functional principles. London: Routledge.
Torres Cacoullos, Rena & Travis, Catherine E. 2018. Bilingualism in the community:
Code-switching and grammars in contact. Cambridge: Cambridge University Press.
(https://doi.org/10.1017/9781108235259).
Torres Cacoullos, Rena & Travis, Catherine E. 2019. Variationist typology: Shared
probabilistic constraints across (non-)null subject languages. Linguistics 57(3).
653–692.
Vollmer, Maria C. 2019. How radical is pro-drop in Mandarin? A quantitative corpus
study on referential choice in Mandarin Chinese. MA thesis, University of Bamberg.
Wälchli, Bernhard. 2009. Data reduction typology and the bimodal distribution bias.
Linguistic Typology 13(1). 77–94.
Wälchli, Bernhard & Sölling, Arnd. 2013. The encoding of motion events: Building
typology bottom-up from text data in many languages. In Goschler, Juliana &
Stefanowitsch, Anatol (eds.), Variation and change in the encoding of motion events,
102–125. Amsterdam: John Benjamins.
Wilkinson, Mark D. & Dumontier, Michel & Aalbersberg, IJsbrand J. & Appleton, Gab-
rielle & Axton, Myles & Baak, Arie & Blomberg, Niklas & alii. 2016. The FAIR Guid-
28 LD&C SP25 — Doing corpus-based typology with spoken language data
ing Principles for scientific data management and stewardship. Scientific Data 3(1).
1–9. (https://doi.org/10.1038/sdata.2016.18).
Zakharko, Taras & Witzlack-Makarevich, Alena & Nichols, Johanna & Bickel,
Balthasar. 2017. Late aggregation as a design principle for typological databases.
Paper presented at the ALT Workshop on Design Principles of Typological Data-
bases, Canberra, Australia, 15 December 2017.
Zeman, Daniel & Nivre, Joakim & Abrams, Mitchell & alii. 2021. Universal Dependen-
cies 2.8. Prague: Universal Dependencies Consortium. (https://hdl.handle.net/
11234/1-3687).
Zipf, George K. 1935. The psycho-biology of Language: An introduction to dynamic philo-
logy. Cambridge, MA: MIT.