Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

InMiTe

Lesson 9 (2 November 2021)


Feedback on the
BootCaT + AntConc exercise
Consulting annotated corpora: intro to the
Corpus Query Language
Formal issues
• Follow the instructions concerning file names, folder
structure, file formats, ecc.
• In general: well done!
• BUT
• Files sent:“copy the texts included in the modules_it and
modules_it_boot corpora in that folder. Rename the folder:
modules_it_final”
• File formats: bootcat-antconc_exercise.rar, bootcat-
antconc_exerciseØ, readme.modules
• File name issues: BootCat-AntConc_exercise.zip, Øbootcat-
antconc_exercise, file zip (as folder name), readme.txt.txt,
surnames_exercise.docx.docx, README.txt,
ReadMe_AcademicCourseDescriptions.txt
• NB: better to avoid diacritics in file names
BootCaT (1)
• Seed selection: complex terms vs. (genre-specific) n-grams
For domain-oriented (&
• Complex terms genre-oriented) corpora!

• Length = variable (from 2 to n words), selected among n-grams or


based on concordance/cluster analysis of keywords
• They must be structurally complete (e.g. corso di studio,
acquisizione delle conoscenze, laurea magistrale a ciclo unico)

• N-grams For genre-oriented corpora!

• Length = 3 (3-grams), selected among the most frequent n-grams


generated by AntConc
• Not necessarily “complete” from a structural point of view (e.g. per
gli studenti, si propone di, la capacità di, del corso di)
BootCaT (2)
• For genre-oriented corpora (as in this case)
• ALL tuples should contain both key terms (simple and
complex) and 3-grams
esame da 15 CFU // dell’attività formativa //
• Cleaning up seeds [boilerplate text] // [boilerplate text?]

• “Not structurally complete” n-grams doesn’t mean


random text
• “esame da cfu”, “dell attività formativa”, “ordinamento
percorso crediti”, “cfu cfu obiettivi”

• Be careful about seeds which might restrict searches to


single websites/domains (or misleading ones)
• “guide web”, “insegnamento linguistico”, "scuola
secondaria”, alunni…
The readme.txt file (1)
• Target genre and/or domain
Essential info (for BootCaT corpora)

• “a sub-genre of informative texts” => informative texts are not a genre


• “domain is university course programmes” => this is not a domain
• Seeds and tuples

• No need to mention discarded items (seeds, tuples, …)


• No need to include URLs for BootCaT corpus => Only for manually-built corpora
• Mention “key” building steps (not in the form of a narration) for the corpus as a whole

• How initial seeds were selected (based on analysis of texts / intuition), whether
you cleaned up tuples/URLs manually and what criteria you adopted, whether
you excluded whole sites/domains, etc.
• NO: “I inserted these seeds, then BootCaT provided these tuples….”, or “I
opened the Collocates function of AntConc…” or “I did not find irrelevant seeds”
• Be careful about vague sentences: “Sentences have to be easily
understandable”, “texts are written in order to provide an accurate description of
the course”
The readme.txt file (2)
• File editing

• In English!
• Format the file
• Empty lines between paragraphs
• Bulleted lists for seeds and tuples
• Capital letters for titles, …

• Watch out for typos


• Broad terrms and signle parts
• File format MUST be .txt
• If you draft the file in MS Word and then copy-paste your text in a
text editor (or you save as text), check file appearance [=> my
suggestion is not to use MS Word at all]
AntConc (1) File specifications! Only
leave the table…

• Keywords should be selected so as to be typical of the domain/


genre under scrutiny
• Is “ateneo” likely to be a keyword for course descriptions? How about
“tirocinio”?

• The notion of collocation/complex term


• “attestato di formazione”, “commissione di laurea”, “propedeutico alla
laurea”,“corso di studio istituito”, “corso si articola in” => OK
• “presso l’Università”, “laurea in”, “prova scritta e”, “studenti saranno”, “in
grado di” => NO. These are neither collocations nor complex terms (they
may be colligations?)

• Always select and report complete units of meaning (unless


instructions are given to the contrary)
• “modalità di verifica”, “il corso offre allo studente” => OK
• “ateneo si accerta che”, “obiettivi al termine”, “laurea magistrale in” => NO
AntConc (2)
• “Advanced” search
• Used to understand in which structures nodes co-occur with
collocates => In the exercise: impossible that 0 hits are returned

• “Collocates”
• “del” / “è” / “tuttora” / “deve” / “emato” / “genova”… they are not
collocates => they are not lexical words (instructions!)
• “al termine del corso lo studente…”: is “studente” a collocate of
“corso” in this context? (Cf. also: “numeriche” as collocate of
“corso”)
• After changing the “Collocate Measure”, remember to click on
“Start” again
• It is highly unlikely (if not impossible) that the 3 measures produce
the same lists of collocates
• At least MI yields different results
Feedback on the
BootCaT + AntConc exercise
Consulting annotated corpora: intro to
the Corpus Query Language
“Translation-driven” corpora:* a typology
Corpora

Reference Specialised

Monolingual Monolingual Multilingual


(usually)
Comparable Parallel

* Zanettin (2012)
“Translation-driven” corpora: a typology
Corpora

Reference Specialised

Monolingual Monolingual Multilingual


(usually)
Comparable Parallel
“Translation-driven” corpora: a typology
Corpora

Reference Specialised

Monolingual Monolingual Multilingual


(usually)
Comparable Parallel

During the Terminology part


(if we have time)
“Translation-driven” corpora: a typology
Corpora

Reference Specialised

Monolingual Monolingual Multilingual


(usually)
Comparable Parallel
Specialised corpora Reference corpora
• They can be used as a
basis to describe language
as a whole (caveats apply)
• They represent a or a macro-variety of a
specific topic domain language, or to provide
and/or genre evidence on non-
specialised language
• Small, DIY, text only, features
consulted with one’s
preferred concordancer • Very large, often
(e.g. AntConc) annotated, consulted with
(online) concordancers
provided by corpus
creators
A few (online) reference corpora
• CORIS (Italian)
• http://corpora.dslo.unibo.it/TCORIS/
• British National Corpus (British English; registration required)
• http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php
• COCA (American English)
• https://www.english-corpora.org/coca/
• Leeds Internet corpora
• Chinese, English, French, German, Italian, Russian, Spanish, …
• http://corpus.leeds.ac.uk/internet.html
• Mannheim corpora (German)
• http://corpora.ids-mannheim.de/ccdb/
• Corpus del Español (Spanish)
• http://www.corpusdelespanol.org/
• CREA (Spanish)
• http://corpus.rae.es/creanet.html
• Russian National Corpus (Russian)
• http://ruscorpora.ru/en/search-main.html
What characterises online corpora?
• Web-based architecture
• No need to install software on your computer, accessible
with most browsers, from PC, Mac and Linux computers
• You can’t upload your texts (but…)
What characterises online corpora?
• Web-based architecture “Corpus annotation is the
practice of adding
interpretative linguistic
• Annotation information to a corpus”
(Leech 2005:17)
• Contextual (“metadata”)
• Author, date of publication, topic, text type, …
• Structural
• Titles, chapter number, paragraphs, sentences, …
• Linguistic
• Morphosyntactic/Part-of-speech (POS) tagging: nouns, verbs,
inflected forms, …
• Lemmatisation: word base forms
What characterises online corpora?
• Web-based architecture

• Annotation

• Indexing
• Allows fast queries, even with very large corpora
• Allows exploitation of annotation in sophisticated ways (e.g.
search for word “process”, only when used as a verb, only
in IT-related texts)
An annotated corpus (behind the scenes)

AL
NU
A
M CONTEXTUAL
METADATA

IC Metadati
STRUCTURAL

AT contestuali
METADATA

OM
T
AU
WORD
POS LEMMA
(token)
Using annotated corpora:
• A reduced version of Sketch Engine ($$$)
• A powerful tool with advanced features (word sketches,
synonym detection, mono- and multi-lingual terminology
extraction, ecc.)

• Corpora that we will consult using NoSkE (@ CoLiTec)


• ukWaC: 2 billion words, web-derived, reference corpus of
English
• acWaC-EU: 90 million words, specialised corpus of English
by EU Universities (institutional academic language)
• Focus on Corpus Query Language
• Also used in interfaces based on Corpus WorkBench (CWB)
NoSkE- and CWB-based corpora
• Aranea corpora
• http://ucts.uniba.sk/guest/ • Kontext corpora
index.html
• https://kontext.korpus.cz/
• Clarin Slovenia corpora
• http://nl.ijs.si/noske/index- • Leeds Internet Corpora
en.html • http://corpus.leeds.ac.uk/
internet.html
• CoLiTec corpora
• http://corpora.dipintra.it • Opus Corpus
• CorpusEye • http://opus.lingfil.uu.se/
• http://corp.hum.sdu.dk

You might also like