Professional Documents
Culture Documents
Intro To The Corpus Query Language
Intro To The Corpus Query Language
• How initial seeds were selected (based on analysis of texts / intuition), whether
you cleaned up tuples/URLs manually and what criteria you adopted, whether
you excluded whole sites/domains, etc.
• NO: “I inserted these seeds, then BootCaT provided these tuples….”, or “I
opened the Collocates function of AntConc…” or “I did not find irrelevant seeds”
• Be careful about vague sentences: “Sentences have to be easily
understandable”, “texts are written in order to provide an accurate description of
the course”
The readme.txt file (2)
• File editing
• In English!
• Format the file
• Empty lines between paragraphs
• Bulleted lists for seeds and tuples
• Capital letters for titles, …
• “Collocates”
• “del” / “è” / “tuttora” / “deve” / “emato” / “genova”… they are not
collocates => they are not lexical words (instructions!)
• “al termine del corso lo studente…”: is “studente” a collocate of
“corso” in this context? (Cf. also: “numeriche” as collocate of
“corso”)
• After changing the “Collocate Measure”, remember to click on
“Start” again
• It is highly unlikely (if not impossible) that the 3 measures produce
the same lists of collocates
• At least MI yields different results
Feedback on the
BootCaT + AntConc exercise
Consulting annotated corpora: intro to
the Corpus Query Language
“Translation-driven” corpora:* a typology
Corpora
Reference Specialised
* Zanettin (2012)
“Translation-driven” corpora: a typology
Corpora
Reference Specialised
Reference Specialised
Reference Specialised
• Annotation
• Indexing
• Allows fast queries, even with very large corpora
• Allows exploitation of annotation in sophisticated ways (e.g.
search for word “process”, only when used as a verb, only
in IT-related texts)
An annotated corpus (behind the scenes)
AL
NU
A
M CONTEXTUAL
METADATA
IC Metadati
STRUCTURAL
AT contestuali
METADATA
OM
T
AU
WORD
POS LEMMA
(token)
Using annotated corpora:
• A reduced version of Sketch Engine ($$$)
• A powerful tool with advanced features (word sketches,
synonym detection, mono- and multi-lingual terminology
extraction, ecc.)