Intro To The Corpus Query Language

InMiTe
Lesson 9 (2 November 2021)

Feedback on the
BootCaT + AntConc exercise
Consulting annotated corpora: intro to the
Corpus Query Language
Formal issues
• Follow the instructions concerning file names, folder
structure, file formats, ecc.
• In general: well done!
• BUT
• Files sent:“copy the texts included in the modules_it and
modules_it_boot corpora in that folder. Rename the folder:
modules_it_final”
• File formats: bootcat-antconc_exercise.rar, bootcat-
antconc_exerciseØ, readme.modules
• File name issues: BootCat-AntConc_exercise.zip, Øbootcat-
antconc_exercise, file zip (as folder name), readme.txt.txt,
surnames_exercise.docx.docx, README.txt,
ReadMe_AcademicCourseDescriptions.txt
• NB: better to avoid diacritics in file names
BootCaT (1)
• Seed selection: complex terms vs. (genre-specific) n-grams
For domain-oriented (&
• Complex terms genre-oriented) corpora!
• Length = variable (from 2 to n words), selected among n-grams or

based on concordance/cluster analysis of keywords
• They must be structurally complete (e.g. corso di studio,
acquisizione delle conoscenze, laurea magistrale a ciclo unico)
• N-grams For genre-oriented corpora!
• Length = 3 (3-grams), selected among the most frequent n-grams

generated by AntConc
• Not necessarily “complete” from a structural point of view (e.g. per
gli studenti, si propone di, la capacità di, del corso di)
BootCaT (2)
• For genre-oriented corpora (as in this case)
• ALL tuples should contain both key terms (simple and
complex) and 3-grams
esame da 15 CFU // dell’attività formativa //
• Cleaning up seeds [boilerplate text] // [boilerplate text?]
• “Not structurally complete” n-grams doesn’t mean

random text
• “esame da cfu”, “dell attività formativa”, “ordinamento
percorso crediti”, “cfu cfu obiettivi”
• Be careful about seeds which might restrict searches to

single websites/domains (or misleading ones)
• “guide web”, “insegnamento linguistico”, "scuola
secondaria”, alunni…
The readme.txt file (1)
• Target genre and/or domain
Essential info (for BootCaT corpora)
• “a sub-genre of informative texts” => informative texts are not a genre

• “domain is university course programmes” => this is not a domain
• Seeds and tuples
• No need to mention discarded items (seeds, tuples, …)

• No need to include URLs for BootCaT corpus => Only for manually-built corpora
• Mention “key” building steps (not in the form of a narration) for the corpus as a whole
• How initial seeds were selected (based on analysis of texts / intuition), whether
you cleaned up tuples/URLs manually and what criteria you adopted, whether
you excluded whole sites/domains, etc.
• NO: “I inserted these seeds, then BootCaT provided these tuples….”, or “I
opened the Collocates function of AntConc…” or “I did not find irrelevant seeds”
• Be careful about vague sentences: “Sentences have to be easily
understandable”, “texts are written in order to provide an accurate description of
the course”
The readme.txt file (2)
• File editing
• In English!
• Format the file
• Empty lines between paragraphs
• Bulleted lists for seeds and tuples
• Capital letters for titles, …
• Watch out for typos

• Broad terrms and signle parts
• File format MUST be .txt
• If you draft the file in MS Word and then copy-paste your text in a
text editor (or you save as text), check file appearance [=> my
suggestion is not to use MS Word at all]
AntConc (1) File specifications! Only
leave the table…
• Keywords should be selected so as to be typical of the domain/

genre under scrutiny
• Is “ateneo” likely to be a keyword for course descriptions? How about
“tirocinio”?
• The notion of collocation/complex term

• “attestato di formazione”, “commissione di laurea”, “propedeutico alla
laurea”,“corso di studio istituito”, “corso si articola in” => OK
• “presso l’Università”, “laurea in”, “prova scritta e”, “studenti saranno”, “in
grado di” => NO. These are neither collocations nor complex terms (they
may be colligations?)
• Always select and report complete units of meaning (unless

instructions are given to the contrary)
• “modalità di verifica”, “il corso offre allo studente” => OK
• “ateneo si accerta che”, “obiettivi al termine”, “laurea magistrale in” => NO
AntConc (2)
• “Advanced” search
• Used to understand in which structures nodes co-occur with
collocates => In the exercise: impossible that 0 hits are returned
• “Collocates”
• “del” / “è” / “tuttora” / “deve” / “emato” / “genova”… they are not
collocates => they are not lexical words (instructions!)
• “al termine del corso lo studente…”: is “studente” a collocate of
“corso” in this context? (Cf. also: “numeriche” as collocate of
“corso”)
• After changing the “Collocate Measure”, remember to click on
“Start” again
• It is highly unlikely (if not impossible) that the 3 measures produce
the same lists of collocates
• At least MI yields different results
Feedback on the
BootCaT + AntConc exercise
Consulting annotated corpora: intro to
the Corpus Query Language
“Translation-driven” corpora:* a typology
Corpora
Reference Specialised
Monolingual Monolingual Multilingual

(usually)
Comparable Parallel
* Zanettin (2012)
“Translation-driven” corpora: a typology
Corpora

(usually)
Comparable Parallel
Corpora

(usually)
Comparable Parallel
During the Terminology part

(if we have time)
Corpora

(usually)
Comparable Parallel
Specialised corpora Reference corpora
• They can be used as a
basis to describe language
as a whole (caveats apply)
• They represent a or a macro-variety of a
specific topic domain language, or to provide
and/or genre evidence on non-
specialised language
• Small, DIY, text only, features
consulted with one’s
preferred concordancer • Very large, often
(e.g. AntConc) annotated, consulted with
(online) concordancers
provided by corpus
creators
A few (online) reference corpora
• CORIS (Italian)
• http://corpora.dslo.unibo.it/TCORIS/
• British National Corpus (British English; registration required)
• http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php
• COCA (American English)
• https://www.english-corpora.org/coca/
• Leeds Internet corpora
• Chinese, English, French, German, Italian, Russian, Spanish, …
• http://corpus.leeds.ac.uk/internet.html
• Mannheim corpora (German)
• http://corpora.ids-mannheim.de/ccdb/
• Corpus del Español (Spanish)
• http://www.corpusdelespanol.org/
• CREA (Spanish)
• http://corpus.rae.es/creanet.html
• Russian National Corpus (Russian)
• http://ruscorpora.ru/en/search-main.html
What characterises online corpora?
• Web-based architecture
• No need to install software on your computer, accessible
with most browsers, from PC, Mac and Linux computers
• You can’t upload your texts (but…)
• Web-based architecture “Corpus annotation is the
practice of adding
interpretative linguistic
• Annotation information to a corpus”
(Leech 2005:17)
• Contextual (“metadata”)
• Author, date of publication, topic, text type, …
• Structural
• Titles, chapter number, paragraphs, sentences, …
• Linguistic
• Morphosyntactic/Part-of-speech (POS) tagging: nouns, verbs,
inflected forms, …
• Lemmatisation: word base forms
• Web-based architecture
• Annotation
• Indexing
• Allows fast queries, even with very large corpora
• Allows exploitation of annotation in sophisticated ways (e.g.
search for word “process”, only when used as a verb, only
in IT-related texts)
An annotated corpus (behind the scenes)
AL
NU
A
M CONTEXTUAL
METADATA
IC Metadati
STRUCTURAL
AT contestuali
METADATA
OM
T
AU
WORD
POS LEMMA
(token)
Using annotated corpora:
• A reduced version of Sketch Engine ($$$)
• A powerful tool with advanced features (word sketches,
synonym detection, mono- and multi-lingual terminology
extraction, ecc.)
• Corpora that we will consult using NoSkE (@ CoLiTec)

• ukWaC: 2 billion words, web-derived, reference corpus of
English
• acWaC-EU: 90 million words, specialised corpus of English
by EU Universities (institutional academic language)
• Focus on Corpus Query Language
• Also used in interfaces based on Corpus WorkBench (CWB)
NoSkE- and CWB-based corpora
• Aranea corpora
• http://ucts.uniba.sk/guest/ • Kontext corpora
index.html
• https://kontext.korpus.cz/
• Clarin Slovenia corpora
• http://nl.ijs.si/noske/index- • Leeds Internet Corpora
en.html • http://corpus.leeds.ac.uk/
internet.html
• CoLiTec corpora
• http://corpora.dipintra.it • Opus Corpus
• CorpusEye • http://opus.lingfil.uu.se/
• http://corp.hum.sdu.dk

Intro To The Corpus Query Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To The Corpus Query Language

Uploaded by

Copyright:

Available Formats

InMiTe

Lesson 9 (2 November 2021)

• Length = variable (from 2 to n words), selected among n-grams or

• N-grams For genre-oriented corpora!

• Length = 3 (3-grams), selected among the most frequent n-grams

• “Not structurally complete” n-grams doesn’t mean

• Be careful about seeds which might restrict searches to

• “a sub-genre of informative texts” => informative texts are not a genre

• No need to mention discarded items (seeds, tuples, …)

• Watch out for typos

• Keywords should be selected so as to be typical of the domain/

• The notion of collocation/complex term

• Always select and report complete units of meaning (unless

Monolingual Monolingual Multilingual

Monolingual Monolingual Multilingual

Monolingual Monolingual Multilingual

During the Terminology part

Monolingual Monolingual Multilingual

• Corpora that we will consult using NoSkE (@ CoLiTec)

You might also like