Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 12

Corpora for Translators

Jarmila Fictumov
Corpora
Monolingual: foreign-language, Czech, general, specialized
(based on genre or field, ad hoc)
Bi- and Multilingual: parallel (= translation corpora);
comparable (e.g. for searching technical terminology)
Learner corpora: monolingual; parallel

CORPUS LINGUISTICS TERMINOLOGY
BASICS.

Tagging (mark-up; annotation): assigning explicit linguistic
information to a text (parts of speech & semantic annotation)
TARGET LANGUAGE (TL) The language into which we translate
CQL: Contextual Query Language (also Corpus Query
Language)
DIACHRONIC CORPUS language development over an extended
time period
CONCORDANCE the immediate context of a lexical unit

CORPUS LINGUISTICS TERMINOLOGY
BASICS.
CORPUS an extensive collection of authentic electronic texts (written or
speech transcripts) collected according to specific criteria.
Corpus manager: software that searches for concordances of specified
terms; it finds all the instances of a term within a given corpus.
KWIC: displaying the key word in context, usually aligned in the centre
of the screen.
LEMMA: a word form chosen as the representative (a headword) of a
group of related word forms.
Lexeme: each unique word within a corpus (e.g.: He lived by the forest
down by the river)
OPEN CORPUS a corpus to which new content is added regularly.


CORPUS LINGUISTICS TERMINOLOGY
BASICS.

PARALLEL CORPUS: texts in different languages - translations
aligned sentence-by-sentence, not unlike translation memories.
COMPARABLE CORPUS texts in different languages that are not
translations, but do have some features in common.
SYNCHRONIC CORPUS: does not study changes resulting from
language development.
Tokens: all words, regardless of form, contained within a corpus.
Source Language (SL) the language from which we translate.
Alignment/Pairing finding and matching the corresponding
segments in different language versions of a text.
Find more at Overview of the basic corpus linguistics terms


Corpus Tools


Information taken from the article "Corpus Linguistics
Help with Text Writing" (muni.cz 14.1.2014)
by Zuzana Nevilov, researcher at the Natural
Language Processing Centre at the Faculty of
Informatics at MU and teacher at the Centre for
Computer Linguistics at the Faculty of Arts at
MU.


The Sketch Engine has been developed for over ten years by Lexical Computing
Ltd. in cooperation with the Natural Language Processing Centre at Masaryk
University. All of the university students and employees have free access to this
corpus-based program.
... The Sketch engine computes a word sketch showing which partner words the key
word co-occurs with and also how often and in what context this happens.
The Sketch Engine can then use the word sketches to compute suitable word
partners on larger units (phrases). The output of this process is a Thesaurus that
helps us find words related in meaning.
However, the software also contains a number of advanced functions for working
with user-generated corpora (automatic keyword extraction, sub-corpora based on
document length or author attributes) or multilingual (parallel) corpora. The Sketch
Engine currently provides access to more than 400 corpora in 70 languages. All of
the functions are described in the documentation
WEB-BASED CORPORA

(MORE INFORMATION in the article about a lecture by ING. VLADIMR BENKO)


ENGLISH-LANGUAGE CORPORA
Araneum Anglicum Maius (En Web 14.04) 1,20 G
enTenTen12
New Model Corpus
ukWaC
Times

CZECH-LANGUAGE CORPORA
czTenTen12 [v. 7]
OPUS2 Czech
CzechParl 2012
Bruna Bohemica Minor (czes 14.04) 121 M

CNK



CORPUS INTERFACE USER GUIDE





FOREIGN-LANGUAGE CORPORA

Mark Davies: Professor, Corpus Linguistics, Brigham Young
University
corpus.byu.edu

Further options


University of Leeds: Tutorial in English



PILOT RUN

A TOOL FOR CREATING ERROR-TAGGED MONO- AND BILINGUAL PARALLEL
OR LEARNER CORPORA.

You might also like