Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

1. What is a concordance?

A display of every instance of a word or other search item in a corpus,


together with a given amount of preceding and following context for
each result.

2. What is KWIC?
An acronym for key word in context. KWIC view is a concordance
view.

3. What is a token?
The smallest text unit in a corpus. Any single, particular instance of an
individual word in a corpus, although a single word can be split into
more than one token, for example he’s (he + ’s).

4. What is the difference between an absolute (raw) and relative


(normalized) frequency?
Absolute frequency is the total number of results (hits) you get.
Relative (normalized) frequency is a frequency relative to some other
value as a proportion of the whole. Usually per million words. Enables
to compare corpora of different sizes.

5. What does pmw stand for?


‘per million words’. Another abbreviation used is ipm – ‘instances per
million’. This is a way of giving a relative (normalized) frequency.

6. What is corpus in language studies and what is it good for?


It is an electronic collection of different types of texts to be used
and/or studied using various automated tools.

7. What is a lemma search good for?


It finds all word forms of a lexeme at one click (e.g. run, runs,
running, ran). But make sure you use a lemma in the lemma search
(the dictionary form of each lexeme).
8. What is tokenization?
The automatic process of converting all of a text into separate tokens,
for example, by splitting conjoined words like he’s, separating
punctuation (such as commas and full stops) from words and
removing capitalisation.

12. What is tagging?


An informal term for corpus annotation, especially forms of
annotation that assign an analysis to every word in a corpus (such as
e.g. part-of-speech tagging).

9. What is POS? What is the plural of the noun corpus?


Part of Speech. Corpora.

10. What is a concordancer?


A concordancer is a corpus tool that allows us to retrieve from a
corpus a specific item you want to retrieve (most typically a word, but
also a part of a word, or a phrase). In other words, it gives you
concordances (or KWICs – key words in context).

11. What is COCA and how much of it is spoken language?


Corpus of Contemporary American English. COCA has one billion of
words, 127.4 million is transcribed speech.

12. What is the BNC and how big is it?


British National Corpus, 100 million words (90 million words of
written language and 10 million words of spoken language)

13. What is a tagger?


A program which automatically assigns a tag (e.g. info about part of
speech to words) to every item in the corpus.

13. What is a reference corpus?


Reference corpus is any corpus chosen as a standard of comparison
with your corpus. It usually has to be quite large and representative,
and it does not change in time. Reference corpus is needed e.g. in
calculating keywords.

14. What is a tagset?


List of tags, i.e. labels attached to every single item in the corpus,
most typically a word e.g. (information about part of speech = POS),
but also punctuation, etc.
15. What does * stand for in the COCA search engine?
Either any sequence of characters (when used as part of a single
token), or as any token (when separated by spaces).

16. What does pipe | stand for?


Either or. X|Y finds both X and Y.

17. How is the word occurrence [výskyt] pronounced? Wat about the
verb to occur [vyskytnout se]?

18. How is the noun query [korpusový dotaz, příkaz k vyhledávání]


pronounced?

19. What is CLAWS?


A tagger sometimes used to annotate (assigns POS info) English data.
COCA is annotated with version 7 of CLAWS.

You might also like