COCA

Recap and Introduction 1
• So far, we have been working with AntConc which we described as an

offline corpus processor.
• However, there are also online corpus processors that come with
built-in mega corpora of millions of words.
• Which is better: online or offline?
• Well, each has its own pros and cons.

• For the online ones:
• Pros: (1) save memory, (2) free, and (3) ready to use
• Cons: (1) restricted to certain functions and (2) limited to certain texts
Session 8 - Online Corpus Processors: The Case of COCA 1

Recap and Introduction 2
• For the offline ones:
• Pros: (1) work on any corpus of your choice, (2) free
• Cons: (1) consume your computer memory
• Since we have explored offline corpus processors, it is about time to

explore the online ones. We will learn lots of things from them as well.

COCA
• The Corpus of Contemporary American English (COCA): It is a
corpus representing American English. It comprises 520 million words
of text (20 million words each year 1990 – 2015) and it is equally
divided among spoken, fiction, popular magazines, newspapers, and
academic texts.

COCA: Signing Up
• To start we need to sign up

COCA: Searching for Single Words
• To search for a single word in COCA, all you need to do is to type your
search query in the search box.
• For example, typing ‘jump’, we get the following result:
• What does the figure 19,993 stand for?

COCA: Single Words and Raw Frequencies
• It is the raw frequency of the word ‘jump’ in COCA.
• It means that the word ‘jump’ has been repeated 19,993 times in the
corpus.
• Put differently, it means that the word ‘jump’ has occurred in 19,993
contexts in the corpus.
• How to get these 19,993 contexts? By clicking the word itself

COCA: Filtering by POS
• When we type ‘jump’ as a search query, the results will include all the
parts of speech of jump; i.e. jump as a verb and as a noun.
• What if we want the raw frequency of jump as a noun?
• You will need to use the ‘POS’ option to the right of the search box.
• Does this mean that the corpus is annotated or raw?

COCA: Searching for Phrases
• You can search for phrases the same way you search for single words.
• For example, if you search for ‘kick the bucket’
• 24 is the raw frequency of the entire phrase.
• To view the contexts in which your query phrase is used, you click the
phrase itself.
• However, with phrases we can’t use the POS filter.

COCA: Searching with the Wildcard 1
• What if you want to search for: kick the bucket, kicks the bucket,
kicked the bucket, and kicking the bucket.
• One option is to enter each phrase as a separate query. This is tedious.
• Another option is to use the asterisk or the wildcard (*) as in:
kick* the bucket
• The wildcard means anything: anything that comes in the position of the
wildcard.

• If a wildcard means anything, then we can use it to know or identify
fixed and flexible expressions.
• Fixed expressions do not allow any other words to come in-between.

For example, kick the bucket will always be the same; never kick the
big bucket or kick the last bucket.
• To make sure, try this query in COCA: kick the * bucket
• How about the expression ‘at first glance’? Is it fixed or flexible? Can
‘first’ be replaced by something else?
• To know, we can try ‘at * glance’. What did you get?

• Wildcards can also be used to do morphological searches.
• What if we want to know which words are used with the suffix ‘-
icity’?
• To know the answer, we can use the wildcard as in *icity.
• Notice the difference between: *icity and *˽icity.
• What different results does each one of them give you?

COCA: Searching for Parts of Speech
• What if we want to know the most frequent common noun in COCA?
• We can search for parts of speech using the tags provided in the interface.
• If we want the most frequent common noun in COCA, we can use the
following:
• Try it and see what is the most frequent common noun

COCA: Searching for Lemmas
• Although wildcards can be used to find word derivations, they only
find affix-based derivations, but what about zero-affix derivations
such as ate ?
• To find all the derivations of a given lemma, including both affix-

based and zero-affix derivations, we can try the following:
• What is the result of your query?

COCA: Searching for Synonyms
• We can search COCA for synonyms as

well. To do so, all we need is the following:
• Do you see something wrong in the

results? How can we get better results?
• Does this mean that COCA is semantically

annotated?
COCA: Searching Genres and Periods of Time 1
• COCA is a general corpus with many genres including spoken, fiction,
magazine, newspaper, and academic genres.
• What if we want to know the frequency of Egypt in each genre?
• COCA also includes texts from different periods of time: 1990 – 2015.
• What if we want to know the frequency of Egypt in each period?
• The best way to do so it to use the chart option.

COCA: Searching Genres and Periods of Time 2
• There are three different numbers in the chart of Egypt
• Freq. stands for raw frequency.
• Size (M) stands for the size of the texts in a given genre/period of
time.
• How about Per MiL?

COCA: Raw vs. Normalized Frequencies
• Per Mil is the normalized frequency per million.
• Raw frequency is the number of occurrences in the corpus.
• It does not always give an accurate idea about which word is more frequent.
• Hence, we typically use normalized frequency which is calculated as

follows:
Normalized Frequency (w) =
where C(w) is the raw frequency of the given word, N is the total number of
words in the corpus, and the common base ranges from 10 to 1,000,000
depending on the size of the corpus.
COCA: Key Word In Content (KWIC) 1
• The Key Word In Content (KWIC) is the concordance function which
displays up to 1,000 random contexts of the query word.
• Two questions:
• What if I want to see more than 1,000 contexts?
• What is the difference between the KWIC and clicking the word frequency to
see the contexts?
• The main difference is that with the KWIC, we get the context with
the parts of speech encoded in colors.

COCA: Key Word In Content (KWIC) 2
• What do these colors stand for?

COCA: Comparing Words 1
• To differentiate near synonyms, COCA uses the ‘compare’ function.
• It displays the collocations of each word sorted by frequency.
• It also displays the raw frequency of the first word to the second word.
• For example, comparing ‘steady’ as an adjective to ‘stable’ as an

adjective yields the following table. The table shows the difference
that ‘stable’ means unchanged, while ‘steady’ is used

Comparing Words in BYU Corpora 2

Comparing Words in BYU Corpora 3
• The first line in the ‘steady’ table reads as follows:
• The raw frequency of word 1 – ‘steady’ – with ‘pace’ is 218. Yet, the raw
frequency of word 2 – ‘stable’ – with ‘pace’ is 0.
• The third column is the ratio of word 1 to word 2 and it reads as follows: there
are 436.0 times as many cases of steady pace as there are stable pace.
• Remember: you can always guarantee more accurate results by adding

the part of speech to each of your queries.

COCA: Finding Collocations 1
• To find the collocations of a given word, you can use the ‘collocate’
function.
• For example, the top collocations of remind_v* are:

• You can refine your collocation search by looking for collocations in a
specific part of speech.
• Suppose that we want to find the adverbial collocations of the verb

remind, then your query should look like:
• The top 3 adverbs collocating with remind

as a verb are: just, how, and also.

• In the ‘collocates’ functions there is the ribbon below. What does it stand for?
• This is the window size – i.e. the search space in which the engine tries to
find collocations. It is meant to find both adjacent and non-adjacent
collocations.
• Adjacent collocations are the ones that immediately precede or follow your
query word. They are usually inseparable. In this case, you need to set the
window size to ±1. An example of adjacent collocations is at hand and kick
the bucket.
• Non-adjacent collocations are the ones that can be separated by one or more
words such as give up.
• Looking for the adjacent left-hand collocation of remind_v* yields:
• Looking for the adjacent right-hand collocation of remind_v* yields:

COCA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COCA

Uploaded by

Copyright:

Available Formats

Recap and Introduction 1

• So far, we have been working with AntConc which we described as an

• Which is better: online or offline?

• Well, each has its own pros and cons.

Session 8 - Online Corpus Processors: The Case of COCA 1

• Since we have explored offline corpus processors, it is about time to

Session 8 - Online Corpus Processors: The Case of COCA 2

Session 8 - Online Corpus Processors: The Case of COCA 3

• To start we need to sign up

Session 8 - Online Corpus Processors: The Case of COCA 4

• For example, typing ‘jump’, we get the following result:

• What does the figure 19,993 stand for?

• How to get these 19,993 contexts? By clicking the word itself

Session 8 - Online Corpus Processors: The Case of COCA 6

• What if we want the raw frequency of jump as a noun?

• Does this mean that the corpus is annotated or raw?

• For example, if you search for ‘kick the bucket’

• 24 is the raw frequency of the entire phrase.

• However, with phrases we can’t use the POS filter.

• One option is to enter each phrase as a separate query. This is tedious.

• Another option is to use the asterisk or the wildcard (*) as in:

kick* the bucket

Session 8 - Online Corpus Processors: The Case of COCA 9

• Fixed expressions do not allow any other words to come in-between.

• To make sure, try this query in COCA: kick the * bucket

• To know, we can try ‘at * glance’. What did you get?

• To know the answer, we can use the wildcard as in *icity.

• Notice the difference between: *icity and *˽icity.

• What different results does each one of them give you?

Session 8 - Online Corpus Processors: The Case of COCA 11

• Try it and see what is the most frequent common noun

• To find all the derivations of a given lemma, including both affix-

• What is the result of your query?

• We can search COCA for synonyms as

• Do you see something wrong in the

• Does this mean that COCA is semantically

• What if we want to know the frequency of Egypt in each genre?

• What if we want to know the frequency of Egypt in each period?

• The best way to do so it to use the chart option.

Session 8 - Online Corpus Processors: The Case of COCA 15

Session 8 - Online Corpus Processors: The Case of COCA 16

• Raw frequency is the number of occurrences in the corpus.

• Hence, we typically use normalized frequency which is calculated as

• What if I want to see more than 1,000 contexts?

Session 8 - Online Corpus Processors: The Case of COCA 18

• What do these colors stand for?

Session 8 - Online Corpus Processors: The Case of COCA 19

• It displays the collocations of each word sorted by frequency.

• For example, comparing ‘steady’ as an adjective to ‘stable’ as an

Session 8 - Online Corpus Processors: The Case of COCA 20

Session 8 - Online Corpus Processors: The Case of COCA 21

• Remember: you can always guarantee more accurate results by adding

Session 8 - Online Corpus Processors: The Case of COCA 22

• For example, the top collocations of remind_v* are:

Session 8 - Online Corpus Processors: The Case of COCA 23

• Suppose that we want to find the adverbial collocations of the verb

• The top 3 adverbs collocating with remind

Session 8 - Online Corpus Processors: The Case of COCA 24

• Looking for the adjacent left-hand collocation of remind_v* yields:

• Looking for the adjacent right-hand collocation of remind_v* yields:

Session 8 - Online Corpus Processors: The Case of COCA 26

You might also like

• Notice the difference between: icity and ˽icity.