Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Recap and Introduction 1

• So far, we have been working with AntConc which we described as an


offline corpus processor.

• However, there are also online corpus processors that come with
built-in mega corpora of millions of words.

• Which is better: online or offline?

• Well, each has its own pros and cons.


• For the online ones:
• Pros: (1) save memory, (2) free, and (3) ready to use
• Cons: (1) restricted to certain functions and (2) limited to certain texts

Session 8 - Online Corpus Processors: The Case of COCA 1


Recap and Introduction 2
• For the offline ones:
• Pros: (1) work on any corpus of your choice, (2) free
• Cons: (1) consume your computer memory

• Since we have explored offline corpus processors, it is about time to


explore the online ones. We will learn lots of things from them as well.

Session 8 - Online Corpus Processors: The Case of COCA 2


COCA
• The Corpus of Contemporary American English (COCA): It is a
corpus representing American English. It comprises 520 million words
of text (20 million words each year 1990 – 2015) and it is equally
divided among spoken, fiction, popular magazines, newspapers, and
academic texts.

Session 8 - Online Corpus Processors: The Case of COCA 3


COCA: Signing Up

• To start we need to sign up

Session 8 - Online Corpus Processors: The Case of COCA 4


COCA: Searching for Single Words
• To search for a single word in COCA, all you need to do is to type your
search query in the search box.

• For example, typing ‘jump’, we get the following result:

• What does the figure 19,993 stand for?


Session 8 - Online Corpus Processors: The Case of COCA 5
COCA: Single Words and Raw Frequencies
• It is the raw frequency of the word ‘jump’ in COCA.

• It means that the word ‘jump’ has been repeated 19,993 times in the
corpus.

• Put differently, it means that the word ‘jump’ has occurred in 19,993
contexts in the corpus.

• How to get these 19,993 contexts? By clicking the word itself

Session 8 - Online Corpus Processors: The Case of COCA 6


COCA: Filtering by POS
• When we type ‘jump’ as a search query, the results will include all the
parts of speech of jump; i.e. jump as a verb and as a noun.

• What if we want the raw frequency of jump as a noun?

• You will need to use the ‘POS’ option to the right of the search box.

• Does this mean that the corpus is annotated or raw?


Session 8 - Online Corpus Processors: The Case of COCA 7
COCA: Searching for Phrases
• You can search for phrases the same way you search for single words.

• For example, if you search for ‘kick the bucket’

• 24 is the raw frequency of the entire phrase.

• To view the contexts in which your query phrase is used, you click the
phrase itself.

• However, with phrases we can’t use the POS filter.


Session 8 - Online Corpus Processors: The Case of COCA 8
COCA: Searching with the Wildcard 1
• What if you want to search for: kick the bucket, kicks the bucket,
kicked the bucket, and kicking the bucket.

• One option is to enter each phrase as a separate query. This is tedious.

• Another option is to use the asterisk or the wildcard (*) as in:

kick* the bucket

• The wildcard means anything: anything that comes in the position of the
wildcard.

Session 8 - Online Corpus Processors: The Case of COCA 9


COCA: Searching with the Wildcard 2
• If a wildcard means anything, then we can use it to know or identify
fixed and flexible expressions.

• Fixed expressions do not allow any other words to come in-between.


For example, kick the bucket will always be the same; never kick the
big bucket or kick the last bucket.

• To make sure, try this query in COCA: kick the * bucket

• How about the expression ‘at first glance’? Is it fixed or flexible? Can
‘first’ be replaced by something else?

• To know, we can try ‘at * glance’. What did you get?


Session 8 - Online Corpus Processors: The Case of COCA 10
COCA: Searching with the Wildcard 3
• Wildcards can also be used to do morphological searches.

• What if we want to know which words are used with the suffix ‘-
icity’?

• To know the answer, we can use the wildcard as in *icity.

• Notice the difference between: *icity and *˽icity.

• What different results does each one of them give you?

Session 8 - Online Corpus Processors: The Case of COCA 11


COCA: Searching for Parts of Speech
• What if we want to know the most frequent common noun in COCA?

• We can search for parts of speech using the tags provided in the interface.

• If we want the most frequent common noun in COCA, we can use the
following:

• Try it and see what is the most frequent common noun


Session 8 - Online Corpus Processors: The Case of COCA 12
COCA: Searching for Lemmas
• Although wildcards can be used to find word derivations, they only
find affix-based derivations, but what about zero-affix derivations
such as ate ?

• To find all the derivations of a given lemma, including both affix-


based and zero-affix derivations, we can try the following:

• What is the result of your query?


Session 8 - Online Corpus Processors: The Case of COCA 13
COCA: Searching for Synonyms

• We can search COCA for synonyms as


well. To do so, all we need is the following:

• Do you see something wrong in the


results? How can we get better results?

• Does this mean that COCA is semantically


annotated?
Session 8 - Online Corpus Processors: The Case of COCA 14
COCA: Searching Genres and Periods of Time 1
• COCA is a general corpus with many genres including spoken, fiction,
magazine, newspaper, and academic genres.

• What if we want to know the frequency of Egypt in each genre?

• COCA also includes texts from different periods of time: 1990 – 2015.

• What if we want to know the frequency of Egypt in each period?

• The best way to do so it to use the chart option.

Session 8 - Online Corpus Processors: The Case of COCA 15


COCA: Searching Genres and Periods of Time 2
• There are three different numbers in the chart of Egypt
• Freq. stands for raw frequency.
• Size (M) stands for the size of the texts in a given genre/period of
time.
• How about Per MiL?

Session 8 - Online Corpus Processors: The Case of COCA 16


COCA: Raw vs. Normalized Frequencies
• Per Mil is the normalized frequency per million.

• Raw frequency is the number of occurrences in the corpus.

• It does not always give an accurate idea about which word is more frequent.

• Hence, we typically use normalized frequency which is calculated as


follows:
Normalized Frequency (w) =

where C(w) is the raw frequency of the given word, N is the total number of
words in the corpus, and the common base ranges from 10 to 1,000,000
depending on the size of the corpus.
Session 8 - Online Corpus Processors: The Case of COCA 17
COCA: Key Word In Content (KWIC) 1
• The Key Word In Content (KWIC) is the concordance function which
displays up to 1,000 random contexts of the query word.

• Two questions:

• What if I want to see more than 1,000 contexts?

• What is the difference between the KWIC and clicking the word frequency to
see the contexts?

• The main difference is that with the KWIC, we get the context with
the parts of speech encoded in colors.

Session 8 - Online Corpus Processors: The Case of COCA 18


COCA: Key Word In Content (KWIC) 2

• What do these colors stand for?

Session 8 - Online Corpus Processors: The Case of COCA 19


COCA: Comparing Words 1
• To differentiate near synonyms, COCA uses the ‘compare’ function.

• It displays the collocations of each word sorted by frequency.

• It also displays the raw frequency of the first word to the second word.

• For example, comparing ‘steady’ as an adjective to ‘stable’ as an


adjective yields the following table. The table shows the difference
that ‘stable’ means unchanged, while ‘steady’ is used

Session 8 - Online Corpus Processors: The Case of COCA 20


Comparing Words in BYU Corpora 2

Session 8 - Online Corpus Processors: The Case of COCA 21


Comparing Words in BYU Corpora 3
• The first line in the ‘steady’ table reads as follows:
• The raw frequency of word 1 – ‘steady’ – with ‘pace’ is 218. Yet, the raw
frequency of word 2 – ‘stable’ – with ‘pace’ is 0.

• The third column is the ratio of word 1 to word 2 and it reads as follows: there
are 436.0 times as many cases of steady pace as there are stable pace.

• Remember: you can always guarantee more accurate results by adding


the part of speech to each of your queries.

Session 8 - Online Corpus Processors: The Case of COCA 22


COCA: Finding Collocations 1
• To find the collocations of a given word, you can use the ‘collocate’
function.

• For example, the top collocations of remind_v* are:

Session 8 - Online Corpus Processors: The Case of COCA 23


COCA: Finding Collocations 2
• You can refine your collocation search by looking for collocations in a
specific part of speech.

• Suppose that we want to find the adverbial collocations of the verb


remind, then your query should look like:

• The top 3 adverbs collocating with remind


as a verb are: just, how, and also.

Session 8 - Online Corpus Processors: The Case of COCA 24


COCA: Finding Collocations 3
• In the ‘collocates’ functions there is the ribbon below. What does it stand for?

• This is the window size – i.e. the search space in which the engine tries to
find collocations. It is meant to find both adjacent and non-adjacent
collocations.

• Adjacent collocations are the ones that immediately precede or follow your
query word. They are usually inseparable. In this case, you need to set the
window size to ±1. An example of adjacent collocations is at hand and kick
the bucket.

• Non-adjacent collocations are the ones that can be separated by one or more
words such as give up.
Session 8 - Online Corpus Processors: The Case of COCA 25
COCA: Finding Collocations 4

• Looking for the adjacent left-hand collocation of remind_v* yields:

• Looking for the adjacent right-hand collocation of remind_v* yields:

Session 8 - Online Corpus Processors: The Case of COCA 26

You might also like