Ready Made Corpus

Off-the-Shelf Corpora 1
• Last week, we have already seen examples of existing corpora. These

corpora are typically referred to as ready-made or off-the-shelf corpora.
For example, COCA, Film Corpus, Wikipedia Corpus
• They come with many advantages:
• They are mostly free so no financial obstacles.
• Copyright issues are typically settled.
• They are ready-made so no time wasted on corpus compilation
• They have handy Web interfaces so that you can search them and extract whatever
type of information you want.
Session 4 - Ready-Made and Custom-Made Corpora 1
Off-the-Shelf Corpora 2
• Although ready-made corpora have lots of advantages, there is one
main disadvantage of them. What is it?
• They are not customizable. You either take them or leave them. What
if you can’t find what you want. In this case, you’ll have to build your
own corpus.
For example, film corpus(2015)

Building Your Own Corpora 1
• The first step is to look for an electronic version of your data
• One place to start with if you want to build your own corpus is the text
archives. What are these?
• These are Web-based repositories of fiction and non-fiction texts that are
scanned from the original sources or typed in by volunteers. The texts
can be usually retrieved in a variety of formats. (Which format should
you seek? Why?)
• Plain text, of course
• The two most famous text archives are : famous and trusted corpora
• Project Gutenberg
• Oxford Text Archive
Using Project Gutenberg
• You study or will be studying legal translation. So you might be interested in
seeing how the terms you are studying are used in legal texts written by native
speakers.
• You might be interested in comparing the British and American legal texts to
know if they use the terms in the same way or not.
• Project Gutenberg can help you with that.

• From the left-hand menu, click Bookshelves
• Browse the results and find the Law Bookshelf
• There you will find two bookshelves: British law and American law
• Click the bookshelves and you will find several texts in multiple formats: HTML, Plain
Text UTF-8, EPUB, and Kindle.
• Choose the Plain Text UTF-8 format.
UTF-8 Encoding 1
• We know that we should be using the Plain Text format but what does
UTF-8 stand for?
• UTF-8 is a text encoding. Words and sentences consist of characters

like the Latin character ‘ô’ and the Arabic character ‘‫’ب‬.
• Computers do not read characters the way we, humans, do.
• Instead, characters of different languages are grouped together in what
is known as character sets or repertoires.
• To refer to characters in an unambiguous way within each set, each
character is associated with a code point. For example, the dollar sign
$ is U+0024.
UTF-8 Encoding 2
• UTF-8 encoding has a code point or byte for every human language
and sign.
• Therefore, when you save your corpus as a UTF-8 file, you can open
and read it on any computer and with any corpus analysis software.

Getting Offline Data 1
• Sometimes, the data you need is only available as hard copy books. What can you do
then? Remember, you need your files to be machine readable.
• One way to overcome this is to (1) scan the books and then (2) use Optical Character
Recognition (OCRs) software to convert scanned PDFs into text files.
• They are highly effective if the texts are typed.
• There are many commercial OCRs, but there are freeware ones too such as:
• SimpleOCR
• TopOCR
• FreeOCR
• Because OCRs are automatic software programs, they are never 100% accurate. So how
can we choose the best OCR software for our research?
Getting Offline Data 2
• To select the best available OCR, you need to calculate the accuracy rate
for all those available ones compared to the same text. How to do so?
1. pick a page from the file you want to convert to Plain Text
2. count how many words in total are in the page
3. upload the page on each of the aforementioned OCR programs
4. for each OCR program, count how many errors were found
5. deduct the number of errors from the total number of words
6. divide the number of correctly-recognized words by the total numb. of
words
7. pick the OCR with the highest accuracy rate

Ready Made Corpus

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ready Made Corpus

Uploaded by

Copyright:

Available Formats

Off-the-Shelf Corpora 1

• Last week, we have already seen examples of existing corpora. These

• They are mostly free so no financial obstacles.

• Copyright issues are typically settled.

• They are ready-made so no time wasted on corpus compilation

For example, film corpus(2015)

Session 4 - Ready-Made and Custom-Made Corpora 2

• Project Gutenberg can help you with that.

• UTF-8 is a text encoding. Words and sentences consist of characters

Session 4 - Ready-Made and Custom-Made Corpora 6

You might also like