Professional Documents
Culture Documents
Ready Made Corpus
Ready Made Corpus
• They have handy Web interfaces so that you can search them and extract whatever
type of information you want.
Session 4 - Ready-Made and Custom-Made Corpora 1
Off-the-Shelf Corpora 2
• Although ready-made corpora have lots of advantages, there is one
main disadvantage of them. What is it?
• They are not customizable. You either take them or leave them. What
if you can’t find what you want. In this case, you’ll have to build your
own corpus.
• One place to start with if you want to build your own corpus is the text
archives. What are these?
• These are Web-based repositories of fiction and non-fiction texts that are
scanned from the original sources or typed in by volunteers. The texts
can be usually retrieved in a variety of formats. (Which format should
you seek? Why?)
• Plain text, of course
• The two most famous text archives are : famous and trusted corpora
• Project Gutenberg
• Oxford Text Archive
Session 4 - Ready-Made and Custom-Made Corpora 3
Using Project Gutenberg
• You study or will be studying legal translation. So you might be interested in
seeing how the terms you are studying are used in legal texts written by native
speakers.
• You might be interested in comparing the British and American legal texts to
know if they use the terms in the same way or not.
• Therefore, when you save your corpus as a UTF-8 file, you can open
and read it on any computer and with any corpus analysis software.
• One way to overcome this is to (1) scan the books and then (2) use Optical Character
Recognition (OCRs) software to convert scanned PDFs into text files.
• They are highly effective if the texts are typed.
• There are many commercial OCRs, but there are freeware ones too such as:
• SimpleOCR
• TopOCR
• FreeOCR
• Because OCRs are automatic software programs, they are never 100% accurate. So how
can we choose the best OCR software for our research?
Session 4 - Ready-Made and Custom-Made Corpora 7
Getting Offline Data 2
• To select the best available OCR, you need to calculate the accuracy rate
for all those available ones compared to the same text. How to do so?
1. pick a page from the file you want to convert to Plain Text
2. count how many words in total are in the page
3. upload the page on each of the aforementioned OCR programs
4. for each OCR program, count how many errors were found
5. deduct the number of errors from the total number of words
6. divide the number of correctly-recognized words by the total numb. of
words
7. pick the OCR with the highest accuracy rate
Session 4 - Ready-Made and Custom-Made Corpora 8