Abstract:: How To Create An Online Corpus

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Abstract:

Aim of this paper is to provide a guideline for beginners for making Corpus. There is step by
step guideline for building Corpus. Building Corpus was very time consuming field of
linguistics research but now it is much easy to build a Corpus. A lot of retrievable electronic text
can be found in the web.
In order to build a Corpus there are number of factors which need to be taken
consideration. This paper is designed for describing those factors which are important to know
for making written and spoken Corpus.

Introduction:
Here we are going to explain some steps by following we can create spoken or written Corpus.
There are number of software in the computer world to analyze the corpus as some are free ware and
some need contribution for use.

AntConc http://www.antlab.sci.waseda.ac.jp/software.html
Wordsmith http://lexically.net/
Monoconc http://www.monoconc.com/
CasualConc https://sites.google.com/site/casualconc/
Wmatrix http://ucrel.lancs.ac.uk/wmatrix/
SketchEngine http://www.sketchengine.co.uk/

How to create an online corpus:


1. Introduction
The Sketch Engine is a leading corpus tool. It has been widely used in
lexicography. It is now ten years since its launch (Kilgarriff et al. 2004).

To create a corpus in the interface, login and go to the home page (if already logged in, you can get to this
page by clicking home top right of the screen.)
The five sections in this page describe:

the process of creating a corpus in the Sketch Engine interface


1. adding data to the corpus by uploading a file
2. adding data to the corpus using WebBootCat
3. creating parallel (bilingual) corpora
4. the functions that are available on a user (your) corpus, including compiling it for Sketch Engine
1- Creating a corpus:
Near the top of the left hand side menu, click Create corpus and fill in the following fields in the form:
After clicking Create, an empty corpus is created and you are taken directly to the next step – adding files.
If you wish to add data using WebBootCat instead, click Cancel and then Add data from web using
WebBootCaT in the corpus screen.

2- Add a file:
If you select add a new file you have options to

 upload a file from your computer


 download from a URL

When you have created a corpus there are many tools available to you in the left hand side panel. Select
the corpus by clicking on its name from the home page and under the Corpus heading in the left hand side
menu to can:

1. Add new file.


2. Add web data (BootCaT)
3. Compile corpus
How to make corpus for freeware software:
1 Loading files and the AntConc interface
Antconc works only on PLAIN-TEXT files with the file appendix .txt (e.g. Hamlet.txt).
It will read XML files that are saved as .txt files. Antconc will not read .doc, .docx, .pdf, files.
You will need to convert these into .txt file.

1.1 Making a plain-text file.


Open http://news.stv.tv/, choose a news article (doesn’t matter which one, as long as it is
primarily text). Highlight all text in the article (header, byline, etc), and right-click “copy”.
Open MSWord and select ‘paste’.
Delete any non-textual objects, such as images: we are preparing for text analysis, so we only
want to retain the text. You probably want to delete their footer, as well:

Feedback: We want your feedback on our site. If you’ve got questions, spotted an inaccuracy or just want to share some ideas

about our news service, please email us on web@stv.tv.

Download: The STV News app is Scotland’s favourite and is available for iPhone from the Apple store and for Android

from Google Play. Download it today and continue to enjoy STV News wherever you are.

Join in: For debate, chat, comment and more, join our communities on the STV News Facebook page or follow @STVNews on

Twitter.

Instead of saving it as a .doc or .docx file, we’re going to save it as a .txt file to the desktop.
File > Save as > [article title], but in the drop down menu labeled SAVE AS TYPE we’re going to
choose the file type “Plain Text (.txt)”.
This will give you a warning: “Saving as a text file will cause all formatting, pictures, and objects
in your file to be lost.” You also get an option for Text Encoding. Select Other Encoding:
“Unicode (UTF-8)” and click OK.
Go to the desktop and check to see you have a file that looks like this. (Depending on some
settings, It might save as ‘Cameron.txt’, or it might just save as ‘Cameron’.)

To be safe, make sure every file is saved with the .txt suffix! Each file you want to use in your
corpus must be a plain text file for Antconc to use it. You can open the file in Notepad to see
what it looks like:
Further reading on corpus construction:

Biber (1993), “Representativeness in Corpus Design”. Literary and Linguistic Computing, 8 (4):


243-257.http://llc.oxfordjournals.org/content/8/4/243.abstract

Wynne, M (ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice . Oxford: Oxbow
Books.http://www.ahds.ac.uk/creating/guides/linguistic-corpora/

Getting AntConc:

Please go to http://www.laurenceanthony.net/software/antconc/releases/AntConc324/ and
download the file Antconc.exe (for PCs). Select Save File. On Internet Explorer, it will ask you if
you want to RUN or SAVE the file. Select RUN, rather than Save File; it will go directly to the
security warning below. On Firefox, select SAVE FILE, then RUN (see screenshots below)
You want to RUN the software, so click RUN on the security warning dialog box.

 Getting Started:
When AntConc launches, it will look like this. 
On the left-hand side, there is a window to see all corpus files loaded.

There are 7 tabs in the centre:


Concordance: This will show you what’s known as a Keyword in Context view (abbreviated
KWIC).
Concordance Plot: this will show you a very simple visualization of your KWIC search, where
each instance will be represented as a little black line from beginning to end of each file
containing the search term.
File View: This will show you a full file view for larger context of a result.
Clusters: This view shows you words which very frequently appear together.
Collocates: clusters show us words which definitely  appear together in a corpus; collocates
show words which are statistically likely to appear together.
Word list: All the words in your corpus.
Keyword List: This will show comparisons between two corpora.

Further help and resources are available, linked 1/3 of the way down on the software page,
after Citing/Referencing AntConc. Here’s a selection –
 https://groups.google.com/forum/#!forum/antconc
 AntConc3.2.0 Help
 AntConc3.1.2 Help
 Various video tutorials (https://www.youtube.com/user/AntlabJPN)
 #corpusMOOC / https://www.futurelearn.com/courses/corpus-linguistics (running
again September 2014)

 Search Operators
* operator (zero or more characters) can help, for instance, find both the singular and the plural
forms of nouns

Example: search for quality*, then sort this search. what tends to precede and
follow quality & qualities? For a full list of available wildcard operators and what they mean, go to
Global Settings > Wildcard Settings.

What’s the difference between * and ?

Search for th*n and th?n. What do these two search queries tell us?

More specifically: searching with the ? operator


wom?n – both women and woman
m?n – man and men, but also min
contrast to m*n: not helpful, because you’ll get mean, melon etc
Books and Scholarly Journals related to How to design a
CORPUS:
Corpora http://www.euppublishing.com/journal/cor
ICAME http://icame.uib.no/journal.html
ICJL https://benjamins.com/#catalog/journals/ijcl
Literary and Linguistic Computing http://llc.oxfordjournals.org/

Biber, Douglas (1993). “Representativeness in Corpus Design”. Literary and Linguistic


Computing, 8 (4): 243-257.http://llc.oxfordjournals.org/content/8/4/243.abstract

As for written Corpora example there are some compiled Corpora:

Xiao, Z. (2009). Well-Known and Influential Corpora, A


Survey http://www.lancaster.ac.uk/staff/xiaoz/papers/corpus%20survey.htm, based on Xiao
(2009), “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling
& M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 2]. Berlin: Mouton de
Gruyter. 987-1007.
Various Historical Corpora http://www.helsinki.fi/varieng/CoRD/corpora/index.html
Oxford Text Archive http://ota.ahds.ac.uk/
Linguistic Data Consortium http://catalog.ldc.upenn.edu/
CQPWeb, a front end to various corpora https://cqpweb.lancs.ac.uk/
BYU Corpora http://corpus.byu.edu/
NLTK Corpora http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Spoken corpus:
When we are saying we can create spoken corpus as well than question aeries how we can make
it. There are five easy steps to make spoken corpus presented by Cambridge University students.

 1. Check your device


Before you make any recordings check your equipment works. Most mobiles come with a recording
function, if not there are many free apps available.

2. Choose your location


You can record any conversation anywhere. We need to be able to hear the conversation clearly so
please choose a place without too much background noise.

3. Select your speakers


Anyone can be part of your conversation, as long as you get their permission. Everyone needs to
sign a Speaker Consent Form before you can record. We are interested in capturing a wide range of
language, so the more different speakers involved in your recordings the better! Please try not to
record any more than 5 speakers at a time.

4. Record
Each recording should be between 10 minutes - 2 hours. If you're worried that the conversation
might dry up, you could always think about some topics beforehand.

5. Upload and send


Once you have your recordings you can submit them following the guidelines below:

a. Rename your recordings according to the following convention: BNC[insert initials]001 and


so on. For example Maggie Jane Smith's recordings would be: BNCMJS001.
b. Fill in the Recording Information Sheet for each recording you submit.
c. Send us all your audio files with matching Recording Information Sheets
to corpus@cambridge.org. We recommend WeTransferfor file sharing.

You might also like