Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Lab/Assignment

Step 1:
Download data, [data] sentences from BNC baby corpus, from blackboard. This
data set uses utf8 for character encoding and contains 332,938 English sentences
from sampling British National Corpus. The data set consists of four 1M genre-based
subsets (academic, fiction, newspaper, and conversation), so it has in total 4M
words. British National Corpus is a snapshot of British English in the early 1990s and
is considered a general (not restricted to any subject field), monolingual (British
English), and mixed (spoken and written) corpus.
Format of this data set: SentNum[TAB]English sentence with space as word
delimiter
For example, the first three sentences are
1
Latest corporate unbundler reveals laid-back approach : Roland Franklin , who
is leading a 697m pound break-up bid for DRG , talks to Frank Kane
2
By FRANK KANE
3
IT SEEMS that Roland Franklin , the latest unbundler to appear in the UK , has
made a fatal error in the preparation of his 697m break-up bid for stationery and
packaging group DRG .
Please note that each line of this data set has exactly one sentence and ends with
\n
Step 2:
Find the sentences (specifically sentence numbers) containing the word
(1-1) priceless
(1-2) valuable
(1-3) absent
(2-1) specialist/specialists
(2-2) expert/experts
(2-3) courage
Take the word priceless for example. It appears in sentences {2264, 11734,
12634, ...}. Note that as for cases (2-1) and (2-2), plural forms i.e. specialists and
experts are also considered. For example, if specialist appears in sentences
{A,B,C,D} and specialists appears in sentences {C,D,E}, your program should
output that specialist/specialists appears in sentences {A,B,C,D,E}, the union
result.

This lab is due at 11:59pm Nov. 1st, 2016. Please turn in your program as well as
the lists of sentence numbers containing the word of (1-1), the word of (1-2), ,

and the word of (2-3). Take the word priceless for example. Your program should
output that the sentence numbers containing the word is {2264, 11734, 12634, ...}.

You might also like