Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Spring End Semester Examination 2023

6th Semester B.Tech.


Natural Language Processing (IT 3035)
Evaluation Scheme

Q1. (a) Write a regular expression that matches a string that has the letter 'a' followed by
anything, ending in the letter 'b'.
Answer: a.*?b$

(b) Illustrate with suitable examples how Lemmatization differs from Stemming in NLP?
Answer: Stemming is a process that stems or removes last few characters from a word,
often leading to incorrect meanings and spelling. Stemming is used in case of large dataset
where performance is an issue. Lemmatization considers the context and converts the word
to its meaningful base form, which is called Lemma. Lemmatization is computationally
expensive since it involves look-up tables and other sophisticated tools. For instance,
stemming the word ‘Caring‘ would return ‘Car‘ (which is seemingly may cause incorrect
interpretation of the word under processing), whereas lemmatizing the same word would
return ‘Care‘.

(c) Define Perplexity measure in NLP.


Answer: In information theory, perplexity is a measurement of how well a probability
distribution or probability model predicts a sample. In the context of Statistical Natural
Language Processing, perplexity is one way to evaluate language models. The perplexity PP
of a discrete probability distribution p is defined as:
PP(p) = 2H(p) = ∏𝑥 𝑝(𝑥)−𝑝(𝑥)
In the above expression, H(p) is the entropy (in bits) of the distribution and x ranges over
events.

(d) Sentence: He lifted the beetle with red cap.


The type of the ambiguity present in the above sentence:
(i) Syntactic
(ii) Semantic
(iii) Pragmatic
(iv) None of the above
Justify your answer.
Answer: (i) Syntactic Ambiguity
Justification: There are two possible parses of the sentence. In one possible parse, the
prepositional phrase “with red cap” is associated with the verb “lifted”. In other possible
parse, the same phrase can be associated with the noun “beetle”.

(e) Explain syntactic constituency with suitable example.


Answer: In syntactic analysis, a constituent is a word or a group of words that function as a
single unit within a hierarchical structure. Many constituents are phrases. A phrase is a
sequence of one or more words (in some theories two or more) built around a head lexical
item and working as a unit within a sentence.
Example Sentence: The woman with red hair won the first prize.

(f) Named Entity Recognition is a sub task of --------- that seeks to locate and classify named
entities in texts.
(i) Information Retrieval
(ii) Information Extraction
(iii) Text Classification
(iv) None of the above
Answer: (ii) Information Extraction

(g) The difference H(X) - H(X|Y) is ---------


(i)Relative Entropy
(ii)Mutual Information
(iii)Joint Entropy
(iv)None of the above
All the symbols have usual meaning.
Answer: (ii)Mutual Information

(h) A sequence of words that occur together unusually often is ---------


(i)Cocordance
(ii)Phrase Structure
(iii)Collocation
(iv)None of the above
Answer: (iii)Collocation

(i) Explain the “Feast or Famine” problem in connection with the task of Information
Retrieval.
Answer: “Feast or Famine” problem occurs when we apply Boolean queries in connection
with the task of Information Retrieval. Boolean queries often result in either too few (~0) or
too many (some thousands or even millions) results. It usually takes a lot of skill to come up
with a query that produces a manageable number of hits.

(j) Pick the Odd One Out from the following:


(i)Naive Bayes classifier
(ii)Probabilistic Context-Free Grammars
(iii)Conditional Random Field
(iv)Hidden Markov Model
Justify your answer.
Answer: (iii)Conditional Random Field
Justification: Conditional Random Field follows a Discriminative machine learning model
whereas the other three follows a Generative machine learning model.

Q2. (a) Explain the benefits of eliminating stop words. Give examples for situations in which
elimination of stop words may be harmful.
Answer: Stop words are available in abundance in any human language. Stop words often do
not provide any meaningful information and as their frequencies are too high, removing
them from the corpus results in much smaller data in terms of size and faster computations
on text data. By removing these words, we remove the low-level information from our text
in order to give more focus to the important information. In order words, we can say that
the removal of such words does not show any negative consequences on the model we train
for our task.
Stopword removal sometimes yield adverse effect. The removal of stop words is highly
dependent on the task we are performing and the goal we want to achieve. There are some
applications in NLP such as Part of Speech (PoS) Tagging, Named Entity Recognition (NER),
Parsing, etc., where we should not remove stop words; Rather, we should preserve them as
they actually provide grammatical information in those applications. Here are some of these:
Suppose that we are training a model that can perform the sentiment analysis task from
movie reviews. If the original movie Movie review is “The movie was not good at all.”, the
text after removal of stop words would be “movie good”. We can clearly see that the review
for the movie was negative. However, after the removal of stop words, the review became
positive, which is not the reality.
Consider another scenario where we try searching for “To be or not to be” in a search
application that removes stop words. That would miss completely the legendary work of
Shakespeare by that name.
Stop words can help to disambiguate the search query, allowing the user to get more
accurate results. If we try to find “notebook without DVD drive” with stop words removed,
results become irrelevant when stop words are removed.

Evaluation Scheme: 1 Mark for benefits and 3 marks for harmfulness examples.

(b) List the problems associated with N-gram language model. Explain how these problems
are handled.
Answer: Problems associated with N-gram language model:
(i) The N-gram model, like many statistical models, is significantly dependent on the training
corpus.
(ii) The performance of the N-gram model varies with the change in the value of N.
(iii) Since any corpus is limited, some perfectly acceptable English word sequences are
bound to be missing from it and have zero probability for the corresponding N-gram. As a
result of it, the N-gram matrix for any training corpus is bound to have a substantial number
of cases of putative “zero probability N-grams”.
(iv) A similar problem is Unknown or Out Of Vocabulary words. The words that does not
appear in training set, but have appeared in the test set.

Solving problems associated with N-gram model:


(1) Make the corpus as generic as possible by including texts from varying number of textual
sources. This makes the MLE values significantly close to the real probability of it. The same
also reduces the chances of the occurrences unknown or or out of vocabulary words.
(2) Increment of the value of N enhances the power of the language model, but at the same
time enhances the computational cost significantly. Also, it is observed from practical
studies that enhancing N above four does not significantly improve the model power.
(3) To deal with zeros of the N-gram matrices and also to deal with unknown words, we
apply smoothing strategies. The simplest among the lot is Laplace Add-1 smoothing, which is
simplest, but not very effective. There are many other smoothing strategies that are quite
effective: Good Turing Smoothing, Kneser-Ney Smoothing, etc.

Evaluation Scheme: 2 Marks for problems and 2 marks for remedies to them.

Q3. (a) Explain Top Down Parsing and Bottom Up Parsing with example.
Answer:
Top-Down Parsing: It is a parsing strategy that first looks at the highest level of the parse
tree and works down the parse tree by using the rules of grammar. Production is used to
derive and check the similarity in the string. It derives the leftmost string and when the
string matches the requirement it is terminated.
Bottom-Up Parsing: It is a parsing strategy that first looks at the lowest level of the parse
tree and works up the parse tree by using the rules of grammar. can be defined as an
attempt to reduce the input string to the start symbol of a grammar. This parsing technique
uses Right Most Derivation. The main decision is to select when to use a production rule to
reduce the string to get the starting symbol.

Evaluation Scheme: 2 marks for Top Down, 2 marks for Bottom Up. In either of the above, 1
mark for explanation, 1 marks for suitable example.

(b) Define Probabilistic Context-Free Grammar (PCFG). List down the disadvantages of PCFG.
Answer:
Definition of PCFG: A probabilistic context-free grammar G can be defined by a quintuple: G
= (M, T, R, S, P), where
⚫ M is the set of non-terminal symbols
⚫ T is the set of terminal symbols
⚫ R is the set of production rules
⚫ S is the start symbol
⚫ P is the set of probabilities on production rules
Limitation of PCFG:
(i) PCFGs do not take lexical information into account, making parse plausibility less than
ideal and making PCFGs worse than n-grams as a language model.
(ii) PCFGs have certain biases; i.e., the probability of a smaller tree is greater than a larger
tree.
(iii) When two different analyses use the same set of rules, they have the same probability.

Evaluation: 2 marks for definition, 2 marks for limitation.

Q4. (a) “Naive Bayes’ Classifier is a linear classifier in logarithmic space” --- Justify the above
statement with proper reasoning.
Answer: Naive Bayes is a probabilistic classifier, meaning that for a document d, out of all
classes c ∈ C the classifier returns the class 𝑐̂ which has the maximum posterior probability
given the document. That is:
𝑐̂ = argmax 𝑃(𝑐|𝑑)
c∈C
With simple manipulation, the above reduces to:
𝑐̂ = argmax 𝑃(𝑑|𝑐). 𝑃(𝑐)
c∈C
If the document d is represent by a set of features F = {f1, …, fn }, then the above reduces to:
𝑐̂ = argmax 𝑃(𝑓1 , . . . , 𝑓𝑛 |𝑐). 𝑃(𝑐)
c∈C
Following Bag of Words assumption and the Naive Bayes assumption, the above expression
reduces to:
𝑐̂ = argmax 𝑃(𝑐) ∏ 𝑃(𝑓|𝑐)
c∈C
𝑓∈𝐹
In a natural language processing task, the features are usually words.
𝑐𝑁𝐵 = argmax 𝑃(𝑐) ∏ 𝑃(𝑤𝑖 |𝑐)
c∈C
𝑖∈𝑊𝑜𝑟𝑑 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠
If we apply logarithm on the above, logarithm being a monotonic increasing function, we
have:

𝑐𝑁𝐵 = argmax [𝑙𝑜𝑔 𝑃(𝑐) + ∑ 𝑙𝑜𝑔 𝑃(𝑤𝑖 |𝑐)]


c∈C
𝑖∈𝑊𝑜𝑟𝑑 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠
The above is clearly resembles a linear function, and hence the justification of the given
statement.
(b) In the table below you are given the bag of words in four training documents and
their labels (spam or not spam).
Doc Bag of Words Label
1 {price, weight, loss, vitamin, Spam
discount}
2 {vitamin, weight, discount, sad} Spam
3 {loss, sad, sorry} Non-spam
4 {weight, foundation, price} Non-spam
Consider further a test document whose bag of words are as follows: {weight, price,
discount, foundation, loss}.
Find the label for the above test document using a Naive Bayes’ document classifier
that uses add-1 smoothing for zeros.
Answer: V = {price, weight, loss, vitamin, discount, sad, sorry, foundation}
Prior from Training:
P(Spam) = P(Non-spam) = 1/2
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝑐)+1
Likelihoods from training: 𝑝(𝑤𝑖 |𝑐 ) = (∑𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤,𝑐))+|𝑉|
2+1 3 1+1 1
P(weight | Spam) = 9+8 = 17 P(weight | Non-spam) = 6+8 = 7
1+1 2 1+1 1
P(price | Spam) = = P(price | Non-spam) = =
9+8 17 6+8 7
2+1 3 0+1 1
P(discount | Spam) = 9+8 = 17 P(discount | Non-spam) = 6+8 = 14
0+1 1 1+1 1
P(foundation | Spam) = = P(foundation | Non-spam) = =
9+8 17 6+8 7
1+1 2 1+1 1
P(loss | Spam) = 9+8 = 17 P(loss | Non-spam) = 6+8 = 7

Scoring the test set:


1 3×2×3×1×2 18
P(Spam)P(Test|Spam) = 2 × 175
= 175 = 1.2677 × 10−5
1 2×2×1×2×2 8
P(Non-spam)P(Test|Non-spam) = × = = 1.4875 × 10−5
2 14 5 14 5

Hence, the test document is Non-spam.

Q5. Write a short note rule-based POS tagging. Discuss the advantages and disadvantages of
the same in this context.
Answer:
Rule-based POS Tagging: One of the oldest techniques of tagging is rule-based POS tagging.
Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word.
If the word has more than one possible tag, then rule-based taggers use hand-written rules
to identify the correct tag. Disambiguation can also be performed in rule-based tagging by
analyzing the linguistic features of a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article then word must be a noun.
As the name suggests, all such kind of information in rule-based POS tagging is coded in the
form of rules. These rules may be either (i) Context-pattern rules, Or, (ii) Regular expression
compiled into finite-state automata, intersected with lexically ambiguous sentence
representation. Rule-based POS tagging follows two-stage architecture. In the first stage, it
uses a dictionary to assign each word a list of potential parts-of-speech. In the second stage,
it uses large lists of hand-written disambiguation rules to sort down the list to a single part-
of-speech for each word.

Advantages of Rule-based POS tagging:


(i) Relatively simple to implement and so, often used as a starting point for more complex
machine learning-based taggers.
(ii) It can be fast.
(iii) It can capture some linguistic regularities and exceptions.
(iv) Since the rules are built manually, these taggers often yield high precision results.

Disadvantages of Rule-based POS tagging:


(i) It requires a lot of manual effort and domain knowledge to create and maintain the rules,
which can be time-consuming and error-prone.
(ii) Since it requires lot of human-hours for development, it often has high development cost.
(iii) It cannot handle unseen or ambiguous words, such as neologisms, slang, or homographs,
that do not match any rule.
(iv) It is not adaptable to different genres, domains, or languages, as the rules might not
generalize well across different contexts. So, these taggers are not transportable.

Evaluation Scheme: 2 marks for Short Note, 1 mark for advantages, 1 mark for
disadvantages.

(b) Consider the following two matrices in connection with an HMM model for POS
tagging:
AB PN PP VB EOS
AB 1/11 1/10 1/12 1/11 1/25
PN 1/11 1/11 1/11 1/10 1/14
PP 1/11 1/12 1/12 1/10 1/16
VB 1/13 1/11 1/12 1/14 1/18
EOS 1/11 1/10 1/10 1/13 1/15
she got up
AB 1/25 1/25 1/14
PN 1/13 1/25 1/25
PP 1/25 1/25 1/13
VB 1/25 1/14 1/19
BOS: Beginning of sentence; EOS: End of Sentence
Find the best POS tag sequence for the sentence: “she got up” following Viterbi
algorithm.

Evaluation Scheme: The POS tag BOS is missing in the table given. So, any honest
attempt to this question should be awarded full credit (4 marks).
Q6. (a) Consider the following tiny phrase structure treebank, consisting of just two
trees. Read off all rules whose left hand sides are either NP or VP and estimate their
rule probabilities using maximum likelihood estimation.

Answer: Probabilities associated with rules starting with NP:


NP --> PRP (2/7)
NP --> NP PP (1/7)
NP --> DT NN (2/7)
NP --> DET NN (2/7)
Probabilities associated with rules starting with VP:
VP --> VB NP (1/2)
VP --> VB NP PP (1/2)

Evaluation Scheme: 2 marks for probabilities of NP rules, 2 marks for probabilities


of VP rules.

(b) Consider a corpus of English containing approximately 560 million tokens. In this
corpus we have the counts of unigrams and bigrams as per the table below. Estimate
Prob(snow) and Prob(snow | white) using maximum likelihood estimation without
smoothing.
snow white white purple purple
snow snow
38,186 256,091 122 11,218 0

Answer:
38186
P(snow) = 560×106 = 6.8189 × 10−5
122
P(snow|white) = 256091 = 4.7639 × 10−4
Evaluation Scheme: 2 marks for P(snow) and 2 marks for P(snow|white).

Q7. (a) Assuming the grammar below construct the parse tree for the sentence: “the
big yellow dog sat under the house”
S --> NP VP ; VP --> VP PP ; VP --> Verb NP ;
VP --> Verb ; NP --> Det NOM; NOM --> Adj NOM ;
NOM --> Noun ; PP --> Prep NP ; Det --> the ;
Adj --> big ; Adj --> yellow ; Noun --> dog ;
Noun --> house ; Verb --> sat ; Prep --> under
Answer: The following is a parse tree for the given sentence:

(b) Find the Minimum Edit Distance between the words “TEACHER” and
“STUDENT” following a dynamic programming based algorithm.
Answer: If we follow Minimum Edit Distance (Levenshtein) Algorithm, we have the
following:
Q8. (a) Consider the following CFG and convert the same into equivalent CFG in
Chomsky Normal form:
S --> NP VP ; S --> VP ; NP --> NP PP;
NP --> PropNoun ; VP --> Verb ; VP --> Verb NP ;
VP --> VP PP ; PP --> Prep NP ; PropNoun --> DALLAS ;
PropNoun --> ALICE ; PropNoun --> BOB ;
PropNoun --> AUSTIN ; Verb --> ADORE ;
Verb --> SEE ; Prep --> IN ; Prep --> WITH
Answer:
(b) Using the normalized grammar derived above, find the CKY parsing chart for the
sentence: “SEE BOB IN AUSTIN”
Answer: The CKY parsing chart for the sentence is as follows:

You might also like