Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Department of Computer Engineering

Academic Term: July-November 2023

Rubrics for Lab Experiments

Class : B.E. Computer Subject Name :NLP


Semester : VII Subject Code : CSDC7023

Practical No: 4

Title: N-Gram Model

Date of Performance: 17/08/2023

Roll No: 9228

Name of the Student: Ruben Rodrigues

Evaluation:

Performance Below average Average Good Excellent Marks


Indicator
On time Not submitted(0) Submitted Early or on time ---
Submission (2) after deadline submission(2)
(1)
Test cases and Incorrect The expected The expected Expected output is
output output (1) output is output is Verified obtained for all test
(4) verified only a for all test cases cases. Presentable and
for few test but is easy to follow (4)
cases (2) not presentable (3)
Coding The code is not The code is The code is -
efficiency (2) structured at all structured but structured
(0) not efficient (1) and
efficient. (2)
Knowledge(2) Basic concepts Understood Could explain the Could relate the theory with
not clear the basic concept with real world
(0) concepts (1) suitable example application(2)
(1.5)
Total
Natural Language Processing (BE COMP – Sem-VII)

Experiment – 4
N-gram Model

Aim: To implement the N-gram model

Task 1:

Quiz output:

I sit you EOS : 0.00808


Can you sit near I EOS : 1/29700
I can sit EOS : 0.0121
You sit EOS : 0.0181
8/25/23, 8:50 PM NLP_Exp4.ipynb - Colaboratory

import pandas as pd
import nltk
from nltk.util import ngrams
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import mark_negation

df = pd.read_csv('/content/Tweets.csv', usecols=[2,3])

df.head()

selected_text sentiment

0 I`d have responded, if I were going neutral

1 Sooo SAD negative

2 bullying me negative

3 leave me alone negative

4 Sons of ****, negative

from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import train_test_split

df = df.dropna()

le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])

X = list(df['selected_text'])
y = list(df['sentiment'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(analyzer = 'word',ngram_range=(1,1), stop_words='english')

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

from sklearn.naive_bayes import MultinomialNB


from sklearn.metrics import f1_score
import numpy as np

clf = MultinomialNB()
clf.fit(X_train_cv, y_train)

y_pred = clf.predict(X_test_cv)

score = f1_score(y_test, y_pred, average='micro')


print('F-1 score : {}'.format(np.round(score,4)))

F-1 score : 0.7699

cv = CountVectorizer(analyzer='word',ngram_range=(1,2), stop_words='english')

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

clf = MultinomialNB()
clf.fit(X_train_cv, y_train)

y_pred = clf.predict(X_test_cv)

score = f1_score(y_test, y_pred, average='micro')


print('F-1 score : {}'.format(np.round(score,4)))

F-1 score : 0.768

https://colab.research.google.com/drive/12p7BnuxD6qOsL08a3_uIKi5k6n9JqgkB?authuser=1#scrollTo=tECROc82Ph_A&printMode=true 1/2
8/25/23, 8:50 PM NLP_Exp4.ipynb - Colaboratory

for N in range(1,11):

cv = CountVectorizer(analyzer = 'word',ngram_range=(1,N), stop_words='english')


X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

clf = MultinomialNB()
clf.fit(X_train_cv, y_train)
y_pred = clf.predict(X_test_cv)

score = np.round(f1_score(y_test, y_pred, average='micro'),4)


print('F-1 score of model with n-gram range of {}: {}'.format((1,N), score))

F-1 score of model with n-gram range of (1, 1): 0.7699


F-1 score of model with n-gram range of (1, 2): 0.768
F-1 score of model with n-gram range of (1, 3): 0.7655
F-1 score of model with n-gram range of (1, 4): 0.7652
F-1 score of model with n-gram range of (1, 5): 0.7658
F-1 score of model with n-gram range of (1, 6): 0.7662
F-1 score of model with n-gram range of (1, 7): 0.7662
F-1 score of model with n-gram range of (1, 8): 0.7659
F-1 score of model with n-gram range of (1, 9): 0.7659
F-1 score of model with n-gram range of (1, 10): 0.7659

Conclusion: Based on the results, the model performs at its best with the n-gram range of (1,5). This means that training the model with n-
grams ranging from unigrams to 5-grams help achieve optimal results, but larger n-grams only result in more sparse input features, which
hampers model performance.

Colab paid products - Cancel contracts here

check 12s completed at 8:46 PM

https://colab.research.google.com/drive/12p7BnuxD6qOsL08a3_uIKi5k6n9JqgkB?authuser=1#scrollTo=tECROc82Ph_A&printMode=true 2/2
q229 Robe Rodbigud

Postlabi
motiuo
Pogidhy
molsls cleling N-gam medels, in tte ielod f
a
It wgases how wrell a
valucs dtico
a tokeus bsecl on
hto mode

F an Ngam model wich is a hpe ef olbhlstie lagage


modet t t pecicts the nwt cON in a Sequence basid

Lohe
N- Nombes ef woxds in Seaueuce
qrolabiliy csiel by tu mocte! to
ente Sequnee

FOR EDUCATIONAL USE


Sundaram)

You might also like