NLP Lab Tasks

FACULTY OF COMPUTER SCIENCE AND ENGINEERING
Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi
Lab Duration:3 hr. AI361L NLP Lab Marks: 10
Lab No: 1 Instructor: Memoona Saleem
Lab Activity
Lab Task: Communicating with ChatGPT and Exploring NLP Applications
This lab focuses on interacting with ChatGPT to understand its capabilities in processing natural
language, especially when dealing with ambiguous sentences. Additionally, students will explore
different practical applications of Natural Language Processing (NLP) to gain insights into the
field's breadth and depth.
1. Interact with ChatGPT:
Begin by crafting a series of sentences that you will use to communicate with ChatGPT. Aim for
a mix of straightforward and ambiguous sentences to test the model's understanding and
contextual interpretation capabilities.
- Examples of ambiguous sentences might include:
- "Can you tell me the time the clock stopped?"
- "I saw the man with a telescope."
- "They are cooking apples."
- Record ChatGPT's responses to each sentence, noting particularly how it handles the
ambiguous ones.
2. Analyze ChatGPT's Performance:
For each interaction, analyze whether ChatGPT correctly understood the context of your
sentences, especially the ambiguous ones. Identify any patterns in responses where the model
may misinterpret the intended meaning.
3. Explore NLP Practical Applications
For each application, consider the following:
- The specific NLP techniques and models used.
- The benefits and limitations of applying NLP in this context.
- Any potential future developments or improvements in NLP that could enhance it
Task Statement:
In this NLP lab task, you will work with a text file containing a collection of recipes. Your
objective is to extract four different types of information from each recipe: the amount, measure
type, ingredient, and instructions.
Instructions:
1. Read Text File:
- Begin by reading the text file containing the collection of recipes. You can use Python's file
handling capabilities or any suitable library for text file processing.
2. Extract Information:
- For each recipe in the text file, extract the following four types of information:
- Amount: The quantity of the ingredient (e.g., "1", "2.5").
- Measure Type: The unit of measurement for the ingredient (e.g., "cup", "teaspoon").
- Ingredient: The name of the ingredient (e.g., "flour", "sugar").
- Instructions: The cooking instructions or steps for preparing the recipe.
3. Data Processing:
- Utilize natural language processing (NLP) techniques, such as tokenization and part-of-
speech tagging, to extract the required information accurately.
- Consider using regular expressions to identify patterns related to amounts, measure types, and
ingredients within the recipe text.
4. Output:
- Organize the extracted information for each recipe into a structured format, such as a list of
dictionaries or a pandas DataFrame.
Information Extraction Using Regular Expressions

Wikipedia contains a wealth of information in a semi-structured format. While reading a
Wikipedia article, we can easily identify key pieces of information such as a person's name, age,
date of birth, spouse, and net worth. However, programmatically extracting this information
requires parsing the unstructured text and capturing the required details.
Task 1
1. Data Collection:
- Choose a public figure's Wikipedia page (e.g., Elon Musk, Bill Gates, etc.).
- Copy a portion of the text that includes the individual's name, age, date of birth, spouse(s),
and net worth and detail about their life
2. Information Extraction:
- Write a Python script using regular expressions to extract the following information from the
text you copied:
- Full Name
- Age
- Date of Birth
- Spouse(s) Name(s)
- Net Worth
And 5 more
- Pay attention to the different ways this information can be presented in text. For example, the
date of birth might be in the format "June 28, 1971" or "1971-06-28".
- Net worth might be presented with various currencies or in a range (e.g., "USD $20 billion"
or "20–30 billion dollars").
Task: Creating a Simple Search Engine using Bag of Words

The bag of words model is a way of representing text data when modeling text with machine
learning algorithms. The model ignores the order of words but maintains multiplicity.
Preparation:
- Provide students with a dataset consisting of various text documents. This could be anything
from news articles, plot summaries of movies, tweets, etc.
- The dataset should be preprocessed to remove any HTML tags, special characters, and should
be case-normalized.
Task Steps:
1. Tokenization:
- Ask students to write a function to tokenize the documents into words. Each word represents
a feature of the document.
2. Stop Words Removal:
- Provide a list of stop words.
- Students should filter out these common words that add no significant value to the meaning
of the document.
3. Frequency Distribution:
- Students will create a frequency distribution for each document, counting how often each
word appears.
4. Bag of Words Model Creation:
Using the frequency distributions, students will construct a bag of words model for the entire
dataset.
- They will create a matrix where each row represents a document, each column represents a
word in the dataset's vocabulary, and the entries are the word frequencies.
5. Implementing the Search Function:
- Students will write a function that takes a query as input and converts it into its bag of words
representation.
- The function will then compare the query's bag of words with the dataset's bag of words to
find the most relevant documents.
- Relevance can be determined by the number of matching words and their frequencies.
6. Testing the Search Engine:
- Each student will test their search engine with a set of queries to retrieve relevant documents.
- They should analyze which queries worked well and which didn’t and discuss why.
Sample input & Output :

1. TF-IDF Analysis Task:

Create a mini search engine for a set of documents.
-Task:
- you should come up with a corpus of documents (these could be articles, book chapters, or
any collection of text).
- you need to write a python program that calculates the TF-IDF score for each word in each
document.
- you should then create a simple search function that uses the TF-IDF scores to find and rank
documents based on a query of one or more words.
-Extension:
- You could expand the search engine to include basic preprocessing of the text such as
tokenization, stopping, stemming, or lemmatization.
- Discuss how different preprocessing steps might affect the TF-IDF scores and the search
results.
2. N-gram Model Task:
- come up with a text dataset (like a set of tweets, sentences from books, or movie subtitles).or
you can use previously used data
- you will create an N-gram model that predicts the next word(s) based on the previous word(s)
in a sequence.
- you should then use their model to generate new sentences based on a given starting sequence
of words.
- Extension:
- Explore the effects of smoothing techniques on the model's performance.
Deliverable: submit your code file via email
1. Word Similarity and Analogy Task

- Objective:
To explore and understand the semantic relationships captured by GloVe and Word2Vec
embeddings.
- Task:
Students will use pre-trained GloVe and Word2Vec models to find words that are most similar
to a given set of words (e.g., "king," "computer," "Paris") and solve word analogies (e.g.,
"king - man + woman = ?").
- Outcome:
This task will help students grasp how word embeddings capture semantic relationships and
analogies, illustrating the models' ability to understand context and meaning.
2. Visualization of Word Embeddings
- Objective:
To visualize the high-dimensional word embeddings in a two-dimensional space.
- Task:
Students will use dimensionality reduction techniques like PCA (Principal Component
Analysis) or t-SNE to visualize the word embeddings generated by GloVe and Word2Vec.
They can then explore how similar words cluster together.
- Outcome:
This visual task will aid in understanding the geometric relationships between words in the
embedding space, highlighting how similarity and context are represented.
3. Text Classification Using Word Embeddings
- Objective:
To apply word embeddings in a practical NLP task.
- Task: Students will build a simple text classification model (e.g., for sentiment analysis or
topic categorization) using features derived from GloVe or Word2Vec embeddings. They can
compare the performance of models using raw text features, Word2Vec embeddings, and
GloVe embeddings.
Submission Guidelines:
Write code in python .
Submit jupyter notebook via email.
Compile your findings, code snippets (if any), and discussions into a Word document.
Ensure your document is well-organized, with clear headings for each part and task.
Include any references used for your theoretical explanations and practical experiments.
The Secret of the Silent Library Detective Team Lead ________________

Detective #2____________________
Detective #3____________________
Detective #4____________________
Detective #5____________________
Detective# 6____________________
Detective #7____________________
Detective #8____________________
Detective #9____________________
Detective #10____________________
Location:
- The Silent Library: A prestigious and ancient library known for its vast collection of rare
books and quiet study spaces.
Time:
- Late Evening: The library is usually closed to the public at this time, but a special event was
being held.
Characters:
1. Victim: Evelyn Reed, the head librarian, known for her strict rules and dedication to
preserving the library's collection.
2. Suspects:
- Marcus Finch: A historian researching an ancient manuscript. He frequently argued with
Evelyn over access to the library's rarest collections.
- Lila Sutton: A local journalist who was investigating a story about a rumored hidden treasure
within the library.
- Henry Clarke: Evelyn's assistant librarian, who recently discovered that he was being
written out of her will.
- Sophia Green: A regular visitor and mystery novelist who often used the library for
inspiration. She had a public disagreement with Evelyn the day before the murder.
Clues (Distributed across text documents):
1. Witness Statement (Document 1): Accounts from library guests about the evening's events.
2. Diary Entries (Document 2): Evelyn's diary revealing her thoughts on the suspects.
3. Library Logs (Document 3): Check-in and check-out records showing who was in the library
at the time of the murder.
4. Email Exchanges (Document 4): Correspondence between Evelyn and a mysterious figure
regarding one of the library's rare books.
Hints to reveal Murdere:

1: Tokenization
- Objective: Break down Witness Statements into sentences to find any mentions of arguments or
suspicious behavior.
- Technique: Use Python's `nltk` or `spaCy` library to tokenize the text.
2: Named Entity Recognition (NER)
- Objective: Identify all named entities in the Diary Entries and Library Logs to see who was
mentioned around the time of the murder.
- Technique: Apply NER using `spaCy` to categorize words into people, places, and times.
3: Sentiment Analysis
- Objective: Analyze the sentiment of Diary Entries related to each suspect.
- Technique: Use `TextBlob` or another sentiment analysis tool to gauge the emotional tone of
the entries.
4: Keyword Search
- Objective: Search Email Exchanges for keywords related to the rare book that Evelyn was
arguing about.
- Technique: Implement regex searches to find specific patterns or keywords.
The Final Challenge: Identifying the Murderer
- Combine Insights: Use findings from each task to piece together the murderer's identity.
- Key Evidence: Marcus Finch's frequent heated arguments over access, Lila's investigation into
the rumored treasure, Henry's motive related to the will, and Sophia's public disagreement.
- Solution Script: Students write a Python script to analyze the texts, extract entities, perform
sentiment analysis, and identify the suspect with the strongest motive and opportunity based on
the clues.
Implementing perplexity calculation in Python or other programming languages involves the following
steps:
1. Preprocess the Text Data: As discussed in Section 2, preprocess the text data by tokenizing it,
removing stopwords and punctuation, and creating n-grams.
2. Build the Language Model: Depending on the chosen language model (e.g., n-gram model or neural
language model), build the model and estimate the probabilities of n-grams or train the neural model on
the preprocessed data.
3. Prepare the Test Dataset: Create a test dataset containing sequences of words that the language model
has not seen during training. This dataset will be used to evaluate the perplexity of the model.
4. Calculate Perplexity: For each sequence in the test dataset, calculate the probability of the sequence
using the language model. Then, compute the inverse probability and take the geometric mean across
all sequences to get the perplexity score
Q. Implement Naiive byes model from scratch and solve any use case of NLP
Task: Building a Part-of-Speech Tagger using Hidden Markov Models

Objective:
Students will develop a basic Part-of-Speech (POS) tagger using Hidden Markov Models to
understand how probabilities are used to predict word types in sentences.
Materials Needed:
- A tagged corpus (such as the Brown, Penn Treebank, or any other POS tagged corpus available
in libraries like NLTK)
- Python programming environment
- Libraries such as NLTK or another statistical modeling library that supports HMMs
Steps:
1. Introduction to POS Tagging:
- Discuss the concept of POS tagging and its importance in NLP.
- Explain how HMMs can be used to model language for POS tagging.
2. Model Training:
- Guide students through the process of loading a pre-tagged training corpus.
- Show how to train an HMM on this data. Discuss the importance of transition probabilities
(from one POS tag to another) and emission probabilities (from POS tags to words).
3. Implementation:
- Students will write code to implement the HMM for POS tagging. They will calculate the
transition and emission probabilities using the training data.
4. Testing and Evaluation:
- Students will test their model on a separate set of sentences from the corpus. They can use
existing tools to compare the accuracy of their tags against the gold standard tags in the corpus.
Task1: Sentiment and Emotion Analysis with RoBERTa

Innovative Elements:
1. Dataset Preparation:
- Use a dataset that includes both sentiment labels (positive, negative, neutral) and emotion
labels (joy, sadness, anger, fear, surprise, etc.). consider augmenting a standard sentiment
analysis dataset with emotion labels through manual annotation or semi-supervised learning
techniques.
2. Model Adaptation and Fine-Tuning:
- Adapt the RoBERTa model to output two predictions: one for sentiment and another for
emotion. This could involve modifying the model architecture to include two output layers,
each tailored to a specific task.
- Fine-tune this adapted model on your dataset, ensuring that it learns to predict both sentiment
and emotions effectively.
3. Evaluation and Metrics:
- Develop a comprehensive evaluation strategy that includes accuracy, precision, recall, and
F1-score for both sentiment and emotion predictions.
- Analyze the correlation between sentiment and emotion predictions to validate the model's
effectiveness in capturing nuanced emotional expressions.
4. Visualization and Interpretation:
- Visualize the results using confusion matrices for both sentiments and emotions to identify
any interesting patterns or common misclassifications.
- Use dimensionality reduction techniques (like PCA or t-SNE) on the model's embeddings to
explore how different sentiments and emotions are represented in the vector space.
Hint : Step 1. Data Augmentation
Semi-supervised Learning
Initial Model Training:
Train a basic model on a small, manually labeled subset of your data. This model can be a simple
machine learning model or a pre-trained NLP model fine-tuned for emotion detection.
Label Propagation:
Use the initial model to predict emotion labels for the unlabeled part of your dataset.
Select predictions with high confidence levels to automatically label more data.
Model Refinement:
Continuously refine your model by retraining it on the newly labeled data combined with the
initially labeled set.
Sample Output
Task 2: Named Entity Recognition (NER) with LSTMs

Objective:
Develop an LSTM-based model to perform Named Entity Recognition (NER) on text data. This
task involves identifying and classifying entities such as names of persons, organizations,
locations, and other entities of interest within text.
Steps:
Dataset Selection:
Choose a dataset annotated with named entities, such as the CoNLL-2003 dataset or the
OntoNotes corpus.
Data Preprocessing:
Preprocess the text data and annotate named entities according to a predefined tagging scheme
(e.g., IOB tagging).
Model Architecture:
Design an LSTM-based architecture for sequence labeling, where the model assigns a label to
each token in the input sequence indicating whether it is part of a named entity or not.
Training:
Train the LSTM model on the annotated text data, optimizing for sequence labeling accuracy
using techniques like the CRF (Conditional Random Field) layer in conjunction with LSTM.
Inference:
Use the trained LSTM model to predict named entities in new, unseen text data.
Evaluate the model's performance using standard NER metrics such as precision, recall, and F1-
score, comparing against baseline models and other state-of-the-art approaches

NLP Lab Tasks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Lab Tasks

Uploaded by

Copyright:

Available Formats

FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 1 Instructor: Memoona Saleem

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 2 Instructor: Memoona Saleem

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 3 Instructor: Memoona Saleem

Information Extraction Using Regular Expressions

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 4 Instructor: Memoona Saleem

Task: Creating a Simple Search Engine using Bag of Words

Sample input & Output :

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 5 Instructor: Memoona Saleem

1. TF-IDF Analysis Task:

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 6 Instructor: Memoona Saleem

1. Word Similarity and Analogy Task

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 7 Instructor: Memoona Saleem

The Secret of the Silent Library Detective Team Lead ________________

Hints to reveal Murdere:

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 8 Instructor: Memoona Saleem

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 9 Instructor: Memoona Saleem

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 10 Instructor: Memoona Saleem

Task: Building a Part-of-Speech Tagger using Hidden Markov Models

Lab Duration:3 hr. AI361L NLP Lab Marks: 10

Lab No: 11 Instructor: Memoona Saleem

Task1: Sentiment and Emotion Analysis with RoBERTa

Task 2: Named Entity Recognition (NER) with LSTMs

You might also like