TD 3 - Feature Extration and Feature Selection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Ms.

Laifa Meriem BBA university - 2023

TD 3
Feature extraction

Problem 1:
You are given a dataset of customer reviews for a product. Each review is labeled with its
sentiment: positive or negative. Your task is to perform sentiment analysis on this dataset using
both the bag of words and TF-IDF representations.

Dataset:

Review Sentiment

"The product is excellent, I love it!" Positive

"Not satisfied with the quality, very disappointed." Negative

"Amazing experience, highly recommended." Positive

"Poor design and functionality." Negative

"Great value for the price." Positive

"Terrible customer service." Negative

Tasks:

​ Bag of Words Representation:

● Tokenize each review and create a bag of word representations for each using
word frequencies.

● Create a vocabulary based on all unique words in the dataset.

● Represent each review as a vector using the bag of words approach.

​ TF-IDF Representation:

● Calculate the TF-IDF values for each term in the reviews.

1
Ms. Laifa Meriem BBA university - 2023

● Create a TF-IDF representation for each review.

​ Sentiment Analysis:

● Based on the bag of words and TF-IDF representations, predict the sentiment
(positive or negative) for each review manually.

● Use your representations to identify important words contributing to each


sentiment.

Note: For TF-IDF, assume a total of 6 documents in the corpus (the number of reviews in the
dataset). You can use the formulas and calculations explained in the previous response to solve
this problem.

Problem 2: (you can use the computer for this one)


You are tasked with performing sentiment analysis on a smaller dataset of 50 customer reviews
for various products. Each review is labeled with its sentiment: positive or negative. Additionally,
you are asked to identify the most important words contributing to each sentiment using both the
bag of words and TF-IDF representations.

Dataset:

You are provided with a dataset containing 50 customer reviews, with labels indicating whether
the sentiment is positive or negative. HERE

1. Bag of Words Representation:

a. Tokenize each review.

b. Create a bag of words representation for each review using word frequencies.

c. Develop a vocabulary based on all unique words in the dataset.

d. Represent each review as a vector using the bag of words approach.

2. TF-IDF Representation:

a. Calculate the TF-IDF values for each term in the reviews.

2
Ms. Laifa Meriem BBA university - 2023

b. Create a TF-IDF representation for each review.

3. Sentiment Analysis: Manually predict the sentiment (positive or negative) for each review
based on both the bag of words and TF-IDF representations.

4. Feature Importance:

a. For each sentiment (positive and negative), identify the top 3 words with the
highest importance based on both bag of words and TF-IDF representations.

b. Importance can be determined by looking at the highest frequency in the bag of


words representation and the highest TF-IDF values in the TF-IDF representation.

5. Summary:

a. Provide a summary of your findings, including insights into the important words for
positive and negative sentiments according to both representations.

b. Compare and contrast the results obtained from the bag of words and TF-IDF
analyses.

Problem 3:
● Write the pseudocode of Bag of Words Algorithm.
● Write the pseudocode of the TF-IDF algorithm.

Make sure you provide details for each step.

You might also like