NLP Workshop Assignment - Feedback Data - Apr 20, 2019

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

NLP Workshop Assignment

Monday, April 29, 2019


Setting up your system - Anaconda
1. Download the 64-bit Graphical Installer of Anaconda for Windows from this link .
2. Following the installation guide here.
3. Once successfully installed, search for ‘Jupyter Notebook’ in the start
4. Create a new Python 3 Notebook
5. Start with importing the dataset into the notebook and further processing.

2
Problem Statement
Microsoft has collected feedback from its users for a flagship event.
It has provided us with feedback comments and asked the MAQ team to generate insights from this
data.
Name Values

Number of Rows 8412


Dataset: Microsoft_EventFeedback.csv
Number of Columns 11

Columns QuestionText, AnswerText,


SessionTrack, Topic,
TechnologyPillar,UserId,...

MAQ Software - Confidential 3


Goals
The end goal is to do the following things:
1. Calculate the overall sentiment score for the event
2. Find top keywords that occur in these comments

MAQ Software - Confidential 4


Text Pre-processing – Initial Cleaning / Casing
You need to perform the following tasks on the provided feedback comments dataset:

1. Clean the missing comments data

- Remove rows with Null value in the comments

2. Clean the text in the comments

- Convert the comments into lowercase and handle contractions

- Remove digits and punctuations/special characters from the comments

- Remove any other noise data in the comment


Before After
<3 NULL
Speaker didn't show up :( Speaker did not show up

5
MAQ Software - Confidential
Text Pre-processing – Spell Checker / Abbreviations
1. Perform spell check on the comments data

- Run spell checker and correct the spellings accordingly on the comments text

2. Comments might contain words or phrases which are not present in any standard dictionaries. These pieces

are not recognized by search engines and models. With the help of regular expressions and manually

prepared data dictionaries, this needs to be fixed.

Ex: Some of the examples are – acronyms, hashtags with attached words,etc.

Before After
MS Microsoft

MAQ Software - Confidential 6


Text Pre-processing – StopWords removal
1. Clean the standard English stopwords from the comments
Before After
It is a fantastic session Fantastic session

2. Identify any domain specific stopwords from the list of comments.


(Hint: Words which appear in most of the documents or are very rare can be removed up as well.)
(Example: Microsoft,etc.)

MAQ Software - Confidential 7


Text Pre-processing – Lemmatization
1. Text Lemmatization

- Lemmatization is an organized & step by step procedure of obtaining the root form of the word

Before After
am, are, is be

MAQ Software - Confidential 8


Text Features – BOW/TF-IDF
1. Convert the comments data into features:

- Use BOW (Bag of words)

- Use TF-IDF(Term Frequency- Inverse Document Frequency)

MAQ Software - Confidential 9


KeyWord Extraction
Extract the keywords from each comment using the features:

1. Extract the top keywords(uni-grams and bi-grams) from all the comments

2. Create a word cloud and bar plot showing top keywords

MAQ Software - Confidential 10


Sentiment Analysis
You need to:

1. Capture the sentiment of each comment (Positive, Negative or Neutral)

2. Calculate overall sentiment score for the event

3. Create bar chart showing the positive sentiment score for each Technology Pillar.

Comment after preprocessing Sentiment Score


The guy giving Pepsi demo awesome positive
Great session informative neutral

MAQ Software - Confidential 11


Questions? (Give Snapshots from the notebook)

1. Number of rows in the dataset after the initial cleaning?


2. Any domain specific stopwords found ? (Give top 3 if any)
3. What is the overall sentiment score of the event?
4. Which Technology Pillar has the most positive sentiment score?
5. What is vocabulary count for the dataset? (Number of words after pre-processing)
6. What are the top 5 keywords(uni-grams and bi-grams) ?
7. Most popular keyword in 'Data And AI'?

MAQ Software - Confidential 12


Output Required

1. Jupyter Notebook with all the steps performed


2. Document having the answers to the questions with Snapshots from notebook

MAQ Software - Confidential 13


Discussion

You might also like