Python For Data Science: 1. Explain Bag-Of-Words in NLP

180220131001 ABHIJEET KUMAR DUBEY
Assignment 2
Python for Data Science
1. Explain Bag-of-words in NLP.
The bag-of-words model is a simplifying representation used in natural

language processing and information retrieval (IR). In this model, a text
(such as a sentence or a document) is represented as the bag
(multiset) of its words, disregarding grammar and even word order but
keeping multiplicity.
is thought of as a bag of words. Lists of words are created in the BoW

process. These words aren’t sentences, as grammar is ignored in their
collection and construction. The words are often representative of the
content of a sentence. While grammar and order of appearance are
ignored, multiplicity is counted and may be used later to determine the
focus points of the document.
The frequency of each term is tallied while the semantic relationships

are ignored.
BoW may use multisets of documents or treat sentences as different

documents tallying for each:
Jill is fond of Jack but Jack is fond of his neighbour’s dog, Sam. Sam is
more fond of the peanut butter sandwiches Jack always has.
Jill is fond of Jack but his neighbour’s dog Sam more the peanut butter
sandwiches always has
D1: 1 2 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0
D2: 0 1 2 1 0 0 0 0 0 1 1 1 1 1 1 1 1
1
Similarly, to BoW, bag of features (BoF) can be applied to other

information such as images. BoF analyzes an image's pixel data such
as the level from 0-256 for each pixels red, blue, green and luminance
values. By understanding how an image's pixels are positioned and the
relative values of each pixel, an algorithm might be used to pick out
patterns or textures.
Application
In practice, the Bag-of-words model is mainly used as a tool of feature

generation. After transforming the text into a "bag of words", we can
calculate various measures to characterize the text. The most common
type of characteristics, or features calculated from the Bag-of-words
model is term frequency, namely, the number of times a term appears
in the text. For the example above, we can construct the following two
lists to record the term frequencies of all the distinct words (BoW1 and
BoW2 ordered as in BoW3):
2
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [0, 1, 1, 1, 0, 1, 0, 1, 1, 1]
Each entry of the lists refers to the count of the corresponding entry in
the list (this is also the histogram representation). For example, in the
first list (which represents document 1), the first two entries are "1,2":
The first entry corresponds to the word "John" which is the first word in
the list, and its value is "1" because "John" appears in the first
document once.
The second entry corresponds to the word "likes", which is the second
word in the list, and its value is "2" because "likes" appears in the first
document twice
2. Explain working of TF-IDF in detail.
Term frequency
In the case of the term frequency tf(t,d), the simplest choice is to use
the raw count of a term in a document, i.e., the number of times that
term t occurs in document d. If we denote the raw count by ft,d, then the
simplest tf scheme is tf(t,d) = ft,d. Other possibilities include
 Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;

 term frequency adjusted for document length: tf(t,d) = ft,d ÷ (number of
words in d)
 logarithmically scaled frequency: tf(t,d) = log (1 + ft,d);
 augmented frequency, to prevent a bias towards longer documents,
e.g. raw frequency divided by the raw frequency of the most occurring
term in the document
3
The inverse document frequency is a measure of how much

information the word provides, i.e., if it's common or rare across all
documents. It is the logarithmically scaled inverse fraction of the
documents that contain the word (obtained by dividing the total
number of documents by the number of documents containing the
term, and then taking the logarithm of that quotient):
TF-IDF is a statistical measure that evaluates how relevant a word is to

a document in a collection of documents. This is done by multiplying
two metrics: how many times a word appears in a document, and the
inverse document frequency of the word across a set of documents.
It has many uses, most importantly in automated text analysis, and is

very useful for scoring words in machine learning algorithms for Natural
Language Processing (NLP).
TF-IDF (term frequency-inverse document frequency) was invented for

document search and information retrieval. It works by increasing
proportionally to the number of times a word appears in a document,
but is offset by the number of documents that contain the word. So,
words that are common in every document, such as this, what, and if,
rank low even though they may appear many times, since they don’t
mean much to that document in particular.
However, if the word Bug appears many times in a document, while not
appearing many times in others, it probably means that it’s very
4
relevant. For example, if what we’re doing is trying to find out which
topics some NPS responses belong to, the word Bug would probably
end up being tied to the topic Reliability, since most responses
containing that word would be about that topic.
3. Write difference between RDBMS and NoSQL database.
RDBMS(Relational Database Management System) NoSQL(Not Only SQL)
1 It is completely a structured way of It is completely a unstructured way of
storing data. storing data.
2 The amount of data stored in RDBMS While in Nosql there is no limit you
depends on physical memory of the system can scale it horizontally.
or in other words it is vertically scalable.
3 In RDBMS schema represents logical view Work on only open source

in which data is organized and tells how the development models.
relation are associates.
4 For defining and manipulating the data It uses UnQL i.e. unstructured query
RDBMS use structured query language i.e. language and focused on collection of
SQL which is very powerful. documents and vary from database to
database.
5 RDBMS database examples: MySql, NoSQL database examples:
Oracle, Sqlite, Postgres and MS-SQL. MongoDB, BigTable, Redis,
RavenDb, Cassandra, Hbase,
Neo4j and CouchDb
6 If we talk about the type of data then NoSql is best bit for hierarchical data
RDBMS are not best fir for hierarchical storage because it follows the
data storage keyvalue pair way of data similar to
JSON. Hbase is the example for the
same.
7 Scalability: RDBMS database is vertically Scalability: as we know Nosql
scalable so to manage the increasing load database is horizontally scalable so to
by increase in CPU, RAM, SSD on a single handle the large traffic you can add
server. few servers to support that.
RDBMS is best suited for high NoSql is still rely on community
transactional based application and its support and for large scale NoSql
more stable and promise for the atomicity deployment only limited experts are
and integrity of the data. available.
8 Properties: ACID properties (Atomicity, Properties: Follow Brewers CAP
Consistency, Isolation, Durability). theorem (Consistency, Availability
and Partition tolerance).
5
4. Explain stemming and stop words in NLP.

is the normalization technique used in Natural language processing
which helps in saving the number of computations. There are libraries
in NLP for stemming like Porter Stemming, Snowball Stemmer, etc.
For example, We have a word ‘walking’,

we remove the suffix ‘ing’ and make that
word to ‘walk’
Stemming is used to reduce the

dimensionality of data which is best for
ML models. In simple words, if there are
words like walk, walks, waited, waiting, All these words are different but
contextually similar. We replace all these words walk by removing
suffixes.
There are two problems with stemming such as Over stemming and
stemming. Over stemming means, if we do stemming for words like
University, Universe and Universal then we will get universe even
though all three mean different things. Under stemming means if we do
stemming to words will change from alumnus to alumnu, alumni to
alumni, alumnae to alumna even though all three words alumnus,
alumni, alumnae mean same, they are rooted to different words.
Stop Words: A stop word is a commonly used word (such as “the”, “a”,
“an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the
result of a search query.
We would not want these words to take up space in our database, or

taking up valuable processing time. For this, we can remove them
easily, by storing a list of words that you consider to stop words.
NLTK(Natural Language Toolkit) in python has a list of stop words
stored in 16 different languages. You can find them in the nltk_data
directory.
6
5. Explain boxplot, pie chart and histogram using graph.

Boxplots are a measure of how well distributed the data in a data set
is. It divides the data set into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the data
set. It is also useful in comparing the distribution of data across data
sets by drawing boxplots for each of them.
Drawing a Box Plot
Boxplot can be drawn calling Series.box.plot() and
DataFrame.box.plot(), or DataFrame.boxplot() to visualize the
distribution of values within each column.
For instance, here is a boxplot representing five trials of 10
observations of a uniform random variable on [0,1).
import pandas as
pd import numpy as
np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C',
'D', 'E']) df.plot.box(grid='True'
Output
A Pie Chart can only display one series of data. Pie charts show the
size of items (called wedge) in one data series, proportional to the sum
of the items. The data points in a pie chart are shown as a percentage
of the whole pie.
Matplotlib API has a pie() function that generates a pie diagram
representing data in an array. The fractional area of each wedge is
given by x/sum(x). If sum(x)< 1, then the values of x give the fractional
7
area directly and the array will not be normalized. Theresulting pie will
have an empty wedge of size 1 - sum(x).
The pie chart looks best if the figure and axes are square, or the Axes
aspect is equal. Input
from matplotlib import pyplot as plt

import numpy as
np fig =
plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
langs = ['C', 'C++', 'Java', 'Python', 'PHP']
students = [23,17,35,29,12]
ax.pie(students, labels =
langs,autopct='%1.2f%%') plt.show()
Output
A histogram is an accurate
representation of the distribution of numerical data. It is an estimate of
the probability distribution of a continuous variable. It is a kind of bar
graph. To construct a histogram, follow these steps −
• Bin the range of values.

• Divide the entire range of values into a series of intervals.
• Count how many values fall into each interval.
The bins are usually specified as consecutive, non-overlapping
intervals of a variable.
The matplotlib.pyplot.hist() function plots a histogram. It computes
and draws the histogram of x.
from matplotlib import pyplot as plt
8
import numpy as np fig,ax =

plt.subplots(1,1)
a=
np.array([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
ax.hist(a, bins = [0,25,50,75,100])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100]) ax.set_xlabel('marks')
ax.set_ylabel('no. of students') plt.show()
Output
~Thank You~

Python For Data Science: 1. Explain Bag-Of-Words in NLP

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python For Data Science: 1. Explain Bag-Of-Words in NLP

Uploaded by

Copyright:

Available Formats

180220131001 ABHIJEET KUMAR DUBEY

1. Explain Bag-of-words in NLP.

The bag-of-words model is a simplifying representation used in natural

is thought of as a bag of words. Lists of words are created in the BoW

The frequency of each term is tallied while the semantic relationships

BoW may use multisets of documents or treat sentences as different

Similarly, to BoW, bag of features (BoF) can be applied to other

In practice, the Bag-of-words model is mainly used as a tool of feature

2. Explain working of TF-IDF in detail.

 Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;

The inverse document frequency is a measure of how much

TF-IDF is a statistical measure that evaluates how relevant a word is to

It has many uses, most importantly in automated text analysis, and is

TF-IDF (term frequency-inverse document frequency) was invented for

3 In RDBMS schema represents logical view Work on only open source

4. Explain stemming and stop words in NLP.

For example, We have a word ‘walking’,

Stemming is used to reduce the

We would not want these words to take up space in our database, or

5. Explain boxplot, pie chart and histogram using graph.

from matplotlib import pyplot as plt

• Bin the range of values.

import numpy as np fig,ax =

You might also like