Professional Documents
Culture Documents
Python For Data Science: 1. Explain Bag-Of-Words in NLP
Python For Data Science: 1. Explain Bag-Of-Words in NLP
Assignment 2
Python for Data Science
Jill is fond of Jack but Jack is fond of his neighbour’s dog, Sam. Sam is
more fond of the peanut butter sandwiches Jack always has.
Jill is fond of Jack but his neighbour’s dog Sam more the peanut butter
sandwiches always has
D1: 1 2 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0
D2: 0 1 2 1 0 0 0 0 0 1 1 1 1 1 1 1 1
1
180220131001 ABHIJEET KUMAR DUBEY
Application
2
180220131001 ABHIJEET KUMAR DUBEY
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [0, 1, 1, 1, 0, 1, 0, 1, 1, 1]
Each entry of the lists refers to the count of the corresponding entry in
the list (this is also the histogram representation). For example, in the
first list (which represents document 1), the first two entries are "1,2":
The first entry corresponds to the word "John" which is the first word in
the list, and its value is "1" because "John" appears in the first
document once.
The second entry corresponds to the word "likes", which is the second
word in the list, and its value is "2" because "likes" appears in the first
document twice
Term frequency
In the case of the term frequency tf(t,d), the simplest choice is to use
the raw count of a term in a document, i.e., the number of times that
term t occurs in document d. If we denote the raw count by ft,d, then the
simplest tf scheme is tf(t,d) = ft,d. Other possibilities include
3
180220131001 ABHIJEET KUMAR DUBEY
However, if the word Bug appears many times in a document, while not
appearing many times in others, it probably means that it’s very
4
180220131001 ABHIJEET KUMAR DUBEY
relevant. For example, if what we’re doing is trying to find out which
topics some NPS responses belong to, the word Bug would probably
end up being tied to the topic Reliability, since most responses
containing that word would be about that topic.
3. Write difference between RDBMS and NoSQL database.
RDBMS(Relational Database Management System) NoSQL(Not Only SQL)
1 It is completely a structured way of It is completely a unstructured way of
storing data. storing data.
2 The amount of data stored in RDBMS While in Nosql there is no limit you
depends on physical memory of the system can scale it horizontally.
or in other words it is vertically scalable.
5
180220131001 ABHIJEET KUMAR DUBEY
There are two problems with stemming such as Over stemming and
stemming. Over stemming means, if we do stemming for words like
University, Universe and Universal then we will get universe even
though all three mean different things. Under stemming means if we do
stemming to words will change from alumnus to alumnu, alumni to
alumni, alumnae to alumna even though all three words alumnus,
alumni, alumnae mean same, they are rooted to different words.
Stop Words: A stop word is a commonly used word (such as “the”, “a”,
“an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the
result of a search query.
6
180220131001 ABHIJEET KUMAR DUBEY
A Pie Chart can only display one series of data. Pie charts show the
size of items (called wedge) in one data series, proportional to the sum
of the items. The data points in a pie chart are shown as a percentage
of the whole pie.
Matplotlib API has a pie() function that generates a pie diagram
representing data in an array. The fractional area of each wedge is
given by x/sum(x). If sum(x)< 1, then the values of x give the fractional
7
180220131001 ABHIJEET KUMAR DUBEY
area directly and the array will not be normalized. Theresulting pie will
have an empty wedge of size 1 - sum(x).
The pie chart looks best if the figure and axes are square, or the Axes
aspect is equal. Input
Output
A histogram is an accurate
representation of the distribution of numerical data. It is an estimate of
the probability distribution of a continuous variable. It is a kind of bar
graph. To construct a histogram, follow these steps −
8
180220131001 ABHIJEET KUMAR DUBEY
~Thank You~