Natural Language Processing (NLP)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Application of

NLP

sentiment analysis
↑ Text classification
7
Smart assistants. Visual 9/A
->
NLP
Machine translation
- >

Topic modelling
spell
->

checkers
↓ -
Chatbot
Text generation
Best models
NIp
language
① BERT -> Bidirectional Encoder representation
from Transformation

② ROBERTa -> Robustly optimized BERT


pre-trained approach

③ XLNEF

GPT3
④ Open Al's
⑤ ALBERT
⑥ T5
computer
science
NLP
Al
Human

Language
NLD is a branch of Al which deals with
->

communication between human language and computer


allows machine/computer to understand
that
human
language.
Most popular Libraries NIP
of

Hugging face (UF)


① Transformes
② spaCy
③ NLTK
④ Gension

⑤ Fainseq
self supervised
new zero-short
few-short 3 unsupervised
semi-supervised
casificationBinary classification (2-classes

↳ Multiclass classification ( >z)


structure
#histic &
control in discourse
communication
reduced

literalmeaningone
pragmatic aces
semany
incases
structure
syntax
-> ofword
norphology
phonology
phonotics 3
phoneme

(cinguistic long
structure sics
sound
making
of

& speech

control)
(out
of NLP

structure word.
of

morphology study
->
of

->
words & phrases
of identifying
↳pography-study
stempreet
Morphology
->suf-

uncomfortable
---


X ↳
stemfootsuffix
prefix
(2) Desiration (3)
compounding
inflection
stem-plural, past, progressive
jump- jumps,
jumped, jumping
#x
like - links,
linked, liking

(2) pation
pointer-painted
Net print-rephantentable
(3) compounding
cosputa -> corpus

& data
text

stem stem
->

boat ->
houseboat
house
boathouse
boathouse ->
corder matters)
order does not matter
morphological wish =>
(some meaning
will's will
will
Ex will will
-
↓ ↓
↓ ↓ desire

verb person
modal person
verb name same fift

pipeline
*stages/NLD
⑲mentation -> enization -> stemming

⑭ER -
Pony- Tmmatization
*
Named entity sparts speech
of

recog)
Powterstemming
↓ ↑stemming Lancaster

↑unking stemming
->

↳ snowball
stemming
pPELINE

(In wide properties)/Lifecycle

sata collection Text extract.


/Data -> & -> Pre-Processing -> EDAT
Acquistion Text cleaning eng.

↑ ·improvement ↓
Maintaining Evaluation
& - Deployment- Modelring
<-
updating
model
Pobability guage
↳ Joint probability
↳ conditional probability
↳ marginal probability

P(BIA)
P(A, B) P(A).
=

P(CIANB)
P(AGB,C) P(A) P(BIA)
=

P(DIACBOCI
P(A). P(B/A) P(C/A, B).
P(A,BaC,D)
=

Markov's Assumption
-
->
n-gram
uni-gram
->

p(wi/wii)
->
bi-gram
PCWi(wiz Wi-i)
->
tsi-gram
Basics Probability theory
probabilityew: of

coins simultaneously,
tossing two

possible
outcomeSE=5TT, THeHTcHUY
1. Sample space
2. Random experiment
3. Favourable event

4. success.
5. Random variable
ProbabilityDistribution
n
vcontinous
Discrete
General continous
1. General Discrete 1.

2. Binomial / Bernoullis 2. Uniform distribution


Exponential distribution
Hyper geometric
3.
3.
n. Geometric 1. Normal/Gaussian
Standard normal
5. Poisson's 3.

Random variable

< (RV
DRV
↓ ↓
pmF pdf
IPsob mass funch) (Prob funch]
density

T
RV X:0
=

12

f(n) P(X)
=

hu Y2 Yu

Fins Yu Yat1

(1) Expectation E(n)


05
means X
of
=
n 2nf(n)
=

Arge ofX
G(x4- G(n)
=
(x) E(n 1)
-

var
=

(2) Variance
arix
(3) standard deviation (S) =
a. Consider the
following pot of a random variable

19

4
q if
X0
=

P(n,2) =

ifn
1 1
9
=

- c
otherwise

variance=?
g0.4,
if =

in a
0,1 or 2 defective piece
9. A machine produces and
16,2/3 16
associated prob of
with
day mean value the variance of
Then
respectively. machines.
of defective pieces produced
by the
the no.

(a) 1,13 (C1,4/3


(b)(3)) (d)4/3/1/3

A+.
oloid!
xs
E(x) =
0x0. 4+
var (x) 2 (x2)
=
-
G(v/
(0x0.4 4x0.6)
= +
- 0.36

=
0.6-0.36

=
0.24

Ans2 !
E (X) 2nf(n)
=

var(x|t((z y
1) 5 13
= 1 =

,
1
-

+
+

-
P(n) Plul
8. function is
given by where A
=

The and

1 are constants with


131 and 12nx and
for P(U)
P(n) = 0 for -><n< / to be a
probability
the value Ashould be
equal
density function
of

to

(A) R -
1 (B) n 1 +

x)((x
-

1(B)Y/n + 1

B [p(n) =

2) 28E 1
=

=) A(++E 5 + ..
+
-

J 1
=

#> j
eaxtra.et
2) A
(ic], = 1

*
20
A+
I] 1
=

*
=) 1
=

a l
en

poppydistrico
-a
continuous
y f(n)
=

a >n
is
X is a continuous random variable.
P(X a)
=
0
=

P(X b)
= 0
=

= fculdm
p(a(x(b)

(i) X-cv then,


single point
0
at
probability
=

p(X a) 0.
=
=

(ii) X-c.5.v.
p(a< x = b)
(culdm
=

under the density funct curve.

(iii) PL-8( X(6):


Area

=Sculdh
(iv) flns 10, always positive
0. h is
If a crv.
having pdf is
given by
(02n<l

[
f(x)
ca2
=

in,'urse
(i) Find c

(ii)
p((((X(3/2)
p(x <3(u)
(iii)
(iv) p(n)(2)
*

1
(i)
Sf(u)dn
=

(nz.du (Yn.an
-

+
1
=

(! ==
y

1
=> +
z

3 5 [(x2 EJ 1
-

+ =

=>
5 E 1
=

& taa
a/

c 6(1)
=

3
finnan n.de
(ii) PCnF

2!
*

E1
= +

5
=
-

z q -
+

(5 -

zy F
+
-

z]
1
r
-

5x
=

5x En e
(iii) p =
junz.du
o
+

I (n.dn

3/2

-3!! +
2 1,
5 19 E
-
+
=

12
8 + 27
-

4x -

=
1xz2
(iv) P(x=( 2)
Scuranton
=

an

1/2

=
-1, + In
5
=
-

zy
+

c-
=

(5 -

E E
+
-

1]
1x
=

7.36
=

-x+
scall to
0. Forthe funch flulatba,
bea valid pdf. statement
which one of
the following
is true.
=1
4 (b)a 0.5,b=

(a)a 1,b
=
=

1
(d)a 1qb
= -

0,b =1
(c)a
=

=> Sa bu)du+ 1
=

-
o
an
+

b/! 1
=
1
a +

E =

↓by

D. Find the value h,


of

:
πl).
Az 1
A +Az
=
+

-
1x(xzh 2x(x3h
+
1
=

bx(xh
+

=>

i) n 13
=

9. g<X70

Son
-

# (x) =

.0=nc)

j
3n 1
-
1[n<2
9

(2=xc
L 1
(i) find P(z<X(3(2)
(ii)p(X (2)=

(iii)p(x(3/2)
distributed
&If X is uniformly in 10.10) then find
(i) f (n)
(ii) mean, variance, std deviation
(iii) P(2(X(6)
(iv)p(0X (5)
(1 3)
P(X =

(vi)P(x)8)
If a wandom variable is uniformly distributed
0.2 variance 13, then P(X<(z) =?
-
meantand
with

random variable X
is
9.3 The pdf of
a
22) 0212
for
(4
-
f(x) =
-

otherwise
· ,
=

Find mean
the
Uniform Density
-
If
X is uniformly distributed continuous
dens ity
variable, then the probability
wandom
finite
the
interval (agb] is
in
function
as
given
f(n)

[ # ,a[X b
=

f(x) =

,
otherwise

Fat
a[n=b l/(((,*
f(x) 90
= I
otherwise
pdf
↑eties
of

1.f(n) = 0

2 -

fuldn
(i)f() a E5 to
=

Ans 1
=
- =

(n)
1 110 c0XC
=

i. f

(ii) mean f(x) H


=

Infinida
=
=

Gbn
-

=
-
a

[22] "
=
Tba
an
itb32
=
-

mean =

#ta
...mean=
0 5
=

variance G(x2) = -

(f(x1)2

E(x2)-badu
[,]" a, =
=
f(x2)
calbutant
=

re
f(x)
=

((9)

-E-cs---
A -

az
=

100

stadeviation = =>

T2

variance "/3
=

((k) 5
1:mean=1 b
As 19
=

b a 2
+
=

=
1 b -
a 2
=

-o
p(xc)(2)
= ***.>
f(z)E(4 n2) for
13
0 n22
: -

0
= potheswise

=>
(n. (4 -22) an

E(x)
1
=

(f60s(25-1 - 25 u) - cs(2++1 + 25+


·
2+

d0
28)


sin
[
(2x7, -2x52) sinpiticos
I
-

=
-

Pd
&tonuous #ponential
Pd

ch,n>0
4
<e-
1
b7q f(x)
=

qu<0
are

ariances n E(x)
=

fnf(n)dn
=

PdfE f(u)
acacia
=

other
=>I.cendo
=

n.zean fan -

=
-ne-an-eau
o

[o t)
=
-

⑭'K
variance f(x2)
=
- f(x)

=cute-endo
=
G(nz. +(2n tdn]
-

re
2

In Esturtiuxofing
variance E
= -

E En =

6 12
=

ed. with 1/10


call duration is
=

8. If the
then
find
dutation exceeds Emins
call
(i) 3 5ming
/
between
(ii) than 8
less
((i,j)
(iv) greater than and of call duration.
n

Gte
-

=
n!0
-
f(x) =

n(0
o

(+ 70 dn
-

P(n(t) e -
=

5en0
= 27/10

(ii) je..du
3
3/10
1
= 5
-

2)
-

-
-

2
+
e
= -
(iii)
j0fe 2/10am
-

=
- i2-((= - 1800 - 1

(u)x 1/10
=

- 110 =

P(X),0)

be FRU, mean=1
0. LetI
p(z>2(z)1)

1
I 1ga
=

-
=

Glen
,n=0
flul:
,2(0

p(2)2) 16.dn
=

= - e
-

21
-
2
= 2
+
P(z)1) =

-nanI

- 2
-

4(4 e
=

1
p(zxz(z)1)
=
-

= i e
=

ERV.with
0. LetX, & Xbe two independent
0.59 0.25respectively
then
=
mean
What
y min(x1,X2).
=

2
ai t1
=

->
4
a= 0
= =

f(x) = 2222
f(x) =
230-32

4
1

-
=
min(2e-2,ye-in)
--
Var (x + y) Va5(x) Var (Y)
= + + 2cov(Xqy)

X
if &Y are independent variable,

v(x,y) Var(X) vas(y)


= +

correlation coeff =w
oxy)
=

9. Consider
two boxes boxl &box2. Box contains
used and 6 black balls whereas Box2 contains s
red and s black balls. Now, a coin is tossed, if
head occurs then one ball is randomly drawn from
tail occurs then one ball from Box2.
boxl whereas if
getting a red ball.
(i) Find the probability of

ball obtained issed, what


is the prob that
(ii) the
if

pee. "
box.
it comes from
B.

*
p(B)) 12 P(B2)
=

7
=

p(Red) p(RednB11 p(Red/Bat


=
+

-Bus
p(B).p(z/B)
p(Red) p(A).p(E(A)
+

Ex
=

E+
+

=
9/20

opleparedina
p(B.lRedl
paththe
=

(ii)
Dave
1Events of in

Mutually dependent
1)
(p(A1B) P(A).P(B))

6
2) Mutually Independent
=

(either head or tail)


3) Mutually Exclusive

collectively exhaustive (p(A1 8(B1 P(c =1]+ +

4)

#
M. C
-
conditional
I
joint
pookprob
marginal (dependent)
cindependent)
nB)
p(A p(A).P(B(A)
=

A & B are mutually


-> p(A1B1
If
=
0, then exclusive.

*
A
=

p(As 1B)
P(B) p(A,1B) P(A-1B)
= + +

p(A1)
=
-
p(B(A)) p/A2).P(B1A2) P(As).P(B1A3)
+
+

probability

TOR
PHAT.P(BIAT) theorem
Be
theorem
pAT3) PABBA
s
CA-1B) =

↑ AB)dPl
manufacture
factory, machines A.B& total output
a
9. In a bolt
25%,35%& 40%
respectively the are
of

and 2% temp
There is a chance 59,95atrandom.
of

bolt is drawn
defective. A bolt
0.0347
is a defective 0.3623
(i) Prob that
it
drawn from A.
defective is
(ii) Prob that
1B) p(def1c)
p(def1A) P(def
+

p(def)
+

=)
=

25x +35x7 10 ro
= +

80
1315
=
+

10000

= 120
- =

10000
&. Consider the following corpus of sentences.
Cs) three friends Aman, akbar and are
anthony
book. (/s)
reading
Is a man
is reading malgudi days </s)

reading a detective (Is) book


Cs) akbar is
nK
book by
reading narayan (Is)
a

<s anthony
is

model. Calculate
Assume bigram language
a
book (Is))
PCCs>aman is reading
a

B, contains
The
are 3 bags BicB2 & B3 ·
bag
2. There contains 3rd &
& sred balls, bag By
s
green contains s red & 3 green
sgreen
balls and bags
have probabilities 31,003/10is
balls. B, B2& By
Bags chosen. A
bag
respectively being
if
at
&4/10 ball is chosen
and
atwandom
a

selected
random from bag.
chosen ball
is
that the green
(A) Find probability
the
selected bag is B3
thatthe
, given is
21 green.
(B)u ,..
that
probability
the
the selected bag is
$39
Find
(2)
that
thechosen ball is green.
given that selected Bag is B3
the
&
find the prob
(d) that chosen
the
is ball grows.
given

Ang2ip(green) 30 30 1 3 1x
=

x
+
x +

0 15
E +
= - +

0.15 + 0.1875 + 0.15


=

=
0.4875-
1mf
(A) =

xp ne
=
=
0.8205

(d)p 1x =
0.4875
P(B>1 Gween) p(53).P(gran/Bs)
=

10

0.195 3/20
=

A1 =
4 4 25
8
+
+

⑥ p =

=
-
24 24

following corpus 3 sentences


8. Consider
of
the
is the
what total count knight
of
biggests for
which likelihood
the
will be estimated assume
consider
we do not perform any preprocessing,
token as Cs)
1
& CIS
& end
beginning
of
the
musuem [(>
=> (sJulia
is visiting the
are friends as
natasha
Cs) Julia, groves meet
will Julia in
<) Zoe & natasha
the musuem </s)
(b/20 (C)/6 (d)18
(a)23 v
Function words us
content words

I
V
↓ ↳info/topic
stop words open class words
keep adding
or on
↳ we
closed class words
new words.
a an,theto, is, of,.-- noun
verb
prepositions - in one of, by adjective
determines- a, an, the
whe adverbs
pronouns -
I, he, his, him,the,

maxolimofurigno
Token end

will will
will
FAT
TTR = 1/3

then words will be found


If
* TTR is
high, new

more.
corpus, it was found thatthe word rank
with
9. In a

has a frequency
4th 600. What
of can be the

for the rank a word of


with
freq.
best guess
500xm
=) 2600 x 4
=

300. & n =

have to feat
the only thing
we

8. In the sentence,
is the fear itself, find TTR.
the
=>11

corpus be
2 words wid
we in a
the rank of
a. Let me &
Let represent
respectively.
my
and 400
1600
ofmeanings of
w, and we respectively.
nor
the
ratio mine would
of
tentatively -

The M,x,00 M2x40 1/2


=

true: S
-

which are
-

9. Tokenization in steps.
(a) Ambiguity can appear
sentence
not appear in
->
will
(b) Ambiguity
F
segmentation step. in
generally more frequent
used is
(c) Function contentword.
T than any any
text words
always real
a

↑ (d) output of lematization are

Law
zts f:freq of
5: sank of
word
word
ord
in degc

k= constant
i.f.5
=

Pr (psob
of
word or ranks) =

E in; =

=
#
↳ m:no. of meanings


risank

ofa word
lilength
#
Hps
Law

IVI kNB =

IVEsizeofrocabula re

document
What size
is the of unique words in a

9. K 3.71 &
words is 12000,
=

total no. of
where
0.69
B =

0-69
(v =3.71x(12000)
-
2421
=

has TTR 0.085


=

& second
8. If the first corpus following
has TTR=0.78.
which the
of

corpus
are F.
tendency have
to unique
has more
F (i)1*corpus
words.

(ii) Ind
.....
T ... 4, 22

values can have


sometimes > 1.
F (iTTR
(iv) TTR indicates degree
of lexical variance
-
& vice versa.
1)
in
Ambiguity lexicography
2) Issues in Tokenization

10-12 slides
one
ision
textintoveto
of

Bow Bag words model

3
1. -> of
fow
feature 2. Binary Bow

extraction 3. TF-1DF (Team frequency -


Inverse doc freq)

Informationreal label encoding & (one


ONE not
encoding
M
&&

LopUS S: He is an awesome boy and an awesome dancer


too.

S2:She is also an awesome girl


awesome.
Both the girl and the boy are
So:
was
good movie awesome
with acting.
sy:It
a

removal stopwatchof

After
vocabulary freq -
- >

awesome fi 5

2
#t
pre-process.

boy f2
dance fy I
① Stopwords
removal

Y
girl
1
goodof ② lowering of
movie the the case

③ stemming &
comiatization
④ Removal of
punctuation sym
⑤ handling of
negation
·trstsentence awesomeboyarresom danceor a girl
dancer good movie acting
awesome boy
[2
101000]

vectors fixed size

0. S: the pizza was


good
Sa: The pizza was not good
not
the pizza was
good 0]
- ...

I
S,: I

L I L 7 1
S2:

in isa
comespresent
do
TF(t,d)
hoof
(iesm fi.)
IDF(t) documentsto
log) aint
=

see

+F- IDF d
Cineasedocument
TF)ta
= IDF

↓score

0.
di:He is a
good boy
&2:She is a good girl
Both the boy
and the girl are good
di:
TF-IDE score for the above corpus.
Calculate

do:good boy
& good girl
d2 ~

a3 ! boy girl good


a
d3

good:
- ⑮3 ( F(tad)(
+

o
1/2 1/3
boy i
1/3
o 1/2
girl

good:log?
0
=

log
boy:
(IDF)t))

girl:log3/
good
Exoy irl 1DF)
d1: 0
(TF -

0
I log (diX (DFi)
0
du.
= log
d3:0 5 log 2
I

of a movie:
reviews
9 From
website we
got 3
long-
a
and
RI:this movie is very scary
slow.
scary and
is
movie is not
R2:this
movie is spooky and
good.
R3:This
d2
=> This:
d/

Y/7
-

1/8 83
-

lotRF 0
=

0
1093/3
=

1/8 1/6
movie: "/7
1/6 log3/5 0
=

is: 1/7 2/8


17
o 0 log3/
very: 1/0
o
log3/2
4/7
scaty: 16 log3/3
0
=

18
and:. 1/7
o o log3/1
long: 1/7
O log3/1
1/8
not: o

slow: o 1/8 o log3/ 1


/6
spooky ↓ 0
0
log3/1
O O 1/6 log3//
good:
is movie
is and scary
verytong so slow noton
the
Ri 0

blogs blogs o
gloge0 0 0
⑤ o
o
o
=
R2

0 0 0 0
0 0
Elogh 00
flog
0
Ryi

correction Edit Distance


Spelling
-

I am writing an email on
thatofKIIT.
incorrect

↳ correct one

·
behalf
min, distance · behave
behavious
of operations
·

min no

3
Insection
↳ Levenshtein
↳ deletion distance

↳ substitution

daybl
abb* -
-

·0-PPoe
10 q O 11 12 /
10

↑ · 89 7
a 10 11
10

I7
8 9109
6
8 :
7 11

2PicsIn
9 &
T656 78 D(igj) min
09 101
=

D(i,5 1) 1
+

6 7
-

545

E " 34
8 9
7

96 · 9 8
⑧ 7 &

& 7
⑧ 7
7
S 6

ixcizysRe
4

2
=
n
6 78
1 1 2 3 4567
# o 12 3456789
U T 1 0
E
# X C
E

the table. Assign the POS


tag
boyput
the keys on

D. "The
for each word.

The -> det


=> noun
->
boy
veab
put-
the > det
noun
keys t
on-> pre
det
the
table-> noun

commented on a number of other


0. "The grand jury
topics".
The ->
D+(determine
I 55 (adjective)
grand ->
NN (noun)
jusy -> (verb)
commented UBD ->

on -> pre-IN
a -> det-DT
number -> houn-NN
Pwep- IN
of ->
(adjective
other -> 55

topics-> NNs (singular noun)


9. "I need a
flight from Atlanta":

PRO I personal pronoun)


-> I -
need -> (verb)
vB

at
D(determine
flight Nw
->

from] IN

Atlanta - NNP

P(race/VB)
P(NR/VB) *
P (VB/ T0) x
an
a
= emmision prob.
State
STf
transition
P2D.
0
P 1e/T
=
=

9 (1) Sun rises in the east.


p=1,1 =0

independence in 1947.
(2) India got p 0.04
=
Ir
will get 5 emails in next one year.
(3) You
will home
The prime minister
come to
your
0 =0.01 IN
(3) toMosOW.

1=0.002 N
will snow in Delhi in June.
(5) It
holiday next Sunday. p=0.04 In
167 You will not get a

will back.
(7) The dog pr It
I: Information
↑E
content

p(ni)6 tui)
p(ni) f)ui))
=

p(ni) log in.=

(n.,
↑logEz
⑭lyi)
1(yi) I(ni)
= +

Entropy (H) E,P(ni)IMil


=

E P(i) log is
PmiclogThe
H
*EpMnillogPIni)
2min
=

Lang=[P: Li

P(x,y) p(x).p(y)X)
=


Independent chain sule of

ih(7(x) n(x)
=
- entropy

Independent
Bayes classification

P(A(B) =

PAL
PAL
Fruits a yellow,
=
sweet, long
(Fruits lowange)
p/Yellow
=
Jorange) P(sweetlorange)
x
P
xP(long lorange)
P(sweet/banana
P(Fruits/banana) P(Yellow/banana)
= x

xP(long/banana
p Yellow/
others) x P(sweetlothers
P (fruits others) =

xPClong others)
I

p(yellow orange) 350,000x200


=
=0. 3

50 850x200
=0in
(sweetlosange) x
=

P(long lowange)
0
=

:P(fruit/orange) 0
=

P(yellow/bananal=00x 08x =
1

0.75
↑ (sweet/banana 1 =

3870
x0x
=
P(y) *AH.P(nily)
Plyln3...un)& asgmax


dependent
features
Idepedant
res

You might also like