ML 3 & 4 Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

ML3

* To calculate distante between 2pointsi


- ln data, we Eucledian Distane.
fos 2 points
Euclediaro istance +Cy-y
-fos points xx Y for points Pi and P, te
Pi distance is Eucledlan Disance=

But if theye are columns os mo


then also it is 20 data only
Eg: 2
Nous, for points
P, P, and P2
Pi= (x1,9,,2)

Natya-y+(za-z)+ ...n
At a time, we can only fnd
olistance between 22 oits. If theve
are 3 points , then we dlo au
3 distantes separately
* KNN (K Nearest Neighbour
Also knon as lazy Algoithm.
(We ae nev qoing to use tis)
KNN doesnt do anything dung
tauning
It doesnt do anything th the training
data duing tsaining strae/phase.
Supervised Machine
Leaming Algoihm that asigns a class labl
Cfor classifcaton) oo pxdicts a value
( for eqre ssion) for aa new data poit by
consideing the K- nearest daa points trom
the tsaining dataset based on a chosen
distance methc. The pdi cded output is
detesmined by a majoty (for
clasocbn)
an average (tor eqession) of the
k- neare st Neighbour's al ue.
ln KN you wiu deide the vlue of k.
(t usually is an odd nunber. feterably 5 or t.
- Aftex foding 5 os 4 nearest eighbours .
k get the mean of al yalueslnn negression.

And in case of classhcaon, it consi ders the


Variable that is in majorty
k Code' IisLogistic
-ffom sklearn import databases
-iris = datasets.loadiris ()
- iis
- import pandas as pd
- X= sís.date
y= iis. target
-X= pd.Data Frame (x)

X.shape
- X.head()
- # spi Hing the data.
-from
skleam.model-selecn
-X_tsain, X_test,
impost tain testselit
ytain, ytest = train test-spi
(X,y,test_size = 0.2 ,andom state=\23,Stratifyy)
-X_train.shape
-fom skleano, iear model inn post LogistcRegession
- logi= ogisticRege ssion ()
- Logi.ft (X_tsain, ytrain)
3-pred= Logi. pedict lxtet)
y-test
3-pred
-fom sklean, metics impost confusion-mati
accuracy-Score, classihcaonrepost
-
Confusionmatix (y test, y-prea)
accuracy- score (y.test,y-pred)
clasihcahonepost
*# iF (ytes,3-pred).
we use the abe
code. ik wot be
visilole properly , hente we pint i.
-pónt (elascihcaton Deport
-test y-prca)
Note! e onte stsatty=y to
proposton ally
lctslbute the testing ota
te Accuracy Meties of
Cacsifcation:
(100 asked in intevie)
- Eg: Covid patients (y)
|00

50 S0

30
N So
20
Hese, the acuray is 8o). ( 3o +5o-xoo)=&o
30+20+50
Eg: Lets
Lets take covid paients dote
|00
p + - N
tve b oTSecty So 50
pedicted as tve -ve but wrongly
P - 20 P- o pedieted s te
N
N-G0 e e conrectty
+ve but wrongly Predicted as Ve.
Poedicted as -ve
Now we ut this Ln the natix.
Heye, the main focus is tve lass (
in medical condions for example used
Confusion Redicted
Matiy
P N
Tsue False
Posiire Negate
P
TP 20 /FN 30
A ctual sue
Falre
N Positve NegaBe
|FP OTN 50
TP+TN
TP+TN+FP +FN)
20+5O 70
=0.4=0.
Accuracy 20+50+0430
Anolher E9 200

5o 150
+
40
’l40
Rredicted
Hex, Accuracy
ITP TP +TN
10 40 TP+TN4 FP4FN
Actual FP
TN 10+140
N 140 200
IO+140+10+40
= 0.75 = 75).
But this Accux aCy is
misleading s!
- We wo firding covid tve . Dus main focuy
should be the class. So, in te above example,
Actual covid tue wee 50 and we just
10 toety predicted
Thots only 209. aauYacy. 80/.
of it is wong.

This is due to Data imbalance !


Hence, mapty of the ime, acuray is
not a qS0d mettic.
- Instead, we we these fomulae:
P

A+\TP
FN 10 40
A
TN 10 |40
TP = 0 0.2
TP+FN 50

Recal is actual Accusacy.


" Recats used fos posiie]tue class

() frecision: TP = 0.5
TP+FP 20
( ) FA- SUOe! (Hanmonie mean of Recau
and Arcision)
(P+R
(iv) Sensiitivty: TN = |40
O.93
TN+ P 150

"Sensitivty is used fos ngatve/e cas.


Exercise: I000 (y)
500 500

+ >25o t b2o0
250 L300
P
Confusion matmx! +250 250
A
2003oo
Reca= TP 250
TP +FN 500 0.5
" Rrecision = TP 250 -0.55
TP+p 450

[P+R 2 de/.05
(PtR 6.275
= 7.636
Sen siivity = TN
= 6.
ML 4
tMajosity of the time, the data wont be
balanced. It wiu be imn balanced data.
-Ang data distibucbon apast Aom 50-50 ai be
mbalanced.
-Generally, Go-4o imbalane ie fine and we can
uie it as it is, we wott ty to balance thatdde
But most of the time, qenenally the das
inmbalance is too much. fo example,
parttie lass ’| ,Negatve clas 497.
When your doda is imbalanced, these can be
3 pOssiole things that you can do:
i.Do nt balance 1t
. Balance it
i . Doit use that dta.
Theye ae 2 ways to balante the dada
i) Oversampling. (SMOTE)
ÞjUnder sarmpüng CNear Miss)
*
OVERSAMPLINGi Oversampling inyoves
increasing the nurber of instances in the
minovy class t balance it with the maosity
class Ths is typi cally done by duplicating ing
instances or ganeratng Synthete data points.
Eq: Coid Dota. \000
te/ ve
200
In ovesamplin9, it wiu vepeat this 2o u
t beccomes oD
1000 200
+/\- ’ 200
200 &oo 200

Soote
- SMOTE (Syntheie Mingity Over-Samping
Technique) Is a dada augmentaton techrique
in macine leoning sed to balance imbalanced
dadesets by qenerotbrg Synthete examples
for the oinoity clacs through
interpolatian
between eisinog instances and their
nearest neihloours
* UNDERSAMPLING:

- Undersompling
of instances ininvolyes
the
meduing the nmber
mgoty bss to
balance t with the minoty class.This is
typically done by an dony selectng a sobset
of instances or the mo
ty cla.
-Generoly, we use undersamping moxt.
-lt neduces tha
majity cass 0 it beones
equal to the minoty das
Eg I000
-1 | select 200 random
200 yalues em Buo

- But in the abore exonoe dota, we wl


Over Sampling as Oversampung has
mok daa i:e. 160D.ros
We cant do much itb YoD owi of dat
if we go wth undeNSampinq.
k Intervies Oue sion . st dass is 48: ,
and 2nd
class is 2/, what il ypu do
Ans: 1t depend on the da 1f ts 2. f 5 La
o 2
of lc u use undersampling
·US and Os doesnot depend on the
t depends on ho many s oil you
at the
hae ot the end.
Oversamping is good only if the a
moderode numnber of mi and inriasance
is a Lot
- In coy Machine Leaming, minimuro 10c00
ows ae eguised for the model to wrk
propely And if you use Nasad Netort r
Deep eaming, we need atleast 5o,000 to
60,00O nus.
- Nowaday, we hae huge ornount d da
being generted. So after doing 1. , e
qeneraly do hae 25000o 30,0o 70us
and that is enough to Ccale
any
Machine Learning nodel even using
Ensemble technigues.
* Code Ccditlavdfraud
- impost pandas as pd
-df pd.vead_csv("..")
df
af isnal).Suro()
df. hed C)
af. shape
X= dfidrollclass],axis =|)
y= dflclass]
yvauecounts ()
#lercertage of tsaud cases
(..)*100
-#splitting the daa
-fon Sklean,model_selection impot
tsaintest- spit
-Xtrain,Xtesti 4tainytest tsan test-spit
(X,4,test size=
D.3,andomstasesi23,stbatfsy)
LoGsTIC WTH |MOALANCE:
-fono sklearm.inea model impost
- logt =Logishelegyesson() Logisticlanescin
-logit.tit (Xtain, ytrain)
Corfusion
(x.test)lot-predict ~yprd=
ifcahonept Sore,accUYacy-
impost mebics sklearn. tom
y-train) (Ktsain, togit.At
y-us) Stotity=
ade= sze= K-us,yus,
test
ty-brain,ytest =
X_us.shape X_tsain
de(Xy) -Xus,
impost
sanpling NearMiss() nm=
#UNDERSAMPLING:under imblea from -
ed) Cclassifcaton
eport
acuyacy-Sore pint -
-pred))(y-testy-pved)
(confusionnatix pint
tenepot accuracy-scoe,
skleann.methcs
impost from
-y-pred
-y-test
-y-pred=logit-pedict
(Xtest)
-accusacy-scoe (y-testy-ped
-pint Ccasifhcabon epot
(y-test,y-poed))
+OVERSAMPLING:
-for imblean.over- Samping impovt
SMdE
Shy= SMOTE ()
X_Os, 0s= Smft vesample CK9)
X_as.shape
-X_tain, Xtest, ytan,ytest = tsain test glit
(os ,y os,
test_size=0.3,TêndDm-stater123,
stratify= y0s )
logt -ft (xtain, y tain)
y-pred = logt predict (Xte)
-fom Skleamm:metics import
canfusion-matbix,
accuTacy-seoe, classicotinept
- Confusion -matix
-
(yteskyoed )
a(cuTay-scox (y-test, y-ped)
clascifcattn epot Cy-test, y-ped)
- print (clacifcton_oepet
y-test,y-pves)
*Overttting and Underfiting :
-Cventitting and underfting ax too commn
problens in Macine Leaming and Statsica
modeling that onise when a model is not
able t generalise when wel hom the
traininq date to Unseen new doda.

-Overhtting! Ovesfing ocCurs when a


model leas tha tsaining data to well
Captuing noise and random fuctuatons
in the date nahes than the
patesns or underlying
tends- As arest, he model
pefom vey wel on the training doda but
On nes, unseen dota.

-Undefting: undetitting OCcurS when a


model is toD simple to captune the undeying
potteons of the dota. Ihe model pertms bot
poorly in botth the training dote and ne,
Unseen data because it fails to lam the
dota'
ala'ss undesying stuctue.
*kNN code om ML3: IiskNN
fom sklea m impot datasets
-irns = datesets.\oad-iri()
- iris
-X= iis.dota
y= ins.target
Kshape
-# splitting the data
-fom Skleam.model sele cten
tain test-spit innport
-X_ttain, X_test,ytsain, y_test= bain test split
(Ku,tetsize= 0.2, randomstate= 123, stratfyy)
~A Model
- from Skleam
Knn=
neighbors impot KNeighbors (lahe
kNeihborsClasifer
we wwite (n-neigtboss=1)
neighoors to choange th
=1
nearest neighbors rom 5 to T.
+ Defaut value of neighboss is .
- kontt(%tain, y.tain)
-y-pred= Kon.predict (X_test)
-y-test
9-ped
- form skleam impnt netics
+ Another way to import methcs
- metics. confusionmatix (ytestsy-pred)
-metics.accurracyscore (ytest, y-pred)
-print(metics. classifcatonvepot (y-terty-pe)

You might also like