Download as pdf or txt
Download as pdf or txt
You are on page 1of 95

Maharashtra State Board of Technical

Education Certificate

This is to certify that Mr. / Ms. Rahul Chandrakant Shinde

Roll No 42 of Sixth Semester of Diploma in

Computer Engineering of Institute VPM’s Polytechnic, Thane

(Code 0007) has completed the term work satisfactorily in subject


Data Warehousing with mining techniques (22621) for the
academic year 2020 to 2021 as prescribed in the curriculum.

Place: Thane Enrollment No. 1800070108


Date: 10/06/2021 Exam Seat No. 103794

Subject Teacher Head of the Department Principal

Seal of
the
Institute
Content Page
List of Practical’s and Progressive Assessment Sheet

Sr. Title of the practical Date of Date of Assessment Dated sign. Remarks (if
No performance submission marks(50) of teacher any)

1 Install Oracle database server and client 6/4/2021 9/4/2021 23

2 Import source data structures in oracle 19/4/2021 23/4/2021 20

3 Develop target data structures in oracle 19/4/2021 23/4/2021 22

4 Install Data mining tool WEKA. Study the 19/4/2021 24/4/2021 20


GUI explorer on WEKA

5 Develop an application for OLAP and its 3/5/2021 8/5/2021 24


operations : roll-up, drilldown

6 Develop an application for OLAP and its 3/5/2021 8/5/2021 23


operations : Slice and Dice

7 Implement Data cleaning techniques I 16/5/2021 19/5/2021 23


( Data Preprocessing – Finding and Replacing
missing value in sample Datasets)

8 Implement Data cleaning techniques II ( 16/5/2021 19/5/2021 22


Data transformation – Transforming data
from one format to another format)

9 Preprocess dataset WEATHER.arff including 24/5/2021 28/5/2021 20


creating an ARFF file and reading it into
10 WEKA, using the WEKA explorer.

11 Demonstration of preprocessing on dataset 25/5/2021 28/5/2021 21


customer.arff. Attributes selection and
normalization, draw various graphs using
12
WEKA.

13 Perform association technique on customer 4/6/2021 7/6/2021 21


dataset using apriori algorithm

14 Perform Association Technique on customer 4/6/2021 9/6/2021 23


dataset II (using classification algorithm of
KNN on sample dataset).

Total Marks 262

Total Marks (Scaled to 25 Marks) 23.54


Prachical No1
AIM- Instal 0ncle atobase Sever and Client

Oounlcodina and Installing the Czacdle Onobose Sohluae


Oracle 76quires thot you USe the 6l:-bil O7ocle dalsbose sotlua

lo dounload and inslall the 6l-bit Oracle datocose soktuort


iSign on he O7ocle suPost website
https://supfo7 oiocle -com /CSP/ui/Slash htal
2 Cick the Potch and Updoles link
The Potch Seoich scIEtn oPPEU7S
3 In the Patch None o Numbei Sield, enter 1169L31t & Seoich
The Potchse lis opgsos
Selec Potch 1769L317 12.:0:2-0 PATCR SET FOR ORACLE
DTABASE SERVER Sai the 6la bil vesion o uou plotBozo
5Click Dounlood
The File Downlood sceen oppeais.
6 Oick p?69L377-121020-palkom-1of8 72ip and pi161LS71-121020
platlcm-2of3.2ip fon the database
Specikically, the Files to daunloocl fa each Pdhlom age

Seletk flat omCGu-bt Oaunlond Files


|Linut xgs 6 pl761317h\o20-Linux-d6-6l-18.2ip
elH6SL3T7-121020-Linuz-186-6l-2d8-zi
Orodle Sdars on SRARCCb) pl}L31t-12020-8OLARIS6 u-188 2iP

p76L371-12l020.20LARISSL-2ol8 zip
FOR EDUCATIONAL USE
undaram
HP-UX Tkanium e11311-12020-HfUx-TAGu-1f8.2ip
pH6AL311-1lo2o-HfUx-TH6L-2 2i
BH Ax on fower Suke pl?69L372-n\020-ATKSL-5L- 1of8 2ip

p769L31t nl020-ATX6-5L-2f 2i
1 Install he daubose acoding o an Orocle dalubose
installaion guid
afplicode to ua intalatian 78quirement

Daunlbading and Insolina he O1acde Chenl ftwaz


You mus insko eithe the nI0.2 oi \\2:0- veision ot
he Olacle Client soktuwase

Dounlaadina Orocle SAtware fo Ovace Oient \2-\.02


To dounload and nsal he 32-bi 010cle Client soktuae f
Oade Client 12.\.02
Sinn on to he Orace supto website:
htts:/sugpon tean ade COm LCselui/flosh htn
2 Cickhe Patches and Uedates link.
The Patcn Seaicn SOeen oppeas
3 In the Pokcn None 01 Numbes fied.entei 17691377 Seach.
The Pokch lish opgeos
Seec Patcr (169lSFN-02-0 PATCH SET FOR ORACLE
OATABASE SERVER So the 32-bl veision o gn plalfonm
Ceven i up pathaim is 6bit)
5Cick Dawnlood.
The File Dun koud sceen appears
FOR EDUCATIONAL USE
undaram
6 Chick one o the Solauing \es Son he Orcde dadb0se dien

Sele PdSom C32b Oounogd Gle


Linu 8S pl1s943?712020-LINUX-2iP
Onoce Solois on SPRRCCG2-6DpKSL3?7-12\o20-SoLARTS.2ip
AP-UX Tarilum (32b L372-12020-HPUX.TR322ipe
IBH AN on RoWER Susteml 32-bil pla69L211-N\020-AIX.2ip
A dounload pro7ass Sffn appeors

nstal he dien occo7ding to the insolation insouction in


he dounlcode ZP Sile
8 Run he Net Cehigurction PssisiontCneka) and peifam Local Net
Service Noame contigzatin, Bo cnsuie that he datatnge dient
Can Connect o uou dotobose SeN
Ate upu insal 0ade Qient 12.:0.2 go he Osacle-clitnt-
home/lib directony ond odd o ibnml.So symbolic ins
cd Osocle-dimt-home/lib
Ins libnn2l2.so ibnnall-sa

Dounloading O1ocle Clipn Sottume foi Onade Chent l1.20:l


Todaunlocd and insal the 32-bk Osacle sotlugne foi Ode Clili:2

1Sign on to he Ovacle suppo webste


https://wpRot.07ace-com/CSR/ui/fosh html
undaram FOR EDUCATIONAL USE
2 Clhick the Paches and Updates link
The Patch Seaich sern opPRorS
3 I n the Patch Narne oi Nmbe Sield, tntey 1330671 & Secch.
The potchset list appeas
Select fatch 13390617: I\.20..0rATCH SEI FOR ORACLE DATA BRSE
SERVER So 32-b veisin ol yad plalom
5 Clidk Oounlood
The File Daunload Sc7een apPos
6 ick one ot he alouing Siles So he o1ace doabuse dint
Sec PldgmC32ud
Linu D6 e13340617-120,D -LTNUX-Ld7.2ip
Oyacle Solans on SeRC.Gb) p13310611-12oL0-SOLARTS-1ef 2.-i
HR-DX THani un (32-b pl3310S71-I20L0-HRUX T932-2-ip
18 ATY onfOWER Sgkens 37-)p3310611-12010 -TX-Io2.zip
A douncod pcgiass earm agpeor5

1Insial he diem oCding to he insalatin insuuctias in the


daunlooded ZIP Sie
: Run he Nel Cniniatian Asseat and epm local Nel Sevice
Name conSignoticnho erSune Pro he daobuse clienB con connect
upu dopbose SeNe

Con clusion
C Solauing, he ioceduse we Successhully indall
Oracale colabage SeIve and dient

undaram OR EDUCATIONAL USE


Hahul Shinde

Proctical No.2
Him: Impo souce dato stuucuze in Osoche.

Theou
The tobe is he msic dala sbicuneuSEd in
The in
elaicnal data base.H table is a callecion o ouws
Each TOw in 0 toble COnainS one d mae he
Columns
A View is on Oacle doo skiuctuze consucted
with o SaL shatement The soL stalen ent is
stoed i n he datuboSe Whenwe O O7e
7 e USe a
View in o duezyhe stored auey is edecutea and
he
h e base table dato is euoned o the USe Views
da 0ocontain dato, þut epesent waus to loos o t
the bose table dato in he wdug. hf quey speciies
A vieis built on o colecian o ase ades, uhich
can beeithe actual ades is on Osdde databaSe
O ofhe Views.

e can USe 0 view o seveial uzoses


3oSimplity oCtess to dato sa ed in mubide ables

3 o isolote onagelicaticn Ssan he sgecilic sttuctuze


o heundezlgino tables

FOR EDUCATIONAL USE


undaram
An inde iso dto soudune hat sfeeds vP atess
o posticulo 7ous in o dalaase. Ao indet is
oSsociated wh oa poaticdlar able and contoins
he dato rom one c mase oticulaz columns in
h e table

Steps o impost souce dota stzucue in Osade

Stepi d Clidk on Windous


Click on oll pogiams
cSCick on d1ocle-Osode tla- home
3 Click woe hause builde, then dick on design
Ceney

Step 2 AAtezwoe hause buildez is loade d ofof up


win dow wilapfeo7 EotePne Useiname aand
nd
Passwoid Click OK

Step 3 Cick on Sile menu


Click new oec

Step Enter he ojec name and clickok.


Step 5 Geated Project will appe07 on he 1ight
ighl side
clickon click on doBabase

Sep6- Righd cick on O10de.


Click co New Osocle Madul
undaram FOR EDUCATIONAL USE
Step 7-A window ull appead. Cick on nezt
Ente he nane Click on o e Loaticn is
set Budefault. You can cnange by clicking
On edit
03 H windaw will of@ea7Ente USeToameX

possuad
oS Enke he hos and seiice name
3 Click on es. Connection
ASuccessful messog? w
ui b dislaued i he
conne cion iSacpe

Step 3:Agoin clidk on Nezt


skeap Clich on Sinish
Ceaked Onace Module wi afgea inside the
Orace Selder
SteploHapin coto oll oams
clic on Oacle nodellg- home
click on Apelicatio Developmem
CAick on sal Aus:
skep 11:Commond psomfl o Sa Pus i appea
Eka he usename and fasswad

Step 12 aeate i ems table in uhich keed id as a


imcsy key Cieahe ores and icduct able.

FOR EDUCATIONAL USE


undaram
Sep 13 Now,In he Wrenause Builde,
Rignt dick on Tables.
Click on impait
Cick on Datdase objects
H winaid will oRPea
Cick on neACheck Table chekho
Cick cn ott

Step ll Neow cick en temsabde and exbact it to


he elected part by clicking P. on

Simila do wih others able that we 7ected


above in the saL Plus

I5 Click on Ne, Cick on Sinish


Ttwill savt imparting he tobles
Ate hat dick on ok. Al ables uil appec7 in
the tables Solde

Concdu sion
We hove Jeatt ho tompank SOUZC Ce
data stctue in Oracle

FOR EDUCATIONAL USE


undara
Rahul Shinde

7acical No 3
Aim-Develop Taraet Dato Sucures in O1acle.
heoT
Relotiom doodose
hese datobose
cakegoize bu a
oe
Set o tades neze dato as Stinto o e-deined
Cokeg0y The kdole consels o 20wSs and columns
where the coumn hos an enby fo7 dota Sov a
sPecific categy and 70ws (ntains instan ce f
Hhot data defined aCcoding to he cateaou The
souctuned Qvey Janguageis he shandazd Usez
and opplied ove he kabe uhich mahes these
hese
dotaases easieto etends join kwa databases
with o common Tebticn and modify al eisting
opplicokions

Dimensiona Ostolbases
A_demensiona duBabase is a dato
boSe thot uses o dimensignal doBa madel to onganize
dota This madelUses Aoct tables and dimen sian
todes in o Sa an snowlake schema

Steps to develop Aargel dalo souetue in Oacle

FOR EDUCATIONAL USE


undaram
Slepi o Click on Windaus
bCick on al Piog7ams
c) Click on Oracle Osadellg -homel
d)Click wamehouse builde hen click on he
design Cente

Slep2: Ae uasehouse buildey is Joodedo


eop-up uindow u i oPpeay
Ugezname and osSwod.
Ente he
Cick Ok

Skep3: C)ick on Vietw me0


Click an Globol navigat

Sepl Click on seanity.-


Cick o n HUSeS

Step5: Click On USesS


(ick op Ofw USeis

Slep6 A Windou icopen toC26dte a new

USe
Click on Ne

Step Now, On he bahom side here is o Catate


DB UP, Click on it
Step : Anohe widawuil oppeoto (7eate dakabase
FOR EDUCATIONAL USE
undaram
eate dotolbose USe
Enterhe come and Poss uoid
Ente he name andosswad to Ceate
he neuw 0 use
dick ok
Se dick Ne
cick Finisn
Now o USe ected ad il
is be disayrd
within he USe1S olde
Seg o: Now dick on he 1dpd uhichis CmEttly
Ofen
Aep11: ick on Daamses
Rig Cic on H Dolaboses
Click on Ne Oode odule.

Step 12: A irdow wil ofen


cick Nest
Enker e name
Cick ne
Location con be modified bu
cCicking an edit
Ente the Usename and Passuond
Cick on Aest Connecicn and iwL
paomplA
Successhu h e connecton is done opelyly
Click Ne
Sundaram FOR EDUCATIONAL USE
Sep 13: Cick inish

Condusion
We bove Suctsstuluy deleoped aost dola
Skuctunes in Cacle

undaram FOR EDUCATIONAL USEE


Ronul Shinde

7octical No. L
AimTnslal doto minino ool WEKh Sudy he
GUT egl.aze on WEa
Theou
WEKR1 s an aconym tha sands f
tne blakoto Envizonment S hnouwledge Andusis WEKA
On ofen Souce Sotu07e p7aidec o So dota
iocessing impementokion osenerd Machine ueor ing
olaszithms ond yisualizcticn ods so ho we cdn
develop mochine Jeaning echniques ond opply them
o eal- wd dako mining idblems
GUT expare
On the top o he explare we wisee tools:
P1epsocesS
Clossiy
3Cust
Asspciote
53Select atzibutes
6S Visualize
3 eroceSs Tab
Inikialuy aS upu agen he edflae7, only
he P7epdoces tab is enabled The Sis Sepin machine
eaoning i s oepocess he daBa
Thus, inhe Raeprocess option,you will select the dato
Sile,paocess il a k ei t f applsing he vazious
undaram
maChine Jeamino olgczithms
FOR EDUCATIONAL USE
3 Clossify Tab
The classiSy ab rovices upu Seea machine
econina olo7ithms Sov he cossihicction o he upuy
data
u s t e Tab
Unde the clushei tao, theze ame seuenal
clusteing aloonithms ovided- such as SimplekMeans
and S0 O0

LAssoote Tab
Uade he oSsociote tab, you wonld Sind
FPG7ounth Apsioi.
53Seled kiute Tab
Slecd Atibuke allows uou etune selections
bast on seyeral olgozthms Such as Princial Componem
etc.

63 Visualize Tab
Visualize oion olous uou to Visualize
uou ocessed doro S anal yze.

Steps o instoll WEKA dato minmng kool

3The WEKA: T is ovaiable as o such_o Sself- exbacting


eAtautable sotware Ss incous OS.
Selec eiher 32-o version o7 6u-pn Nsiono he
undaram FOR EDUCATIONAL USE
weko pockoge The packages waud be named as Sollows:
sel-etactino eecuable wib Orade's 6u-bh Javo foy
6-bik Windows 0S
Se-etiocino edtcutabewh Oode's 32-o Sava Sar
32-bik Windows 09

i3Oounlood the ockoge, didk he package to stoB he


inStallakion rocess once Phe dounload is completed
i3 Once he instalation is complele, the Weka cuculd be
be
OPpeoon h e ospam menu
The icon o weko is bid sroPe
Cick an h eCon o shak weka

Condusien
We hove sucessiuly irslalled WEKA data
minino oo) and Lerin aouk GUS eplane

undaram FOR EDUCATIONAL USE


Rahu Shinde

Proctica No 5
AimDeve lop on appliaticn fo OLAf ond ts
oll-vp, dill-down oferokions
Jntuaduction
An OLAP cupeisa dao Soucluie hat
Oveicomes the limitatons e\otionol dokabages by pioidia
7apid analusis of daBa Tk is Multidimen Sional cube that is
built using oLAr dotabases. OLif cubes con disfay and
SUm laoge omounts o doa uhile also OLhferoviding uses
with seacheble a.ccess o anu data poinms so thal the daa
Can be 7olled ups Siced and diced as needed to hondle he
wces variely of questions Pha ade eevant to a uSe's
Orea of interes An OLAP cube conmects to o data sounce
to 7eadand and ioces 70w doBoo peram acgegaticns and
calculations toits assocched measuneS: ne data souzCe
all Sev ice Maraa OLAP cuDes i s he doka mas, which
incdudes the doko mos Sn oth he Opeotian Managei ond
Conhiguiation Manoge

Thee ane hee Camfonents ossocaked wth anu Data


CUbe Neasunes DimenSions ana Hieianchies

FOR EDUCATIONAL USE


undaram
iocaton

Outdoor Products
GO Sport Line

Envleonmental Line
ror East
orth Aerico
OLAP Operotions:
OLAr Provides a use- riendly envicnment fo imevactive
cota
dota andlusis One o he mose popula iont end
applicatians So OLAP is o PC Seveodsneek pragom

OUHP Opeatians omuidimensiaa doto ave:

Rol) upCdil-up)
ROLLUR 1sUSed intnsls invalving subBoals Tk CRates
Subtutals a any Jevel aggmogation needed, Soam he
o s t detaile
Post up to o gand olal.ie climbing up o Concept
hieachy the dimensian such as time a geagapny

Example A Que Could ivolve o follOP o uea7>Naonth>


doy o County- saBe-City. whn d oll-up iS peomed, one
mae dimensiors om he dala Cube oe emaved becaSe
Phe output wod display bank ra Sos cetoin sous
Bela ROLLUP opeakion example woud eku»n the
The Belaw
o al 7evenue aCoss algraucs at inceasing, cogregatian
leves lacotion: on sate ta countey to 18g0n a he
the
diffeent QUsrAeis
QUERY SINTAX:
SALECTGROUP B1 ROLLUP CGROLPING -(ODMN-REFERENCE-LISD;
EXAMPLE
SELECI TIME.LCATTION, PRODUCT, SUM CREVENUE) As fROEJT FROM SHLB
GROUP ROLUP CTINE,LOCHTIONPRODUCTD
undara FOR EDUCATTONAL USE
Cenets uSing ROLLUP
The acove suotpal openkon coud be ochieved USing 'SELECT
stotemats ut UNION AL. The Queu pehardnce is in
eAisçient,as inaes ables access mukiple imes and the
Syn lart is olso CamplicaBed
i doun CRol doun): Queny Suhax
Ihis o 7everse h e fo uf opeokin discussed acove.
1s
The dato is aggegcted hom o highe euel sumnay to a
lowe level gummanydekailed data

aUERY SINTAX
SELECT-GRODP B ROLLDOWN CCOUDMNS
ampe
SELECT TINE,LOCATION,PRODUCT,SUMCREENUE As PRoFTI
FRON SPLS GROUP B ROLDDuN CTINE LOCATION , PRODUCT)
Dill-daun COn e harmed ehe b:.
Stpping doun 0 Concep ierachy Ro o dimenSian
2 inboducing a neu dimensim

Concusian
oLA
OLAP ofeiatins u p and Dill-dotun ae pekorrd

FOR EDUCATIONAL USE


undaram
Rahul Shinde

rocbcol No6
Aim:Develop an application Soi OLAP ond ts Peotions
Slice ord Dice
Siceng
ASice in a multidimensiona oioy is a column o dato
Cornesponding to o sinle value fa one a7 m7e mempers o
he dimension.Tt hels the useto visnlize and ogthei he
inomotin specitic to a dimension When upu hia k o sicing
think o it as o sfecialized Gilte Sr a poaticula value in
o dinension Fo insance, i o Use uanted o Knaw the tolal
numbe o OPN Producs sod acioss al h e ohet locations
CEuzopeFoz-Eose Nosth Ameica, Souto 9merica,) he use
wou pe fom o hoizon tal sice CShown in fia)
iocaton

Outdoor Products
GO Sport Line

Envleonmental Line
ror East
orth Aerico
QUERY SYINTAX
SELECTION CONDTTIONS ON SONE AIRIBUTES USING<WHERE
CLAUSESGRoUP B> AND AGGREGATIONON SOME ATKiBUTE
EXANPLE
SELECT PRODUCTIS, SUNMCREVEN UE) FRO) SAES WHERE
PRODUCTSDPV' GRouf By PRODUCTS

Dicing
Ocing is smo to Sicing, bt it uoks a little
Dicing
diSSeently When one thinks o lick, Siltesing is done to
Socus on a prticulaz atbibte.Dcing.on 4he ohe hand, is.
MOoe o a
2com Seatue hat selecs subse ove all the
a

dimenSians but i seific yalues o dimensicnFor instance


i o USe wantd to Knowhe Aevenue edoned due to the
EL product i n orticula Ma ke looticn o Euope,he usez
would pe1hmm o dicing ogerokion CSnouun in Fig2).
Revenue

2000- 2000- 4000-


4000 4000 6000
2000
npod GO SL
Europe
Sum
FarEast

N. America

SAmenca

Fig 2-Data cube :Slicing and Dicing


QUERY SNTHX:
SELECTION CONDITICN of SOME AMRIAUTES USING <WHERE
RO DUC GAROUP B ANO AGGREGATION ON SOME TIKTBUTE
EXAMPLE
SELECT PRODUCTS, SUMCREVENUE) FROM ALES WHERE PRUCG-°EL'
AND LoERTION="EUROft GROUP B PRODUCTS;

Conclusion: OLR oferalicns Sie ond ice are ehomed


FOR EDUCATIONAL USE
Sundaram
Rohu Shinde

Proctical Mo7
Aim:-11 mpement Oata cleoning Aecahriques 1CDoto Pngiocesina
-Finding ond Replocina missing value in somple 0atosels)
Intsaduction
The ata cleanina ecniques indudes data
Prepccessing ard data rons fomation

Dato Pegiocessing
Osto prepoCessingis a dato rminirg tecnrigue that
inblies anSaiming 70w dato inko an Ude7sBard aole fanmat
Real waild dako i s otm noisy. missing Cincomplete,
inCansisten ,it may cohain many eias

Dato piepoCESsing is a ven mebad o esalving all


heseiSSUES Data eocessing eptues aw daBa foz futhe
giOCessing
Oao pie psocessng is Usd dadoase-diven appicoliors
Such oS cusBm ey 7eotonship management and ule-based
applicctins Clike neial netuoks).
Daka pepicessing dpes thiough a seies of steps
13 Data Cleanina: Oata is cleansed hicugh piocesses such as fillina
inmissing values, Smathing he noisy dato, oesdvingthe
inCansStancies in he data
i'3 Oata Tnteaation:-Oato uwith disSeent epesentaticns ae
Pu togethe and cantlcts wth in the daBa ae o be
e sdved. FOR EDUCATIONAL USE
undaram)
iL Oota Tansfonmoticn OaBa is nopmolized, aogaegaked &agreicaliael
Dako Reduction This step aims o ppesent.o 7eàuced
7epesentction ohe doko Gin a daloa umehause
V Dao Oscrelizationr Invahes the seducio oo nmoe os value
oof a continvous atbicuke b dividing he ongeo he
attibute indeivols

Oako Cleuning
he dato Can have mak ielevant and missing gnts.To
hardle his fant, daBa cdleaning is cone.Tt imdlve s hondling
o missing dato, nosy dds etc

(0) Missing Oata:


This situctian iseswnfn Scme daa is nissig
inthe dato. Tt con be handled in yaiaus wdys
Some o hen Qe
Ignose he tuples: This agncoch is suituble cnly uhen he
datose we have is quite laige& mutiple values aie
missing with in o tugle
ii.Fi he Missing values: 1here ave vasias ways o do this
task You can choose o Sill he missing yalues manually
by aoibute mean 07 he mast oboble yalue.

(b. Noisy lata


Nois daBa is a neaningless dato thak can be interpeled
bg mochines.T con be geneaBed due ho Suulty dato colecian,
doto entu enas etc.T con e handlcd in foloung way
Binning Method: This method uoks n soted dala in onder
o mooth H he wnole doo is divided into scaments o equal
undaram FOR EDUCATIONAL USE
Size and hen yoious melhods aie peAoymed o complete the
task Eoch Segmenke i s hardled sepanaely.One can 7eplace
all o seament by s mean o bounckry vdues can
all data in
be Used o compete the osk
iReg7ession: Heie daha can be made smooth by Siting i
to o2egressionunction.The egessio0 USed oy be
inea

o multiple Chaing mutiple idependen vaniables).


clustez.
iiClusteing This appioach goups the Simila daBa in a
outside cluste.
he outdiers may be urdetecked on it uill Rall
EAample
Binning Methods So Dolo Smoothing
The bioning meBhod can be used So smcothing the daBo.

Mostly daBa is Sull noise

Data smoothing is o dalo e-PiacesSing echnique Using o


diSferent kind o algonihm ko serove he noise Gicm the
dato set. This allous ingcn pattens o sBand aut
Unsorted dato o pice in dollars
Befoe soing:,1G,9.15,21,21,2,30,26,27,20,31
Fias of al.so the dato
ASte Soting:8,9,15,IK,2l21,24,26,27,30,30,3

Smaothing he dat by edua) Gequencu bins


Bin: 3,9.15,16
Bin2: 2.2,24,26
Bin3 2,30,30,23u FOR EDUCATTONAL USE
undaram
Smoohino by bin means
Fo Bin
CS49 +15 tIS/L) 12
Ck indioaung he oal values like .9.15.16)
Binz 12, 12, 12.12
F Bin 2
C221 2t26 /)23
Bin223,23,23,3
Fon Bin3
(27 A3043043[L) -30
Bin330,30,30,30

Smoohing by in boundaies
binl S3,3,5
Bin2 2 21.25.25
Bin3 2S,26, 26,3

How o Smolh daBo by bo boundozies?


You need to pich h eminimum Jnd adimum value. Pot
he minimum on he lea ide and nodimum cn he igh side.
Now,wnal l hdP o he micdlevalues?

Midle values in bin baundavies noe to ils closesl neighboz


value with less dislance.
Unsazted dato o ice in dollaB
Befoe sokig:9 16,3,15,21,2),24,30,26,27,30,3
Fist all, Sol he dako
ARte sotirg:9,9.S 16,21,2,2,26.21,30,30,3.
undaram FOR EDUCATIONAL USE
Smaoth data otte bin Boundoy
Belo7e bin Baurday: Bin 1:8,95.16

Here,is the minimu ualueoand 16 is he maimum value i s


neato S,so uwil be bfoted as 1S is moe neam to 16
and fazhe away son so,1S wil be eote as 16.

Pe bin Boudazuy Bin :3.3. 16, 16


BeSoze bin Baunday Bin 2:2),21,2u 26,
ARte bin Boundany:in 2:2121,26,26,
Belore pin Boundany:Bin 3:21,30,30,3u
AK tez Dio Boundary:Bin 3:21,27 22, 3

Conclusion
Here we eon obout otez and befpre bin
Boundo7yalues and also Jened abau Dato Pre-pocessing,
Data Cleaning and he Smoot dato by he bin
Oounday ies

FOR EDUCATIONAL USE


undaram
Rahul Shinde

Pracical No8
Hhm21Tmlement Dato cleaning techiques TT (DoBo kranskomalio
7ansfoming doto am one oima to anothe Soyma

Inboducion:

Data 7ansHoatian and Disoetiaation


Dataans formaticnis the process o convertingdaBo Som
One fomot o andthe kypi.cally om the Sonmo o Souce
SoSustem into he 7equized osma oo desination system
Dato ansfoimation is a compcnent of os doa integiation
and doto management ashs, 9c os data unangling and
and
dato wdyehausing
The goal d he daBa banso7mation pxocess is to etroct
dato fom o Souce,comves into o usable forot and
delive it ko o destinaian-This entiye Qioceesis knoun as
ETL CEtbact,lood,Tanskom) Dring the etvacian prase,
dalo is dentikied and pdled Sm many dilfenen locakion o
Sau7Ces into a Single epasitoy
Oota ecvacked an the SouiCe locatian is okten aw ard not
Usable in its oiginal Soxn To ove cme his obskade the
data mus be eiansf armed
he olowing Steps in he anshoimotioniocess aCtuz

Dato discoveiy: The Svst step in he dala kvons fazmation


g1ocess cansisks o ideni fying and undeps tand ing the deta
in is SCuce Ssiro: This i s usally acconplished with the
help o dato poliling tool This slep heles yeu detide ura
FOR EDUCATIONAL USE
undaram
need to hopped to in he dcta osde to get it into desized
Romat
Dokto maingDuing this Prase, the actual bdnshaination
1ocessis planned
Keneroting code: In oide ohe ansp/mation piocess to
be Comple ked.a Code mus be coeaked to un he
yans fon mation ob.0enhese codes ose oeneoked ut
the hele o dda yansiomakian aatoom.

EAecutinog ne ode The dao onefo mation poses hat has


been planned and
and coded 1s rou into molion, and the dado
is convee tothe desiaed oukput

Reyisu I7ansfoi moion dota is checked to make Susei


has been fomatted Comecly

In odditicn u bosic ee, cha useriaecopescicos may


ray occu7 Foi eample
Fiteing Ce-9 Selecting only catoin codunns b load)
EnicringCeg Ful name to FinstName, Hidle Name,Last Name).
Spilltting a colum nBo multife colunns ond vice veiso.

Doining oge the ddBa fran multiele Soices

undaram
Remaino dugliale dao
FOR EDUCATTONAL USE
This involves Solowing waus

NO7malizotion
tis dcne io cide o scdle tre daa vdues in seciied
1ange C\.0 lo \o o 0-0 to \:0)

t b h e Selectian
Tn his suahegy, neu artbutes and consouced Scn he
given set o atibes b help he mining pocess:

3 Oisaetization
This is done seplace he dw alues Snumec
ottibule ineval Jevel o Conceptual levels

LConcept Hieachy Genechion:


Hese otbibuBes he ccveked fom \owe level o highe
Level 0 hieaschy Foi eiample
h e atbieute ity" con be conelsd tocounby"

ConcluSion:
Heance Hee uwe Jeco about daBa
rans fomatian and dato discielizdtion &drscovezy and
olso Jeoin abou asic Stees customized apenatians with
Piope eiomplr

Gundaram FOR EDUCATIONAL USE


Rahul Shinde Roll No.42

Practical no-9 & 10


Aim-Preprocess dataset WEATHER.arff including creating an ARFF file and
reading it into WEKA,and using the WEKA explorer.
1. Introduction
WEKA is a data mining system developed by the University of Waikato in New Zealand that
implements data mining algorithms. WEKA is a state-of-the-art facility for developing machine learning
(ML) techniques and their application to real-world data mining problems. It is a collectionof machine
learning algorithms for data mining tasks. The algorithms are applied directly to a dataset. WEKA
implements algorithms for data preprocessing, classification, regression, clustering, association rules;
it also includes a visualization tools. The new machine learning schemes can also be developed with
this package. WEKA is open source software issued under the GNU General Public License.

2. Launching WEKA Explorer

You can launch Weka from C:\Program Files directory, from your desktop selecting

icon, or from the Windows task bar ‘Start’  ‘Programs’  ‘Weka 3-4’. When ‘WEKA
GUI Chooser’ window appears on the screen, you can select one of the four options at the bottom
of the window :

1. Simple CLI provides a simple command-line interface and allows direct execution of
Weka commands.
2. Explorer is an environment for exploring data.

3. Experimenter is an environment for performing experiments and conducting statistical


tests between learning schemes.
4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine
learning experiments.

For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the ‘WEKA
Rahul Shinde Roll No.42
GUI Chooser’ window.

‘WEKA Explorer’ window appears on a screen.

3. Preprocessing Data

At the very top of the window, just below the title bar there is a row of tabs. Only the first
tab, ‘Preprocess’, is active at the moment because there is no dataset open. The first three
Rahul Shinde Roll No.42
buttons at the top of the preprocess section enable you to load data into WEKA. Data can be imported
from a file in various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an
SQL database (using JDBC) [4]. The easiest and the most common way of getting the data into WEKA
is to store it as Attribute-Relation File Format (ARFF) file.
You’ve already been given “weather.arff” file for this exercise; therefore, you can skip section 3.1 that
will guide you through the file conversion.

File Conversion

We assume that all your data stored in a Microsoft Excel spreadsheet “weather.xls”.

WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file. Before you apply
the algorithm to your data, you need to convert your data into comma-separated file into ARFF
format (into the file with .arff extension) [1]. To save you data in comma-separated format, select
the ‘Save As…’ menu item from Excel ‘File’ pull-down menu. In the ensuing dialog box select ‘CSV
(Comma Delimited)’ from the file type pop-up menu, enter a name of the file, and click ‘Save’
button. Ignore all messages that appear by clicking ‘OK’. Open this file with Microsoft Word. Your
screen will look like the screen below.
Rahul Shinde Roll No.42

The rows of the original spreadsheet are converted into lines of text where the elements are separated
from each other by commas. In this file you need to change the first line, which holds the attribute
names, into the header structure that makes up the beginning of an ARFF file. Add a @relation tag
with the dataset’s name, an @attribute tag with the attribute information, and a @data tag as
shown below.

Choose ‘Save As…’ from the ‘File‘ menu and specify ‘Text Only with Line Breaks’ as the file
type. Enter a file name and click ‘Save’ button. Rename the file to the file with extension .arff to
indicate that it is in ARFF format.

Opening file from a local file system

Click on ‘Open file…’ button.


Rahul Shinde Roll No.42

It brings up a dialog box allowing you to browse for the data file on the local file system, choose
“weather.arff” file.

Some databases have the ability to save data in CSV format. In this case, you can select CSV
file from the local filesystem. If you would like to convert this file into ARFF format, you can click
on ‘Save’ button. WEKA automatically creates ARFF file from your CSV file.
Rahul Shinde Roll No.42

Opening file from a web site

A file can be opened from a website. Suppose, that “weather.arff” is on the following
website:

The URL of the web site in our example is http://gaia.ecs.csus.edu/~aksenovs/. It means that the
file is stored in this directory, just as in the case with your local file system. To open this file, click
on ‘Open URL…’ button, it brings up a dialog box requesting to enter source URL.
Rahul Shinde Roll No.42

Enter the URL of the web site followed by the file name, in this example the URL is
http://gaia.ecs.csus.edu/~aksenovs/weather.arff, where weather.arff is the name of the file you
are trying to load from the website.

Reading data from a database

Data can also be read from an SQL database using JDBC. Click on ‘Open DB…’ button,
‘GenericObjectEditor’ appears on the screen.

To read data from a database, click on ‘Open’ button and select the database from a filesystem.
Rahul Shinde Roll No.42

Preprocessing window

At the bottom of the window there is ‘Status’ box. The ‘Status’ box displays messages that keep you
informed about what is going on. For example, when you first opened the ‘Explorer’, the message
says, “Welcome to the Weka Explorer”. When you loading “weather.arff” file, the ‘Status’ box displays
the message “Reading from file…”. Once the file is loaded, the message in the ‘Status’ box changes
to say “OK”. Right-click anywhere in ‘Status box’, it brings up a menu with two options:

1. Available Memory that displays in the log and in ‘Status’ box the amount of
memory available to WEKA in bytes.
2. Run garbage collector that forces Java garbage collector to search for memory
that is no longer used, free this memory up and to allow this memory for new tasks.

To the right of ‘Status box’ there is a ‘Log’ button that opens up the log. The log records every action
in WEKA and keeps a record of what has happened. Each line of text in the log contains time of entry.
For example, if the file you tried to open is not loaded, the log will have record of the problem that
occurred during opening.
To the right of the ‘Log’ button there is an image of a bird. The bird is WEKA status icon.
The number next to ‘X’ symbol indicates a number of concurrently running processes. When you
loading a file, the bird sits down that means that there are no processes running. The number of
processes besides symbol ‘X’ is zero that means that the system is idle. Later, in classification
problem, when generating result look at the bird, it gets up and start moving that indicates that a
process started. The number next to ‘X’ becomes 1 that means that there is one process running, in
this case calculation.
Rahul Shinde Roll No.42
If the bird is standing and not moving for a long time, it means that something has gone wrong.
In this case you should restart WEKA Explorer.

Loading data
Lets load the data and look what is happening in the ‘Preprocess’ window.

The most common and easiest way of loading data into WEKA is from ARFF file, using ‘Open
file…’ button (section 3.2). Click on ‘Open file…’ button and choose “weather.arff” file from your
local filesystem. Note, the data can be loaded from CSV file as well because some databases
have the ability to convert data only into CSV format.

Once the data is loaded, WEKA recognizes attributes that are shown in the ‘Attribute’ window. Left
panel of ‘Preprocess’ window shows the list of recognized attributes:

No. is a number that identifies the order of the attribute as they are in data file, Selection tick boxes
allow you to select the attributes for working relation, Name is a name of an attribute as it was
declared in the data file.

The ‘Current relation’ box above ‘Attribute’ box displays the base relation (table) name and the current
working relation (which are initially the same) - “weather”, the number of instances - 14 and the number
of attributes - 5.

During the scan of the data, WEKA computes some basic statistics on each attribute. The following
statistics are shown in ‘Selected attribute’ box on the right panel of ‘Preprocess’ window:

Name is the name of an attribute,


Type is most commonly Nominal or Numeric, and
Missing is the number (percentage) of instances in the data for which this attribute is unspecified,
Distinct is the number of different values that the data contains for this attribute, and
Unique is the number (percentage) of instances in the data having a value for this attribute that no
other instances have.
Rahul Shinde Roll No.42

An attribute can be deleted from the ‘Attributes’ window. Highlight an attribute you would like to
delete and hit Delete button on your keyboard.

By clicking on an attribute, you can see the basic statistics on that attribute. The frequency for
each attribute value is shown for categorical attributes. Min, max, mean, standard deviation
(StdDev) is shown for continuous attributes.

Click on attribute Outlook in the ‘Attribute’ window.

Outlook is nominal. Therefore, you can see the following frequency statistics for this attribute in the
‘Selected attributes’ window:
Missing = 0 means that the attribute is specified for all instances (no missing values), Distinct = 3
means that Outlook has three different values: sunny, overcast, rainy, and Unique = 0 means that
other instances do not have the same value as Outlook has.

Just below these values there is a table displaying count of instances of the attribute Outlook. As you
can see, there are three values: sunny with 5 instances, overcast with 4 instances, and rainy with 5
instances. These numbers match the numbers of instances in the base relation and table
“weather.xls”.

Lets take a look at the attribute Temperature.


Rahul Shinde Roll No.42

Temperature is a numeric value; therefore, you can see min, max, means, and standard deviation in
‘Selected Attribute’ window.
Missing = 0 means that the attribute is specified for all instances (no missing values), Distinct = 12
means that Temperature has twelve different values, and
Unique = 10 means that other attributes or instances have the same 10 value as Temperature has.
Temperature is a Numeric value; therefore, you can see the statistics describing the distribution of
values in the data - Minimum, Maximum, Mean and Standard Deviation. Minimum = 64 is the lowest
temperature, Maximum = 85 is the highest temperature, mean and standard deviation.
Compare the result with the attribute table “weather.xls”; the numbers in WEKA match the numbers
in the table.

You can select a class in the ‘Class’ pull-down box. The last attribute in the ‘Attributes’ window is
the default class selected in the ‘Class’ pull-down box.
Rahul Shinde Roll No.42
You can Visualize the attributes based on selected class. One way is to visualize selected
attribute based on class selected in the ‘Class’ pull-down window, or visualize all attributes by
clicking on ‘Visualize All’ button.

Setting Filters
Pre-processing tools in WEKA are called “filters”. WEKA contains filters for discretization,
normalization, resampling, attribute selection, transformation and combination of attributes. Some
techniques, such as association rule mining, can only be performed on categorical data. This requires
performing discretization on numeric or continuous attributes. For classification example you do not
need to transform the data. For you practice, suppose you need to perform a test on categorical data.
There are two attributes that need to be converted: ‘temperature’ and ‘humidity’. In other words, you
will keep all of the values for these attributes in the data. This means you can discretize by removing
the keyword "numeric" as the type for the
Rahul Shinde Roll No.42
‘temperature’ attribute and replace it with the set of “nominal” values. You can do this by applying a
filter.
In ‘Filters’ window, click on the ‘Choose’ button.

This will show pull-down menu with a list of available filters. Select Supervised  Attribute 
Discretize and click on ‘Apply’ button. The filter will convert Numeric values into Nominal.

When filter is chosen, the fields in the window changes to reflect available options.
Rahul Shinde Roll No.42

As you can see, there is no change in the value Outlook. Select value Temperature, look at the
‘Selected attribute’ box, the ‘Type’ field shows that the attribute type has changed from Numeric to
Nominal. The list has changed as well: instead of statistical values there is count of instances, and
the count of it is 14 that means that there are 14 instances of the value Temperature.

Note, when you right-click on filter, a ‘GenericObjectEditor’ dialog box comes up on your screen.
The box lets you to choose the filter configuration options. The same box can be used for
classifiers, clusterers and association rules.
Clicking on ‘More’ button brings up an ‘Information’ window describing what the different options
can do.
Rahul Shinde Roll No.42

At the bottom of the editor window there are four buttons. ‘Open’ and ‘Save’ buttons allow you to
save object configurations for future use. ‘Cancel’ button allows you to exit without saving
changes. Once you have made changes, click ‘OK’ to apply them.

4. Building “Classifiers”

Classifiers in WEKA are the models for predicting nominal or numeric quantities. The
learning schemes available in WEKA include decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons, logistic regression, and bayes’ nets. “Meta”-
classifiers include bagging, boosting, stacking, error-correcting output codes, and locally weighted
learning .

Once you have your data set loaded, all the tabs are available to you. Click on the ‘Classify’ tab.

‘Classify’ window comes up on the screen.


Rahul Shinde Roll No.42

Now you can start analyzing the data using the provided algorithms. In this exercise you will
analyze the data with C4.5 algorithm using J48, WEKA’s implementation of decision tree learner.
The sample data used in this exercise is the weather data from the file “weather.arff”. Since C4.5
algorithm can handle numeric attributes, in contrast to the ID3 algorithm from which C4.5 has
evolved, there is no need to discretize any of the attributes. Before you start this exercise, make
sure you do not have filters set in the ‘Preprocess’ window. Filter exercise in section 3.6 was just
a practice.

Choosing a Classifier

Click on ‘Choose’ button in the ‘Classifier’ box just below the tabs and select C4.5
classifier WEKA  Classifiers  Trees  J48.

Setting Test Options

Before you run the classification algorithm, you need to set test options. Set test options in
the ‘Test options’ box. The test options that available to you are [2]:
Rahul Shinde Roll No.42
1. Use training set. Evaluates the classifier on haw well it predicts the class of the
instances it was trained on.
2. Supplied test set. Evaluates the classifier on how well it predicts the class of a set of
instances loaded from a file. Clicking on the ‘Set…’ button brings up a dialog allowing
you to choose the file to test on.
3. Cross-validation. Evaluates the classifier by cross-validation, using the number of folds
that are entered in the ‘Folds’ text field.
4. Percentage split. Evaluates the classifier on how well it predicts a certain percentage of
the data, which is held out for testing. The amount of data held out depends on the value
entered in the ‘%’ field.

In this exercise you will evaluate classifier based on how well it predicts 66% of the
tested data. Check ‘Percentage split’ radio-button and keep it as default 66%. Click on ‘More
options…’ button.

Identify what is included into the output. In the ‘Classifier evaluation options’ make sure that the
following options are checked [2]:

1. Output model. The output is the classification model on the full training set, so that it
can be viewed, visualized, etc.
2. Output per-class stats. The precision/recall and true/false statistics for each class
output.
3. Output confusion matrix. The confusion matrix of the classifier’s predictions is included
in the output.
4. Store predictions for visualization. The classifier’s predictions are remembered so
that they can be visualized.
5. Set ‘Random seed for Xval / % Split’ to 1. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.
Rahul Shinde Roll No.42

The remaining options that you do not use in this exercise but that available to you are:

6. Output entropy evaluation measures. Entropy evaluation measures are included in


the output.
7. Output predictions. The classifier’s predictions are remembered so that they can be
visualized.

Once the options have been specified, you can run the classification algorithm. Click on ‘Start’
button to start the learning process. You can stop learning process at any time by clicking on ‘Stop’
button.

When training set is complete, the ‘Classifier’ output area on the right panel of ‘Classify’
window is filled with text describing the results of training and testing. A new entry appears in the
‘Result list’ box on the left panel of ‘Classify’ window.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42

Analyzing Results

=== Run information ===


Run Information gives you the following information:
• the algorithm you used - J48
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
• the relation name – “weather”
Relation: weather
Instances: 14 • number of instances in the relation – 14
Attributes: 5 • number of attributes in the relation – 5 and the list of the
outlook attributes: outlook, temperature, humidity, windy, play

humidity
windy
play
Test mode: split 66% train, remainder test • the test mode you selected: split=66%

=== Classifier model (full training set) ===

J48 pruned tree

Classifier model is a pruned decision tree in textual form that was


outlook = sunny produced on the full training data. As you can see, the first split is
| humidity <= 75: yes (2.0) on the ‘outlook’ attribute, at the second level, the splits are on
| humidity > 75: no (3.0) ‘humidity’ and ‘windy’.
outlook = overcast: yes (4.0) In the tree structure, a colon represents the class label that has
outlook = rainy been assigned to a particular leaf, followed by the number of
| windy = t: no (2.0) instances that reach that leaf.
| windy = f: yes (3.0) Below the tree structure, there is a number of leaves (which is 5),
and the number of nodes in the tree - size of the tree (which is 8).
Number of Leaves : The program gives a time it took to build the model, which is 0.06
seconds.
Size of the tree :

Time taken to build model: 0.06 seconds


Evaluation on test split. This part of the output gives estimates of
=== Evaluation on test split === the tree’s predictive performance, generated by WEKA’s evaluation
=== Summary === module. It outputs the list of statistics summarizing how accurately
the classifier was able to predict the true class of the instances
Correctly Classified Instances 2 40 %
under the chosen test module. The set of measurements is derived
Incorrectly Classified Instances 3 60 %
from the training data.
Kappa statistic -0.3636
In this case only 40% of 14 training instances have been
Mean absolute error 0.6
classified correctly. This indicates that the results obtained from
Root mean squared error 0.7746
the training data are not optimistic compared with what might be
Relative absolute error 126.9231 %
obtained from the independent test set from the same source. In
Root relative squared error 157.6801 %
addition to classification error, the evaluation output
Total Number of Instances 5
measurements derived from the class probabilities assigned by
the tree. More specifically, it outputs mean output error (0.6) of
=== Detailed Accuracy By Class === the probability estimates, the root mean squared error (0.77) is
the square root of the quadratic loss. The mean absolute error
TP Rate FP Rate Precision Recall F-Measure Class calculated in a similar way by using the absolute instead of
0.667 0.5 0.667 0.571 squared difference. The reason that the errors are not 1 or 0 is
0.333 0 no because not all training instances are classified correctly.

=== Confusion Matrix === Detailed Accuracy By Class demonstrates a more detailed per-
class break down of the classifier’s prediction accuracy.
a b <-- classified as
2 1 | a = yes From the Confusion matrix you can see that one instance of a
2 0 | b = no class ‘yes’ have been assigned to a class ‘no’, and two of class
‘no’ are assigned to class ’yes’.
Rahul Shinde Roll No.42

Visualization of Results

After training a classifier, the result list adds an entry.

WEKA lets you to see a graphical representation of the classification tree. Right-click on the entry
in ‘Result list’ for which you would like to visualize a tree. It invokes a menu containing the
following items:

Select the item ‘Visualize tree’; a new window comes up to the screen displaying the tree.
Rahul Shinde Roll No.42

WEKA also lets you to visualize classification errors. Right-click on the entry in ‘Result list’ again
and select ‘Visualize classifier errors’ from the menu:

‘Weka Classifier Visualize’ window displaying graph appears on the screen.


Rahul Shinde Roll No.42

On the ‘Weka Classifier Visualize’ window, beneath the X-axis selector there is a drop- down list,
‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the
attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility
you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select lighter
color from the color palette.

To the right of the plot area there are series of horizontal strips. Each strip represents an attribute, and
the dots within it show the distribution values of the attribute. You can choose what axes are used in
the main graph by clicking on these strips (left-click changes X-axis, right- click changes Y- axis).
Change X - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out in the plot
area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement given to
all points in the plot, to the right, until you can spot concentration points.

On the plot you can see the results of classification. Correctly classified instances are represented
as crosses, incorrectly classified once represented as squares. In this example in the left lower corner
you can see blue cross indicating correctly classified instance: if Outlook = ‘sunny’ € play = ‘yes’
Rahul Shinde Roll No.42

Practical No.11 & 12


Aim: Demonstration of preprocessing on dataset
customer.arff.Attributes selection and normalization,drawvarious
graphs using WEKA.

1. Fire up WEKA to get the GUI Chooser panel. Select Explorer from the four
choices on the right side.

2. We are on Preprocess now. Click the Open file button to bring up a standard
dialog through which you can select a file. Choose the customer_labThree.cvs
file.
3. To perform classification with Weka, the last attribute in the dataset is taken
asclass label and it should be nominal. Since the last attribute of data set
customer_labThree.cvs is numeric type (1/0), we should convert it to nominal
type in next step.
Rahul Shinde Roll No.42
4. Unsupervised attribute filter – NumericToNominal is chosen to perform this
conversion. Since we would like to convert the last attribute only, change the
attributeIndices to last.

5. After applying the filter, the last attribute becomes nominal type and it is taken
asthe class label for the dataset – now the data set is visualized in two colors.
Rahul Shinde Roll No.42

6. If the class attribute is not the last attribute, you could set it in edit window.

7. You should also to convert the types of other attributes. Attributes region,
townsize, agecat, jobcat, empcat, card2tenurecat, and internet are all nominal
values, however, they are treated as numeric type by Weka. And attributes
gender,union, equip, wireless, called, callwait, forward, confer, ebill are binary
values,
Rahul Shinde Roll No.42
they are treated as numeric types as well. NumericToNominal filter should be
applied to convert them. You could also normalize attribute educat to [0, 1] since
education categories are rankings.

Attribute Selection - Since not all attributes are relevant to the classification
job, you should perform attribute selection before training the classifier.

8. You could remove irrelevant attributes by hand. For example, the first attribute
custId should be removed. Select it and click Remove button to remove it.
Rahul Shinde Roll No.42
9. You also could run automatic attribute selection. We have introduced two
methods of evaluating attributes individually – InfoGainAttributeEval and
ChiSquaredAttributeEval. The default attribute selection method of Weka is
CfsSubsetEval, which evaluates subsets of attributes.

10. To use evaluator InfoGainAttributeEval, a search method Ranker is selected to


rank all attributes regarding the evaluation results. We use the full dataset as
training dataset. The results show that the first 8 attributes are good.
Rahul Shinde Roll No.42

11. Run feature selection the second time with CfsSubsetEval and BestFirst search
method. Compare results of two feature selection methods.
Rahul Shinde Roll No.42
12. If you decide to reduce the dataset by removing unimportant attributes, you
couldchoose to save the reduced dataset by right-click the Result list. Save the
file name as customer.arff.

Naïve Bayes Classifier: bayes/NaïveBayes

13. Open the saved processed data file customer.arff and then click Classify Tab on
top of the window. Click Choose button under Classifier. The drop down list of
all classifiers show. Choose NaiveBayes from bayes folder.
Rahul Shinde Roll No.42
14. Left click the field of Classifier, choose Show Property from the drop down list.
The property window of NaiveBayes opens, if you do not want to use Normal
Distribution for numeric data, set useKernelEstimator to ture; You also could
perform supervised discretization on numeric data by setting
useSupervisedDiscretization to ture. Click OK button to save all the settings.

15. To partition the training data set and test data set, choose 10-fold
cross-validation.
Rahul Shinde Roll No.42
16. Click Start button on the left of the window, the algorithm begins to run.
Theoutput is showing in the right window.

parameters of normal
distributions for numeric

frequency counts of
nominal values

NaiveBayes avoids zero


frequencies by applying the
Laplace correction.

Accuracy
Rahul Shinde Roll No.42

K-Nearest-Neighbor: lazy/IBK

17. We would like to perform K-Nearest-Neighbor classification on the same


dataset.You could try different K and see what value gives a better result.
Compare the results with Naïve Bayes classifier.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42

Decision Tree: trees/J48 (Implementing C4.5)

18. We would like to build a Decision Tree model on the same given training data
set.Take all default values of the parameters.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42
19. To visualize the decision tree we build, right-click the Result list item for J48.

20. All trained classification models could be saved by right-click the Result list items.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42

Ensemble (Metalearning) classifier.meta.Voting

21. You could combine multiple classifiers to perfrom an ensemble method.


Rohul Shinde

Proctical No1
Aim: Per opm ossociotion echnique on cuskomer daase
T.(mpementing Apriori olggrinhm on Cucmez dalaset)

Apioi olgowithm Ezample


0 Datose TJO items
TI I,12,15
T2
T3 T2.I3
TL,I2,T
T5 I,13
T6 T2,T3
T7 13
T3 LI, T3.Is
19 TI. 13 |
Minimum SupPon is2
Coun
Minimum Conkidence is 6o1

Step k- c-Ccondidate set)


TAense So Coun
6

13
Ty
IS 2

undara FOR EDUCATIONAL USE


Compone Cardidoe se item's suppo con with
mimmum Suroi Counk
This ojives us item L1

Tkemset Sup-Count
I2
T3 6
T
T5 2

SteP-2 K2
DGeneote Candidate celc2 uSing LChis is called
oin step Condition of joining LIk-and Lk-| is
Phat i Shauld haVe Ck-2) eemen-s n common
Now Sind suppa caunt o hese iemself by
Searching in datasel

Ttemset Sup-CouD
I T2
la,13
TL,IL
2
Iz ,13
12 T5 2
13,
13 T5
15 O
undaran FOR EDUuCATIONAL USE
D Comfase CardidoBe Cc supro Coont wh mimnimum
Suppot Count
This gves Us ikemset Lz.

Tensel Sue-Count
I1,T2
S1,T3
T,15
T T3
J,Tu
J,ts 2

Sep-3
DFind supfoz Count ok these emaining itemset ay
sea7ching in daBo set Cc3)

TAemset Sup-count
IL,J2,13 2
IIJs 2

Compne condidate Cc3) supR Count win minimum.


SuPfok Count
his gives Us itemset L3
Ttemse su-ou
I1,J,I3
I ,T21s

FOR EDUCATIONAL USE


SPp
We sop hee because no frequenk itemseta7e
Sound futher
Confidence
Mini mun Cofidence 6o
Contidence (A-> SupeotCount CAUB
Suppo-count CA)D

Tkense O:- STI, I2, IsG I[ Gm L3


ules cun be:
CTIA T2 Li33 / CcnSidence supCIT2h T3)/
SuPCTI J2)= 2/4 * loo So7
C12 T332 Lil / Confidence Sup CTT2 13)/
Sup CI1 13) 2/u loo OJ.
LI2 132 LIi / CaoAidence ue CI T J3)
7

SupC12 3) /4 l00So'/.
C13Lt2 T I/ConGidence Sup CTT2 13)/
Sup CID 2/6 o033
CI23LI 133 l Confidence Sup C T2 T3)/
SupCI2) 2/1l0o 22
CIs3 LTIA TS l Considence up CTT3)/
Sup C13) 22/6 *loa 33/
Hee he miniun confidece 1s Co/
Now, we Sind
find ong ossaciation 2ules with he hele
G Second iemset iTi,2,Is3
FOR EDUCATIONAL USE
onteran
Ttemset :SILT2 153 //fan L3
70les can be
CAJl>[is3 /LConkidence suP CIA T2 Is)/
SupCiiAT2) - 2/4 loo So

SupCTIS) 22* l0o lo0


CIs)[I3 // conkidence sup CT I2h15)/
sUPCIzTS) 2/2* J00 lOo
s-LI2'J53 / confidence sup (I2*TS
Sup CT)2/6 * \oo 33/
5 CT2 [ i I 3 /Considen ce supCTI2*TS) /
SuPCI2)-2/7* (00 28/
6 S3-[ia J23 / Confidence supCI J2*TS)/
supCIS) =2/2 Io0 \00 |

minmum conidence is 6o' then he Sollowing


ules Can bee ConsidegPa as ssong asciation 7ules
Conidence -lco
Conidence loo
LIS3II3 Conidence 0'

FOR EDUCATIONAL USE


Sundaram
Rahul Shinde Roll No.42

Practical No. 13
Aim:- Perform association technique on customer dataset using apriori algorithm

Association rule mining is a technique to identify underlying relations between


different items. Take an example of a Super Market where customers can buy variety
of items. Usually, there is a pattern in what the customers buy. For instance, mothers
with babies buy baby products such as milk and diapers. Damsels may buy makeup
items whereas bachelors may buy beers and chips etc. In short, transactions involve a
pattern. More profit can be generated if the relationship between the items purchased
in different transactions can be identified.
For instance, if item A and B are bought together more frequently then several steps
can be taken to increase the profit. For example:
1. A and B can be placed together so that when a customer buys one of the
products he doesn't have to go far away to buy the other product.
2. People who buy one of the products can be targeted through an advertisement
campaign to buy the other.
3. Collective discounts can be offered on these products if the customer buys both
of them.
4. Both A and B can be packaged together.

The process of identifying an association between products is


calledassociationrulemining.
Apriori Algorithmfor Association Rule Mining
Different statistical algorithms have been developed to implement association rule
mining, and Apriori is one such algorithm. In this article we will study the theory behind
theApriorialgorithmand will later implement Apriori algorithm in Python.
Theory of Apriori Algorithm
There are three major components of Apriori algorithm:
❖ Support
❖ Confidence
❖ Lift
We will explain these three concepts with the help of an example.
Suppose we have a record of 1 thousand customer transactions, and we want to find
the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one
thousand transactions,100 contain ketchup while 150 contain a burger. Out of 150
Rahul Shinde Roll No.42

transactions where a burger is purchased, 50 transactions contain ketchup as well.


Using this data, we want to find the support, confidence, and lift.

Support
Support refers to the default popularity of an item and can be calculated by finding
number of transactions containing a particular item divided by total number of
transactions. Suppose we want to find support for item B. This can be calculated as:
Support(B)=(Transactionscontaining(B))/(TotalTransactions)

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the
support for item Ketchup can be calculated as:
Support(Ketchup)=(TransactionscontainingKetchup)/(TotalTransactions)
Support(Ketchup)=100/1000=10%

Confidence
Confidence refers to the likelihood that an item B is also bought if item A is bought. It
can be calculated by finding the number of transactions where A and B are bought
together, divided by total number of transactions where A is bought. Mathematically,
it can be represented as:
Confidence(A→B)=(Transactionscontainingboth(AandB))/(TransactionscontainingA)

Coming back to our problem, we had 50 transactions where Burger and Ketchup were
bought together. While in 150 transactions, burgers are bought. Then we can find
likelihood of buying ketchup when a burger is bought can be represented as
confidence of Burger -> Ketchup and can be mathematically written as:
Confidence(Burger→Ketchup)=(Transactionscontainingboth(Burger
AndKetchup))/(TransactionscontainingA)
Confidence(Burger→Ketchup)=50/150=33.3%

You may notice that this is similar to what you'd see in the Naïve BayesAlgorithm,
however, the two algorithms are meant for different types of problems.

Lift
Rahul Shinde Roll No.42

Lift(A->B)refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can
be calculated by dividingConfidence(A->B)divided bySupport(B).Mathematically it can
be represented as:
Lift(A→B)=(Confidence(A→B))/(Support(B))

Coming back to our Burger and Ketchup problem, theLift(Burger->Ketchup)can be


calculatedas:
Lift(Burger→Ketchup)=(Confidence(Burger→Ketchup))/(Support(Ketchup))
Lift(Burger→Ketchup)=33.3/10=3.33

Lift basically tells us that the likelihoodof buying a Burger and Ketchup together is 3.33
times more than the likelihood of just buying the ketchup. A Lift of 1 means there is
no association between products A and B. Lift of greater than 1 means products A and
B are more likely to be bought together. Finally, Lift of less than 1 refers to the case
where two products are unlikely to be bought together.

Steps Involved in Apriori Algorithm


For large sets of data, there can be hundreds of items in hundreds of thousands of
transactions. The Apriori algorithm tries to extract rules for eachpossible combination
of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3,
item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations
of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.
As you can see from the above example, this process can be extremely slow due to the
number of combinations. To speed up the process, we need to perform the following
steps:
1. Set a minimum value for support and confidence. This means that we are only
interested in finding rules for the items that have certain default existence (e.g.
support) and have a minimum value for co-occurrence with other items (e.g.
confidence).
2. Extract all the subsets having higher value of support than minimum threshold.
3. Select all the rules from the subsets with confidence value higher than
minimum threshold.
4. Order the rules by descending order of Lift.

Implementing Apriori Algorithm with Python


Enough of theory, now is the time to see the Apriori algorithm in action. In this section
we will use the Apriori algorithm to find rules that describe associations between
Rahul Shinde Roll No.42

different products given 7500 transactions over the course of a week at a French retail
store. The dataset can be downloaded from the following link:
https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp
=sharing
Another interesting point is that we do not need to write the script to calculate
support, confidence, and lift for all the possible combination of items. We willuse an
off-the-shelf library where all of the code has already been implemented.
The library I'm referring to isapyoriand the source can be foundhere. I suggest you to
download and install the library in the default path for your Python libraries before
proceeding.
Note: All the scripts in this article have been executed usingSpyderIDEfor Python.
Follow these steps to implement Apriori algorithm in Python:
Import the Libraries
The first step, as always, is to import the required libraries. Execute the following script
to doso:
Importnumpyasnp
importmatplotlib.pyplotasplt
importpandasaspd
fromapyoriimportapriori

In the script above we import pandas, numpy, pyplot, and apriori libraries.

Importing the Dataset


Now let's import the dataset and see what we're working with. Download the
dataset and place it in the "Datasets" folder of the "D" drive (or change the code
below to match the path of the file on your computer) and execute the following
script:
store_data=pd.read_csv('D:\\Datasets\\store_data.csv')

Let's call thehead()function to see how the dataset looks:


store_data.head()
Rahul Shinde Roll No.42

A snippet of the dataset is shown in the above screenshot. If you carefully look at the
data, we can see that the header is actually the first transaction. Each row
corresponds to a transaction and each column corresponds toan item purchased in
that specific transaction. TheNaNtells us that the item represented by the column
was not purchased in that specific transaction.
In this dataset there is no header row. But by default,pd.read_csvfunction treats first
row as header. To get rid of this problem, addheader=Noneoption
topd.read_csvfunction, as shown below:
store_data=pd.read_csv('D:\\Datasets\\store_data.csv',header=None)

Now execute thehead()function:


store_data.head()

In this updated output you will see that the first line is now treated as a record
instead of header as shown below:

Now we will use the Apriori algorithm to find out which items are commonly sold
together, so that store owner can take action to place the related items together or
advertise them together in order to have increased profit.
Rahul Shinde Roll No.42

Data Proprocessing
The Apriori library we are going to use requires our dataset to be in the form of a list
of lists, where the whole dataset is a big list and each transaction in the dataset is an
inner list within the outer big list. Currently we have datain the form of a pandas
dataframe. To convert our pandas dataframe into a list of lists, execute the following
script:
records=[]
foriinrange(0,7501):
records.append([str(store_data.values[i,j])
forjinrange(0,20)])

Applying Apriori
The next step is to apply the Apriori algorithm on the dataset. To do so, we can use
theaprioriclass that we imported from the apyori library.
Theaprioriclass requires some parameter values to work. The first parameter is the
list of list that you want to extract rules from. The second parameter is
themin_supportparameter. This parameter is used to select the items with support
values greater than the value specified by the parameter. Next,
themin_confidenceparameter filters those rules that have confidence greater than
the confidence threshold specified by the parameter. Similarly, themin_liftparameter
specifies the minimum lift value for the shortlisted rules. Finally,
themin_lengthparameter specifies the minimum number of items that you want in
your rules.
Let's suppose that we want rules for only those items that are purchased at least 5
times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time
period. The support for those items can be calculated as 35/7500 = 0.0045. The
minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift
as 3 and finallymin_lengthis 2 since we want at least two products in our rules. These
values are mostly just arbitrarily chosen, so you can play with these values and
seewhat difference it makes in the rules you get back out.
Execute the following script:
association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,

min_lift=3,min_length=2)
association_results=list(association_rules)
Rahul Shinde Roll No.42

In the second line here we convert the rules found by theaprioriclass into alistsince it
is easier to view the results in this form.

Viewing the Results


Let's first find the total number of rules mined by theaprioriclass. Execute the
following script:
print(len(association_rules))

The script above should return 48. Each item corresponds to one rule.
Let's print the first item in theassociation_ruleslist to see the first rule. Execute the
following script:
print(association_rules[0])

The output should look like this:


RelationRecord(items=frozenset({'lightcream','chicken'}),
support=0.004532728969470737,
ordered_statistics[OrderedStatistic(items_base=frozenset({'lightcream'}),
items_add=frozenset({'chicken'}),confidence=0.29059829059829057,

lift=4.84395061728395)])

The first item in the list is a list itself containing three items. The first item of the list
shows the grocery items in the rule.
For instance, from the first item, we can see that light cream and chicken are
commonly bought together. This makes sense sincepeople who purchase light cream
are careful about what they eat hence they are more likely to buy chicken i.e. white
meat instead of red meat i.e. beef. Or this could mean that light cream is commonly
used in recipes for chicken.
The support value for the first rule is 0.0045. This number is calculated by dividing
the number of transactions containing light cream divided by total number of
transactions. The confidence level for the rule is 0.2905 which shows that out of all
the transactions that contain light cream, 29.05% of the transactions also contain
chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be
bought by the customers who buy light cream compared to the default likelihood of
the sale of chicken.
Rahul Shinde Roll No.42

The following script displays the rule, the support, the confidence, and lift for each
rule in a more clear way:
Foriteminassociation_rules:
#firstindexoftheinnerlist
#Containsbaseitemandadditem

pair=item[0]
items=[xforxinpair]
print("Rule:"+items[0]+">"+items[1])
#secondindexofthe innerlist

print("Support:"+str(item[1]))
#thirdindexofthelistlocatedat0th
#ofthethirdindexoftheinnerlistprint("Confidence:"+str(item[2][0][2]))
print("Lift:"+str(item[2][0][3]))

print("=====================================")

If you execute the above script, you will see all the rules returned by theaprioriclass.
The first four rules returned by theaprioriclass look like this:
Rule:lightcream->chicken
Support:0.004532728969470737
Confidence:0.29059829059829057

Lift:4.84395061728395
=====================================
Rule:mushroomcreamsauce->escalope
Support:0.005732568990801126

Confidence:0.3006993006993007
Lift:3.790832696715049
=====================================
Rule:escalope->pasta

Support:0.005865884548726837
Confidence:0.3728813559322034
Rahul Shinde Roll No.42

Lift:4.700811850163794
=====================================
Rule:groundbeef->herb&pepper

Support:0.015997866951073192
Confidence:0.3234501347708895
Lift:3.2919938411349285
=====================================

We have already discussed the first rule. Let's now discuss the second rule. The
second rule states that mushroomcream sauce and escalope are bought frequently.
The support for mushroom cream sauce is 0.0057. The confidence for this rule is
0.3006 which means that out of all the transactions containing mushroom, 30.06% of
the transactions are likely to contain escalope as well. Finally, lift of 3.79 shows that
the escalope is 3.79 more likely to be bought by the customers that buy mushroom
cream sauce, compared to its default sale.
Conclusion
Association rule mining algorithms such as Apriori are very useful for finding simple
associations between our data items. They are easy to implement and have high
explain-ability. However for more advanced insights, such those used by Google or
Amazon etc., more complex algorithms, such asrecommendersystems, are used.
However, you can probably see that this method is a very simple way to get basic
associations if that's all youruse-case needs.
Rahul Shinde Roll No.42

Practical No. 14

Aim : Perform Association Technique on customer dataset II ( using


classification algorithm of KNN on sample dataset).

Theory :

KNN Algorithm - Finding Nearest Neighbors

Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which
can be used for both classification as well as regression predictive problems.
However, it is mainly used for classification predictive problems in industry. The
following two properties would define KNN well −
➢ Lazy learning algorithm − KNN is a lazy learning algorithm because it does
not have a specialized training phase and uses all the data for training
while classification.
➢ Non-parametric learning algorithm − KNN is also a non-parametric
learning algorithm because it doesn’t assume anything about the
underlying data.

Working of KNN Algorithm


K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the
values of new datapoints which further means that the new data point will be
assigned a value based on how closely it matches the points in the training set.
We can understand its working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
Rahul Shinde Roll No.42

• 3.1 − Calculate the distance between test data and each row of training
data with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance
is Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent
class of these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN
algorithm −
Suppose we have a dataset which can be plotted as follows

−−

Now, we need to classify new data point with black dot (at point 60,60) into blue
or red class. We are assuming K = 3 i.e. it would find three nearest data points. It
is shown in the next diagram −
Rahul Shinde Roll No.42

We can see in the above diagram the three nearest neighbors of the data point
with black dot. Among those three, two of them lies in Red class hence the black
dot will also be assigned in red class.

Implementation in Python
As we know K-nearest neighbors (KNN) algorithm can be used for both
classification as well as regression. The following are the recipes in Python to use
KNN as classifier as well as regressor −

KNN as Classifier
First, start with importing necessary python packages −
import numpy as np import
matplotlib.pyplot as plt import
pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names = headernames) dataset.head()
Rahul Shinde Roll No.42

sepal-length sepal-width petal-length petal-width Class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines.
X = dataset.iloc[:, :-1].values y =
dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the
dataset into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
= 0.40)
Next, data scaling will be done as follows −
from sklearn.preprocessing import StandardScaler scaler
= StandardScaler() scaler.fit(X_train)
X_train = scaler.transform(X_train) X_test =
scaler.transform(X_test)
Next, train the model with the help of KNeighborsClassifier class of sklearn as
follows −
Rahul Shinde Roll No.42

from sklearn.neighbors import KNeighborsClassifier classifier =


KNeighborsClassifier(n_neighbors = 8) classifier.fit(X_train, y_train)
At last we need to make prediction. It can be done with the help of following
script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
from sklearn.metrics import classification_report,
confusion_matrix, accuracy_score result =
confusion_matrix(y_test, y_pred)print("Confusion
Matrix:")print(result)
result1 =classification_report(y_test,
y_pred)print("Classification Report:",)print(result1)
result2 =
accuracy_score(y_test,y_pred)print("Accuracy:",result2)

Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support Iris-setosa 1.00 1.00
1.00 21
Iris-versicolor 0.70 1.00 0.82 16 Iris-virginica 1.00
0.70 0.82 23 micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60 weighted avg 0.92
0.88 0.88 60

Accuracy: 0.8833333333333333

KNN as Regressor
First, start with importing necessary Python packages −
Rahul Shinde Roll No.42

import numpy as np import


pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
data = pd.read_csv(url, names = headernames) array =
data.values
X = array[:,:2] Y =
array[:,2]
data.shape
output:(150,5)
Next, import KNeighborsRegressor from sklearn to fit the model −
from sklearn.neighbors import KNeighborsRegressor knnr =
KNeighborsRegressor(n_neighbors = 10) knnr.fit(X, y)
At last, we can find the MSE as follows −
print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean())) Output
The MSE is: 0.12226666666666669

Pros and Cons of KNN


Pros
• It is very simple algorithm to understand and interpret.
• It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
• It is a versatile algorithm as we can use it for classification as well as
regression. It has relatively high accuracy but there are much better
supervised learning models than KNN. Cons
Rahul Shinde Roll No.42

• It is computationally a bit expensive algorithm because it stores all the


training data.
• High memory storage required as compared to other supervised learning
algorithms.
• Prediction is slow in case of big N.

• It is very sensitive to the scale of data as well as irrelevant features.

Applications of KNN
The following are some of the areas in which KNN can be applied successfully −

Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters
one?

Calculating Credit Ratings


KNN algorithms can be used to find an individual’s credit rating by comparing
with the persons having similar traits.

Politics
With the help of KNN algorithms, we can classify a potential voter into various
classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote
to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video Recognition.

You might also like