Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

IE4903 DATA MINING LECTURE HOMEWORK 4

Fato LB-1535459

Q1)

Since Gini index, Entropy and Misclassification error measures impurity, their value

increases as impurity increase. Thus, they take their maximum value when records are equally distributed among all classes implying least information, i.e. if there exists n classes, and T records at node t, p (i \t)= (T/n) / T= 1/n. Maximum Gini Index= 1-i (p(i\t))2 = 1-n* (1/n)2 = 1-(1/n) Maximum Entropy= -i p(i\t)*log2 p(i\t) = -n* (1/n)*log2 (1/n) = -log2 (1)+ log2 n =0 + log2 n = log2 n Maximum Misclassification Error= 1-maxi{ p(i\t)} = 1 (1/n)

Q3) Codes for the question:


matris = xlsread ('Associations.xls'); v=randperm(2000); tmatris=matris; vmatris=matris; for i=1:1600 tmatris(i,:)= matris(v(i),:); end tmatris(1601:2000,:)=[]; for i=1601:2000 vmatris(i-1600,:)= matris(v(i),:); end vmatris(401:2000,:)=[]; j=0; k=0; birtmatris=tmatris; sifirtmatris=tmatris; for i=1:1600 if tmatris(i,11)==1 j=j+1; birtmatris(j,:)=tmatris(i,:); else k=k+1; sifirtmatris(k,:)=tmatris(i,:); end end birtmatris(j+1:1600,:)=[]; sifirtmatris(k+1:1600,:)=[]; pbir=j/1600; psifir=k/1600;

v(11:2000)=[]; birgbir=v; sifirgbir=v; for t=1:10 h=0; for i=1:j if birtmatris(i,t)==1 h=h+1; end birgbir(t)=h/j; sifirgbir(t)=1-birgbir(t); end end birgsifir=v; sifirgsifir=v; for t=1:10 h=0; for i=1:k if sifirtmatris(i,t)==1 h=h+1; end birgsifir(t)=h/k; sifirgsifir(t)=1-birgsifir(t); end end Bir=pbir; Birprime=psifir; Label=randperm(400); for i=1:400 for t=1:10 if vmatris(i,t)==1 Bir=Bir*birgbir(t); Birprime=Birprime*birgsifir(t); else Bir=Bir*sifirgbir(t); Birprime=Birprime*sifirgsifir(t); end if Bir>Birprime Label(i)=1; else Label(i)=0; end end end eV=[0 0; 0 0]; for i=1:400 if Label(i)==vmatris(i,11)+1 eV(1,2)=eV(1,2)+1; elseif Label(i)==vmatris(i,11)-1 eV(2,1)=eV(2,1)+1; elseif Label(i)==0 eV(1,1)=eV(1,1)+1; elseif Label(i)==1 eV(2,2)=eV(2,2)+1; end end ErrorV =(eV(1,2)+eV(2,1))/400; Bir=pbir; Birprime=psifir; Label=randperm(1600); for i=1:1600

for t=1:10 if tmatris(i,t)==1 Bir=Bir*birgbir(t); Birprime=Birprime*birgsifir(t); else Bir=Bir*sifirgbir(t); Birprime=Birprime*sifirgsifir(t); end if Bir>Birprime Label(i)=1; else Label(i)=0; end end end eT=[0 0; 0 0]; for i=1:1600 if Label(i)==tmatris(i,11)+1 eT(1,2)=eT(1,2)+1; elseif Label(i)==tmatris(i,11)-1 eT(2,1)=eT(2,1)+1; elseif Label(i)==0 eT(1,1)=eT(1,1)+1; elseif Label(i)==1 eT(2,2)=eT(2,2)+1; end end ErrorT =(eT(1,2)+eT(2,1))/1600;

Since training and validation sets are split randomly at each run, error matrices changes at each run. One example output can be seen below: eT matrix: 1397 28 1 174 ErrorT= 0,13 eV matrix: 341 17 1 41 ErrorV= 0,15

Q4) Functions used for this question: 1. Function for splitting a node with minimum split Gini
function [purity,splitattr,splitindex,splitamount,newmatrix]= splitnode(datamatrix,startindex,endindex) splitattr=0; splitindex=0; splitamount=0; newmatrix=datamatrix; datasayisi=endindex-startindex+1; birler=0; sifirlar=0; for i=1:datasayisi if datamatrix(i,7)==1 birler=birler+1; else sifirlar=sifirlar+1; end end purity=0; if birler==datasayisi purity=1; elseif sifirlar==datasayisi purity=1; end

if purity==0 bestsplit(1:103,1:6)=1; splitplace(1:103,1:6)=0; for i=1:6 datamatrix(startindex:endindex,:)= sortrows(datamatrix(startindex:endindex,:),i); for j=startindex:endindex-1 if datamatrix(j,7) == datamatrix(j+1,7) bestsplit(j,i)= 1; else bestsplit(j,i)=ginihesapla(datamatrix(:,7),startindex,endindex,j); splitplace(j,i)=datamatrix(j,i); end end end minginis(1:2,1:6)=0; for i=1:6 [minginis(1,i),minginis(2,i)]=min(bestsplit(:,i)); end [~,splitattr]=min(minginis(1,:)); splitindex=minginis(2,splitattr); splitamount=splitplace(splitindex,splitattr); newmatrix(startindex:endindex,:)=sortrows(datamatrix(startindex:endindex,:) ,splitattr); end end

2. Function calculating the gini which is used in previous function:


function [gini]=ginihesapla(vektor,startpt,endpt,ayrimindex) toplambiryukari = sum(vektor(startpt:ayrimindex)); toplam1 = ayrimindex-startpt+1; toplamsifiryukari=toplam1-toplambiryukari; toplambirasagi = sum(vektor(ayrimindex+1:endpt)); toplam2 = endpt-ayrimindex; toplamsifirasagi=toplam2-toplambirasagi; toplam=toplam1+toplam2; gini1 = 1-((toplambiryukari/toplam1)^2) - ((toplamsifiryukari/toplam1)^2); gini2 = 1-((toplambirasagi/toplam2)^2) - ((toplamsifirasagi/toplam2)^2); gini=(toplam1/toplam)*gini1+(toplam2/toplam)*gini2; end

3. Main function to construct the decision tree for training set:


for i=105:130 vmatrix(i-104,:)= matrix(v(i),:); end vmatrix(27:130,:)=[]; treematrix(:,7)=0; startindx=1; endindx=104; [purity1,splitattr1,splitindex1,splitamount1,newtmatrix1]= splitnode(tmatrix,startindx,endindx); treematrix(1,:)=[1 startindx endindx purity1 splitattr1 splitindex1 splitamount1] ; [purity2,splitattr2,splitindex2,splitamount2,newtmatrix2]= splitnode(newtmatrix1,startindx,splitindex1); treematrix(2,:)=[2 startindx splitindex1 purity2 splitattr2 splitindex2 splitamount2] ; [purity3,splitattr3,splitindex3,splitamount3,newtmatrix3]= splitnode(newtmatrix1,(splitindex1+1),endindx); treematrix(3,:)=[3 (splitindex1+1) endindx purity3 splitattr3 splitindex3 splitamount3] ; if purity2==0 [purity4,splitattr4,splitindex4,splitamount4,newtmatrix4]= splitnode(newtmatrix2,startindx,splitindex2); treematrix(4,:)=[4 startindx splitindex2 purity4 splitattr4 splitindex4 splitamount4] ; [purity5,splitattr5,splitindex5,splitamount5,newtmatrix5]= splitnode(newtmatrix2,(splitindex2+1),splitindex1); treematrix(5,:)=[5 (splitindex2+1) splitindex1 purity5 splitattr5 splitindex5 splitamount5] ; end

if purity3==0 [purity6,splitattr6,splitindex6,splitamount6,newtmatrix6]= splitnode(newtmatrix3,(splitindex1+1),splitindex3); treematrix(6,:)=[6 (splitindex1+1) splitindex6 purity6 splitattr6 splitindex6 splitamount6] ; [purity7,splitattr7,splitindex7,splitamount7,newtmatrix7]= splitnode(newtmatrix3,(splitindex3+1),endindx); treematrix(7,:)=[7 (splitindex3+1) endindx purity7 splitattr7 splitindex7 splitamount7] ; end

The output is a matrix having columns for node number, startindex, endindex, purity, attribute type, attribute index and attribute value used to split the node respectively: 1 2 3 4 5 6 7 1 1 51 1 51 51 56 104 50 104 50 53 0 104 0 1 0 1 1 1 1 1 0 5 0 0 0 0 50 0 55 0 0 0 0 12,77 0 87 0 0 0 0

You might also like