Professional Documents
Culture Documents
Ie4903 Data Mining Lecture Homework 4
Ie4903 Data Mining Lecture Homework 4
Fato LB-1535459
Q1)
Since Gini index, Entropy and Misclassification error measures impurity, their value
increases as impurity increase. Thus, they take their maximum value when records are equally distributed among all classes implying least information, i.e. if there exists n classes, and T records at node t, p (i \t)= (T/n) / T= 1/n. Maximum Gini Index= 1-i (p(i\t))2 = 1-n* (1/n)2 = 1-(1/n) Maximum Entropy= -i p(i\t)*log2 p(i\t) = -n* (1/n)*log2 (1/n) = -log2 (1)+ log2 n =0 + log2 n = log2 n Maximum Misclassification Error= 1-maxi{ p(i\t)} = 1 (1/n)
v(11:2000)=[]; birgbir=v; sifirgbir=v; for t=1:10 h=0; for i=1:j if birtmatris(i,t)==1 h=h+1; end birgbir(t)=h/j; sifirgbir(t)=1-birgbir(t); end end birgsifir=v; sifirgsifir=v; for t=1:10 h=0; for i=1:k if sifirtmatris(i,t)==1 h=h+1; end birgsifir(t)=h/k; sifirgsifir(t)=1-birgsifir(t); end end Bir=pbir; Birprime=psifir; Label=randperm(400); for i=1:400 for t=1:10 if vmatris(i,t)==1 Bir=Bir*birgbir(t); Birprime=Birprime*birgsifir(t); else Bir=Bir*sifirgbir(t); Birprime=Birprime*sifirgsifir(t); end if Bir>Birprime Label(i)=1; else Label(i)=0; end end end eV=[0 0; 0 0]; for i=1:400 if Label(i)==vmatris(i,11)+1 eV(1,2)=eV(1,2)+1; elseif Label(i)==vmatris(i,11)-1 eV(2,1)=eV(2,1)+1; elseif Label(i)==0 eV(1,1)=eV(1,1)+1; elseif Label(i)==1 eV(2,2)=eV(2,2)+1; end end ErrorV =(eV(1,2)+eV(2,1))/400; Bir=pbir; Birprime=psifir; Label=randperm(1600); for i=1:1600
for t=1:10 if tmatris(i,t)==1 Bir=Bir*birgbir(t); Birprime=Birprime*birgsifir(t); else Bir=Bir*sifirgbir(t); Birprime=Birprime*sifirgsifir(t); end if Bir>Birprime Label(i)=1; else Label(i)=0; end end end eT=[0 0; 0 0]; for i=1:1600 if Label(i)==tmatris(i,11)+1 eT(1,2)=eT(1,2)+1; elseif Label(i)==tmatris(i,11)-1 eT(2,1)=eT(2,1)+1; elseif Label(i)==0 eT(1,1)=eT(1,1)+1; elseif Label(i)==1 eT(2,2)=eT(2,2)+1; end end ErrorT =(eT(1,2)+eT(2,1))/1600;
Since training and validation sets are split randomly at each run, error matrices changes at each run. One example output can be seen below: eT matrix: 1397 28 1 174 ErrorT= 0,13 eV matrix: 341 17 1 41 ErrorV= 0,15
Q4) Functions used for this question: 1. Function for splitting a node with minimum split Gini
function [purity,splitattr,splitindex,splitamount,newmatrix]= splitnode(datamatrix,startindex,endindex) splitattr=0; splitindex=0; splitamount=0; newmatrix=datamatrix; datasayisi=endindex-startindex+1; birler=0; sifirlar=0; for i=1:datasayisi if datamatrix(i,7)==1 birler=birler+1; else sifirlar=sifirlar+1; end end purity=0; if birler==datasayisi purity=1; elseif sifirlar==datasayisi purity=1; end
if purity==0 bestsplit(1:103,1:6)=1; splitplace(1:103,1:6)=0; for i=1:6 datamatrix(startindex:endindex,:)= sortrows(datamatrix(startindex:endindex,:),i); for j=startindex:endindex-1 if datamatrix(j,7) == datamatrix(j+1,7) bestsplit(j,i)= 1; else bestsplit(j,i)=ginihesapla(datamatrix(:,7),startindex,endindex,j); splitplace(j,i)=datamatrix(j,i); end end end minginis(1:2,1:6)=0; for i=1:6 [minginis(1,i),minginis(2,i)]=min(bestsplit(:,i)); end [~,splitattr]=min(minginis(1,:)); splitindex=minginis(2,splitattr); splitamount=splitplace(splitindex,splitattr); newmatrix(startindex:endindex,:)=sortrows(datamatrix(startindex:endindex,:) ,splitattr); end end
if purity3==0 [purity6,splitattr6,splitindex6,splitamount6,newtmatrix6]= splitnode(newtmatrix3,(splitindex1+1),splitindex3); treematrix(6,:)=[6 (splitindex1+1) splitindex6 purity6 splitattr6 splitindex6 splitamount6] ; [purity7,splitattr7,splitindex7,splitamount7,newtmatrix7]= splitnode(newtmatrix3,(splitindex3+1),endindx); treematrix(7,:)=[7 (splitindex3+1) endindx purity7 splitattr7 splitindex7 splitamount7] ; end
The output is a matrix having columns for node number, startindex, endindex, purity, attribute type, attribute index and attribute value used to split the node respectively: 1 2 3 4 5 6 7 1 1 51 1 51 51 56 104 50 104 50 53 0 104 0 1 0 1 1 1 1 1 0 5 0 0 0 0 50 0 55 0 0 0 0 12,77 0 87 0 0 0 0