Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Dealing with Data ,Bagging,

Boosting
Types of Data :Binary Data
ID Salary Male / Female Mortgage car
A 50000 1 0 0
B 85000 0 1 1
C 55000 1 0 1
D 95000 1 1 0
E 75000 0 0 0
F 45000 0 1 1
G 65000 1 1 0
A Binary variable has two states 0 or 1 , where 0 means the variable is absent
and 1 means it is present. Thus the variable smoker has value 1 if the
person smokes and 0 if he does not. A binary variable is symmetric if both of
its states are equally valuable and carry the same weight. A variable denoting
the gender of the person is a symmetric binary variable as both male, female
values are equally important.
Consider the data above. In this case Male/ Female , Mortgage and car are
binary variables as they take values 0 and 1 only. In this case how do we find
the distance between A and B.
Types of Data :Binary Data
object j
1 0 sum
object i 1 q r q+r
0 s t s+t
sum q+s r+t p
We construct a matrix as shown above. The matrix shows the matching
between two objects i and j. In the above matrix q denotes the number of
matches between i and j , where both are 1 , r denotes the number of
matches where i=1 and j=0 and so on.

r + s r + s
d( i,j)= ------------------ = --------------
q + r + s +t p
The distance between i & j is also called the
dissimilarity between i and j.

D(A,B) = (r+ s) / p = 3/3 =1
Calculation of d(A,B) i.e dissimilarity
between A and B
D(A,C) = (r+s) / p = 1 /3 =.33

Calculation of D(A,C) i.e dissimilarity
between A and C
B
1 0 sum
A 1 0 1 1
0 2 0 2
sum 2 1 3
C
1 0 sum
A 1 1 0 1
0 1 1 2
sum 2 1 3
Symmetric Data
Asymmetric Binary variable
object j
1 0 sum
object i 1 q r q+r
0 s t s+t
sum q+s r+t p
A variable is asymmetric variable if the outcomes of the states are not equally
important such as positive and negative outcomes of a disease test. Let the variable
be the status of HIV disease of a person. It will be 1 if disease is present and 0 if
disease is absent. Given two asymmetric binary variables , the agreement of two 1s (
a positive match) is considered more significant than that of two 0s. In this case the
formula for dissimilarity becomes:

r + s
d( i,j)= ------------------
q + r + s

where t is not considered
name gender fever cough test-1 test-2 test-3 test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y Y N N N N
name gender fever cough test-1 test-2 test-3 test-4
Jack M 1 0 1 0 0 0
Mary F 1 0 1 0 1 0
Jim M 1 1 0 0 0 0
In the above case gender is symmetric and other factors are asymmetric binary. We convert asymmetric
values as 1 for Yes and Positive and 0 for No and negative.

D(Jack, Mary)= ( 0 + 1) / ( 2+ 0 +1 ) = .33

D(Jack,Jim) = ( 1 + 1) / ( 1 + 1 +1) = .67

D(Mary,Jim) = ( 1 + 2 ) / ( 1 + 1 +2) = .75
Asymmetric Binary variable
Categorical Variables
A categorical variable is a generalization of the binary variable in that it can take
on more than two states . For example map_color is a categorical variable that
may take five states : red, yellow, green, pink, and blue.

The dissimilarity between two categorical objects i and j can be computed
based on the ratio of mismatches:

p - m
d( i, j ) = ----------------------------
p

Where m is the number of matches ( i.e the number of variables for which i and j
are in the same state) , and p is the total number of variables.
Categorical Variables
We take into account object identifier and test-1 only and make the dissimilarity
matrix. We have p=1 since only one variable is considered.
Categorical Variables
Ordinal Variables
A discrete ordinal variable resembles a categorical variable, except that the M
states of the ordinal value are ordered in a meaningful sequence.
Ordinal Variables
We consider the object identifier and test2 (ordinal variable). We replace each of
the test-2 value by the rank. Since there are three states namely ( excellent, fair
and good) M
f
= 3.
Ordinal Variables
object-identifier test-2 Normalized value
1 3 (3-1)/ (3-1) =1
2 1 (1-1)/ (3-1)=0
3 2 (2-1) / (3-1)=.5
4 3 (3-1)/ (3-1) =1
We next calculate the Euclidean distance between the objects using the normalised
values .

The distance between 2 and 1 is (( 1)
2
)
1/2
= 1
The distance between 3 and 1 is ((.5-1)
2
)
1/2
= .5 and so on. This results in the
following matrix.
Rank : 1-fair , 2-good,3-excellent
Ratio Scaled Variables
For the Ratio Scaled Variables we take the log values. Consider the object-
identifier and the test-3 variable.
object-identifier test-3 Log Values
1 445 log(445)= 2.65
2 22 log(22)= 1.34
3 164 log(164)=2.21
4 1210 log(1210)=3.08
Ratio Scaled Variables
From the values in the last column we calculate the Euclidean distance and we get the
following distance matrix.
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging
Bagging , which is also known as bootstrap aggregating , is a
technique that repeatedly samples (with replacement) from a data
set . Each bootstrap sample has the same size as the original data.
Because the sampling is done with replacement , some instances
may appear several times in the same training set., while others may
be omitted from the training set.
Let x denote a one-dimension attribute and y denote the class label.
We apply a classifier that induces only one-level binary decision tree
with a test condition x < = k, where k is a split point chosen to
minimize the entropy of the leaf nodes.
Bagging
Bagging
These values of y are determined based on
bagging round 1. The round 1 states that for
x<=3.5 ,y =1 and for x>=3.5 , y= -1
The values of y in the column are added
An iterative procedure to adaptively change
distribution of training data by focusing more
on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of
boosting round
Boosting
Records that are wrongly classified will have
their weights increased
Records that are classified correctly will
have their weights decreased
Boosting
Boosting - AdaBoost
AdaBoost Algorithm
1: w= { w
j
= 1 /N | j= 1,2,3.N} {Initialize the weights for all N samples}
2: Let k be the number of boosting rounds.
3: for i= 1 to k do
4: Create training set D
i
by sampling (with replacement) from D according to w
5: Train a base classifier C
i
on D
6: Apply C
i
to all examples in the original training set D.





Calculate the weighted error
7: If
i
> .5 then
w= { w
j
= 1 /N | j= 1,2,3.N} (Reset the weights for all N examples}
Go back to step 4
8: end if

9: Calculate


10: Update the weight of each example
( )

=
= =
N
j
j j i j i
y x C w
N
1
) (
1
o c
|
|
.
|

\
|

=
i
i
i
c
c
o
1
ln
2
1
Base classifiers: C
1
, C
2
, , C
T

Error rate:




Importance of a classifier:

( )

=
= =
N
j
j j i j i
y x C w
N
1
) (
1
o c
|
|
.
|

\
|

=
i
i
i
c
c
o
1
ln
2
1
Boosting - AdaBoost
Weight update:




If any intermediate rounds produce error rate
higher than 50%, the weights are reverted
back to 1/ N and the re sampling procedure is
repeated
Classification:
factor ion normalizat the is where
) ( if exp
) ( if exp
) (
) 1 (
j
i i j
i i j
j
j
i
j
i
Z
y x C
y x C
Z
w
w
j
j

=
=
=

+
o
o
( )

=
= =
T
j
j j
y
y x C x C
1
) ( max arg ) ( * o o
Boosting - AdaBoost
Boosting - AdaBoost
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
N=10 i.e the number of elements as shown above.
w= 1 / N = 1/10 =.1 is the initial weight assigned to each
element in the data

Let k= number of boosting rounds =3

Boosting - AdaBoost
The figure above shows the three boosting
rounds . The elements are sampled with
replacement. Hence a element appears more
than once.
Boosting - AdaBoost
In round 1 all elements are given the same weight = 1 /10 =.1
as shown in the first row above.
The weights of training records are as follows (calculation is shown in subsequent
slides)
Boosting - AdaBoost
The split point is: if x <= .75 then y =-1 else it is 1
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
We compare this with the original data set
We observe that values of y corresponding to x= .1, x=.2 and x=.3 are wrong rest all
values are correctly classified. This is called training a base classifier
We define the values of y based on above criteria
Boosting - AdaBoost
We need to calculate the value of

and so that new weights can be
calculated according to the equation:

( )

=
= =
N
j
j j i j i
y x C w
N
1
) (
1
o c
factor ion normalizat the is where
) ( if exp
) ( if exp
) (
) 1 (
j
i i j
i i j
j
j
i
j
i
Z
y x C
y x C
Z
w
w
j
j

=
=
=

+
o
o
|
|
.
|

\
|

=
i
i
i
c
c
o
1
ln
2
1
Boosting - AdaBoost

i
= 1/10 (.1 x1 + .1x 1 +.1x1 +0+0..)

i
= .1 (.3) = .03
( )

=
= =
N
j
j j i j i
y x C w
N
1
) (
1
o c
Calculation is as under :

= 1 if data element in D does not
match the original data element else it
is 0.

Thus = 1 for first three data elements in D , w is the weight assigned which
is equal to .1 for the first round.
Boosting - AdaBoost
We have the value of
i
:

i
= .1 (.3) = .03

= 1 /2 In ( (1- .03)/ .03)

= 1.738

Boosting - AdaBoost
We now need to calculate the new weights given by the equation
factor ion normalizat the is where
) ( if exp
) ( if exp
) (
) 1 (
j
i i j
i i j
j
j
i
j
i
Z
y x C
y x C
Z
w
w
j
j

=
=
=

+
o
o
The normalization factor ensures that w
i
(j+1)
=1
This condition shows the matching or non matching of values
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Matching values
We need to calculate the value of Z
j
the normalization factor

1 = (.1/Z
j
) ( e
1.738
) + (.1/Z
j
) ( e
1.738
) + (.1/Z
j
) ( e
1.738
) + (.1/Z
j
) ( e
-1.738
)+

1 = (.1/Z
j
) ( e
1.738
) + (.1/Z
j
) ( e
1.738
) + (.1/Z
j
) ( e
1.738
) + (.1/Z
j
) (.175 x 7)

The value of Z
j
must make the right hand side expression equal to 1

If we solve the above equation we get value of Z
j
= 1.82


For non-matching instances the weights are:

=(.1 / 1.82) x e
1.738
= .31

For non-matching instances the weights are :

= (.1/1.82) x e
-1.738
=.0096 ~ .01

The whole process is repeated with the new weights
Boosting - AdaBoost
Boosting - AdaBoost
= -1 x (1.738) + 1 x (2.7784 ) + 1 x (4.1195)
=-1 x(1.738) + 1 x (2.7784) +
-1 x (4.1195)

You might also like