Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Text

 Mining  and  Classifica1on  

Karianne  Bergen    
kbergen@stanford.edu
Ins1tute  for  Computa1onal  and  Mathema1cal  Engineering,  
Stanford  University  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   1  


Text  Classifica1on  
•  Determine  a  characteris1c  of  a  document  
based  on  the  text:  
–  Author  iden1fica1on  
–  Sen1ment  analysis  (e.g.  posi1ve  vs.  nega1ve  
review)  
–  Subject  or  topic  category  
–  Spam  filtering  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   2  


Text  Classifica1on  

hTp://www.theshedonline.org.au/ac1vi1es/ac1vity/scam-­‐email-­‐examples  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   3  


Document  Features  
•  How  do  we  generate  a  set  of  input  features  
from  a  text  document  to  pass  to  the  machine  
learning  algorithm?  
–  Bag  of  words  /  term-­‐document  matrix  
–  N-­‐grams    

Machine  Learning  Short  Course  |    August  11-­‐15  2014   4  


Bag-­‐of-­‐Words  Model  
•  Representa1on  of  text  data  in  terms  of  
frequencies  of  words  from  a  dic1onary  
–  The  grammar  and  ordering  of  words  are  ignored  
–  Just  keep  the  (unordered)  list  of  words  that  
appear  and  the  number  of  1mes  they  appear  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   5  


Bag-­‐of-­‐Words  Model  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   6  


Term-­‐Document  Matrix  
•  Term-­‐document  matrix  useful  for  working  
with  text  data  
–  Sparse  matrix,  describes  frequency  of  words  
occurring  in  a  collec1on  of  documents  
–  Rows  represent  terms/words,  Columns  represent  
individual  documents  
–  Entry  (𝑖,𝑗)  gives  number  of  occurrences  of  term  𝑖  
in  document  𝑗  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   7  


Term-­‐Document  Matrix  
•  Example  
–  Documents:  
1.  “one  fish  two  fish”  
2.  “red  fish  blue  fish”  
3.  “black  fish  blue  fish”  
4.  “old  fish  new  fish”  
 
–  Terms:  “one”,  “two”,  “fish”,  “red”,  “blue”  “black”  
“old”,  “new”  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   8  


Term-­‐Document  Matrix  
Document  
1   2   3   4  
“one”   1   0   0   0  
“two”   1   0   0   0  
“fish”   2   2   2   2  
Term   “red”   0   1   0   0  
“blue”   0   1   1   0  
“black”   0   0   1   0  
“old”   0   0   0   1  
“new”   0   0   0   1  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   9  


N-­‐gram  
•  N-­‐gram:  a  con1guous  sequence  of  𝑛  items  
(e.g.  words  or  characters)    
•  Used  for  language  modeling  -­‐  features  retain  
informa1on  related  to  word  ordering  
•  e.g.  “It's  kind  of  fun  to  do  the  impossible.”  
                 -­‐  Walt  Disney    
–  3-­‐grams:  “It’s  kind  of,”  “kind  of  fun,”  “of  fun  to,”  
“fun  to  do,”  “to  do  the”,  “do  the  impossible,”  “the  
impossible  it’s”  “impossible  it’s  kind”  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   10  


Text  Mining:  NMF  
•  Unsupervised  learning  method  for  
dimensionality  reduc1on  
•  NMF  is  a  type  of  matrix  factoriza1on    
–  Original  matrix  and  factors  only  contain  posi1ve  
or  zero  values  
–  For  dimensionality  reduc1on  and  clustering  
–  Non-­‐nega1vity  of  factors  makes  the  results  easier  
to  interpret  than  other  factoriza1ons  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   11  


Nonnega1ve  Matrix  Factoriza1on  
•  NMF  factors  matrix  𝑋  into  product  of  two  non-­‐
nega1ve  matrices:  
𝑋≈𝑊𝐻,      
   𝑊≥0,    𝐻  ≥0  
•  𝑊  is  the  “dic1onary”  matrix  and  columns  are  
“metafeatures”,  𝐻  is  coefficient  matrix  
 

Machine  Learning  Short  Course  |    August  11-­‐15  2014   12  


NMF  for  Text  
•  𝑋∈​ℝ↑𝑡  𝑥  𝑑   :  term-­‐document  matrix  
•  𝑊∈​ℝ↑𝑡  𝑥  𝑘 :  𝑘  columns  (“metafeatures”)  ,  
each  represen1ng  a  collec1on  of  terms  
•  𝐻∈​ℝ↑𝑘  𝑥  𝑑 :  coefficients  
•  Each  document  is  represented  as  a  posi1ve  
combia1on  of  the    𝑘  metafeatures  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   13  


NMF  for  Text  
•  Example  
–  Documents:  
1.  “one  fish  two  fish”  
2.  “red  fish  blue  fish”  
3.  “old  fish  new  fish”  
4.  “some  are  red  and  some  are  blue”  
5.  “some  are  old  and  some  are  new”  
–  Terms:  “one”,  “two”,  “fish”,  “red”,  “blue”,  “old”,  
“new”,  “some”,  “are”,  “and”  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   14  


NMF  for  Text:    
X  (term-­‐document  matrix)    
Document  
1   2   3   4   5  
“one”   1  
“two”   1  
“fish”   2   2   2  
“red”   1   1  
“blue”   1   1  
Term  
“old”   1   1  
“new”   1   1  
“some”   2   2  
“are”   2   2  
“and”   1   1  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   15  


NMF  for  Text:    
W  (dic1onary  matrix)    
Metafeature  
“one”  +   “fish”   “red”  +   “old”  +   “some”  +  “are”  +  
“two”   “blue”   “new”   0.5  ·∙  “and”  
“one”   1  
“two”   1  
“fish”   1  
“red”   1  
Term  
“blue”   1  
“old”   1  
“new”   1  
“some”   1  
“are”   1  
“and”   0.5  
Machine  Learning  Short  Course  |    August  11-­‐15  2014   16  
NMF  for  Text:    
H  (coefficient  matrix)    
Document  
1   2   3   4   5  
“one”  +  “two”   1  
“fish”   2   2   2  
Metafeature  
“red”  +  “blue”   1   1  
“old  +  new”   1   1  
“some”  +  “are”  +  0.5  ·∙  “and”   2   2  

•  e.g.  “one  fish  two  fish”      →      “one”  “fish”  “two”  “fish”    


       =  1דone”  +  1×  “two”+  2×  “fish”  
     OR  =  1×(“one”  +  “two”)  +  2×  “fish”  
Machine  Learning  Short  Course  |    August  11-­‐15  2014   17  
NMF  for  Text  
•  Metafeatures  in  dic1onary  matrix  𝑊  may  
reveal  interes1ng  paTerns  in  the  data  
–  Posi1vity  of  metafeatures  helps  with  
interpretability  
–  Groupings  of  words  in  metafeatures  onen  occur  
together  in  the  same  document  
•  e.g.  “red”  and  “blue”  or  “old”  and  “new”  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   18  


NMF  for  Text  
•  e.g.  Text  from  news  from  business  sec1on  
–  2500  ar1cles,  50  authors  
–  948  terms  aner  pre-­‐processing  (stemming,  stop  
word  removal,  removal  of  infrequent  terms)  
–  Apply  NMF  factoriza1on  with  𝐾=25    
–  Metafeatures  in  dic1onary  factor  𝑊  roughly  
correspond  to  topics  within  the  text  
–  Representa1on  of  text:  948  terms  à  25  topics  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   19  


NMF  for  Text  
Ford Motor Co. Thursday announced sweeping
organizational changes and a major shake-up of its
senior management, replacing the head of its
global automotive operations. The moves include
combining Ford's four components divisions into a
single organization with 75,000 employees and $14
billion in revenues, and a consolidation of the
automaker's vehicle product development centers to
three from five.
 

à  {  “ford”  “motor”  “thursday”  “announc”  “chang”      


               “major”    “senior”  “manag”  “replac”…    }  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   20  


NMF  for  Text  
Metafeature  1   Metafeature  2   Metafeature  3   Metafeature  4  
cargo   0.47         internet       0.43         china         0.73         plant       0.47        
air   0.47     comput         0.42         beij       0.31         worker         0.35        
airline   0.24           corp           0.30         chines         0.30         uaw         0.24        
servic         0.18         use         0.29         state         0.21         strike       0.21        
kong           0.16         system         0.20         offici         0.20         ford           0.19        
hong         0.16         microsoE         0.19         said       0.19         part           0.17        
aircraE         0.13         soEware           0.18         trade       0.14         local       0.15        
airport         0.13         inc         0.16         foreign     0.13         auto           0.15        
flight   0.12   technolog         0.16         unite   0.11     said        motor        0.14        
industri       0.16         truck         0.13        
network         0.15         chrysler         0.13        
product         0.13         work         0.13        
servic         0.13         automak         0.13        
busi     0.11   union       0.13        
contract   0.13        
0.11          

Machine  Learning  Short  Course  |    August  11-­‐15  2014   21  


NMF  for  Images  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   22  


NMF  for  Images  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   23  


NMF  for  Images  

 
≈ +   +   +   +  

+   +   +   +  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   24  


# NMF in R

# install.packages("NMF") # nmf
library(NMF)

V <- scale(data, center = FALSE, scale = colSums(V))

k = 20
res <- nmf(V,k)

W <- basis(res) # get dictionary matrix W


H <- coef(res) # get dictionary matrix H
V.hat <- fitted(res) # get estimate W*H

Machine  Learning  Short  Course  |    August  11-­‐15  2014   25  


Text  Classifica1on  
•  Naïve  Bayes  
–  Simple  algorithm  based  on  Bayes  rule  from  
sta1s1cs  
–  Uses  the  bag-­‐of-­‐words  model  for  documents  
–  Has  been  shown  to  be  very  effec1ve  for  text  
classifica1on  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   26  


Naïve  Bayes  
•  NB  chooses  the  most  likely  class  label  based  on  
the  following  assump1on  about  the  data:  
–  Independent    feature  (word)  model  –  presence  of  any  
word  in  document  is  unrelated  to  the  presence/
absence  of  other  words  
•  This  assump1on  makes  it  easier  to  combine  the  
contribu1ons  of  features,  don’t  need  to  model  
interac1ons  between  words  
•  Even  though  this  assump1on  rarely  hold,  NB  s1ll  
works  well  in  prac1ce  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   27  


Naïve  Bayes  
•  Compute  𝑃𝑟𝑜𝑏(𝑌=𝑗  |𝑋)  for  each  class  𝑗  and  
choose  class  with  greatest  probability  
•  Bayesian  classifiers  
𝑃𝑟𝑜𝑏​𝑌⁠𝑋 =  ​𝑃𝑟𝑜𝑏(𝑌)𝑃𝑟𝑜𝑏(𝑋|𝑌)/𝑃𝑟𝑜𝑏(𝑋)     
•  For  Naïve  Bayes  
​𝑌 =​argmax┬𝑌 ⁠𝑃𝑟𝑜𝑏(𝑌)∏𝑗=1↑𝑑▒𝑃𝑟𝑜𝑏(​𝑋↓𝑗 |𝑌)    
–  𝑃𝑟𝑜𝑏(𝑌),  𝑃𝑟𝑜𝑏​​𝑋↓𝑗 ⁠𝑌   es1mated  using  training  data  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   28  


Naïve  Bayes  
•  Advantages:  
–  Does  not  require  a  large  training  set  to  obtain  
good  performance,  especially  in  text  applica1ons  
–  Independence  assump1on  leads  to  faster  
computa1ons    
–  Is  not  sensi1ve  to  irrelevant  features  
•  Disadvantages:  
–  Independence  of  features  assump1on    
–  Good  classifier,  but  poor  probability  es1mates  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   29  


Author  iden1fica1on  
•  Collec1on  of  poems  –  William  Shakespeare  or  
Robert  Frost?  
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim…

Shall I compare thee to a summer's day?


Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date.
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature's changing course, untrimmed;…

Machine  Learning  Short  Course  |    August  11-­‐15  2014   30  


Author  iden1fica1on  
install.packages("tm") # text mining

library(tm) # loads library

# shakespeare
s.dir = "shakespeare"
s.Docs <- Corpus(DirSource(directory=s.dir,
encoding="UTF-8"))

# frost
f.dir = "frost"
f.Docs <- Corpus(DirSource(directory=f.dir,
encoding="UTF-8"))

Machine  Learning  Short  Course  |    August  11-­‐15  2014   31  


cleanCorpus<-function(corpus){

# apply stemming
corpus <-tm_map(corpus, stemDocument, lazy=TRUE)

# remove punctuation
corpus.tmp <- tm_map(corpus,removePunctuation)

# remove white spaces


corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)

# remove stop words


corpus.tmp <-
tm_map(corpus.tmp,removeWords,stopwords("en"))

return(corpus.tmp)
}

Machine  Learning  Short  Course  |    August  11-­‐15  2014   32  


d.docs <- c(s.docs, f.docs) # combine data sets
d.cldocs <- cleanCorpus(d.docs) # preprocessing

# forms document-term matrix


d.tdm <- DocumentTermMatrix(d.cldocs)

# removes infrequent terms


d.tdm <- removeSparseTerms(d.tdm,0.97)

> dim(d.tdm) # [ #docs, #numterms ]


[1] 264 518

> inspect(d.tdm) # inspect entries in document-term


matrix

Machine  Learning  Short  Course  |    August  11-­‐15  2014   33  


# exploring the data

# terms appearing > 55 times in shakespeare’s poems


> findFreqTerms(s.tdm,55)
[1] "and" "but" "doth" "eye" "for" "heart" "love"
"mine" "sweet" "that" "the" "thee" "thi" "thou"
"time" "yet"

# terms appearing > 55 times in frost’s poems


> findFreqTerms(f.tdm,55)
[1] "and" "back" "but" "come" "know" "like" "look"
"make" "one" "say" "see" "that" "the" "they" "way"
"what" "with" "you"

Machine  Learning  Short  Course  |    August  11-­‐15  2014   34  


# exploring the data

# identify associations between terms - shakespeare


> findAssocs(s.tdm, "winter", 0.2)
winter
summer 0.50
age 0.40
youth 0.34
like 0.24
old 0.23
beauti 0.21
seen 0.21

Machine  Learning  Short  Course  |    August  11-­‐15  2014   35  


# exploring the data

# identify associations between terms - frost


> findAssocs(f.tdm, "winter", 0.5)
winter
climb 0.66
town 0.62
toward 0.57
side 0.55
black 0.53
mountain 0.52

Machine  Learning  Short  Course  |    August  11-­‐15  2014   36  


# assign class labels to each document,
# based on the document author

class.names = c('shakespeare','frost')
d.class = c(rep(class.names[1], nrow(s.tdm)),
rep(class.names[2], nrow(f.tdm)))

d.class = as.factor(d.class)
> levels(d.class)
[1] "frost" "shakespeare“

Machine  Learning  Short  Course  |    August  11-­‐15  2014   37  


# separate data into training and test sets

set.seed(123) # set random seed


train_frac = 0.6 # fraction of data for training
train_idx = sample.int(nrow(d.tdm), size =
ceiling(nrow(d.tdm) * train_frac),
replace = FALSE);
train_idx <- sort(train_idx)
test_idx <- setdiff(1:nrow(d.tdm), train_idx)

d.tdm.train <- d.tdm[train_idx,]


d.tdm.test <- d.tdm[test_idx,]
d.class.train <- d.class[train_idx]
d.class.test <- d.class[test_idx]

Machine  Learning  Short  Course  |    August  11-­‐15  2014   38  


# separate data into training and test sets
> d.tdm.train
<<DocumentTermMatrix (documents: 159, terms: 518)>>
Non-/sparse entries : 6167/76195
Sparsity : 93%
Maximal term length : 9
Weighting : term frequency (tf)

> d.tdm.test
<<DocumentTermMatrix (documents: 105, terms: 518)>>
Non-/sparse entries : 4578/49812
Sparsity : 92%
Maximal term length : 9
Weighting : term frequency (tf)

Machine  Learning  Short  Course  |    August  11-­‐15  2014   39  


# CART

install.packages("rpart") # install cart package


library(rpart) # load library

d.frame.train <- data.frame(as.matrix(d.tdm.train));


d.frame.train$class <- as.factor(d.class.train)

treefit <- rpart(class ~., data = d.frame.train)

> summary(treefit)
Variables actually used in tree construction:
[1] doth eyes green grow let thee which

Machine  Learning  Short  Course  |    August  11-­‐15  2014   40  


Decision  Tree  result  

plot(treefit, uniform=TRUE)
text(treefit, use.n=T)

Machine  Learning  Short  Course  |    August  11-­‐15  2014   41  


•  William  Shakespeare  or  Robert  Frost?  

Two roads diverged in a yellow wood,


And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim…

Shall I compare thee to a summer's day?


Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date.
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature's changing course, untrimmed;…

Machine  Learning  Short  Course  |    August  11-­‐15  2014   42  


# CART
Node number 1: 159 observations, complexity param=0.3947368
predicted class=shakespeare expected loss=0.4779874 P(node) =1
class counts: 76 83
probabilities: 0.478 0.522
left son=2 (120 obs) right son=3 (39 obs)
Primary splits:
thee < 0.0007022472 to the left, improve=21.14, (0 missing)
thi < 0.01323529 to the left, improve=21.14, (0 missing)
thou < 0.003511236 to the left, improve=19.58, (0 missing)
doth < 0.0007022472 to the left, improve=16.21, (0 missing)
love < 0.01906318 to the left, improve=14.89, (0 missing)
Surrogate splits:
thou < 0.003511236 to the left, agree=0.906, (0 split)
thi < 0.0007022472 to the left, agree=0.899, (0 split)
art < 0.005088523 to the left, agree=0.836, (0 split)
thine < 0.0007022472 to the left, agree=0.824,(0 split)
hast < 0.009433962 to the left, agree=0.805, (0 split)

Machine  Learning  Short  Course  |    August  11-­‐15  2014   43  


# CART

predclass <- predict(treefit1, d.frame.test)

colNames = colnames(predclass)
d.class.pred <-
as.factor(colNames[max.col(predclass)])
tree.table <- table(d.class.pred, d.class.test)

> tree.table
actual
predicted frost shakespeare
frost 55 12
shakespeare 1 37

Machine  Learning  Short  Course  |    August  11-­‐15  2014   44  


# CART

errorRate<-function(table){
TP = table[1,1]; # true positives
TN = table[2,2]; # true negatives
FP = table[1,2]; # false positives
FN = table[2,1]; # false negatives
error_rate = (FP + FN)/(TP + TN + FP + FN)
return(error_rate)
}

> errorRate(tree.table)
[1] 0.1238095

Machine  Learning  Short  Course  |    August  11-­‐15  2014   45  


COME unto these yellow
sands,
And then take hands:
Court'sied when you have, How countlessly they congregate
and kiss'd,-- O'er our tumultuous snow,
The wild waves whist,-- Which flows in shapes as tall as
Foot it featly here and trees
there; When wintry winds do blow!–
And, sweet sprites, the As if with keenness for our fate,
burthen bear. Our faltering few steps on
Hark, hark! To white rest, and a place of
Bow, wow, rest
The watch-dogs bark: Invisible at dawn,--
Bow, wow. And yet with neither love nor
Hark, hark! I hear hate,
The strain of strutting Those stars like some snow-white
chanticleer Minerva's snow-white marble eyes
Cry, Cock-a-diddle-dow! Without the gift of sight.

Machine  Learning  Short  Course  |    August  11-­‐15  2014   46  


COME unto these yellow
sands,
And then take hands:
Court'sied when you have, How countlessly they congregate
and kiss'd,-- O'er our tumultuous snow,
The wild waves whist,-- Which flows in shapes as tall as
Foot it featly here and trees
there; When wintry winds do blow!–
And, sweet sprites, the As if with keenness for our fate,
burthen bear. Our faltering few steps on
Hark, hark! To white rest, and a place of
Bow, wow, rest
The watch-dogs bark: Invisible at dawn,--
Bow, wow. And yet with neither love nor
Hark, hark! I hear hate,
The strain of strutting Those stars like some snow-white
chanticleer Minerva's snow-white marble eyes
Cry, Cock-a-diddle-dow! Without the gift of sight.

True  Author:  Shakespeare   True  Author:  Frost  


Predicted:  Frost   Predicted:  Shakespeare  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   47  


# KNN
library(class)
knn_res <- knn(d.tdm.train, d.tdm.test,
d.class.train, k = 5, prob=TRUE)
knn.table <- table(knn_res, d.class.test,
dnn = list('predicted','actual'))
> knn.table
actual
predicted frost shakespeare
frost 56 33
shakespeare 0 16

> errorRate(knn.table)
[1] 0.3142857

Machine  Learning  Short  Course  |    August  11-­‐15  2014   48  


# naive bayes

nb_classifier <- naiveBayes(as.matrix(d.tdm.train),


d.class.train, laplace = 1)
res <- predict(nb_classifier, as.matrix(d.tdm.test),
type = "raw", threshold = 0.5)
> res
frost shakespeare
[1,] 2.265614e-244 1.000000e+00
[2,] 2.285289e-165 1.000000e+00
[3,] 5.696532e-67 1.000000e+00

[104,] 1.000000e+00 0.000000e+00
[105,] 1.000000e+00 0.000000e+00

Machine  Learning  Short  Course  |    August  11-­‐15  2014   49  


# naive bayes

> nb_classifier$apriori # breakdown of training data


d.class.train
frost shakespeare
77 82

Machine  Learning  Short  Course  |    August  11-­‐15  2014   50  


errorRate<-function(table){
TP = table[1,1]; # true positives
TN = table[2,2]; # true negatives
FP = table[1,2]; # false positives
FN = table[2,1]; # false negatives
error_rate = (FP + FN)/(TP + TN + FP + FN)
return(error_rate)
}

> errorRate(res.table)
[1] 0.1619048

Machine  Learning  Short  Course  |    August  11-­‐15  2014   51  


NMF  for  Text  
•   'cargo'        'air'    'airlin'  'servic'        'kong‘          'hong'        'aircran'        'airport'        
'flight’  (        0.4711        0.4696    0.2349          0.1772        0.1648        0.1583        0.1328        
0.1271        0.1245  )  
•   'internet'        'comput'        'corp'        'use'        'system'        'microson'        'sonwar‘          
'inc'        'technolog'        'industri'        'network'        'product'        'servic'        
'busi‘  (0.4285        0.4165        0.2990        0.2885        0.1958        0.1883        0.1776        
0.1630        0.1618        0.1565        0.1519        0.1347        0.1320        0.1146)  
•  'china'        'beij'        'chines'        'state'        'offici'        'said'        'trade'        'foreign‘    
'unite‘  (        0.7297        0.3059        0.3034        0.2089        0.2038        0.1884        0.1400        
0.1337        0.1147  )  
•   'plant'        'worker'        'uaw'        'strike'        'ford'        'part'        'local'        'auto‘        'said'        
'motor'        'truck'        'chrysler'        'work'        'automak'        'union‘        'contract'        
'agreement'        'three'        'mich‘    (        0.4729        0.3485        0.2438        0.2141        
0.1877        0.1692        0.1498        0.1452        0.1382        0.1310        0.1305        0.1291        
0.1281        0.1264        0.1261        0.1130        0.1044        0.1040        0.1023)  

Machine  Learning  Short  Course  |    August  11-­‐15  2014   52  


# CART
Node number 1: 159 observations, complexity param=0.3947368
predicted class=shakespeare expected loss=0.4779874 P(node) =1
class counts: 76 83
probabilities: 0.478 0.522
left son=2 (120 obs) right son=3 (39 obs)
Primary splits:
thee < 0.5 to the left, improve=21.14719, (0 missing)
thi < 0.5 to the left, improve=20.35459, (0 missing)
thou < 0.5 to the left, improve=19.57953, (0 missing)
doth < 0.5 to the left, improve=16.20745, (0 missing)
tree < 0.5 to the right, improve=13.91526, (0 missing)
Surrogate splits:
thou < 0.5 to the left, agree=0.906, adj=0.615, (0 split)
thi < 0.5 to the left, agree=0.899, adj=0.590, (0 split)
art < 0.5 to the left, agree=0.830, adj=0.308, (0 split)
thine < 0.5 to the left, agree=0.824, adj=0.282, (0 split)
hast < 0.5 to the left, agree=0.805, adj=0.205, (0 split)

Machine  Learning  Short  Course  |    August  11-­‐15  2014   53  


Sample  R  code  
> Auto=read.table("Auto.data")
> fix(Auto)
> dim(Auto)
[1] 392 9

> names(Auto)
[1] "mpg" "cylinders " "displacement" "horsepower "
[5] "weight" "acceleration" "year" "origin"
[9] "name"

Machine  Learning  Short  Course  |    August  11-­‐15  2014   54  

You might also like