Professional Documents
Culture Documents
Cheat Sheet Final
Cheat Sheet Final
Cheat Sheet Final
Mean square residual (MSE) = Residual Standard Error (#8 on table on next page) MSR= SSR/K (K = # of predictors)
R2
SSR
SST
Type I (include some unimportant independent variables in the model) or Type II errors (eliminate some important independent
variables). RA2 - The adjusted coefficient of determination is done to penalize the inclusion of useless predictors.
Regression Selection R2 - RA2 - CP Want smallest CP error
R2 = SSR/SST = 1 SSE/SST = 1 [MSE/(n-k-1)] - RA2 = 1 (n-1)(MSE/SST)=1-[(n-1)/(n-k-1)]*(1-R2) - CP =SSEk/MSEL +2(k-1)-5
R2 > RA2 and for poor-fitting models R A2 may be negative. Choose highest adjusted R squared
(your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost).
If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.
Error Rate(whole dataset) = the root node error times the relative error =
Total Records
(smaller of
number is misclass)
10-fold CV error rate = root node error times the xerror
Prune Tree
Step 1: Find the best subtree of each size (1,2,3, ).
Step 2: Pick the tree in the sequence that gives the smallest misclassification error in the validation set.
The idea behind pruning is to recognize that a very large tree is likely to e overfitting the training data and that the weakest branches,
which hardly reduce the error rate, should be removed.
The tree method can also be used for numerical response variables regression tree.
- Both the principle and the procedure are the same.
- There are three details that are different from the
classification tree.
(i) Prediction
(ii) Impurity measures
(iii) Evaluating performance
Regression Tree
Both the principle and the procedure are the same. There are three details that are different from the classification tree.
(i) Prediction
(ii) Impurity measures
(iii) Evaluating performance
Logistic Regression
Logistic regression model explains a relationship between a binary response and predictors
using a logit link function.
Y is used to represent the binary response.
P(Y = 1) or p is the probability of belonging to class 1
p
odds
p
0 1 x1 ... k xk
p
0 1 x1 ... k xk p e
log it ( p ) ln
1 p
1 e 0 1x1 ... k xk Odds e 0 1x1 ... k xk
1 odds
Odds
If xj increases 1 unit, then odds changes by (e j -1)(100)% (holding all other predictors constant.)
Odds ratio
e 0 CD (1)... k xk
e CD
e 0 CD ( 0 )... k xk
Association Rules
Support
no. transactions that include both condition and result item sets
s
the total number of records
Confidencec no. transactions that include both condition and result item sets
no. transactions with condition item sets
P(result )
P(condition ) P (result )
Lift Ratio
1 p
A lift ration greater than 1.0 suggests that there is some usefulness to the rule - the level of
association between the condition and result item sets is higher than would be expected if they
were independent.The larger the lift ratio, the greater the strength of the association.
The support indicates its impact in terms of overall size. If only a small number of transactions are
affected, the rule may be of a little use (unless the consequent is very valuable and/or the rule is
very efficient in finding it).
Cluster Analysis
dij is a distance metric or dissimilarity
The following properties are required.
measure, between records i and j.
Nonnegative
dij 0.
(xi1, xi2, , xip) is the vector of p
Self-Proximity
dii = 0.
measurements for record i.
Symmetry
dij = dji.
(xj1, xj2, , xjp) is the vector of p
Triangle Inequality
dij dik + dkj.
measurements for record j.
Interpretation
We explore the characteristics of each cluster
by
a. Obtaining summary statistics from each
cluster on each measurement that was
used in the cluster analysis
b. Examining the clusters for the presence
of some common feature (variable) that
was not used in the cluster analysis
c. Cluster labeling: based on the
interpretation, trying to assign a name
or label to each other.
AIC =L + 2k L is usually given so not 2L?
L = Residual Deviance Log likelihood k=# of parameters
Lower AIC and BIC better
AIC indicates how good the estimates that maximize the chance of obtaining the data.
AIC Gives penalty to a higher # of predictors
-Holding the other variables constant, the (response variable) is more/less likely to be in class 1 if
the variable is 1
Root Node X Error = Cross validation error rate