Rating Easy? AI? Sys? Thy? Morning?

+2 y y n y n
+2 y y n y n
+2 n y n n n
+2 n n n y n
+2 n y y n y
+1 y y n n n
+1 y y n y n
+1 n y n y n
0 n n n n y
0 y n n y y
0 n y n y n
0 y y y y y
-1 y y y n y
-1 n n y y n
-1 n n y n y
-1 y n y n y
-2 n n y y n
-2 n y y n y
-2 y n y n n
-2 y n y n y Table 1: Course rating data set

224 a course in machine learning

K-nearest neighbors, 58 categorical features, 30 dual variables, 151

dA -distance, 82 chain rule, 117, 120
p-norms, 104 chord, 102 early stopping, 53, 132
0/1 loss, 100 circuit complexity, 138 embedding, 178
80% rule, 80 clustering, 35, 178 ensemble, 164
clustering quality, 178 error driven, 43
absolute loss, 14 complexity, 34 error rate, 100
activation function, 130 compounding error, 215 estimation error, 71
activations, 41 concave, 102 Euclidean distance, 31
AdaBoost, 166 concavity, 193 evidence, 127
adaptation, 74 concept, 157 example normalization, 59, 60
algorithm, 99 confidence intervals, 68 examples, 9
all pairs, 92 constrained optimization problem, expectation maximization, 186, 189
all versus all, 92 112 expected loss, 16
approximation error, 71 contour, 104 expert, 212
architecture selection, 139 convergence rate, 107 exponential loss, 102, 169
area under the curve, 64, 96 convex, 99, 101
argmax problem, 199 covariate shift, 74 feasible region, 113
AUC, 64, 95, 96 cross validation, 65, 68 feature augmentation, 78
AVA, 92 cubic feature map, 144 feature combinations, 54
averaged perceptron, 52 curvature, 107 feature mapping, 54
feature normalization, 59
back-propagation, 134, 137 data covariance matrix, 184 feature scale, 33
bag of words, 56 data generating distribution, 15 feature space, 31
bagging, 165 decision boundary, 34 feature values, 11, 29
base learner, 164 decision stump, 168 feature vector, 29, 31
batch, 173 decision tree, 8, 10 features, 11, 29
Bayes error rate, 20 decision trees, 57 forward-propagation, 137
Bayes optimal classifier, 19 density estimation, 76 fractional assignments, 191
Bayes optimal error rate, 20 development data, 26 furthest-first heuristic, 180
Bayes rule, 117 dimensionality reduction, 178
Bernouilli distribution, 121 discrepancy, 82 Gaussian distribution, 121
bias, 42 discrete distribution, 121 Gaussian kernel, 147
bias/variance trade-off, 72 disparate impact, 80 Gaussian Mixture Models, 191
binary features, 30 distance, 31 generalize, 9, 17
bipartite ranking problems, 95 domain adaptation, 74 generative story, 123
boosting, 155, 164 dominates, 63 geometric view, 29
bootstrap resampling, 165 dot product, 45 global minimum, 106
bootstrapping, 67, 69 dual problem, 151 GMM, 191
226 a course in machine learning

gradient, 105 KKT conditions, 152 non-convex, 135

gradient ascent, 105 non-linear, 129
gradient descent, 105 label, 11 Normal distribution, 121
Lagrange multipliers, 119 normalize, 46, 59
Hamming loss, 202 Lagrange variable, 119 null hypothesis, 67
hard-margin SVM, 113 Lagrangian, 119
hash kernel, 177 lattice, 200 objective function, 100
held-out data, 26 layer-wise, 139 one versus all, 90
hidden units, 129 learning by demonstration, 212 one versus rest, 90
hidden variables, 186 leave-one-out cross validation, 65 online, 42
hinge loss, 102, 203 level-set, 104 optimization problem, 100
histogram, 12 license, 2 oracle, 212, 220
horizon, 213 likelihood, 127 oracle experiment, 28
hyperbolic tangent, 130 linear classifier, 169 output unit, 129
hypercube, 38 linear classifiers, 169 OVA, 90
hyperparameter, 26, 44, 101 linear decision boundary, 41 overfitting, 23
hyperplane, 41 linear regression, 110 oversample, 88
hyperspheres, 38 linearly separable, 48
hypothesis, 71, 157 link function, 130 p-value, 67
hypothesis class, 71, 160 log likelihood, 118 PAC, 156, 166
hypothesis testing, 67 log posterior, 127 paired t-test, 67
log probability, 118 parametric test, 67
i.i.d. assumption, 117 log-likelihood ratio, 122 parity, 21
identically distributed, 24 logarithmic transformation, 61 parity function, 138
ILP, 195, 207 logistic loss, 102 patch representation, 56
imbalanced data, 85 logistic regression, 126 PCA, 184
imitation learning, 212 LOO cross validation, 65 perceptron, 41, 42, 58
importance sampling, 75 loss function, 14 perpendicular, 45
importance weight, 86 loss-augmented inference, 205 pixel representation, 55
independent, 24 loss-augmented search, 205 policy, 212
independently, 117 polynomial kernels, 146
independently and identically dis- margin, 49, 112 positive semi-definite, 146
tributed, 117 margin of a data set, 49 posterior, 127
indicator function, 100 marginal likelihood, 127 precision, 62
induce, 16 marginalization, 117 precision/recall curves, 63
induced distribution, 88 Markov features, 198 predict, 9
induction, 9 maximum a posteriori, 127 preference function, 94
inductive bias, 20, 31, 33, 103, 121 maximum depth, 26 primal variables, 151
integer linear program, 207 maximum likelihood estimation, 118 principle components analysis, 184
integer linear programming, 195 mean, 59 prior, 127
iteration, 36 Mercer’s condition, 146 probabilistic modeling, 116
model, 99 Probably Approximately Correct, 156
jack-knifing, 69 modeling, 25 programming by example, 212
Jensen’s inequality, 193 multi-layer network, 129 projected gradient, 151
joint, 124 projection, 46
naive Bayes assumption, 120 psd, 146
K-nearest neighbors, 32 nearest neighbor, 29, 31
Karush-Kuhn-Tucker conditions, 152 neural network, 169 radial basis function, 139
kernel, 141, 145 neural networks, 54, 129 random forests, 169
kernel trick, 146 neurons, 41 random variable, 117
kernels, 54 noise, 21 RBF kernel, 147
index 227

RBF network, 139 span, 143 time horizon, 213

recall, 62 sparse, 104 total variation distance, 82
receiver operating characteristic, 64 specificity, 64 train/test mismatch, 74
reconstruction error, 184 squared loss, 14, 102 training data, 9, 16, 24
reductions, 88 statistical inference, 116 training error, 16
redundant features, 56 statistically significant, 67 trajectory, 213
regularized objective, 101 steepest ascent, 105 trellis, 200
regularizer, 100, 103 stochastic gradient descent, 173 trucated gradients, 175
reinforcement learning, 212 stochastic optimization, 172 two-layer network, 129
representer theorem, 143, 145 strong law of large numbers, 24
ROC curve, 64 strong learner, 166 unary features, 198
strong learning algorithm, 166 unbiased, 47
sample complexity, 157, 158, 160 strongly convex, 107 underfitting, 23
sample mean, 59 structural risk minimization, 99 unit hypercube, 39
sample selection bias, 74 structured hinge loss, 203 unit vector, 46
sample variance, 59 structured prediction, 195 unsupervised adaptation, 75
semi-supervised adaptation, 75 sub-sampling, 87 unsupervised learning, 35
sensitivity, 64 subderivative, 108
separating hyperplane, 99 subgradient, 108 validation data, 26
sequential decision making, 212 subgradient descent, 109 Vapnik-Chernovenkis dimension, 162
SGD, 173 sum-to-one, 117 variance, 59, 165
shallow decision tree, 21, 168 support vector machine, 112 variational distance, 82
shape representation, 56 support vectors, 153 VC dimension, 162
sigmoid, 130 surrogate loss, 102 vector, 31
sigmoid function, 126 symmetric modes, 135 visualize, 178
sigmoid network, 139 vote, 32
sign, 130 t-test, 67 voted perceptron, 52
single-layer network, 129 test data, 25 voting, 52
singular, 110 test error, 25
slack, 148 test set, 9 weak learner, 166
slack parameters, 113 text categorization, 56 weak learning algorithm, 166
smoothed analysis, 180 the curse of dimensionality, 37 weights, 41
soft assignments, 190 threshold, 42
soft-margin SVM, 113 Tikhonov regularization, 99 zero/one loss, 14

