Learning3 6pp PDF

Lecture 4: Machine learning III
Question
Is it possible to learn a linear classier with an innite number of features using nite computation? Yes No
CS221: Articial Intelligence (Autumn 2013) - Percy Liang
Roadmap
x Review of features and non-linearity
Learning framework
Feature extraction (x) Parameter tuning fw
How good is a predictor?
Unsupervised learning
Domain knowledge
Conclusion
Linear classiers Nearest neighbors Kernels methods Decision trees Neural networks
Every predictor is linear!

...with the right features.
Last lecture, we talked about the two stages of machine learning: feature extraction and parameter tuning. Feature extraction focuses on manually using domain knowledge to gure out what information and how to expose it to the learning algorithm. Parameter tuning is about guring out how to best use the features (x) exposed by feature extraction to produce good predictors. Both stages inuence the quality of the nal predictor. If we use a simple method (linear classiers), then we need to make sure that the features are specied in a way such that the prediction is in fact linear in the features (not necessarily in the original x!). If we use a more powerful method (decision trees), then we often have to work less hard (but not be completely insensitive) in making sure the features are in a form that parameter tuning can take advantage of. Of course, making the learning algorithm do more work means that we need more data and computation to learn and predict, whereas making the learning algorithm do less work means that the human (you) has to do more work manually. The art of machine learning is nding a good balance between the two.
simple
complex (x) f
complex x (x)
linear
Kernel methods
Prediction on x :
There are a lot machine learning methods nearest neighbors, kernel methods decision trees, neural networks, all with their own logical intuitions. To get a deeper understanding, it is useful to relate all of these methods to each other. One easy way to do this is look at the possible decision boundaries dened by each of these methods. Another way, which well pursue now, is to show that the predictors dened these methods can be represented by a linear classier based on more complex features (x). In other words, decision trees, for example, take simple features (x) and nd a complex decision boundary on (x). This is equivalent to mapping x to some more complex features (x) and nding a simple linear decision boundary on (x). This should not surprising because features are arbitrary functions. If we took (x) to be the feature vector where we have one feature per possible input x and set the weights to be the correct output for that x, then the so-called linear classiers actually represent all possible predictors.
score = Gaussian kernel:
x k (x, x )
k (x, x ) = exp
(x) (x ) 2 2
=(x)(x )
Every kernel corresponds to dot products of feature vectors (x) (x ) (in this case, innite dimensional!) score = w (x )
Decision trees
Recall the prediction score on a new point x is a combination of the example coecients x (think of them as predictions of those points) weighted by k (x, x ), which captures the similarity between x and x . The Gaussian kernel, in particular, says two points are similar if their feature vectors are very close. We stated (but didnt prove) that each kernel k (x, x ) corresponds to the inner product between two feature vectors, say (x) (x ). It is important that this is just a mathematical statement we dont actually need to construct the feature vectors because we are working with kernels. Its worth noting that in the case of the Gaussian kernel, the feature vectors are innitedimensional. In other words, by using a Gaussian kernel, we are implicitly doing learning on a innite number of features. You would expect that having an innite number of features gives you something quite powerful, and indeed it does. After all, Gaussian kernels are like nearest neighbors, which are also quite powerful as they allow for very exible decision boundaries. What are these features? There is a feature for each input point a. The value of that feature on an input x is the similarity to a: a (x) = k (a, x). In other words, the properties dened by the features are just similarities to all points in the space. Finally, we can use the identity w = x x (x) to recover our familiar denition of the score for a linear classier.
(x) = (sex, age, sibsp)
Each leaf j corresponds to a complex feature j (x):

score = 1 [1 (x) = 1 2 (x) 9.5 3 (x) > 2.5] +
w2 2 (x)
= w (x)
CS221: Articial Intelligence (Autumn 2013) - Percy Liang 9
Neural networks
1 (x) 2 (x)
In decision trees, we can associate each leaf j with a feature j (x). The weight of that feature is simply the label assigned to the leaf, and the feature value is the conjunction of the conditions along the path from the root to that leaf. Again, this gives us the usual linear prediction score w (x). In kernel methods, we essentially throw in all the features, and learnings job is to tune the weights of those features (implicitly through the x s). In decision trees, the features (x) are actually selected by the learning algorithm (through the construction of the decision tree itself), which makes it a kind of feature learning algorithm.
h1 h2
w score
3 (x)
score = w (V (x))
(x)
11
Summary
Start with basic set of features (x)
Neural networks take the original feature vector (x) and map it to a vector of activations h. We can understand this activation vector as a complex feature vector (x), which is used in the usual linear way to dene the prediction score w (x). Like decision trees, this feature vector (x) is set by the learning algorithm via tuning the parameters of the rst layer V. The feature vector (x) can be thought of as the output of a set of linear predictors, each mapping (x) to (vk (x)). Heres a cartoon of whats going on to provide intuition: if the neural network is trying to map an image x onto whether there is a face in the image (y ), each hidden unit might correspond to the presence or absence of a face part (e.g., ear, eye, nose, mouth). However, we dont tell the neural network what parts to look for it decides for itself what parts are useful for the nal face detection task.
Linear classiers: use features directly Kernel methods: linear classiers with a possibly innite set of features (x) Decision trees and neural networks: linear classiers with automatically learned features (x)
13
Roadmap
In summary, we always start out with basic set of features (x). If the features are good enough, then we can simply use a linear classier. This is typical for most text applications, where we have a feature for the presence/absence of each word. In this case, the feature vector is quite high dimensional (at least thousands), so linear classiers are powerful enough. However, if the features are not good enough, then we can either hand engineer the features to make them better or simply appeal to a more complex learning algorithm that can do more with the basic features (x). For instance, if x is an image and (x) is just the vector of pixel intensities, then linear classiers dont work very well because the features are too low-level and noisy. We saw several ways that learning algorithms can be thought of as linear classiers using more complex features. Kernel methods just throw in the kitchen sink and use an innite set of features. Decision trees choose features that carve out regions of the input space. Neural networks choose features that correspond to linear predictors that capture higher-order properties of the basic input (through the activations of the hidden layers). Finally, its worth noting that weve spent most of our time talking about the expressiveness of hypothesis classes dened by the various learning algorithms. This is perhaps the part that is most important and hardest to understand. The mechanics of how to actually tune the parameters are fairly straightforward: using gradient-based optimization in the case of kernel methods and neural networks, and using the ID3 algorithm for decision trees.
Review of features and non-linearity
Conclusion
15
Question
Whats the true objective of machine learning? Minimize error on the training set Minimize training error + complexity penalty Minimize error on the validation set Minimize error on the test set
A strawman algorithm
Algorithm: rote learning Training: just store Dtrain . Predictor f (x): If (x, y ) Dtrain : return y . Else: segfault.
Gets zero training error! Minimize error on unseen future examples To learn about machines But doesnt generalize to examples it hasnt seen...
16
17
Evaluation
Dtrain Learner f
So far, weve talked about the expressivity of dierent hypothesis classes and gave algorithms to tune parameters, but we have neglected perhaps the most important question; what is our true objective? Clearly, machine learning cant be about just minimizing the training loss as weve pretended to with our fancy loss minimization framework. The rote learning algorithm does a perfect job of that, and yet is clearly a bad idea. It overts to the training data and doesnt generalize to unseen examples.
How good is the predictor f ?
Key idea: the real learning objective Our goal is to minimize error on unseen future examples.
19
Evaluation
Problem: dont have unseen future examples Surrogate: Denition: test set The test set Dtest contains examples only used for nal evaluation.
So what is the true objective then? Taking a step back, what were doing is building a system which happens to use machine learning, and then were going to deploy it. What we really care about is how accurate that system is on those unseen future inputs. Of course, we cant access unseen future examples, so the next best thing is to create a test set. As much as possible, we should treat the test set as a pristine thing thats unseen and from the future. We shouldnt manually or automatically tune our predictor based on the test error, because we wouldnt be able to do that on future examples. Of course at some point we have to run our algorithm on the test set, but just be aware that each time this is done, the test set becomes less good of an indicator of how well youre actually doing.
20
Why is overtting so bad?
Why is overtting so bad? Heres a less contrived case of overtting: comparing nearest neighbors and linear classication. In this demo, nearest neighbors gets zero training error, but 20% test error, whereas linear classiers get about 10% on both. So overtting does lead to worse results, especially when theres noise in our data.
[demo: compare nearest neighbors and linear classiers]
22
Generalization
When will a learning algorithm generalize well?
Dtrain
Dtest
So far, we have an intuitive feel for what overtting is. How do we make this precise? In particular, when does a learning algorithm generalize from the training set to the test set?
24
Bias and variance

All predictors
Training and test error

0.5
F f
bias Parameter tuning variance Feature extraction
fw
Error
Training Test
0.0 0
Bias: how good is the hypothesis class? hypothesis class size Variance: how good is the learned predictor relative to the hypothesis class? Test error is bias + variance Undertting: bias too high (majority algorithm) Overtting: variance too high (rote learning algorithm) Fitting: balancing bias and variance
CS221: Articial Intelligence (Autumn 2013) - Percy Liang 26 CS221: Articial Intelligence (Autumn 2013) - Percy Liang 27
Question
What are possible ways to reduce overtting (select all that apply)?
Recall that our learning framework consists of rst choosing a hypothesis class F and then choosing a particular element from F . We use the term bias to refer to how far the entire hypothesis class is from the target predictor f . Larger hypothesis classes have lower bias. We use the term variance to refer to how good the predictor fw returned by the learning algorithm is, but only with respect to the hypothesis class. Larger hypothesis classes lead to higher variance because its harder to know based on limited data which predictor is good. We can roughly think about the training error as capturing the bias how well we can t. The test error that we care about includes both bias and variance. Wed like both to be small, but theres a tradeo. Note: we are using the terms bias and variance casually here to convey intuition. There is in fact a formal denition of bias and variance, but we will not discuss it here.
Remove features Add 0.2 w

2
to the objective function
Make sure w 1 Run SGD for fewer iterations Round your weights w to integers
29
Controlling size of hypothesis class

Linear predictors are specied by weight vector w Rd Keeping the dimensionality d small:
Controlling size of hypothesis class

Linear predictors are specied by weight vector w Rd
fw (x) = w0 + w1 x + w2 x2
Keeping the norm w small: [whiteboard: reduce dimensionality, norm]
30
31
Dimensionality
Linear classiers: keep number of features small
Decision trees: keep number of leaves small

To balance bias and variance, we need to be able to adjust the size of our hypothesis class. For concreteness, think about linear predictors, in which the hypothesis class is specied by weight vectors w Rd . There are two primary ways to do reduce (or increase) the size of these hypothesis classes: the rst is by controlling the dimensionality d, and the second is by controlling the norm w .
Neural networks: keep number of hidden units small
But kernel methods could have an innite number of features...
33
Constraining the norm

Constrained optimization:
w B
Keeping dimensionality small is perhaps the most intuitive way to reduce the complexity of a hypothesis class. For linear classiers, this simply means keeping the number of features small. In feature extraction, we should only add features if we think they are relevant. Recall that we can think of decision trees and neural networks as linear classiers using some compound features (x). From this perspective, it is natural to use the number of leaves and number of hidden units as a measure of complexity. But what about kernel methods, which could have an innite number of features? Clearly, we need another way to measure the size of a hypothesis class.
min TrainLoss(w)
Algorithm: projected gradient descent Initialize w = [0, . . . , 0] For t = 1, . . . , T : w w t w TrainLoss(w) B If w > B : w w w Same as gradient descent, except if w leaves constraint set, put it back.
35
Penalizing the norm

Regularized objective: min TrainLoss(w) +
w
The second way to keep hypothesis classes small is to not let the weight vectors get too big. We can simply take our original loss minimization framework and add a constraint that the norm (also called length or magnitude) of w isnt too big ( w B ). This denes the optimization problem over weight vectors in a ball of radius B . The gradient descent algorithm we had before for unconstrained minimization can be adapted easily to handle the constraint: do gradient descent as usual, and if the weights ever leave the radius B ball, then shrink the length so that were back in the constraint set.
w 2
Algorithm: gradient descent Initialize w = [0, . . . , 0] For t = 1, . . . , T : w w t [w TrainLoss(w)+w] Same as gradient descent, except shrink the weights towards zero. Note: SVM = hinge loss + regularization
37
Early stopping
Algorithm: gradient descent Initialize w = [0, . . . , 0] For t = 1, . . . , T : w w t w TrainLoss(w) Idea: simply make T smaller Intuition: if have fewer updates, then w cant get too big. Lesson: try to minimize the training error, but dont try too hard.
A related way to keep the weights small is called regularization, which involves adding an additional term to the objective function which penalizes the norm of w in a soft way rather than enforcing a hard constraint. This is probably the most common way to control the norm. We can use gradient descent on this regularized objective, and this simply leads to an algorithm which subtracts a scaled down version of w each iteration This has the eect of keeping w closer to the origin than it otherwise would be.
39
Summary
Control dimensionality: remove features, prune nodes of decision trees, reduce number of hidden units Control norm: constrain w , penalize w , or just stop gradient descent early Key idea: keep it simple Try to minimize training error, but keep the hypothesis class small. Question: how small?
A really cheap way to keep the weights small is to do early stopping. As we run more iterations of gradient descent, the objective function improves. If we cared about the objective function, this would always be a good thing. However, our true objective is not the training loss. Each time we update the weights, w has the potential of getting larger, so by running gradient descent a fewer number of iterations, we are implicitly ensuring that w stays small. Though early stopping seems hacky, there is actually some theory behind it. And one paradoxical note is that we can sometimes get better solutions by performing less computation.
41
Hyperparameters
Denition: hyperparameters Properties of the learning algorithm (features, regularization parameter , number of iterations T , step size t , etc.). More generally, how do we choose hyperparameters? Choose hyperparameters to minimize Dtrain error? No - solution would be to include all features, set = 0, T . Choose hyperparameters to minimize Dtest error? No - choosing based on Dtest makes it an unreliable estimate of error!
Weve seen many ways to control the size of the hypothesis class (and thus reducing variance) based on either reducing the dimensionality or reducing the norm. It is important to note that what matters is the size of the hypothesis class, not how complex the predictors in the hypothesis class look. To put it another way, using complex features backed by 1000 lines of code doesnt hurt you if there are only 5 of them. However, if we make the hypothesis class too small, then the bias gets too big. In practice, how do we decide the appropriate size?
43
Validation
Problem: cant use test set! Solution: randomly take out 20% of training and use it instead of the test set to estimate test error. Dtrain \Dval Denition: validation set A validation (development) set is taken out of the training data which acts as a surrogate for the test set. Dval
Generally, our learning algorithm has multiple hyperparameters to set. These hyperparameters cannot be set by the learning algorithm on the training data because we would just choose a degenerate solution and overt. On the other hand, we cant use the test set either because then we would spoil the test set. The solution is to invent something that looks like a test set. Theres no other data lying around, so well have to steal it from the training set. The resulting set is called the validation set. With this validation set, now we can simply try out a bunch of dierent hyperparameters and choose the setting that yields the lowest error on the validation set. Which hyperparameter values should we try? Generally, you should start by getting the right order of magnitude (e.g., = 0.0001, 0.001, 0.01, 0.1, 1, 10) and then rening if necessary.
44
Cross-validation
Problem: only one validation set (usually small) provides an unreliable estimate of test error. Solution: generate multiple validation sets. Partition training data Dtrain into K folds: Dtrain Dtrain Dtrain Dtrain Dtrain Use each fold k = 1, . . . , K as the validation set. Average the K validation error rates.
(1) (2) (3) (4) (5)
Summary
Key to good machine learning: balance bias and variance
Control tradeo using size of hypothesis class
Test set: only for nal evaluation
Use validation set to tune hyperparameters
46
47
Roadmap
Supervised learning: Review of features and non-linearity
Supervision?
Prediction: Dtrain contains input-output pairs (x, y ) Fully-labeled data is very expensive to obtain (we can get 10,000 labeled examples)
How good is a predictor? Unsupervised learning: Unsupervised learning Clustering: Dtrain only contains inputs x Unlabeled data is much cheaper to obtain (we can get 100 million unlabeled examples)
Conclusion
CS221: Articial Intelligence (Autumn 2013) - Percy Liang [Brown et al, 1992]
48
CS221: Articial Intelligence (Autumn 2013) - Percy Liang [Le et al, 2012]
49
Word clustering using HMMs

Input: raw text (100 million words of news articles)... Output:
Cluster 1: Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays Cluster 2: June March July April January December October November September August Cluster 3: water gas coal liquid acid sand carbon steam shale iron Cluster 4: great big vast sudden mere sheer gigantic lifelong scant colossal Cluster 5: man woman boy girl lawyer doctor guy farmer teacher citizen Cluster 6: American Indian European Japanese German African Catholic Israeli Italian Arab Cluster 7: pressure temperature permeability density porosity stress velocity viscosity gravity tension Cluster 8: mother wife father son husband brother daughter sister boss uncle Cluster 9: machine device controller processor CPU printer spindle subsystem compiler plotter Cluster 10: John George James Bob Robert Paul William Jim David Mike Cluster 11: anyone someone anybody somebody Cluster 12: feet miles pounds degrees inches barrels tons acres meters bytes Cluster 13: director chief professor commissioner commander treasurer founder superintendent dean custodian Cluster 14: had hadnt hath wouldve couldve shouldve mustve mightve Cluster 15: head body hands eyes voice arm seat eye hair mouth
Feature learning using neural networks

Input: 10 million images (sampled frames from YouTube) Output:
Impact: state-of-the-art results on object recognition (22,000 categories)
Impact: used in many state-of-the-art NLP systems

CS221: Articial Intelligence (Autumn 2013) - Percy Liang 50 CS221: Articial Intelligence (Autumn 2013) - Percy Liang 51
Key idea: unsupervised learning Data has lots of rich latent structures; want methods to discover this structure automatically.
We have so far covered the basics of supervised learning. If you get a labeled training set of (x, y ) pairs, then you can train a predictor. However, where do these examples (x, y ) come from? If youre doing image classication, someone has to sit down and label each image, and generally this tends to be expensive enough that we cant get that many examples. On the other hand, there are tons of unlabeled examples sitting around (e.g., Flickr for photos, Wikipedia, news articles for text documents). The main question is whether we can harness all that unlabeled data to help us make better predictions? This is the goal of unsupervised learning. Empirically, unsupervised learning has produced some pretty impressive results. HMMs can be used to take a ton of raw text and clusters related words together. An unsupervised variant of neural networks called autoencoders can be used to take a ton of raw images and output clusters of images. No one told the learning algorithms explicitly what the clusters should look like they just gured it out.
52
Types of unsupervised learning

Clustering (e.g., K-means):
Clustering
Denition: clustering Input: training set of input points Dtrain = {x1 , . . . , xn } Output: assignment of each point to a cluster (z1 , . . . , zn ) where zi {1, . . . , K } Intuition: Want similar points to be put in same cluster, dissimilar points to put in dierent clusters [whiteboard]
Dimensionality reduction (e.g., PCA):
54
55
K-means objective
Setup: Each cluster k = 1, . . . , K is represented by a centroid k Rd Intuition: want each point (xi ) close to its assigned centroid zi Objective function:
n
The task of clustering is to take a set of points as input and return a partitioning of the points into K clusters. We will represent the partitioning using an assignment vector z = (z1 , . . . , zn ); for each i, zi {1, . . . , K } species which of the K clusters point i is assigned to.
Losskmeans (z, ) =
i=1
zi (xi )
Need to choose centroids and assignments z jointly
57
K-means algorithm (Step 1)

Goal: given centroids 1 , . . . , K , assign each point to the best centroid.
K-means is a particular method for performing clustering which is based on associating each cluster with a centroid k for k = 1, . . . , K . The intuition is that to assign the points to clusters and place the centroid for each cluster so that each point (xi ) is close to its assigned centroid zi .
Algorithm: Step 1 of K-means For each point i = 1, . . . , n: Assign i to cluster with closest centroid: zi arg min (xi ) k 2 .
k=1,...,K
59
K-means algorithm (Step 2)

Goal: given cluster assignments z1 , . . . , zn , nd the best centroids 1 , . . . , K .
Trying to optimize K-means objective all at once is a daunting task, so lets try to break it down. First, assume we know where the centroids are. Then we can optimize the K-means objective with respect to z alone quite easily. It is easy to show that the best setting for zi is the cluster k that minimizes the distance to the centroid k (which is xed).
Algorithm: Step 2 of K-means For each cluster k = 1, . . . , K : Set k to average of points assigned to cluster k : 1 k (xi ) |{i : zi = k }|
i:zi =k
61
K-means algorithm
min min Losskmeans (z, )
z
Now, turning things around, lets suppose we knew what the assignments z were. We can again look at the K-means objective function and try to optimize it with respect to the centroids . The best k is to place the centroid at the average of all the points assigned to cluster k .
Key idea: alternating minimization Tackle hard problem by solving two easy problems.
63
K-means algorithm
Objective: min min Losskmeans (z, )
z
Algorithm: K-means Initialize 1 , . . . , K randomly. For t = 1, . . . , T : Step 1: set assignments z given Step 2: set centroids given z [demo]
Now we have the two ingredients to state the full K-means algorithm. We start by initializing all the centroids randomly. Then, we iteratively alternate back and forth between steps 1 and 2, optimizing z given and vice-versa.
64
Local minima
K-means is guaranteed to converge to a local minimum, but is not guaranteed to nd the global minimum.
K-means is guaranteed to decrease the loss function each iteration and will converge to a local minimum, but it is not guaranteed to nd the global minimum, so one must exercise caution when applying K-means. One solution is to simply run K-means several times from multiple random initializations and then choose the solution that has the lowest loss. Or we could try to be smarter in how we initialize K-means. K-means++ is an initialization scheme, which places the centroids on the training points in a way that they tend to be far apart from each other.
[demo: getting stuck in local optima] Solutions: Run multiple times from dierent random initializations Initialize with a heuristic (K-means++)
66
Using the clusters

Goal: create better features (x) for supervised learning Intuition: k -th feature similarity between k and (x)
4 2 (x) 1 3
Cluster-based features
(x)
(x)
Distance to centroid: dk = (x) k = Average distance: d

1 K K k=1
dk
Assumption: points inside same cluster tend to have same output label
dk , 0} k (x) = max{d
68
69
Unsupervised learning summary

Leverage tons of unlabeled data
Having performed clustering on our large amount of unlabeled data, lets see how it can help us perform better classication. The key idea will be to use the centroids to dene new features. We will dene a new feature vector (x) = (1 (x), . . . , K (x)) where the value of feature k represents the proximity of (x) to the centroid k . Specically, k (x) will be the amount to which the distance from (x) to centroid k is less than the average distance between (x) and a centroid. Why is this a reasonable thing to do? The underlying assumption is that points inside the same cluster tend to have the same output label, so by mapping the points to that output label, we are transforming the data points into (x), which is in a way is much cleaner than (x).
Dicult optimization:
latent variables z
parameters
Using as features in supervised learning improves accuracy!
71
Roadmap
Review of features and non-linearity
Fancier prediction tasks

x fw y
So far: y {1, +1} (binary classication) or y R (regression) reex-based models Structured prediction: y is a sequence, tree, etc. more complex models Same strategy: write down a loss function, take the derivative, and do gradient descent (more on this later)!
Conclusion
72
73
Conclusion
Think in terms of hypothesis class:
All predictors
F f
bias Parameter tuning variance Feature extraction
fw
Separate (powerful) modeling and (simple) algorithms:

Learner Dtrain Feature extraction Parameter tuning f
This concludes our tour of machine learning. Machine learning is a huge topic, and we have only skimmed the surface. One thing Id like to encourage you to do is to think in terms of hypothesis classes what kind of functions can your learning algorithm potentially learn? This gives a good way to decouple the modeling from the algorithms, and gives us a way to understand generalization in terms of bias and variance tradeos. From a modeling perspective, one can focus our attention on guring out what hypothesis class to use (in deciding which features to use). From the learning and optimization perspective, one can focus on what the best way to choose a predictor and return parameters. So far, we have only focused on reex-based models where the predictor only outputs a yes/no or a number. Next time, we will start looking at models which can perform higherlevel reasoning, but machine learning will be still be our companion for the remainder of the class.
So far, all reex-based; next time: state-based
74
Question
What was the most confusing part of todays lecture?
76

Learning3 6pp PDF

Uploaded by

Copyright:

Available Formats

You might also like

Learning3 6pp PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning3 6pp PDF

Uploaded by

Copyright:

Available Formats

Lecture 4: Machine learning III

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

How good is a predictor?

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Every predictor is linear!

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

score = Gaussian kernel:

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

(x) = (sex, age, sibsp)

Each leaf j corresponds to a complex feature j (x):

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Review of features and non-linearity

How good is a predictor?

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

How good is the predictor f ?

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Why is overtting so bad?

[demo: compare nearest neighbors and linear classiers]

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Bias and variance

Training and test error

Remove features Add 0.2 w

to the objective function

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Controlling size of hypothesis class

Controlling size of hypothesis class

Keeping the norm w small: [whiteboard: reduce dimensionality, norm]

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Decision trees: keep number of leaves small

Neural networks: keep number of hidden units small

But kernel methods could have an innite number of features...

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Constraining the norm

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Penalizing the norm

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Control tradeo using size of hypothesis class

Test set: only for nal evaluation

Use validation set to tune hyperparameters

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

CS221: Articial Intelligence (Autumn 2013) - Percy Liang

Word clustering using HMMs

Feature learning using neural networks

Impact: state-of-the-art results on object recognition (22,000 categories)

Impact: used in many state-of-the-art NLP systems

CS221: Articial Intelligence (Autumn 2013) - Percy Liang