Professional Documents
Culture Documents
Learning3 6pp PDF
Learning3 6pp PDF
Learning3 6pp PDF
Question
Is it possible to learn a linear classier with an innite number of features using nite computation? Yes No
Roadmap
x Review of features and non-linearity
Learning framework
Feature extraction (x) Parameter tuning fw
Unsupervised learning
Domain knowledge
Conclusion
Linear classiers Nearest neighbors Kernels methods Decision trees Neural networks
simple
complex (x) f
complex x (x)
linear
Kernel methods
Prediction on x :
There are a lot machine learning methods nearest neighbors, kernel methods decision trees, neural networks, all with their own logical intuitions. To get a deeper understanding, it is useful to relate all of these methods to each other. One easy way to do this is look at the possible decision boundaries dened by each of these methods. Another way, which well pursue now, is to show that the predictors dened these methods can be represented by a linear classier based on more complex features (x). In other words, decision trees, for example, take simple features (x) and nd a complex decision boundary on (x). This is equivalent to mapping x to some more complex features (x) and nding a simple linear decision boundary on (x). This should not surprising because features are arbitrary functions. If we took (x) to be the feature vector where we have one feature per possible input x and set the weights to be the correct output for that x, then the so-called linear classiers actually represent all possible predictors.
x k (x, x )
k (x, x ) = exp
(x) (x ) 2 2
=(x)(x )
Every kernel corresponds to dot products of feature vectors (x) (x ) (in this case, innite dimensional!) score = w (x )
Decision trees
Recall the prediction score on a new point x is a combination of the example coecients x (think of them as predictions of those points) weighted by k (x, x ), which captures the similarity between x and x . The Gaussian kernel, in particular, says two points are similar if their feature vectors are very close. We stated (but didnt prove) that each kernel k (x, x ) corresponds to the inner product between two feature vectors, say (x) (x ). It is important that this is just a mathematical statement we dont actually need to construct the feature vectors because we are working with kernels. Its worth noting that in the case of the Gaussian kernel, the feature vectors are innitedimensional. In other words, by using a Gaussian kernel, we are implicitly doing learning on a innite number of features. You would expect that having an innite number of features gives you something quite powerful, and indeed it does. After all, Gaussian kernels are like nearest neighbors, which are also quite powerful as they allow for very exible decision boundaries. What are these features? There is a feature for each input point a. The value of that feature on an input x is the similarity to a: a (x) = k (a, x). In other words, the properties dened by the features are just similarities to all points in the space. Finally, we can use the identity w = x x (x) to recover our familiar denition of the score for a linear classier.
= w (x)
CS221: Articial Intelligence (Autumn 2013) - Percy Liang 9
Neural networks
1 (x) 2 (x)
In decision trees, we can associate each leaf j with a feature j (x). The weight of that feature is simply the label assigned to the leaf, and the feature value is the conjunction of the conditions along the path from the root to that leaf. Again, this gives us the usual linear prediction score w (x). In kernel methods, we essentially throw in all the features, and learnings job is to tune the weights of those features (implicitly through the x s). In decision trees, the features (x) are actually selected by the learning algorithm (through the construction of the decision tree itself), which makes it a kind of feature learning algorithm.
h1 h2
w score
3 (x)
score = w (V (x))
(x)
11
Summary
Start with basic set of features (x)
Neural networks take the original feature vector (x) and map it to a vector of activations h. We can understand this activation vector as a complex feature vector (x), which is used in the usual linear way to dene the prediction score w (x). Like decision trees, this feature vector (x) is set by the learning algorithm via tuning the parameters of the rst layer V. The feature vector (x) can be thought of as the output of a set of linear predictors, each mapping (x) to (vk (x)). Heres a cartoon of whats going on to provide intuition: if the neural network is trying to map an image x onto whether there is a face in the image (y ), each hidden unit might correspond to the presence or absence of a face part (e.g., ear, eye, nose, mouth). However, we dont tell the neural network what parts to look for it decides for itself what parts are useful for the nal face detection task.
Linear classiers: use features directly Kernel methods: linear classiers with a possibly innite set of features (x) Decision trees and neural networks: linear classiers with automatically learned features (x)
13
Roadmap
In summary, we always start out with basic set of features (x). If the features are good enough, then we can simply use a linear classier. This is typical for most text applications, where we have a feature for the presence/absence of each word. In this case, the feature vector is quite high dimensional (at least thousands), so linear classiers are powerful enough. However, if the features are not good enough, then we can either hand engineer the features to make them better or simply appeal to a more complex learning algorithm that can do more with the basic features (x). For instance, if x is an image and (x) is just the vector of pixel intensities, then linear classiers dont work very well because the features are too low-level and noisy. We saw several ways that learning algorithms can be thought of as linear classiers using more complex features. Kernel methods just throw in the kitchen sink and use an innite set of features. Decision trees choose features that carve out regions of the input space. Neural networks choose features that correspond to linear predictors that capture higher-order properties of the basic input (through the activations of the hidden layers). Finally, its worth noting that weve spent most of our time talking about the expressiveness of hypothesis classes dened by the various learning algorithms. This is perhaps the part that is most important and hardest to understand. The mechanics of how to actually tune the parameters are fairly straightforward: using gradient-based optimization in the case of kernel methods and neural networks, and using the ID3 algorithm for decision trees.
Unsupervised learning
Conclusion
15
Question
Whats the true objective of machine learning? Minimize error on the training set Minimize training error + complexity penalty Minimize error on the validation set Minimize error on the test set
A strawman algorithm
Algorithm: rote learning Training: just store Dtrain . Predictor f (x): If (x, y ) Dtrain : return y . Else: segfault.
Gets zero training error! Minimize error on unseen future examples To learn about machines But doesnt generalize to examples it hasnt seen...
16
17
Evaluation
Dtrain Learner f
So far, weve talked about the expressivity of dierent hypothesis classes and gave algorithms to tune parameters, but we have neglected perhaps the most important question; what is our true objective? Clearly, machine learning cant be about just minimizing the training loss as weve pretended to with our fancy loss minimization framework. The rote learning algorithm does a perfect job of that, and yet is clearly a bad idea. It overts to the training data and doesnt generalize to unseen examples.
Key idea: the real learning objective Our goal is to minimize error on unseen future examples.
19
Evaluation
Problem: dont have unseen future examples Surrogate: Denition: test set The test set Dtest contains examples only used for nal evaluation.
So what is the true objective then? Taking a step back, what were doing is building a system which happens to use machine learning, and then were going to deploy it. What we really care about is how accurate that system is on those unseen future inputs. Of course, we cant access unseen future examples, so the next best thing is to create a test set. As much as possible, we should treat the test set as a pristine thing thats unseen and from the future. We shouldnt manually or automatically tune our predictor based on the test error, because we wouldnt be able to do that on future examples. Of course at some point we have to run our algorithm on the test set, but just be aware that each time this is done, the test set becomes less good of an indicator of how well youre actually doing.
20
Why is overtting so bad? Heres a less contrived case of overtting: comparing nearest neighbors and linear classication. In this demo, nearest neighbors gets zero training error, but 20% test error, whereas linear classiers get about 10% on both. So overtting does lead to worse results, especially when theres noise in our data.
22
Generalization
When will a learning algorithm generalize well?
Dtrain
Dtest
So far, we have an intuitive feel for what overtting is. How do we make this precise? In particular, when does a learning algorithm generalize from the training set to the test set?
24
F f
bias Parameter tuning variance Feature extraction
fw
Error
Training Test
0.0 0
Bias: how good is the hypothesis class? hypothesis class size Variance: how good is the learned predictor relative to the hypothesis class? Test error is bias + variance Undertting: bias too high (majority algorithm) Overtting: variance too high (rote learning algorithm) Fitting: balancing bias and variance
CS221: Articial Intelligence (Autumn 2013) - Percy Liang 26 CS221: Articial Intelligence (Autumn 2013) - Percy Liang 27
Question
What are possible ways to reduce overtting (select all that apply)?
Recall that our learning framework consists of rst choosing a hypothesis class F and then choosing a particular element from F . We use the term bias to refer to how far the entire hypothesis class is from the target predictor f . Larger hypothesis classes have lower bias. We use the term variance to refer to how good the predictor fw returned by the learning algorithm is, but only with respect to the hypothesis class. Larger hypothesis classes lead to higher variance because its harder to know based on limited data which predictor is good. We can roughly think about the training error as capturing the bias how well we can t. The test error that we care about includes both bias and variance. Wed like both to be small, but theres a tradeo. Note: we are using the terms bias and variance casually here to convey intuition. There is in fact a formal denition of bias and variance, but we will not discuss it here.
Make sure w 1 Run SGD for fewer iterations Round your weights w to integers
29
fw (x) = w0 + w1 x + w2 x2
30
31
Dimensionality
Linear classiers: keep number of features small
33
min TrainLoss(w)
Algorithm: projected gradient descent Initialize w = [0, . . . , 0] For t = 1, . . . , T : w w t w TrainLoss(w) B If w > B : w w w Same as gradient descent, except if w leaves constraint set, put it back.
35
w 2
Algorithm: gradient descent Initialize w = [0, . . . , 0] For t = 1, . . . , T : w w t [w TrainLoss(w)+w] Same as gradient descent, except shrink the weights towards zero. Note: SVM = hinge loss + regularization
37
Early stopping
Algorithm: gradient descent Initialize w = [0, . . . , 0] For t = 1, . . . , T : w w t w TrainLoss(w) Idea: simply make T smaller Intuition: if have fewer updates, then w cant get too big. Lesson: try to minimize the training error, but dont try too hard.
A related way to keep the weights small is called regularization, which involves adding an additional term to the objective function which penalizes the norm of w in a soft way rather than enforcing a hard constraint. This is probably the most common way to control the norm. We can use gradient descent on this regularized objective, and this simply leads to an algorithm which subtracts a scaled down version of w each iteration This has the eect of keeping w closer to the origin than it otherwise would be.
39
Summary
Control dimensionality: remove features, prune nodes of decision trees, reduce number of hidden units Control norm: constrain w , penalize w , or just stop gradient descent early Key idea: keep it simple Try to minimize training error, but keep the hypothesis class small. Question: how small?
A really cheap way to keep the weights small is to do early stopping. As we run more iterations of gradient descent, the objective function improves. If we cared about the objective function, this would always be a good thing. However, our true objective is not the training loss. Each time we update the weights, w has the potential of getting larger, so by running gradient descent a fewer number of iterations, we are implicitly ensuring that w stays small. Though early stopping seems hacky, there is actually some theory behind it. And one paradoxical note is that we can sometimes get better solutions by performing less computation.
41
Hyperparameters
Denition: hyperparameters Properties of the learning algorithm (features, regularization parameter , number of iterations T , step size t , etc.). More generally, how do we choose hyperparameters? Choose hyperparameters to minimize Dtrain error? No - solution would be to include all features, set = 0, T . Choose hyperparameters to minimize Dtest error? No - choosing based on Dtest makes it an unreliable estimate of error!
Weve seen many ways to control the size of the hypothesis class (and thus reducing variance) based on either reducing the dimensionality or reducing the norm. It is important to note that what matters is the size of the hypothesis class, not how complex the predictors in the hypothesis class look. To put it another way, using complex features backed by 1000 lines of code doesnt hurt you if there are only 5 of them. However, if we make the hypothesis class too small, then the bias gets too big. In practice, how do we decide the appropriate size?
43
Validation
Problem: cant use test set! Solution: randomly take out 20% of training and use it instead of the test set to estimate test error. Dtrain \Dval Denition: validation set A validation (development) set is taken out of the training data which acts as a surrogate for the test set. Dval
Generally, our learning algorithm has multiple hyperparameters to set. These hyperparameters cannot be set by the learning algorithm on the training data because we would just choose a degenerate solution and overt. On the other hand, we cant use the test set either because then we would spoil the test set. The solution is to invent something that looks like a test set. Theres no other data lying around, so well have to steal it from the training set. The resulting set is called the validation set. With this validation set, now we can simply try out a bunch of dierent hyperparameters and choose the setting that yields the lowest error on the validation set. Which hyperparameter values should we try? Generally, you should start by getting the right order of magnitude (e.g., = 0.0001, 0.001, 0.01, 0.1, 1, 10) and then rening if necessary.
44
Cross-validation
Problem: only one validation set (usually small) provides an unreliable estimate of test error. Solution: generate multiple validation sets. Partition training data Dtrain into K folds: Dtrain Dtrain Dtrain Dtrain Dtrain Use each fold k = 1, . . . , K as the validation set. Average the K validation error rates.
(1) (2) (3) (4) (5)
Summary
Key to good machine learning: balance bias and variance
46
47
Roadmap
Supervised learning: Review of features and non-linearity
Supervision?
Prediction: Dtrain contains input-output pairs (x, y ) Fully-labeled data is very expensive to obtain (we can get 10,000 labeled examples)
How good is a predictor? Unsupervised learning: Unsupervised learning Clustering: Dtrain only contains inputs x Unlabeled data is much cheaper to obtain (we can get 100 million unlabeled examples)
Conclusion
CS221: Articial Intelligence (Autumn 2013) - Percy Liang [Brown et al, 1992]
48
CS221: Articial Intelligence (Autumn 2013) - Percy Liang [Le et al, 2012]
49
Key idea: unsupervised learning Data has lots of rich latent structures; want methods to discover this structure automatically.
We have so far covered the basics of supervised learning. If you get a labeled training set of (x, y ) pairs, then you can train a predictor. However, where do these examples (x, y ) come from? If youre doing image classication, someone has to sit down and label each image, and generally this tends to be expensive enough that we cant get that many examples. On the other hand, there are tons of unlabeled examples sitting around (e.g., Flickr for photos, Wikipedia, news articles for text documents). The main question is whether we can harness all that unlabeled data to help us make better predictions? This is the goal of unsupervised learning. Empirically, unsupervised learning has produced some pretty impressive results. HMMs can be used to take a ton of raw text and clusters related words together. An unsupervised variant of neural networks called autoencoders can be used to take a ton of raw images and output clusters of images. No one told the learning algorithms explicitly what the clusters should look like they just gured it out.
52
Clustering
Denition: clustering Input: training set of input points Dtrain = {x1 , . . . , xn } Output: assignment of each point to a cluster (z1 , . . . , zn ) where zi {1, . . . , K } Intuition: Want similar points to be put in same cluster, dissimilar points to put in dierent clusters [whiteboard]
54
55
K-means objective
Setup: Each cluster k = 1, . . . , K is represented by a centroid k Rd Intuition: want each point (xi ) close to its assigned centroid zi Objective function:
n
The task of clustering is to take a set of points as input and return a partitioning of the points into K clusters. We will represent the partitioning using an assignment vector z = (z1 , . . . , zn ); for each i, zi {1, . . . , K } species which of the K clusters point i is assigned to.
Losskmeans (z, ) =
i=1
zi (xi )
57
K-means is a particular method for performing clustering which is based on associating each cluster with a centroid k for k = 1, . . . , K . The intuition is that to assign the points to clusters and place the centroid for each cluster so that each point (xi ) is close to its assigned centroid zi .
Algorithm: Step 1 of K-means For each point i = 1, . . . , n: Assign i to cluster with closest centroid: zi arg min (xi ) k 2 .
k=1,...,K
59
Trying to optimize K-means objective all at once is a daunting task, so lets try to break it down. First, assume we know where the centroids are. Then we can optimize the K-means objective with respect to z alone quite easily. It is easy to show that the best setting for zi is the cluster k that minimizes the distance to the centroid k (which is xed).
Algorithm: Step 2 of K-means For each cluster k = 1, . . . , K : Set k to average of points assigned to cluster k : 1 k (xi ) |{i : zi = k }|
i:zi =k
61
K-means algorithm
min min Losskmeans (z, )
z
Now, turning things around, lets suppose we knew what the assignments z were. We can again look at the K-means objective function and try to optimize it with respect to the centroids . The best k is to place the centroid at the average of all the points assigned to cluster k .
Key idea: alternating minimization Tackle hard problem by solving two easy problems.
63
K-means algorithm
Objective: min min Losskmeans (z, )
z
Algorithm: K-means Initialize 1 , . . . , K randomly. For t = 1, . . . , T : Step 1: set assignments z given Step 2: set centroids given z [demo]
Now we have the two ingredients to state the full K-means algorithm. We start by initializing all the centroids randomly. Then, we iteratively alternate back and forth between steps 1 and 2, optimizing z given and vice-versa.
64
Local minima
K-means is guaranteed to converge to a local minimum, but is not guaranteed to nd the global minimum.
K-means is guaranteed to decrease the loss function each iteration and will converge to a local minimum, but it is not guaranteed to nd the global minimum, so one must exercise caution when applying K-means. One solution is to simply run K-means several times from multiple random initializations and then choose the solution that has the lowest loss. Or we could try to be smarter in how we initialize K-means. K-means++ is an initialization scheme, which places the centroids on the training points in a way that they tend to be far apart from each other.
[demo: getting stuck in local optima] Solutions: Run multiple times from dierent random initializations Initialize with a heuristic (K-means++)
66
Cluster-based features
(x)
(x)
dk
Assumption: points inside same cluster tend to have same output label
dk , 0} k (x) = max{d
68
69
Dicult optimization:
latent variables z
parameters
71
Roadmap
Review of features and non-linearity
So far: y {1, +1} (binary classication) or y R (regression) reex-based models Structured prediction: y is a sequence, tree, etc. more complex models Same strategy: write down a loss function, take the derivative, and do gradient descent (more on this later)!
Unsupervised learning
Conclusion
72
73
Conclusion
Think in terms of hypothesis class:
All predictors
F f
bias Parameter tuning variance Feature extraction
fw
This concludes our tour of machine learning. Machine learning is a huge topic, and we have only skimmed the surface. One thing Id like to encourage you to do is to think in terms of hypothesis classes what kind of functions can your learning algorithm potentially learn? This gives a good way to decouple the modeling from the algorithms, and gives us a way to understand generalization in terms of bias and variance tradeos. From a modeling perspective, one can focus our attention on guring out what hypothesis class to use (in deciding which features to use). From the learning and optimization perspective, one can focus on what the best way to choose a predictor and return parameters. So far, we have only focused on reex-based models where the predictor only outputs a yes/no or a number. Next time, we will start looking at models which can perform higherlevel reasoning, but machine learning will be still be our companion for the remainder of the class.
74
Question
What was the most confusing part of todays lecture?
76