Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Supervised

Learning
Recap
• What is the characteristic of a supervised learning algorithm?
• What are two major types of supervised learning:
Assignment
• Give an example of supervised learning from your daily life.
• What is the most important information to get started?
• What type of features that you can extract from data?
• Make at least 5-dimensional feature vector.
Generalization, Overfitting and
Underfitting
Generalization, Overfitting and Underfitting

1. “If the customer is older than 45 and has less than 3 children or is not divorced, then they want to buy a
boat.”
2. “People older than 50 want to buy a boat.”
We can make up many rules that work well on this data.
Generalization
• In supervised learning, we want to build a model on the training data and then be able to
make accurate predictions on new, unseen data that has the same characteristics as the
training set that we used.
• If a model is able to make accurate predictions on unseen data, we say it is able to
generalize from the training set to the test set.
• If the training and test sets have enough in common, we expect the model to also be
accurate on the test set.
• However, if we build very complex models, we can always be as accurate as we like on
the training set but not on test set.
Generalization
• “If the customer is older than 45 and has less than 3 children or is not
divorced, then they want to buy a boat.”
• This rule is 100% accurate on the data that is available like many other rules
that can be made from this data.
• Remember, we want to know if new customers are likely to buy a boat.
• Whether an algorithm will perform well on new data is the evaluation on the
test set.
• Simple models generalize better to new data.
• “People older than 50 want to buy a boat.”
• We would trust it more than the model involving childrent and martial status
in addition to age.
Overfitting and Underfitting
• Building a model that is too complex for the amount of information we have is called overfitting.
• Overfitting occurs when you fit a model too closely to the particularities of the training set and
obtain a model that works well on the training set but is not able to generalize to new data.
• On the other hand, if your model is too simple—say, “Everybody who owns a house buys a
boat”—then you might not be able to capture all the aspects of and variability in the data, and your
model will do badly even on the training set.
• Choosing too simple a model is called underfitting.

• There is a sweet spot in between that will yield the best generalization performance.
• This is the model we want to find.
k-Nearest Neighbours
• The k-NN algorithm is arguably the simplest machine learning algorithm.
• Building the model consists only of storing the training dataset.
• To make a prediction for a new data point, the algorithm finds the closest data points in the
training dataset—its “nearest neighbours.”

A man is known by the company he keeps.


One Nearest Neighbour Example
1-Nearest Neighbour
Training Step

Correction: read m instead of n


1-Nearest Neighbour Prediction Step

Correction: read m
instead of n
Which Point is Closest?
Commonly Used Distance Measure

Correction: j = 1,…,n
More Than One
Neighbours
• We can also consider an arbitrary number,
k, of neighbors.
• When considering more than one
neighbor, we use voting to assign a label.
• This means that for each test point, we
count how many neighbors belong to
• class 0 and how many neighbors belong to
class 1.
• We then assign the majority class among
the k-nearest neighbors.
K-NN for Classification
k-neighbors Regression
Analyzing K-neoghbors Regressor
KNN: Strengths, Weaknesses
• There are two important parameters to the KNeighbors classifier:
• The number of neighbors and
• how you measure distance between data points.
• In practice, using a small number of neighbors like three or five often works well,
but you should certainly adjust this parameter.
• Choosing the right distance measure is somewhat beyond the scope of this book.
• By default, Euclidean distance is used, which works well in many settings.
KNN: Strengths, Weaknesses
• k-NN is very easy to understand,
• Often gives reasonable performance without a lot of adjustments.
• A good baseline method to try before considering more advanced techniques.
• Building the nearest neighbors model is usually very fast, but when your training set is very large
(either in number of features or in number of samples) prediction can be slow.
• When using the k-NN algorithm, it’s important to preprocess your data.
• Does not perform well on datasets with many features (hundreds or more), and it does particularly
badly with datasets where most features are 0 most of the time (so-called sparse datasets).
• t is not often used in practice, due to prediction being slow and its inability to handle many
features.

You might also like