Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Machine learning

Lecture 1 Introduction

In machine learning we look at problems we can’t solve explicitly. Approximate solution and limited
reliability, predictability and interpretability are fine in machine learning problems. A machine
learning problem need plenty of examples available to learn from, both good and bad examples.

Reinforcement learning is taking actions in a world based on delayed feedback and online learning is
predicting and learning at the same time.

In offline learning we separate learning, predicting and acting. Take a fixed dataset of examples, train
a model to learn from these examples, test the model to see if its works by checking the predictions,
if it works put it into production.

In machine learning you give the computer abstract tasks like classification, regression, clustering etc.
To create algorithms. The abstract tasks can be divided in two categories:

- Supervised: explicit example of input and output. Learn to predict the output for an unseen
input. To main ones are classification and regression.
- Unsupervised: Only inputs provided. Find any pattern that explains something about the
data.

Supervised learning

With classification you classify certain data in different classes. Assign a class to each example. The
things you measure are called features of the instance. Features can be numeric or categorical. Each
instance comes with a target value (what you are trying to learn). A classifier takes an instance and
guesses his label. The data with features is showed to a learning algorithm and a classifier assigns the
example to a class. In a feature space you can show all the features in a diagram. A linear classifier
uses a line through the points. A hyperplane is used when you have 3 features. A model space is used
for a 3D diagram of the features.

Lossdata(model) = performance of model on the data (the lower the better)

Via a decision tree you can classify when you have three features. It isolates each category. The k-
Nearest Neighbour (kNN) is a lazy classifier algorithm. It just remembers the data.

Variations of classifiers are that features can either be numerical or categorical. Binary classification
classifies in two classes. Multiclass classification is for more than two classes. Multilabel classification
is none, some or all classes may be true. The class probability scores, is the classifiers reports
probability or score for each class.

Regression works the same as classification but gives a number instead of a category. Regression
predicts numbers or assign a number to each example. You model the relation between the features
and the numbers. The loss function for regression (aka mean squared errors (MSE) loss) is needed to
see how good the model is, the lower the number the better the model. The You can plot a linear
line, regression tree or kNN regression.

Unsupervised learning

Clustering uses a model to find a cluster id. It looks like clustering but you don’t have labels. The
machine learning algorithm has to find a cluster, you do have to provide a number of labels you
expect. With K-means you cluster three groups in a lot of data points. With rearranging the points
each time, you create clusters of points that are related to each other and need further research to
find out what the clusters mean.

With density estimation you want to learn how likely new data is. The higher the number the more
likely the data is.

Generative modeling is building a model from which you can sample new examples. Semi supervised
learning can be useful when unlabeled data is very cheaply available but labeled data is very
expansive to acquire. This involved learning from a labeled data set with unlabeled data.

Supervised learning are


classification and regression.
Unsupervised learning are
clustering, density estimation and
generative modeling.

Some machine learning algorithms


and training bias can cause
unintentional harm. Sometimes
prediction can be offensive, hurtful
or harmful. Sensitive attributes
should be used with extreme care.

Generalization

Overfitting is random noise. Never judge your model performance on the training data, a model must
do well on new data. The task is to fit the training data and discard the noise. The problem of
induction is the problem of how we learn. Deductive reasoning is discrete, unambiguous, provable
and known rules. Inductive reasoning is fuzzy, ambiguous, experimental and unknown rules.

Lecture 2 Learning models and search

The basic machine learning recipe is:

1. Abstract your problem to a standard task


2. Choose your instance and their features
3. Choose your model class
4. Search for a good model

xi instance i in the data

xj feature j (of some instance

Xij feature j of instance i

The feature space is the horizontal axis and in the model space
every possible model is a point.

Black box optimization

With a random search you start with a random point p in the model space, then a loop and enter
another point close to p. optimization and machine learning is not entirely the same. Optimalization
is to find the minimum, or the best possible approximation. Machine learning is to find the lowest
loss that generalizes. So, you minimize the loss on the test data, seeing only the training data.
Variation on random search are fixed radius, random uniform and normal distribution.

With evolutionary algorithms you start with a population of k models, then you enter a loop that rank
the population by loss, remove the half with the worst loss and breed a new population of k models.

Gradient decent

You start with picking a random point p in the model space, then enter a loop that picks k random
points close to p. Using calculus we can find the direction in which the loss drops most quickly. This
direction is the opposite of the gradient. The gradient is an n-dimensional version of the derivative.
Gradient descent takes small steps in this direction in order to find the minimum of a function.

The tangent hyperplane of a function f approximates the function f locally. The gradient of f gives the
slope of the tangent hyperplane. The vector expressing the slope is also the direction of steepest
ascent. The opposite vector is the direction of steepest descent. The lower the learning rate the
slower but precise it finds the minimum.

Sometimes your loss function should not be the same as your evaluation function.

Blackbox optimization are simple and works on discrete model spaces. Examples are random search,
simulated annealing, evolutionary. Gradient descent is powerful, only on continuous model spaces.
For classification you have to find a smooth loss function such as the least square loss.

Lecture 3 Model evaluation

Experiments

In classification, the main metric of performance is the proportion of misclassified examples. This is
called the error. The proportion of correctly classified examples is called the accuracy. A
hyperparameter you have to choose yourself. To compare different models you can use different
values of the hyperparameter k. Train classifier A, train classifier B then compute the error of A and
compute the error of B. The proportion of a data set is not important, the absolute size of the test
data is, when you split the data in training data and test data. The most important factor is the size in
instances of the test data. The bigger this number, the more precise our estimate of our model’s
error.

The modern recipe of evaluation is:

1. Split data into train and test


data
2. Choose model,
hyperparameters, etc. only
using the training set
3. State hypothesis
4. Test hypothesis once on the
test data

A model that fits the training data


perfectly may not be much use when
it comes to data you haven't seen
before. Don’t reuse your test data, because then you are overfitting.
With temporal data you can’t split the data random.

Statistical testing

Statistics and machine learning are very much alike but we relatively use little of statistical tools and
don’t often do significance tests. However, there are a few important statistical concepts to be aware
of, even if we don't use the whole statistical toolbox to interrogate them rigorously.

The first is the difference between the true metric of a problem or task, and the value you measure.
This is a very basic principle in statistics. For instance, we can’t observe the mean height of all Dutch
people currently living, but we can take a random sample of 100 Dutch people, and use their average
height as an estimate of the true average height.

We usually imagine that the data is sampled from some distribution p(x). In this view, we're not really
interested in training a classifier that does well on the dataset we have, even on the test data. What
we really want is a classifier that does well on any data sampled from p(x).

Imagine sampling one instance from the data distribution and classifying it with some classifier C. If
you do this, there is a certain probability that C will be correct. This is called the true accuracy. It is
not something we can ever know or compute (except in very specific cases). The only thing we can do
is take a large number of samples from p(x), classify them with C, and approximate the true accuracy
with the relative frequency of correct classifications in our sample. This is what we are doing when
we compute the accuracy of a classifier on the test set or the validation set: we are estimating the
true accuracy. To explicitly distinguish this estimate from the true accuracy, we sometimes call this
the sample accuracy.

You might also like