Kernel Functions - Towards Data Science PDF

23/3/2020 Kernel Functions - Towards Data Science
Kernel Functions
Tejumade Afonja
Jan 2, 2017 · 6 min read
Lately, I have been doing some reading up on machine learning and Kernels happens to
be an interesting part of classification problems, before I go further, this topic was
inspired by a medium post written by Alan, Do it Yourself nlp for bot developers .
Thanks A.
What is a kernel function?

To talk about kernels, we need to understand terms like SVM (support vector
machines) -classifications -supervised Learning -machine learning -blah blah…. So
many terms right?, but don’t let that discourage you (I knew nothing about all of those
before the DIY exercise). Let’s walk in it together:-
https://towardsdatascience.com/kernel-function-6f1d2be6091 1/6
So what exactly is “machine learning (ML)” ? well, it turns out that ML is actually a lot of
things but the overarching theme is best summed up by this oft-quoted statement made
by Arthur Samuel way back in 1959:
“Machine Learning is the field of study that gives computers the ability to learn without
being explicitly programmed.”
A computer program is said to learn from

experience E with respect to some task T and some
performance measure P, if its performance on T, as
measured by P, improves with experience E.” —
Tom Mitchell, Carnegie Mellon University
So if you want your program to predict, for example, traffic patterns at a busy
intersection (task T), you can run it through a machine learning algorithm with data
about past traffic patterns (experience E), if it has successfully “learned”, it will then do
better at predicting future traffic patterns (performance measure P).
Among the different types of ML tasks is what we call supervised learning (SL). This
is a situation where you put in some data you already have answers to (for example, to
predict if a dog is a particular breed, we load in millions of dog information/properties
like type, height, skin color, body hair length etc. In ML lingo, these properties are
referred to as ‘features’. A single entry of these list of features is a data instance while
the collection of everything is the Training Data which forms the basis of your
prediction i.e if you know the skin color, body hair length, height and so on of a
particular dog, then you can predict the breed it will probably belong to.
Before we can jump into kernels, we need to understand what a support vector machine is.
Support Vector Machine or SVM are supervised learning models with associated learning
algorithms that analyze data for classification( clasifications means knowing what belong
to what e.g ‘apple’ belongs to class ‘fruit’ while ‘dog’ to class ‘animals’ -see fig.1)
Fig. 1
In support vector machines, it looks somewhat like Fig.2 below :) which separates the
blue balls from red.
SVM is a classifier formally defined by a separating hyperplane. An hyperplane is a

subspace of one dimension less than its ambient space. The dimension of a
mathematical space (or object) is informally defined as the minimum number of
coordinates (x,y,z axis) needed to specify any point (like each blue and red point) within it
while an ambient space is the space surrounding a mathematical object. A mathematical
object is an abstract object arising in mathematics An abstract object is an object which
does not exist at any particular time or place, but rather exists as a type of thing, i.e., an
idea, or abstraction (wikipedia) .
Therefore the hyperplane of a two dimensional space below (fig.2) is a one

dimensional line dividing the red and blue dots.
Fig. 2
From the example above of trying to predict the breed of a particular dog, it goes like
this
Data (all breeds of dog)→ Features(skin color, hair etc)→ Learning algorithm
So why Kernels?
Consider the Fig. 3 below
Fig. 3
Can you try to solve the above problem linearly like we did with Fig. 2?
NO!
The red and blue balls cannot be separated by a straight line as they are randomly
distributed and this, in reality, is how most real life problem data are -randomly
distributed.
In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of
using a linear classifier to solve a non-linear problem. It entails transforming linearly
inseparable data like (Fig. 3) to linearly separable ones (Fig. 2). The kernel function is
what is applied on each data instance to map the original non-linear observations into
a higher-dimensional space in which they become separable.
Using the dog breed prediction example again, kernels offer a better alternative.
Instead of defining a slew of features, you define a single kernel function to compute
similarity between breeds of dog. You provide this kernel, together with the data and
labels to the learning algorithm, and out comes a classifier.
How does it work?

To better understand how Kernels work, let us use Lili Jiang’s mathematical illustration
Mathematical definition: K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are
n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes
the dot product. usually m is much larger than n.
Intuition: normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and
then do the dot product. These two computation steps can be quite expensive as they
involve manipulations in m dimensional space, where m can be a large number. But after
all the trouble of going to the high dimensional space, the result of the dot product is really
a scalar: we come back to one-dimensional space again! Now, the question we have is: do
we really need to go through all the trouble to get this one number? do we really have to go
to the m-dimensional space? The answer is no, if you find a clever kernel.
Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1,
x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)².
Let’s plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5,
6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
A lot of algebra, mainly because f is a mapping from 3-dimensional to 9 dimensional space.
Now let us use the kernel instead:

K(x, y) = (4 + 10 + 18 ) ^2 = 32² = 1024
Same result, but this calculation is so much easier.
That’s about it for kernels. Good Job! you just took the first step to becoming a Machine
Learning Expert :)
Extra note: To learn more, you can check out how I predicted the stock market at Numerai
ml and what are kernels in machine learning and SVM.
Pelumi Aboluwarin did a fantastic job in reading the draft and suggesting this topic. Thank
you!
If you enjoyed reading this as much as I enjoyed writing it, you know what to do ;) show it
some love and if you have suggestions on topics you would like me to write about, drop it in
the comment section below. Thanks for reading :)
*all images are from web**
Extra Readings
1. https://en.wikipedia.org/wiki/Statistical_classification
2. https://en.wikipedia.org/wiki/Supervised_learning
Machine Learning Kernel Hyperplane Svm Math
About Help Legal

Kernel Functions - Towards Data Science PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kernel Functions - Towards Data Science PDF

Uploaded by

Copyright:

Available Formats

23/3/2020 Kernel Functions - Towards Data Science

What is a kernel function?

A computer program is said to learn from

SVM is a classifier formally defined by a separating hyperplane. An hyperplane is a

Therefore the hyperplane of a two dimensional space below (fig.2) is a one

Consider the Fig. 3 below

How does it work?

A lot of algebra, mainly because f is a mapping from 3-dimensional to 9 dimensional space.

Now let us use the kernel instead:

*all images are from web**

Machine Learning Kernel Hyperplane Svm Math

About Help Legal

You might also like