MLXX (Dimensionality Reduction) - 1

Dimensionality reduction
CIC1205 - Machine Learning
Eduardo Bezerra
CEFET/RJ
Outline
Introduction
Linear Algebra Refresher
PCA
2 / 31
Introduction
Dimensionality Reduction
• A large number of attributes (dimensions, variables, predictors, features) can be

available for each example in a dataset.
• However, it may be that not all of them will be required for the building a ML
model.
• Indeed, using of all attributes may in many cases worsen predictive performance
(i.e., quality of results).
3 / 31
Dimensionality Reduction
Therefore in some cases, we will need some form of dimensionality reduction, in

which one wishes to retain as much of the information as possible, but in fewer
attributes.
4 / 31
Dimensionality Reduction - PCA
The most trivial form of dimension reduction is simply to use one’s

judgement/experience and select a subset of attributes.
A less subjective and automated approach is principal component analysis (PCA).
It picks out the directions which contain the greatest amount of information.
5 / 31
Linear Algebra Refresher
Data matrix
We are used to a structure like the one below in a supervised learning setting. We have
been calling it a training dataset {x(i) , y (i) }m
i=1 .
X y
x(1) y (1)
x(2) y (2)
x(3) y (3)
x(4) y (4)
x(5) y (5)
.. ..
. .
x(i) y (i)
.. ..
. .
6 / 31
Data matrix
X
x(1)
x(2)
• Now, we are going to focus on the data matrix part x(3)
X = {x(i) }m (i) ∈ ℜn . x(4)
i=1 , with x
x(5)
..
.
x(i)
..
.
7 / 31
Data matrix
X
x(1)
x(2)
X = {x(i) }m
i=1 , with x
(i) ∈ ℜn .
x(4)
• In Statistics, there is this concept of random variable. x(5)
..
.
x(i)
..
.
7 / 31
Data matrix
X
X = {x(i) }m (i) ∈ ℜn . x(2)
i=1 , with x
x(3)
• In Statistics, there is this concept of random variable.
x(4)
• In the context of a data matrix X, we can think of each
x(5)
existing feature (i.e., each dimension of x(i) ) as a ..
random variable. .
x(i)
..
.
7 / 31
Data matrix
For example, think of a data matrix with two dimensions

(i.e., n = 2) and four examples (i.e., m = 4):
x1 x2
• x1 = number of hours of study
10 7.0
• x2 = final grade obtained 1 1.0
If we were to sample examples from X, we would see random 15 8.5
values for x1 and x2 . 20 10.0
8 / 31
Data matrix

x1 x2
10 7.0
But notice that, in general, variables in a data matrix have dependencies between
them.
8 / 31
Data matrix

x1 x2
10 7.0
But notice that, in general, variables in a data matrix have dependencies between
them. A way to capture this dependency is through the concept of covariance...
8 / 31
Covariance
Given two random variables xj and xk , the covariance between them is defined as
follows:
m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1
In the above expression,
• x¯j and x¯k are the means of xj and xk , respectively;

• m is the number of examples in our data matrix.
9 / 31
Covariance
follows:
m
1 X (i) (i)
m−1
i=1

But what does this all mean?!
9 / 31
Covariance
follows:
m
1 X (i) (i)
m−1
i=1

But what does this all mean?! Answer: the value cov(xj , xk ) tells us whether those two
variables are related, and, if so, what is the direction of such relationship.
9 / 31
Covariance matrix
For example, consider our two previous variables:

• x2 = final grade obtained
We can organize the corresponding covariance values in a matrix like this:

" #
var (x1 ) cov (x1 , x2 )
Σ=
cov (x2 , x1 ) var (x2 )
10 / 31
Covariance matrix
For example, consider our two previous variables:

Suppose that we compute the entries of the covariance matrix for our two variables
and come up with the following:
" #
65.7 5.0
Σ=
5.0 1.5
10 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
First, notice that Σ is symmetric.
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
First, notice that Σ is symmetric. That is because cov(x1 , x2 ) = cov(x2 , x1 ).
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
First, notice that Σ is symmetric. That is because cov(x1 , x2 ) = cov(x2 , x1 ).
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
The positive value 5.0 indicates that x1 and x2 tend to increase/decrease together.
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
Now, the value 65.7 is the covariance of x1 with itself, i.e., its variance.
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
Now, the value 65.7 is the covariance of x1 with itself, i.e., its variance. In the same
way, 1.5 is the variance of x2 .
11 / 31
Covariance matrix
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
In the data matrix, you should notice that the values of x1 are more spread out than
the values of x2 . That is why var (x1 ) > var (x2 ).
11 / 31
Variance
The covariance between a random variable and itself is the variance of this variable.
• This gives us a notion of the dispersion of the values of this variable around its
mean.
• That is, variance is a measure of the deviation (from the mean) for points in one
dimension.
• Variance is always a non-negative value.
12 / 31
Covariance
The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation.
13 / 31
Covariance
together, and also indicates the direction of variation. Its value is
• positive if when one variable increases (decreases), the other tends to increase
(decrease).
• negative if when one variable increases (decreases), the other tends to decrease
(increase).
• zero when there is no (linear) relationship between those variables.
13 / 31
Covariance
together, and also indicates the direction of variation. Its value is
• positive if when one variable increases (decreases), the other tends to increase
(decrease).
• negative if when one variable increases (decreases), the other tends to decrease
(increase).
• zero when there is no (linear) relationship between those variables.
Hence covariance measures how correlated two variable are...
13 / 31
Covariance - visual interpretation
x2 x2 x2
x1 x1 x1
Figure 1: a) Positive Figure 2: b) Figure 3: c) Negative

Correlation Uncorrelated/No Correlation Correlation
14 / 31
Covariance matrix
Notice that we can build a covariance matrix for any number of variables.
15 / 31
Covariance matrix
Notice that we can build a covariance matrix for any number of variables. For
example, if we have a data matrix with five variables, then:
 
var (x1 ) cov (x1 , x2 ) cov (x1 , x3 ) cov (x1 , x4 ) cov (x1 , x5 )
 
cov (x2 , x1 ) var (x2 ) cov (x2 , x3 ) cov (x2 , x4 ) cov (x2 , x5 )
 
Σ=
cov (x3 , x1 ) cov (x3 , x2 ) var (x3 ) cov (x3 , x4 ) cov (x3 , x5 )

cov (x4 , x1 ) cov (x4 , x2 ) cov (x4 , x3 ) var (x4 ) cov (x4 , x5 )
 
cov (x5 , x1 ) cov (x5 , x2 ) cov (x5 , x3 ) cov (x5 , x4 ) var (x5 )
15 / 31
Dot product
Consider two n-dimensional vectors u and v:

   
u1 v1
 u2  v2 
   
u=  ..  andv =  .. 
  
. .
un vn
The dot product of these two vectors is defined as

n
X
u·v = ui v i = u1v 1 + u2v 2 + · · · + un v n
i=1
16 / 31
Orthogonal vectors
Two vectors are orthonormal if their dot product is zero.
17 / 31
Orthogonal vectors
Two vectors are orthonormal if their dot product is zero.
For example, vectors u and v below are orthonormal:

   
1 1
√ 
u =  0  , v =  2
 
−1 1
17 / 31
Orthogonal matrix
A square real matrix with orthonormal vectors as its columns. For a matrix A to be
orthogonal, it must be true that AAT = In
18 / 31
Orthogonal matrix
A square real matrix with orthonormal vectors as its columns. For a matrix A to be
orthogonal, it must be true that AAT = In
Example:
 
1 1 1
√ √ 
A= 0 2 2

−1 1 1
18 / 31
Singular Value Decomposition
Let A be any m × n matrix. Then SVD decomposes this matrix into 2 unitary matrices
that are orthogonal and a rectangular diagonal matrix containing singular values.
Mathematically:
A = UΣV T
where,
• Σ → (m × n) orthogonal matrix;
• U → (m × n) orthogonal matrix;
• V → (n × n) diagonal matrix with first r rows having only singular values.
19 / 31
Linear transformations
Suppose the covariance matrix for this dataset is the following:

" #
9 5
Σ=
5 4
20 / 31

" #
9 5
Σ=
5 4
Now, we will interpret Σ as a linear transformation.
20 / 31

" #
9 5
Σ=
5 4
Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv
20 / 31

" #
9 5
Σ=
5 4
another vector u.
u = Σv
In such interpretation, Σ is a function f : ℜ2 → ℜ2 , such that:
f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )
20 / 31

" #
9 5
Σ=
5 4
another vector u.
u = Σv
In such interpretation, Σ is a function f : ℜ2 → ℜ2 , such that:
f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )
Notice the coefficients in f were taken from the entries of Σ.

20 / 31
f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )
So, this transformation f maps points in ℜ2 to points in ℜ2 .
21 / 31
f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )
So, this transformation f maps points in ℜ2 to points in ℜ2 .
• Point (0, 0) goes to (0, 0)

• Point (1, 0) goes to (9, 5)
• Point (0, 1) goes to (5, 4)
• Point (−1, 0) goes to (−9, −5)
• Point (0, −1) goes to (−5, −4)
21 / 31
Eigenvalues and eigenvectors
Geometrically, the transformation encoded by Σ maps points in a circle (left) to points

in a ellipse (right).
22 / 31
Geometrically, the transformation encoded by Σ maps points in a circle (left) to points

in a ellipse (right).
In most cases, the transformation both stretches and rotates the original vector...
22 / 31
...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
23 / 31
These are the eigenvectors of Σ:
Σv1 = λ1 v1
Σv2 = λ2 v2
23 / 31
Σv1 = λ1 v1
Σv2 = λ2 v2
The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.
23 / 31
Σv1 = λ1 v1
Σv2 = λ2 v2
eigenvalues.
Also, it can be proved that, because Σ is symmetric, v1 and v2 are perpendicular to

each other,
23 / 31
Σv1 = λ1 v1
Σv2 = λ2 v2
eigenvalues.
Also, it can be proved that, because Σ is symmetric, v1 and v2 are perpendicular to

each other, and this happens even in higher dimensions.
23 / 31
x2
v2 v1
x1
24 / 31
Summary
By now, you should have a high-level understanding of the following concepts.
• Covariance Matrix
• SVD (Singular Value Decomposition)
• Linear Transformation
• Eigenvectors and eigenvalues
25 / 31
PCA
Intuition
Suppose our data matrix X has five variables:
• x1 = number of weekly hours of study

• x3 = number of weekly hours of video-game
• x4 = number of rooms
• x5 = crime rate
Also, suppose we want to reduce the dimension of X to two.
26 / 31
Intuition

• x5 = crime rate
Also, suppose we want to reduce the dimension of X to two. Maybe PCA will combine
dimensions x1 , x2 , and x3 into a feature related to academic performance.
26 / 31
Intuition

• x5 = crime rate
Also, suppose we want to reduce the dimension of X to two. Maybe PCA will combine
dimensions x1 , x2 , and x3 into a feature related to academic performance. And maybe
it will combine features x4 and x5 into a feature related to housing information.
26 / 31
Projection error
• What PCA does is to project the

original data point in a new coordinate
system.
• Definition (projection error): the sum
of the squares of the distances
between each data point and the
surface corresponding to the smaller
dimensional space to which the points
must be projected. Credits: https://alliance.seas.upenn.edu/~cis520
27 / 31
Projection error
• What PCA does is to project the

original data point in a new coordinate
system.
• Definition (projection error): the sum
of the squares of the distances
between each data point and the
surface corresponding to the smaller
dimensional space to which the points
must be projected.
• PCA produces the new coordinate
system in such a way as to minimize Credits: https://alliance.seas.upenn.edu/~cis520
the projection error of the data on the

smaller dimensional space.
27 / 31
PCA - basic idea
The new coordinate system formed by PCA is a basis of eigenvectors such that
• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.
28 / 31
PCA - basic idea
component),
• and so on.
Therefore, if we want to reduce the dimensionality of the (transformed) data matrix, it

is better to keep the directions with the greatest eigenvalues.
28 / 31
PCA - basic idea
component),
• and so on.
Therefore, if we want to reduce the dimensionality of the (transformed) data matrix, it

is better to keep the directions with the greatest eigenvalues. That is the basic idea
behind PCA.
28 / 31
PCA - basic idea
29 / 31
PCA - Steps of the algorithm
Now, we are ready do understand the PCA algorithm.
30 / 31
Consider using PCA to reduce the dimensionality of a data matrix X ∈ ℜm×n
30 / 31
Consider using PCA to reduce the dimensionality of a data matrix X ∈ ℜm×n
PCA will take X as input and produce as output another matrix Xprojected ∈ ℜm×k .
30 / 31
1. Centralize and normalize X

1 Pm T

2. Compute Σ, the covariance matrix of X: Σ = m−1 i=i x(i) · x(i)
3. Decompose Σ using SVD: Σ = USV T . The principal components are the columns
of the matrix U.
4. Select the k columns of U corresponding to the k largest singular values of S.
These columns are the k principal components.
5. Construct the Projection Matrix: W = [u1 , u2 , . . . , uk ]
6. Project the data matrix over the basis formed by U.
Xprojected = XW
31 / 31

MLXX (Dimensionality Reduction) - 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLXX (Dimensionality Reduction) - 1

Uploaded by

Copyright:

Available Formats

Dimensionality reduction

CIC1205 - Machine Learning

Linear Algebra Refresher

• A large number of attributes (dimensions, variables, predictors, features) can be

Therefore in some cases, we will need some form of dimensionality reduction, in

The most trivial form of dimension reduction is simply to use one’s

A less subjective and automated approach is principal component analysis (PCA).

For example, think of a data matrix with two dimensions

For example, think of a data matrix with two dimensions

For example, think of a data matrix with two dimensions

In the above expression,

• x¯j and x¯k are the means of xj and xk , respectively;

In the above expression,

• x¯j and x¯k are the means of xj and xk , respectively;

But what does this all mean?!

In the above expression,

• x¯j and x¯k are the means of xj and xk , respectively;

For example, consider our two previous variables:

• x1 = number of hours of study

We can organize the corresponding covariance values in a matrix like this:

For example, consider our two previous variables:

• x1 = number of hours of study

First, notice that Σ is symmetric.

First, notice that Σ is symmetric. That is because cov(x1 , x2 ) = cov(x2 , x1 ).

First, notice that Σ is symmetric. That is because cov(x1 , x2 ) = cov(x2 , x1 ).

Hence covariance measures how correlated two variable are...

Figure 1: a) Positive Figure 2: b) Figure 3: c) Negative

Consider two n-dimensional vectors u and v:

The dot product of these two vectors is defined as

Two vectors are orthonormal if their dot product is zero.

Two vectors are orthonormal if their dot product is zero.

For example, vectors u and v below are orthonormal:

Suppose the covariance matrix for this dataset is the following:

Suppose the covariance matrix for this dataset is the following:

Now, we will interpret Σ as a linear transformation.

Suppose the covariance matrix for this dataset is the following:

Suppose the covariance matrix for this dataset is the following:

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

Suppose the covariance matrix for this dataset is the following:

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

Notice the coefficients in f were taken from the entries of Σ.

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

So, this transformation f maps points in ℜ2 to points in ℜ2 .

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

So, this transformation f maps points in ℜ2 to points in ℜ2 .

• Point (0, 0) goes to (0, 0)

Geometrically, the transformation encoded by Σ maps points in a circle (left) to points

Geometrically, the transformation encoded by Σ maps points in a circle (left) to points

Also, it can be proved that, because Σ is symmetric, v1 and v2 are perpendicular to

Also, it can be proved that, because Σ is symmetric, v1 and v2 are perpendicular to

By now, you should have a high-level understanding of the following concepts.

Suppose our data matrix X has five variables:

• x1 = number of weekly hours of study

Also, suppose we want to reduce the dimension of X to two.

Suppose our data matrix X has five variables:

• x1 = number of weekly hours of study

Suppose our data matrix X has five variables:

• x1 = number of weekly hours of study

• What PCA does is to project the

• What PCA does is to project the

the projection error of the data on the

Therefore, if we want to reduce the dimensionality of the (transformed) data matrix, it

Therefore, if we want to reduce the dimensionality of the (transformed) data matrix, it