Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Dimensionality reduction

CIC1205 - Machine Learning

Eduardo Bezerra

CEFET/RJ
Outline

Introduction

Linear Algebra Refresher

PCA

2 / 31
Introduction
Dimensionality Reduction

• A large number of attributes (dimensions, variables, predictors, features) can be


available for each example in a dataset.
• However, it may be that not all of them will be required for the building a ML
model.
• Indeed, using of all attributes may in many cases worsen predictive performance
(i.e., quality of results).

3 / 31
Dimensionality Reduction

Therefore in some cases, we will need some form of dimensionality reduction, in


which one wishes to retain as much of the information as possible, but in fewer
attributes.

4 / 31
Dimensionality Reduction - PCA

The most trivial form of dimension reduction is simply to use one’s


judgement/experience and select a subset of attributes.

A less subjective and automated approach is principal component analysis (PCA).

It picks out the directions which contain the greatest amount of information.

5 / 31
Linear Algebra Refresher
Data matrix

We are used to a structure like the one below in a supervised learning setting. We have
been calling it a training dataset {x(i) , y (i) }m
i=1 .

X y
x(1) y (1)
x(2) y (2)
x(3) y (3)
x(4) y (4)
x(5) y (5)
.. ..
. .
x(i) y (i)
.. ..
. .
6 / 31
Data matrix

X
x(1)
x(2)
• Now, we are going to focus on the data matrix part x(3)
X = {x(i) }m (i) ∈ ℜn . x(4)
i=1 , with x
x(5)
..
.
x(i)
..
.

7 / 31
Data matrix

X
x(1)
x(2)
• Now, we are going to focus on the data matrix part x(3)
X = {x(i) }m
i=1 , with x
(i) ∈ ℜn .
x(4)
• In Statistics, there is this concept of random variable. x(5)
..
.
x(i)
..
.

7 / 31
Data matrix

X
• Now, we are going to focus on the data matrix part x(1)
X = {x(i) }m (i) ∈ ℜn . x(2)
i=1 , with x
x(3)
• In Statistics, there is this concept of random variable.
x(4)
• In the context of a data matrix X, we can think of each
x(5)
existing feature (i.e., each dimension of x(i) ) as a ..
random variable. .
x(i)
..
.

7 / 31
Data matrix

For example, think of a data matrix with two dimensions


(i.e., n = 2) and four examples (i.e., m = 4):
x1 x2
• x1 = number of hours of study
10 7.0
• x2 = final grade obtained 1 1.0
If we were to sample examples from X, we would see random 15 8.5
values for x1 and x2 . 20 10.0

8 / 31
Data matrix

For example, think of a data matrix with two dimensions


(i.e., n = 2) and four examples (i.e., m = 4):
x1 x2
• x1 = number of hours of study
10 7.0
• x2 = final grade obtained 1 1.0
If we were to sample examples from X, we would see random 15 8.5
values for x1 and x2 . 20 10.0

But notice that, in general, variables in a data matrix have dependencies between
them.

8 / 31
Data matrix

For example, think of a data matrix with two dimensions


(i.e., n = 2) and four examples (i.e., m = 4):
x1 x2
• x1 = number of hours of study
10 7.0
• x2 = final grade obtained 1 1.0
If we were to sample examples from X, we would see random 15 8.5
values for x1 and x2 . 20 10.0

But notice that, in general, variables in a data matrix have dependencies between
them. A way to capture this dependency is through the concept of covariance...

8 / 31
Covariance

Given two random variables xj and xk , the covariance between them is defined as
follows:

m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1

In the above expression,

• x¯j and x¯k are the means of xj and xk , respectively;


• m is the number of examples in our data matrix.

9 / 31
Covariance

Given two random variables xj and xk , the covariance between them is defined as
follows:

m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1

In the above expression,

• x¯j and x¯k are the means of xj and xk , respectively;


• m is the number of examples in our data matrix.

But what does this all mean?!

9 / 31
Covariance

Given two random variables xj and xk , the covariance between them is defined as
follows:

m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1

In the above expression,

• x¯j and x¯k are the means of xj and xk , respectively;


• m is the number of examples in our data matrix.

But what does this all mean?! Answer: the value cov(xj , xk ) tells us whether those two
variables are related, and, if so, what is the direction of such relationship.
9 / 31
Covariance matrix

For example, consider our two previous variables:

• x1 = number of hours of study


• x2 = final grade obtained

We can organize the corresponding covariance values in a matrix like this:


" #
var (x1 ) cov (x1 , x2 )
Σ=
cov (x2 , x1 ) var (x2 )

10 / 31
Covariance matrix

For example, consider our two previous variables:

• x1 = number of hours of study


• x2 = final grade obtained

Suppose that we compute the entries of the covariance matrix for our two variables
and come up with the following:
" #
65.7 5.0
Σ=
5.0 1.5

10 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

First, notice that Σ is symmetric.

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

First, notice that Σ is symmetric. That is because cov(x1 , x2 ) = cov(x2 , x1 ).

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

First, notice that Σ is symmetric. That is because cov(x1 , x2 ) = cov(x2 , x1 ).

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

The positive value 5.0 indicates that x1 and x2 tend to increase/decrease together.

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

Now, the value 65.7 is the covariance of x1 with itself, i.e., its variance.

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

Now, the value 65.7 is the covariance of x1 with itself, i.e., its variance. In the same
way, 1.5 is the variance of x2 .

11 / 31
Covariance matrix

Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:

" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0

In the data matrix, you should notice that the values of x1 are more spread out than
the values of x2 . That is why var (x1 ) > var (x2 ).

11 / 31
Variance

The covariance between a random variable and itself is the variance of this variable.

• This gives us a notion of the dispersion of the values of this variable around its
mean.
• That is, variance is a measure of the deviation (from the mean) for points in one
dimension.
• Variance is always a non-negative value.

12 / 31
Covariance

The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation.

13 / 31
Covariance

The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation. Its value is

• positive if when one variable increases (decreases), the other tends to increase
(decrease).
• negative if when one variable increases (decreases), the other tends to decrease
(increase).
• zero when there is no (linear) relationship between those variables.

13 / 31
Covariance

The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation. Its value is

• positive if when one variable increases (decreases), the other tends to increase
(decrease).
• negative if when one variable increases (decreases), the other tends to decrease
(increase).
• zero when there is no (linear) relationship between those variables.

Hence covariance measures how correlated two variable are...

13 / 31
Covariance - visual interpretation

x2 x2 x2

x1 x1 x1

Figure 1: a) Positive Figure 2: b) Figure 3: c) Negative


Correlation Uncorrelated/No Correlation Correlation

14 / 31
Covariance matrix

Notice that we can build a covariance matrix for any number of variables.

15 / 31
Covariance matrix

Notice that we can build a covariance matrix for any number of variables. For
example, if we have a data matrix with five variables, then:

 
var (x1 ) cov (x1 , x2 ) cov (x1 , x3 ) cov (x1 , x4 ) cov (x1 , x5 )
 
cov (x2 , x1 ) var (x2 ) cov (x2 , x3 ) cov (x2 , x4 ) cov (x2 , x5 )
 
Σ=
cov (x3 , x1 ) cov (x3 , x2 ) var (x3 ) cov (x3 , x4 ) cov (x3 , x5 )

cov (x4 , x1 ) cov (x4 , x2 ) cov (x4 , x3 ) var (x4 ) cov (x4 , x5 )
 
cov (x5 , x1 ) cov (x5 , x2 ) cov (x5 , x3 ) cov (x5 , x4 ) var (x5 )

15 / 31
Dot product

Consider two n-dimensional vectors u and v:


   
u1 v1
 u2  v2 
   
u=  ..  andv =  .. 
  
. .
un vn

The dot product of these two vectors is defined as


n
X
u·v = ui v i = u1v 1 + u2v 2 + · · · + un v n
i=1

16 / 31
Orthogonal vectors

Two vectors are orthonormal if their dot product is zero.

17 / 31
Orthogonal vectors

Two vectors are orthonormal if their dot product is zero.

For example, vectors u and v below are orthonormal:


   
1 1
√ 
u =  0  , v =  2
 

−1 1

17 / 31
Orthogonal matrix

A square real matrix with orthonormal vectors as its columns. For a matrix A to be
orthogonal, it must be true that AAT = In

18 / 31
Orthogonal matrix

A square real matrix with orthonormal vectors as its columns. For a matrix A to be
orthogonal, it must be true that AAT = In

Example:
 
1 1 1
√ √ 
A= 0 2 2

−1 1 1

18 / 31
Singular Value Decomposition

Let A be any m × n matrix. Then SVD decomposes this matrix into 2 unitary matrices
that are orthogonal and a rectangular diagonal matrix containing singular values.
Mathematically:

A = UΣV T

where,

• Σ → (m × n) orthogonal matrix;
• U → (m × n) orthogonal matrix;
• V → (n × n) diagonal matrix with first r rows having only singular values.

19 / 31
Linear transformations

Suppose the covariance matrix for this dataset is the following:


" #
9 5
Σ=
5 4

20 / 31
Linear transformations

Suppose the covariance matrix for this dataset is the following:


" #
9 5
Σ=
5 4

Now, we will interpret Σ as a linear transformation.

20 / 31
Linear transformations

Suppose the covariance matrix for this dataset is the following:


" #
9 5
Σ=
5 4

Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv

20 / 31
Linear transformations

Suppose the covariance matrix for this dataset is the following:


" #
9 5
Σ=
5 4

Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv
In such interpretation, Σ is a function f : ℜ2 → ℜ2 , such that:

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

20 / 31
Linear transformations

Suppose the covariance matrix for this dataset is the following:


" #
9 5
Σ=
5 4

Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv
In such interpretation, Σ is a function f : ℜ2 → ℜ2 , such that:

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

Notice the coefficients in f were taken from the entries of Σ.


20 / 31
Linear transformations

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

So, this transformation f maps points in ℜ2 to points in ℜ2 .

21 / 31
Linear transformations

f (x1 , x2 ) = (9x1 + 5x2 , 5x1 + 4x2 )

So, this transformation f maps points in ℜ2 to points in ℜ2 .

• Point (0, 0) goes to (0, 0)


• Point (1, 0) goes to (9, 5)
• Point (0, 1) goes to (5, 4)
• Point (−1, 0) goes to (−9, −5)
• Point (0, −1) goes to (−5, −4)

21 / 31
Eigenvalues and eigenvectors

Geometrically, the transformation encoded by Σ maps points in a circle (left) to points


in a ellipse (right).

22 / 31
Eigenvalues and eigenvectors

Geometrically, the transformation encoded by Σ maps points in a circle (left) to points


in a ellipse (right).

In most cases, the transformation both stretches and rotates the original vector...
22 / 31
Eigenvalues and eigenvectors

...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).

23 / 31
Eigenvalues and eigenvectors

...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1

Σv2 = λ2 v2

23 / 31
Eigenvalues and eigenvectors

...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1

Σv2 = λ2 v2

The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.

23 / 31
Eigenvalues and eigenvectors

...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1

Σv2 = λ2 v2

The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.

Also, it can be proved that, because Σ is symmetric, v1 and v2 are perpendicular to


each other,

23 / 31
Eigenvalues and eigenvectors

...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1

Σv2 = λ2 v2

The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.

Also, it can be proved that, because Σ is symmetric, v1 and v2 are perpendicular to


each other, and this happens even in higher dimensions.

23 / 31
Eigenvalues and eigenvectors

x2
v2 v1

x1

24 / 31
Summary

By now, you should have a high-level understanding of the following concepts.

• Covariance Matrix
• SVD (Singular Value Decomposition)
• Linear Transformation
• Eigenvectors and eigenvalues

25 / 31
PCA
Intuition

Suppose our data matrix X has five variables:

• x1 = number of weekly hours of study


• x2 = final grade obtained
• x3 = number of weekly hours of video-game
• x4 = number of rooms
• x5 = crime rate

Also, suppose we want to reduce the dimension of X to two.

26 / 31
Intuition

Suppose our data matrix X has five variables:

• x1 = number of weekly hours of study


• x2 = final grade obtained
• x3 = number of weekly hours of video-game
• x4 = number of rooms
• x5 = crime rate

Also, suppose we want to reduce the dimension of X to two. Maybe PCA will combine
dimensions x1 , x2 , and x3 into a feature related to academic performance.

26 / 31
Intuition

Suppose our data matrix X has five variables:

• x1 = number of weekly hours of study


• x2 = final grade obtained
• x3 = number of weekly hours of video-game
• x4 = number of rooms
• x5 = crime rate

Also, suppose we want to reduce the dimension of X to two. Maybe PCA will combine
dimensions x1 , x2 , and x3 into a feature related to academic performance. And maybe
it will combine features x4 and x5 into a feature related to housing information.

26 / 31
Projection error

• What PCA does is to project the


original data point in a new coordinate
system.
• Definition (projection error): the sum
of the squares of the distances
between each data point and the
surface corresponding to the smaller
dimensional space to which the points
must be projected. Credits: https://alliance.seas.upenn.edu/~cis520

27 / 31
Projection error

• What PCA does is to project the


original data point in a new coordinate
system.
• Definition (projection error): the sum
of the squares of the distances
between each data point and the
surface corresponding to the smaller
dimensional space to which the points
must be projected.
• PCA produces the new coordinate
system in such a way as to minimize Credits: https://alliance.seas.upenn.edu/~cis520

the projection error of the data on the


smaller dimensional space.
27 / 31
PCA - basic idea

The new coordinate system formed by PCA is a basis of eigenvectors such that

• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.

28 / 31
PCA - basic idea

The new coordinate system formed by PCA is a basis of eigenvectors such that

• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.

Therefore, if we want to reduce the dimensionality of the (transformed) data matrix, it


is better to keep the directions with the greatest eigenvalues.

28 / 31
PCA - basic idea

The new coordinate system formed by PCA is a basis of eigenvectors such that

• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.

Therefore, if we want to reduce the dimensionality of the (transformed) data matrix, it


is better to keep the directions with the greatest eigenvalues. That is the basic idea
behind PCA.

28 / 31
PCA - basic idea

29 / 31
PCA - Steps of the algorithm

Now, we are ready do understand the PCA algorithm.

30 / 31
PCA - Steps of the algorithm

Now, we are ready do understand the PCA algorithm.

Consider using PCA to reduce the dimensionality of a data matrix X ∈ ℜm×n

30 / 31
PCA - Steps of the algorithm

Now, we are ready do understand the PCA algorithm.

Consider using PCA to reduce the dimensionality of a data matrix X ∈ ℜm×n

PCA will take X as input and produce as output another matrix Xprojected ∈ ℜm×k .

30 / 31
PCA - Steps of the algorithm

1. Centralize and normalize X


1 Pm  T

2. Compute Σ, the covariance matrix of X: Σ = m−1 i=i x(i) · x(i)
3. Decompose Σ using SVD: Σ = USV T . The principal components are the columns
of the matrix U.
4. Select the k columns of U corresponding to the k largest singular values of S.
These columns are the k principal components.
5. Construct the Projection Matrix: W = [u1 , u2 , . . . , uk ]
6. Project the data matrix over the basis formed by U.

Xprojected = XW

31 / 31

You might also like