Professional Documents
Culture Documents
MLXX (Dimensionality Reduction) - 1
MLXX (Dimensionality Reduction) - 1
Eduardo Bezerra
CEFET/RJ
Outline
Introduction
PCA
2 / 31
Introduction
Dimensionality Reduction
3 / 31
Dimensionality Reduction
4 / 31
Dimensionality Reduction - PCA
It picks out the directions which contain the greatest amount of information.
5 / 31
Linear Algebra Refresher
Data matrix
We are used to a structure like the one below in a supervised learning setting. We have
been calling it a training dataset {x(i) , y (i) }m
i=1 .
X y
x(1) y (1)
x(2) y (2)
x(3) y (3)
x(4) y (4)
x(5) y (5)
.. ..
. .
x(i) y (i)
.. ..
. .
6 / 31
Data matrix
X
x(1)
x(2)
• Now, we are going to focus on the data matrix part x(3)
X = {x(i) }m (i) ∈ ℜn . x(4)
i=1 , with x
x(5)
..
.
x(i)
..
.
7 / 31
Data matrix
X
x(1)
x(2)
• Now, we are going to focus on the data matrix part x(3)
X = {x(i) }m
i=1 , with x
(i) ∈ ℜn .
x(4)
• In Statistics, there is this concept of random variable. x(5)
..
.
x(i)
..
.
7 / 31
Data matrix
X
• Now, we are going to focus on the data matrix part x(1)
X = {x(i) }m (i) ∈ ℜn . x(2)
i=1 , with x
x(3)
• In Statistics, there is this concept of random variable.
x(4)
• In the context of a data matrix X, we can think of each
x(5)
existing feature (i.e., each dimension of x(i) ) as a ..
random variable. .
x(i)
..
.
7 / 31
Data matrix
8 / 31
Data matrix
But notice that, in general, variables in a data matrix have dependencies between
them.
8 / 31
Data matrix
But notice that, in general, variables in a data matrix have dependencies between
them. A way to capture this dependency is through the concept of covariance...
8 / 31
Covariance
Given two random variables xj and xk , the covariance between them is defined as
follows:
m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1
9 / 31
Covariance
Given two random variables xj and xk , the covariance between them is defined as
follows:
m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1
9 / 31
Covariance
Given two random variables xj and xk , the covariance between them is defined as
follows:
m
1 X (i) (i)
cov(xj , xk ) = (xj − x¯j )(xk − x¯k )
m−1
i=1
But what does this all mean?! Answer: the value cov(xj , xk ) tells us whether those two
variables are related, and, if so, what is the direction of such relationship.
9 / 31
Covariance matrix
10 / 31
Covariance matrix
Suppose that we compute the entries of the covariance matrix for our two variables
and come up with the following:
" #
65.7 5.0
Σ=
5.0 1.5
10 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
The positive value 5.0 indicates that x1 and x2 tend to increase/decrease together.
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
Now, the value 65.7 is the covariance of x1 with itself, i.e., its variance.
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
Now, the value 65.7 is the covariance of x1 with itself, i.e., its variance. In the same
way, 1.5 is the variance of x2 .
11 / 31
Covariance matrix
Let us put our data matrix X and the corresponding covariance matrix Σ side-by-side:
" # x1 x2
65.7 5.0 10 7.0
Σ=
5.0 1.5 1 1.0
15 8.5
20 10.0
In the data matrix, you should notice that the values of x1 are more spread out than
the values of x2 . That is why var (x1 ) > var (x2 ).
11 / 31
Variance
The covariance between a random variable and itself is the variance of this variable.
• This gives us a notion of the dispersion of the values of this variable around its
mean.
• That is, variance is a measure of the deviation (from the mean) for points in one
dimension.
• Variance is always a non-negative value.
12 / 31
Covariance
The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation.
13 / 31
Covariance
The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation. Its value is
• positive if when one variable increases (decreases), the other tends to increase
(decrease).
• negative if when one variable increases (decreases), the other tends to decrease
(increase).
• zero when there is no (linear) relationship between those variables.
13 / 31
Covariance
The covariance between two random variables measures how much they vary
together, and also indicates the direction of variation. Its value is
• positive if when one variable increases (decreases), the other tends to increase
(decrease).
• negative if when one variable increases (decreases), the other tends to decrease
(increase).
• zero when there is no (linear) relationship between those variables.
13 / 31
Covariance - visual interpretation
x2 x2 x2
x1 x1 x1
14 / 31
Covariance matrix
Notice that we can build a covariance matrix for any number of variables.
15 / 31
Covariance matrix
Notice that we can build a covariance matrix for any number of variables. For
example, if we have a data matrix with five variables, then:
var (x1 ) cov (x1 , x2 ) cov (x1 , x3 ) cov (x1 , x4 ) cov (x1 , x5 )
cov (x2 , x1 ) var (x2 ) cov (x2 , x3 ) cov (x2 , x4 ) cov (x2 , x5 )
Σ=
cov (x3 , x1 ) cov (x3 , x2 ) var (x3 ) cov (x3 , x4 ) cov (x3 , x5 )
cov (x4 , x1 ) cov (x4 , x2 ) cov (x4 , x3 ) var (x4 ) cov (x4 , x5 )
cov (x5 , x1 ) cov (x5 , x2 ) cov (x5 , x3 ) cov (x5 , x4 ) var (x5 )
15 / 31
Dot product
16 / 31
Orthogonal vectors
17 / 31
Orthogonal vectors
−1 1
17 / 31
Orthogonal matrix
A square real matrix with orthonormal vectors as its columns. For a matrix A to be
orthogonal, it must be true that AAT = In
18 / 31
Orthogonal matrix
A square real matrix with orthonormal vectors as its columns. For a matrix A to be
orthogonal, it must be true that AAT = In
Example:
1 1 1
√ √
A= 0 2 2
−1 1 1
18 / 31
Singular Value Decomposition
Let A be any m × n matrix. Then SVD decomposes this matrix into 2 unitary matrices
that are orthogonal and a rectangular diagonal matrix containing singular values.
Mathematically:
A = UΣV T
where,
• Σ → (m × n) orthogonal matrix;
• U → (m × n) orthogonal matrix;
• V → (n × n) diagonal matrix with first r rows having only singular values.
19 / 31
Linear transformations
20 / 31
Linear transformations
20 / 31
Linear transformations
Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv
20 / 31
Linear transformations
Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv
In such interpretation, Σ is a function f : ℜ2 → ℜ2 , such that:
20 / 31
Linear transformations
Now, we will interpret Σ as a linear transformation. That means we will consider that
multiplying Σ by a vector v is equivalent to applying a function to v , which produces
another vector u.
u = Σv
In such interpretation, Σ is a function f : ℜ2 → ℜ2 , such that:
21 / 31
Linear transformations
21 / 31
Eigenvalues and eigenvectors
22 / 31
Eigenvalues and eigenvectors
In most cases, the transformation both stretches and rotates the original vector...
22 / 31
Eigenvalues and eigenvectors
...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
23 / 31
Eigenvalues and eigenvectors
...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1
Σv2 = λ2 v2
23 / 31
Eigenvalues and eigenvectors
...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1
Σv2 = λ2 v2
The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.
23 / 31
Eigenvalues and eigenvectors
...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1
Σv2 = λ2 v2
The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.
23 / 31
Eigenvalues and eigenvectors
...but, there are two special vectors, lets call them v1 and v2 , that are only stretched by
the transformation (i.e., not rotated).
These are the eigenvectors of Σ:
Σv1 = λ1 v1
Σv2 = λ2 v2
The values λ1 and λ2 are the corresponding stretching factors, and are called
eigenvalues.
23 / 31
Eigenvalues and eigenvectors
x2
v2 v1
x1
24 / 31
Summary
• Covariance Matrix
• SVD (Singular Value Decomposition)
• Linear Transformation
• Eigenvectors and eigenvalues
25 / 31
PCA
Intuition
26 / 31
Intuition
Also, suppose we want to reduce the dimension of X to two. Maybe PCA will combine
dimensions x1 , x2 , and x3 into a feature related to academic performance.
26 / 31
Intuition
Also, suppose we want to reduce the dimension of X to two. Maybe PCA will combine
dimensions x1 , x2 , and x3 into a feature related to academic performance. And maybe
it will combine features x4 and x5 into a feature related to housing information.
26 / 31
Projection error
27 / 31
Projection error
The new coordinate system formed by PCA is a basis of eigenvectors such that
• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.
28 / 31
PCA - basic idea
The new coordinate system formed by PCA is a basis of eigenvectors such that
• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.
28 / 31
PCA - basic idea
The new coordinate system formed by PCA is a basis of eigenvectors such that
• the largest variation in the data is in the direction of the eigenvector associated to
the largest eigenvalue (called the first principal component),
• the second largest variation in the data is in the direction of the eigenvector
associated to the second largest eigenvalue (called the second principal
component),
• and so on.
28 / 31
PCA - basic idea
29 / 31
PCA - Steps of the algorithm
30 / 31
PCA - Steps of the algorithm
30 / 31
PCA - Steps of the algorithm
PCA will take X as input and produce as output another matrix Xprojected ∈ ℜm×k .
30 / 31
PCA - Steps of the algorithm
Xprojected = XW
31 / 31