Chapter 5 Statistical Unsupervised Methods

FEM 2063 - Data Analytics
CHAPTER 5
Statistical Unsupervised
Learning
1
Overview
➢Singular Value Decomposition (SVD)
➢Principal Components Analysis (PCA)
2
Unsupervised Learning
Unsupervised vs Supervised Learning:
•The previous 3 chapters dealt with supervised learning methods, mainly

regression and classification.
•In supervised learning, a dataset comprising of elements is given with a

set of features X1,X2,…,Xp as well as a response or outcome variable Y for
each element. The goal was then to build a model to predict Y using
X1,X2,…,Xp.
•In unsupervised learning, A set of features X1,X2,…,Xp is observed (no

outcome variable Y). The goal is to extract the most valuable information
from the observations .
The goal is to discover interesting things about the observations
• is there an informative way to visualize the data?
• can we discover subgroups among the variables or among the

observations?
• More subjective compared to supervised learning, as there is no simple goal

for the analysis, such as prediction of a response.
• It is often easier to obtain unlabeled data (e.g. from a lab instrument or a

computer) than labeled data, which may require expert intervention.
Examples of applications in a various fields:
• subgroups of breast cancer patients grouped by their gene expression

measurements
• groups of shoppers characterized by their browsing and purchase histories
• grouped by the ratings assigned by movie viewers.

We study two statistical methods :
• Singular Value Decomposition (SVD)
• Principal Components Analysis (PCA)

Datasets in the form of matrices
We are given n objects and p features describing the objects.
Dataset
An n-by-p matrix A
n rows representing n objects
Each object has p numeric values describing it.
Goal
1. Understand the structure of the data, e.g., the underlying process generating the data.
1. Reduce the number of features representing the data

Example 1- Market basket matrices
p products (e.g., milk, bread, rice, etc.)
 
 
Aij= quantity of j-th product
n customers purchased by the i-th
 A  customer
 
 
Aim: find a subset of the products that characterize customer behavior
Example 2- Social-network matrices
p groups (e.g., BU group, opera, etc.)
  Aij= participation of i-th

n users   user in the event by the j-th
 A  group
 
 
Aim: find a subset of the groups that accurately clusters social-network users
Example 3- Document matrices
p terms (e.g., theorem, proof, etc.)
 
 
n documents
 A 
Aij= frequency of the j-th
term in the i-th document
 
 
Aim: find a subset of the terms that accurately clusters the documents
Example 4-Recommendation systems
p products
  Aij= frequency of j-th

n customers   product in products
 A  bought by the i-th
 
customer
 
Aim: find a subset of the products that accurately describe the behavior or the customers
Overview
13
Singular Value Decomposition (SVD)
Data matrices have n rows (one for each
feature 2
object) and p columns (one for each feature).
Object d
Rows: vectors in a Euclidean space,  1 .5 
X = 
 .5 1 
(d,x) Object x
Two objects are “close” if the angle between

their corresponding vectors is small. feature 1
SVD
Example
Input: 2-d dimensional points
5
Output:
2nd (right)
singular 1st (right) singular vector:
4 vector direction of maximal variance
2nd (right) singular vector:

3 direction of maximal variance, after
removing the projection of the data along
1st (right) singular the first singular vector.
vector
2
4.0 4.5 5.0 5.5 6.0
SVD
Singular values
2nd (right)
singular 1: measures how much of the data
4
vector variance is explained by the first singular
vector.
2: measures how much of the data

3 variance is explained by the second
1 singular vector.
1st (right) singular
vector
2
4.0 4.5 5.0 5.5 6.0
SVD
SVD decomposition
    T
       
 A =
  U •
   • V 
       
   
nxp nxℓ ℓxℓ ℓxp
U (V): orthogonal matrix containing the left (right) singular vectors of X.

Ʃ: diagonal matrix containing the singular values of X:
1 ≥ 2 ≥ … ≥ ℓ
V’=VT : Transpose of V
SVD
SVD and Rank-k approximations
A = U  VT
features
sig. significant
significant
noise noise
noise
=
objects
SVD
Rank-k approximations (Ak)
Ak is the best
approximation
of A nxp nxk kxk kxp
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
k: diagonal matrix containing the top k singular values of A
Ak is an approximation of A
SVD
Example computation:
Find the SVD of A, UΣVT, where
Calculate the square matrix AA’
Find the eigenvalues of AA’ :
The eigenvalues of AA’ are 6 and 8

SVD
Find the eigenvector of AA’ for the eigenvalue λ=8:
solve
The eigenvector of AA’ for 8 is: u1 = (already normalized)

SVD
Similarly the eigenvector of AA’ for λ= 6 is: u2=

(already normalized)
SVD
Calculate the square matrix A’ A
Find its eigenvalues:
The eigenvalues of A’A are 8, 6 and 0

SVD
For A’A the eigenvalues are λ=8, 6 and 0:
The eigenvector of AAT for λ= 8 is:
For λ= 6:
(all not normalized)
For λ=0:
SVD
After normalization:
SVD
Finally the SVD of A, UΣVT
SVD
Python Example
import numpy as np
import scipy [[ 2 0 2]
from scipy.linalg import svd [ 1 2 -1]]
A=np.array([[2,0,2], [1,2,-1]]) [[-1.00000000e+00 1.17756934e-16]

[ 1.17756934e-16 1.00000000e+00]]
U, S, Vt = svd(A)
[2.82842712 2.44948974]
print(A)
print(U) [[-0.70710678 0. -0.70710678]
print(S) [ 0.40824829 0.81649658 -0.40824829]
print(Vt) [-0.57735027 0.57735027 0.57735027]]
Overview
28
Principal Components Analysis (PCA)
• PCA produces a low-dimensional representation of a dataset. It finds a

sequence of linear combinations of the variables that have maximal
variance and are mutually uncorrelated.
• PCA could be used for dimensional reduction, also as a tool for data
visualization.
PCA
PCA and SVD
• PCA is SVD done on centered data
• PCA looks for such a direction that the data projected to it has the
maximal variance
• PCA/SVD continues by seeking the next direction that is orthogonal to

all previously found directions
• All directions are orthogonal

PCA
Details
The first principal component of a set of features X1, X2,…,Xp is the
normalized linear combination of the features that has the largest variance.
Normalized means
The elements are the loadings of the first principal component;

Together, the loadings make up the principal component loading vector
PCA
Further PCs
• The second principal component is the linear combination of X1 ,..., X p
that has maximal variance among all linear combinations that are
uncorrelated with Z1 .
• The second principal component scores z12 , z22 ,..., zn 2 take the form
zi 2 = 12 xi1 + 22 xi 2 + ... +  p 2 xip , where 2 is the second
principal component loading vector, with 12 , 22 ,..., P 2 elements
PCA
Geometry
• The loading vector with elements defines a direction in

feature space along which the data vary the most.
• If we project the n data points onto this direction, the projected

values are the principal component scores themselves.
PCA
Example
The population size (pop) and ad spending (ad) for 100 different cities are
shown as purple circles. The green solid line indicates the first principal
component direction, and the blue dashed line indicates the second principal
component direction.
PCA
Computation of PCs
• Given n  p dataset X.
• Define M as been centered matrix of X (column means of M are zero).
• The principal components of X are the eigenvectors of M’M
• If SVD of M is UDV’, the principal components are the right singular vectors of M
PCA
Further PCs
It turns out that constraining Z 2 to be uncorrelated with Z1 is equivalent to

constraining the direction 2 to be orthogonal (perpendicular) to the direction
1 etc..
PCA
Illustration
• USA arrests data: For each of the 50 states in the United States, the data set
contains the number of arrests per 100,000 residents for each of three crimes:
Assault, Murder, and Rape. UrbanPop (% of the population in each state
living in urban areas) is also recorded.
• n = 50, p = 4.
• PCA was performed

PCA
USA arrests data: plot with first 2 PCs
PCA
Figure details
The first two principal components for the US Arrests data.
• The blue state names represent the scores for the first two principal
components.
• The orange arrows indicate the first two principal component loading
vectors (with axes on the top and right).
• This figure is known as a biplot, because it displays both the principal

component scores and the principal component loadings.
PCA
PCA loads
PCA
Scaling of the variables matters
If the variables are in different units, scaling each to have standard deviation
equal to one is recommended.
PCA
Proportion of variance explained (PVE)
• To understand the strength of each component, we are interested in

knowing the proportion of variance explained (PVE) by each one.
• The total variance present in a data set (centered variables) as
and the variance explained by the mth PC is
• We have the following

PCA
Proportion of variance explained (PVE)
The PVE of the mth PC is
Example Cumulative PVE

PCA
Example: The Digit dataset
1797 hand-written digits in form of images of size 8x8.
Each image will be represented by a feature vector of length 64.

PCA
Using PCA:
• Project from 64 to 2
• Visualization
https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
46

Chapter 5 Statistical Unsupervised Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5 Statistical Unsupervised Methods

Uploaded by

Copyright:

Available Formats

FEM 2063 - Data Analytics

➢Singular Value Decomposition (SVD)

➢Principal Components Analysis (PCA)

•The previous 3 chapters dealt with supervised learning methods, mainly

•In supervised learning, a dataset comprising of elements is given with a

•In unsupervised learning, A set of features X1,X2,…,Xp is observed (no

• is there an informative way to visualize the data?

• can we discover subgroups among the variables or among the

• More subjective compared to supervised learning, as there is no simple goal

• It is often easier to obtain unlabeled data (e.g. from a lab instrument or a

• subgroups of breast cancer patients grouped by their gene expression

• groups of shoppers characterized by their browsing and purchase histories

• grouped by the ratings assigned by movie viewers.

We study two statistical methods :

• Singular Value Decomposition (SVD)

• Principal Components Analysis (PCA)

1. Reduce the number of features representing the data

p products (e.g., milk, bread, rice, etc.)

  Aij= participation of i-th

  Aij= frequency of j-th

 A  bought by the i-th

➢Singular Value Decomposition (SVD)

➢Principal Components Analysis (PCA)

Data matrices have n rows (one for each

Two objects are “close” if the angle between

2nd (right) singular vector:

2: measures how much of the data

U (V): orthogonal matrix containing the left (right) singular vectors of X.

Calculate the square matrix AA’

Find the eigenvalues of AA’ :

The eigenvalues of AA’ are 6 and 8

The eigenvector of AA’ for 8 is: u1 = (already normalized)

Similarly the eigenvector of AA’ for λ= 6 is: u2=

Find its eigenvalues:

The eigenvalues of A’A are 8, 6 and 0

The eigenvector of AAT for λ= 8 is:

A=np.array([[2,0,2], [1,2,-1]]) [[-1.00000000e+00 1.17756934e-16]

➢Singular Value Decomposition (SVD)

➢Principal Components Analysis (PCA)

• PCA produces a low-dimensional representation of a dataset. It finds a

• PCA/SVD continues by seeking the next direction that is orthogonal to

• All directions are orthogonal

The elements are the loadings of the first principal component;

• The loading vector with elements defines a direction in

• If we project the n data points onto this direction, the projected

• Define M as been centered matrix of X (column means of M are zero).

• The principal components of X are the eigenvectors of M’M

It turns out that constraining Z 2 to be uncorrelated with Z1 is equivalent to

• PCA was performed

The first two principal components for the US Arrests data.

• This figure is known as a biplot, because it displays both the principal

• To understand the strength of each component, we are interested in

and the variance explained by the mth PC is

• We have the following

The PVE of the mth PC is

Example Cumulative PVE

1797 hand-written digits in form of images of size 8x8.

Each image will be represented by a feature vector of length 64.

You might also like