Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

FEM 2063 - Data Analytics

CHAPTER 5
Statistical Unsupervised
Learning

1
Overview

➢Singular Value Decomposition (SVD)

➢Principal Components Analysis (PCA)

2
Unsupervised Learning
Unsupervised vs Supervised Learning:

•The previous 3 chapters dealt with supervised learning methods, mainly


regression and classification.

•In supervised learning, a dataset comprising of elements is given with a


set of features X1,X2,…,Xp as well as a response or outcome variable Y for
each element. The goal was then to build a model to predict Y using
X1,X2,…,Xp.

•In unsupervised learning, A set of features X1,X2,…,Xp is observed (no


outcome variable Y). The goal is to extract the most valuable information
from the observations .
Unsupervised Learning
The goal is to discover interesting things about the observations

• is there an informative way to visualize the data?

• can we discover subgroups among the variables or among the


observations?
Unsupervised Learning

• More subjective compared to supervised learning, as there is no simple goal


for the analysis, such as prediction of a response.

• It is often easier to obtain unlabeled data (e.g. from a lab instrument or a


computer) than labeled data, which may require expert intervention.
Unsupervised Learning
Examples of applications in a various fields:

• subgroups of breast cancer patients grouped by their gene expression


measurements

• groups of shoppers characterized by their browsing and purchase histories

• grouped by the ratings assigned by movie viewers.


Unsupervised Learning

We study two statistical methods :

• Singular Value Decomposition (SVD)

• Principal Components Analysis (PCA)


Unsupervised Learning
Datasets in the form of matrices
We are given n objects and p features describing the objects.

Dataset
An n-by-p matrix A
n rows representing n objects
Each object has p numeric values describing it.

Goal

1. Understand the structure of the data, e.g., the underlying process generating the data.

1. Reduce the number of features representing the data


Unsupervised Learning
Example 1- Market basket matrices

p products (e.g., milk, bread, rice, etc.)

 
 
Aij= quantity of j-th product
n customers purchased by the i-th
 A  customer
 
 
Aim: find a subset of the products that characterize customer behavior
Unsupervised Learning
Example 2- Social-network matrices
p groups (e.g., BU group, opera, etc.)

  Aij= participation of i-th


n users   user in the event by the j-th
 A  group
 
 
Aim: find a subset of the groups that accurately clusters social-network users
Unsupervised Learning
Example 3- Document matrices
p terms (e.g., theorem, proof, etc.)

 
 
n documents
 A 
Aij= frequency of the j-th
term in the i-th document
 
 
Aim: find a subset of the terms that accurately clusters the documents
Unsupervised Learning
Example 4-Recommendation systems
p products

  Aij= frequency of j-th


n customers   product in products

 A  bought by the i-th

 
customer

 
Aim: find a subset of the products that accurately describe the behavior or the customers
Overview

➢Singular Value Decomposition (SVD)

➢Principal Components Analysis (PCA)

13
Singular Value Decomposition (SVD)

Data matrices have n rows (one for each

feature 2
object) and p columns (one for each feature).
Object d
Rows: vectors in a Euclidean space,  1 .5 
X = 
 .5 1 
(d,x) Object x

Two objects are “close” if the angle between


their corresponding vectors is small. feature 1
SVD
Example
Input: 2-d dimensional points
5
Output:
2nd (right)
singular 1st (right) singular vector:
4 vector direction of maximal variance

2nd (right) singular vector:


3 direction of maximal variance, after
removing the projection of the data along
1st (right) singular the first singular vector.
vector
2
4.0 4.5 5.0 5.5 6.0
SVD
Singular values

2nd (right)
singular 1: measures how much of the data
4
vector variance is explained by the first singular
vector.

2: measures how much of the data


3 variance is explained by the second
1 singular vector.
1st (right) singular
vector
2
4.0 4.5 5.0 5.5 6.0
SVD
SVD decomposition
    T
       
 A =
  U •
   • V 
       
   
nxp nxℓ ℓxℓ ℓxp

U (V): orthogonal matrix containing the left (right) singular vectors of X.


Ʃ: diagonal matrix containing the singular values of X:
1 ≥ 2 ≥ … ≥ ℓ
V’=VT : Transpose of V
SVD
SVD and Rank-k approximations

A = U  VT
features
sig. significant

significant
noise noise

noise
=

objects
SVD
Rank-k approximations (Ak)

Ak is the best
approximation
of A nxp nxk kxk kxp

Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
k: diagonal matrix containing the top k singular values of A

Ak is an approximation of A
SVD
Example computation:
Find the SVD of A, UΣVT, where

Calculate the square matrix AA’

Find the eigenvalues of AA’ :

The eigenvalues of AA’ are 6 and 8


SVD
Example computation:
Find the eigenvector of AA’ for the eigenvalue λ=8:

solve

The eigenvector of AA’ for 8 is: u1 = (already normalized)


SVD
Example computation:

Similarly the eigenvector of AA’ for λ= 6 is: u2=


(already normalized)
SVD
Example computation:
Calculate the square matrix A’ A

Find its eigenvalues:

The eigenvalues of A’A are 8, 6 and 0


SVD
Example computation:
For A’A the eigenvalues are λ=8, 6 and 0:

The eigenvector of AAT for λ= 8 is:

For λ= 6:
(all not normalized)

For λ=0:
SVD
Example computation:

After normalization:
SVD
Example computation:
Finally the SVD of A, UΣVT
SVD
Python Example

import numpy as np
import scipy [[ 2 0 2]
from scipy.linalg import svd [ 1 2 -1]]

A=np.array([[2,0,2], [1,2,-1]]) [[-1.00000000e+00 1.17756934e-16]


[ 1.17756934e-16 1.00000000e+00]]
U, S, Vt = svd(A)
[2.82842712 2.44948974]
print(A)
print(U) [[-0.70710678 0. -0.70710678]
print(S) [ 0.40824829 0.81649658 -0.40824829]
print(Vt) [-0.57735027 0.57735027 0.57735027]]
Overview

➢Singular Value Decomposition (SVD)

➢Principal Components Analysis (PCA)

28
Principal Components Analysis (PCA)

• PCA produces a low-dimensional representation of a dataset. It finds a


sequence of linear combinations of the variables that have maximal
variance and are mutually uncorrelated.

• PCA could be used for dimensional reduction, also as a tool for data
visualization.
PCA
PCA and SVD
• PCA is SVD done on centered data

• PCA looks for such a direction that the data projected to it has the
maximal variance

• PCA/SVD continues by seeking the next direction that is orthogonal to


all previously found directions

• All directions are orthogonal


PCA
Details
The first principal component of a set of features X1, X2,…,Xp is the
normalized linear combination of the features that has the largest variance.

Normalized means

The elements are the loadings of the first principal component;


Together, the loadings make up the principal component loading vector
PCA
Further PCs
• The second principal component is the linear combination of X1 ,..., X p
that has maximal variance among all linear combinations that are
uncorrelated with Z1 .

• The second principal component scores z12 , z22 ,..., zn 2 take the form
zi 2 = 12 xi1 + 22 xi 2 + ... +  p 2 xip , where 2 is the second
principal component loading vector, with 12 , 22 ,..., P 2 elements
PCA
Geometry

• The loading vector with elements defines a direction in


feature space along which the data vary the most.

• If we project the n data points onto this direction, the projected


values are the principal component scores themselves.
PCA
Example

The population size (pop) and ad spending (ad) for 100 different cities are
shown as purple circles. The green solid line indicates the first principal
component direction, and the blue dashed line indicates the second principal
component direction.
PCA
Computation of PCs

• Given n  p dataset X.

• Define M as been centered matrix of X (column means of M are zero).

• The principal components of X are the eigenvectors of M’M

• If SVD of M is UDV’, the principal components are the right singular vectors of M
PCA
Further PCs

It turns out that constraining Z 2 to be uncorrelated with Z1 is equivalent to


constraining the direction 2 to be orthogonal (perpendicular) to the direction

1 etc..
PCA
Illustration

• USA arrests data: For each of the 50 states in the United States, the data set
contains the number of arrests per 100,000 residents for each of three crimes:
Assault, Murder, and Rape. UrbanPop (% of the population in each state
living in urban areas) is also recorded.

• n = 50, p = 4.

• PCA was performed


PCA
USA arrests data: plot with first 2 PCs
PCA
Figure details

The first two principal components for the US Arrests data.

• The blue state names represent the scores for the first two principal
components.

• The orange arrows indicate the first two principal component loading
vectors (with axes on the top and right).

• This figure is known as a biplot, because it displays both the principal


component scores and the principal component loadings.
PCA
PCA loads
PCA
Scaling of the variables matters
If the variables are in different units, scaling each to have standard deviation
equal to one is recommended.
PCA
Proportion of variance explained (PVE)

• To understand the strength of each component, we are interested in


knowing the proportion of variance explained (PVE) by each one.
• The total variance present in a data set (centered variables) as

and the variance explained by the mth PC is

• We have the following


PCA
Proportion of variance explained (PVE)

The PVE of the mth PC is

Example Cumulative PVE


PCA
Example: The Digit dataset

1797 hand-written digits in form of images of size 8x8.

Each image will be represented by a feature vector of length 64.


PCA
Using PCA:

• Project from 64 to 2

• Visualization

https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
46

You might also like