ML Lectures 2022 Part 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 231

CU5004

Machine Learning
in
Communication Networks
Machine Learning (ML)
• ML is a branch of artificial intelligence (AI):
– Uses computing based systems to make sense out
of data
• Extracting patterns, fitting data to functions, classifying
data, etc
– ML systems can learn and improve
• With historical data, time and experience
– Bridges theoretical computer science and real
noise data.

2
ML in real-life

17
OBJECTIVES:

To enable the student to understand the concept of machine learning and its
application in wireless communication and bio-medical.
To expose the student to be familiar with a set of well-known supervised, semi-
supervised and unsupervised learning algorithms

COURSE OUTCOMES:

At the end of the course the student would be


CO1: Demonstrate understanding of the mathematical principles underlying machine
learning.
CO2: Familiar with the different machine learning techniques and their use cases.
CO3: In a position to formulate machine learning problems corresponding to different
applications.
CO4: Able to recognize the characteristics of machine learning techniques that are
useful to solve real-world problems.
CO5: In a position to read current research papers, understand the issues and the
machine learning based solution approaches.
UNIT I MATHEMATICAL BACKROUND 9
Linear Algebra – Arithmetic of matrices, Norms, Eigen decomposition, Singular value decomposition, Pseudo
inverse, Principal Component analysis. Probability theory – probability distribution, conditional probability,
Chain rule, Bayes rule, Information theory, Structured Probabilistic models.

UNIT II MACHINE LEARNING BASICS 9


Supervised and Unsupervised learning, Capacity, Overfitting and Underfitting, Cross Validation, Linear
regression, Logistic Regression, Regularization, Naive Bayes, Support Vector Machines (SVM), Decision tree,
Random forest, K-Means Clustering, k nearest neighbor.

UNIT III NEURAL NETWORKS 9


Feedforward Networks , Backpropagation, Convolutional Neural Networks-LeNet, AlexNet, ZF-Net, VGGNet,
GoogLeNet, ResNet, Visualizing Convolutional Neural Networks, Guided Backpropagation, Deep Dream, Deep
Art, Fooling Convolutional Neural Networks, Recurrent Neural Network (RNN) – Backpropagation through time
(BPTT), Vanishing and Exploding Gradients.

UNIT IV ML IN WIRELESS AND SECURITY 10


Water-filling power allocation, Optimization for MIMO Systems, OFDM Systems and MIMO-OFDM systems.
Optimization in beamformer design – Robust receive beamforming, Transmit downlink beamforming.
Application: Radar for target detection, Array Processing, MUSIC, ML in Side channel analysis.

UNIT V ML IN BIO-MEDICAL 10
Machine Learning in Medical Imaging. Deep Learning for Health Informatics. Deep Learning Automated ECG
Noise Detection and Classification System for Unsupervised Healthcare Monitoring, Techniques for Electronic
Health Record (EHR) Analysis.
Machine Learning as a Process
- Define measurable and quantifiable goals
Define
- Use this stage to learn about the problem
Objectives

- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation

- Study models accuracy


- Work better than the naïve - Data Splitting
approach or previous - Features Engineering
system - Estimating
- Do the results make sense in Performance
the context of the problem Model Model - Evaluation and Model
Evaluation Building Selection

20
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated

•Missing Values • Scaling


•Error Values • Centering
Raw Data Data Modeling
•Different Scales • Skewness
•Dimensionality
Transfor
Data •Types Problems mation
• Outliers
• Missing Values
Ready phase
•Many others • Errors

21
ML as a Process: Feature engineering
• Determining the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for
non-informative predictors
• Binning predictors

Algorithms that use


Multiple models models as input and
Wrappers adding and Genetics Algorithms
performance as
removing parameter output

Evaluate the
Based normally on
Filters relevance of the
correlations
predictor

22
ML as a Process: Model Building
• Data Splitting
– Allocate data to different tasks
• model training
• performance evaluation
– Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
– Visualization of results – discovery of interesting areas of the
problem space
– Statistics and performance measures
• Evaluation and Model selection
– The ‘no free lunch’ theorem : no apriory assumptions can be
made
– Avoid use of favorite models if NEEDED
23
Machine Learning as a Process
- Define measurable and quantifiable goals
Define
- Use this stage to learn about the problem
Objectives

- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation

- Study models accuracy


- Work better than the naïve - Data Splitting
approach or previous - Features Engineering
system - Estimating
- Do the results make sense in Performance
the context of the problem Model Model - Evaluation and Model
Evaluation Building Selection

33
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated

•Missing Values • Scaling


•Error Values • Centering
Raw Data Data Modeling
•Different Scales • Skewness
•Dimensionality
Transfor
Data •Types Problems mation
• Outliers
• Missing Values
Ready phase
•Many others • Errors

34
ML as a Process: Feature engineering
• Determining the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for
non-informative predictors
• Binning predictors

Algorithms that use


Multiple models models as input and
Wrappers adding and Genetics Algorithms
performance as
removing parameter output

Evaluate the
Based normally on
Filters relevance of the
correlations
predictor

35
ML as a Process: Model Building
• Data Splitting
– Allocate data to different tasks
• model training
• performance evaluation
– Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
– Visualization of results – discovery of interesting areas of the
problem space
– Statistics and performance measures
• Evaluation and Model selection
– The ‘no free lunch’ theorem : no apriory assumptions can be
made
– Avoid use of favorite models if NEEDED
36
Combined
classification of
inter-linked objects
using label-attribute
a machine learning correlations and
method where we label-label neighbor
reuse a pre-trained correlations.
model as the starting
point for a model on a
new task.
• Accuracy : number of correct predictions
made as a ratio of all predictions made 
( [TP+TN]/[TP+FP+FN+TN])
• Precision : number of correct documents
returned by our ML model (Eg. Document
retrievals)  TP/[TP+FP]
• Recall or Sensitivity : number of positives
returned by our ML model  TP/[TP+FN]
• Specificity : number of negatives returned
by our ML model  TN/[TN+FP]
• Margin : the distance between the two
hyperplanes that separate linearly-separable
classes of data points
• Squared Error : average squared • Entropy : the randomness or measuring the
difference between the estimated disorder of the information being processed
values and the actual value in Machine Learning (LOW  purity of data)
• K-L Divergence : Kullback-Leibler Divergence
• Likelihood : ~ probability or chances score, or KL divergence score, quantifies how
of something happening much one probability distribution differs
• Posterior Probability : updated from another probability distribution (LOW)
probability of an event occurring • Cost / Utility: Cost utility analysis (CUA) is
after taking into consideration new one type of economic evaluation that can
information help you compare the costs and effects of
alternative interventions
ML in Communication Network design
•Machine/deep learning for signal detection, channel modeling, estimation,
interference mitigation, and decoding.
•Resource and network optimization using machine learning techniques.
•Distributed learning algorithms and implementations over realistic
communication networks.
•Machine learning techniques for application/user behavior prediction and
user experience modeling and optimization.
•Machine learning techniques for anomaly detection in communication
networks.
•Machine learning for emerging communication systems and applications,
such as drone systems, IoT, edge computing, caching, smart cities, and
vehicular networks.
•Machine learning for transport-layer congestion control.
•Machine learning for integrated radio frequency/non-radio frequency
communication systems.
ML in Communication Network design
•Machine learning techniques for information-centric networks and data
mining.
•Machine learning for network slicing, network virtualization, and software
defined networking.
•Performance analysis and evaluation of machine learning techniques in
wired/wireless communication systems.
•Scalability and complexity of machine learning in networks.
•Techniques for efficient hardware implementation of neural networks in
communications.
•Synergies between distributed/federated learning and communications.
•Secure machine learning over communication networks.
Example

• Vector in Rn is an ordered
1
set of n real numbers.  
6
– e.g. v = (1,6,3,4) is in R4  3
 
– A column vector:  4
 
– A row vector:
• m-by-n matrix is an object 1 6 3 4
in Rmxn with m rows and n
columns, each entry filled
1 2 8
with a (typically) real  
number:  4 78 6 
9 3 2
 
• Tensor: An array with more than two axes, A at coordinates ( i,j,k) written as Ai,j,k
Norms
Vector norms: A norm of a vector ||x|| is informally a
measure of the “length” of the vector.
Squared L2 norm is more
convenient to work with
mathematically and
computationally

– Common norms: L1, L2 (Euclidean) Increases very slowly near


the origin

L1 norm = better able to


discriminate between zeros
– Linfinity and small nonzero
element values - suited for
ML applications
Max Norm :
Absolute value of element with largest magnitude

– Frobenius norm: same as L2 norm


Element by Element multiplication  Hadamard Product
We will use lower case letters for vectors The elements are
referred by xi.

• Vector dot (inner) product:

If u•v=0, ||u||2 ≠ 0, ||v||2 ≠ 0  u and v are orthogonal


If u•v=0, ||u||2 = 1, ||v||2 = 1  u and v are orthonormal

• Vector outer product:


Transpose: You can think of it as
– “flipping” the rows and columns
OR
– “reflecting” vector/matrix on line

T
e.g. a
   a b 
b
T
a b  a c 
    
c d  b d 
We will use upper case letters for matrices. The elements
are referred by Ai,j.
• Matrix product:

e.g.  a11 a12   b11 b12 


A   , B   
 a21 a22   b21 b22 
 a11b11  a12b21 a11b12  a12b22 
AB   
 a21b11  a22b21 a21b12  a22b22 
Special matrices
 a 0 0 a b c
   
 0 b 0 diagonal 0 d e upper-triangular
0 0 c 0 0
   f 

a b 0 0 a 0 0
   
c d e 0
b c 0  lower-triangular
0 tri-diagonal
f g h d e f 
  
0 0 i j 

Symmetric Matrix  A = AT
1 0 0
 
0 1 0 I (identity matrix) Unit vector  vector with
0 0 1 unit norm
 

Orthogonal Matrix  AT A = A AT = I ; A-1 = AT


Orthonormal rows & Orthonormal Columns
Linear independence
• A set of vectors is linearly independent if none of them can
be written as a linear combination of the others.
• Vectors v1,…,vk are linearly independent if c1v1+…+ckvk = 0
implies c1=…=ck=0 | | |  c1   0 
    
central to determining the  v1 v2 v3  c2    0 
dimension of a vector space | | |  c   0 
  3   
e.g.  1 0  0 (u,v)=(0,0), i.e. the columns are
  u   
 2 3     0  linearly independent. columns of A are
 1 3  v   0  linearly independent if
    Ax = 0 only for x = 0

x3 = −2x1 + x2

none of the row vectors is linear


linearly independent row vectors 
combination of other row vectors
Span of a vector space
•The span of a set of vectors is the set of all points obtainable by linear
combination of the original vectors.
•If all vectors in a vector space may be expressed as linear combinations of a
set of vectors v1,…,vk, then we can say that v1,…,vk spans that vector space.
•The cardinality of this set is the dimension of the vector space.

e.g.  2
 
1
 
 0
 
 0
 
 2   2 0   2 1   2 0 
 2  0  0 1
(0,0,1)        

(0,1,0)

(1,0,0)

•A basis is a maximal set of linearly independent vectors / a minimal set of


spanning vectors of a vector space
Rank of a Matrix
• rank(A) (the rank of a m-by-n matrix A) is
The maximal number of linearly independent columns
=The maximal number of linearly independent rows
=The dimension of col (A)
=The dimension of row (A)
• If A is n by m, then
– rank(A)<= min(m,n)
– If n=rank(A), then A has full row rank
– If m=rank(A), then A has full column rank
Inverse of a matrix
• Inverse of a square matrix A, denoted by A-1 is a
unique matrix such that,
– AA-1 = A-1A = I (identity matrix)
• If A-1 and B-1 exist, then
– (AB)-1 = B-1A-1,
– (AT)-1 = (A-1)T
• For orthonormal matrices
• For diagonal matrices
Dimensions

By Thomas Minka. Old and New Matrix Algebra Useful for Statistics
Prove the following using example
vector/matrices

http://matrixcookbook.com/
Matrix Inverse
• A  Rm×n , b  Rm known vector, and x  Rn is a vector of
unknown variables

• For Ax = b to have a solution, b should be in the span of the


columns of A  column space of A or range of A

• A-1 exists only if  Ax = b has exactly one solution for


every value of b  n = m

• A must be square matrix, all columns – linearly


independent

• Singular  A square matrix, columns – linearly dependent


Eigen Decomposition
• Decompose a matrix into a set of eigenvectors and
eigenvalues

• An eigenvector of a square matrix A is a nonzero vector


v such that multiplication by A alters only the scale of v
 Av = λv

• Scalar λ  eigenvalue corresponding to this


eigenvector

• Scaled eigenvectors - same eigenvalues  unit


eigenvectors
Eigen Decomposition
• Suppose matrix A has n linearly independent eigenvectors {v(1) ,
… ,v(n)} with corresponding eigenvalues {λ1, . . . , λn}

• Matrix V  concatenation of all the eigenvectors with one


eigenvector per column

• Vector λ = [λ1 , . . . ,λn] T  concatenation of the eigenvalues

• Eigen decomposition of A  A = V diag(λ) V−1

• Every real symmetric matrix can be decomposed into an


expression using only real-valued eigenvectors and eigenvalues,
A = QΛQT
Eigen Value Decomposition
 Effect of eigenvectors and eigenvalues
 Eigenvectors define the orthonormal directions v(1) and v(2)
 Eigenvalues stretch unit circle along respective direction by their value
 Eigendecomposition is unique only if all the eigenvalues are unique
 Singular  if and only if any of the eigenvalues are zero
 Positive definite  matrix whose
eigenvalues are all positive xTAx = 0  x = 0.
 Positive semidefinite  matrix whose
eigenvalues are all positive or zero valued
xTAx >= 0 for all x
 Negative definite  matrix whose
eigenvalues are all negative
 Negative semidefinite  matrix whose
eigenvalues are all negative or zero valued
Singular Value Decomposition
• SVD enables us to discover similar kind of information as the
eigendecomposition reveals
• Every real matrix has a singular value decomposition, but the same
is not true of the eigenvalue decomposition - If a matrix is not
square, the eigendecomposition is not defined, and we must use a
singular value decomposition instead
• Any matrix A can be SVD decomposed as A=UDVT, where
– D is a diagonal matrix, with d=rank(A) non-zero elements,
elements are called singular values
– If A is an m ×n matrix, then U is defined to be an m ×m matrix, D
to be an m × n matrix, and V to be an n × n matrix.
– The columns of U are known as the left-singular vectors
– The columns of V are known as as the right-singular vectors.
– The first d rows of U are orthogonal basis for col (A)
– The first d rows of V are orthogonal basis for row (A)
Singular Value Decomposition
(SVD)

• Applications of the SVD


– Matrix Pseudoinverse
– Low-rank matrix approximation
– MIMO Systems
The Moore-Penrose Pseudoinverse
• Matrix-inversion of non square matrices
• Moore-Penrose pseudoinverse 
• Practical Algorithm 

where U, D and V are the singular value decomposition of A,


and the pseudoinverse D+ of a diagonal matrix D obtained by
taking the reciprocal of its nonzero elements and then taking
the transpose of the resulting matrix.
• A is taller than it is wide possibility of no solution –
pseudoinverse gives a solution x which is close to y in terms
of Euclidean norm
• A is wider than it is tall  possibility for multiple possible
solutions – pseudoinverse gives a solution with least
Euclidean norm among all possible solutions
Low-rank Matrix Inversion
• In many applications (e.g. linear regression, Gaussian model)
we need to calculate the inverse of covariance matrix XTX
(each row of n-by-m matrix X is a data sample)

• If the number of features is huge (e.g. each sample is an


image, #sample n << #feature m) inverting the m-by-m XTX
matrix becomes an problem

• Complexity of matrix inversion is generally O(n3)

• Matlab can comfortably solve matrix inversion with m of the


order of thousands, but not much more than that
Low-rank Matrix Inversion
• With the help of SVD, we actually do NOT
need to explicitly invert XTX

– Decompose X=UDVT
– Then XTX = VDUTUDVT = VD2VT
– Since V(D2)VTV(D2)-1VT=I
– We know that (XTX )-1= V(D2)-1VT
– Inverting a diagonal matrix D2 is trivial
Parallel Decomposition of MIMO Channels
Singular Value Decomposition of H = UVH
System Model
 y1 (k )   h11  h1M T   s1 (k )   n1 (k ) 

  

 
  

   

 
  

 yM (k ) hM 1  hM M   sM (k )  nM ( k ) 
 R   R R T 
T
 R 

 ~y1 (k )   11 0 0   ~s1 (k )   n~1 (k ) 



  

 
 0  0

 ~ 

  
  

~y ( k )   0 0  M M   sM (k ) n~M (k )
 R 
M  R T 
T
 R 
SVD  Singular Value Decomposition

H  U V H
98
Precoding in MIMO

~
Y  U *T Y ; Y  [ HX  n]
 U *T [UV *T X  n]
 (U *T U )V *T X  U *T n]
~ ~
 X  n ,
~ ~
where X  V X ; X  VX
*T

n~  U *T n
~ ~ ~ [Note : var n~i  var ni ]
Y  X  n
Principal Component Analysis (PCA)
Factor or Component Analysis
• Discover a new set of factors/dimensions/axes against which to represent, describe
or evaluate the data
– For more effective reasoning, insights, or better visualization
– Reduce noise in the data
– Typically a smaller set of factors: dimension reduction
– Better representation of data without losing much information
– Can build more effective data analyses on the reduced-dimensional
space: classification, clustering, pattern recognition

• Factors are combinations of observed variables


– May be more effective bases for insights, even if physical meaning is
obscure
– Observed data are described in terms of these factors rather than in
terms of original variables/dimensions
Motivation
• Clustering
– One way to summarize a complex real-valued data set with a
single categorical variable
• Dimensionality reduction
– Another way to simplify complex high-dimensional data
– Summarize data with a lower dimensional real valued vector

• Given data points in d dimensions


• Convert them to data points in r<d
dimensions
• With minimal loss of information
Dimensionality reduction
• PCA (Principal Component Analysis):
– Find projection that maximize the variance

• ICA (Independent Component Analysis):


– Very similar to PCA except that it assumes non-Guassian
features

• Multidimensional Scaling:
– Find projection that best preserves inter-point distances

• LDA (Linear Discriminant Analysis):


– Maximizing the component axes for class-separation
• …
Basic Concept
• Areas of variance in data are where items can be best discriminated and
key underlying phenomena observed
– Areas of greatest “signal” in the data
• If two items or dimensions are highly correlated or dependent
– They are likely to represent highly related phenomena
– If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable
• Parsimony
• Reduction in Error
• Combine related variables, and focus on uncorrelated or independent
ones, especially those along which the observations have high variance
• Smaller set of variables that explain most of the variance in the original
data, in more compact and insightful form
Factor Analysis
• What if the dependences and correlations are not so strong or
direct?
• Suppose you have 3 variables, or 4, or 5, or 10000?
• Look for the phenomena underlying the observed covariance/co-
dependence in a set of variables
– Once again, phenomena that are uncorrelated or independent, and
especially those along which the data show high variance

• These phenomena are called “factors” or “principal components” or


“independent components,” depending on the methods used
– Factor analysis: based on variance/covariance/correlation
– Independent Component Analysis: based on independence
Ex. Data Compression
Reduce data from
2D to 1D
(inches)

(cm)

Andrew Ng
Ex. Data Compression
Reduce data from 3D to 2D

Andrew Ng
What are the new axes?

Original Variable B PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data

• Projections along PC1 discriminate the data most


along any one axis
Principal Components
• First principal component is the direction of greatest
variability (covariance) in the data

• Second is the next orthogonal (uncorrelated)


direction of greatest variability
– So first remove all the variability along the first component,
and then find the next direction of greatest variability

• And so on …
PCA Summary
• PCA is “an orthogonal linear transformation that transfers the data
to a new coordinate system such that the greatest variance by any
projection of the data comes to lie on the first coordinate (first
principal component), the second greatest variance lies on the
second coordinate (second principal component), and so on.”
• Most common form of factor analysis
• The new variables/dimensions
– Are linear combinations of the original ones
– Are uncorrelated with one another
» Orthogonal in original dimension space
– Capture as much of the original variance in the data as
possible
– Are called Principal Components
PCA Summary

• Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of uncorrelated
variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning

• Properties
– It can be viewed as a rotation of the existing axes to new positions in
the space defined by original variables
– New axes are orthogonal and represent the directions with maximum
variability
Principal Component Analysis (PCA) problem formulation

Reduce from 2-dimension to 1-dimension: Find a direction


(a vector) onto which to project the data so as to minimize the
projection error.

Reduce from n-dimension to k-dimension: Find k vectors


onto which to project the data, so as to minimize the
projection error. Andrew Ng
Covariance
• Variance and Covariance:
– Measure of the “spread” of a set of points around their
center of mass(mean)
• Variance:
– Measure of the deviation from the mean for points in
one dimension
• Covariance:
– Measure of how much each of the dimensions vary
from the mean with respect to each other

• Covariance is measured between two dimensions


• Covariance sees if there is a relation between two dimensions
• Covariance in one dimension is the variance
Positive: Both dimensions Negative: While one increase
increase or decrease together the other decrease
Computing the Components
• Data points are vectors in a multidimensional space
• Projection of vector x onto an axis (dimension) u is u.x
• Direction of greatest variability is that in which the average square of
the projection is greatest
– u : such that E((u.x)2) over all x is maximized
– (we subtract the mean along each dimension, and center the original axis
system at the centroid of all data points, for simplicity)
– This direction of u is the direction of the first Principal Component
Computing the Components
• E((u.x)2) = E ((u.x) (u.x)T) = E (u.x.x T.uT)
• The matrix C = x.xT contains the correlations (similarities)
of the original axes based on how the data values project
onto them
• So we are looking for w that maximizes uCuT, subject to u
being unit-length
• It is maximized when w is the principal eigenvector of the
matrix C, in which case
– uCuT = uuT =  if u is unit-length, where  is the principal
eigenvalue of the correlation matrix C
– The eigenvalue denotes the amount of variability captured along
that dimension
Why the Eigenvectors?
Maximise uTxxTu subject to uTu = 1

Construct Langrangian uTxxTu – λuTu

Vector of partial derivatives set to zero


xxTu – λu = (xxT – λI) u = 0

As u ≠ 0 , u must be an eigenvector of xxT with


eigenvalue λ
Eigenvectors of a Correlation Matrix
Computing the Components
• Similarly for the next axis, etc.
• So, the new axes are the eigenvectors of the matrix of
correlations of the original variables, which captures the
similarities/differences of the original variables based on
how data samples project to them

• Geometrically: centering followed by rotation


– Linear transformation
Relation to SVD
• The first root is called the prinicipal eigenvalue which has an associated
orthonormal (uTu = 1) eigenvector u

• Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D)
non-zero values

• Eigenvectors form an orthonormal basis i.e. uiTuj = δij

• Eigenvalue decomposition of xxT = UDUT , where U = [u1, u2, …, uM]


and D = diag[λ 1, λ 2, …, λ M]

• Eigenvalue decomposition of xTx = VDVT

• SVD is closely related to the above  x=U D1/2 VT


where, U - left eigenvectors , V - right eigenvectors , and the Singular
values = Square root of eigenvalues
PCs, Variance and Least-Squares

• The first PC retains the greatest amount of variation in the


sample

• The kth PC retains the kth greatest fraction of the variation in


the sample

• The kth largest eigenvalue of the correlation matrix C is the


variance in the sample along the kth PC

• The least-squares view: PCs are a series of linear least


squares fits to a sample, each orthogonal to all previous ones
How Many PCs?
• For n original dimensions, correlation matrix is nxn, and
has upto n eigenvectors. So atmost n PCs.

• Where does dimensionality reduction come from?


Dimensionality Reduction
• Can ignore the components of lesser significance.
25

20
Variance (%)

15

10

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

• You do lose some information, but if the eigenvalues are small, you
don’t lose much
– n dimensions in original data
– calculate n eigenvectors and eigenvalues
– choose only the first p eigenvectors, based on their eigenvalues
– final data set has only p dimensions
PCA – mathematical background
• Suppose attributes are A1 and A2, and we have n training
examples. x’s denote values of A1 and y’s denote values of A2
over the training examples.
n

 (x  x)i
2

• Variance of an attribute: var( A1 )  i 1


(n  1)
n

 ( x  x )( y
i i  y)
• Covariance of two attributes: cov( A1 , A2 )  i 1
(n  1)

• If covariance is positive, both dimensions increase together.


If negative, as one increases, the other decreases,
if Zero: independent of each other.
• Suppose we have n attributes, A1, ..., An
C nn  (ci , j ), where ci , j  cov( Ai , Aj )
Covariance

• Used to find relationships between dimensions in high


dimensional data sets

The Sample mean


Eigenvector and Eigenvalue
Ax = λx

A: Square Matrix
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value

• The zero vector can not be an eigenvector


• The value zero can be an eigenvalue
Eigenvector and Eigenvalue
Ax = λx

A: Square Matirx
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value

Example
Eigenvector and Eigenvalue
Ax - λx = 0
Ax = λx (A – λI)x = 0

If we define a new matrix B: B = A – λI


Bx = 0

BUT! an
If B has an inverse: x= B-10 =0 eigenvector
cannot be
zero!!
x will be an eigenvector of A if and only if B does
not have an inverse, or equivalently det(B) = 0 :
det(A – λI) = 0
Eigenvector and Eigenvalue
2  12
Example 1: Find the eigenvalues of A 
 1  5 
  2 12
I  A   (  2)(  5)  12
1   5 two eigenvalues:
 2  3  2  (  1)(  2) 1,  2

Note: The roots of the characteristic equation can be repeated. That is, λ1=
λ2 =…= λk. If that happens, the eigenvalue is said to be of multiplicity k.
Example 2: Find the eigenvalue of 2 1 0
A  0 2 0
  2 1 0 0 0 2
I  A  0   2 0  (  2)3  0 λ = 2 is an
eigenvector of
0 0  2 multiplicity 3
Principal Component Analysis
Input:

Set of basis vectors:

Summarize / represent a D dimensional vector X with


K dimensional feature vector h(x)
Principal Component Analysis

Basis vectors are orthonormal

New data representation h(x)


Principal Component Analysis
New data representation h(x)

Empirical mean of the data


PCA – Steps
PCA – Covariance Example
 cov( H , H ) cov( H , M ) 
 
 cov( M , H ) cov( M , M ) 

 var( H ) 104.5 
  
 104.5 var( M ) 

 47.7 104.5 
  
104.5 370 

Covariance matrix
PCA - Example
1. Given original data set S = {x1, ..., xk}, produce new set by
subtracting the mean of attribute Ai from each xi.

Mean: 1.81 1.91 Mean: 0 0


2. Calculate the covariance matrix:
x y

x
y

3. Calculate the (unit) eigenvectors and eigenvalues of the


covariance matrix:
Eigenvector with largest
eigenvalue traces
linear pattern in data
4. Order eigenvectors by eigenvalue, highest to lowest.

  .677873399 
v1      1.28402771
  .735178956 

  .735178956 
v 2      .0490833989
 .677873399 

In general, you get n components. To reduce


dimensionality to p, ignore np components at the
bottom of the list.
Construct new feature vector.
Feature vector = (v1, v2, ...vp)

  .677873399  .735178956 
FeatureVec tor1   
  .735178956 .677873399 

or reduced dimension feature vector :

  .677873399 
FeatureVec tor 2   
  .735178956 
5. Derive the new data set.

TransformedData = RowFeatureVector  RowDataAdjust

  .677873399  .735178956 
RowFeature Vector 1   
  .735178956 .677873399 

RowFeature Vector 2   .677873399  .735178956 

 .69  1.31 .39 .09 1.29 .49 .19  .81  .31  .71 
RowDataAdj ust   
 .49  1.21 .99 .29 1.09 .79  .31  .81  .31  1.01

This gives original data in terms of chosen


components (eigenvectors)—that is, along these axes.
Reconstructing the original data
We did:
TransformedData = RowFeatureVector  RowDataAdjust

so we can do
RowDataAdjust = RowFeatureVector -1  TransformedData

= RowFeatureVector T  TransformedData

and
RowDataOriginal = RowDataAdjust + OriginalMean
Probability Theory - A Tool For Artificial
Intelligence Applications
Frequentist Probability, related directly to the rates at which events occur

Eg. When we say that an outcome has a probability p of occurring, it means that if
we repeated the experiment (e.g., drawing a hand of cards) infinitely many times,
then a proportion p of the repetitions would result in that outcome.

Bayesian Probability. related to qualitative levels of certainty,

Eg, If a doctor analyzes a patient and says that the patient has a 40 percent chance
of having the flu, this means something very different—we cannot make infinitely
many replicas of the patient, nor is there any reason to believe that different
replicas of the patient would present with the same symptoms yet have varying
underlying conditions. In the case of the doctor diagnosing the patient, we use
probability to represent a degree of belief, with 1 indicating absolute certainty that
the patient has the flu and 0 indicating absolute certainty that the patient does not
have the flu.
Mixtures of Distributions
• Defining probability distributions by combining other simpler
probability distributions

• Mixture Distribution:
• A mixture distribution is made up of several
component distributions
• On each trial, the choice of which component distribution should
generate the sample is determined by sampling a component
identity from a multinoulli distribution

P  x    P (c  i ) P ( x c  i )
i
Mixture Model
• Mixture model is one simple strategy for combining
probability distributions to create a richer distribution

• Mixture model allows us to briefly glimpse a concept that will


be of paramount importance later - latent variable -- > Latent
variable is a random variable that we cannot observe directly
– Latent variables may be related to x through the joint distribution, in
this case, P(x, c) = P (x | c)P (c). The distribution P(c) over the latent
variable and the distribution P (x | c) relating the latent variables to
the visible variables determines the shape of the distribution P (x)

• A very powerful and common type of mixture model is the


Gaussian mixture model, in which the components p(x | c = i)
are Gaussians.
Probability Functions used in Deep
Learning Models
• Logistic Sigmoid
1
 ( x) 
1  exp(  x)
Saturates when its
argument is very positive or
very negative,
 the function becomes
very flat and insensitive to
small changes in its
input.

function σ−1(x) is called the ‘logit’


• Softplus function

 ( x)  log(1  exp( x))

• it is a smoothed, or
“softened,” version of
x+ = max(0, x)
Some
Useful
Properties
Information theory
• A branch of applied mathematics that revolves around
quantifying how much information is present in a signal
• Learning that an unlikely event has occurred is more informative
than learning that a likely event has occurred
• Self-information of an event x = x to be I(x) = - log P(x) nats
• One nat is the amount of information gained by observing an
event of probability 1/e
• Self-information deals only with a single outcome.
• Uncertainty in an entire probability distribution can be quantified
using the Shannon entropy

• Differential Entropy  the Shannon entropy when x is


continuous
• If we have two separate probability distributions P (x) and
Q(x) over the same random variable x, we can measure how
different these two distributions are using Kullback-Leibler
(KL) divergence

• is non-negative
• is 0 if and only if P and Q are the same distribution in the
case of discrete variables, or equal “almost everywhere” in
the case of continuous variables
• is often conceptualized as measuring some sort of distance
between these distributions
• is not a true distance measure because it is not symmetric:
DKL(P || Q) not equal to DKL(Q || P ) for some P and Q.
• Cross-entropy - A quantity that is closely related
to the KL divergence is the

H(P, Q) = H(P) + DKL(P ||Q)

H(P, Q) = −ExP log Q(x)

Lim x→0 x log x = 0


Structured Probabilistic Models
• Machine learning algorithms often involve probability
distributions over a very large number of random variables ,
involving direct interactions between relatively few variables.
• Using a single function to describe the entire joint probability
distribution can be very inefficient  Factorization
• For example, suppose we have three random variables: a, b
and c. Suppose that a influences the value of b, and b
influences the value of c, but that a and c are independent
given b, then p(a, b, c) = p(a) p(b | a) p(c | b)

• Representing the factorization of a probability distribution


with a graph,
 structured probabilistic model or graphical model
Example – Directed Model

Directed models use graphs


with directed edges, and they
represent factorizations
into conditional probability
distributions.
Example – Undirected Model
Any set of nodes that are all
connected to each other in G is called a
clique. Each clique C(i) in an undirected
model is associated with a factor φ(i)(C(i) ).

Undirected models use graphs with


undirected edges, and they represent
factorizations into a set of functions;

These functions are usually not probability


distributions of any kind.
Supervised and Unsupervised Learning

• Unsupervised Learning
– There are not predefined and known set of outcomes
– Look for hidden patterns and relations in the data
– A typical example: Clustering 2.5

2.0

1.5
irisCluster$cluster

Petal.Width
1

1.0

0.5

0.0
2 4 6
Petal.Length

229
Supervised and Unsupervised Learning

• Supervised Learning
– For every example in the data there is always a predefined outcome
– Models the relations between a set of descriptive features and a
target (Fits data to a function)
– 2 groups of problems:
• Classification
• Regression

230
Supervised Learning

• Classification
– Predicts which class a given sample of data (sample of descriptive
features) is part of (discrete value).
virginica
0.0 4.0 96.0

Percent
100

75

Predicted
versicolor
0.0 96.0 4.0 50

25

• Regression setosa
100.0 0.0 0.0
– Predicts continuous values.
setosa versicolor virginica
Actual

231

You might also like