Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Overview of Moderm AI and ML Chapter 4

Dimensionality Reduction and PCA


C.V Jawahar

4.1 Dimensionality Reduction and Feature Extraction

In general, we start with a feature representation x corresponding to a physical phenomena or an object of


interest. (In the case of supervised learning, we also have a corresponding y. i.e., ((xi , yi )) In practice, this x
is not the result of very careful selection of measurements/features, These features are either what we could
think of as relevant or what is feasible in practice.This could be the “log” of some process. There could be
redundancy and correlation within these features too. They may not be the best for our problem also. A
problem of interest to us is to find a new feature representation z, often lower in dimension, as

z = Ux. (4.5)
If x is a d-dimensional vector and z is a k-dimensional vector (where k < d), U is a k × d matrix.
Two variants:

• The above linear transformation (matrix multiplication) leads to a set of “new” features that are
“derived” out existing features. New features are linear combination of existing features. We will
have a closer look at this today. This can be supervised or unsupervised. i.e., y may be available
(supervised) or unavailable (unsupervised).

• Second direction to create a new lower dimensional representation by selecting only the useful/impor-
tant features from the original set. This is a classical subset selection problem, which is hard to solve.
(Similar to your familiar knapsack problem.) This is not in our scope. Such problems are attempted
with greedy or backtracking algorithms. In such cases, U can be thought of a matrix of 0 or 1.

Some people distinguish these two carefully as feature extraction and feature selection. We can also look at
them as dimensionality reduction. Dimensionality reduction makes the downstream computations efficient.
Some time storage/memory is also made efficient through the dimensionality reduction. If your original x
is “raw-data” (say an image as such represented as a vector), then such techniques are treated as principled
ways to define and extract features.
When we do the dimensionality reduction, we may be removing useful information (or noise). Due to this,
there is a minor chance that we may reduce the accuracy of the down stream task (say classification). In
general, dimensionality reduction schemes aim at no major reduction in accuracy (happy with some increase
in accuracy), but in a lower dimension.
If “noise” (irrelevant information) is removed or suppressed in the entire process, we may expect some
increase in the accuracy. At the same time, if we loose some “relevant information”, then we may loose the
accuracy. But, alas, we do not know what is noise and what is signal. We would like the machine to figure
out this from the examples/data. If both these problems are solved independendly (as is the case in the
classical ML schemes), there is noway, we can tell the dimensionality reduction scheme what is the best to
do.
The above equation is for linear dimensionality reduction. We also have many equivalent non-linear dimen-
sionality reduction schemes. They are mostly not in our scope.

1
2 Dimensionality Reduction and PCA

Let us start with an example. Consider a situation where we capture kids in the school with two features x1
the height of the student and x2 the weight of the student. From a school, we collect the data and plot as in
Figure 4.8. It is intuitive that if one increases, the other also increases. (Indeed there could be exceptional
kids!!.) Loosely, the relationship between them is linear. What we want is to capture this linear relationship
with a direction vector u and project all the samples onto it. That is what is shown as a new feature x3 .
In this case x3 may be thought of as something like “age”. Our objective is to find u and project all the
samples onto u and create a new feature. In the case of a typical linear dimensionality reduction, we may
have multiple such us. We stack then as rows and that is what our original matrix U was.

4.2 Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is one of the most popular technique for dimensionality reduction. It
is unsupervised or it does not use the label information.
Typically, the input to the PCA is a set of samples x1 , . . . , xN each in d dimension. The objective is to get
a set of new samples z1 , . . . , zN each in k dimension. Where k < d.
The central piece of PCA is covariance matrix. If data was scalar, mean of standard deviation could have
helped us to capture the distribution of the data. In our case, data is a vector x. To model the distribution
of the data, we need a matrix, i.e., covariance matrix.

4.2.1 PCA: Algorithm


PN
• Compute mean µ = 1
N i=1 xi of the data.

• Center the data by subtracting the mean µ

xi ← xi − µ

• Compute the covariance matrix


N
1 X
Σ= xi xTi
N i=1

• Compute eigen values and eigen vectors of Σ. Eigen values (popularly represented as λi ) are scalars.
Eigen Vectors are d dimensional vectors. Corresponding to each eigen value, we have an eigen vector.

x2

x3

x1
Figure 4.8: Given an input in 2D we are interested in finding a new direction/vector that helps in capturing
most of the information in a lower dimension.
Dimensionality Reduction and PCA 3

• Take the k eigen vectors corresponding to the largest k eigen values. Keep the eigen vectors as the
rows and create a matrix U of k × d.
• Compute the reduced dimensional vectors as
zi = Uxi
for all i = 1, . . . N

How many eigen vectors?

In practice what should be the value of k? This is often decided by looking at how much information is lost
in doing the dimensionality reduction. An estimate of this is the sum of eigen values that we discarded. i.e.,
Pd
i=k+1 λi
Pd
i=1 λi

Often k is picked such that the above ratio is less than 5% or 10%.

4.3 Appreciating PCA

PCA can be appreciated in many ways.

Minimizing Variance If we look for a direction u such that if we project the data xi to it, we obtain a
new representation zi that has maximum variance. This means that we retain most of the information and
loose the redundancy. The solution to this problem can be shown as the eigen vector corresponding to the
largest eigen value of the covariance matrix Σ. Best dimension to project so as to maximize the variance
is the eigen vector corresponding to the largest eigen value. The second best will be the second largest one
and so on.(which reduces to our PCA!!)

Minimizing Reconstruction Loss Let u1 , . . . ud be a set of new dimensions. We want to represent the
sample in the new space. What is the loss in this way of reconstructing with a new set of basis vectors? We
would like to minimize this. Minimizing the reconstruction error can be shown as that of Maximizing uT Σu
with our familiar constraint of uT u = 1. This reduces the solution as the eigen vectors corresponding to the
largest eigen values. And finally to PCA.

Line Fitting that minimze orthogonal distance Look at the problem of line fitting where we need to
predict yi from xi . We minimized an error in yi . If you draw then in a 2D plane (when x was 1d) the error
is parallel to the y axis. In general, the error is always defined with respect to y. A very related problem is
how do we fit a line that minimizes the orthogonal distance from the samples to the line? i.e.,

min d⊥ (xi , w)
w
Note that xi is a point and w is a line. This also reduces to the eigen vector solution of PCA.

4.4 Application:Eigen Face

A classical example/application of PCA


√ is
√ in face recognition and face representation. Let us assume that
we are given N face images each of d × d. We can visualize this as a d dimensional vector. Assume that
4 Dimensionality Reduction and PCA

our input is all the faces from a passport office. All the faces are approximately of the same size. They are
all frontal. And also expression neutral.

• Let us assume that mean µ is subtracted from each of the face. If all of the inputs were faces, then
the mean is also looking very similar to a face.
• Let the input be x1 . . . xN be the mean subtracted faces.

• We are interested in finding the eigen vectors of the matrix Σ. Say they are u1 . . . uk . Note that
k ≤ min(N, d).
• We then project each of the inputs x to these new eigen vectors and obtain a new feature.

zi = ui T x i = 1, . . . , k

• The new representation z is obtained like this. You can also look at this as forming a k × d matrix U
which has its rows eigen vectors.
• We obtain now a set of compact (k dimensional) representation zi for each of the input face images xi ,
where i = 1, . . . , N

The representation thus obtained is a useful feature for characterizing the face image. This feature is shown
to work well for recognizing identify, expressions, gender etc. This is an example of a feature that is derived
from the data itself.

Figure 4.9: Mean of all the faces available in the dataset.

Figure 4.10: First 4 eigen faces generated from the data.


Overview of Moderm AI and ML Chapter 5

What is Deep Learning?


C.V Jawahar

5.1 Shallow and Deep Functions

Assume we want to predict the volume from the atmospheric temperature, we can use a well known ”text
book knowledge” or Charles law which models this as V ∝ T . If we know the coefficient of this relation, we
should use this to directly compute as V = kT . Do not use machine learning to learn such well known and
simple functions. We can use our understanding of basic science to do this. For the sake of argument, let
us use ML to attempt this. We want to predict V = w · T . What do we do, we use many measurements
of pressure and temperature (Vi , Ti ) and learn this function V = f (T ) = w · T . What we learn is w. This
could make sense when the data has noise. i.e., our measurements of these quantities are noisy due to sensor
issues and also due to limitations in resolution (we do not measure extremely high precision quantities in
practice). I hope you appreciate by now, how to learn such models from the data.
There are many situations where the linear functions like f (x)= wT x are perfectly enough. Including simple
linear classifiers and linear regressors that gives us useful solutions. (e.g., with a nice features, many problems
of our interest can in fact be solved with high accuracy with a linear classifiers). Many working solutions
still use linear or simple nonlinear models.
Very soon we started to look for more complex functions to model the problem. This led to natural nonlinear
functions in the form of Multi Layer Perceptrons and Nonlinear Support Vector Machines. (We will see both
these more in detail at a later stage.) However the non-linearity in these were limited. But they were enough
if we want to capture nonlinearities that we see in many walks (eg. the relationship between how much I eat
and how much I gain weight).
Many modern AI problems are centered around perception. When I see a “cat”, how do I know it is a cat?
How do I process this raw signal? Our understandings are quite limited. Based on our understanding of
how human perception systems work, we only know that the signal travels from your eyes to some specific
parts of the brain ( visual cortex) in a systematic manner. We also know that in the entire process of
transmission, we do many simple nonlinear computations. This also helped us in anticipating the solution
to many complex problems in AI as that of such deep cascaded nonlinear computations.
Let us assume that we have some simple computational block y = f (x), very similar to what we had seen
earlier. Note that y need not be a simple {0, 1} as in classification or real number as in simple regression. We
used bold y to indicate that this can be a vector valued function. Let us look at this as an new representation
for the same x. May be then we can look at this as x2 = f (x1 ). i.e., it converts a feature representation
x1 to another one x2 . The hope is that x2 is more useful than x1 for the eventual goal that we have. (say
to interpret that the image is that of a cat.) We can look at this more generically as a set of functions
xk+1 = f (xk ). This means that the final equivalent function is very complex. For example, if we had two
such blocks. First one transforms x1 to x2 and the second one from x2 to x3 . Then we see the overall
function as x3 = f2 (f1 (x1 )) = g(x1 ). Indeed g() is now much more complex function than f1 () or f2 ().
It was not very difficult to visualize this view point of learning complex functions from many relatively
simple nonlinear functions, historically. However, the challenge was always how do we learn this function.
When deep learning algorithms could come up with computationally attractive solutions to learn a hierarchy
of (simpler) nonlinear functions, we found that many human perception tasks (like recognizing cats from
images of recognizing words and instructions from sounds became much more accurrate).

1
2 What is Deep Learning?

5.2 Deep Neural Networks

Over years, we have devised many algorithms that can learn the parameters associated with “simple” or
“shallow” (whether it is linear or nonlinear) functions. Examples include logistic regression, SVMs and
MLPs. One of the most popular algorithms that we used for learning was based on gradient descent. The
error backpropagation algorithms (or popularly known as backpropagation) has been one of the most popular
algorithms to learn nonlinear functions due to many practical reasons (i) it works for both convex and non-
convex cases (ii) it is incremental in nature and do not demand all the samples to be kept in the memory
or huge matrices of the order of number of samples to be used for computing (iii) it is also simple, intuitive
and efficient. We will see technical details of the backpropagation algorithms at a later stage.
When we want to design the deep functions for learning, the natural choice was to use the backpropagation.
This also made the individual functions fi (·) to be built as neural networks. In todays world, deep learning
has become synonymous to deep neural networks.
Any neural network that has multiple (say more than 2) layers is doing deep learning. With this simplicity,
we have been doing deep learning for ever.!!. Then what is new about todays deep learning? Truth is that a
number of difficulties (numerical, algorithmic) that practically stopped us from learning such deep functions
is addressed in the last 10 years or so. Here are some examples. (i) Deep neural networks have much more
parameters to learn than the traditional shallow functions/networks. With the internet explosion, large
data repositories and annotated data sets evolved. (ii) when the function is deep, optimization becomes
difficult. With gradient based learning, the initial layers of the networks are not getting enough gradient

x f1( ) f2( ) f3( ) ……. fn( ) y

Figure 5.11: In Deep Learning, we can look at the final solution as a concatenation of many nonlinear
functions

x fn(fn-1(….. f2(f1(x))……. )) y

Figure 5.12: In Deep Learning, we can look at the final solution as a concatenation of many nonlinear
functions. Here the final g() can be seen as a nested. Compare with the previous figure.
What is Deep Learning? 3

signal (popularly known as vanishing gradient problem). A number of improvements in the optimization
scheme and architecture design happened on top of the error backpropagation algorithm to address this. (iii)
MLPs are not the best neural network architecture for all the problems. There have been many practical
advances in the architectures. Architectures could change according to the input data, the type of connections
etc. Architectures like Convolutional Neural Networks (CNN) and and Recurrent Neural Networks (RNN)
became more popular than traditional MLPs.
Though we look AlexNet or 2012 as possibly the turning point in the history of deep learning, there are
couple of very important and pioneering attempts to acknowledge. First is a from Yan LeCun who worked
extensively on a deep CNN architectre. He designed a LeNet architecture more than 20 years back and
demonstrated the applicability in practical problems of cheque reading. (reading isolated digits). This also
led to the popular MNIST data that we all use for laboratory experiments. The table summarizes a strong
similarity of the LeNets to the well acclaimed Alexnet. Take this with a pinch of salt. There has been
many minor but critical changes in between. The second is to acknowledge the extensive works of XXXX on
recurrent neural networks in the early years. He spent many years to make the recurrent neural networks
practical. In short CNNs and LSTMs existed ahead of the modern surge in AI.

LeNet(1989) LeNet (1998) AlexNet (2012)


Task Digit Digit Objects
Classes 10 10 1000
Image size 16 * 16 28 * 28 256 * 256 * 3
Examples 7291 60000 1.2 M
Units 1256 8084 658000
Parameters 9760 60k 60M
Connection 65k 344k 652M
Operations 11 billion 412 billion 200 quadrillion

Table 5.1: A simplified comparison of LeNet and AlexNet. Then what changed over time? Only data and
computing? Not that simple

5.3 Deep Learning

Another view of deep learning has been what do we try to learn?


The intelligent systems of early days captured the knowledge with the help of humans and used well structured
codes and inferences to make decisions. A class of such solutions are expert systems. One could use if-else
style codes/reasoning and lead to conclusions. Often humans will have to create the knowledge that exists in
the code to infer, and also create the knowledge in the external databases connected to it. For example, if a
patient has certain symptoms then the diagnosis should be this. There were many successful implementations
of these. It primarily transferred the systematic human reasoning to a digital computer.
The classical machine learning looked at the problem as learning from the data. The problem was formulated
in two critical stages. First stage, represents the object in a mathematically nicer form (say a vector) and the
second stage does classification or regression. The second stage advanced a lot with many powerful algorithms
ranging from decision trees to support vector machines. They have surely an independent existence. The
first stage was quite intuitive in the early years. We could represent a fruit with colour and shape, we can
represent an animal with its colour and skin texture. This extremely intuitive features were the recipe for
disaster. Very soon we started to realize that human is unable to express the intuition of ‘what makes an
animal cat’ in a computationally friendly manner. We were also not able to imagine or enumerate all the
potential variations that exists in the world. We started to realize that it may be easier to learn from the
data.
The limitation of the intuitive (or popularly (dererogatorily?) called handcrafted) features were replaced by
4 What is Deep Learning?

features that can be learned or computed with well understood mathematical models. For example Fourier
transforms, Wavelet transforms, Numerical estimates of probability distributions etc. became very popular
to do this. A wide variety of them had sound mathematical interpretation (why high frequencies are useful
than guessing that eyes distinguish cat from other animals). This also led to robust features. Even if eyes of
the cats could work, it was impossible to code and extract the eyes very reliably. Also a new set of features
that learn certain basis from the data became popular. Examples include methods like bag of words and bag
of visual words. This era gave impressive results on many seemingly difficult problems. Machine learning
became integral part of many sub areas of AI.
Along with the previous development, another branch of machine learning started to prosper. Popularly
known as representation learning. In the simplest form, this wants to learn a representation that suits the
problem, given the data and our eventual goal. Dimensionality reduction techniques, manifold learning and
many other areas of machine learning touch this direction of development. It is indeed very difficult to
separate these methods in a water tight manner. We should not even aim to do that.
The modern end to end learning takes this one more step forward. Can we learn from the data to the goals
directly? Given that what a driver will see, can we learn the drivers actions (like applying breaks!!). This
end to end learning is possibly the one integral ingredient of the success of modern deep learning. Note
that deep learning need not be end to end always. Many of us may not want to train a model end to end,
given our poor resources (data, computing, storage etc.). We can still use the power and advantages of deep
learning in our projects. Deep end to end learning is beneficial in many respects. We can learn features
and classifiers together. This gives us the benefit of getting the ‘best’ solution directly from the data. In
practice, they have been doing quite well, compared to the previous era of features.
Overview of Moderm AI and ML Chapter 6

Deep Features
C.V Jawahar

6.1 Features and Data Driven Features

A basic problem of our interest is to design good features. Performance of any solution will depend on
the quality of the feature/representation. Though we can design features or representations with our own
intuitions, it is well known that the best feature surely depends on the task and the data. Why do we
think that colour is a good feature for classifying fruits? May be it is so, but the choice of it came from the
intuition that we have about the fruits. Why not learn features from the data itself? There are many data
driven features also in the past. Bag of words, PCA and many other such representations are data driven.
There has been many approximations based on wavelets, DCT etc. also in the literature around the same
time.
A good feature for compression (a compact representation that can reproduce the signal/data) could be very
different from that required for separating apples and oranges. The representation required for separating
fruits vs vegetables could be yet another one. Most of the time, we start with an “acceptable” representation
and then solve a specific problem of classification, regression etc. If we can learn a task specific feature, it
would have been great. That is what deep learning also enabled. By defining the learning problem as end
to end, we are now learning the representation and classification together.
However, it is not always possible to come up with enough data, and annotation for such end to end learning
task. This is where we appreciate the utility of “generic” features that are very good for a wide variety of
tasks. With the emergence of deep learning, the popularity of data driven features has increased a lot. A
classical example is word2vec. (though it may not be deep enough).

6.1.1 X2Vecs

We had seen the word embedding like word2vecs. Though this may not be a example of deep features, it is
an excellent example of how data driven features can be useful. word2vec representation allowed us to learn
the “meaning” of a word from many sentences that use that word. Very impressive idea. Similar approaches
were defined now to come up with embedding for a wide variety of data sets and tasks. Similar to word2vec,
we have now representations like sentence2Vec, doc2vec etc. at different levels. Another area where such
representations helped a lot is in dealing with data that is highly heterogeneous. For examples, medical
records of a patient, or attributes of a product etc. This led to embeddings like prod2vec etc. A number of
powerful data driven embeddings surfaced in the recent years.

diff2vec doc2vec graph2vec tweet2vec


batter-pitcher-2vec illustration-2vec ida2vec node2vec
sentence2vec wiki2vec topic2vec entity2vec

Table 6.2: A Zoo of X2Vec in the literature. See references online

1
2 Deep Features

6.2 Deep Features

We had seen the deep learning as a sequence of operations resulting in obtaining the goal. This can also be
seen as a systematic transformation of the representation from pixels/(raw data) to the class label/goal. Let
us look at the problem of classifying images. One end of the spectrum is the raw data (pixels represented
in 2D or 3D array in three colours). The other end is the label of Traffic Junction. One is a too low level
representation of the scene and the other is a too high level description of the scene. The deep learning
solutions can bridge this semantic gap of representation.
We can see x1 → x2 → . . . xN as a systematic semantic enrichment of the representation. If we have taken
the lowest level representation/features, then we would have needed an extremely powerful classifier to solve
our problem. If we have taken xN−1 , then the feature is rich and powerful and the “classifier/regressor/ML
Algorithm” required to solve this becomes simple. In short, if representation is very good and powerful,
the next stage can be simple and if the representation is poor and low-level, then the next stage has to
be very complex/powerful to yield good results. We can now look deep architectures as the initial stages
doing feature extraction and later stages doing classification. However, there is no need to see the end to
end learning or deep learning as a two stage learning process. It just helps in appreciating the intermediate
representations of the deep networks.
Ever since the empirical success of deep learning in many problems, people have started to find out what these
deep networks learn? This is a natural curiosity, also it is important for interpreting what these machine
learning models are learning from the data. The indications are that the initial layers learn features that
are simple and more generic. While the later stages learn features that are semantically rich and powerful.
For example for an image classification problem, initial layers are learning generic features including colour,
edges, corners, and many low-level features that the traditional image signal processing has been focusing
on. While the next (say mid) stages learn features that are more complex like shapes (circles) or parts of the
objects (like tyres of the vehicles). The latter stages provide more complex features that help in classifying
the scene as a traffic signal.
We will see how to visualize and appreciate this argument at a later stage when we know more about the
deep neural architectures. At this stage, we can look at deep learning as a powerful tool that can give us
features for a number of tasks. If you have now a new task of image classification, it is advisable to try
out how a deep feature performs before trying out classical features based on DCT or Wavelets. Similarly
for speech, we now see the deep features, that can compete if not outperform age old MFCC features. We
already saw, how word2vec outperforms one hot representations.
The migration to a deep features may look some what counter intuitive at the beginning. If we have built
a solution for classifying 1000 natural objects, why it should be applicable for some of the objects (eg. auto
rickshaws) that are seen only in India? Indeed there is a small issue here. But we should note that auto
rikshaws are also made of parts that are present in cars and buses. Or other words, many low level parts
(read as patches) that we see in one “new” image was present in the many many images that we saw in
a a very different environment. Extending the argument, we can see that many “sounds” that are present
in Hindi are also present in English. Both these are made by humans who have similar sound generation
system. This should convince us that many of the features that are learnt for a different task should also be
useful in new tasks.
Deep Features 3

6.3 Transferring the Knowledge to New Domains

When we learn the success stories of deep learning for a specific task, a number of questions often bothers
us:

• Why should I care about the success of Alex in an imagenet challenge and other similar success stories.

• I would like to use deep learning for my own problem. However I do not have enough data to do end
to end training.

Since deep networks have lot more parameters (weights, unknowns), they require large amount of data to
learn end to end from scratch. (from scratch imply that we are training the network/model from the very
beginning.). This makes such a training difficult ( if not impossible) for most of our tasks. A very useful
way to look at the recent deep learning is to look at this as learning a set of new representations. In the last
section, we saw that the deep networks learn a hierarchy of features.
The initial stages learning simple and more generic features and later stages learning semantically richer
and more specific features. The question is now how do we make use of these features?. The answer is
simple. You can chop/cut the network at any stage and take the internal representation (internal values of
the network) as a representation for new task.
For example, if the original network was trained on 1000 class object recognition, and we have a new task
of classifying 100 indoor scenes, we could take an intermediate representation towards the end and use it as
a representation. Such a representation could be few thousands real numbers in size. It was observed for a
wide variety of tasks, such features learning from a some what different task is very useful. This removes
the need for having our own data in huge quantities.
Consider yet another problem of working on Indian faces. What we may have is a face recognition model
that is built over Hollywood or western celebrities. However, it was observed that a representation that is
transferred from the model trained on western celebrities is very useful for matching two Indian faces. We
can appreciate this now since all our faces have lots in common.
A deeper question emerges now. Can I use the representation that is learnt on a different data/task and
then further improve it with minimal data/effort/computation? Yes; indeed this is possible. This is possibly
known as fine tuning or sometimes transfer learning. We will see this more in detail at a later stage.

x y
Input Output
Raw Signal (high level)
Low-level High-level
feature feature

Figure 6.13: End2End Classification = Representation Learning + Classification


4 Deep Features

Representation Learning Classification

x y
Input Output
Raw Signal (high level)
Representation Classification
Learning

Figure 6.14: Features that can be transferred for new tasks

You might also like