MScFE 650 MLF - Video - Transcripts - M2

MScFE xxx [Course Name] - Module X: Collaborative Review Task
© 2019 - WorldQuant University – All rights reserved.

Revised: 08/19/2019
1
MScFE 650 Machine Learning in Finance– Table of Contents
Unit 1: Transcript: Principal Component Analysis (Part 1) ........................ 3

Unit 2:
Principal Component Analysis (Part 2) ..................................................... 6
Principal Component Analysis (Part 3) ................................................... 10
Unit 3: Linear Discriminant Analysis (Part 1) .............................................. 13
Unit 4: Linear Discriminant Analysis (Part 2) ............................................. 17
Unit 5: Academic Papers in Review ................................................................20

2
MScFE 650 Machine Learning in Finance - Module 2: Unit 1 Video Transcript
Unit 1: Transcript: Principal Component Analysis (Part 1)
In the next three videos, we will introduce principal component analysis (PCA), an unsupervised
learning technique used in dimensionality reduction.
Before you start with modeling, an important first step is data preparation. This is where
dimensionality reduction can be especially useful: in feature engineering. If you look at the scikit-
learn user guide, there is a whole section on data preparation. That has to do with preprocessing
techniques like standardizing and normalizing the data, and related issues.
In this module, we will talk about different types of data preparation, specifically those related to
dimensionality reduction. The first dimensionality reduction method we will consider is known as
principal component analysis and the second is linear discriminate analysis. Let’s start by
considering principal components.
Principal component analysis
The easiest way to explain this will be to draw a figure using a two-dimensional plot. Drawing in
two dimensions is the simplest, although we could also visualize the technique in three
dimensions. The idea is that, if we have data as shown in this figure (below) and data values like
this, then we would ideally develop a technique which will help us identify the direction of the
maximum spread or variance.

3
The first step is always to determine the mean of the data. In this illustration the x ’s indicates the
data. Then let us indicate the mean with the dot. Now, you can imagine yourself sitting at the dot
and looking at the data values. You will probably see that the spread in one direction is much
larger than in any other direction. This is the first principal direction that we are looking for.
In two dimensions, it is easy to find the second principal direction, because it is simply orthogonal
to the first one, so it will lead in the orthogonal direction. We have now illustrated two principal
directions for this dataset in two dimensions. The question is now: how do we translate what we
have drawn here into mathematical terms?
Firstly, let us consider having several observations written as 𝑥 and if we have 𝑁 observations,
𝑛
the index will run from 1 to 𝑁. As you will see in my notation, I always put an underscore when
writing a column vector.
For example, x may have components like x1 , x2, x3 … but the point is that it is a column vector.
𝑥1
𝑥
𝑥=[ 2 ]
𝑥3
𝑥 𝑇 = [𝑥1 𝑥2 𝑥3 ]
If I want to have a row vector then I will put in the transpose of that, so a row vector will be 𝑥 𝑇 and
then it will just be [x1 , x2, x3 ]. That is just a matter of notation.
As I said, the first thing we want to do is to remove the mean from the data. We thus have to
calculate the mean. Let us put an overbar on the mean, it is simply the mean of the observations:

4
𝑥 = ∑ 𝑥 𝑛.
𝑛=1
Now that we have calculated the mean, the next thing to do is to remove the mean from all the
observations. Typically, to remove the mean, we take the observation and subtract the mean. We
repeat that for all values.
𝑥𝑛−𝑥
In effect what we are doing is translating the origin to the position of the mean.
In this video, PCA was introduced and we discussed some of the notation that will be used
throughout the course. In the next video, we will discuss how to find the principal directions using
mathematics.

5
Unit 2: Principal Component Analysis (Part 2)
In the previous video, we introduced PCA and some of the notation that we will be using
throughout the course. In this video, we will discuss how to find the principal direction — the
direction of maximum spread or variance.
Remember that we will have more than one principle direction, so let us call the first one 𝑢1 which
¯
is unknown at this point. We want to calculate 𝑢1 . For that we need to project the observations
¯
onto 𝑢1 :
¯
You may remember from linear algebra that the projection is just the vector times the cosine of
the angle between the vectors, of course assuming we project onto a unit vector.
Length of 𝑢1𝑇 is 1
¯
𝑢1𝑇 𝑢1 = 1
¯ ¯
Remember that 𝑢1 without the transpose is a column vector, so what we have is a row vector
¯
multiplied with a column vector. This is just an inner product. For a unit vector, the inner product is
the same as length and must be 1. That is a constraint that we have put on 𝑢1 .The projection of
¯
generic observation 𝑥 in the direction of 𝑢1 is simply the inner product.

¯
This we can write as:

6
Projection of 𝒙 onto of 𝒖𝟏 : 𝒖𝑻𝟏 (𝒙 − 𝒙)

¯ ¯
Note that we actually project the vector with the mean subtracted. We do this for all observations.
Remember we want to find the direction of the maximum spread. How do we calculate the spread?
The variance is just the average spread, and in mathematical terms this is given by the following:
𝑁
1 2
2
𝜎 = ∑ [ 𝑢1𝑇 (𝑥𝑛 − 𝑥)]
𝑁 ¯ ¯
𝑛−1
𝑁
1 𝑇
= ∑ (𝑢1𝑇 (𝑥𝑛 − 𝑥) (𝑥𝑛 − 𝑥) 𝑢1
𝑁 ¯ ¯ ¯ ¯
𝑛=1
𝑁
1 𝑇
= 𝑢1𝑇 [ ∑( 𝑥𝑛 − 𝑥) (𝑥𝑛 − 𝑥) ]𝑢1
¯ 𝑁 ¯ ¯ ¯
𝑛−1
𝑇
= 𝑢1 𝑆𝑢1 − 𝑆 − data co variance matrix
¯ ¯
We have now obtained an expression for the spread in the direction of 𝑢1 . 𝑢1 , however, is an
¯ ¯
unknown. We choose 𝑢1 so that the spread is a maximum. You can imagine sitting again at the
¯
mean, and you are looking in different directions, and when you see this direction, where the
spread is the maximum, you know that you have found 𝑢1 . We must do that mathematically. To do
¯
that we maximize this variance with respect to 𝑢1 .

¯
𝑚𝑎𝑥 𝜎 2
= max [𝑢1 𝑇 𝑆𝑢1 + 𝜆 (𝑢1𝑇 𝑢1 − 1)].
‖𝑢1 ‖= 1
We can make the value of 𝑢1𝑇 𝑆𝑢1 arbitrarily large by just increasing the magnitude of 𝑢1 . Since we
¯ ¯ ¯
are only interested in its direction anyway, we need to impose the condition that it is of unit
length.
We do this by adding the Lagrange multiplier, also known as the 𝜆, as in the equation. If you are not
familiar with Lagrange multipliers, don't worry too much about it right now. But please consult the
notes.
Now we maximize with respect to this and, if we maximize, as always, we take the derivative. Here,
it will be the derivative with respect to the component 𝑢1 . If we do that, and we set it equal to 0,
¯

7
then on the left-hand side we get 2𝑆𝑢1 and on the right-hand side, again, we have a square that will
¯
be 2 times the Lagrange multiplier times 𝑢1 . We can cancel the 2's and then we have a very
¯
interesting equation.
2𝑆𝑢1 = 2𝜆𝑢1
¯ ¯
𝑆𝑢1 = 𝜆𝑢1 .
¯ ¯
𝑆 is the covariance matrix and we calculate that from the data. We have two unknowns—namely,
the 𝑢1 and the 𝜆. We want to have the variance maximized along the direction 𝑢1 . So, we must solve
¯ ¯
this equation for 𝑢1 ; that is what we are after.

¯
𝑆𝑢1 = 𝜆𝑢1
¯ ¯
𝜎 2 = 𝑢1𝑇 𝑆𝑢1 = 𝜆𝑢1𝑇 𝑢1 = 𝜆

¯ ¯ ¯ ¯
2
This means 𝜎𝑚𝑎𝑥 = 𝜆𝑚𝑎𝑥
We only need to recognize that equation for what it is. Hopefully, you recognize that as an
eigenvalue problem where 𝑢1 is an eigenvector with eigenvalue 𝜆. Also, S is a symmetric matrix, so
¯
we know all the eigenvalues will be real. What is more, and you may want to do that as an exercise,
S is also a positive semi-definite matrix.
If you don’t remember what that is, you might want to brush up on your linear algebra, but it
means that we are guaranteed that all these eigenvalues will be non-negative. Some of them may
be 0, but they cannot be negative.
We want to maximize the variance which is equal to the eigenvalues of the data covariance matrix.
Therefore, the maximum value for the variance equals the largest eigenvalue. 𝑢1 , the first
¯
principal component, is the eigenvector of the data covariance matrix, belonging to the largest
eigenvalue.
We have basically solved the problem, because we can do the same as before, but this time we
look at the direction that is orthogonal to 𝑢1 . We can do that formally, but I think at this point we
¯
can use our intuition and conclude that the rest of the principal directions will simply be the rest of

8
the eigenvectors. What is more is that the eigenvalues belonging to those eigenvectors have a
very specific meaning — namely, they measure the spread in those directions.
We have not only obtained the principal directions, but also calculated the spread in those
directions. That is a very nice property, because if we have the principal directions and, say for
instance, we find that in some directions, the eigenvalues are negligible or 0, then we know that,
those directions have no spread. I think of these as empty dimensions — there is no activity in
those directions. If there is no activity in those directions, we can get rid of them and describe the
data in terms of the principal directions or components where we have a significant spread.
Coming back to our original image (below) we now know the 𝑢1 . The next step is to project each of
¯
the observations orthogonally onto 𝑢1 . In this case, because of the ways we have illustrated it, one
¯
can see I have drawn it, one can see that all the data values are close to 𝑢1 . If we project onto 𝑢1 we
¯ ¯
do not have a large error at all, since they are so close to the line in the first place.
This means that, for this example, the data was essentially one dimensional. In our next video we
are going to look at how to compute these eigenvalues and eigenvectors of the data covariance
matrix.

9
Unit 2: Principal Component Analysis (Part 3)
In this video, we will briefly discuss additional important aspects of PCA — namely:
o how to compute the eigenvalues and eigenvectors in a numerically stable way,
o how to project onto the principal directions (dimensionality reduction), and
o how to approximately reconstruct the original data from their projections.
How do we compute the eigenvalues and eigenvectors of a data covariance matrix? Previously, we
mentioned that it is not a good idea to calculate a covariance matrix, because we lose half of the
significant digits. This means that it is a numerically unstable process. To find the principle
directions, we need to calculate the eigenvalues and eigenvectors of the data covariance matrix.
No matter how accurate we calculate those eigenvalues and eigenvectors, the damage had
already been done by calculating the data covariance matrix. We essentially need to avoid
calculating the covariance matrix.
If you look at the scikit-learn documentation on PCA, you will see that they refer to the singular
value decomposition. It does not form the data covariance matrix at all but writes the original data
values in a matrix and then calculates the singular value decomposition of that matrix. It does the
same as calculating the eigenvalues and eigenvectors but bypass the covariance matrix. There will
be a dedicated notebook in which you can further explore how this is done. Then we will see some
of the exciting properties of the singular value decomposition. You can also explore this in the
notes, where it is briefly explained. It is one of the great factorizations of matrix linear algebra and
is definitely worth looking at.
𝑄 = [𝑢1 , 𝑢2,… , 𝑢𝑑 ]
¯ ¯ ¯
We have the principal directions, let us assemble them in a matrix, called 𝑄.𝑢1 , 𝑢2 up to 𝑢𝑑 will all
¯ ¯ ¯
be orthogonal and normalized. So, 𝑄 is an orthogonal matrix.
It is also important to make sure that there is no misunderstanding when naming this an
orthonormal matrix. It simply means that the matrix is both orthogonal and normalized. If we talk
about these matrices, the columns will always be normalized. There is no reason not to do that. If I
refer to orthogonal matrix, the columns will be orthogonal and normalized.

10
How do we do the dimensionality reduction?
Remember that each of these eigenvectors (𝑢’s) also come with an eigenvalue. Let us assume that
¯
we have just sorted the eigenvalues from large to small. If we have small eigenvalues, those will be
last on the list. If we have 𝑑 columns, then 𝑢𝑑 will belong to the smallest eigenvalue. I mentioned
¯
before that eigenvalues measure a spread in a particular direction. So, if the eigenvalues are small,
or perhaps 0, it means that we can safely discard those dimensions. We are now going to form a
new matrix and call it 𝑄𝜈 .
𝑄𝜈 = [𝑢1 , 𝑢2 , … , 𝑢𝜈 ].
¯ ¯ ¯
This means that I am only keeping first 𝜈 columns of the original matrix with the assumption that
𝜆𝜈+1 is very small. Now, the question is how to calculate the projections onto these matrices.
The projections are given by the following:
𝑦 = 𝑄𝜈⊤ [ 𝑥 − 𝑥 ] .
¯
Let us say the dimension of the data is 𝑑, then 𝑄𝜈 will be 𝑑 by 𝜈 matrix. Then 𝑄𝜈⊤ will be 𝜈 × 𝑑 matrix
and 𝑥 − 𝑥 will be 𝑑 × 1 matrix, so the result will be a 𝜈 × 1 matrix so, 𝑦 will be a 𝜈 × 1 matrix. This
¯ ¯
means that 𝑦 will only have 𝜈 components, whereas the original had 𝑑 components. If, 𝜈 < 𝑑, we
¯
have dimensionality reduction. If all the eigenvalues are significant, of course, 𝜈 will be equal to 𝑑
and we don’t get any dimensionality reduction.
To restore, we cannot really invert this equation, because 𝑄𝜈 is not a square matrix. Since its
columns are orthogonal, it follows that
𝑄𝜈 𝑄𝜈⊤ ≠ 𝐼
𝑄𝜈⊤ 𝑄𝜈 ≠ 𝐼.
The final step is to do the inversion. If we want to project an observation then we have a formula
for that, but how can we recover the original from the projection? We cannot recover it
completely, because where there is a small spread (i.e. when the eigenvalue is small) there is an
information loss. What we do is to use the generalized inverse, given by the following:

11
𝑥 − 𝑥 + 𝑄𝜈 𝑦
¯
In our next lesson, we are going to start exploring linear discriminant analysis. The LDA problem is
very different from PCA, as it is closely related to the classification problem, which will be
discussed later in the module.

12
Unit 3: Linear Discriminant Analysis (Part 1)
In this video, we discuss linear discriminant analysis (LDA).
The LDA problem is very different from PCA. It is more related to the classification problem, which
we will discuss later. In the case of LDA, we start at the same place as before where we are given
𝑛 observations, but the crucial difference is that each observation comes with a label. That label
will indicate which class that observation belongs to.
One of the applications of LDA, which we will look at later, is when we have a digit and the
computer must decide whether that digit is anything from 0 to 9. In that case, we have 10 different
classes. If we are given a hand-written digit, we want to assign it to one of those 10 classes. Let me
illustrate with an example: there are two separate classes — the one will be indicated with 𝑥 and
the other class will be indicated with zeros. We can label these with crosses and circles. If we
hadn’t known the labels, we would not have been able to do this, but with this indication we know
to which class each of the observations belongs.
In this example, by using dimensionality reduction, we want to project the data values to a lower
dimensional space, so that the classes have maximum separation in the lower dimensional space.
In the notes for this lecture, we have a picture that can be color coded with blues and reds. Now
we find a line – by using dimensionality reduction, we reduce two-dimensional data to one
dimension, which can be represented as a line. Somehow, we want to find a line that, when data is
projected onto it, we will have maximum separation between the classes in the projected space.

13
How do we do that? First, we have learnt how to do a projection. The direction we want to project
in we call 𝑤 . The projection of an observation 𝑥𝑛 will now simply be 𝑦𝑛 . 𝑦𝑛 will be a number in the
¯ ¯
projected space, because remember that 𝑤 is a row vector.

𝑇
¯
We multiply that with the column vector and that gives us a number.
Note that we do not subtract the mean in this case. It is not necessary, because what we want to
find is a direction and the line on which we project will go through the origin. This is the general
projection formula:
𝑦𝑛 = 𝑤 𝑇 𝑥𝑛 .
¯ ¯
We can write down the mean for each of those two clusters. For cluster 1, we get the following:
𝑁1
1
𝑚1 = ∑ 𝑥𝑛 , 𝑥 𝑛 ∈ 𝐶1
¯ 𝑁1
𝑛=1
For cluster 2, we have a similar formula. We can also project the means. To get the projected
means, I am going to put an overbar. That will simply be: overbar 𝑦1 is equal to 𝑤 𝑇 𝑚1 and overbar
¯
𝑦2 is equal to 𝑤 𝑚2 .
𝑇
¯ ¯
𝑦1 = 𝑤 𝑇 𝑚1 , 𝑦2 = 𝑤 𝑇 𝑤2
¯ ¯ ¯ ¯
Now, we need to find the direction in which to project. We must also now translate that into
mathematics. The first thing that we may try is to find 𝑤 so that the direction in which we want to
project is such that the difference between the two projected means (the difference between 𝑦2
and 𝑦1 ) is a maximum. That can be written down as the following.
𝑦2 − 𝑦1 = 𝑤 𝑇 (𝑚2 − 𝑚1 ).
¯ ¯ ¯
Can we now find 𝑤 so that this distance is a maximum?
Again, we are only interested in the direction of 𝑤, not in its magnitude. If we want to maximize
this, we must add the Lagrange multiplier. After adding the Lagrange multiplier, we will have the
quantity that we must maximize.

14
How do we maximize this quantity? We must take the derivative with respect to 𝑤 and set that
¯
equal to 0. The formula is:
𝑚2 − 𝑚1 + 2𝜆𝑤 = 0.
¯ ¯ ¯
1
Now we find that 𝑤 = − (𝑚2 − 𝑚1 ). The 𝜆 is not important; the minus sign is also not
¯ 2𝜆 ¯ ¯
important. What we see is that we maximize the distance between 𝑦2 and 𝑦1 , then the projection
direction points in the direction from 𝑚1 to 𝑚2 . This is just the vector that connects the two
¯ ¯
means.
Let us go back to the original illustration and put in the two means. We connect them, and that is
the direction in which we project. Now we can see the projection is perpendicular to this; if we
project, for instance, a value from the X class and we project a value from the circle class, we can
already see that there will be an overlap between these two classes. It is quite clear from the
image in the lecture notes.
That is not a particularly good choice. The question is: how can we improve on this?
If you think about it for a moment, it should be quite clear that, if we project in a slightly different
direction, in your imagination you can probably see that we should have a much better separation
between the two classes. The question remains: how can we write down a formula for this new
direction? What property makes this particular direction special?

15
You will realize that it is not a bad idea to maximize the separation between the two means. But, at
the same time, we could also consider projecting in the direction that will ensure the smallest
possible scatter between each of the two projected classes. We want to have the projected values
cluster as tightly as possible.
In the next lecture, we will take a closer look at within-class scatter and how to calculate it using
the LDA.

16
Unit 4: Linear Discriminant Analysis (Part 2)
In this video, we will be carrying on from where we left off in the first LDA lecture. We will explore
a mathematical way to not only maximize the difference between the means, but also minimize the
within-class scatter. Let’s get started.
Within-class scatter simply means the spread of each of the two classes after they have been
projected. We must formulate formulas for that. First, we formulate the formula for the projected
means which we have had before. That was just the 𝑦2 and 𝑦1 over bar. We have formulas, for that
in terms of the original means. We now want to calculate the scatter.
For the first one, let us call the scatter 𝑆12 , which is all the observations in class 1 — and that will be
the projected value minus the mean, squared:
𝑆12 = ∑ ( 𝑦𝑛 − 𝑦 )2
𝑦𝑛 ∈𝐶1
We have a similar formula for 𝑆22 , which is the scatter of cluster 2:
𝑆22 = ∑ ( 𝑦𝑛 − 𝑦 )2
𝑦𝑛 ∈𝐶2
Now we have measures of the scatter, we want to simultaneously maximize the distance between
the two projected means, and minimize the within-class scatter.
( 𝑦2 − 𝑦1 )2
𝐽 (𝑤 ) = .
¯ 𝑆12 + 𝑆22
How can this be done simultaneously? By writing down the ratio. How do we maximize 𝐽 as a
function of 𝑤 𝑇 ? It will be a maximum when the numerator is large and when the denominator is
¯
small. That is exactly what we want. We want a large distance between the projected means and a
small within-class scatter.
What we do here is simultaneously maximize the one quantity and minimize the other quantity. All
we need to do is rewrite those equations in terms of the original data:

17
𝑤 𝑇 𝑆𝐵 𝑤
¯ ¯
𝐽 (𝑤) = .
¯ 𝑤 𝑇 𝑆𝑤 𝑤
¯ ¯
It is a fairly straightforward calculation to show the following:
𝑆𝐵 = ( 𝑚2 − 𝑚1 )( 𝑚2 − 𝑚1 )𝑇
¯ ¯ ¯ ¯
𝑆𝑤 = ∑ (𝑥𝑛 − 𝑚1 )(𝑥𝑛 − 𝑚1 )𝑇 + ⋯
¯ ¯ ¯ ¯
𝑥𝑛 ∈𝐶1
I’ll leave it as an exercise for you to complete the equation, or you can have a look at the notes. We
want to maximize 𝐽 of 𝑤 with respect to 𝑤 . Again, we are only interested in the direction of 𝑤 .
¯ ¯ ¯
𝑤 must be a unit vector. We don’t need to add a Lagrange multiplier. If we scale 𝑤 , let’s say we
¯ ¯
multiply 𝑤 with a non-zero constant; since we must do it on top, and below the constant simply
¯
divides out. It is only the direction of 𝑤 that is going to influence the 𝐽 of 𝑤 . Now we need to take
¯ ¯
the derivative of 𝐽 of 𝑤 with respect to 𝑤 and set that equal to 0:

¯ ¯
▽ 𝐽 (𝑤) = 0.
¯
After some algebra this becomes the following:
(𝑤 𝑇 𝑆𝑤 𝑤 ) 𝑆𝐵 𝑤 = (𝑤 𝑇 𝑆𝐵 𝑤) 𝑆𝑤 𝑤 .
¯ ¯ ¯ ¯ ¯ ¯
This is the equation that we need to solve for 𝑤 . It looks complicated because of the non-linear
¯
terms. Now, we find a bit of magic that I’ll show you. It is important to keep in mind that we are
only interested in the direction of 𝑤 .
¯
We therefore cancel all scalars. Since 𝑤 𝑇 𝑆𝑤 𝑤 is a number, we discard it. The same with 𝑤 𝑇 𝑆𝐵 𝑤 .
¯ ¯ ¯ ¯
What we now find is that 𝑆𝑤 𝑤 is proportional to 𝑆𝐵 𝑤:

¯ ¯
𝑆𝑤 𝑤 ∝ 𝑆𝐵 𝑤
¯

18
This equation has some meaning, because what we now see is that if we ignore 𝑆𝑤 on the left-hand
side for a moment, 𝑤 points in the direction of the vector between the two-original means. This is
what we had before. Now, however, we have 𝑆𝑤 × 𝑤 on the left-hand side:
¯
𝑆𝑤 𝑤 = 𝑚2 − 𝑚1 .
¯ ¯
This modifies the direction between the two means and determines the optimal direction in which
to project. If we go to more classes and higher dimensions things tend to become a bit more
technical. If you are interested in the technical details, I suggest you read through the notes. All
the mechanics are explained there. I am not going to work through all the technical details; what is
most important, is that you understand the principle behind LDA.
In summary, we started with given data, and for each data value we have a label that assigns each
data value to a specific class. The purpose of LDA, is that we want to project those data values to a
lower-dimensional space, in order to have maximum class separation within the projected space.
The best way to achieve this is to simultaneously maximize one quantity, the between-class
scatter, and minimize the within-class scatter. Those ideas can now be carried over to higher
dimensions. It becomes a little more technical to get the details right, but that is the basic idea
behind applying LDA in higher dimensions.
We will also take you through a notebook with exercises where you will compare the projection of
PCA and LDA and the conclusion will inevitably be that the LDA projection is much better at
preserving class separation in the projected space than PCA. It is a very powerful method and is
very often used if you are in very high dimensions and there is a need to project to lower
dimensions.
In this video we explored a mathematical way to not only maximize the difference between the
means, but also minimize the within-class scatter.

19
Unit 5: Academic Papers in Review
In this video, we will be reviewing some academic papers for the purpose of showing students
some applications of dimensionality reduction techniques used in the financial industry.
Alpha design
Most of us should be familiar with the typical alpha design strategies used by at least traditional
quantitative investment firms. The two that jump out immediately are factor-based investment
strategies and statistical arbitrage.
Factor-based investment strategies
The two textbooks that have popularized this investment strategy are: “Active Portfolio
Management” by Grinold & Kahn and “Quantitative Equity Portfolio Management” by Chincarini
& Kim.
𝑟𝑗 = 𝑎𝑗 + 𝑏𝑗1 𝐹1 + 𝑏𝑗2 𝐹2 + ⋯ + 𝑏𝑗𝑛 𝐹𝑛 +∈𝑗
In this type of strategy expected returns are modeled as a regression on various factor premiums
such as value, size, and momentum. Once a regression is applied, then the factor loadings are
computed. These factor loadings are very useful in risk modeling, as well as in stock selection for
forming portfolios. In the case of stock selection, one of the things that can be done is to create
statistical factors using dimensionality reduction techniques like PCA. (Note that PCA is a linear
technique).
We may want to find the 𝑘 different factors that explain 95% of the variance, and use those as
statistical features in our model. More about this technique can be read on page 222 of
Quantitative Equity Portfolio Management.
Another good source to look at is the Quantopian lecture series. In particular their notebook on
PCA. In this example, they apply PCA to 10 stocks, 5 are tech, and the others are gold-mining: IBM,
MSFT, FB, T, INTC, ABX, NEM, AU, AEM and GFI.

20
They apply PCA to the portfolio and notice that 79% of the variance can be explained by using just
the first two principal components.
An interesting paper to read that expands on using PCA techniques to reduce the dimensionality
of the features, and then model an artificial neural network on said features is “Forecasting Daily
Stock Market Return Using Dimensionality Reduction”, by Xiao Zhong and David Enke. This paper
is publicly available from Research Gate.
Statistical arbitrage: finding valid pairs
Pairs trading is a common and popular technique used in quantitative investing. One of the
challenges faced, though, is to find pairs of assets that exhibit similar features. To solve this, we
can use dimensionality reduction techniques like t-SNE, a non-linear method.
I highly recommend that students go through the notebook provided by Quantopian, which shows
a concrete example of using machine learning techniques to find pairs in great detail. It shows how
PCA is used in combination with fundamental features to cluster stocks using DBSCAN clustering
algorithm. Afterwards, t-distributed Stochastic Neighbor Embedding, or t-SNE, is applied to help
visualize the validated clusters in two-dimensional space.

21
In the figure below, we can see the time series of the stocks contained in cluster 2. Look at how
similar the nature of these stocks are. The next step is to validate the cointegration relationship.
Here is a link to a final project completed by a group of Stanford graduates, implemented using
Quantopian’s platform.1
1
MS&E448 Project Statistical Arbitrage Carolyn Soo(csjy), Zhengyi Lian(zylian), Hang Yang (hyang63), Jiayu Lou(jiayul)

22
Statistical arbitrage: model based approach
A very popular paper in the field of statistical arbitrage is by Marco Avellaneda and Jeong-Hyun
Lee and is called “Statistical Arbitrage in the U.S. Equities Market”.
Whilst exploring this topic I found a great repo from the Big Financial Data for Algorithmic Trading
course taught at Stanford University. I highly recommend you browse some of the final projects
presented by students.
In this video we reviewed some examples of dimensionality reduction techniques used in finance. I
would highly recommend studying the examples on statistical arbitrage more closely. It also gives
you a good opportunity to give some thought to what you would like to do your capstone project
on.

23

MScFE 650 MLF - Video - Transcripts - M2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MScFE 650 MLF - Video - Transcripts - M2

Uploaded by

Copyright:

Available Formats

MScFE xxx [Course Name] - Module X: Collaborative Review Task

© 2019 - WorldQuant University – All rights reserved.

Unit 1: Transcript: Principal Component Analysis (Part 1) ........................ 3

© 2019 - WorldQuant University – All rights reserved.

Unit 1: Transcript: Principal Component Analysis (Part 1)

Principal component analysis

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

Unit 2: Principal Component Analysis (Part 2)

generic observation 𝑥 in the direction of 𝑢1 is simply the inner product.

This we can write as:

© 2019 - WorldQuant University – All rights reserved.

Projection of 𝒙 onto of 𝒖𝟏 : 𝒖𝑻𝟏 (𝒙 − 𝒙)

that we maximize this variance with respect to 𝑢1 .

© 2019 - WorldQuant University – All rights reserved.

this equation for 𝑢1 ; that is what we are after.

𝜎 2 = 𝑢1𝑇 𝑆𝑢1 = 𝜆𝑢1𝑇 𝑢1 = 𝜆

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

Unit 2: Principal Component Analysis (Part 3)

o how to compute the eigenvalues and eigenvectors in a numerically stable way,

o how to project onto the principal directions (dimensionality reduction), and

o how to approximately reconstruct the original data from their projections.

be orthogonal and normalized. So, 𝑄 is an orthogonal matrix.

© 2019 - WorldQuant University – All rights reserved.

How do we do the dimensionality reduction?

The projections are given by the following:

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

Unit 3: Linear Discriminant Analysis (Part 1)

In this video, we discuss linear discriminant analysis (LDA).

© 2019 - WorldQuant University – All rights reserved.

projected space, because remember that 𝑤 is a row vector.

Can we now find 𝑤 so that this distance is a maximum?

© 2019 - WorldQuant University – All rights reserved.

equal to 0. The formula is:

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

Unit 4: Linear Discriminant Analysis (Part 2)

We have a similar formula for 𝑆22 , which is the scatter of cluster 2:

© 2019 - WorldQuant University – All rights reserved.

It is a fairly straightforward calculation to show the following:

the derivative of 𝐽 of 𝑤 with respect to 𝑤 and set that equal to 0:

After some algebra this becomes the following:

What we now find is that 𝑆𝑤 𝑤 is proportional to 𝑆𝐵 𝑤:

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

Unit 5: Academic Papers in Review

Factor-based investment strategies

𝑟𝑗 = 𝑎𝑗 + 𝑏𝑗1 𝐹1 + 𝑏𝑗2 𝐹2 + ⋯ + 𝑏𝑗𝑛 𝐹𝑛 +∈𝑗

© 2019 - WorldQuant University – All rights reserved.

Statistical arbitrage: finding valid pairs

© 2019 - WorldQuant University – All rights reserved.

© 2019 - WorldQuant University – All rights reserved.

Statistical arbitrage: model based approach

© 2019 - WorldQuant University – All rights reserved.

You might also like