TT18 Dissertation Vu 0

Risk Parity with Constrained
Gaussian Mixture Models
Kellogg College
University of Oxford
A thesis submitted for the degree of

Master of Science
Trinity 2018
This thesis is dedicated to my parents.
Acknowledgements
A dissertation submitted to the University of Oxford in accordance with the

requirements of the degree of Master of Science in the Department of Math-
ematics. It has not been submitted for any other degree or diploma of any
examining body. Except where specifically acknowledged, It is all the work of
the Author.
Abstract
We study and implement the equivalent risk contribution portfolio using ex-
pected shortfall as the risk measure and Gaussian mixture to model the asset
returns as proposed by Roncalli et al. in [15]. In order to estimate the models
parameters, we study the constrained Gaussian Mixture Model framework from
another separate paper by Ari [1]. We compare this model with other traditional
approach including equivalent risk contribution portfolio using volatility as the
risk measure and the well known mean-variance portfolio. All algorithms and
backtests are implemented in Python and the code is listed in Appendix C.
Contents
1 Introduction 1
1.1 Objective and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Diversification and Skewness 3

2.1 Introduction to diversification . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Skewness aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Constrained Gaussian Mixture Models 10

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Parameters estimation with simple case . . . . . . . . . . . . . . . . . . . . 10
3.3 Exponential family form of Gaussian Mixture Models . . . . . . . . . . . . . 13
3.4 Generalized Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Bound on the log-likelihood . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 Expectation step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.3 Primal problem for the Maximization step . . . . . . . . . . . . . . . 17
3.4.4 Dual problem for the Maximization step . . . . . . . . . . . . . . . . 19
3.5 Constrained Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.2 Primal parameterizations for the Maximization step . . . . . . . . . 23
3.5.3 Dual parameterizations for the Maximization step . . . . . . . . . . 24
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Risk Parity with Expected Shortfall and Gaussian Mixture 27

4.1 Introduction to Risk Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Risk budgeting portfolio . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Existence and uniqueness of solution . . . . . . . . . . . . . . . . . . 27
4.2 Risk parity with skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Jump risk and skewness risk . . . . . . . . . . . . . . . . . . . . . . . 28
i
4.2.2 Expected shortfall risk measure . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Existence and uniqueness of the portfolio . . . . . . . . . . . . . . . 33
4.3 Results analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Stressed regime parameters calibration . . . . . . . . . . . . . . . . . 38
4.3.3 Filtering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.4 Backtesting results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.5 Risk premia portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Conclusion and Future Work 50
A Mathematics Supplementary 51
A.1 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Generalized Exponential Family of Distributions . . . . . . . . . . . . . . . 52
A.3 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.4 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
B Risk Allocation Supplementary 60

B.1 Risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
B.2 Risk contribution and Euler allocation principle . . . . . . . . . . . . . . . . 61
C Code Listing 63
Bibliography 80
ii
List of Figures
2.1 The skewness coefficient γ1 (X + Y ) when the random vector (X, Y ) is log-
normal. σX = 0.5, σY = 0.5, γ1 (X) = 1.8, γ1 (Y ) = 1.8 . . . . . . . . . . . . . 7
normal. σX = 1.0, σY = 1.0, γ1 (X) = 6.2, γ1 (Y ) = 6.2 . . . . . . . . . . . . . 8
normal. σX = 1.0, σY = 0.5, γ1 (X) = 6.2, γ1 (Y ) = 1.8 . . . . . . . . . . . . . 8
normal. σX = 1.0, σY = 1.5, γ1 (X) = 6.2, γ1 (Y ) = 33.5 . . . . . . . . . . . . 9
4.1 Skewness coefficient γ1 (Y ) as the function of volatility σ1 (Y ) . . . . . . . . 30

4.2 Skewness coefficient γ1 (Y ) as the function of volatility σ2 (Y ) . . . . . . . . 30
4.3 Cumulative PnL of bonds, equities and volatility carry strategies . . . . . . 38
4.4 Markowitz maximum Sharpe portfolio weights with constraint σ <= 0.05 . 42
4.5 ERC portfolio weights using volatility . . . . . . . . . . . . . . . . . . . . . 42
4.6 ERC portfolio weights using expected shortfall (ES) with GMM . . . . . . . 43
4.7 Weight turnover comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Cumulative PnL comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.9 Year on year return comparison . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.10 Markowitz maximum Sharpe portfolio weights with constraint σ <= 0.09 for
the risk premia portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.11 ERC portfolio weights using volatility for the risk premia portfolio . . . . . 47
4.12 ERC portfolio weights using expected shortfall (ES) with GMM for the risk
premia portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.13 Weight turnover comparison of risk premia portfolio . . . . . . . . . . . . . 48
4.14 Cumulative PnL comparison of risk premia portfolio with different asset al-
location models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.15 Year on year return for risk premia portfolio comparison . . . . . . . . . . . 49
iii
List of Tables
2.1 Skewness sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Parameters estimation of µ̃ under different historical periods with λ = 0.02 39

4.2 Correlation matrix under normal and stressed regime under different histor-
ical periods with λ = 0.02 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Performance metrics comparison between different asset allocation algorithms 41
4.4 Performance metrics comparison with jump probabilities . . . . . . . . . . . 45
4.5 Performance metrics of risk premia portfolio between different asset alloca-
tion algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv
Listings
C.1 Markowitz mean-variance optimisation function . . . . . . . . . . . . . . . . 63

C.2 Objective functions and Jacobian functions for volatility based risk parity
and expected-shortfall based risk parity with Gaussian mixture models . . . 64
C.3 Constrained Expectation Maximization Algorithm . . . . . . . . . . . . . . 68
C.4 Function to calculate the cumulative PnL . . . . . . . . . . . . . . . . . . . 72
v
Chapter 1
Introduction
1.1 Objective and Contributions

Traditionally, the concept of diversification has been associated with risk optimisation and
multi-asset classes. The modern portfolio theory formulated by Markowitz places the em-
phasis on the trade-off between expected return and risk. While expected return is the linear
combination of expected returns of each asset in the portfolio, the portfolio’s volatility is a
convex combination of the assets volatilities, which means that the risk of the portfolio is
lower than the sum of individual risks. This asymmetry has made correlation become the
central parameter in this traditional approach and reduced the concept of volatility diver-
sification to the concept of volatility optimisation. Over the last few years, factor investing
whose purpose is to capture the excess return from market anomalies and systematic risk
factors in equities and alternative risk premia which is an extension of factor investing to all
asset classes have become very popular. Risk parity allocation model which uses volatility
as the risk measure and diversifies by allocating the same amount of risk across the assets
has been the top choice, due to its simplicity and robustness, to control the risk exposure to
these systematic risk factors. By ignoring the return dimension, risk parity using volatility
attempts to remove the instability of Markowitz’s theory which has the caveat that only the
small inaccuracy in forecasted return will result in a very different solution [7]. However,
volatility is not the best metric to assess the risk of the asset since it does not adapt to
assets with high jump risk. And jump risk is related to skewness risk. An example is the
volatility carry strategy, which has a very low volatility but a high skewness. Hence, the
objective of this thesis is to study the use of expected shortfall in the risk parity frame-
work. Simply replacing volatility with expected shortfall as risk measure does not work well
since the model still assumes that the dependence behaviour of the portfolio components is
unchanged during gap events. i.e. it is still using the spot covariance matrix to estimate
expected shortfall. Therefore, we will study another approach proposed by Roncalli [15] to
overcome this issue in which asset returns are modelled using a Gaussian mixture model
1
with two states: a normal state and a stressed state where the jumps occur. The expected
shortfall and marginal risk contribution from this model exist in analytical form. When
we take into account skewness, another problem arose is the aggregation of skewness risk
premia which we will also study. In particular, investors are exposed to a risk of large
drawdown when they mix strategies with high skewness together. Linear correlation is not
the right statistical tool that we can rely on anymore. Hence, this risk is difficult to mitigate
by volatility diversification.
The main contributions of this thesis are as follows. First, we study the skewness
based risk parity model proposed by Roncalli et al.[15]. In this paper, the author uses
the Expectation-Maximisation (EM) algorithm to estimate the parameters of the Gaussian
mixture model. The difficulty comes when we first tried to implement this algorithm because
when various constraints are introduced on the parameters, the author does not specify
the methods to estimate the model parameters under these constraints. Hence the next
contribution is the study and implementation of a constrained Gaussian mixture model
framework proposed by Ari in [1]. The novelty comes from utilizing this framework to
implement and study how the asset allocation algorithm in [15] behaves with different
parameters constraints. Finally, we backtest various portfolios and compare the result with
the traditional volatility based risk parity approach and the mean variance approach.
1.2 Organization of the Thesis

In Chapter 2, we study the skewness aggregation problem as discussed by Hamdan et al.[9]
showing that unlike volatility diversification, there is no guaranteed monotonic relationship
between correlation and skewness coefficients. Even though negative correlation can reduce
volatility, it can in some cases increase the tail risks of the portfolio in relative terms.
In Chapter 3, we study and implement the constrained Gaussian mixture model frame-
work by Ari in [1] which allows prior information to be incorporated in the form of convex
constraints on the parameters. The information and the source parametersations of the
Gaussian mixture model are studied and their relationship is shown using the convex du-
ality theory. From that, convex primal and dual problems for the Maximisation step for
adding convex constraints on the parameters are derived.
In Chapter 4, we study and implement the equal risk contribution (ERC) portfolio using
expected shortfall and Gaussian mixture model for assets returns as proposed by Roncalli
et al.[15]. Combing with the framework we studied in Chapter 3, we can calibrate the
parameters and study the behaviour of this asset allocation model. We study the drawbacks
of the traditional volatility based ERC portfolio and see whether the new model adds any
value.
In Chapter 5, we summarize our conclusions and plans for future work.
2
Chapter 2
Diversification and Skewness
2.1 Introduction to diversification

We know that diversification is a concept that is extensively applied in portfolio manage-
ment. The main idea is based on the convexity property of risk measure1 :
R(X + Y ) ≤ R(X) + R(Y ) (2.1)
where R is the risk measure of portfolios X and Y . In the case if we use volatility, σ, as
the risk measure:
p
σ(X + Y ) = σ 2 (X) + σ 2 (Y ) + 2ρ(X, Y )σ(X)σ(Y )
(2.2)
≤ σ(X) + σ(Y )
where ρ(X, Y ) is the correlation between X and Y . Hence, one way to minimize volatility
is to select assets that have low or negative correlations. However, correlation as the de-
pendence measure is only sensible when we assume Gaussian distribution of asset returns,
which is the challenge when constructing portfolio with alternative risk premia due to the
high negative skewness of the strategy return.
2.2 Skewness aggregation

The aggregated skewness γ1 (X + Y ) depends on the individual skewness γ1 (X) and γ1 (Y ).
To understand the problem of skewness diversification, we study this complex dependence.
Let X, Y, Z be random variables. Let us denote µn (X) as the nth central moment of X. We
have:
µ2 (X + Y ) = µ2 (X) + µ2 (Y ) + 2cov(X, Y ) (2.3)
and

µ3 (X + Y ) = µ3 (X) + µ3 (Y ) + 3 cov(X, X, Y ) + cov(X, Y, Y ) . (2.4)
1
We refer the readers to Appendix B for the formal definition of risk measure.
3

In addition, we denote cov(X, Y, Z) = E (X − E[X])(Y − E[Y ])(Z − E[Z]) . We can then
define the skewness γ1 (X + Y ) and coskewness γ1 (X, Y, Z) as:
µ3 (X + Y )
γ1 (X + Y ) = 3/2
(2.5)
µ2 (X + Y )
and
cov(X, Y, Z)
γ1 (X, Y, Z) = . (2.6)
σ(X)σ(Y )σ(Z)
Proposition 1. The skewness coefficient of X + Y is close to the skewness of X if σ(X)
σ(Y ) given X is independent of Y . In the case of dependent random variables, the coskew-
ness coefficients are the functions of the correlations ρ(X 2 , Y ) and ρ(X, Y 2 ).
Proof. Firstly, in the case of independent random variables, we have:
µ2 (X + Y ) = σ 2 (X) + σ 2 (Y ) (2.7)
and
µ3 (X + Y ) = µ3 (X) + µ3 (Y ) . (2.8)
eq. (2.5) becomes:
µ3 (X) + µ3 (Y )
γ1 (X + Y ) = 3/2
σ 2 (X)
+ σ 2 (Y )
(2.9)
σ 3 (X) σ 3 (Y )
= γ1 (X) 3/2 + γ1 (Y ) 3/2 .
σ 2 (X) + σ 2 (Y ) σ 2 (X) + σ 2 (Y )
From this, we can see that γ1 (X + Y ) ≈ γ1 (X) if σ(X) σ(Y ).

In the general case where there is dependency between random variables, eq. (2.5) becomes:
σ 3 (X) σ 3 (Y )
γ1 (X + Y ) = γ1 (X) + γ 1 (Y )
σ 3/2 (X + Y ) σ 3/2 (X + Y ))
(2.10)
3 cov(X, X, Y ) + cov(X, Y, Y )
+ .
σ 3/2 (X + Y )
Using eq. (2.6), we can deduce that:
σ 3 (X) σ 3 (Y )
γ1 (X + Y ) = γ1 (X) + γ 1 (Y )
σ 3/2 (X + Y ) σ 3/2 (X + Y )
(2.11)
3σ 2 (X)σ(Y ) 3σ(X)σ 2 (Y )
+ γ1 (X, X, Y ) 3/2 + γ1 (X, Y, Y ) 3/2 .
σ (X + Y ) σ (X + Y )
Hence, the skewness of the sum is a weighted average of skewness and coskewness coefficients.
4
Furthermore, if E[X] = E[Y ] = 0, we can obtain that:
cov(X, X, Y )
γ1 (X, X, Y ) =
σ(X)2 σ(Y )
E X Y − X 2 E[Y ] − 2XY E[X] + 2XE[X]E[Y ] + E[X]2 Y − E[X]2 E[Y ]

=
σ(X)2 σ(Y )
cov(X 2 , Y ) − 2cov(X, Y )E[X]
=
σ(X)2 σ(Y )
= ρ(X 2 , Y ) .
(2.12)
Similar result can be obtained for γ1 (X, Y, Y ). This has shown that the coskewness coeffi-
cients are the function of the correlations ρ(X 2 , Y ) and ρ(X, Y 2 ).
One direct consequence of the above is that µ3 (X + Y ) is a decreasing function of

the correlation ρ(X, Y ), but an increasing function of correlations ρ(X 2 , Y ) and ρ(X, Y 2 ).
Hence, γ1 (X + Y ) is a decreasing function of ρ(X, Y ).
Effect ρ(X, Y ) ρ(X 2 , Y ) ρ(X, Y 2 )

σ(X + Y ) + 0 0
µ3 (X + Y ) − + +
γ1 (X + Y ) − + +
Table 2.1: Skewness sensitivities
However, skewness aggregation is complex since we have not accounted for the inter-
dependence between ρ(X, Y ), ρ(X 2 , Y ), ρ(X, Y 2 ). To illustrate this, we study an example
where the analytical solutions exist for γ1 (X + Y ) [9].
Proposition 2. Let (X, Y ) be the random vector that follows a bivariate log-normal dis-
tribution, i.e., ln X ∼ (µX , σX ) and ln Y ∼ (µY , σY ), the skewness of the sum X + Y is
µ3 (X+Y )
γ1 (X + Y ) = 3/2 and
µ2 (X+Y )
exp(ρσX σY ) − 1
ρX,Y = q q (2.13)
exp(σX2 ) − 1 exp(σ 2 ) − 1
Y
2 σY2
cov(X, X, Y ) = exp 2µX + σX + µY + × (exp(ρσX σY ) − 1)
2 (2.14)
2 2
× (exp(σX + ρσX σY ) + exp(σX ) − 2)
2
σX
cov(X, Y, Y ) = exp 2µY + σY2 + µX + × (exp(ρσY σX ) − 1)
2 (2.15)
× (exp(σY2 + ρσY σX ) + exp(σY2 ) − 2)
5
where ρ denotes the correlation between lnX and lnY .
Proof. We recall from properties of log-normal distribution that:

2
σX
E[enX ] = E[X n ] = exp nµX + n2 (2.16)
2
with n ≥ 1. Hence we have that
µ2 (X) = E[X 2 ] − E[X]2

(2.17)
2 2
= exp(2µX + σX ) × (exp(σX ) − 1)
and
µ3 (X) = E[(X − E[X])3 ]

= E X 3 − 3X 2 E[X] + 3XE[X]2 − E[X]3

(2.18)
2
σX 2 2

= exp 3µX + 3 × exp(3σX ) − 3 exp(σX )+2 .
2
Furthermore, let Z = n ln X + m ln Y , we have:
σZ2
E[eZ ] = exp(µZ + ) (2.19)
2
where
µZ = nµX + mµY (2.20)
and
σZ2 = n2 σX
2
+ m2 σY2 + 2mnρσX σY . (2.21)
We obtain that:
E[X n Y m ] = E[eZ ]
n2 σX
2 + m2 σ 2 + 2mnρσ σ (2.22)
Y X Y
= exp(nµX + mµY + ).
2
It follows that:
cov(X, Y )
ρX,Y =
σX σY
E[XY ] − E[X]E[Y ]
=q q
2 ) × (exp(σ 2 ) − 1) exp(2µ + σ 2 ) × (exp(σ 2 ) − 1
exp(2µX + σX X Y Y Y
2 2
σX σY (2.23)
exp(µX + µY + 2 +q 2 ) × (exp(ρσq
X σY ) − 1)
= 2 2
σX σY 2 2
exp(µX + µY + 2 + 2 ) exp(σX ) − 1 exp(σY ) −1
exp(ρσX σY ) − 1
=q q .
exp(σX2 ) − 1 exp(σ 2 ) − 1
Y
6
3.5
3.0
γ1 (X + Y )
2.5
2.0
1.5
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00

ρX,Y
Figure 2.1: The skewness coefficient γ1 (X+Y ) when the random vector (X, Y ) is log-normal.
σX = 0.5, σY = 0.5, γ1 (X) = 1.8, γ1 (Y ) = 1.8
From eq. (2.12) and eq. (2.22), we also obtain:
cov(X, X, Y ) = cov(X 2 , Y ) − 2cov(X, Y )E[X]

= E[X 2 Y ] − E[X 2 ]E[Y ] − 2E[X](E[XY ] − E[X]E[Y ])
σY2 (2.24)
2
= exp 2µX + σX + µY + × (exp(ρσX σY ) − 1)
2
2 2

× exp(σX + ρσX σY ) + exp(σX )−2 .
Similarly, we obtain the result for cov(X, Y, Y ).
2.3 Conclusion
We know that if we consider the volatility as the risk measure, there is a monotonic in-
creasing relationship between the correlation parameter and volatility. However, we can see
that there is no monotonic relationship between the skewness risk measure and the corre-
lation parameter ρX,Y . We can see this from figure 2.1 and figure 2.2 where the skewness
first decreases with the correlation and then increases. Moreover, when one skewness is
dominated by the other, as in the case in figure 2.3 and figure 2.4, the skewness decreases
monotonically as the correlation increases. This again contradicts the case when we use
volatility as the risk measure. Hence, from this simple example we can see that the problem
of skewness aggregation is difficult when the variables are correlated. Thus we can conclude
that any portfolio optimization techniques that rely solely on correlation and volatility as
the risk measure are in danger of being totally blind to the skewness as well as the convexity
7
6.5
6.0
γ1 (X + Y )
5.5
5.0
4.5
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

ρX,Y
σX = 1.0, σY = 1.0, γ1 (X) = 6.2, γ1 (Y ) = 6.2
8.0
7.5
7.0
γ1 (X + Y )
6.5
6.0
5.5
5.0
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

ρX,Y
σX = 1.0, σY = 0.5, γ1 (X) = 6.2, γ1 (Y ) = 1.8
8
34
32
γ1 (X + Y )
30
28
26
−0.2 0.0 0.2 0.4 0.6 0.8

ρX,Y
σX = 1.0, σY = 1.5, γ1 (X) = 6.2, γ1 (Y ) = 33.5
and concavity of asset returns [14]. This motivates us to study the expected shortfall risk
measure in the asset allocation model in chapter 4.
9
Chapter 3
Constrained Gaussian Mixture

Models
3.1 Introduction
As mentioned in Chapter 1, we will study the risk parity asset allocation model where asset
returns are modelled by a mixture of two Gaussians. One is for the normal regime and the
other is for the stressed regime. The Expectation-Maximization (EM) algorithm is a very
popular algorithm to estimate the parameters for mixture of Gaussian models. It is based
on the maximum likelihood principle with the convergence to local maxima. However, EM
on its own is not very useful in our case since it does not allow us to impose constraints on
the parameters. In particular, we would like incorporate prior information of our model in
the form of constraints. For e.g. To study the model, we would like to impose constraints
on the jump intensity or on the returns and covariance matrices of assets during stressed
regime as the affine transformation of those in the normal regime. The original paper by
Roncalli et al.[15] does not specify how they solved the constrained maximum likelihood
problem when estimating the models parameters. Hence, in this chapter, we study the
recent research paper by Ari in [1] and apply the results to our asset allocation model in
the later chapter.
3.2 Parameters estimation with simple case

We first study the simple case of parameter estimation for the mixture of two Gaussians
model using this popular algorithm. Assume we have the set X = {x1 , . . . , xN } of observed
data with the following density function:
f (xn ) = (1 − λ)φ0 (xn ; µ1 , Σ1 ) + λφ0 (xn ; µ2 , Σ2 ) (3.1)
10
where φ0 (xn ; µ, Σ) is the probability density function (pdf) of the normal distribution with
the parameters (µ, Σ), xn ∈ Rd and 1 − λ, λ are the probabilities of xn belonging to the first
and the second Gaussian respectively. We would like to use maximum likelihood estimation
to estimate our parameters. Let us define:
θ = (λ, µ1 , Σ1 , µ2 , Σ2 ) . (3.2)
To find the estimate θ̂, we would like to maximize the log likelihood function:
N
X 2
X
`(θ) = ln λk φ(xn ; µk , Σk ) . (3.3)
n=1 k=1
The derivative of this function with respect to µk is:

N
`(θ) X λk φ(xn ; µk , Σk )
= P2 Σ−1
k (xn −µk ) . (3.4)
µk s=1 λ s φ(x n ; µ s , Σ s )
n=1
The first order condition implies that

N
X
λk,n Σ−1
k (xn −µk ) = 0 (3.5)
n=1
where we have defined

λk φ(xn ; µk , Σk )
λk,n = P2 . (3.6)
s=1 λs φ(xn ; µs , Σs )
Hence we obtain: PN
n=1 λk,n xn
µ̂k = PN
. (3.7)
n=1 λk,n
To find the derivative of `(θ) with respect to Σk , we first consider the pdf of multivariate
normal distribution as a function of Σ−1
k
1 1
g(Σ−1 exp − (xn −µk )T Σ−1

k )= d 1 k (x n −µ k )
(2π) |Σk |
2 2
2
1 (3.8)
|Σk |− 2 1
exp − trace Σ−1 T

= d k (x n −µ k )(x n −µ k ) .
(2π) 2 2
11
Hence we can deduce 1
1
∂g(Σ−1
k ) 1 |Σ−1
k |
−2
|Σ−1
k |Σk 1
exp − trace Σ−1 (xn −µk )(xn −µk )T

−1 = d k
∂Σk 2 (2π) 2 2
1
1 |Σ−1 |− 2 1
− (xn −µk )(xn −µk )T k d/2 exp − trace Σ−1 (xn −µk )(xn −µk )T

2 2 k
(2π)
Σk − (xn −µk )(xn −µk )T

1 1
exp − (xn −µk )T Σ−1

= 1 k (xn −µk ) .
(2π)d/2 |Σk | 2 2 2
(3.9)
It follows
N
∂`(θ) 1X λk φ(xn ; µk , Σk )
Σk − (xn −µk )(xn −µk )T .

−1 = P2 (3.10)
∂Σk 2 s=1 λs φ(xn ; µs , Σs )
n=1
The first order condition implies that

N
X
λk,n Σk − (xn −µk )(xn −µk )T = 0 .

(3.11)
n=1
We obtain that PN T
n=1 λk,n (xn −µ̂k )(xn −µ̂k )
Σ̂k = PN . (3.12)
n=1 λk,n
Next, let c be the Lagrange multiplier to ensure that the mixture probabilities λk sum to
unity, the first order condition implies that:
∂`(θ)
− c = 0. (3.13)
∂λj
It follows
N
X φ(xn ; µk , Σk )
P2 = c. (3.14)
n=1 s=1 λs φ(xn ; µs , Σs )
Hence, we have c = N and:

N
1 X
λk = λk,n . (3.15)
N
n=1
The problem here is that λk,n depends on the parameters µ, Σ that we try to estimate.
Hence the EM algorithm consists of an iterative process using the previous estimates of
µ, Σ to estimate the current parameters:
(0) (0) (0) (0) (0) (0)
1. Start with guesses about the mixture component µ1 , µ2 , Σ1 , Σ2 , λ1 , λ2 , t = 0.
1
We use:
∂|A|
= |A|(A−1 )T
∂A
∂ trace(AT B)
=B.
∂A
12
2. Using the current parameter guesses, calculate the weights λk,n (E-step)
(t) (t) (t)
(t) λ φ(xn ; µk , Σk )
λk,n = P k (t) (t) (t)
. (3.16)
2
s=1 λ s φ(x n ; µ s , Σs )
3. Using the current weights, maximize the weighted likelihood to get the new parameter
estimates (M-step)
PN (t)
(t+1) k=1 λk,n
λk = , (3.17)
N
PN (t)
(t+1) n=1 λk,n xn
µk = PN (t)
, (3.18)
n=1 λk,n
PN (t) (t) (t) T
(t+1) n=1 λk,n (xn −µk )(xn −µk )
Σk = P (t)
. (3.19)
n=1 λk,n
4. Iterate Step 2 and 3 until convergence.

(∞) (∞) (∞)
5. Finally, µ̂k = µk , λ̂k = λk , Σ̂k = Σk .
3.3 Exponential family form of Gaussian Mixture Models

We generalize our study using the Gaussian mixture models with K Gaussian components,
d dimensional random vector x ∈ Rd and a random variable y ∈ {1, . . . , K} following a
multinomial distribution with the joint distribution P (x, y|θ). It turns out that we can
express the joint distribution P (x, y|θ) in exponential family form [5]:
P (x, y|θ) = exp(θT T(x, y) − A(θ, y)) . (3.20)
To see that, let us first consider the marginal distribution of y, P (y|θy ). Since y follows
the multinomial distribution, from section A.3, we know that it can be expressed into
exponential family form as
P (y|θy ) = exp(θyT Ty (y) − A(θy )) (3.21)
where θy ∈ RK−1 is the natural parameters, Ty : Ωy → RK−1 is the sufficient statistics

function and A(θy ) is the log partition function. In addition, since x follows the Gaussian
distribution, using the result from section A.4, we can write the conditional distribution
P (x |y = k, θx |y=k ) as:
P (x |y = k, θx |y=k ) = P (x |θx |y=k )

(3.22)
= exp(θxT |y=k Tx (x) − A(θx |y=k )) k = 1, . . . , K
13
where θx |y ∈ Rd × K–d is the natural parameters. Tx : Rd → Rd × K+
d is the sufficient statis-
tics function and A(θx |y=k ) is the log partition function. Moreover, the joint distribution
P (x |y, θx |y ) can also be expressed as:
K
Y –1 PK−1
P (x |y, θx |y ) = P (x | θx|y=k )δ(y=k) P (x |θx |y=K )(1− i=1 δ(y=i))
k =1
YK
= P (x | θx|y=k )δyk
k =1
(3.23)
K
Y
= exp log P (x | θx|y=k )δyk
k =1
XK
= exp( δyk log P (x | θx|y=k ))
k =1
where δyk is the kth element of the vector δy of delta functions defined as:
K−1
X
δy = (δ(y = 0), . . . , δ(y = K − 1), (1 − δ(y = i)))T . (3.24)
i=1
Now, using eq. (3.22), we can rewrite eq. (3.23) into exponential family form:
K
X
T
P (x |y, θx |y ) = exp δyk (θx|y=k Tx (x) − A(θx|y=k ))
k =1
K
X K
X
T
= exp θx|y=k (δyk Tx (x)) − δyk A(θx|y=k )
k =1 k =1
K K
(3.25)
X X
T
= exp θx|y=k Tx |y=k (x, y) − δyk A(θx|y=k )
k =1 k =1
K
X
= exp(θxT |y Tx |y (x, y) − δyk A(θx|y=k ))
k =1
= exp(θxT |y Tx |y (x, y) − A(θx |y , y))
where we have defined:

T T
θx |y = (θx,y=1 , . . . , θx,y=K )T , (3.26)

Tx |y (x, y) = δy1 Tx (x), . . . , δyK Tx (x)
(3.27)
= Tx |y=1 (x, y), . . . , Tx |y=K (x, y) ,
K
X
A(θx |y , y) = δyk A(θx|y=k ) . (3.28)
k =1
14
Finally, Using Bayes’ rule and the results above, the joint distribution P (x, y|θ) can be
written in exponential family form as:
P (x, y|θ) = P (y|θy )P (x |y, θx |y )

= exp(θyT Ty (y) − A(θy )) exp(θxT |y Tx |y (x, y) − A(θx |y , y))
(3.29)
= exp θyT Ty (y) + θxT |y Tx |y (x, y) − A(θy ) − A(θx |y , y)

= exp(θT T(x, y) − A(θ, y))
where we have defined the natural parameters θ = (θyT , θx|y=k

T )T , the sufficient statistics
T(x, y) = (θy (y), θx |y (x, y)) and the log partition function A(θ, y) = A(θy ) + A(θx|y=k , y).
3.4 Generalized Expectation-Maximization

3.4.1 Bound on the log-likelihood
To formulate the Maximization step as a convex optimization problem with convex con-
straint set, we need to understand the theoretical underpinning of the EM algorithm. Let
us assume we are given a data set X = {x1 , . . . , xN } of N independent and identically
distributed (i.i.d) normally distributed random vectors and the corresponding random vari-
ables Y = {y1 , . . . , yN } that follow the multinomial distribution. The goal is to maximize
the log likelihood `(θ) = log P (X |θ) where θ are the natural parameters. In continuous
form, `(θ) can be written as:
`(θ) = log P (X |θ)

Z
= log P (X , Y|θ)dY
Z
P (X , Y|θ) (3.30)
= log Q(Y) dY
Q(Y)
P (X , Y|θ)
Z
≥ Q(Y) log dY = F (Q, θ) .
Q(Y)
In the last equation, we have introduced the arbitrary distribution Q(Y) to obtain a lower
bound F (Q, θ) on the log likelihood using Jensen’s inequality.
3.4.2 Expectation step
The Expectation-Maximization algorithm is an iterative algorithm that consists of two

steps. In the expectation step (E-step), we aim to maximize the lower bound function
F (Q, θt−1 ) with the distribution Q over Y while holding the parameters, θt−1 , from the
previous iteration t − 1 fixed. i.e. We find Qt such that:
Qt = arg max F (Q, θt−1 ) . (3.31)

Q
15
If we consider discrete distributions, then F (Q, θ), the lower bound on the log likelihood
`(θ| xn ), is also the function of the log likelihood `(θ| xn , yn ) of the joint distribution of the
observed variables X , hidden variables Y with the distributions Q = {q(y1 ), . . . , q(yn )} over
Y. Hence, for n = 1, . . . , N , we have:
K
X P (xn , yn = k|θ)
F (q(yn ), θ) = q(yn = k) log
q(yn = k)
k=1
K
X P (yn = k|xn , θ)P (xn |θ)
= q(yn = k) log
q(yn = k) (3.32)
k=1
K K
X X P (yn = k|xn , θ)
= q(yn = k) log P (xn |θ) + q(yn = k) log
q(yn = k)
k=1 k=1
= `(θ|xn ) − KL[q(yn )||P (yn |xn , θ)]

where the second term is called the Kullback-Leibler divergence. We obtain the overall
bound function F (Q, θ) as:
N
1 X
F (Q, θ) = `(θ| xn ) − KL[q(yn )||P (yn |xn , θ)]
N (3.33)
n=1
= `N (θ|X ) − KLN [Q||P (Y|X , θ)]

where we have defined:
N
1 X
`N (θ|X ) = `(θ| xn ) (3.34)
N
n=1
and
N
1 X
KLN [Q||P (Y|X , θ)] = KL[q(yn )||P (yn |xn , θ)] . (3.35)
N
n=1
This means that in the E-step, we aim to maximize the bound function:
F (Q, θt−1 ) = `N (θt−1 |X ) − KLN [Q||P (Y|X , θt−1 )] . (3.36)
We note that `N (θt−1 |X ) is not a function of Q .This means that for fixed θt−1 , F (Q, θt−1 )
is bounded above by `N (θt−1 |X ) and achieves that bound when
KLN [Q||P (Y|X , θt−1 )] is minimized. i.e. We find Qt such that:
Qt = arg min KLN [Q||P (Y|X , θt−1 )] . (3.37)

Q
From the expression of the KL divergence term in eq. (3.32), we can see that if the distri-
bution q(yn ) is equal to the posterior distribution q(yn | xn , θt−1 ), this will minimize the KL
divergence and also make it zero. Thus we obtain:
KLN [Q||P (Y|X , θt−1 )] = 0 if q(yn ) = P (yn | xn , θt−1 ), n = 1, . . . , N . (3.38)
Hence after the E-step, the lower bound function F (P (Y|X , θt−1 ), θt−1 ) and the log likeli-
hood function `N (θt−1 |X ) are equal.
16
3.4.3 Primal problem for the Maximization step
To understand M-step, we rewrite eq. (3.32) as:

K
X K
X
F (q(yn ), θ) = q(yn = k) log P (xn , yn = k|θ) − q(yn = k) log q(yn = k)
k=1 k=1 (3.39)
= Eq(yn ) [log P (xn , yn |θ)] + H[q(yn )]
PK
where H[q(yn )] = k=1 q(yn = k) log q(yn = k) is the entropy of q(yn ). The overall bound
function F (Q, θ) becomes:
N
1 X
F (Q, θ) = Eq(yn ) [log P (xn , yn |θ)] + H[q(yn )]
N (3.40)
n=1
= EQ [log P (X , Y|θ)] + HN (Q)
where:
N
1 X
EQ [log P (X , Y|θ)] = Eq(yn ) [log P (xn , yn |θ)] (3.41)
N
n=1
and
N
1 X
HN (Q) = H[q(yn )] . (3.42)
N
n=1
The bound function that we try to maximize in the M-step with respect to θ and fixed
distributions of hidden variables Q(t) is:
F (Q(t) , θ) = EQ(t) [log P (X , Y|θ)] + HN (Q(t) ) . (3.43)
Since HN (Q(t) ) does not depend on θ, we can say that the M-step is maximizing the sum
of the expected joint log-likelihoods of both observed and hidden variables X , Y:
θt = arg min EQ(t) [log P (X , Y|θ)] . (3.44)

θ
17
Since the Gaussian mixture distribution belongs to the exponential family, from eq. (3.29),
we can express the joint log-likelihoods as:
Eq(yn ) [log P (xn , yn |θ)] = Eq(yn ) [log exp θT T(xn , yn ) − A(θ, yn ) ]

= Eq(yn ) [θT T(xn , yn ) − A(θ, yn )]

= Eq(yn ) [θT T(xn , yn )] − Eq(yn ) [A(θ, yn )]
= θT Eq(yn ) [T(xn , yn )] − Eq(yn ) [A(θ, yn )]
= − Eq(yn ) [A(θy ) + A(θx|y , yn )]
K
X
+ θyT Eq(yn ) [Ty (yn )] + T
θx|y=k Eq(yn ) [Tx|y=k (xn , yn )]
k=1
K
X
= − Eq(yn ) [A(θy ) + δynk A(θx|y , yn )]
k=1
K−1
X K
X (3.45)
T
+ θy=k Eq(yn ) [δynk ] + θx|y=k Eq(yn ) [δynk Tx (xn )]
k=1 k=1
K
X
= −A(θy ) − Eq(yn ) [δynk ]A(θx|y=k )
k=1
K−1
X K
X
T
+ θy=k Eq(yn ) [δynk ] + θx|y=k Eq(yn ) [δynk ] Tx (xn )
k=1 k=1
K
X
= −A(θy ) − q(yn = k)A(θx|y=k )
k=1
K−1
X K
X
T
+ θy=k q(yn = k) + θx|y=k q(yn = k) Tx (xn ) .
k=1 k=1
For the overall bound function F (Q, θ), this becomes:

N
1 X
EQ [log P (X , Y|θ)] = Eq(yn ) [log P (xn , yn |θ)]
N
n=1
K N
X 1 X
= −A(θy ) − q(yn = k) A(θx|y=k )
N
k=1 n=1
K−1 X N XK X N
X 1 T 1
+ θy=k q(yn = k) + θx|y=k q(yn = k) Tx (xn )
N N
k=1 n=1 k=1 n=1
K
X K−1
X K
X
T
= −A(θy ) − λsk A(θx|y=k ) + θy=k νsy=k + λsk θx|y=k νs x |y=k
k=1 k=1 k=1
(3.46)
where we have defined: the expected empirical probabilities of Gaussian components as

λsk = N1 N
P
n=1 q(yn = k) for k = 1, . . . , K, the expected empirical moments of y as νsy=k =
18
λsk for k = 1, . . . , K − 1 and finally, the expected empirical conditional moments of x |y as
νs x |y=k = λsk1 N N
P
n=1 q(yn = k) Tx (xn ) for k = 1, . . . , K. Thus the optimization problem
for the M-step can now be written as:
K
X K−1
X K
X
T
minimize A(θy ) + λsk A(θx|y=k ) − θy=k νsy=k − λsk θx|y=k νs x |y=k (3.47)
k=1 k=1 k=1
where θ ∈ Rn is the optimization variable. From definition 7, we know that θ as the domain
of the log partition function A(θ) must belong to a convex set of parameters Cθ where A(θ)
is well defined.
Proposition 3. The minimization problem eq. (3.47) is a convex parameterization problem

parameterized by natural parameters θ.
Proof. From Proposition 12, we know that A(θy ), A(θx |y=1 ), . . . , A(θx |y=K ) are convex in
θ. The sum of convex functions, A(θ) + K
P
k=1 λsk A(θx|y=k ) defined over the convex set Cθ
is convex in θ. In addition, the expression − K−1
P PK T
k=1 θy=k νsy=k − k=1 λsk θx|y=k νs x |y=k is a
linear function of θ. The sum of convex function and a linear function is convex [12]. Hence,
using definition 2 , the minimization problem eq. (3.47) is a convex optimisation problem
in θ.
3.4.4 Dual problem for the Maximization step
As shown above, the M-step is a convex optimization problem with natural parameters θ as
the optimization variable. We will see that the Lagrange dual optimization problem with
moment parameters ν can be formulated.
The Lagrange dual function of problem in eq. (3.47) is the constant p∗ where
K
X K−1
X K
X
p∗ = inf A(θy ) + λsk A(θx|y=k ) − θy=k νsy=k − T
λsk θx|y=k νs x |y=k . (3.48)
θ∈dom A
k=1 k=1 k=1
We can rewrite this into:

K
X K−1
X K
X
T
minimize A(θy ) + λsk A(θx|y=k ) − θ̄y=k νsy=k − λsk θ̄x|y=k νs x |y=k
k=1 k=1 k=1
(3.49)
subject to θ̄y = θy ,
λsk θ̄x|y=k = λsk θx|y=k k = 1, . . . , K
where we have introduced the equality constraints θ̄y = θy , λsk θ̄x|y=k = λsk θx|y=k with the
new variable θ̄ ∈ Rn . We can see that the problem in eq. (3.47) and 3.49 are equivalent.
19
From definition 5, we can see that the Lagrangian L : Rn × Rn × Rn → R of eq. (3.49) is:
K
X K
X
T T
L(θ, θ̄, ν) = A(θy ) + λsk A(θx|y=k ) − θ̄y=k νsy=k − λsk θ̄x|y=k νs x |y=k
k=1 k=1
(3.50)
K
X
+ νyT (θ̄y − θy ) + νxT |y=k (λsk θ̄x|y=k
T T
− λsk θx|y=k )
k=1
where ν = (νyT , νx |y=1 , . . . , νx |y=1 )T ∈ Rn are the Lagrange multipliers. Using the definition
6, we can define the Lagrange dual function g : Rn → R as:
g(ν) = inf L(θ, θ̄, ν)
θ∈dom A,θ̄∈Rn
K
X K
X
T T
= inf A(θy ) + λsk A(θx|y=k ) − θ̄y=k νsy=k − λsk θ̄x|y=k νs x |y=k
θ∈dom A,θ̄∈Rn (3.51)
k=1 k=1
K
X
+ νyT (θ̄y − θy ) + T
νx |y=k (λsk θ̄x|y=k T
− λsk θx|y=k ).
k=1
We can separate θ and θ̄. In addition, from proposition 13, we have proved the relation-
ship between the log-partition function A(θy ), A(θx |y=1 ), . . . , A(θx |y=K ) and the entropy
functions H(νy ), H(νx |y=1 ), . . . , H(νx |y=K ):
H(ν) = inf A(θ) − θT ν . (3.52)

θ∈dom A
Thus we can obtain:

K
X
A(θy ) − θyT νy + T

g(ν) = inf λsk A(θx|y=k ) − θx|y=k νs x |y=k
θ∈dom A
k=1
K
X
+ inf θ̄yT (νy − νsy ) + T
λsk θ̄x|y=k (νx |y=k − νs x |y=k )
θ̄∈Rn
k =1
(3.53)
K
X
= H(νy ) + λsk H(νx |y=k )
k =1
K
X
+ inf θ̄yT (νy − νsy ) + T
λsk θ̄x|y=k (νx |y=k − νs x |y=k ) .
n
θ̄∈R
k =1
Since the Lagrangian L is linear in θ̄y , θ̄x |y=1 , . . . , θ̄x |y=K so g(ν) = −∞ unless νy − νsy =
0, νx |y=k − νs x |y=k = 0 for k = 1, . . . , K. Hence, the dual function g(ν) is:

H(νy ) + PK λ H(ν
k =1 sk x |y=k ), if ν = νs
g(ν) = (3.54)
−∞, otherwise
1 PN
where νs is the expected empirical moments νs = N n=1 T(xn ) We notice that due to the
equality constraint ν = νs , we cannot add new constraints on the moment parameters ν in
eq. (3.54). It turns out that eq. (3.54) can be written into an unconstrained optimization
problem that is equivalent to eq. (3.47) as shown below.
20
Proposition 4. The solutions of the maximum likelihood problem in eq. (3.47) is the same
as the unconstrained dual problem:
K
X K
X
maximize H(νy ) + λsk H(νx|y=k ) + νyT θsy + λsk νxT |y=k θs x |y=k . (3.55)
k=1 k =1
Proof. Since both problems are unconstrained optimization problems, we can obtain the
solution by setting the derivatives of the objective functions with respect to the optimization
variables to zero. Let us take the first derivative of the maximum likelihood problem
eq. (3.47) and set it to zero:
K
X K
X
T T

∇θ A(θy ) + λsk A(θx|y=k ) − θy=k νsy − λsk θx|y=k νs x |y=k = 0
k=1 k=1
   
∇θy (A(θy ) − θyT νsy ) 0
 T
   (3.56)
 ∇θ λs1 A(θx |y=1 ) − λs1 νx |y=1 νsx|y=1   0
 
x |y=1
=

 ..   ... 
  

 .   
T

∇θx |y=K λsK A(θx |y=K ) − λsk νx |y=1 νsx|y=K 0
   
∇θ A(θy ) νsy
   
 ∇θ
 x |y=1 A(θx |y=1 )   νs x |y=1 
  
 ..  =  .. . (3.57)

 .  
  . 

∇θx |y=K A(θx |y=K ) νs x |y=K
Recall from proposition 11 that the first derivative of the log partition function with respect
to the information parameters is equal to the source parameters. ∇θ A(θ) = ν. We have
∇θy A(θy ) = νy and ∇θx |y=k A(θx |y=k ) = νx |y=k . Hence,
   
νy νsy
   
 νx |y=1   νs x |y=1 
 . = . (3.58)
   
 ..   ..
   . 

νx|y=K νs x |y=K
Hence the derivative is zero when νy = νsy and νx |y=k = νs x |y=k for k = 1, . . . , K. Now let
us take the derivative of eq. (3.55):
K
X K
X
∇ν H(νy ) + λsk H(νx|y=k ) + νyT θsy + λsk νxT |y=k θs x |y=k = 0
k=1 k =1
   
∇vy H(νy ) + νyT θsy

0
 T
   (3.59)
 ∇ν λs1 H(νx|y=1 + λs1 νx |y=1 θs x |y=1   0
 
x |y=1
 = .


 .
..   .. 
   
T

∇νx |y=K λsK H(νx|y=K + λsK νx |y=K θs x |y=K 0
21
   
∇νy H(νy ) −θsy
   
 ∇ν H(ν x |y=1 )   −θs x |y=1 
x |y=1
= . (3.60)
   
 .. ..

 .  
  . 

∇νx |y=K H(νx |y=K ) −θs x |y=K
From what is shown in proposition 13, we know that −∇ν H(ν) = θ. Hence, ∇νy H(νy ) =
−θy and ∇νx |y=k H(νx |y=k ) = −θx |y=k . We obtain:
   
θy θsy
   
 θx |y=1   θs x |y=1 
 . = . (3.61)
   
 ..   ..
   . 

θx|y=K θs x |y=K
We can see that the derivative is zero when θy = θsy and θx |y=k = θs x |y=k for k =
1, . . . , K. From the relations between optimal parameters θsy , θs x |y=1 , . . . , θs x |y=K and
νsy , νs x |y=1 , . . . , νs x |y=K via the log partition functions A and the entropy functions H ,
we can see that the optimization problem eq. (3.47) and the unconstrained dual problem
eq. (3.55) have the same optimal solutions.
3.5 Constrained Gaussian Mixture Models

3.5.1 Problem definition
Let us denote the probability density function of the Gaussian mixture models with K
Gaussian components , d dimensional random vector x ∈ Rd and a random variable
y ∈ {1, . . . , K} following a multinomial distribution P (x, y|θ). In addition, P (x, y|θ)
is parameterized by the information parameters θ = {η, m1 , S1 , ..., mK , SK } where η ∈
RK−1 , mk ∈ Rd , Sk ∈ S+
d . Next, let us assume that N random vectors of observations
X = {x1 , ..., xN } are independent, identically distributed (i.i.d.) and generated from the
PK
marginal distributions P (x |θ) = k=1 P (x, y = k|θ). Different to the unconstrained
Gaussian mixture model estimation, here we assume that we have a set of constraints
denoted by C which are either convex constraints defined over the information parameters
θ = {η, m1 , S1 , ..., mK , SK } or over the source parameters ν = {λ, µ1 , Σ1 , ..., µK , ΣK } where
α ∈ RK−1 , µk ∈ Rd , Σk ∈ S+
d for k = 1, . . . , K. Given these constraints, our objective is to
find the estimator θ̂ that maximizes the likelihood. i.e. We need to find:
N
1 X
θ̂ = arg max `(θ| xn ) . (3.62)
θ∈C N
n=1
22
3.5.2 Primal parameterizations for the Maximization step
Using the parameterization of the Gaussian and multinomial distribution in the form of
information parameters, we can express the primal problem for the M-step as a convex
optimization problem with information parameters as optimization variables.
From the relationship shown in eq. (A.33) and eq. (A.21), the natural parameters θ can
be expressed in terms of the information parameters η, m1 , S1 , . . . , mK , SK as θy = η and
θx |y=k = (mTk , (− vec( 21 Sk )T )T where we have defined the vector η as
η = (η1 , ..., ηK−1 )

λ1 λK−1 (3.63)
= (log PK –1 , . . . , log PK –1 ).
(1 − i=1 λi ) (1 − i=1 λi )
As shown in eq. (A.38) and eq. (A.24), we can express the expected empirical moments
parameters νsy , νs x |y=1 , . . . , νs x |y=K using the source parameters λs , µs1 , Σs1 , ..., ΣsK as
T
νsy = λs , νs x |y=k = µTsk , vec(Σsk +µsk µTsk )T . In addition, from eq. (A.22) and eq. (A.34),
we can write the log partition functions as:
K−1
X
A(θy ) = log(1 + exp ηk ) , (3.64)
k=1
1 1 d
A(θx |y=k ) = − log |Sk | + mTk Sk−1 mk + log 2π . (3.65)
2 2 2
The inner product terms in eq. (3.47) become:
θyT νsy = η T λs
K−1
X (3.66)
= ηk λsk ,
k=1
1 T
θxT |y=k νs x |y=k = (mTk , − vec( Sk )T ) µTsk , vec(Σsk + µsk µTsk )T
2 (3.67)
1
= mk µsk − tr (Sk )(Σsk + µsk µTsk ) .
T

2
As a result, the convex optimization problem in eq. (3.47) can be parameterized using the
information parameters as an unconstrained optimization problem:
K−1 K
X X1 1 d
minimize log(1 + λsk (− log |Sk | + mTk Sk−1 mk + log 2π)
exp ηk ) +
2 2 2
k=1 k=1
(3.68)
K−1 K
X X
T 1 T

− ηk λsk − λsk mk µsk − tr (Sk )(Σsk + µsk µsk )
2
k=1 k=1
where η ∈ RK−1 , mk ∈ Rd , Sk ∈ S+
d for k = 1, . . . , K are the optimization variables. The
optimization problem depends on the expected sufficient statistics which are computed
23
apriori after the E-step as the expected probabilities λsk , the expected means µsk and the
expected covariance matrices Σsk for k = 1, . . . , K:
N
1 X
λsk = q(yn = k) , (3.69)
N
n=1
N
1 X
µsk = q(yn = k) xn , (3.70)
λsk N
n=1
N
1 X
Σsk = q(yn = k)(xn −µsk )(xn −µsk )T . (3.71)
λsk N
n=1
Using Definition 3.62, we can add constraints over the information parameters θ =
{η, m1 , S1 , . . . , mK , SK } to the optimization problem eq. (3.68) where θ ∈ Cθ with Cθ de-
fined as the convex constraint set including convex inequality and affine equality constraints
on θ.
3.5.3 Dual parameterizations for the Maximization step
We can express the dual problem for the M-step using the source parameters. Using the
relationship shown in eq. (A.38) and eq. (A.24), we can express the moment parameters
ν using the source parameters λ, µ1 , Σ1 , . . . , µk , Σk as νy = λ and νx |y=k = µTk , vec(Σk +
T
µk µTk )T .
The expected natural parameters can be written in terms of the information parameters
as νs , ms1 , Ss1 , ..., msK , SsK as θsy = ηs and θs x |y=k = (mTsk , (− vec( 12 Ssk )T )T . Hence, as
shown in eq. (A.39) and eq. (A.28), the entropy functions can be written as:
K−1
X K−1
X K−1
X
H(νy ) = − λk log λk − (1 − λk ) log(1 − λk ) , (3.72)
k=1 k=1 k=1
1 d
log |Σk | + log(2πe) .
H(νx |y=k ) = (3.73)
2 2
The inner product terms in eq. (3.55) become:
νyT θsy = αT ηs
K−1
X (3.74)
= αk ηsk ,
k=1
1
νxT |y=k θs x |y=k = µTk , vec(Σk + µk µTk )T (mTsk , − vec( Ssk )T )T

2
T T 1
= µk msk + tr (Σk + µk µk )(− Ssk )
2 (3.75)
1 1
= µTk msk − tr(Σk Ssk ) − tr(µk µTk Ssk )
2 2
T 1 1 T
= µk msk − tr(Σk Ssk ) − µk Ssk µk .
2 2
24
As a result, the convex optimization problem in eq. (3.55) can be parameterized using the
source parameters as an unconstrained optimization problem:
K−1
X K−1
X K−1
X
maximize − λk log λk − (1 − λk ) log(1 − λk )
k=1 k=1 k=1
K K
X 1 d X
+ λsk log |Σk | + log(2πe) + λk ηsk (3.76)
2 2
k=1 k=1
K
X 1 1
+ λsk µTk msk − tr(Σk Ssk ) − µTk Ssk µk
2 2
k=1
where λ ∈ RK−1 , µk ∈ Rd , Σk ∈ S+
d for k = 1, . . . , K are the optimization variables. The
optimization problem depends on the expected information parameters:

λk
ηsk = log PK−1 for k = 1, ..., K − 1 , (3.77)
1− i=1 λsi
msk = Σ−1
sk µsk for k = 1, ..., K , (3.78)
Ssk = Σ−1
sk for k = 1, ..., K . (3.79)
which are calculated after the E-step using the expected probabilities λsk , the expected
means µsk and the expected covariance matrices Σsk for k = 1, . . . , K:
N
1 X t
λsk = q (yn = k) for k = 1, ..., K , (3.80)
N
n=1
N
1 X t
µsk = q (yn = k)xn for k = 1, ..., K , (3.81)
λsk N
n=1
N
1 X t
Σsk = q (yn = k)(xn − µsk )(xn − µsk )T for k = 1, ..., K . (3.82)
λsk N
n=1
Similar to the primal constrained optimization problem, using Definition 3.62, we can add
constraints over the moment parameters ν = {λ, µ1 , Σ1 , . . . , µK , ΣK } to the optimization
problem eq. (3.76) where ν ∈ Cθ with Cν defined as the convex constraint set including
convex inequality and affine equality constraints on ν. Some example constraints that are
frequently used in practice [16] and can be formulated as affine quality or convex inequality
convex constraints on the moment parameters are:
d to be diagonal and related to another di-

• Constraint on covariance matrix Σk ∈ S+
d via an affine transformation with non-negative
agonal covariance matrix Σ̃k ∈ S+
25
variables a1 , . . . , ad can be formulated as linear equality and convex inequality con-
straints in the variables Σk as
Σi,i i,i
k = ai Σ̃k
Σi,j
k = 0 i 6= j
(3.83)
Σk 0
ai ≥ 0 i = 1, . . . , d .
• Constraint on the relationship mean vector µk ∈ Rd and another known vector µ̃k ∈
Rm via the affine transformation A ∈ Rd×m , b ∈ Rm . This can be formulated as linear
equality constraints in the variable µk , A, b
µk = Aµ˜k + b . (3.84)
• Constraint Σk = AΣ̃k AT as the relationship between the covariance matrix Σk ∈ S+d
and another covariance matrix Σ˜k ∈ S+d via the transformation A ∈ Rd×m . Although
this is not an affine equality constraint in the variable Σk , A, the relaxation Σk

AΣ̃k AT can be used since it is a convex inequality constraint in the variable Σk , A[12].
3.6 Conclusion
The framework of generalized Expectation-Maximization in section 3.4 has allowed the
derivation and parameterization of the Maximization step as a convex optimization problem
with convex constraint set as presented in section 3.5. The novelty of this thesis is we are
able to apply the result in eq. (3.76) to study the expected shortfall parity model presented
in the next chapter.
26
Chapter 4
Risk Parity with Expected

Shortfall and Gaussian Mixture
4.1 Introduction to Risk Allocation

4.1.1 Risk budgeting portfolio
We refer the reader to appendix B for the introduction to risk measures, the principle of
risk allocation and the notations used in this chapter. The main idea of risk allocation is to
achieve the diversification by allocating assets based on the risk contribution to the whole
portfolio.
Definition 1. Let bi be the proportion of the portfolio-wide risk R(x) that we want to assign
the risk contribution of asset ith to. We can define the solution x∗ of risk budgeting portfolio
as in [11]:
n
X ∂R(x)
x∗ = {x ∈ [0, 1]n : xi = 1, RC i = xi = bi R(x)} . (4.1)
∂ xi
1
4.1.2 Existence and uniqueness of solution
Proposition 5. For strictly positive risk budgets, and convex risk measure R, the solution
for eq. (4.1) in definition 1 satisfies the following optimisation problem:
y ∗ = arg min R(y)

y
P
 n bi ln yi ≥ c
i=1
subject to (4.2)
y ≥ 0
where c is an arbitrary constant.
27
Proof. The Lagrange function is:
Xn
L(y; ζ, ζc ) = R(y) − ζ T y − ζc ( bi ln yi − c) (4.3)
i=1
where ζ ∈ Rn and ζc ∈ R are the Lagrange multipliers. Hence, the solution y ∗ has to satisfy
the condition:
L(y; ζ, ζc ) ∂R(y) bi
= − ζi − ζc = 0 . (4.4)
∂yi ∂yi yi
The complementary slackness property implies that we must have:

ζi yi = 0 ,
(4.5)
ζ (Pn b ln y − c) = 0 .
c i=1 i i
Since we know that yi cannot be zero, we must have yi > 0 and ζi = 0. This also implies
that we must have ζc > 0 and hence, ni=1 bi ln yi = c. eq. (4.4) becomes:
P
∂R
yi = ζc bi (4.6)
∂yi
Hence, from the definition of risk contribution, we can verify that:
RC i = ζc bi . (4.7)
We can conclude that the optimisation problem has a solution and is unique.
1
For the equivalent risk contribution (ERC) portfolio, we would have the risk budget bi = n
y∗
and the normalized weights x∗ = Pn i
yj∗
.
j=1
4.2 Risk parity with skewness

4.2.1 Jump risk and skewness risk
Before considering the relationship between jump risk and skewness risk of our model, we
first define the model of the asset returns using Gaussian mixture as in [15]. Let R(x) be
the return of the portfolio.
R(x) = Y = (1 − λ)Y1 + λY2 . (4.8)
Let µ, Σ respectively be the vector of asset returns and covariance matrix in the normal
regime. We have the normal regime modelled by Y1 ∼ N (µ1 (x), σ12 (x)) where µ1 (x) = xT µ
and σ12 = xT Σ x. And the stressed regime with the jump probability of λ is modelled by
Y2 ∼ N (µ2 (x), σ22 (x)) with µ2 (x) = xT (µ + µ̃) and σ22 (x) = xT (Σ + Σ̃) x. Thus we have the
following density function for R(x):
f (y) = (1 − λ)f1 (y) + λf2 (y)
= (1 − λ)φ0 (y; µ1 (x), σ1 (x)) + λφ0 (y; µ2 (x), σ2 (x)) (4.9)

1 y − µ1 (x) 1 y − µ2 (x)
= (1 − λ) φ +λ φ
σ1 (x) σ1 (x) σ2 (x) σ2 (x)
28
where we have used the notation φ0 (z; µ, σ) as the probability density function of z of the
normal distribution with parameters (µ, σ) and φ(z) as the standard normal probability
density function of z.
Proposition 6. The skewness of the asset returns model in eq. (4.8) is given by:

3 2 2
λ(1 − λ) (2λ − 1)(µ1 − µ2 ) + 3(µ1 − µ2 )(σ1 − σ2 )
γ1 (Y ) = 3 . (4.10)
2
(1 − λ)σ12 + λσ22 + λ(1 − λ)(µ1 − µ2 )2
Proof. The k-th moment of Y is given by:
E[Y k ] = (1 − λ)E[Y1k ] + λE[Y2k ] . (4.11)
Recall that for normally distributed variables, we have E[Yi ] = µi , E[Yi2 ] = µ2i + σi2 and
E[Yi3 ] = µ3i + 3µi σi2 . We obtain:
E[Y ] = (1 − λ)µ1 + λµ2 . (4.12)
Hence we can also deduce that

σ 2 (Y ) = (1 − λ)E[Y12 ] + λE[Y22 ] − E2 [Y ]
= (1 − λ)(µ21 + σ12 ) + λ(µ22 + σ22 ) − (λµ1 + (1 − λ)µ2 )2
(4.13)
= (1 − λ)σ12 + λσ22 + 2λ(1 − λ)(µ21 + µ22 ) − 2µ1 µ2 λ(1 − λ)
= (1 − λ)σ12 + λσ22 + λ(1 − λ)(µ1 − µ2 )2 .
Then we can compute:
E (Y − E)3 = E[Y 3 ] − 3E[Y ](E[Y 2 ] − E2 [Y ]) − E3 [Y ]

= E[Y 3 ] − 3E[Y ]σ 2 (Y ) − E3 [Y ]
= (1 − λ)(µ31 + 3µ1 σ12 ) + λ(µ32 + 3µ2 σ22 )
(4.14)
− 3 (1 − λ)µ1 + λµ2 (1 − λ)σ12 + λσ22 + (1 − λ)λ(µ1 − µ2 )2

3
− (1 − λ)µ1 + λµ2
= (1 − λ)λ(2λ − 1)(µ1 − µ2 )3 + 3λ(1 − λ)(µ1 − µ2 )(σ12 − σ22 ) .
Thus we obtain:
Y − E[Y ] 3

γ1 (Y ) = E
σ(Y )

3 2 2
(1 − λ)λ (2λ − 1)(µ1 − µ2 ) + 3(µ1 − µ2 )(σ1 − σ2 ) (4.15)
= 3 .
2
(1 − λ)σ12 + λσ22 + λ(1 − λ)(µ1 − µ2 )2
29
0.0
−0.5
γ1 (Y )
−1.0
−1.5
−2.0
0.0 0.1 0.2 0.3 0.4 0.5

σ1 (Y )
Figure 4.1: Skewness coefficient γ1 (Y ) as the function of volatility σ1 (Y )
0.2
0.0
−0.2
−0.4
γ1 (Y )
−0.6
−0.8
−1.0
−1.2
0.0 0.1 0.2 0.3 0.4 0.5

σ2 (Y )
Figure 4.2: Skewness coefficient γ1 (Y ) as the function of volatility σ2 (Y )
30
In figure 4.1, we plot the relationship between the volatility in the normal regime with
the skewness coefficient while keeping other parameters constant. As we can see, γ1 (Y )
increases with σ1 (Y ). That is because as the normal volatility increases, it is harder to
detect distinctive jump from normal movement of returns. In other words, jumps have
small influence on the distribution of the returns. On the other hand, the absolute value
of γ1 (Y ) is highest when the portfolio’s volatility is low because jumps can dramatically
change the distribution of the returns. These conclusions are also confirmed in [9] and [15].
From figure 4.2 we can also observe that the magnitude of γ1 (Y ) increases with the jump
regime volatility.
4.2.2 Expected shortfall risk measure
Proposition 7. The expected shortfall for the return model in eq. (4.8) has the form:
ESα (x) = (1 − λ)ϕ(VaRα (x), µ1 (x), σ1 (x)) + λϕ(VaRα (x), µ2 (x), σ2 (x)) (4.16)
where
c a+b b a+b
ϕ(a, b, c) = φ( )− Φ(− ). (4.17)
1−α c 1−α c
Proof. Let Y ∼ N (µ, σ 2 ). Consider:
ϕ = E[1{Y ≥ a}.Y ]
Z ∞
y y − µ
= φ dy
a σ σ
Z ∞
= (µ + σt)φ(t) dt
σ −1 (a−µ)
Z ∞ (4.18)
∞ σ 1
t exp − t2 dt

= µ Φ(t) σ−1 (a−µ) + √
2π σ−1 (a−µ) 2
∞
a − µ σ 1
=µ 1−Φ +√ − exp − t2
σ 2π 2 σ −1 (a−µ)
a − µ a − µ
= µΦ − + σφ
σ σ
x−µ
where we have used the change of variable t = σ . From eq. (4.8), we can see that g(y)
is the density function of L(x)
1 y + µ1 (x) 1 y + µ2 (x)
g(y) = (1 − λ) φ( )+λ φ( ) (4.19)
σ1 (x) σ1 (x) σ2 (x) σ2 (x)
where L(x) = −R(x) ∼ N (−µ(x), σ 2 (x)) is the portfolio’s loss. Thus using the result from
31
eq. (4.18) , we obtain:
ESα (x) = E[L(x)|L(x) ≥ VaRα (x)]

1
= E[1{L(x) ≥ VaRα (x)}.L(x)]
1−αZ
∞
1
= yg(y)dy
1 − α VaRα (x)
Z ∞
1
= (1 − λ)E[1{L1 (x) ≥ VaRα (x).L1 (x)}] (4.20)
1 − α VaRα (x)
+ λE[1{L2 (x) ≥ VaRα (x).L2 (x)}]

1−λ VaRα (x) + µ1 (x) VaRα (x) + µ1 (x)
= σ1 (x)φ − µ1 (x)Φ −
1−α σ1 (x) σ1 (x)

λ VaRα (x) + µ2 (x) VaRα (x) + µ2 (x)
+ σ2 (x)φ − µ2 (x)Φ − .
1−α σ2 (x) σ2 (x)
In addition, from the definition of VaR, we have:
P(L(x) ≤ VaRα (x)) = α . (4.21)
It follows:
Z VaRα (x)
g(y)dy = α . (4.22)
−∞
Thus VaR can be found using a bisection algorithm by solving the equation

VaRα (x) + µ1 (x) VaRα (x) + µ1 (x)
(1 − λ)Φ + λΦ = α. (4.23)
σ1 (x) σ1 (x)
It can also be shown that there is an analytical solution of the marginal risk contribution
using the expected shortfall measure. Let us define
VaRα (x) + µi (x)
hi (x) = . (4.24)
σi (x)
Hence, using the result from eq. (B.11), we have that:
VaRα (x) + µ h1 (x)

∂x h1 (x) = − 2 Σx, (4.25)
σi (x) σ1 (x)
VaRα (x) + µ + µ̃ h2 (x)

∂x h2 (x) = − 2 (Σ + Σ̃) x . (4.26)
σi (x) σ2 (x)
From the definition of VaR, we have:
(1 − λ)Φ(h1 (x)) + λΦ(h2 (x)) = α . (4.27)
32
Differentiate the above equation with respect to x, we obtain:
(1 − λ)φ(h1 (x))∂x h1 (x) + λφ(h2 (x))∂x h2 (x) = 0 . (4.28)
Thus, we can deduce that:

h1 (x) h2 (x)
ω̄1 (x) σ1 (x) Σx − µ + ω̄2 (x) σ2 (x) (Σ + Σ̃) x −(µ + µ̃)
∂x VaRα (x) = (4.29)
ω̄1 (x) + ω̄2 (x)
where
πi φ(hi (x))
ω̄i (x) = . (4.30)
σi (x)
Hence ,we can express the expected shortfall as

1−λ
ESα (x) = σ1 (x)φ h1 (x) − µ1 (x)Φ −h1 (x)
1−α
(4.31)
λ
+ σ2 (x)φ h2 (x) − µ2 (x)Φ −h2 (x) .
1−α
Finally, we can deduce the marginal risk contribution ∂x ES α (x) by differentiating the above
expression of ESα (x) with respect to x:

1−λ
∂x ES α (x) = ∂x σ1 (x)φ(h1 (x)) − σ1 (x)h1 (x)φ(h1 (x))∂x h1 (x)
1−α

1−λ
∂x µ1 (x)Φ(−h1 (x)) − µ1 (x)φ(h1 (x))∂x h1 (x)
1−α
(4.32)
λ
∂x σ2 (x)φ(h2 (x)) − σ2 (x)h2 (x)φ(h2 (x))∂x h2 (x)
1−α

λ
∂x µ2 (x)Φ(−h2 (x)) − µ2 (x)φ(h2 (x))∂x h2 (x) .
1−α
4.2.3 Existence and uniqueness of the portfolio
Recall that if we use volatility risk measure, since R(y) ≥ 0, the risk parity portfolio
always exists and is unique and is the solution of the optimization problem 4.2 . The
existence of solution is more complex when we take into account the standard deviation-
based risk measure since we may have the situation of limy→∞ R(y) = −∞. When a
standard deviation-based risk measure is used, we have add another constraint to ensure
the existence of the solution:
R(x) ≥ 0 . (4.33)
This is equivalent to have the scaling factor c greater than the maximum Sharpe ratio. i.e.

c > max sup SR(x), 0 . (4.34)
x∈[0,1]n
33
To understand this, we study the relationship between the risk contribution, performance
contribution and volatility contribution as in [13]. Let us define the risk contribution RC i
of asset i as:
RC i = −µi (x) + cσi (x) (4.35)
where µi (x) = µi xi and σi (x) = xi (Σx) i

σ(x) . We also define the normalized risk contribution of
each asset as:
−µi (x) + cσi (x)
RC ∗i = (4.36)
R(x)
and the normalized performance contribution as:
µi (x) µi x i
PC ∗i = = Pn (4.37)
µ(x) j=1 xj µj
and finally, the normalized volatility contribution as:

σi (x) xi (Σx)i
VC ∗i = = T . (4.38)
σ(x) x Σx
Hence, we can obtain the following result.
Proposition 8. The risk contribution of asset i is the weighted average of the performance
contribution and the volatility contribution:
RC ∗i = (1 − ω) PC ∗i +ω VC ∗i (4.39)
cσ(x)
where ω = −µ(x)+cσ(x) .
Proof. We have:
RC ∗i = (1 − ω) PC ∗i +ω VC ∗i
−µ(x) + cσ(x) − cσ(x) −µi xi cσ(x) xi (Σx)i
= Pn +
−µ(x) + cσ(x) j=1 xj µj −µ(x) + cσ(x) xT Σx
−µi xi cxi (Σx)i
= + (4.40)
−µ(x) + cσ(x) −µ(x) + cσ(x)
1
= −µi xi + cxi (Σx)i
R(x)
−µi (x) + cσi (x)
= .
R(x)
∂ω
Furthermore, we can obtain ∂c :

∂ω σ(x) −µ(x) + cσ(x) − cσ(x)σ(x)
=
∂c (−µ(x) + cσ(x))2
(4.41)
−µ(x)σ(x)
= .
(−µ(x) + cσ(x))2
34
As a result, we can see that if c = 0, ω = 0. And ω is a decreasing function of c until the
µ(x)
value of c∗ = σ(x) which is the Sharpe ratio of the portfolio. When c > c∗ , ω is positive and
approaches 1 as c approaches ∞. We can see that when c is lower than the Sharpe ratio of
the portfolio, the risk contribution is return based and hence can be negative. To guarantee
that the solution to the problem 4.2 exists, we must have c > c∗ . i.e. the risk contribution
is volatility based and will be always positive.
To study the existence and solution of our model, we need to write the expected shortfall
in the form of standard deviation-based risk measure and find the scaling factor c. We
begin with finding the VaR lower bound as in [15]. We assume the normal case where the
confidence level α is greater than the jump intensity λ since practically α is higher than
50% while λ is lower than 50%.
Proposition 9. Assume that α ≥ λ, the lower bound VaR− of the value at risk is :
α − λ
VaR− = −µ1 (x) + Φ−1 σ1 (x) . (4.42)
1−λ
Proof. Let L1 (x) ∼ N (−µ1 (x), σ12 (x)), L2 (x) ∼ N (−µ2 (x), σ22 (x)) be the loss of each
regimes and g1 (y), g2 (y) be the density functions of these losses respectively. The value-at-
risk with the confidence level α is defined as:
Z VaRα (x)
g(y)dy = α (4.43)
−∞
where g(y) = (1 − λ)g1 (y) + λg2 (y). Hence, we can also deduce:
Z ∞
(1 − λ)g1 (y) + λg2 (y)dy = 1 − α . (4.44)
VaRα (x)
Since λg2 (y) ≥ 0, we obtain:

Z ∞
(1 − λ)g1 (y) dy ≤ 1 − α . (4.45)
VaRα (x)
It follows that: ∞
(1 − α)
Z
g1 (y) dy ≤ . (4.46)
VaRα (x) (1 − λ)
(1−α)
Let 1 − α0 = (1−λ) . Since we make the assumption that α ≥ λ, we can deduce that:
Z ∞ Z ∞
0
g1 (y) dy ≤ 1 − α = g1 (y) dy (4.47)
VaRα (x) VaR1α0 (x)
where VaR1α0 the value-at-risk at the confidence level α0 of the portfolio under the first
R∞
regime. Since a g1 (y) dy is a decreasing function of a, we can obtain:
VaRα (x) ≥ VaR1α0 . (4.48)
This means that VaR1α0 is a lower bound of the value-at-risk. Using the result from eq. (B.4),
we can write this as the lower bound VaR− = −µ1 (x) + Φ−1 α−λ

1−λ σ1 (x) .
35
Then, we can use the result to find the lower bound for the expected shortfall.
Proposition 10. The lower bound of the expected shortfall, denoted ES− , is given as:

− 1−λ −1 α − λ −1 α − λ
ES = −(1 + λ)µ1 (x) + φ Φ + λΦ σ1 (x) . (4.49)
1−α 1−λ 1−λ
Proof. The expected shortfall is defined as:
Z ∞
1
ESα (x) = yg(y)dy
1 − α VaRα (x)
(4.50)
(1 − λ) ∞
Z Z ∞
λ
= yg1 (y) dy + yg2 (y) dy .
(1 − α) VaRα (x) (1 − α) VaRα (x)
Assuming the worst case scenario when g2 (y) = δy (VaRα (x)), we can deduce that:
Z ∞
yg2 (y) dy ≥ (1 − α)VaRα (x) ≥ (1 − α)VaR1α0 . (4.51)
VaRα (x)
Since the expected shortfall is an increasing function of the value-at-risk and confidence
level, we also have:
Z ∞ Z ∞
yg1 (y) dy ≥ yg1 (y) dy = (1 − α)ES1α0 (4.52)
VaRα (x) VaR1α0 (x)
where ES1α0 is the expected shortfall with the confidence level α0 under the first regime.
Combining the result from eq. (4.51) and eq. (4.52), we obtain:
1−λ λ
ES(x) ≥ (1 − α0 )ES1α0 (x) + (1 − α)VaR1α0 (x) . (4.53)
1−α 1−α
1−α
Since 1 − α0 = 1−λ , this is simplified to:
ES(x) ≥ ES1α0 (x) + λVaR1α0 (x) . (4.54)
Using the standard deviation forms of expected shortfall and VaR from eq. (B.5) and
eq. (B.4), the lower bound ES− can be written as:

− 1−λ −1 α − λ −1 α − λ
ES = −µ1 (x) + φ Φ σ1 (x) + λ −µ1 (x) + Φ σ1 (x)
1−α 1−λ 1−λ

1−λ −1 α − λ −1 α − λ
= −(1 + λ)µ1 (x) + φ Φ + λΦ σ1 (x) .
1−α 1−λ 1−λ
(4.55)
As a result, using the result we discussed after proposition 8, we know that that RB portfolio
exists and unique if
+
c ≥ SR = max sup SR(x), 0 . (4.56)
x∈[0,1]n
36
Since expected shortfall is an increasing function of α, in our case, this means that:
α ≥ max(α− , 0) (4.57)
where α− is the solution to:

− −
1−λ −1 α − λ −1 α − λ
φ Φ + λΦ = (1 + λ)SR+
1 (4.58)
1 − α− 1−λ 1−λ
Roncalli et al. in [15] show that in the case when the Sharpe ratio is high, α− will approach
1 and the solutions may not exist as R(x) is negative for our model. One way to ensure
that the solution always exists is to have the expected return µi = 0 for all assets. This
means that our asset allocation algorithm has no views on the future performance of assets.
i.e. The performance contribution in eq. (4.39) is always zero. One of the extra benefits of
doing this is that it makes the algorithm directly comparable to the ERC portfolio using
volatility as the risk measure since when volatility is the risk measure, the performance
contribution is also zero. We will compare these two approaches in the next section.
4.3 Results analysis

4.3.1 Introduction
In this section we study the ERC portfolio using expected shortfall (ES) and Gaussian
Mixture Models (GMM) for assets returns as shown in the previous section. We start with
the portfolio of 3 underlyings representing 3 asset classes. First asset is US volatility carry
strategy in which it tries to capture the risk premium of implied volatility being most of
the time higher than realized volatility by selling delta hedged at the money straddles on
S&P 500 Index. The second asset is US Equity where we use the S&P 500 Total Return
Index as the proxy. And the third asset is US Bond where we use the performance of 10
year US Treasury Note. The time-series of these 3 assets are plotted in figure 4.3. We
can observe from the volatility carry timeseries that there are irregular jumps caused by
spikes in realized volatility in the stressed market conditions. We evaluate by comparing
the performance, risk management, weight turnover of this algorithm with the default ERC
portfolio using volatility as risk measure and the Markowitz mean-variance algorithm.
To evaluate these algorithms, we first assume that the portfolio weights are re-balanced
monthly on the first business date of the month. In addition, let Nrw be the rolling window
of 250 business dates where we calculate the weekly returns of the underlyings. For the ERC
portfolio using ES with GMM, to set up the portfolio weight optimization as in eq. (4.2)
with the risk measure R(x) = ESα (x) using expression in eq. (4.31), we would need to
estimate the parameters (µ̂n , Σ̂n ) on each re-balancing date n. i.e.
N
Xrw
0 0

µ̂n , Σ̂n = arg max ln (1 − λ)φ (Rn−s ; µn , Σn ) + λφ (Rn−s ; µn + µ̃, Σn + Σ̃) . (4.59)
(µn ,Σn ) s=1
37
Volatility Carry
400
Equity
350 Bond
Cumulative PnL
300
250
200
150
100
2002 2004 2006 2008 2010 2012 2014 2016 2018

Time
Figure 4.3: Cumulative PnL of bonds, equities and volatility carry strategies
4.3.2 Stressed regime parameters calibration
To solve the problem in eq. (4.59), first we would need to calibrate the parameters (µ̃, Σ̃)
from historical data using maximum likelihood estimation. Default unconstrained parame-
ters estimation in section 3.2 using EM algorithm achieves local maxima and does not allow
us to control the jump probability. In addition, since we have a bond asset in our portfolio,
it also makes more sense to have the same return and volatility in normal market regime
and stressed market regime for US 10 year Note. As a result, utilizing the constrained
mixture of Gaussian framework that we studied in chapter 3, we can estimate (µ̃, Σ̃) us-
ing maximum likelihood by solving1 the convex optimization problem eq. (3.76) with the
following constraints:
λ = 0.02 , (4.60)
µ̃bond = µbond , (4.61)
σ̃bond = σbond . (4.62)

We chose λ = 0.02 to be the base case because it corresponds to have a single jump in 50
weeks which is also equal to the length of our rolling window. Let us use the index 1, 2, 3
to represent the volatility carry, equity and bond respectively. We use various historical
periods to calibrate (µ̃, Σ̃) and show the results in table 4.1 and 4.2 where in table 4.2 we
have shown the correlation matrix in the normal regime ρ and in the stressed regime ρ̃. In
1
We use the open source convex optimization solver Python library CVXOPT https://cvxopt.org/
38
Period µ µ + µ̃
[2003 − 01 − 01, 2004 − 01 − 01] (0.18, 0.26, 0.03) (−0.10, −0.15, 0.03)
[2005 − 01 − 01, 2009 − 01 − 01] (0.10, 0.15, 0.03) (−0.06, −0.09, 0.03)
[2003 − 01 − 01, 2018 − 01 − 01] (0.09, 0.16, 0.03) (−0.05, −0.08, 0.03)
Table 4.1: Parameters estimation of µ̃ under different historical periods with λ = 0.02
Period ρ ρ̃
   
1 1
   
[2005 − 01 − 01, 2006 − 01 − 01]  0.19
 1 

 0.25
 1 

−0.06 −0.35 1 −0.02 −0.23 1
   
1 1
   
[2005 − 01 − 01, 2009 − 01 − 01]  0.39
 1 

 0.51
 1 

−0.31 −0.33 1 −0.03 −0.08 1
   
1 1
   
[2005 − 01 − 01, 2018 − 01 − 01]  0.40
 1 

 0.44
 1 

−0.25 −0.35 1 −0.06 −0.11 1
Table 4.2: Correlation matrix under normal and stressed regime under different historical
periods with λ = 0.02
this example, we can see that the estimated values of (µ̃, Σ̃) are stable when we use a long
enough period and that period contains a stressed event. for e.g. the 2008 financial crisis.
4.3.3 Filtering algorithm

Having now calibrated the parameters (µ̃, Σ̃), we can fix them and estimate (µ̂n , Σ̂n in

eq. (4.59) on each re-balancing date n. Even though we can estimate (µ̂n , Σ̂n using the
constrained Gaussian Mixture Model framework with the equality affine constraints µ̃n = µ̃
and the relaxed convex inequality constraint Σ̃n I Σ̃I T , it is recommended by Jacod et
al. in [3] that using the filtering approach to isolate the continuous and jump components
should be preferred to estimate the return and covariance matrix in this case. Let us outline
the filtering algorithm as proposed in [3] and [15]. We recall that the probability density
function of our assets return model is:

f (R) = λφ0 (R; µn , Σn ) + (1 − λ)φ0 (R; µn + µ̃, Σn + Σ̃) . (4.63)
On each re-balancing date n, given λ, µ̃, Σ̃, we would like to estimate the parameters µ̂n
and Σ̂n . We say that a jump is detected on rebalancing date n if the filtering probability
39
λ̂n is larger than a given threshold λ∗ . Let Nrw be the length of the rolling window.
1. At time n, we calculate the posterior jump probabilities:
λφ0 (Rn−s ; µ̂n−1 + µ̃, Σ̂n−1 + Σ̃)

λ̂n−s = (4.64)
(1 − λ)φ0 (Rn−s ; µ̂n−1 , Σ̂n−1 ) + λφ0 (Rn−s ; µ̂n−1 + µ̃, Σ̂n−1 + Σ̃)
for s = 1, . . . , Nrw . Note that λ̂n−s are based on the estimates µ̂n−1 and Σ̂n−1 calcu-
lated at time n − 1.
2. Calculate the estimates of µ̂n−1 and Σ̂n−1 :

Nrw
1X
µ̂n = 1{λ̂n−s ≤ λ∗ }Rn−s , (4.65)
n̂
s=1
N rw
1X
Σ̂n = 1{λ̂n−s ≤ λ∗ }(Rn−s − µ̂n )(Rn−s − µ̂n )T (4.66)
n̂
s=1
where n̂ is
rw −1
NX
n̂ = 1{λ̂n−s ≤ λ∗ } . (4.67)
s=1
These estimates can then be used to calculate λ̂n+1−s on the next re-balancing date
n + 1.
4.3.4 Backtesting results
Using the parameter estimates Σ̂n , Σ̃n , µ̃n obtained from the calibration step and the filtering
algorithm, we can set up the optimization problem as describe in eq. (4.2). Note that even
though we estimated µ̂n in the filtering algorithm, we will only keep the covariance structure
of the returns and set µ̂n to be the zero vector. The result is that we have removed the bias
of the return estimation and make our algorithm comparable to the ERC portfolio using
volatility as risk measure as discussed in section 4.2.3.
Before comparing these two approaches, we recall that the ERC portfolio using R(x) =
σ(x) as risk measure is one of the most popular asset allocation algorithm used in asset
management industry because of its simplicity and robustness compared to the Markowitz
mean-variance approach since it removes the instability of expected return estimation and
try to increase diversification by distributing equally the risk contribution of assets in the
portfolio. However, it suffers from several drawbacks as noted in [15]. Firstly, using standard
deviation as risk measure does not capture the non normality of asset returns. Secondly,
the weight allocation is not smooth. Whenever there is a jump, the weight of the volatility
carry strategy decreases sharply. We can see this from figure 4.5. This sharp decrease
is accompanied by a sharp increase in the weight allocation when the jump in the asset
returns exits the rolling window used to estimate the covariance matrix. As a result, we
40
obtain a high weight turnover. Also as a consequence, the weight is generally maximum
just before the jump occurs and it is generally too late to reduce the allocation after the
jump occurrence because jumps are not frequent and not correlated.
We present the weight allocation of ERC portfolio with expected shortfall (ES) risk
measure and Gaussian Mixture Model (GMM) in figure 4.6 with the base case parameter
λ = 0.02. We observe that the weight allocation is now much smoother compared to ERC
portfolio using volatility. As the result, we notice a better average annualized turnover
as shown in figure 4.7. The worst turnover comes from the mean-variance algorithm with
more than 10 times bigger turnover compared to ERC portfolio with ES and GMM. This is
due to the model instability in estimating the expected return leading to instability in the
weight allocation. We can see that from figure 4.4. In term of risk management metrics, we
again see from table 4.3 that the ERC portfolio with ES and GMM has lowest maximum
drawdown, lowest skewness coefficient in term of magnitude and best Sortino ratio which
better captures the risk adjusted return when the portfolio returns exhibit skew in their
distributions.
Next we study the behaviour of the new model with respect to the jump probability
parameter λ. As presented in table 4.4, as the jump probability decreases and approaches
zero, the algorithm behaves like if we use the single Gaussian model. The turnover increases
sharply as the single spot covariance matrix does not capture any possibility of jump risk.
Metrics Volatility ERC ES/GMM ERC Mean-Variance

Annualized return 5.11% 4.95% 5.34%
Annualized volatility 4.43% 4.22% 5.43%
Sharpe ratio 1.17 1.17 0.98
Sortino ratio 1.21 1.41 1.16
Maximum drawdown 14.53% 12.01% 15.04%
Skewness −1.25 −0.65 −1.19
Average annualized turnover 44% 19% 250%
Table 4.3: Performance metrics comparison between different asset allocation algorithms
4.3.5 Risk premia portfolio
Having studied our model using the toy portfolio with three underlyings in the above section,
we turn our focus to the practical application of our asset allocation model to the portfolio
of risk premia strategies. We pick some well known risk premia strategies and include them
in our portfolio:
• G10 FX Carry Strategy. Bloomberg ticker is UISFC1UE Index. The strategy goes
41
Figure 4.4: Markowitz maximum Sharpe portfolio weights with constraint σ <= 0.05
Figure 4.5: ERC portfolio weights using volatility
42
Figure 4.6: ERC portfolio weights using expected shortfall (ES) with GMM
0.6 Vol ERC

GMM/ES ERC
0.5
0.4
Turnover
0.3
0.2
0.1
0.0
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
Year
Figure 4.7: Weight turnover comparison
43
220 Vol ERC
GMM/ES ERC
200 Mean-Variance
Cumulative PnL
180
160
140
120
100
2004 2006 2008 2010 2012 2014 2016 2018

Time
Figure 4.8: Cumulative PnL comparison
Vol ERC
0.100 GMM/ES ERC
0.075
YoY return
0.050
0.025
0.000
−0.025
−0.050
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
Year
Figure 4.9: Year on year return comparison
44
Metrics λ = 0.04 λ = 0.02 λ = 0.001
Skewness −0.54 −0.65 −0.70
Table 4.4: Performance metrics comparison with jump probabilities
long and short G10 FX based on the one month carry signal.
• G10 FX Value Strategy. Bloomberg ticker is UISFV1UE Index. The strategy goes
long and short G10 FX based on the relationship between the current and historical
spot FX prices.
• MSCI World Beta Neutral Momentum Strategy. Bloomberg ticker is UISEMGSE

Index. The strategy goes long stocks with highest 12 month momentum and short
the MSCI World Index.
• MSCI World Beta Neutral Low Volatility Strategy. Bloomberg ticker is UISELGSE
Index. The strategy goes long stocks with lowest 12 month volatility and short the
MSCI World Index.
• MSCI World Beta Neutral Value Strategy. Bloomberg ticker is UISEVGSE Index.
The strategy goes long stocks with best value score based on various financial ratios
and short the MSCI World Index.
• MSCI World Beta Neutral Quality Strategy. Bloomberg ticker is UISEQGSE Index.
The strategy goes long stocks with best quality score based on various financial ratios
and short the MSCI World Index.
• US Mean Reversion Strategy. Bloomberg ticker is UISERUAE Index. The strategy

goes long daily variance swaps and short biweekly variance swaps to capture the mean
reversion effect of the S&P 500.
• US Carry - Term Structure Slope Strategy. Bloomberg ticker is UISRCX8E Index.

The strategy is duration neutral and decides to long and short 2-year, 5-year, 10-year,
30-year US Treasury bonds based on the term structure slope.
45
14
12 UISELGSE Index
UISEMGSE Index
10 UISEQGSE Index
UISERUAE Index
8
Weight
UISEVGSE Index
UISFC1UE Index
6 UISFV1UE Index
UISRCX8E Index
4 UISRTLGE Index
UISXTXUE Index
2
0
2008 2010 2012 2014 2016 2018
Time
Figure 4.10: Markowitz maximum Sharpe portfolio weights with constraint σ <= 0.09 for
the risk premia portfolio
• Cross-Currency Carry - Cross Term Structure Strategy. Bloomberg ticker is UIS-

RTLGE Index. The strategy seeks to invest in rates assets with highest carry and
short rates assets with lowest carry accross different term structures.
• Cross-Asset Trend Strategy. Bloomberg ticker is UISXTXUE Index. The strategy

goes long assets with highest momentum across FX, Rates and Equities asset class.
In this test, we remove the constraint that the total weight must be 1. In addition, we
introduce the convex inequality constraint on the variance of the portfolio such that the
volatility of the portfolio is targeted at 9%. This allows the algorithm to control the leverage
of the portfolio and is a technique commonly used in practice. We report the weight
allocation of different algorithms in figure 4.10, 4.12 and 4.11. In figure 4.12, we see that
our algorithm leverage is very stable. On the other hand, we can again observe the sharp
increase and decrease in the allocation of the volatility based risk parity portfolio. This leads
to the instability of the leverage. The instability is worst for mean-variance portfolio with
the complete in and out of different underlying positions during the backtest period. The
performance metrics are reported in table 4.5. The result shows that our asset allocation is
stable, with the realized volatility closest the the targeted volatility of 9%. The algorithm
also achieves best risk-adjusted return in term of Sharpe and Sortino ratio. It also has the
lowest maximum drawdown. And as expected, it also has the lowest annualized turnover.
This shows that our algorithm is highly implementable in practice, especially under liquidity
constraints.
46
10
UISELGSE Index
UISEMGSE Index
8
UISEQGSE Index
UISERUAE Index
Weight
6 UISEVGSE Index
UISFC1UE Index
UISFV1UE Index
4
UISRCX8E Index
UISRTLGE Index
2 UISXTXUE Index
0
2008 2010 2012 2014 2016 2018
Time
Figure 4.11: ERC portfolio weights using volatility for the risk premia portfolio
6 UISELGSE Index
UISEMGSE Index
5
UISEQGSE Index
UISERUAE Index
4
Weight
UISEVGSE Index
UISFC1UE Index
3
UISFV1UE Index
UISRCX8E Index
2
UISRTLGE Index
UISXTXUE Index
1
0
2008 2010 2012 2014 2016 2018
Time
Figure 4.12: ERC portfolio weights using expected shortfall (ES) with GMM for the risk
premia portfolio
47
10
Vol ERC
GMM/ES ERC
8
Turnover
0
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
Year
Figure 4.13: Weight turnover comparison of risk premia portfolio
250
225
200
Cumulative PnL
175 Vol ERC

GMM/ES ERC
150
Mean-Variance
125
100
75
2008 2010 2012 2014 2016 2018
Time
Figure 4.14: Cumulative PnL comparison of risk premia portfolio with different asset allo-
cation models
48
Vol ERC
0.20
GMM/ES ERC
0.15
YoY return
0.10
0.05
0.00
−0.05
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
Year
Figure 4.15: Year on year return for risk premia portfolio comparison
Metrics Volatility ERC ES/GMM ERC Mean-Variance

Skewness −0.94 −0.33 −0.44
Table 4.5: Performance metrics of risk premia portfolio between different asset allocation
algorithms
49
Chapter 5
Conclusion and Future Work
The thesis starts with the study showing that we cannot diversify the skewness by relying
on the correlation parameter as in the case of volatility diversification. In particular, we
can optimize the portfolio to have low volatility but this can also result in high stress risk.
Therefore, to take skewness risk into account and not rely on just volatility risk, we study
and implement the paper by Roncalli et al.[15] where expected shortfall is used in the risk
parity framework together with the mixture distribution model for asset returns. The nov-
elty comes from applying the theory of constrained Gaussian Mixture Model by Ari in [1] to
estimate the parameters in this model. The idea is that prior information can be formulated
in the form of convex constraints either on the source of the information parameters and
these constraints can be handled by solving the constrained convex optimization problems
for the Maximization-step of the EM algorithm. The results presented in this thesis show
that our allocation algorithm has overcome the shortcomings of volatility-based risk parity.
It achieves stable leverage with better risk-adjusted returns and with lower maximum draw-
down and weight turnover. This implies that our algorithm is highly applicable in practice,
especially under liquidity constraints.
Despite of the brilliance of Markowitz’s theory, the mean-variance framework suffers
from several drawbacks that prevent it to perform in practice. In particular, it often concen-
trates the allocation on a few assets only, which inevitably leads to disastrous out-of-sample
risks. In addition, the solution portfolio is also very sensitive to small changes in the return
forecast [4]. This has lead to the risk allocation approach and the birth of volatility-based
risk parity. In this thesis, we have only studied one variation of how volatility-based risk
parity can be improved. Hence, the future work could be focused on comparing this ap-
proach with other recent advances in risk parity such as hierarchical risk parity by Prado
[8] and minimum-torsion bets by Meucci [6].
50
Appendix A
Mathematics Supplementary
A.1 Convex Optimization

Definition 2. A convex optimization problem is defined as a optimisation problem where
the objective function is convex and minimized or concave and maximized subject to convex
inequality constraint functions and affine equality constraint functions.
Definition 3. The Fenchel conjugate function of f: Rn → R is f ∗ : Rn → R where f ∗ is

defined as:
f ∗ (ν) = sup θT ν − f (θ) . (A.1)
θ∈dom f
The domain of the conjugate function is determined by the values ν ∈ Rn where the supre-
mum is finite. In addition, the conjugate function f ∗ is differentiable f is the Legendre
transform of f where:
f ∗ (ν) = [θT ν − f (θ)]ν=∇θ f (θ) . (A.2)
Definition 4. The Fenchel-Young inequality is expressed as:
f ∗ (ν) + f (θ) − θT ν ≥ 0 (A.3)
where the equality happens if ∇θ f (θ) = ν and ∇ν f ∗ (ν) = θ.
Definition 5. Let f0 (x) be the objective function we try to minimize with x ∈ Rn subject
to the equality constraints hi (x) = 0 for i = 1, . . . , p and inequality constraints fi (x) ≤ bi
for i = 1, . . . , m. The Lagrangian L : Rn × Rm × Rp → R is defined as:
X X
L(x, ζ, ν) = f0 (x) + ζi (fi (x) − bi ) + νi hi (x) . (A.4)
i i
The variable ζi and νi are called the Lagrange multipliers.
51
Definition 6. The Lagrange dual function g : Rm × Rp → R is defined as the minimum
value of the Lagrangian eq. (A.4) over x for any ζ and ν
g(ζ, ν) = inf L(x, ζ, ν) (A.5)

x∈dom P
p
where dom P = (∩m
i=0 dom fi ) ∩ (∩i=1 dom hi ). We notice that the function g(ζ, ν) is a
concave function of ζ and ν because it is defined as the infimum of affine function of ν and
ζ.
A.2 Generalized Exponential Family of Distributions

Definition 7. The exponential family distribution is a set of distributions over random
vector x with the probability density function
P (x |θ) = exp θT T(x) − A(θ) .

(A.6)
where θ ∈ Rn are called the natural parameters, T : Ωx → Rn are called the sufficient
statistics and A(θ) is called the log partition function defined as:
Z
exp θT T(x) dx

A(θ) = log (A.7)
x∈Ωx
which preserves the property that the integral of P (x |θ) over x is 1. Let us denote the set
of all parameters θ as Cθ = {θ ∈ Rn | A(θ) < ∞}. For the exponential family distributions,
we have that the set of parameters Cθ is an open convex set in Rn .
The moment parameter ν ∈ Rn is defined as the expected value of the sufficient statistic
function:
ν = EP (x |θ) [T(x)] . (A.8)
The relation between the moment parameters ν ∈ Cν and the natural parameters θ ∈ Cθ
can be seen via the moment generating property of the log partition function A(θ)
Proposition 11. The gradient of the log partition function A(θ) with respect to the natural
parameters θ is equal to the moment parameters ν.
52
Proof.
T(x) exp θT T(x) dx

R
x∈Ωx
∇θ A(θ) = R
T T(x̂) dx̂
x̂∈Ωx̂ exp θ
T
R
x∈Ωx T(x) exp θ T(x) dx
= R
exp log x̂∈Ωx̂ exp θT T(x̂) dx̂
T
R
x∈Ωx T(x) exp θ T(x) dx (A.9)
=
exp A(θ)
Z
T(x) exp θT T(x) − A(θ) dx

=
x∈Ωx
= EP (x |θ) [T(x)]
.=ν
In addition, when maximum likelihood estimation is used, the most crucial property of
the log partition function A(θ) is that it is convex in the parameter θ.
Proposition 12. The log partition function A(θ) is a convex function of the natural pa-
rameters θ. i.e. A(λθ1 + (1 − λ)θ2 ) ≤ λA(θ1 ) + (1 − λ)A(θ2 ).
Proof.
Z
exp (λθ1 + (1 − λ)θ2 )T T(x) dx

A(λθ1 + (1 − λ)θ2 ) = log
Zx∈Ωx
exp λθ1T T(x) + (1 − λ)θ2T T(x) dx

= log
Zx∈Ωx
exp λθ1T T(x) exp (1 − λ)θ2T T(x) dx

= log
x∈Ωx
λ 1−λ
Z Z
λ T 1−λ T
≤ exp θ1 T(x) exp θ T(x)
x∈Ωx λ x∈Ωx 1−λ 2
Z Z
= λ log exp θ1T T(x) + (1 − λ) log exp θ2T T(x)
x∈Ωx x∈Ωx
= λ A(θ1 ) + (1 − λ) A(θ2 )
(A.10)
where we have used the Holder’s inequality.
Definition 8. The entropy of the a probability distribution P (x) defined on the sample
space Ωx is: Z
H(P (x)) = − P (x) log P (x) dx (A.11)
Ωx
The entropy function and the log partition function are also Fenchel conjugate functions.
53
Proposition 13. The entropy function and the log partition function are also Fenchel
conjugate functions
−H(ν) = sup θT ν − A(θ) . (A.12)
θ∈dom A
As a result
H(ν) = inf A(θ) − θT ν . (A.13)
θ∈dom A
Proof. The entropy of exponential family distributions as a function of the moment param-
eters ν can be expressed as:
H(ν) = −EP (x |θ) [log P (x |θ)]

= −EP (x |θ) [θT T(x) − A(θ)] (A.14)
= −θT ν + A(θ) .
Hence, it follows that:

A(θ) − H(ν) − θT ν = 0 . (A.15)
This corresponds to the Fenchel-Young inequality in definition 4 with the equality holds.
In addition, from proposition 11 and definition 3, we know that θT ν − A(θ) achieves the
supremum when ν = ∇θ A(θ).
A.3 Multinomial Distribution

This section shows how we can represent multinomial distribution in exponential family
form. In the mixture of Gaussian Models, we use one dimensional multinominal random
variable y with the sample space Ωy = {1, . . . , K} and the source parameters λ with a set
of probabilities {λ1 , . . . , λK }. The probability density function is:
K
δ(y=k)
Y
P (y|λ) = λk (A.16)
k=1
where δ(y = k) is the delta function. To express the multinominal distribution in exponen-
tial form
P (y|θy ) = exp(θyT Ty (y) − A(θy )) (A.17)
where θy ∈ RK−1 , Ty : Ωy → RK−1 , A(θy ) are the natural parameters, sufficient statistics
function and log partition function respectively, we need to parameterize the probability
54
density function P (y|λ) using the K − 1 components of λ [1]:
K
δ(y=k)
Y
P (y|λ) = λk
k =1
K –1
δ(y=k) δ(y=K)
Y
= λk λK
k =1 (A.18)
K –1 K –1
–1
(1−PKi=1
δ(y=k) δ(y=i)
Y X
= λk 1− λi
k =1 i=1
= P (y|λ̂)
where λ̂ = {λ1 , . . . , λK−1 }. Then we can write:

K –1 K –1
–1
(1−PKi=1
δ(y=k) δ(y=i)
Y X
P (y|λ̂) = exp log λk 1− λi
k =1 i=1
K
X –1 K
X –1 K
X –1

= exp δ(y = k) log λk + (1 − δ(y = i) log 1 − λi
i=1 i=1 i=1
K
X –1 K
X –1
= exp δ(y = k) log λk + log(1 − λi ) (A.19)
i=1 i=1
K
X –1 K
X –1

− δ(y = k) log 1 − λi
k =1 i=1
K –1 K –1
X λk X
−1
= exp log PK –1 δ(y = k) − log(1 − λi ) .
k =1
(1 − i=1 λi ) i=1
We obtain that:

Ty (y) = δ(y = 1), . . . , δ(y = K − 1) (A.20)
with the natural parameters

λ1 λK−1
θy = (log PK –1 , . . . , log PK –1 ) (A.21)
(1 − i=1 λi ) (1 − i=1 λi )
and log partition function:
K
X –1
A(θy ) = log(1 − λi )−1
i=1
PK−1 PK –1
1− j=1 λj + k =1 λk
= log( PK –1 )
1 − k =1 λk
K –1
(A.22)
X λk
= log(1 + exp log PK−1 )
k =1
1 − j=1 λj
K
X –1
= log(1 + exp θy=k ) .
k =1
55
Hence, P (y|θy ) has the general exponential form:
K
X –1 K
X –1
P (y|θy ) = exp θy=k δ(y = k) − log(1 + exp θy=k ) . (A.23)
k =1 k =1
In addition, the moment parameters induced by the sufficient statistic function in

eq. (A.20) and the entropy function induced by the log partition function in eq. (A.22)
can be derived [1]. From Proposition 11 , we have the mapping ∇θy A : θy → νy :
∂ A(θy ) exp θy=k

= PK –1
∂θy=k 1 + k =1 exp θy=k
λk
PK−1
1− j=1 λj
= PK –1 λk (A.24)
1+ k =1 1−PK−1 λj
j=1
= λk
= νy=k .
Furthermore, we notice that:

K
X –1
log νy=k = θy=k + log(1 + exp θy=i )−1 . (A.25)
i=1
As a result,
K
X –1
θy=k = log νy=k − log(1 + exp θy=i )−1 . (A.26)
i=1
Hence, using the relation from eq. (A.22) and applying it to eq. (A.24), we have:
νy=k
θy=k = log PK –1 . (A.27)
1− i=1 νy=i
Finally, using Fenchel duality as shown in proposition 13, we can obtain the relationship
56
between the log partition function and the entropy function H(νy )
H(νy ) = inf A(θy ) − θyT νy

θy
= A(θy ) − θyT νy

νy =∇θy A(θy )
K
X –1 K
X –1

= log(1 + exp θy=i ) − θy=k νy=k θ νy=k
y=k =log PK –1
1− ν
i=1 k =1 i=1 y=i
K –1
X νy=k
= log(1 + exp log PK –1 )
i=1 1− k =1 νy=i
K –1
X νy=k
− log νy=k
PK –1
k =1
1 − k =1 νy=i
P –1 PK –1
1− K

i=1 νy=i + i=1 νy=i
= log P –1
1− K i=1 νy=i (A.28)
K –1
X νy=k
− log PK –1 νy=k
k =1
1− k =1 νy=i
K
X –1 K
X –1
= log(1 − νy=i )−1 − νy=k log νy=k
i=1 k =1
K
X –1 K
X –1
− νy=k log(1 − νy=i )−1
k =1 i=1
K
X –1 K
X –1 K
X –1
=− νy=k log νy=k + (1 − νy=k ) log(1 − νy=i )−1
k =1 k =1 i=1
K
X –1 K
X –1 K
X –1
=− νy=k log νy=k − (1 − νy=k ) log(1 − νy=k ) .
k =1 k =1 k =1
A.4 Gaussian Distribution

We are familiar with the parameterization of the Gaussian distribution in source form with
the source parameters µ, Σ and the probability density function P (x |µ, Σ) given by:

1 1 T −1
P (x |µ, Σ) = d 1 exp − (x −µ) Σ (x −µ) (A.29)
(2π) 2 |Σ| 2 2
where µ ∈ Rd , Σ ∈ S+
d are mean and covariance matrix of random vector x with the sample
space Ωx = Rd . S+
d denotes the set of symmetric positive semidefinite matrices. The
alternative parameterization in information form is shown in [5] and [1].
Definition 9. The parameterization of Gaussian distribution using information vector m ∈

Rd and the information matrix S ∈ S+
d has the probability density function:

1 1 1 d
P (x |m, S) = exp mT x +tr(− S x xT ) + log |S| − mT S −1 m − log 2π (A.30)
2 2 2 2
57
where x is the random vector with sample space Ωx = Rd .
We can write the information form P (x |m, S) into exponential family form as:

1 1 1 d
P (x |m, S) = exp mT x +tr(− S x xT ) + log |S| − mT S −1 m − log 2π
2 2 2 2

T 1 T 1 1 T −1 d (A.31)
= exp m x +tr(− S x x ) − (− log |S| + m S m + log 2π)
2 2 2 2
= exp(θxT T(x) − A(θx ))
where the sufficient statistic function Tx : Rd → Rd × K+

d , Kd = {vec(R) ∈ Rd(d+1)/2 |R ∈
+
d } is:
S+
Tx = (xT , vec(x xT )T )T (A.32)
which has the natural parameters θx ∈ Rd × K–d , K–d = {vec(− 21 S) ∈ Rd(d+1)/2 |S ∈ S+

d}
1
θx = (mT , vec(− S)T )T . (A.33)
2
The log partition function A : Rd × K–d , K–d = {vec(− 21 S) ∈ Rd(d+1)/2 |S ∈ S+

d } is:
1 1 d
A(θx ) = − log |S| + mT S −1 m + log 2π . (A.34)
2 2 2
The relationship between the source parameters µ, Σ and the information parameters m, S
can be seen as:

1 1 T −1
P (x |µ, Σ) = d 1 exp − (x −µ) Σ (x −µ)
(2π) 2 |Σ| 2 2

1 1 1 d
= exp − xT Σ−1 x + xT Σ−1 µ − µT Σ−1 µ + log |Σ−1 | − log 2π
2 2 2 2

1 1
= exp tr(− Σ−1 x xT ) + (Σ−1 µ)T x − (Σ−1 µ)T Σ(Σ−1 µ)
2 2

1 d
+ log |Σ−1 | − log 2π (A.35)
2 2

1 1
= exp (Σ−1 µ)T x + tr(− Σ−1 x xT ) + log |Σ−1 |
2 2

1 d
− (Σ−1 µ)T Σ(Σ−1 µ) − log 2π
2 2

T 1 T 1 1 T −1 d
= exp m x + tr(− S x x ) + log |S| − m S m − log 2π .
2 2 2 2
Hence, we can see that m = Σ−1 µ, S = Σ−1 and µ = S −1 , Σ = S −1 . Similarly, using the
property of log partition function ∇θx A(θx ) = νx , we can obtain:
∇m A(m, S) = S −1 m
(A.36)
= µ,
58
∇− 1 A(m, S) = S −1 + S −1 mmT S −1
2
(A.37)
= Σ + µµT .
Therefore, the moment parameters νx ∈ Rd × K+

d is:
νx = (µT , vec(Σ + µµT )T )T . (A.38)
Finally, using Fenchel duality as shown in proposition 13, we can obtain the relationship
between the log partition function and the entropy function H(νx )
H(νx ) = inf A(θx ) − θxT νx

T
= A(θx ) − −θx νx
θx ∇ A(θx )=νx

1 1 T −1 d
= − log |S| + m S m + log 2π − mT µ
2 2 2

1
− tr(− S(Σ + µµT ))
2 m=Σ−1 µ,S=Σ−1
1 1 d
= − log |Σ−1 | + (Σ−1 µ)T Σ(Σ−1 µ) + log 2π
2 2 2
1
− (Σ−1 µ)T µ − tr(− Σ−1 (Σ + µµT ))
2
1 1 T −1 d
= log |Σ| + µ Σ µ + log 2π − µT Σ−1 µ
2 2 2
1 1
+ tr(Σ−1 Σ) + tr(Σ−1 µµT )
2 2
1 1 T −1 T 1 T −1 T d 1
= log |Σ| + µ Σ µ + µ Σ µ − µT Σ−1 µT + log 2π + tr(Σ−1 Σ)
2 2 2 2 2
1 d d
= log |Σ| + log 2π +
2 2 2
1 d
= log |Σ| + log(2πe) .
2 2
(A.39)
59
Appendix B
Risk Allocation Supplementary
B.1 Risk measures

There are many different measures we can use to measure risk in the portfolio. Let us define
L(x) as the loss of the portfolio and L(x) = −R(x) where x is the weight vector and R(x)
is the return of the portfolio, we can define the following risk measures:
• The portfolio volatility: R(x) = σ(L(x)) = σ(x).
• Value-at-risk with confidence level α: R.(x) = VaRα (x) = inf{l : P (L(x) ≤ l) ≥ α}.
1
R1
• Expected shortfall: R(x) = 1−α α VaRu (x)du.
If we assume that R ∼ N (µ, Σ), we can rewrite the above risk measures into generic form
of standard deviation based risk measure:
√
SDc (x) = −µ(x) + c xT Σ x . (B.1)

Take an example of value-at risk, we have P r R(x) ≤ V aRa (x) = 1 − α. Thus,
R(x) − µ(x) −V aRα (x) − µ(x)
P √ = √ = 1 − α. (B.2)
xT Σ x xT Σ x
Hence, we can easily see that:
−V aRα (x) − µ(x)
√ = Φ−1 (1 − α) (B.3)
T
x Σx
where Φ is the cumulative distribution function of normal random variable. We finally have:
√
SDc (x) = V aRα (x) = −µ(x) + Φ−1 (α) xT Σ x (B.4)
where c = Φ−1 (α). It can also be shown in [11] that the standard deviation form of expected
shortfall (ES) can be expressed as:
√
xT Σ x
ESα (x) = −µ(x) + φ(Φ−1 (α)) . (B.5)
(1 − α)
60
B.2 Risk contribution and Euler allocation principle
After the first step of measuring risk in the portfolio, we need to also decompose the risk
portfolio into a sum of risk contributions by assets. This process is referred to as risk
allocation [10]. Risk contributions can be defined using Euler principle in [11] as follows.
Let Υi be the profit and loss of the asset i in the portfolio and hence, the profit and loss of
the whole portfolio with n assets is:
n
X
Υ= Υi . (B.6)
i=1
The risk measure R(Υ) is the portfolio-wide risk and R(Υi |Υ) as risk contribution of asset
ith to the portfolio-wide risk. Tasche in [2] defines the risk-adjusted performance measure-
ment (RAPM) as:
E[Υ]
RAPM(Υ) = (B.7)
R[Υ]
and also:
E[Υ]
RAPM(Υi |Υ) = . (B.8)
R(Υi |Υ)
Then we can state the two desirable risk contribution properties as in [2]:
Pn
1. The full allocation property: i=1 R(Υi |Υ) = R(Υ).
2. The RAPM compatible property: if ∃i , i > 0 such that RAPM(Υi |Υ) > RAPM(Υ)
then RAPM(Υ + hΥi ) > RAPM(Υ), ∀h, 0 < h < i .
If these two conditions are satisfied then it is shown in [2] that R(Υi |Υ) can be uniquely
d
defined as: R(Υi |Υ) = dh R(Υ + hΥi ). Hence, based on this framework, if we consider
the risk measure R(x) defined in term of weights, the risk contribution of asset i can be
uniquely defined as:
∂R(x)
RC i = xi (B.9)
∂xi
and hence, the Euler decomposition is satisfied:
n n
X ∂R(x) X
R(x) = xi = RC i . (B.10)
∂xi
i=1 i=1
This definition plays the key role in risk budgeting portfolios. For e.g., if we consider
the case of Gaussian asset returns: R ∼ N (µ, Σ) and with volatility as the risk measure
√
R(x) = σ(x) = xT Σ x, we can verify that this satisfies the full allocation property. First,
we have the marginal volatility:
∂σ(x) Σx
=√ . (B.11)
∂x xT Σ x
61
Then we can compute the risk contribution of the ith asset:
(Σ x)i
RC i = xi √ . (B.12)
xT Σ x
Hence, the full allocation property is satisfied:
n n
X X (Σ x)i
RC i = xi √
i=1 i=1 xT Σ x
(B.13)
Σx
= xT √ = σ(x) .
xT Σ x
62
Appendix C
Code Listing
1 from s c i p y import o p t i m i z e a s spo

2 import pandas a s pd
3 import numpy a s np
4
5 # Max d i v e r s i f i c a t i o n o b j e c t i v e f u n c t i o n
6 def m i n v a r o b j f u n c ( wgtvec , covarmat ) :
7 wgtvec = np . matrix ( wgtvec )
8 r e t u r n np . a r r a y ( np . dot ( wgtvec , np . dot ( covarmat , wgtvec . T) ) ) [ 0 ] [ 0 ]
9 d e f minimum variance ( data ,
10 schedule ,
11 signal ,
12 params ={} ,
13 ∗∗ kwargs ) :
14 ”””
15 Simple minimum/mean v a r i a n c e o p t i m i z e r .
16 Function m i n i m i z e s t h e p o r t f o l i o v o l a t i l i t y with c o n s t r a i n t s t h a t
w e i g h t s add up t o 100%.
17 An a d d i t i o n a l c o n s t r a i n t o f minimum p o r t f o l i o r e t u r n can be added .
18
19 I f : math : ‘ C‘ i s : math : ‘ n∗n ‘ c o v a r i a n c e matrix , : math : ‘ R‘ a : math : ‘ n ∗ 1 ‘

v e c t o r o f u n d e r l y i n g r e t u r n s and
20 : math : ‘W‘ i s a : math : ‘ n ∗ 1 ‘ v e c t o r o f w e i g h t s .
21
22 : math : ‘ p o r t f o l i o v o l a t i l i t y = \\ s q r t {WˆT ∗ C ∗ W} ‘
23
24 : math : ‘ p o r t f o l i o r e t u r n = WˆT ∗ R‘
25
26 Required S i g n a l : ’ covariance matrix ’

27 Optional Signal : ’ u n d e r l y i n g r e t u r n ’ ( i f t a r g e t r e t u r n i s not None )
28 Required I n p u t s : None
29 Optional Inputs ( d e f a u l t value ) :
30 t o l ( 1 e −8) : tolerance for optimizer
31 t a r g e t r e t u r n ( None ) : minimum r e t u r n ( f o r mean−v a r i a n c e v e r s i o n )
63
32
33
34 ”””
35 o p t i m c o n s = ( { ’ type ’ : ’ eq ’ , ’ fun ’ : lambda x : sum ( x ) − 1 } , )
36
37 t a r g e t r e t u r n = params . g e t ( ’ t a r g e t r e t u r n ’ , None )
38 m e a n v a r i a n c e = not ( t a r g e t r e t u r n i s None ) # F a l s e
39 # if ’ t a r g e t r e t u r n ’ i n l i s t ( params . k e y s ( ) ) :
40 # t a r g e t r e t u r n = params [ ’ t a r g e t r e t u r n ’ ]
41 # m e a n v a r i a n c e = True
42
43 schedule = schedule i f hasattr ( schedule , ’ iter ’ ) e l s e [ schedule ]

44 a l l w e i g h t s = pd . DataFrame ( )
45
46 for date rebal in schedule :

47 covariance matrix = signal [ date rebal ] [ ’ covariance matrix ’ ]
48 cm le n = l e n ( c o v a r i a n c e m a t r i x )
49 b o u n d a r i e s = [ ( 0 , 1 ) ] ∗ cm le n
50 s t a r t i n g v a l u e s = [ 1 . 0 / cm le n ] ∗ cm le n
51
52 i f mean variance :
53 return matrix = signal [ date rebal ] [ ’ underlying return ’ ]
54 o p t i m c o n s r e t u r n = ( { ’ type ’ : ’ eq ’ , ’ fun ’ : lambda x , r , mu : np .
prod ( 1 + np . dot ( r , x ) ) − 1 − mu,
55 ’ args ’ : ( return matrix , target return , ) } ,)
56 optim cons = optim cons + optim cons return
57
58 r e s = spo . minimize ( minvar objfunc ,

59 starting values ,
60 bounds=b o u n d a r i e s ,
61 a r g s =( c o v a r i a n c e m a t r i x , ) ,
62 method= ’SLSQP ’ ,
63 c o n s t r a i n t s=o p t i m c o n s ,
64 t o l =1e −8)
65 i f not r e s [ ’ s u c c e s s ’ ] :
66 p r i n t ( ’ERROR with max d i v e r s i f i c a t i o n − a s o f ’ + s t r ( d a t e r e b a l ) )
67 p r i n t ( ’MSG: ’ + r e s [ ’ message ’ ] )
68 a l l w e i g h t s = a l l w e i g h t s . append (
69 pd . DataFrame ( r e s [ ’ x ’ ] , columns =[ d a t e r e b a l ] , i n d e x=
c o v a r i a n c e m a t r i x . columns ) . T)
70
71 return all weights . f i l l n a (0)
Listing C.1: Markowitz mean-variance optimisation function
3 import s c i p y
64
4 from s c i p y . s t a t s import m u l t i v a r i a t e n o r m a l
6 import copy
7
8 def v o l m i x t c o n s s q r t ( wgtvec , c o v a r m a t 1 , c o v a r m a t 2 , no jump prob ,

jump prob , v o l t a r g e t , mu 1 , mu 2 ) :
9 v o l = −np . s q r t ( no jump prob ∗ ( v o l f u n c ( wgtvec , c o v a r m a t 1 ) ∗ ∗ 2 ) + jump prob
∗ ( v o l f u n c ( wgtvec , c o v a r m a t 2 ) ∗ ∗ 2 ) + no jump prob ∗ jump prob ∗ ( ( r e t f u n c (
wgtvec , mu 1 ) − r e t f u n c ( wgtvec , mu 2 ) ) ∗ ∗ 2 ) )
10 vol optim val = vol + ( vol target )
11 return vol optim val
12
13 def e x p e c t e d s h o r t f a l l j a c o p t ( wgt vector , c o e f s ) :

14 var value = b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s
[1] , coefs [2] , coefs [3] , coefs [4] , coefs [6])
15 h 1 = ( v a r v a l u e+n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 2 ] ) ) / v o l f u n c ( w g t v e c t o r ,
coefs [0])
16 h 2 = ( v a r v a l u e+n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 3 ] ) ) / v o l f u n c ( w g t v e c t o r ,
coefs [1])
17 w 1 = ( c o e f s [ 5 ] ∗ s c i p y . s t a t s . norm . pdf ( h 1 ) ) / v o l f u n c ( w g t v e c t o r , c o e f s
[0])
[1])
19 d v a r = ( w 1 ∗ ( h 1 / v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) ∗np . dot ( c o e f s [ 0 ] ,
w g t v e c t o r ) − c o e f s [ 2 ] ) +\
20 w 2 ∗ ( h 2 / v o l f u n c ( w g t v e c t o r , c o e f s [ 1 ] ) ∗np . dot ( c o e f s [ 1 ] ,
w g t v e c t o r ) − c o e f s [ 3 ] ) ) /\
21 ( w 1+w 2 )
22
23 d e l t a 1 = (1+ h 1 / v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) ∗ v a r v a l u e ) ∗np . dot ( c o e f s

[ 0 ] , w g t v e c t o r )−\
24 var value ∗( d var + c o e f s [ 2 ] )
25
26 d e l t a 2 = (1+ h 2 / v o l f u n c ( w g t v e c t o r , c o e f s [ 1 ] ) ∗ v a r v a l u e ) ∗np . dot ( c o e f s

[ 1 ] , w g t v e c t o r )−\
27 var value ∗( d var + c o e f s [ 3 ] )
28
29 d e s = w 1/(1− c o e f s [ 4 ] ) ∗ d e l t a 1 + w 2/(1− c o e f s [ 4 ] ) ∗ d e l t a 2 − (1/(1 − c o e f s

[ 4 ] ) ) ∗ ( c o e f s [ 5 ] ∗ c o e f s [ 2 ] ∗ s c i p y . s t a t s . norm . c d f (− h 1 ) + c o e f s [ 6 ] ∗ c o e f s [ 3 ] ∗
s c i p y . s t a t s . norm . c d f (− h 2 ) )
30
31 return d es
32
33 def e x p e c t e d s h o r t f a l l j a c o p t ( wgt vector , c o e f s ) :

34 h 1 = ( b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] ,
c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) +n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 2 ] )
) / vol func ( wgt vector , c o e f s [ 0 ] )
65
35 h 2 = ( b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] ,
c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) +n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 3 ] )
) / vol func ( wgt vector , c o e f s [ 1 ] )
[0])
[1])
38 d v a r = ( w 1 ∗ ( h 1 / v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) ∗np . dot ( c o e f s [ 0 ] ,
w g t v e c t o r ) − c o e f s [ 2 ] ) +\
39 w 2 ∗ ( h 2 / v o l f u n c ( w g t v e c t o r , c o e f s [ 1 ] ) ∗np . dot ( c o e f s [ 1 ] ,
w g t v e c t o r ) − c o e f s [ 3 ] ) ) /\
40 ( w 1+w 2 )
41 d e l t a 1 = (1+ h 1 / v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) ∗ b i s e c t i o n o p t ( l o w e r a ,
higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] , c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] ,
c o e f s [ 6 ] ) ) ∗np . dot ( c o e f s [ 0 ] , w g t v e c t o r )−\
42 b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s
[ 1 ] , c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) ∗( d var + c o e f s [ 2 ] )
43 d e l t a 2 = (1+ h 2 / v o l f u n c ( w g t v e c t o r , c o e f s [ 1 ] ) ∗ b i s e c t i o n o p t ( l o w e r a ,
higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] , c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] ,
c o e f s [ 6 ] ) ) ∗np . dot ( c o e f s [ 1 ] , w g t v e c t o r )−\
44 b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] ,
c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) ∗( d var + c o e f s [ 3 ] )
45 d e s = w 1/(1− c o e f s [ 4 ] ) ∗ d e l t a 1 + w 2/(1− c o e f s [ 4 ] ) ∗ d e l t a 2 − (1/(1 − c o e f s
[ 4 ] ) ) ∗ ( c o e f s [ 5 ] ∗ c o e f s [ 2 ] ∗ s c i p y . s t a t s . norm . c d f (− h 1 ) + c o e f s [ 6 ] ∗ c o e f s [ 3 ] ∗
s c i p y . s t a t s . norm . c d f (− h 2 ) )
46 return d es + r i s k p a r i t y j a c ( wgt vector , c o e f s [ 7 ] )
47
48 def r i s k p a r i t y f u n c ( wgtvec , c o e f s ) :
49 r e t u r n −1 ∗ np . sum ( np . abs ( c o e f s ) ∗ np . l o g ( np . abs ( wgtvec ) ) )
50
51 def r i s k p a r i t y j a c ( wgtvec , c o e f s ) :
52 j a c v a l = [ −1.0 ∗ np . abs ( c o e f s ) [ k ] / wgtvec [ k ] f o r k i n r a n g e ( l e n ( wgtvec ) )
] #
53 return jac val
54
55 def e x p e c t e d s h o r t f a l l o p t ( wgt vector , c o e f s ) :

56 e s = ((1 − c o e f s [ 6 ] ) /(1− c o e f s [ 4 ] ) ∗ ( ( v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) ) ∗\
57 s c i p y . s t a t s . norm . pdf ( ( b i s e c t i o n o p t ( l o w e r a , h i g h e r b , wgt vector ,
c o e f s [ 0 ] , c o e f s [ 1 ] , c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) +n e g r e t f u n c (
w g t v e c t o r , c o e f s [ 2 ] ) ) / v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) )−\
58 n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 2 ] ) ∗ s c i p y . s t a t s . norm . c d f ( −((
b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] , c o e f s
[ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) +n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 2 ] ) ) /
v o l f u n c ( w g t v e c t o r , c o e f s [ 0 ] ) ) ) ) )+\
59 ( ( c o e f s [ 6 ] ) /(1− c o e f s [ 4 ] ) ∗ ( ( v o l f u n c ( w g t v e c t o r , c o e f s [ 1 ] ) ) ∗\
60 s c i p y . s t a t s . norm . pdf ( ( b i s e c t i o n o p t ( l o w e r a , h i g h e r b , wgt vector ,
c o e f s [ 0 ] , c o e f s [ 1 ] , c o e f s [ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) +n e g r e t f u n c (
66
w g t v e c t o r , c o e f s [ 3 ] ) ) / v o l f u n c ( w g t v e c t o r , c o e f s [ 1 ] ) )−\
61 n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 3 ] ) ∗ s c i p y . s t a t s . norm . c d f ( −((
b i s e c t i o n o p t ( lower a , higher b , wgt vector , c o e f s [ 0 ] , c o e f s [ 1 ] , c o e f s
[ 2 ] , c o e f s [ 3 ] , c o e f s [ 4 ] , c o e f s [ 6 ] ) +n e g r e t f u n c ( w g t v e c t o r , c o e f s [ 3 ] ) ) /
vol func ( wgt vector , c o e f s [ 1 ] ) ) ) ) )
62 return es + r i s k p a r i t y f u n c ( wgt vector , c o e f s [ 7 ] )
63
64
65 def v o l f u n c ( wgt vector , covar ) :

66 r e t u r n np . s q r t ( np . dot ( np . dot ( w g t v e c t o r , c o v a r ) , w g t v e c t o r ) )
67
68 def r e t f u n c ( wgt vector , r e t v e c t o r ) :

69 r e t u r n np . dot ( w g t v e c t o r , r e t v e c t o r )
70
71 def n e g r e t f u n c ( wgt vector , r e t v e c t o r ) :

72 r e t u r n −np . dot ( w g t v e c t o r , r e t v e c t o r )
73
74 l o w e r a = −100
75 h i g h e r b = 100
76 d e f b i s e c t i o n o p t ( a , b , w g t v e c t o r , c o v a r 1 , c o v a r 2 , mu vt 1 , mu vt 2 , alpha ,
lambda jump , t o l =1e −6) :
77 i f v a r f u n c t i o n o p t ( a , w g t v e c t o r , c o v a r 1 , c o v a r 2 , mu vt 1 , mu vt 2 ,
alpha , lambda jump ) ∗ v a r f u n c t i o n o p t ( b , w g t v e c t o r , c o v a r 1 , c o v a r 2 ,
mu vt 1 , mu vt 2 , alpha , lambda jump ) > 0 :
78 p r i n t ( v a r f u n c t i o n o p t ( a , w g t v e c t o r , c o v a r 1 , c o v a r 2 , mu vt 1 ,
mu vt 2 , alpha , lambda jump ) )
79 p r i n t ( v a r f u n c t i o n o p t ( b , w g t v e c t o r , c o v a r 1 , c o v a r 2 , mu vt 1 ,
mu vt 2 , alpha , lambda jump ) )
80 p r i n t ( ”No r o o t found . ” )
81 else :
82 while (b − a ) /2.0 > t o l :
83 midpoint = ( a + b ) / 2 . 0
84 i f v a r f u n c t i o n o p t ( midpoint , w g t v e c t o r , c o v a r 1 , c o v a r 2 ,
mu vt 1 , mu vt 2 , alpha , lambda jump ) == 0 :
85 r e t u r n ( midpoint ) #The midpoint i s t h e x−i n t e r c e p t / r o o t .
86 e l i f v a r f u n c t i o n o p t ( a , w g t v e c t o r , c o v a r 1 , c o v a r 2 , mu vt 1 ,
mu vt 2 , alpha , lambda jump ) ∗ v a r f u n c t i o n o p t ( midpoint , w g t v e c t o r ,
c o v a r 1 , c o v a r 2 , mu vt 1 , mu vt 2 , alpha , lambda jump ) < 0 : # I n c r e a s i n g
but below 0 c a s e
87 b = midpoint
88 else :
89 a = midpoint
90 r e t u r n ( midpoint )
91
92 d e f v a r f u n c t i o n o p t ( v a r a , w g t v e c t o r , c o v a r 1 , c o v a r 2 , mu vt 1 , mu vt 2 ,
alpha , lambda jump ) :
93
67
94 v a r r e s u l t = (1−lambda jump ) ∗ s c i p y . s t a t s . norm . c d f ( ( v a r a + n e g r e t f u n c (
w g t v e c t o r , mu vt 1 ) ) / v o l f u n c ( w g t v e c t o r , c o v a r 1 ) ) + \
95 lambda jump ∗ s c i p y . s t a t s . norm . c d f ( ( v a r a + n e g r e t f u n c (
w g t v e c t o r , mu vt 2 ) ) / v o l f u n c ( w g t v e c t o r , c o v a r 2 ) ) − a l p h a
96 return var result
Listing C.2: Objective functions and Jacobian functions for volatility based risk parity and
expected-shortfall based risk parity with Gaussian mixture models
3 from s c i p y . s t a t s import m u l t i v a r i a t e n o r m a l
5 import copy
6
7 h undl list = [
8 ’UISELGSE Index ’ ,
9 ’UISEMGSE Index ’ ,
10 ’UISEQGSE Index ’ ,
11 ’UISERUAE Index ’ ,
12 ’UISEVGSE Index ’ ,
13 ’UISFC1UE Index ’ ,
14 ’UISFV1UE Index ’ ,
15 ’UISRCX8E Index ’ ,
16 ’UISRTLGE Index ’ ,
17 ’UISXTXUE Index ’
18 ]
19
20 u n d l l i s t = [ adm . r e t r i e v e ( h u n d l ) f o r h u n d l i n h u n d l l i s t ]
21
22 d e f g e t r e t u r n ( t i m e s e r i e s , days =5) :
23 s h i f t e d t s = t i m e s e r i e s . s h i f t ( days )
24 return timeseries / s h i f t e d t s − 1
25
26 un ret list = []
27 f o r undl i n u n d l l i s t :
28 u n r e t = g e t r e t u r n ( undl )
29 u n r e t l i s t . append ( u n r e t )
30
31 r e t d f = pd . c o n c a t ( u n r e t l i s t , a x i s =1 , j o i n= ’ i n n e r ’ )
32 r e t d f = r e t d f . dropna ( )
33 c o v m a t r i x = r e t d f . cov ( )
34 corr matrix = ret df . corr ()
35 r e t m ea n = r e t d f . mean ( ) . v a l u e s
36 s c a l e d l e v e l s d f = 100 ∗ (1+ r e t d f ) . cumprod ( )
37
38 def i s p o s d e f (x) :
39
68
40 p r i n t ( np . l i n a l g . e i g v a l s ( x ) )
41
42 r e t u r n np . a l l ( np . l i n a l g . e i g v a l s ( x ) >= 0 )
43
44 ret df period = ret df . iloc [:1000]

45
46 from cvxpy import ∗

47 alpha k = Variable (1)
48 mu k 1 = V a r i a b l e ( i n t ( l e n ( h u n d l l i s t ) ) )
49 mu k 2 = V a r i a b l e ( i n t ( l e n ( h u n d l l i s t ) ) )
50 covar k 1 = Variable ( int ( len ( h u n d l l i s t ) ) , int ( len ( h u n d l l i s t ) ) )
51 covar k 2 = Variable ( int ( len ( h u n d l l i s t ) ) , int ( len ( h u n d l l i s t ) ) )
52
53 from s k l e a r n import mi xtu re

54 c l f = mi xtu re . GaussianMixture ( n components =2 , m a x i t e r =1000 , t o l =1e −7) . f i t (
ret df period )
55 c l f = mi xtu re . GaussianMixture ( n components =2 , m a x i t e r =500 , t o l =1e −7 ,
weights init =[0.2 , 0 . 8 ] ) . f i t ( r e t d f p e r i o d )
56
57 print ( ’ weights ’ )
58 print ( c l f . weights )
59
60 if c l f . weights [ 0 ] > c l f . weights [ 1 ] :

61 alpha k . value = c l f . weights [ 0 ]
62 mu k 1 . v a l u e = c l f . means [ 0 ]
64 covar k 1 . value = c l f . covariances [ 0 ]
66 else :
67 alpha k . value = c l f . weights [ 1 ]
72
73 iter t = 3
74 N = r e t d f p e r i o d . shape [ 0 ]
75 K = 2
76
77 f o r t i t e r in range ( i t e r t ) :
78 q k = [ a l p h a k . v a l u e , 1−a l p h a k . v a l u e ]
79 m u s k l i s t = [ np . a r r a y ( mu k 1 . v a l u e ) . f l a t t e n ( ) , np . a r r a y ( mu k 2 . v a l u e ) .
flatten () ]
80 c o v a r s k l i s t = [ c o v a r k 1 . value , c o v a r k 2 . value ]
81 q posterior list = []
82 q p o s t e r i o r m a p = {}
83 import math
69
84
85 # marginal p r o b a b i l i t y
86 m a r p r o b l i s t = np . z e r o s ( (K, N) )
87 f o r k i d x t e m p i n r a n g e ( 0 , K) :
88 f o r j t e m p i n r a n g e ( 0 , N) :
89 m a r p r o b l i s t [ k idx temp , j t e m p ] = \
90 ( q k [ k i d x t e m p ] ∗ m u l t i v a r i a t e n o r m a l . pdf ( r e t d f p e r i o d . i l o c [
j t e m p ] . v a l u e s , mean=m u s k l i s t [ k i d x t e m p ] , cov=c o v a r s k l i s t [ k i d x t e m p
]) )
91
92 f o r k i d x i n r a n g e ( 0 , K) :
93 f o r j i n r a n g e ( 0 , N) :
94 j o i n t p r o b k = q k [ k i d x ] ∗ m u l t i v a r i a t e n o r m a l . pdf ( r e t d f p e r i o d .
i l o c [ j ] , mean=m u s k l i s t [ k i d x ] , cov=c o v a r s k l i s t [ k i d x ] )
95 q p o s t e r i o r = j o i n t p r o b k / np . sum ( m a r p r o b l i s t [ : , j ] )
96 q p o s t e r i o r l i s t . append ( q p o s t e r i o r )
97
98 q posterior map [ k idx ] = q p o s t e r i o r l i s t

99 q posterior list = []
100
101 ##### a p r i o r i a f t e r E s t e p
102 # empirical
103 # probability
104
105 alpha sk list = []

107 alpha sk = 0
108 f o r j i n r a n g e ( 0 , N) :
109 a l p h a s k += q p o s t e r i o r m a p [ k i d x ] [ j ]
110 a l p h a s k l i s t . append ( a l p h a s k / N)
111 alpha sk list [1] = 1 − alpha sk list [0]
112
113 # mean
114 mu sk list = [ ]
115 w e i g h t e d x s m a p = {}
117 w e i g h t e d x s m a p [ k i d x ] = np . empty ( [ i n t ( l e n ( h u n d l l i s t ) ) , N ] )
118
120 mu sk = 0
121 f o r j i n r a n g e ( 0 , N) :
122 weighted x = ( q posterior map [ k idx ] [ j ] ) ∗ r e t d f p e r i o d . i l o c [ j ] .
values
123 mu sk += w e i g h t e d x
124 weighted xs map [ k idx ] [ : , j ] = weighted x / a l p h a s k l i s t [ k idx ]
125 m u s k l i s t . append ( mu sk / N / a l p h a s k l i s t [ k i d x ] )
126
70
127 covar sk list = []
129 sum covar = 0
130 f o r j i n r a n g e ( 0 , N) :
131 w0 temp = ( r e t d f p e r i o d . i l o c [ j ] − m u s k l i s t [ k i d x ] ) . v a l u e s
132 w0 covar = np . o u t e r ( w0 temp , w0 temp )
133 w e i g h t e d c o v a r = ( q p o s t e r i o r m a p [ k i d x ] [ j ] ) ∗ w0 covar
134 sum covar += w e i g h t e d c o v a r
135 sum covar = sum covar / ( a l p h a s k l i s t [ k i d x ] ∗N)
136 c o v a r s k l i s t . append ( sum covar )
137
138 #####
139 # i n f o r m a t i o n params
140 n sk list = []
141 f o r k i d x i n r a n g e ( 0 , K−1) :
142 n s k = np . l o g ( a l p h a s k l i s t [ k i d x ] / (1− ( np . sum ( a l p h a s k l i s t [ : − 1 ] ) ) )
)
143 n s k l i s t . append ( n s k )
144 m sk list = [ ]
146 m sk = np . dot ( np . l i n a l g . i n v ( c o v a r s k l i s t [ k i d x ] ) , m u s k l i s t [ k i d x ] )
147 m s k l i s t . append ( m sk )
148
149 S sk list = []
151 S s k = np . l i n a l g . i n v ( c o v a r s k l i s t [ k i d x ] )
152 S s k l i s t . append ( S s k )
153
154 a l p h a s k 1 = Parameter ( s i g n= ’ p o s i t i v e ’ )
155 a l p h a s k 2 = Parameter ( s i g n= ’ p o s i t i v e ’ )
156 n s k p r = Parameter ( 1 )
157 dim = Parameter ( s i g n= ’ p o s i t i v e ’ )
158 m sk 1 = Parameter ( i n t ( l e n ( h u n d l l i s t ) ) )
159 m sk 2 = Parameter ( i n t ( l e n ( h u n d l l i s t ) ) )
160 alpha sk 1 . value = alpha sk list [0]
161 alpha sk 2 . value = alpha sk list [1]
162 n sk pr . value = n s k l i s t [ 0 ]
163 m sk 1 . v a l u e = m s k l i s t [ 0 ]
164 m sk 2 . v a l u e = m s k l i s t [ 1 ]
165 dim . v a l u e = i n t ( l e n ( h u n d l l i s t ) )
166 c o v a r k 1 = Se mi de f ( i n t ( l e n ( h u n d l l i s t ) ) )
167 c o v a r k t i l d e = Se mi de f ( i n t ( l e n ( h u n d l l i s t ) ) )
168
169 d u a l f u n c = e n t r ( a l p h a k ) + e n t r (1− a l p h a k ) + a l p h a s k 1 ∗ 0 . 5 ∗ l o g d e t (
covar k 1 ) + \
170 a l p h a s k 1 ∗dim ∗ 0 . 5 ∗ l o g ( 2 ∗ np . e ∗np . p i ) + a l p h a s k 2 ∗ 0 . 5 ∗ l o g d e t (
c o v a r k 1+c o v a r k t i l d e ) + a l p h a s k 2 ∗dim ∗ 0 . 5 ∗ l o g ( 2 ∗ np . e ∗np . p i ) + \
71
171 a l p h a k ∗ n s k p r + a l p h a s k 1 ∗ mu k 1 . T∗ m sk 1 − a l p h a s k 1 ∗ 0 . 5 ∗ t r a c e (
c o v a r k 1 ∗ S s k l i s t [ 0 ] ) − a l p h a s k 1 ∗ 0 . 5 ∗ quad form ( mu k 1 , S s k l i s t [ 0 ] ) +
\
172 a l p h a s k 2 ∗ ( mu k 1 . T) ∗ m sk 2 − a l p h a s k 2 ∗ 0 . 5 ∗ t r a c e ( ( c o v a r k 1+
c o v a r k t i l d e ) ∗ S s k l i s t [ 1 ] ) − a l p h a s k 2 ∗ 0 . 5 ∗ quad form ( mu k 1 , S s k l i s t
[1])
173
174 prob = Problem ( Maximize ( d u a l f u n c )

175 ,
176 [
177 a l p h a k == 0 . 9 6 ,
178 mu k 1 > 0 ,
179 mu k 2 − mu k 1 < 0 ,
180 mu k 2 > −mu k 1 ,
181 mu k 2 < 0 ,
182 ])
183
184 p r i n t ( prob . s o l v e ( s o l v e r=CVXOPT, v e r b o s e=True , k k t s o l v e r= ’ r o b u s t ’ , m a x i t e r s

=1000) )
185 print ( t i t e r )
186 covar k 2 . value = covar k 1 . value + c o v a r k t i l d e . value
187 print ( i s p o s d e f ( covar k 1 . value ) )
188 print ( i s p o s d e f ( c o v a r k t i l d e . value ) )
189 print ( i s p o s d e f ( covar k 2 . value ) )
190 c o v a r k 1 t e m p = np . a r r a y ( c o v a r k 1 . v a l u e )
191 c o v a r k 2 t e m p = np . a r r a y ( c o v a r k 2 . v a l u e )
192 c o v a r k t i l d e t e m p = np . a r r a y ( c o v a r k t i l d e . v a l u e )
193 mu k 1 temp = np . a r r a y ( mu k 1 . v a l u e ) . f l a t t e n ( )
194 mu k 2 temp = np . a r r a y ( mu k 2 . v a l u e ) . f l a t t e n ( )
195 D 1=np . d i a g ( np . s q r t ( np . d i a g ( c o v a r k 1 t e m p ) ) )
196 c o r r e l k 1 = np . dot ( np . dot ( np . l i n a l g . i n v ( D 1 ) , c o v a r k 1 t e m p ) , np . l i n a l g .
inv ( D 1 ) )
197 D 2=np . d i a g ( np . s q r t ( np . d i a g ( c o v a r k 2 t e m p ) ) )
198 c o r r e l k 2 = np . dot ( np . dot ( np . l i n a l g . i n v ( D 2 ) , c o v a r k 2 t e m p ) , np . l i n a l g .
inv ( D 2 ) )
Listing C.3: Constrained Expectation Maximization Algorithm
1 def index calculation default ( basket timeseries ,

2 basket weights ,
3 params ,
4 s t r a t e g y c a c h e=None ,
5 d a t a h a n d l e r=None ,
6 ∗∗ kwargs ) :
7 t i m e s e r i e s s t a r t = min ( b a s k e t t i m e s e r i e s . apply ( lambda x : x [ pd . i s n u l l ( x )
== F a l s e ] , 0 ) . i n d e x )
8 basket timeseries = basket timeseries . f i l l n a (0.0)
9 asOf = np . d a t e t i m e 6 4 ( kwargs [ ’ asOf ’ ] )
72
10 quoting calendar = b a s k e t t i m e s e r i e s . index . values
11 i n c e p t i o n D a t e = np . d a t e t i m e 6 4 ( kwargs . g e t ( ’ i n c e p t i o n D a t e ’ ) )
12 r e b a l s t a r t d a t e = max( i n c e p t i o n D a t e , b a s k e t w e i g h t s . i n d e x . v a l u e s [ 0 ] )
13 c a l i b r a t i o n d a t e s = kwargs [ ’ r e b a l a n c i n g d a y m a p ’ ]
14 s i g n a l c a l e n d a r = kwargs [ ’ s i g n a l c a l e n d a r ’ ]
15 s i g n a l c a l e n d a r d a t e = np . a r r a y ( s i g n a l c a l e n d a r . d a t e l i s t )
16 t r a d i n g d a t e s = b a s k e t t i m e s e r i e s . index . values
17
18 l a s t t r a d e d a t e i = np . s e a r c h s o r t e d ( t r a d i n g d a t e s , asOf )
19 i f t r a d i n g d a t e s [ l a s t t r a d e d a t e i ] > asOf :
20 l a s t t r a d e d a t e i −= 1
21 last trade date = trading dates [ last trade date i ]
22
23 u n d l l i s t = b a s k e t w e i g h t s . columns
24 basket timeseries = basket timeseries [ undl list ]
25 basket timeseries values = basket timeseries [ u n d l l i s t ] . values
26 ## R e t r i e v e FX data
27 f x r e f = params . g e t ( ’ fx map ’ , None )
28 f x h e d g e d = not f x r e f i s None
29 i f fx hedged :
30 #f x r e f = params . g e t ( ’ fx map ’ , { } )
31 all fx = []
32 f x d i s c r i m i n a t o r = params . g e t ( ’ f x d i s c r i m i n a t o r ’ , None )
33 f o r undl i n u n d l l i s t :
34 i f undl i n f x r e f . k e y s ( ) :
35 i f not f x r e f [ undl ] [ 0 ] i s None :
36 f x t s = s t r a t e g y c a c h e . r e t r i e v e d a t a ( f x r e f [ undl ] [ 0 ] ,
d i s c r i m i n a t o r=f x d i s c r i m i n a t o r )
37 f x t s = f x t s ∗∗ ( f x r e f [ undl ] [ 1 ] )
38 else :
39 f x t s = pd . S e r i e s ( 1 . 0 , i n d e x=b a s k e t t i m e s e r i e s . i n d e x )
40 else :
41 f x t s = pd . S e r i e s ( 1 . 0 , i n d e x=b a s k e t t i m e s e r i e s . i n d e x )
42 a l l f x += [ f x t s ]
43
44 f x t s = pd . c o n c a t ( a l l f x , j o i n= ’ o u t e r ’ , a x i s =1) . f i l l n a ( method= ’ pad ’ )

45 f x t s . columns = u n d l l i s t
46 f x d f = t s . f a j ( b a s k e t t i m e s e r i e s . index , f x t s ) [ u n d l l i s t ]
47 fx df values = fx df . values
48
49 ## d a t e s
50 r e b a l d a t e s = basket weights . index . values
51 r e b a l w e i g h t s i d x = np . s e a r c h s o r t e d ( r e b a l d a t e s , [ r e b a l s t a r t d a t e ,
last trade date ])
52 i f r e b a l w e i g h t s i d x [ 1 ] ! = l e n ( r e b a l d a t e s ) and r e b a l d a t e s [
rebal weights idx [ 1 ] ] > last trade date :
53 r e b a l w e i g h t s i d x [ 1 ] −= 1
73
54 roll dates = rebal dates [ rebal weights idx [ 0 ] : rebal weights idx [1]+1]
55 r o l l d a t e s = np . append ( r o l l d a t e s , l a s t t r a d e d a t e )
56 rebal weights = basket weights . values [ rebal weights idx [ 0 ] :
rebal weights idx [1]+1]
57 r e b a l d a t e s s m o o t h i n g = i n t ( params . g e t ( ’ smoothing ’ , 1 ) )
58 c a r r y c o s t d a y c o n v e n t i o n = params . g e t ( ’ c a r r y c o s t d a y c o n v e n t i o n ’ , 3 6 5 )
59
60 # extract costs
61 c a r r y c o s t = params . g e t ( ’ c a r r y c o s t ’ , 0 . 0 )
62 r e b a l c o s t = params . g e t ( ’ r e b a l c o s t ’ , 0 . 0 )
63 i f not i s i n s t a n c e ( c a r r y c o s t , d i c t ) :
64 carry cost = { i : carry cost for i in u n d l l i s t }
65 i f not i s i n s t a n c e ( r e b a l c o s t , d i c t ) :
66 rebal cost = { i : rebal cost for i in u n d l l i s t }
67
68 c a r r y c o s t d f = pd . S e r i e s ( c a r r y c o s t ) [ u n d l l i s t ] . v a l u e s
69 r e b a l c o s t d f = pd . S e r i e s ( r e b a l c o s t ) [ u n d l l i s t ] . v a l u e s
70 index dates = [ ]
71 index levels = [ ]
72
73 # For w e i g h t decom
74 total cost series = []
75 no cost ret series = []
76 no cost ret df = [ ]
77 i n d e x r e f = kwargs . g e t ( ’ i n c e p t i o n V a l u e ’ , 1 0 0 . 0 )
78 l a s t u n i t l e v e l = np . a r r a y ( [ ] )
79 i n c l u d e c o s t d a y o n e = params . g e t ( ’ d a y o n e c o s t ’ , True )
80 u n i t r o u n d i n g = params . g e t ( ’ r o u n d i n g t a r g e t ’ , None )
81
82 i f inceptionDate < r e b a l s t a r t d a t e :
83 pre start date = trading dates [( trading dates < rebal start date ) ∗
84 ( t r a d i n g d a t e s >= i n c e p t i o n D a t e ) ]
85 i n d e x l e v e l s . append ( [ i n d e x r e f ] ∗ l e n ( p r e s t a r t d a t e ) )
86 index dates = [ pre start date ] ∗ len ( pre start date )
87 t o t a l c o s t s e r i e s . append ( [ 0 ] ∗ l e n ( p r e s t a r t d a t e ) )
88 n o c o s t r e t s e r i e s . append ( [ 0 ] ∗ l e n ( p r e s t a r t d a t e ) )
89 n o c o s t r e t d f . append ( [ [ 0 ] ∗ l e n ( u n d l l i s t ) ] ∗ l e n ( p r e s t a r t d a t e ) )
90
91 rebal cost date list = []

92 carry cost 2d array list = []
93 rebal cost ts list = []
94 a l l u n i t s c h e d u l e 2 d a r r a y = None
95 days rebal list = []
96 no cost daily ret = [ ]
97 one day = np . t i m e d e l t a 6 4 ( 1 , ’D ’ )
98 index rebal ref = index ref
99
74
100 # r o l l d a t e s a r e from w e i g h t a l l o c a t o r w e i g h t s
101 r d i = np . s e a r c h s o r t e d ( t r a d i n g d a t e s , r o l l d a t e s )
102 s c d i = np . s e a r c h s o r t e d ( s i g n a l c a l e n d a r d a t e , r o l l d a t e s )
103 num rolls = len ( r o l l d a t e s )
104 t i m e s e r i e s s t a r t n p = np . d a t e t i m e 6 4 ( t i m e s e r i e s s t a r t )
105 d e t e r m i n a t i o n d t s = [ max( c a l i b r a t i o n d a t e s [ x ] , t i m e s e r i e s s t a r t n p ) i f x
i n c a l i b r a t i o n d a t e s e l s e asOf f o r x i n r o l l d a t e s ]
106 u n d l r e f p r i c e i d x = np . s e a r c h s o r t e d ( t r a d i n g d a t e s , d e t e r m i n a t i o n d t s )
107
108 f o r k in range ( num rolls − 1) :

109 nrd = r o l l d a t e s [ k + 1 ]
110 n e x t d e t e r m i n a t i o n d a t e = d e t e r m i n a t i o n d t s [ k+1]
111
112 wgt = r e b a l w e i g h t s [ k ]
113 r o l l t d = t r a d i n g d a t e s [ r d i [ k ] : r d i [ k +1]+1]
114 undl ref price = basket timeseries values [ undl ref price idx [k ] ]
115
116 # converting weight to u n i t s

117 t a r g e t u n i t s = wgt / u n d l r e f p r i c e ∗ i n d e x r e b a l r e f
118 i f fx hedged :
119 t a r g e t u n i t s /= f x d f v a l u e s [ u n d l r e f p r i c e i d x [ k ] ]
120 t a r g e t u n i t s = np . nan to num ( t a r g e t u n i t s )
121 t a r g e t u n i t s = t a r g e t u n i t s i f u n i t r o u n d i n g i s None e l s e t a r g e t u n i t s
. round ( u n i t r o u n d i n g )
122
123 i f len ( l a s t u n i t l e v e l ) > 0:

124 traded units = ( target units − l a s t u n i t l e v e l ) /
rebal dates smoothing
125 else :
126 traded units = target units / rebal dates smoothing
127
128 d a y s r e b a l = s i g n a l c a l e n d a r d a t e [ s c d i [ k ] : s c d i [ k]+
rebal dates smoothing ]
129 u n i t s s c h e d u l e = np . a r r a y ( [ t r a d e d u n i t s ∗ ( i + 1 ) f o r i i n r a n g e ( 0 ,
len ( days rebal ) ) ] )
131 units schedule = units schedule + last unit level
132
133 a l l u n i t s c h e d u l e 2 d a r r a y = np . v s t a c k (
134 [ all unit schedule 2d array , units schedule ]) if
a l l u n i t s c h e d u l e 2 d a r r a y i s not None e l s e u n i t s s c h e d u l e
135 d a y s r e b a l l i s t . append ( d a y s r e b a l )
136
137 # c o s t used a b s o l u t e change i n u n i t s

138 c o s t r e b a l r e f = abs ( t r a d e d u n i t s ) ∗ r e b a l c o s t d f
139 c a r r y c o s t r e f = abs ( u n i t s s c h e d u l e ) ∗ c a r r y c o s t d f
140 d a y s r e f = d a y s r e b a l [ d a y s r e b a l <= r o l l t d [ − 1 ] ]
75
141 s u b r o l l s = np . append ( d a y s r e f , nrd )
142 s r d i = np . s e a r c h s o r t e d ( r o l l t d , s u b r o l l s )
143 b s r d i = np . s e a r c h s o r t e d ( t r a d i n g d a t e s , s u b r o l l s )
144 f o r l in range ( len ( s u b r o l l s ) − 1) :
145 r o l l t d s u b r o l l = r o l l t d [ s r d i [ l ] : s r d i [ l +1]+1]
146 wgt sr = units schedule [ l ]
147 carry cost sr = carry cost ref [ l ]
148
149 # b i g improvement vs u s i n g r o l l t d s u b r o l l
150 u n d l t s = b a s k e t t i m e s e r i e s v a l u e s [ b s r d i [ l ] : b s r d i [ l +1]+1]
151 l e v e l r e f = undl ts [ 0 ]
152 undl ret = undl ts − l e v e l r e f
153
154 # this i s intermediate

155 d a i l y r e t = undl ts [ 1 : ] − undl ts [: −1]
156
157 i f fx hedged :
158 f x t s = f x d f v a l u e s [ b s r d i [ l ] : b s r d i [ l +1]+1]
159 u n d l r e t ∗= f x t s
160 d a i l y r e t ∗= f x t s [ 1 : ]
161
162 n o c o s t p o r t f o l i o r e t r o l l t d d a y s = ( undl ret ∗ wgt sr )

163 n o c o s t p o r t f o l i o r e t = np . sum ( n o c o s t p o r t f o l i o r e t r o l l t d d a y s ,
a x i s =1)
164
165 daily drag per undl = ( carry cost sr ∗ l e v e l r e f )

166 c a r r y c o s t 2 d n p a r r a y = [ np . z e r o s ( l e n ( u n d l l i s t ) ) ]
167 f o r num days i n ( r o l l t d s u b r o l l − r o l l t d s u b r o l l [ 0 ] ) [ 1 : ] / one day
:
168 d r a g t o t a l p e r u n d l = d a i l y d r a g p e r u n d l ∗ num days /
carry cost day convention
169 c a r r y c o s t 2 d n p a r r a y . append ( d r a g t o t a l p e r u n d l )
170
171 c a r r y c o s t t o t a l = np . a r r a y ( [ np . sum ( x ) f o r x i n
c a r r y c o s t 2 d n p a r r a y ∗( f x t s i f fx hedged e l s e 1) ] )
172
173 # another intermediate

174 try :
175 c a r r y c o s t a p p e n d = np . a r r a y ( c a r r y c o s t 2 d n p a r r a y [ 1 : ] ∗ ( f x t s
[1:] i f fx hedged e l s e 1) ) − \
176 np . a r r a y ( c a r r y c o s t 2 d n p a r r a y [ : − 1 ] ∗ (
f x t s [: −1] i f fx hedged e l s e 1) )
177
178 i f len ( carry cost append ) > 0:

179 c a r r y c o s t 2 d a r r a y l i s t . append ( c a r r y c o s t a p p e n d )
180 except ValueError :
181 msg = ’ Reach l a s t d a t e − ’ + s t r ( n e x t d e t e r m i n a t i o n d a t e )
76
182 l o g g e r . debug ( msg )
183
184 cost rebal = l e v e l r e f ∗ cost rebal ref ∗ include cost day one
185
186 i f fx hedged :
187 c o s t r e b a l ∗= f x t s [ 0 ]
188 r c t o t a l = c o s t r e b a l . sum ( )
189
190 total cost = carry cost total + rc total

191 ref with cost = index ref − total cost
192 index roll = ref with cost + no cost portfolio ret
193
194 # f o r w e i g h t decom
195 total cost roll = carry cost total + rc total
196
197 # f o r c r e a t i n g d a t a fr a m e
198 # of rebal cost
199 r e b a l c o s t t s l i s t . append ( c o s t r e b a l )
200 r e b a l c o s t d a t e l i s t . append ( r o l l t d s u b r o l l [ 0 ] )
201 # o f no c o s t p o r t f o l i o r e t u r n
202 n o c o s t d a i l y r e t . append ( d a i l y r e t ∗ w g t s r )
203
204 # append e v e r y t h i n g e x c e p t l a s t i n d e x r o l l v a l u e
205 i f n e x t d e t e r m i n a t i o n d a t e <r o l l t d s u b r o l l [ −1] and
next determination date in r o l l t d s u b r o l l :
206 i n d e x r e b a l r e f = i n d e x r o l l [ np . s e a r c h s o r t e d ( r o l l t d s u b r o l l ,
next determination date ) ]
207 i n d e x d a t e s . append ( r o l l t d s u b r o l l [ : − 1 ] )
208 i n d e x l e v e l s . append ( i n d e x r o l l [ : − 1 ] )
209
210 # c o s t f o r decomp
211 t o t a l c o s t s e r i e s . append ( t o t a l c o s t r o l l [ : − 1 ] )
212
213 # ignore f i r s t value

214 # i n d e x r e f i s i n d e x l e v e l with c a r r y c o s t and r e b a l c o s t a p p l i e d
215 # if s i z e 1 ( l a s t trading date ) , t h i s get ignored .
216 # i n d e x r e f s e r i e s . append ( r e f w i t h c o s t [ 1 : ] )
217 n o c o s t r e t s e r i e s . append ( n o c o s t p o r t f o l i o r e t [ 1 : ] )
218 n o c o s t r e t d f . append ( n o c o s t p o r t f o l i o r e t r o l l t d d a y s [ 1 : , :] −
n o c o s t p o r t f o l i o r e t r o l l t d d a y s [: −1 , : ] )
219 i n d e x r e f = i n d e x r o l l [ −1]
220 i n c l u d e c o s t d a y o n e = i n c l u d e c o s t d a y o n e o r True
221 l a s t u n i t l e v e l = u n i t s s c h e d u l e [ −1]
222
223 # l a s t i t e r , append l a s t i n d e x r o l l v a l u e
224 i n d e x d a t e s . append ( r o l l t d s u b r o l l [ − 1 : ] )
225 i n d e x d a t e s = np . c o n c a t e n a t e ( i n d e x d a t e s )
77
226 i n d e x l e v e l s . append ( i n d e x r o l l [ − 1 : ] )
227 i n d e x l e v e l = pd . S e r i e s ( np . c o n c a t e n a t e ( i n d e x l e v e l s ) , i n d e x=i n d e x d a t e s )
228
229 # c o s t f o r decomp
230 t o t a l c o s t s e r i e s . append ( t o t a l c o s t r o l l [ − 1 : ] )
231 t o t a l c o s t s e r i e s = pd . S e r i e s ( np . c o n c a t e n a t e ( t o t a l c o s t s e r i e s ) , i n d e x=
index dates )
232 n o c o s t r e t s e r i e s = pd . S e r i e s ( np . c o n c a t e n a t e ( n o c o s t r e t s e r i e s ) , i n d e x=
index dates [: −1])
233 n o c o s t r e t d f = pd . DataFrame ( np . c o n c a t e n a t e ( n o c o s t r e t d f ) , i n d e x=
i n d e x d a t e s [ : − 1 ] , columns=u n d l l i s t )
234
235 # make d at a f r am e f o r i n t e r m e d i a t e r e s u l t s :
236 c a r r y c o s t d f = pd . DataFrame ( data=np . c o n c a t e n a t e ( c a r r y c o s t 2 d a r r a y l i s t )
, columns=u n d l l i s t )
237
238 t r a d i n g d a t e s f r o m = t r a d i n g d a t e s [ t r a d i n g d a t e s >= r o l l d a t e s [ 0 ] ]
239 second td = trading dates from [ 1 ]
240
241 # do not s t o r e c a r r y c o s t on f i s t s u b r o l l day − always 0

242 t r a d i n g d a t e s r a n g e = t r a d i n g d a t e s [ ( t r a d i n g d a t e s >= s e c o n d t d ) & (
t r a d i n g d a t e s <= nrd ) ]
243 carry cost df [ ’ trading date ’ ] = trading dates range
244 carry cost df = carry cost df . set index ( [ ’ trading date ’ ] )
245 n o c o s t d a i l y r e t d f = pd . DataFrame ( data=np . c o n c a t e n a t e ( n o c o s t d a i l y r e t )
, columns=u n d l l i s t ,
246 i n d e x=t r a d i n g d a t e s r a n g e )
247 n o c o s t d a i l y r e t d f . i n d e x . name = ’ r e b a l d a t e ’
248 r e b a l c o s t d f = pd . DataFrame ( r e b a l c o s t t s l i s t , columns=u n d l l i s t , i n d e x=
rebal cost date list )
249 r e b a l c o s t d f . i n d e x . name = ’ r e b a l d a t e ’
250
251 ### Add w e i g h t s t o be r e b a l a n c e a f t e r asOf d a t e

252 wgt next = basket weights . l o c [ so r t e d ( c a l i b r a t i o n d a t e s . keys ( ) ) ]
253 wgt next = wgt next [ wgt next . index > l a s t t r a d e d a t e ]
254
255 f o r i in range ( len ( wgt next ) ) :

256 wgt future = wgt next . i l o c [ i ]
257 r e b a l d a t e = wgt next . index . values [ i ]
258 calib date = calibration dates [ rebal date ]
259 d a y s r e f = s i g n a l c a l e n d a r d a t e [ s i g n a l c a l e n d a r d a t e >= r e b a l d a t e ] [ 0 :
rebal dates smoothing ]
260 index level ref = index level . loc [ calib date ]
261 undl level ref = basket timeseries . loc [ calib date ]
262 target units = ( wgt future / u n d l l e v e l r e f ) ∗ i n d e x l e v e l r e f
263
264 # i f f x h e d g e d then r e a d j u s t u n i t s !
78
265 i f fx hedged :
266 try :
267 # f i n d f x r a t e on d e t e r m i n a t i o n d a t e
268 det date wgt next = c a l i b r a t i o n d a t e s [ wgt next . index . values
[ −1]]
269 t a r g e t u n i t s /= f x d f . l o c [ d e t d a t e w g t n e x t ]
270 except IndexError :
271 t a r g e t u n i t s /= f x d f . i l o c [ −1]
272
273 # r o u n d i n g f o r u n i t s a f t e r asOf d a t e
274 t a r g e t u n i t s = np . round ( t a r g e t u n i t s ,
275 params [ ’ r o u n d i n g t a r g e t ’ ] ) if ’ rounding target
’ i n params e l s e t a r g e t u n i t s
276

278 traded units = ( target units − l a s t u n i t l e v e l ) /
rebal dates smoothing
279 else :
280 traded units = target units / rebal dates smoothing
281
282 u n i t s s c h e d u l e = np . a r r a y ( [ t r a d e d u n i t s . v a l u e s ∗ ( i + 1 ) f o r i i n
range (0 , len ( d a y s r e f ) ) ] )
284 units schedule = units schedule + last unit level
285
286 a l l u n i t s c h e d u l e 2 d a r r a y = np . v s t a c k (
287 [ all unit schedule 2d array , units schedule ]) if
a l l u n i t s c h e d u l e 2 d a r r a y i s not None e l s e u n i t s s c h e d u l e
288 d a y s r e b a l l i s t . append ( d a y s r e f )
289
290 a l l u n i t s c h e d u l e = pd . DataFrame ( data=a l l u n i t s c h e d u l e 2 d a r r a y ,

291 i n d e x =[ item f o r s u b l i s t i n
d a y s r e b a l l i s t f o r item i n s u b l i s t ] ,
292 columns=u n d l l i s t )
293
294 t o p d r a g f e e = None
295 if ’ d r a g f e e ’ i n params :
296 i n d e x l e v e l = s i . d r a g f e e ( d r a g r a t e=params [ ’ d r a g f e e ’ ] , t i m e s e r i e s=
index level )
297 t o p d r a g f e e = params [ ’ d r a g f e e ’ ]
298
299 return { ’ index level ’ : index level ,

300 ’ weight schedule ’ : all unit schedule
301 }
Listing C.4: Function to calculate the cumulative PnL
79
Bibliography
[1] C. Ari. Maximum likelihood estimation of robust constrained Gaussian Mixture Mod-
els. 2013.
[2] D. Tasche. Capital Allocation to Business Units and Sub-Portfolios: The Euler Prin-
ciple. The New Accord: The Challenge of Economic Capital, 2008.
[3] J. Jacod and Y. Ait-Sahalia. Analyzing the Spectrum of Asset Returns: Jump and
Volatility Components in High Frequency Data. Journal of Economic Literature,
50:1007–1050, 2012.
[4] J.P. Bouchaud, M. Potters, R. Benichou and Y. Lemperiere. Agnostic Risk Parity:
Taming Known and Unknown-Unknowns. 2016.
[5] M. Wainwright and M. Jordan. Graphical models, exponential families and variational
inference. 2008.
[6] A. Meucci. Risk Budgeting and Diversification Based on Optimized Uncorrelated Fac-
tors. 2015.
[7] R. Michaud. Efficient asset allocation: A practical guide to stock portfolio optimization
and asset allocation. 1998.
[8] M. Prado. Building diversified portfolios that outperform out-of-sample. 2016.
[9] R. Hamdan, F. Pavlowsky, T. Roncalli and B. Zheng. A Primer on Alternative Risk

Premia. 2016.
[10] R. Litterman. Hot Spots and Hedges. Goldman Sachs Risk Management Series, 1996.
[11] T. Roncalli. Introduction to Risk Parity and Budgeting. 2012.
[12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,

2004.
[13] T. Roncalli. Introducing Expected Returns into Risk Parity Portfolios: A New Frame-
work for Asset Allocation. 2013.
80
[14] T. Roncalli. Keep Up The Momentum. 2017.
[15] T. Roncalli, N. Kostyuchyk, B. Bruder. Risk Parity Portfolios with Skewness Risk.
2016.
[16] Y. Tenne and C. Goh. Computational Intelligence in Optimization: Applications and

Implementations. Springer, 2010.
81

TT18 Dissertation Vu 0

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TT18 Dissertation Vu 0

Uploaded by

Copyright:

Available Formats

Risk Parity with Constrained

Gaussian Mixture Models

A thesis submitted for the degree of

A dissertation submitted to the University of Oxford in accordance with the

2 Diversification and Skewness 3

3 Constrained Gaussian Mixture Models 10

4 Risk Parity with Expected Shortfall and Gaussian Mixture 27

5 Conclusion and Future Work 50

B Risk Allocation Supplementary 60

4.1 Skewness coefficient γ1 (Y ) as the function of volatility σ1 (Y ) . . . . . . . . 30

2.1 Skewness sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 Parameters estimation of µ̃ under different historical periods with λ = 0.02 39

C.1 Markowitz mean-variance optimisation function . . . . . . . . . . . . . . . . 63

1.1 Objective and Contributions

1.2 Organization of the Thesis

Diversification and Skewness

2.1 Introduction to diversification

R(X + Y ) ≤ R(X) + R(Y ) (2.1)

2.2 Skewness aggregation

Proof. Firstly, in the case of independent random variables, we have:

eq. (2.5) becomes:

From this, we can see that γ1 (X + Y ) ≈ γ1 (X) if σ(X)  σ(Y ).

Using eq. (2.6), we can deduce that:

One direct consequence of the above is that µ3 (X + Y ) is a decreasing function of

Effect ρ(X, Y ) ρ(X 2 , Y ) ρ(X, Y 2 )

Table 2.1: Skewness sensitivities

Proof. We recall from properties of log-normal distribution that:

µ2 (X) = E[X 2 ] − E[X]2

µ3 (X) = E[(X − E[X])3 ]

−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00

From eq. (2.12) and eq. (2.22), we also obtain:

cov(X, X, Y ) = cov(X 2 , Y ) − 2cov(X, Y )E[X]

Similarly, we obtain the result for cov(X, Y, Y ).

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

−0.2 0.0 0.2 0.4 0.6 0.8

Constrained Gaussian Mixture

3.2 Parameters estimation with simple case

f (xn ) = (1 − λ)φ0 (xn ; µ1 , Σ1 ) + λφ0 (xn ; µ2 , Σ2 ) (3.1)

The derivative of this function with respect to µk is:

The first order condition implies that

where we have defined

The first order condition implies that

Hence, we have c = N and:

4. Iterate Step 2 and 3 until convergence.

3.3 Exponential family form of Gaussian Mixture Models

P (x, y|θ) = exp(θT T(x, y) − A(θ, y)) . (3.20)

P (y|θy ) = exp(θyT Ty (y) − A(θy )) (3.21)

where θy ∈ RK−1 is the natural parameters, Ty : Ωy → RK−1 is the sufficient statistics

P (x |y = k, θx |y=k ) = P (x |θx |y=k )

= exp(θxT |y Tx |y (x, y) − A(θx |y , y))

where we have defined:

P (x, y|θ) = P (y|θy )P (x |y, θx |y )

= exp(θT T(x, y) − A(θ, y))

where we have defined the natural parameters θ = (θyT , θx|y=k

3.4 Generalized Expectation-Maximization

`(θ) = log P (X |θ)

3.4.2 Expectation step

The Expectation-Maximization algorithm is an iterative algorithm that consists of two

Qt = arg max F (Q, θt−1 ) . (3.31)

= `(θ|xn ) − KL[q(yn )||P (yn |xn , θ)]

= `N (θ|X ) − KLN [Q||P (Y|X , θ)]

From this, we can see that γ1 (X + Y ) ≈ γ1 (X) if σ(X) σ(Y ).

this is not an affine equality constraint in the variable Σk , A, the relaxation Σk