Professional Documents
Culture Documents
2 Cranko19b Boosted Density Estimation Remastered
2 Cranko19b Boosted Density Estimation Remastered
2 Cranko19b Boosted Density Estimation Remastered
Abstract cost here, insofar as we have been able to devise, is that one
forgoes learning an efficient sampler (as with a GAN), and
There has recently been a steady increase in the
must make do with classical sampling techniques to sample
number iterative approaches to density estimation.
from the learned density. We leave the issue of efficient sam-
However, an accompanying burst of formal con-
pling from these density as an open problem, and instead
vergence guarantees has not followed; all results
focus on analysing the densities learned with formal conver-
pay the price of heavy assumptions which are of-
gence guarantees. Previous formal results have established
ten unrealistic or hard to check. The Generative
a range of guarantees, from qualitative convergence (Grover
Adversarial Network (GAN) literature — seem-
& Ermon, 2018), to geometric convergence rates (Tolstikhin
ingly orthogonal to the aforementioned pursuit
et al., 2017), with numerous results in between (See §1.1).
— has had the side effect of a renewed interest
in variational divergence minimisation (notably Our starting point is fundamentally different, we learn a
f -GAN). We show how to combine this latter density from a sequence of binary classifiers. By using a
approach and the classical boosting theory in su- similar weak learning assumptions in boosting, is shown to
pervised learning to get the first density estima- be able to fit arbitrarily closely the target density.
tion algorithm that provably achieves geometric
With the advent of deep learning, such an approach appears
convergence under very weak assumptions. We
to be very promising, as the formal bounds we obtain yield
do so by a trick allowing to combine classifiers
geometric convergence under assumptions arguably much
as the sufficient statistics of an exponential fam-
weaker than other similar works in this area
ily. Our analysis includes an improved variational
characterisation of f -GAN.
The rest of the paper and our contributions are as follows: in
§2, to make explicit the connections between classification,
1. Introduction density estimation, and divergence minimisation we re-intro-
duce the variational f -divergence formulation, and in doing
In the emerging area of Generative Adversarial Networks so are able to fully explain some of the underspecified com-
(GAN’s) (Goodfellow et al., 2014) a binary classifier, called ponents of f -GAN (Nowozin et al., 2016); in §3, we relax
a discriminator, is used learn a highly efficient sampler for a a number of the assumptions of Grover & Ermon (2018),
data distribution P ; combining what would traditionally be and then give both more general, and much stronger bounds
two steps — first learning the density function from a family for their algorithm; in §4, we apply our algorithm to several
of densities, then fine-tuning a sampler — into one. Interest toy datasets in order demonstrate convergence and compare
in this field has sparked a series of formal inquiries and directly with Tolstikhin et al. (2017); and finally, a final
generalisations describing GAN’s in terms of (among other section §5 concludes.
things) divergence minimisation (Nowozin et al., 2016; Ar-
jovsky et al., 2017). Using a similar framework to Nowozin The appendices that follow in the supplementary material
et al. (2016), Grover & Ermon (2018) make a preliminary are: §A, we compare our formal results with other related
analysis of an algorithm that takes a series of iteratively works; §B, a geometric account of the function class in the
trained discriminators to estimate a density function1 . The variational form of an f -divergence; §C, a further relax-
1
ation of the weak learning assumptions to some that could
The Australian National University 2 Data61 3 The Uni- actually be estimated experimentally and a proof that the
versity of Sydney. Correspondence to: Zac Cranko
<zac.cranko@anu.edu.au>.
boosting rates are slightly worse but of essentially the same
order; §D, proofs for the main formal results from the pa-
Proceedings of the 36 th International Conference on Machine per; and finally, §E, technical details for the settings of our
Learning, Long Beach, California, PMLR 97, 2019. Copyright experiments.
2019 by the author(s).
1
Grover & Ermon (2018) call this procedure “multiplicative
discriminative boosting”.
Boosted Density Estimation Remastered
Table 1: Summary of related works and results in terms of (i) the components aggregated (”updates”) in the final density
(this density can be implicit), (ii) the rate to a given KL/JS value and (iii) the assumption(s).
1.1. Related work ticular for large dimensions. Our objective however is to
focus on furtive lack of formal results on the densities and
In learning a density function iteratively, it is remarkable that
convergence, instead leaving the problem of sampling from
most previous approaches (Guo et al., 2016; Li & Barron,
these densities as an open question.
2000; Miller et al., 2017; Rosset & Segal, 2002; Tolstikhin
et al., 2017; Zhang, 2003, and references therein) have in-
vestigated a single update rule, not unlike Frank–Wolfe 2. Preliminaries
optimisation:
In the sequel X is a topological space. Unnormalised Borel
qt = h(↵t j(dt ) + (1 ↵t )j(qt 1 )), (1) measures on X are indicated by decorated capital letters, P̃ ,
and Borel probability measures by capital letters without
wherein h, j are some transformations that are in general decoration, P . To a function f : X ! ( 1, +1] we asso-
(but not always) the identity, (qt )t2N is the sequence of ciate another function f ⇤ , called the Fenchel conjugate with
density functions learnt, and (↵t , dt )t2N are the step sizes def
f ⇤ (x⇤ ) = supx2X hx⇤ , xi f (x). If f is convex, proper,
and updates. The updates and step sizes are chosen so that and lower semi-continuous, f = (f ⇤ )⇤ . If f is strictly con-
for some measure of divergence I(P, Qt ) ! 0 as t ! 1, vex and differentiable on int(dom f ) then (f ⇤ )0 = (f 0 ) 1 .
where I(P, Qt ) is the divergence of the true distribution Theorem-like formal statements are numbered to be consis-
P from Qt . Grover & Ermon (2018) is one (recent) rare tent with their appearance in the appendix (§D) to which we
exception to (1) wherein alternative choices are explored. defer all proofs.
Few works in this area are accompanied by convergence
proofs, and even less provide convergence rates (Guo et al., An important tool of ours are the f -divergences of infor-
2016; Li & Barron, 2000; Rosset & Segal, 2002; Tolstikhin mation theory (Ali & Silvey, 1966; Csiszár, 1967). The
def R
et al., 2017; Zhang, 2003). f -divergence of P from Q is If (P, Q) = f (dP/dQ) dQ,
where it is assumed that f : R ! ( 1, +1] is convex
To establish convergence and/or bound convergence by a
and lower semi-continuous, and Q dominates P .2 Ev-
rate, all approaches necessarily make structural assumptions
ery f -divergence has a variational representation (Reid
or approximations on the parameters involved in (1). These
& Williamson, 2011) via the Fenchel conjugate:
assumptions can be on the (local) variation of the divergence
Z ✓ ◆
(Guo et al., 2016; Naito & Eguchi, 2013; Zhang, 2003), the ⇤ ⇤ dP
true distribution or the updates (Dai et al., 2016; Grover If (P, Q) = (f ) dQ
dQ
& Ermon, 2018; Guo et al., 2016; Li & Barron, 2000),the Z ✓ ◆
dP
step size (Miller et al., 2017; Tolstikhin et al., 2017), the = sup t · f ⇤ (t) dQ
t>0 dQ
previous updates, (di )it ,(Dai et al., 2016; Rosset & Segal, Z ✓ ◆
2002), and so on. Often in order to produce the best geo- dP
= sup u· f ⇤ u dQ
metric convergence bounds, the update is usually required u2(dom f ⇤ )X dQ
required to be close to the optimal one (Tolstikhin et al., ⇣ ⌘
2017, Cor. 2, 3). Table 1 compares the best results of the = sup EP u EQ f ⇤ u , (2)
u2(dom f ⇤ )X
leading three to our approach. We give for each of them
the updates aggregated, the assumptions on which rely the where the supremum is implicitly restricted to measurable
results and the rate to come close to a fixed value of KL di- functions.
vergence (Jensen-Shannon, JS, for (Tolstikhin et al., 2017)), In contrast to the abstract family (dom f ⇤ )X , binary clas-
which is just the order of the number of iterations necessary, sification models tend to be specified in terms of density
hiding all other dependences for simplicity.
2
Common divergence measures such as Kullback–Liebler (KL)
However, it must be kept in mind that for many of these and total variation can easily be shown to be members of this family
works (viz. Tolstikhin et al., 2017) the primary objective by picking f accordingly (Reid & Williamson, 2011). Several
is to develop an efficient black box sampler for P , in par- examples of these are listed in Table 2.
Boosted Density Estimation Remastered
Table 3: Classification decision rules. with an exponential activation function. This is just the
arrow C(X ) ! R(X ) in Figure 4.
Collection R(X ) D(X ) C(X )
Decision rule d 1 D 1 D c 0 Example 2 (f -GAN). The GAN objective (Goodfellow
et al., 2014) is implicitly solving the reparameterised varia-
exp tional problem (V):
f0
C(X ) R(X ) int(dom f ⇤ )X sup (EP log(D) + EQ log(1 D))
D2D(X )
D(X )
' = sup (EP (f 0 ') D EQ (f ⇤ f 0 ') D)
D2D(X )
Figure 4: Bijections for reparameterising (V). = GAN(P, Q),
Z
(' ) = exp, 1 def (3)
Qt = Q̃t , where Zt = dQ̃t ,
Zt
which are illustrated in Figure 4.
where ↵t ✓ [0, 1] is the step size (for reasons that will be
It is a common result (Nguyen et al., 2010; Grover & Ermon, clear shortly), dt : X ! R+ is a density ratio, and we
2018; Nowozin et al., 2016) that the supremum in (2) is fix the initial distribution Q0 . After t updates we have the
achieved for f 0 dP/dQ. It’s convenient to define the distribution
reparameterised variational problem:
Yt
def 1
maximise J(d) = EP f
def 0 ⇤ 0 Qt (dx) = R Qt d↵
t (x)Q0 (dx).
i
(4)
d EQ f f d, (V) d ↵t
dQ 0
d2F i=1 t i=1
where F ✓ R(X ), so that the unconstrained maximum is Proposition 2. The normalisation factors can be written
achieved for dP/dQ. See §B for more details. recursively with Zt = Zt 1 · EQt 1 d↵
t .
t
Example 1 (Neural classifier). A neural network with soft- Proposition 3. Let Qt be defined via (4) with a sequence of
max layer corresponds to an element D 2 C(X ). To convert binary classifiers c1 , . . . , ct 2 C(X ), where ci = log di for
this to an element d 2 R(X ) simply substitute the softmax i 2 [t]. Then Qt is an exponential family distribution with
def
3
natural parameter ↵ = (↵1 , . . . , ↵t ) and sufficient statistic
While it might seem like there are certain inclusions here def
(for example D(X ) ✓ R(X ) ✓ C(X )), these categories of func- c(x) = (c1 (x), . . . , ct (x)).
tions really are distinct objects when thought of with respect to
their corresponding binary classification decision rules (listed in The sufficient statistic of our distributions are classifiers
Table 3). that would hence be learned, along with the appropriate
Boosted Density Estimation Remastered
fitting of natural parameters. As explained in the proof, assume X is a measure space and that all measures have
the representation may not be minimal; however, without positive density functions (denoted by lowercase letters)
further constraints on ↵, the exponential family is regular with respect to the ambient measure then
(Barndorff-Nielsen, 1978). A similar interpretation of a ✓ ◆↵ t
neural network in terms of parameterising the sufficient p
qt / d ↵ t
q
t t 1 = " t qt 1 = r̃t↵t qt1 1↵t ,
statistics of a deformed exponential family is given by Nock qt 1
et al. (2017). or equivalently
In the remainder of this section, we show how to learn the 9C > 0 : log qt = ↵t log rt + (1 ↵t )qt 1 C.
density ratios di and choose the step sizes ↵i from observed
data to ensure convergence Qt !t P . This shows the manner in which (3) is a special case of (1).
Corollary 6. For any ↵t 2 [0, 1] and "t 2 [0, +1)X ,
3.1. Helper results for the convergence of Qt to P letting Qt as in (4) and Rt from (6). If Rt satisfies
The updates dt learnt by solving (V) with Q replaced KL(P, Rt ) KL(P, Qt 1) (7)
by Qt 1 . To make explicit the dependence of Qt on
↵t we will sometimes write dQ̃t |↵t = d↵
def for 2 [0, 1], then
t dQ̃t 1 and
t
def 1
dQt |↵t = Zt dQ̃t |↵t . Since Qt is an exponential family KL(P, Qt |↵t ) (1 ↵t (1 )) KL(P, Qt 1 ).
(Proposition 3), we measure the divergence between P and
Qt using the KL divergence (Table 2), which is the canoni-
cal divergence of exponential families (Amari & Nagaoka, We obtain the same geometric convergence as Tolstikhin
2000). et al. (2017, Cor. 2) for a boosted distribution Qt which is
not a convex mixture, which, to our knowledge, is a new
Notice that we can write any solution to (V) as dt = result. Corollary 6 is restricted to the KL divergence but we
dP/dQt 1 · "t , where we call "t : X ! R+ the error do not need the technical domination assumption that Tol-
term due to the fact that it is determined by the difference stikhin et al. (2017, Cor. 2) require. From the standpoint of
between the constrained and global solutions to (V). A more weak versus strong learning, Tolstikhin et al. (2017, Cor. 2)
detailed analysis of the quantity "t is presented in §B. require a condition similar to (7), that is, the iterate Rt has
Lemma 5. For any ↵t 2 [0, 1], letting Qt , Qt 1 as in (3), to be close enough to P . It is the objective of the following
we have: sections to relax this requirement to something akin to the
weak updates common in a boosting scheme.
KL(P, Qt |↵t ) (1 ↵t ) KL(P, Qt 1)
(5)
+ ↵t (log EP "t EP log "t ) 3.2. Convergence under weak assumptions
def 1
for all dt 2 R(X ), where "t = (dP/dQt 1) dt . In the previous section we have established two preliminary
Remark 3. Grover & Ermon (2018) assume a uniform convergence results (Remark 3, Corollary 6) that equal the
error term, "t ⌘ 1. In this case Lemma 5 yields geometric state of the art and/or rely on similarly, strong assumptions.
convergence We now show how to relax these in favour of placing some
weak conditions on the binary classifiers learnt in Equation 2.
8↵t 2 [0, 1] : KL(P, Qt |↵t ) (1 ↵t ) KL(P, Qt 1 ). Define the two expected edges of ct (cf. Nock & Nielsen,
2008):
This result is significantly stronger than Grover & Ermon
(2018, Thm. 2), who just show the non-increase of the KL def 1 def 1
divergence. If, in addition to achieving uniform error, we let ⌫ Qt = EQt 1 [ c t ] and µP = EP [ct ], (8)
1
c? c?
↵t = 1, then (5) guarantees Qt |↵t =1 = P .
def
where c? = maxt ess sup |ct | and the maximum is implic-
We can express the update (4) and (5) in a way that more itly over all previous iterations. Classical boosting results
closely resembles Frank–Wolfe update (1). Since "t takes rely on assumption on such edges for different kinds of
on positive values, we can identify it with a density ratio ct (Freund & Schapire, 1997; Schapire, 1990; Schapire &
involving a (not necessarily normalised) measure R̃t , as Singer, 1999) and the implicit and weak assumption, that
follows we also make, that 0 < c? < 1, that is, the classifiers have
def def 1 bounded and nonzero confidence. By construction then,
R̃t (dx) = "t (x) · P (dx) and Rt = R · R̃t . (6) ⌫Qt 1 , µP 2 [ 1, 1]. The difference of sign of ct is due to
dR̃t
the decision rule for a binary classifier (Table 3), whereby
Introducing R̃t allows us to lend some interpretation to ct (x) 0 reflects that ct classifies x 2 X as originating
Lemma 5 in terms of the probability measure Rt . If we from P rather than Qt 1 , and vice versa for ct (x).
Boosted Density Estimation Remastered
1 1 1
c?
· ct c?
· ct c?
· ct 1
· ct
c?
p p
qt 1 qt 1
µP ⌫ Qt 1 µP ⌫ Qt 1
(a) WLA is satisfied since µP and ⌫Qt 1 are positive. (b) WLA is violated since ⌫Qt 1 is negative.
Figure 5: Illustration of WLA in one dimension with a classifier ct and its decision rule (indicated by the dashed grey line).
The red (resp. blue) area is the area under the ct /c? · p (resp. ct /c? · qt 1 ) line (where p, qt 1 are corresponding density
functions), that is, µP (resp. ⌫Qt 1 ).
normalise NLL by its true value to make this quantity more the sixth round, with ReLU and SELU being the marginal
interperable. The KL divergence is computed using numeri- winners. It is also interesting to note the narrow error rib-
cal integration, and as such it can be quite tricky to ensure bons on tanh compared to the other functions, indicating
stability when running stochastically varying experiments, more consistent training.
and becomes very hard to compute in dimensions higher
than n = 2. In these computationally difficult cases we use
NLL, which is much more stable by comparison. We plot
the mean and 95% confidence intervals for these quantities.
4.1. Results
Complete details about the experimental procedures includ-
ing target data and network architectures are deferred to the
supplementary material (§E).
(a) P (b) ReLU
4.1.1. E RROR AND CONVERGENCE
(0.5 if Qt ! P )
0.8
Accuracy
0.6
0.5
0.4
4
(lower is better)
KL(P, Qt )
3
2
(c) tanh (d) Softplus
1
0
0 1 2 3 4 5
Figure 7: The effect of different activation functions, mod-
elling a ring of Gaussians. The “petals” in the ReLU condi-
Figure 6: As KL(P, Qt ) ! 0 it becomes harder and harder
tion are likely due to the linear hyperplane sections the final
to train a good classifier ct .
network layer being shaped by the final exponential layer.
4
It is easy to pick Q0 to be convenient to sample from.
Boosted Density Estimation Remastered
4.1.3. N ETWORK TOPOLOGY of NLL (Figure 9). After we achieve the optimal NLL
of 1, we observe that NLL becomes quite variable as we
To compare the effect of the choice of network architecture
begin to overfit. Secondly overfitting the likelihood becomes
we fix activation function and try a variety of combinations
harder as we increase the dimensionality, taking roughly
of network architecture, varying both the depth and the
two times the number of iterations to pass NLL = 1 in the
number nodes per layer. For this experiment the target
n = 4 condition as the n = 2 condition. We conjecture that
distribution P is a mixture of 8 Gaussians that are randomly
not overfitting is a matter of early stopping boosting, in a
positioned at the beginning of each run of training. Let
similar way as it was proven for the consistency of boosting
m ⇥ n denote a fully connected neural network ct with m
algorithms (Bartlett & Traskin, 2006).
hidden layers and n nodes per layer. After each hidden layer
we apply the SELU activation function.
4.1.5. C OMPARISON WITH T OLSTIKHIN ET AL . (2017)
Numerical results are plotted in Figure 8b. Interestingly
To compare the performance of our model (here called D IS -
doubling the nodes per layer has little benefit, showing
CRIM ) with A DAGAN we replicate their Gaussian mixture
only moderate advantage. By comparison, increasing the
toy experiment,5 fitting a randomly located eight component
network depth allows us to achieve over a 70% reduction in
isotropic Gaussian mixture where each component has con-
the minimal divergence we are able to achieve.
stant variance. These are sampled using the code provided
by Tolstikhin et al. (2017).
6 ReLU
SELU We compute the coverage metric6 of Tolstikhin et al. (2017):
(lower is better)
KL(P, Qt )
4 Softplus def
Sigmoid C (P, Q) = P (lev> q), where Q(lev> q) = , and 2
2
tanh [0, 1]. That is, we first find to determine a set where most
of the mass of Q lies, lev> q, then look at how much of
0 the mass of P resides there.
0 1 2 3 4 5 6
SELU : 1 ⇥ 10
0.4 SELU : 2 ⇥ 10 coverage metric Figure 10b.
SELU : 2 ⇥ 20
0.2 Notably A DAGAN converges tightly, with NLL centred
0
around its mean, while D ISCRIM begins to vary wildly.
0 1 2 3 4 5 6 However the AdaGAN procedure includes a step size that
(b) Network topology experiment decreases with 1/t — thereby preventing overfitting —
whereas D ISCRIM uses a constant step size of 1/2. Sug-
Figure 8: KL divergence for a variety of activation functions gesting that a similarly decreasing procedure for ↵t may
and architectures over six iterations of boosting. have desirable properties.
2
5
(closer to 1 is better)
15
D ISCRIM
serve overfitting in the resulting densities QT and in-
A DAGAN stability in the training procedure after the point of
Negative Log-Likelihood
10
overfitting (§4.1.4 §4.1.6, §4.1.5), indicating that a pro-
(closer to 1 is better)
1
Finally, while we have used KDE as a point of comparison
of algorithm, there is no reason why the two techniques
0.9
could not be combined. Since KDE is a closed form mixture
(higher is better)
0.8 A DAGAN one couldn’t build some kind of kernel density distribution
and use this for Q0 which one could refine with a neural
0.7 network.
0.6
0 1 2 3 4 5 6 7 8 5. Conclusion
(b)
The prospect of learning a density iteratively with a boosting-
like procedure has recently been met with significant atten-
Figure 10: Comparing the performance of D ISCRIM and tion. However, the success these approaches hinge on the
A DAGAN. existence of oracles satisfying very strong assumptions.
By contrast, we have shown that a weak learner in the orig-
variety of kernel density estimators, with bandwidth selected inal sense of Kearns (1988) is sufficient to yield compara-
via the Scott/Silverman rule.7 ble or better convergence bounds than previous approaches
when reinterpreted in terms of learning a density ratio. To
Results from this experiment are displayed in Figure 11 (in
derive this result we leverage a series of related contributions
the supplementary material). On average Q1 fits the target
including 1) a finer characterisation of f -GAN (Nowozin
distribution P better than all but the most efficient kernels,
et al., 2016), and 2) a full characterisation we learn, in terms
and at Q2 we begin overfitting, which aligns with the ob-
of an exponential family.
servations made in §4.1.4. We note that his performance is
with a model with around 200 parameters, while the kernel Experimentally, our approach shows promising results for
estimators each have 2000 — i.e. we achieve KDE’s perfor- with respect to the capture of modes, and significantly out-
mances with models whose size is the tenth of KDE’s. Also, performs AdaGAN during the early boosting iterations using
in this experiment ↵t is selected to minimise NLL, however a comparatively very small architecture. Our experiments
it is not hard to imagine that a different selection criteria for leave open the challenge to obtain a black box sampler for
↵t would yield better properties with respect to overfitting. domains with moderate to large dimension. We conjecture
that the exponential family characterisation should be of
4.2. Summary significant help in tackling this challenge.
Dudı́k, M., Phillips, S.-J., and Schapire, R.-E. Performance Nock, R. and Nielsen, F. A Real Generalization of discrete
guarantees for regularized maximum entropy density esti- AdaBoost. Artificial Intelligence, 171:25–41, 2007.
mation. In COLT, 2004.
Nock, R. and Nielsen, F. On the efficient minimization
Freund, Y. and Schapire, R. E. A Decision-Theoretic gener- of classification-calibrated surrogates. In Advances in
alization of on-line learning and an application to Boost- neural information processing systems, pp. 1201–1208,
ing. JCSS, 55:119–139, 1997. 2008.
Kearns, M. Thoughts on hypothesis boosting. Unpublished Rosset, S. and Segal, E. Boosting density estimation. In
manuscript, 45:105, 1988. NIPS, pp. 641–648, 2002.
Boosted Density Estimation Remastered