Professional Documents
Culture Documents
A Survey of Vectorization Methods in Topological Data Analysis
A Survey of Vectorization Methods in Topological Data Analysis
A Survey of Vectorization Methods in Topological Data Analysis
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
Abstract—Attempts to incorporate topological information in supervised learning tasks have resulted in the creation of several
techniques for vectorizing persistent homology barcodes. In this paper, we study thirteen such methods. Besides describing an
organizational framework for these methods, we comprehensively benchmark them against three well-known classification tasks.
Surprisingly, we discover that the best-performing method is a simple vectorization, which consists only of a few elementary summary
statistics. Finally, we provide a convenient web application which has been designed to facilitate exploration and experimentation with
various vectorization methods.
✦
1 I NTRODUCTION
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
frequency and increasing complexity — the reader will texture database [6], the SHREC14 shape retrieval dataset
encounter thirteen of them here. The bad news, on the [7], and the Fashion-MNIST database [8]. Surprisingly, the
other hand, comes in the form of three serious challenges best-performing vectorization in all three cases is a rather
which must be confronted by those who build or use such naı̈ve one obtained by collecting basic statistical quantities
vectorizations: associated to (the multiset of) intervals in a given barcode.
Our third contribution is a companion web application
1) Given the large number of options, even established
which computes and visualizes all thirteen vectorization
practitioners are not aware of all the vectorization
techniques which have been investigated in this paper.
techniques; similarly, knowledge of which vector-
In addition to running online1 , this web app can also be
izations are suitable for which types of data is diffi-
downloaded2 and run locally on more challenging datasets.
cult – if not impossible – to glean from the published
literature.
2) There is a natural metric between barcodes called Not In This Paper
the bottleneck distance; when it is endowed with Vectorization methods form but a small part of the ever
this metric, the space of barcodes becomes infinite- expanding interface between topological data analysis and
dimensional and highly nonlinear. As such, it does machine learning. As such, there are several related tech-
not admit any faithful embeddings into finite- niques which are not benchmarked here. The precise inclu-
dimensional vector spaces. sion criteria for our study in this paper are as follows.
3) Even the stable vectorizations, which preserve
distances by mapping barcodes into infinite- 1) We restrict our attention to those methods which
dimensional vector spaces, may suffer from a lack produce genuine vectors from barcodes. In particu-
of discriminative power in practice: by design, they lar, kernel methods [9], [10] are beyond the scope of
are poor at distinguishing between datasets whose this paper.
coarse structures are similar and whose differences 2) We only consider those vectorizations that are either
reside in finer scales. straightforward for us to implement, or have an
easily accessible and trusted implementation. For
instance, path signature based vectorizations [11],
In This Paper
[12] are excluded.
Here we seek to comprehensively describe, catalogue and 3) We do not compare machine learning architectures
benchmark vectorization methods for persistent homology designed for the explicit purpose of inferring (per-
barcodes. The first contribution of this paper is the following sistent) homology [13], [14], [15].
taxonomy of the known methods, which we hope will serve 4) We do not touch upon various attempts to design or
as a convenient organizational framework for beginners and study neural networks using tools from topological
experts alike — data analysis [16], [17].
1) Statistical vectorizations: these summaries consist of 5) Finally, even among methods which satisfy the first
basic statistical quantities; four criteria, we have discarded techniques which
2) Algebraic vectorizations: these are generated from regularly obtained a classification accuracy below
polynomials; fifty percent.
3) Curve vectorizations: these come from maps R → H ,
where H is a vector space; Similar Efforts
4) Functional vectorizations: these are maps of the form The authors of [18] have summarised – but not compared
X → H for X ̸= R; – several vectorization and kernel methods for barcodes.
5) Ensemble vectorizations: these are generated from Another summary (sans comparison) may be found in [19],
collections of training barcodes. with emphasis on metric aspects of the chosen vectoriza-
There are unavoidable overlaps between these five cate- tions. The work of [20] describes a common overarching
gories. When such an overlap occurs, we have placed the framework for what we have called curve vectorizations
given vectorization technique in the earliest relevant cate- here. More recently, [21] and [22] have described and com-
gory among those in the list above; thus, an algebraic vec- pared five and four vectorization methods respectively.
torization given by polynomial functions of basic statistical
quantities will be placed in category (1) rather than category Outline
(2). The reader might claim, quite reasonably, that category
Notation and preliminaries involving barcodes are estab-
(3) should be subsumed into category (4). However, the
lished in Section 2. In Sections 3 and 4 we introduce the
sheer number of curve-based vectorizations compelled us
thirteen vectorizations (suitably organised into our taxon-
to set them apart.
omy) and the three datasets. Section 5 contains the results
The second contribution of this paper is a comprehensive
of our experiments whose finer details have been relegated
benchmarking of thirteen vectorization techniques across
to Appendices A and B. We provide a description of the
these five categories on three well-known image classifi-
web app in Section 6 and some brief concluding remarks in
cation datasets. These datasets were selected to simultane-
Section 7.
ously (a) provide an increasing level of difficulty for topo-
logical methods, and (b) to be instantly recognizable for the 1. https://persistent-homology.streamlit.app
broader machine learning community. These are: the Outex 2. https://github.com/dashtiali/vectorisation-app
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
At its core, persistent homology studies sequences of finite- Persistence modules arise naturally from a wide class of
dimensional vector spaces V = {Vi | 0 ≤ i ≤ n} and linear datasets. The first step in topological data analysis involves
maps a = {ai : Vi−1 → Vi | 1 ≤ i ≤ n}: imposing the structure of a filtered cell complex – either
simplicial [28, Chapter 8] or cubical [29] – from the data [1],
V0
a1
/ V1 a2
/ ··· an
/ Vn . [2], [3]. The two most prominent examples of filtered cell
complex structures arising from data are as follows.
Such sequences (V, a) are called persistence modules. Among
1) Given a finite point cloud X ⊂ Rn , one con-
the simplest examples are interval modules — for each pair
structs a family of increasing simplicial com-
of integers p ≤ q with [p, q] ⊂ [0, n], the corresponding
plexes {Sϵ | ϵ ≥ 0} defined as follows. A collection
interval module (I [p,q] , c[p,q] ) has
{x0 , . . . , xk } forms a k -simplex in Sϵ if and only if
the (Euclidean) distance between xi and xj is no
(
[p,q] 1 if p ≤ i ≤ q
dim Ii = larger than ϵ for all i, j in {0, . . . , k}. Since there are
0 otherwise; only finitely many ϵ values at which new simplices
[p,q]
are introduced, the filtration is indexed by a subset
similarly, the map ci is the identity whenever p + 1 ≤ i ≤ of the natural numbers. The collection Sϵ is called
q and zero otherwise. the Vietoris-Rips filtration of X . These filtrations
can be similarly defined for any metric space.
2) Consider a grayscale image I , given in terms
2.1 Structure and Stability of m × n pixels with intensity values in the
Every persistence module decomposes into a direct sum set {0, 1, . . . , 255}. This naturally forms a two-
of interval modules. In particular, we have the following dimensional cubical complex, which can be en-
structure theorem [23], [24]. dowed with the upper-star filtration by intensity
values. In particular, each elementary cube of di-
Theorem 2.1. For every persistence module (V, a), there mension < 2 appears at the smallest intensity en-
exists a unique set Bar(V, a) of subintervals of [0, n] countered among the 2-dimensional cubes in its
along with a unique function Bar(V, a) → Z>0 denoted immediate neighbourhood. Higher-dimensional cu-
[p, q] 7→ µp,q for which we have an isomorphism bical filtrations may be similarly generated from
M µp,q higher-dimensional pixel grids.
(V, a) ≃ I [p,q] , c[p,q] .
[p,q]∈Bar(V,a) There are several other filtration types which may be
used to model point and image datasets. Once the data has
Thus, the algebraic object (V, a) may be fully recovered (up been suitably modeled by a filtered simplicial or cubical
to isomorphism) from purely combinatorial data consisting complex, persistence modules are obtained by computing
of the set of intervals Bar(V, a) and the multiplicity function homology groups with coeffiecients in a field. In general,
µ. Alternately, one may view Bar(V, a) itself as a multiset these homology groups are not invariant to the choice of
with µp,q copies of each interval [p, q]. This multiset is called filtration. The reader who is interested in the definition
the barcode of (V, a). It is often useful in applications to let and computation of homology is urged to either consult
the vector spaces Vi be indexed by real numbers rather than standard algebraic topology references such as [30, Ch 2]
integers. With this modification in place, Bar(V ) becomes a or see the more recent [3], [4], [31], [32].
collection of real intervals [p, q] ⊂ R. A substantial difficulty in topological data analysis is
The most important property of persistence modules, that although persistent homology barcodes can be readily
beyond the structure theorem, is their stability [24]. There associated with a large class of datasets, the space of all
is a natural metric on the set of persistence modules called such barcodes is notoriously unpleasant to encounter from
the interleaving distance and a metric on the set of barcodes a statistical perspective. Fortunately, barcodes are combina-
called the bottleneck distance. torial objects which can be mapped to Hilbert spaces in a
plethora of reasonable ways. Indeed, across the last decade,
Theorem 2.2. The assignment (V, a) 7→ Bar(V, a) is an such vectorization methods have been proposed by various
isometry from the space of persistence modules (with authors, and our main purpose in this work is to benchmark
interleaving distance) to the space of barcodes (with many of these methods against standard classification tasks.
bottleneck distance).
The advantage of this theorem is that barcodes remain 3 V ECTORIZATION M ETHODS FOR BARCODES
robust to (certain types of) perturbations of the original
dataset, thus conferring upon the topological data anal- Throughout this section, we assume knowledge of the bar-
ysis pipeline a degree of noise-tolerance. The significant code B := Bar(V, a) of an R-indexed persistence module
difficulty from a statistical perspective, however, is that along with its multiplicity function µ : B → Z>0 . We note
the metric space of persistence barcodes with bottleneck that for each interval [p, q] in B the numbers p and q are
distance is nonlinear — even averages can not be defined called its birth and death respectively, and the length q − p is
for arbitrary collections of barcodes [25], [26], [27]. called its lifespan.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
where Lµ is the weighted sum Definition 3.4. A tropical coordinate function for µ : B →
X Z>0 is a function F of variables {x1 , y1 , . . . , xn , yn }
Lµ := µp,q · (q − p). (1) which is both tropical and symmetric as described below.
[p,q]∈B
1) Tropical: there is an expression for F which uses only
The entropy from Definition 3.1(3) was introduced in the operations max, min, + and − on the variables
[34], [35]. Our second statistical vectorization is from [36], {xi } and {yi }.
where entropy has been upgraded to a real-valued piece- 2) Symmetric: any permutation of {1, . . . , n}, when ap-
wise constant function rather than a single number. plied to both {xi } and {yi }, leaves F unchanged.
Definition 3.2. The entropy summary function of µ : B →
Z>0 is the map Sµ : R → R given by Let λi be the lifespan qi − pi of the i-th interval in B . To
q−p q−p generate feature maps from the tropical coordinate functions
1p≤t<q · µp,q ·
X
Sµ (t) = − · log . described above, one simply evaluates them at xi = λi and
Lµ Lµ
[p,q]∈B yi equal to either max(rλi , pi ) or min(rλi , pi ) for a positive
Here 1• is the indicator function — it equals 1 when integer parameter r. Examples of such tropical coordinate
the conditional • is true and it equals 0 otherwise. The features include:
number Lµ appearing in the expression above is defined
in (1). F1 = max λi
i
The entropy summary function has also been called the life F2 = max(λi + λj )
i<j
entropy curve, e.g., in [20].
F3 = max (λi + λj + λk )
i<j<k
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
B ). Consider the three continuous maps R, S, T : R2 → C small bottleneck distance — we can always add lots of
defined as follows: very small intervals to a given barcode without changing
its bottleneck distance by a significant amount. One way to
R(x, y) = x + iy
( y−x rectify the bottleneck instability of Betti and lifespan curves
√
α 2
· (x + iy) if (x, y) ̸= (0, 0) is to test the containment not only of t in each interval
S(x, y) = [p, q] ∈ B , but rather of the largest subinterval of the form
0 otherwise
y−x h i [t − s, t + s]. This modification leads to one of the oldest and
T (x, y) = · (cos α − sin α) + i(cos α + sin α) , best-known stable curve vectorizations [43], [44], as defined
2
p below.
where α is the norm x2 + y 2 . Definition 3.8. The persistence landscape of the bar-
Definition 3.5. Given a barcode µ : B → Z>0 , let X : code µ : B → Z>0 is a sequence of curves
R2 → C be any one of the three functions R, S, T defined {Λµi : R → [−∞, ∞] | i ∈ Z>0 } defined as follows. The
above. The complex polynomial vectorization of µ of µ
image Λi (t) of each t in R equals
type X is the sequence of coefficients of the complex
polynomial in one variable z given by
1[t−s,t+s]⊂[p,q] · µp,q ≥ i .
X
sup s ≥ 0
µ
Y
CX (z) := [z − X(p, q)] p,q .
[p,q]∈B
[p,q]∈B
By convention, the supremum over the empty set is zero.
In practice, it is customary to either take only the first few Moreover, since our barcode B is assumed to be finite, the
highest degree coefficients of CX (z) or to multiply it by µ
landscape functions Λi become identically zero for suffi-
a suitable power of z . This is done to guarantee that the ciently large i. An alternate approach to defining persistence
feature vectors assigned to a collection of barcodes all have landscapes comes from the function ∆ : B × R → R, given
the same dimension. by
Other Algebraic Vectorizations: In the subsequent sec-
tion, we describe how to extract vectorizations by using ∆([p, q], t) := max (min(t − p, q − t), 0) . (2)
barcode data to build curves which take values in a vec-
tor space. Once such a curve has been extracted, one can For each i ∈ Z>0 , the curve Λµi
from Definition 3.8 equals
compute its path signature via iterated integrals [42]. The the i-th largest number in the multiset that contains µp,q
path signature resides in the tensor algebra of the target copies of ∆([p, q], t) for each interval [p, q] in B . The fourth
vector space; elements of the tensor algebra are equivalent and final curve vectorization that we consider here was
to coefficients of non-commuting polynomials, and hence introduced in [45], and it is also defined in terms of the
constitute algebraic vectorizations of barcodes — see [11], functions ∆ from (2).
[12] for examples of this approach. Definition 3.9. Let w : B → R>0 be any function, which we
will denote [p, q] 7→ wp,q . The w-weighted persistence
3.3 Curve Vectorizations silhouette of µ : B → Z>0 is the map ϕw µ : R → R
defined as the weighted average
There are several interesting ways of turning barcodes into P
one or more curves, which for our purposes here mean wp,q · µp,q · ∆([p, q], t)
ϕwµ (t) := .
(piecewise) continuous maps from R to a convenient vector
P
wp,q · µp,q
space. Feature vectors can then be constructed by sampling
the given curve at finite subsets of R. Perhaps the simplest Here both sums on the right are indexed over all [p, q] ∈
and most widely used curve-based vectorization is the B , and ∆ is defined in (2).
following. Reasonable choices of weight functions are provided by
Definition 3.6. The Betti curve of µ : B → Z>0 is the curve setting wp,q = (q − p)α for a real-valued scale parameter
βµ : R → R given by α ≥ 0. For small α, the shorter intervals dominate the value
of the silhouette curve, whereas for large α it is the longer
1p≤t<q · µp,q .
X
βµ (t) = intervals which play a more substantial role — see [45, Sec
[p,q]∈B
4] for details.
Here 1• is the indicator function as described in Definition Other Curve Vectorizations: See the envelope embedding
3.2, so this function counts the number of intervals (with from [11], the accumulated persistence function in [46], and the
multiplicity) in B which contain t. Very similar in spirit (and persistent Betti function of [47]. In [48], the persistent Betti
formula) to the Betti curve is the following vectorization function is decomposed along the Haar basis to produce a
from [20]. vectorization. More recently, [20] provides a general frame-
work for constructing several different curve vectorizations.
Definition 3.7. The lifespan curve of µ : B → Z>0 is the
map Lµ : R → R given by
3.4 Functional Vectorizations
1p≤t<q · µp,q · (q − p).
X
Lµ (t) =
[p,q]∈B
Here we catalogue those barcode vectorizations which are
given by maps from spaces other than R. The first, and
It is not difficult to create very different-looking Betti and perhaps most prominent member of this category is the fol-
lifespan curves from two barcodes which have arbitrarily lowing vectorization from [49]. Its definition below makes
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
use of two auxiliary components besides the given bar- where Vµ (fi ) is as defined in (3).
code µ : B → Z>0 . The first is a continuous, piecewise-
Two convenient choices of T , called tent functions and in-
differentiable function f : R2 → R≥0 satisfying f (x, 0) = 0
terpolating polynomials, have been highlighted in [38]. Tent
for all x ∈ R. And the second is a collection of smooth
functions are indexed by points (u, v) ∈ R2 and require an
probability distributions Ψ := {ψp,q | [p, q] ∈ B} where ψp,q
additional parameter δ > 0; they have the form
has mean (p, q − p).
Definition 3.10. The persistence surface of µ : B → Z>0 with δ 1
gu,v (x, y) = max 1 − · max(|x − u|, |y − v|), 0 (4)
respect to f and Ψ (as described above) is the function δ
R2 → R given by By construction, each such function is supported on the
square of side length 2δ around the point (u, v) in the birth-
ρµf,Ψ (x, y) =
X
µp,q · f (p, q − p) · ψp,q (x, y).
lifespan plane. The normal pipeline for selecting finitely
[p,q]∈B
many template functions requires covering a sufficiently
µ
The persistence image If,Φ of µ with respect to (f, Φ) large bounded subset of W with a square grid and then
assigns a real number to every subset Z ⊂ R2 ; this selecting the appropriate tent functions supported on grid
number is given by integrating the persistence surface cells. We direct interested readers to [38, Sections 6 and 7]
over Z : for details on interpolating polynomials and for suggestions
ZZ on how one might select suitable n and f ∈ Subn (T ) for a
Iµf,Ψ (Z) = ρµf,Ψ (x, y) dx dy. given classification task.
Z
Other Functional Vectorizations: See the generalised per-
In order to obtain a vector from the persistence image, sistence landscape in [50] and the crocker stacks of [51].
one lets Z range over grid pixels in a rectangular subset
of R2 and renormalizes the resulting array of numbers, thus 3.5 Ensemble Vectorizations
producing a grayscale image. Standard choices of f and Ψ =
{ψp,q } are: Our last category contains two methods which require ac-
cess to a sufficiently large collection of training barcodes
0
t≤0 µi : Bi → Z>0 in order to generate a vectorization. The first
f (x, y) = t/λmax 0 < t < λmax of these methods, introduced in [52], [53], is a modification
of the template system vectorization from Definition 3.11.
1 t > λmax
We recall that W ⊂ R2 is defined as {(x, y) | 0 ≤ x < y} and
(x − p)2 + (y − (q − p))2
1 that every barcode B is identified with a subset P (B) ⊂ W
ψp,q (x, y) = · exp − .
2πσ 2 2σ 2 via the map that sends intervals [p, q] of positive length to
Here λmax is the largest lifespan max[p,q]∈B (q − p) encoun- points (p, q).
tered among the intervals in B , and σ is a user-defined Definition 3.12. The adaptive template system induced by
parameter which forms the common standard deviation of a collection of barcodes {µi : Bi → Z>0 } is obtained via
every ψp,q in sight. S following two steps. Letting P ⊂ W be the union
the
The second and final functional vectorization which i P (Bi ), one
we will
examine was introduced in the paper [38]. Set 1) identifies finitely many ellipses Ej ⊂ W which
W := (x, y) ∈ R2 | y > 0 , and note that points (x, y) ∈ W tightly contain P , and then
parameterize intervals [x, x + y] ⊂ R of strictly positive 2) constructs suitable functions gj supported on Ej , as
length that could possibly lie in a given barcode. Let Cc (W) described in (5) below.
be the set of all continuous functions f : W → R with
compact support3 . The given barcode µ : B → Z>0 induces The desired vectorization of a new barcode µ : B → Z>0
a function Vµ : Cc (W) → R via is now obtained by using these gj , rather than tent functions,
X as template functions in Definition 3.11. Three different
Vµ (f ) = µp,q · f (p, q − p). (3)
methods for finding the Ej can be found in [52, Sec 3]. Let
[p,q]∈B
v ∗ denote the transpose of a given vector v in R2 . Now
A subset T of Cc (W) is called a template system if for any each ellipse E with centre x = (x1 , x2 )∗ corresponds to a
distinct pair µ1 : B1 → Z>0 and µ2 : B2 → Z>0 of barcodes, symmetric 2 × 2 matrix A satisfying
there exists at least one f ∈ T so that Vµ1 (f ) ̸= Vµ2 (f ).
E = z ∈ R2 | (z − x)∗ A(z − x) = 1 .
Definition 3.11. Fix an integer n > 0 and let Subn (T )
be the collection of all size n subsets of a template Setting h(z) := (z − x)∗ A(z − x), the adaptive template
system T as described above. The template function function g supported on E is
vectorization of µ : B → Z>0 with respect to T is (
1 − h(z) h(z) < 1
the map τ : Subn (T ) → Rn defined as follows. Given g(z) = (5)
f = {f1 , . . . , fn } in Subn (T ), the associated vector in Rn 0 otherwise.
is The second instance of an ensemble vectorization frame-
τ µ (f ) := (Vµ (f1 ), . . . , Vµ (fn )) , work which we benchmark in this paper is from [54]. Let
µi : Bi → Z>0 be a collection of training barcodes as before,
3. In other words, Cc (W) contains those continuous real-valued func-
tions on W which evaluate to 0 outside the intersection of a sufficiently and fix a dimension parameter b ∈ Z>0 . Much like the
large rectangle with W in R2 . adaptive template systems of Definition 3.12, the automatic
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
Ωµi :=
X
µp,q · Ωi (p, q).
[p,q]∈B 4.2 SHREC14
Other Ensemble Vectorizations: The persistence code- The Shape Retrieval of a non-rigid 3D Human Models
books approach from [55] proposes three different types of dataset, usually abbreviated SHREC14 [7], is designed to
barcode vectorizations; these are based on bag-of-word em- test shape classification and retrieval algorithms. It contains
beddings, VLAD (vector of locally aggregated descriptors), real and synthetic human shapes and poses stored as 3D
and Fisher Vectors respectively. meshes (which are already simplicial complexes). We use the
synthetic part of the dataset; this constitutes a classification
task with 15 classes (5 men, 5 women and 5 children), each
4 DATASETS one with 20 different poses — see the upper-right corner of
The vectorization methods described in the preceding Fig. 2.
section have been benchmarked against three standard We apply the Heat Kernel Signature (HKS) to obtain
datasets; these are described below and arranged in in- filtrations [9], [58]. For a fixed real parameter t > 0, this
creasing order of difficulty for topological methods. All filtration assigns to each mesh point x the value
three of them have been used in the past for comparing
vectorizations (or kernels) for persistence barcodes [9], [10], ∞
X
[11], [38], [52], [56]. HKSt (x) = e−λi t · ϕi (x)2 (6)
i=0
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
4.3 FMNIST all techniques perform rather well, with Persistence Statis-
The Fashion-MNIST database contains 28 × 28 grayscale tics and Algebraic Functions sharing the best performance
images (7, 000 images per class, with 10 classes) — see the with 99.2% accuracy each, followed closely by Persistent
left side of Fig. 2 for some sample images. We split this Silhouettes with 98.3% each.
dataset into 60, 000 training and 10, 000 testing images. Results from the full experiment with 68 classes are
The filtration used for generating barcodes is as follows: contained in Table 2; as one might expect, the performance
we performed padding, median filter, and shallow thresh- of every single vectorization degrades in the passage from
olding before computing canny edges [59]. Then each pixel is Outex10 to Outex68. Here Persistence Statistics is the clear
given a filtration value equalling its distance from the edge- winner by a significant margin, earning 93.4% accuracy.
pixels. Finally, all other cells inherit filtration values from Tropical Coordinates ranks second with 88.7%. Setting aside
the top pixels via the lower star filtration rule. the outstanding performance of Persistence Statistics, it
appears clear from these results that the algebraic vectoriza-
tions perform far better on Outex68 than the vectorizations
5 R ESULTS from the other categories.
We note that the authors of [20] have also used Outex to
Here we report the classification accuracy of the thirteen compare the performance of various curve vectorizations,
vectorization methods from Section 3 on each of the three with Persistence Statistics being used as a baseline. They
datasets from Section 4. Implementation details and pa- also obtained their best results with Persistence Statistics.
rameter choices are provided in Appendix A. The source
code is available at the following GitHub repository: https:
//github.com/Cimagroup/vectorization-maps. 5.2 SHREC14
We used 10 different t-values t1 < t2 < · · · < t10 , as in
[9], [38], [52], for generating filtrations via the heat kernel
5.1 Outex from (6). At t10 we found several sparse or empty barcodes,
Table 1 displays the classification accuracy for the smaller which led us to discard that classification problem. Table 3
(and easier) experiment on 10 classes. As one might expect, collects the best performance for each method across the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
first 9 values of t; it also contains values of the optimal ordinary Template Functions despite having recourse to
parameters (see Appendix A) and the optimal values of t. 60, 000 training barcodes. We do not have a clear explana-
Persistence Statistics yielded the best classification ac- tion for this phenomenon, particularly in light of a fairly
curacy of 94.7%, followed closely by Template Functions competitive performance from ATOL (which was also ex-
at 94.4%. One remarkable feature of these results is that posed to the same training data).
the dataset does not appear to favour any one category of
vectorizations over the other — it is possible to achieve over
88% accuracy by using a suitable statistical, algebraic, curve, 6 W EB A PPLICATION
functional or ensemble vectorization. In fact, only the curve- In order to illustrate and visualize the vectorization methods
based vectorizations failed to achieve over 90% accuracy on described here, we have built an interactive web application
this dataset. The variation of classification accuracy with the Brava that runs on any modern browser; it is available at
heat kernel parameter t is discussed in Appendix B.
https://persistent-homology.streamlit.app/
5.3 FMNIST The app has been built in Python using the Streamlit library
along with several existing Python libraries. The sidebar
The results of our experiments on FMNIST are recorded in
contains options for selecting different types of input data
Table 4. We note that these experiments only used infor-
and displays several options for data visualization. One
mation contained in the 0-dimensional barcodes and that
sample image/point-cloud from each of the three datasets
the SVM classifier was not used. The classification accuracy
used in this paper has been pre-loaded, but the user is
of all the methods is much lower than the corresponding
free to upload their own data. Specifications, formatting
figures for the two preceding datasets. Once more, the
guidelines, and downloading instructions are available in
Persistence Statistics vectorization takes the top spot with
our GitHub repository:
74.9% and Template Functions are slightly behind at 74.7%
https://github.com/dashtiali/vectorisation-app
Vectorization Method Accuracy Parameters
Persistence Statistics 0.749
Entropy Summary 0.696 30
Algebraic Functions 0.710
Tropical Coordinates 0.696 10
Complex Polynomial 0.661 10, R
Betti Curve 0.618 50
Lifespan Curve 0.692 30
Persistence Landscape 0.694 30, 5
Persistence Silhouette 0.670 30, 0
Persistence Image 0.698 1, 12
Template Functions 0.747 10, 2
Adaptive Template System 0.602 GMM, 5
ATOL 0.730 16
TABLE 4
FMNIST results. All the scores have been achieved for Random
Forest classifier with 100 trees. Fig. 3. A screenshot of the web app
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
A PPENDIX A
A.8 Persistence Landscapes
I MPLEMENTATION AND PARAMETER D ETAILS
We have used the GUDHI implementation of persis-
We have made use of several existing software packages,
tence landscapes (see Definition 3.8). The are two pa-
such as GUDHI [60], Teaspoon5 or Scikit-learn [61], as well
rameters to consider: the resolution (to identify the grid
as our own implementations in some cases. Salient informa-
points where each landscape is sampled) and the to-
tion regarding each method has been provided in the list
tal number of landscapes used. The resolution was opti-
below. Full details can be found in the GitHub repository
mized over {50, 100, 200} for Outex and SHREC14, and
accompanying this paper6 .
over {15, 30, 50} for FMNIST; the number of landscapes
5. https://lizliz.github.io/teaspoon/index.html ranged over {2, 5, 10, 20} for Outex and SHREC14 and over
6. https://github.com/Cimagroup/vectorization-maps {1, 2, 3, 5} for FMNIST.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2023.3308391
[41] B. Di Fabio and M. Ferri, “Comparing persistence diagrams J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
through complex vectors,” in Image Analysis and Processing — and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
ICIAP 2015, V. Murino and E. Puppo, Eds. Cham: Springer Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
International Publishing, 2015, pp. 294–305. [62] C. Cortes and V. Vapnik, “Support-vector networks,” Machine
[42] I. Chevyrev and A. Kormilitzin, “A primer on the signature learning, vol. 20, no. 3, pp. 273–297, 1995.
method in machine learning,” arXiv:1603.03788 [stat.ML], 2016. [63] T. K. Ho, “Random decision forests,” in Proceedings of 3rd interna-
[43] P. Bubenik, “Statistical topological data analysis using persistence tional conference on document analysis and recognition, vol. 1. IEEE,
landscapes,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 77–102, 1995, pp. 278–282.
Jan. 2015. [Online]. Available: http://dl.acm.org/citation.cfm?id=
2789272.2789275 Dashti Ali is a machine learning and software
[44] P. Bubenik and P. Dłotko, “A persistence landscapes toolbox for developer. He obtained his MSc in scientific
topological statistics,” Journal of Symbolic Computation, vol. 78, pp. computation from the university of Nottingham in
91–114, 2017. 2013. He does research in machine learning and
[45] F. Chazal, B. T. Fasy, F. Lecci, A. Rinaldo, and L. Wasserman, applications of applied and computational topol-
“Stochastic convergence of persistence landscapes and ogy with particular interest in digital image anal-
silhouettes,” in Proceedings of the Thirtieth Annual Symposium on ysis. As a software developer based in Canada,
Computational Geometry, ser. SOCG’14. New York, NY, USA: he develops cross platform applications. In addi-
Association for Computing Machinery, 2014, p. 474–483. [Online]. tion, he is a research fellow at Koya University.
Available: https://doi.org/10.1145/2582112.2582128
[46] C. A. Biscio and J. Møller, “The accumulated persistence function,
a new useful functional summary statistic for topological data
analysis, with a view to brain artery trees and spatial point pro- Aras Asaad is a mathematician and machine
cess applications,” Journal of Computational and Graphical Statistics, learning researcher.His research interests are
vol. 28, no. 3, pp. 671–681, 2019. mainly in applications of applied topology and
[47] K. Xia, “Persistent similarity for biomolecular structure compari- geometry, with particular interest in computer
son,” Communications in Information and Systems, vol. 18, no. 4, pp. vision. He obtained his PhD in mathematics and
269–298, 2018. computing from The University of Buckingham in
[48] Z. Dong, C. Hu, C. Zhou, and H. Lin, “Vectorization of persis- 2020. He is an honorary junior research fellow in
tence barcode with applications in pattern classification of porous school of computing at The University of Buck-
structures,” Computers & Graphics, vol. 90, pp. 182–192, 2020. ingham and research fellow at Koya University.
[49] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Ship-
man, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier,
“Persistence images: A stable vector representation of persistent Maria-Jose Jimenez does her research on ap-
homology,” J. Mach. Learn. Res., vol. 18, no. 1, p. 218–252, 2017. plications of computational topology, with special
[50] E. Berry, Y.-C. Chen, J. Cisewski-Kehe, and B. T. Fasy, “Functional interest on biological data. She obtained her
summaries of persistence diagrams,” Journal of Applied and Compu- PhD in mathematics in 2003 from Universidad
tational Topology, vol. 4, no. 2, pp. 211–262, 2020. de Sevilla, where she is an Associate professor.
[51] L. Xian, H. Adams, C. M. Topaz, and L. Ziegelmeier, “Capturing She is currently doing a one-year research stay
dynamics of time-varying data via topology,” Foundations of Data in the Mathematical Institute, University of Ox-
Science, vol. 4, no. 1, 2022. ford.
[52] L. Polanco and J. A. Perea, “Adaptive template systems: Data-
driven feature selection for learning with persistence diagrams,”
in 2019 18th IEEE International Conference On Machine Learning And
Applications (ICMLA). IEEE, 2019, pp. 1115–1121.
[53] S. Tymochko, E. Munch, and F. A. Khasawneh, “Adaptive parti- Vidit Nanda is an applied and computational al-
tioning for template functions on persistence diagrams,” in 2019 gebraic topologist. He obtained his PhD in math-
18th IEEE International Conference On Machine Learning And Appli- ematics from Rutgers University in 2012, and
cations (ICMLA), 2019, pp. 1227–1234. has held postdoctoral positions at the University
[54] M. Royer, F. Chazal, C. Levrard, Y. Umeda, and Y. Ike, of Pennsylvania, the Alan Turing Institute and the
“Atol: Measure vectorization for automatic topologically-oriented Institute for Advanced Study. In 2018 he joined
learning,” in Proceedings of The 24th International Conference on the Mathematical Institute at the University of
Artificial Intelligence and Statistics, ser. Proceedings of Machine Oxford as an Associate Professor.
Learning Research, A. Banerjee and K. Fukumizu, Eds.,
vol. 130. PMLR, 4 2021, pp. 1000–1008. [Online]. Available:
https://proceedings.mlr.press/v130/royer21a.html
[55] B. Zieliński, M. Lipiński, M. Juda, M. Zeppelzauer, and P. Dłotko, Eduardo Paluzo-Hidalgo has research inter-
“Persistence codebooks for topological data analysis,” Artificial ests primarily in machine learning and compu-
Intelligence Review, vol. 54, no. 3, pp. 1969–2009, 2021. tational topology. He obtained his PhD in mathe-
[56] W. Guo, K. Manohar, S. L. Brunton, and A. G. Banerjee, “Sparse- matics in 2021 from Universidad de Sevilla. He is
tda: Sparse realization of topological data analysis for multi-way currently a researcher in the Department of Ap-
classification,” IEEE Transactions on Knowledge and Data Engineer- plied Math I at Universidad de Sevilla, Spain, and
ing, vol. 30, no. 7, pp. 1403–1408, 2018. a member of the Combinatorial Image Analysis
[57] P. Dlotko, “Cubical complex,” in GUDHI User and Reference Man- group.
ual, 3.6.0 ed. GUDHI Editorial Board, 2022. [Online]. Available:
https://gudhi.inria.fr/doc/3.6.0/group cubical complex.html
[58] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably
informative multi-scale signature based on heat diffusion,” in
Manuel Soriano-Trigueros specializes in topo-
Computer graphics forum, vol. 28. Wiley Online Library, 2009, pp.
logical persistence and topological data anal-
1383–1392.
ysis. He obtained his PhD in mathematics at
[59] J. Canny, “A computational approach to edge detection,” IEEE
Universidad de Sevilla in 2023. In 2017, he ob-
Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-
tained an MSc in Mathematics from the same
8, no. 6, pp. 679–698, 1986.
university. He also studied one year as an ex-
[60] The GUDHI Project, GUDHI User and Reference Manual, change student at Warwick University. Currently,
3.6.0 ed. GUDHI Editorial Board, 2022. [Online]. Available: he is a postdoctoral researcher at the Institute of
https://gudhi.inria.fr/doc/3.6.0/ Science and Technology Austria.
[61] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/