Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Review of “TabPFN: A Transformer That Solves

Small Tabular Classification Problems in a


Second”[2]

Agajan Torayev
July 2023

1 Introduction
The paper builds on top of previous work[3], which proposed Prior-Data Fitted
Networks (PFNs). In a nutshell, the proposed approach in the paper aims to
approximate Posterior Predictive Distribution (PPD) for supervised learning
using the Bayesian framework:
Z
p(y|x, D) ∝ p(y|x, ϕ)p(D|ϕ)p(ϕ)dϕ (1)
Φ

where y is the output, x is the test sample for which we do not know the output
y, D = {(x1 , y1 ), . . . , (xn , yn )} is the training set, Φ is the space of hypotheses
and ϕ ∈ Φ is one particular hypothesis. To approximate the PPD, authors train
the Transformers network[4] (with some slight modifications to the attention
mechanism). In the paper, Φ is the space of Structural Causal Models and
Bayesian Neural Networks.
The theoretical groundings of the paper, in particular, why the approach
works and why minimising the Cross-Entropy loss approximates the true Bayesian
posterior completely relies on the previous work[3]. The previous work used
Bayesian Neural Networks as a prior data-generating mechanism. This work
used Structural Causal Models in addition to Bayesian Neural Networks. Al-
though the authors argue for a fundamentally probabilistic approach, i.e., using
BNN and SCM as prior, I could not see any strong arguments for this choice.
But I assume the reasoning is justified in the previous work. Therefore, this
paper is mostly a technical improvement over the previous work, and there are
no strong theoretical arguments for the choice of transformers, SCMs, BNNs,
etc.
The experiments are well-defined, and the comparison with baselines is done
fairly. However, the experiments include only a subset of the classification task
(fully numerical features, no missing values, less than 1024 training examples,
and less than 10 classes). For this subset of classification tasks, the experiments

1
show how the TabPFN outperforms tree-based approaches and how it is on par
with AutoML libraries. However, the authors argue that Deep Learning-based
methods cannot work on small datasets (which is true), tree-based methods take
longer to train (relatively, I am more than happy with how they perform from
my practical experience), and the pre-trained PFN solves both problems.
The key weakness of the paper is that it can only work with small datasets.
However, the approach is beautiful and can be applied to other domains, such
as deep-learning-based image compression methods (I had some experience with
this before). The weakness of this paper is fundamental to the choice of trans-
formers architecture which scales quadratically to large datasets with a large
number of features. Therefore, the current method is limited to small, only
numerical datasets with no missing values, which is far from real-world data (at
least in manufacturing, where I am used to working with predictive models).
I would want to see more justification for choosing Structural Causal Models
for defining prior. I understand that SCMs and BNNs are used to create prior
datasets for synthetic prior fitting, but I would like to see why. For example, it
is unclear how “rung 1.5” is considered.
As a follow-up work, I would use different transformer architecture to remove
the input length limitations. For example, this work[1] claims to scale the
transformer to 1M tokens and beyond using Recurrent Memory Transformer.
Overall, I liked the work’s relative novelty in pre-training on synthetic prior
datasets, which resembles GPT for NLP tasks.

References
[1] Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. “Scaling Transformer
to 1M tokens and beyond with RMT”. In: arXiv preprint arXiv:2304.11062
(2023).
[2] Noah Hollmann et al. “Tabpfn: A transformer that solves small tabular
classification problems in a second”. In: arXiv preprint arXiv:2207.01848
(2022).
[3] Samuel Müller et al. “Transformers can do bayesian inference”. In: arXiv
preprint arXiv:2112.10510 (2021).
[4] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural
information processing systems 30 (2017).

You might also like