A Systematic Survey of Chemical Pre-Trained Models

A Systematic Survey of Chemical Pre-trained Models
Jun Xia1,∗ , Yanqiao Zhu2,∗ , Yuanqi Du3,∗ and Stan Z. Li1,†

1
Westlake University 2 University of California, Los Angeles 3 Cornell University
{xiajun, stan.zq.li}@westlake.edu.cn, yzhu@cs.ucla.edu, yd392@cs.cornell.edu
Abstract Pre-training
Self-supverised
Deep learning has achieved remarkable success in Backbone Head
Tasks
learning representations for molecules, which is GNNs or
Transformers
crucial for various biochemical applications, rang- Molecular Database
Pre-training Tasks
ing from property prediction to drug design. How-
ever, training Deep Neural Networks (DNNs) from Transfer
scratch often requires abundant labeled molecules, Fine-tuning

MPP DDI
which are expensive to acquire in the real world. To Pre-trained New
alleviate this issue, tremendous efforts have been Backbone Head
devoted to Chemical Pre-trained Models (CPMs), Chemical Pre-trained Models ...

DTI
where DNNs are pre-trained using large-scale unla- Downstream Molecular Datasets
Downstream Tasks
beled molecular databases and then fine-tuned over
specific downstream tasks. Despite the prosperity,
there lacks a systematic review of this fast-growing Figure 1: A typical learning pipeline for Chemical Pre-trained Models
field. In this paper, we present the first survey that (CPMs). MPP: Molecular Property Prediction; DDI: Drug-Drug
summarizes the current progress of CPMs. We first Interactions; DTI: Drug-Target Interactions.
highlight the limitations of training molecular rep-
resentation models from scratch to motivate CPM
studies. Next, we systematically review recent ad- molecular representations [Kearnes et al., 2016]. The key
vances on this topic from several key perspectives, advancements underneath these approaches are Graph Neu-
including molecular descriptors, encoder architec- ral Networks (GNNs), which consider graph structures and
tures, pre-training strategies, and applications. We attributive features simultaneously by recursively aggregating
also highlight the challenges and promising avenues node features from neighborhoods [Kipf and Welling, 2017].
for future research, providing a useful resource for Recently, another line of development of GNNs for molecular
both machine learning and scientific communities. representations models 3D geometric symmetries of molecular
conformations, considering the molecules are in a constant
motion in 3D space by nature [Schütt et al., 2017]. However,
1 Introduction the majority of the above works learn molecular representa-
Extracting vector representations for molecules is critical tions under supervised settings, which limits their widespread
to applying machine learning methods to a broad spectrum application in practice for the following reasons. (1) Scarcity
of molecular tasks. Initially, molecular fingerprints are de- of labeled data: task-specific labels of molecules can be ex-
veloped to encode molecules into binary vectors with rule- tremely scarce because molecular data labeling often requires
based algorithms [Consonni and Todeschini, 2009]. Subse- expensive wet-lab experiments; (2) Poor out-of-distribution
quently, various Deep Neural Networks (DNNs) have been generalization: learning molecules with different sizes or
employed to encode molecules in a data-driven manner. Early functional groups requires out-of-distribution generalization
attempts exploit sequence-based neural architectures (e.g., in many real-world cases. For example, suppose one wishes
RNNs, LSTMs, and transformers) to encode molecules rep- to predict the properties of a newly synthesized molecule that
resented in Simplified Molecular-Input Line-Entry System differs from all the previous molecules in the training set.
(SMILES) strings [Weininger, 1988]. Later, it is argued that However, models trained from scratch cannot extrapolate to
molecules can be naturally represented in graph structures out-of-distribution molecules well [Hu et al., 2020a].
with atoms as nodes and bonds as edges. This inspires a line Pre-trained Language Models (PLMs) have been a potential
of works to leverage such structured inductive bias for better solution to the above challenges in the Natural Language Pro-
cessing (NLP) community [Devlin et al., 2019]. Inspired by
∗
Equal contribution. † Corresponding author. their success, as shown in Fig. 1, Chemical Pre-trained Models
Fingerprints MEMO [Zhu et al., 2022b]
Sequences ChemBERTa [Chithrananda et al., 2020], SMILES-BERT [Wang et al., 2019]

Molecular
descriptors (§2)
2D topology graphs GraphCL [You et al., 2020], Hu et al. [Hu et al., 2020a]
3D geometry graphs GraphMVP [Liu et al., 2022], 3D InfoMax [Stärk et al., 2022]
GNNs GraphCL [You et al., 2020], GraphMVP [Liu et al., 2022], Hu et al. [Hu et al., 2020a]
Encoder
architectures (§2)
Transformers GROVER [Rong et al., 2020], Graphomer [Ying et al., 2021], MEMO [Zhu et al., 2022b]
AE SMILES transformer [Honda et al., 2019], Pangu Drug Model [Lin et al., 2022]
Chemical Pre-trained Models (CPMs)
AM GPT-GNN [Sun et al., 2021], MGSSL [Zhang et al., 2021], MolGPT [Bagal et al., 2021]
MCM AttrMasking [Hu et al., 2020a], Mole-BERT [Xia et al., 2023]
CP GROVER [Rong et al., 2020], ContextPred [Hu et al., 2020a]

Self-supervised learning
CSC DGI [Velickovic et al., 2019], InfoGraph [Sun et al., 2020]
SSC GraphCL [Suresh et al., 2021], MoCL [Sun et al., 2021]

Pre-training
RCD MPG [Li et al., 2021], D-SLA [Kim et al., 2022]
strategies (§3)
DN Denoising [Zaidi et al., 2023], GeoSSL [Liu et al., 2023]
Knowledge-enriched KCL [Fang et al., 2022b], MoCL [Sun et al., 2021]

Extensions
Multimodal KV-PLM [Zeng et al., 2022b], MolT5 [Edwards et al., 2022]
Property prediction ChemRL-GEM [Fang et al., 2022a]
Molecular generation MolGPT [Bagal et al., 2021], Uni-Mol [Zhou et al., 2023]
Applications (§4) Drug discovery
DDI prediction MPG [Li et al., 2021]
DTI prediction GraphMVP [Liu et al., 2022]
Figure 2: A taxonomy of Chemical Pre-trained Models (CPMs) with representative examples.
(CPMs) have been introduced to learn universal molecular rep- for molecules is systematically delineated. (3) Abundant addi-
resentations from massive unlabeled molecules and then fine- tional resources. Abundant resources including open-sourced
tuned over specific downstream tasks. Initially, researchers CPMs, available datasets, and an important paper list are
adopt sequence-based pre-training strategies on string-based collected and can be found at https://github.com/junxia97/
molecular data such as SMILES. A typical strategy is to awesome-pretrain-on-molecules. These resources will be con-
pre-train the neural encoders to predict randomly masked to- tinuously updated on a regular basis. (4) Discussion of future
kens like BERT [Devlin et al., 2019]. This line of works directions. The limitations of existing works are discussed and
include ChemBERTa [Chithrananda et al., 2020], SMILES- several promising research directions are highlighted.
BERT [Wang et al., 2019], Molformer [Ross et al., 2022],
etc. More recently, the community explores pre-training on
(both 2D and 3D) molecular graphs. For example, [Hu et al.,
2 Molecular Descriptors and Encoders
2020a] propose to mask atom or edge attributes and predict In order to feed molecules to DNNs, molecules have to be
the masked attributes. [Liu et al., 2022] pre-train GNNs via featurized in numerical descriptors. Various descriptors are
maximizing the correspondence between 2D topological and designed to describe molecules in a concise format. In this
3D geometric structures. section, we briefly review these molecular descriptors and
Although CPMs have been increasingly applied in molec- their corresponding neural encoder architectures.
ular representation learning, this rapidly expanding field still Fingerprints (FP). Molecular fingerprints describe the pres-
lacks a systematic review. In this paper, we present the first ence or absence of particular substructures of a molecule
survey for CPMs to assist audiences of diverse backgrounds with binary strings. For example, PubChemFP [Wang et al.,
in understanding, using, and developing CPMs for various 2017] encodes 881 structural key types that correspond to the
practical tasks. The contributions of this work can be sum- substructures for a fragment of compounds in the PubChem
marized from the following four aspects. (1) A structured database.
taxonomy. A broad overview of the field is presented with Sequences. The most frequently used sequential descriptor
a structured taxonomy that categorizes existing works from for molecules is the Simplified Molecular-Input Line-Entry
four perspectives (Fig. 2): molecular descriptors, encoder System (SMILES) [Weininger, 1988] owing to its versatility
architectures, pre-training strategies, and applications. (2) and interpretability. Each atom is represented as a respective
Thorough review of the current progress. Based on the tax- ASCII symbol. Chemical bonds, branching, and stereochem-
onomy, the current research progress of pre-trained models istry are denoted by specific symbols. Transformers [Vaswani
et al., 2017] are a powerful neural model for processing se- For molecular graphs, GPT-GNN [Hu et al., 2020b] recon-
quences and modeling the complex relationships among each structs the molecular graph in a sequence of steps (Fig. 3b),
token. We can split sequence-based molecular descriptors into in contrast to graph autoencoders that reconstruct the whole
a series of tokens denoting atoms/bonds at first and then apply graph at once. In particular, given a graph with its nodes and
transformers on top of these tokens [Chithrananda et al., 2020; edges randomly masked, GPT-GNN generates one masked
Wang et al., 2019]. node and its edges at a time and maximizes the likelihood
2D graphs. Molecules can be represented as 2D graphs of the node and edges generated in each iteration. Then, it
naturally, with atoms as nodes and bonds as edges. Each node iteratively generates nodes and edges until all masked nodes
and edge can also carry feature vectors denoting the atom are generated. Analogously, MGSSL [Zhang et al., 2021]
types/chirality and bond types/direction for instance [Hu et generates molecular graph motifs instead of individual atoms
al., 2020a]. Here, GNNs [Kipf and Welling, 2017; Xu et al., or bonds autoregressively. Formally, such autoregressive mod-
2019] can be used to learn 2D molecular graph representations. eling objectives can be written as
Some hybrid architectures of GNNs and transformers [Rong |C|
et al., 2020; Ying et al., 2021] can also be leveraged to capture X
LAM = −EM∈D log p(Ci | C<i ), (1)
the topological structures of molecular graphs.
i=1
3D graphs. 3D geometries represent the spatial arrange-
ments of atoms of the molecule in the 3D space, where each where Ci , C<i are the attributes of i-th component and the ones
atom is associated with its type and coordinate plus some generated before the index i in the molecule M, respectively.
optional geometric attributes such as velocity. The advan- Compared with other strategies, AM allows CPMs to perform
tage of using 3D geometry is that the conformer information better at generating molecules its training procedure resembles
is critical to many molecular properties, especially quantum that of molecule generation [Bagal et al., 2021]. However,
properties. In addition, it is also possible to directly leverage AM is more computationally expensive and requires a preset
stereochemistry information such as chirality given the 3D ordering of atoms or bonds beforehand, which may be inap-
geometries. A number of approaches [Schütt et al., 2017; propriate for molecules because the atoms or bonds do not
Satorras et al., 2021; Du et al., 2022a] have developed present inherent orders.
message-passing mechanisms on 3D geometries, which enable
the graph representations to follow certain physical symme- 3.3 Masked Component Modeling (MCM)
tries, such as equivariance to translations and rotations. In language domains, Masked Language Modeling (MLM)
has emerged as a dominant pre-training objective. Specifically,
3 Pre-training Strategies MLM randomly masks out tokens from the input sentences,
where the model can be trained to predict those masked tokens
In this section, we elaborate on several representative self- using the remaining tokens [Devlin et al., 2019]. Masked
supervised pre-training strategies of CPMs. Component Modeling (MCM, Fig. 3c) generalizes the idea of
3.1 AutoEncoding (AE) MLM for molecules. Specifically, MCM masks out some com-
ponents (e.g., atoms, bonds, and fragments) of the molecules
Reconstructing molecules with autoencoders (Fig. 3a) serves
and then trains the model to predict them given the remaining
as a natural self-supervised target for learning expressive
components. Generally, its objective can be formulated as
molecular representations. The prediction in molecule recon- X
structions is (partial) structures of the given molecules such as LMCM = −EM∈D log p(Mf | M\m(M)), (2)
the attributes of a subset of atoms or chemical bonds. A typical
M∈m(M)
f
example is SMILES transformer [Honda et al., 2019], which
leverages a transformer-based encoder-decoder network and where m(M) denotes the masked components from the
learns the representations by reconstructing molecules repre- molecule M and M\m(M) are the remaining components.
sented by SMILES strings. More recently, unlike conventional For sequence-based pre-training, ChemBERTa [Chithrananda
autoencoders with the same types for the input and output et al., 2020], SMILES-BERT [Wang et al., 2019], and Mol-
data, [Lin et al., 2022] pre-train a graph-to-sequence asym- former [Ross et al., 2022] mask random characters in the
metric conditional variational autoencoder to learn molecular SMILES strings and then recover them based on the output of
representations. Although autoencoders can learn meaningful the transformer of the corrupted SMILES strings. For molecu-
representations for molecules, they focus on single molecules lar graph pre-training, [Hu et al., 2020a] propose to randomly
and fail to capture inter-molecule relationships, which limits mask input atom/chemical bond attributes and pre-train the
their performance in some downstream tasks [Li et al., 2021]. GNNs to predict them. Similarly, GROVER [Rong et al.,
2020] attempts to predict the masked subgraphs to capture
3.2 Autoregressive Modeling (AM) the contextual information in the molecular graphs. Recently,
Autoregressive Modeling (AM) factorizes the molecular in- Mole-BERT [Xia et al., 2023] argues that masking the atom
put as a sequence of sub-components and then it predicts types could be problematic due to the extremely small and
the sub-components one by one, conditioned on previous unbalanced atom set in nature. To mitigate this issue, they de-
sub-components in the sequence. Following the idea of sign a context-aware tokenizer to encode atoms as chemically
GPT [Brown et al., 2020] in NLP, MolGPT [Bagal et al., meaningful discrete values for masking.
2021] pre-trains a transformer network to predict the next to- MCM is especially beneficial for richly-annotated
ken in the SMILES strings in such an autoregressive manner. molecules. For example, masking atom attributes enables
E
A1 A2 {C, N, O, S, ... } ?
B1 E
B2?
A3?
E D
Generated atoms or bonds
Atoms or bonds to be generated {C, N, O, S, ... } ? Masked atoms
(a) Autoencoding (b) Autoregressive Modeling (c) Masked Components Modeling
k-hop neighborhood
E
E Add noise to
Predict noise
E atom coordinates
Maximize Aggreement {1, 0} ? {1, 0} ?
Center atom E' E
E 1: Replaced
Context anchors E
Context graph 0: Not Replaced
(d) Context Prediction (e) Contrastive Learning (f) Replaced Component Detection (g) Denoising
Figure 3: Semantic diagrams of seven unsupervised pre-training strategies. E: encoder; D: decoder.
GNNs to learn simple chemistry rules such as valency, as well Cross-Scale Contrast (CSC). Deep InfoMax is a representa-
as potentially other complex chemistry descriptors such as tive CSC model that is originally proposed for learning image
the electronic or steric effects of functional groups. Addition- representations by contrasting a pair of an image and its local
ally, compared with the aforementioned AM strategies, MCM regions against other negative pairs [Hjelm et al., 2019]. For
predicts the masked components based on their surrounding molecular graphs, InfoGraph [Sun et al., 2020] follows this
environments as opposed to AM which merely relies on pre- idea by contrasting molecule- and substructure-level represen-
ceding components in a predefined sequence. Hence, MCM tations, which can be formally described as
can capture more complete chemical semantics. However, as " #
MCM often masks a fixed portion of each molecule during LCSC = −EM∈D log s(M, C) − log
X
−
s(M, C ) ,
pre-training following BERT [Devlin et al., 2019], it cannot
C − ∈N
train on all the components in each molecule, which results in (4)
less efficient sample utilization. where N is a set of negative samples, C is a substructure
of M, C − is a substructure of the other molecule, and s(·, ·)
3.4 Context Prediction (CP) denotes a similarity metric. Follow-up work MVGRL [Hassani
Context Prediction (CP, Fig. 3d) aims to capture the seman- and Ahmadi, 2020] performs node diffusion to generate an
tics of molecules/atoms in an explicit, context-aware manner. augmented molecular graph and then maximizes the similarity
Generally, CP can be formulated as between original and augmented views by contrasting atom
representations of one view with molecular representations of
LCP = −EM∈D log p (t | M1 , M2 ) , (3) the other view and vice versa.
where t = 1 if neighborhood components M1 and surround- Same-Scale Contrast (SSC). Same-Scale Contrast (SSC)
ing contexts M2 share the same center atom and otherwise performs contrastive learning on individual molecules by push-
t = 0. For example, [Hu et al., 2020a] use a binary classifica- ing the augmented molecule close to the anchor molecule
tion of whether the subgraphs in molecules and surrounding (positive pairs) and away from other molecules (negative
context structures belong to the same node. While simple and pairs). For example, GraphCL [You et al., 2020] and its vari-
effective, CP requires an auxiliary neural model to encode ants [You et al., 2021; Sun et al., 2021; Suresh et al., 2021;
the context into a fixed vector, adding extra computational Xu et al., 2021a; Fang et al., 2022b; Wang et al., 2021;
overhead for large-scale pre-training. Xia et al., 2022b; Wang et al., 2022a] propose various augmen-
tation strategies for molecule-level pre-training represented
by graphs. Additionally, some recent works maximize the
3.5 Contrastive Learning (CL)
agreement between various descriptors of identical molecules
Contrastive Learning (CL, Fig. 3e) pre-trains the model by and repel the different ones. For example, SMICLR [Pinheiro
maximizing the agreement between a pair of similar inputs, et al., 2022] jointly leverages a graph encoder and a SMILES
such as two different augmentations or descriptors of the string encoder to perform SSC; MM-Deacon [Guo et al., 2022]
same molecule. According to the contrastive granularity (e.g., utilizes two separate transformers to encode the SMILES and
molecule- or substructure-level), we introduce two categories the International Union of Pure and Applied Chemistry (IU-
of CL in CPMs: Cross-Scale Contrast (CSC) and Same-Scale PAC) of molecules, after which a contrastive objective is used
Contrast (SSC). to promote similarity of SMILES and IUPAC representations
from the same molecule; 3DInfoMax [Stärk et al., 2022] pro- noise to atomic coordinates of 3D molecular geometry and
poses to maximize the agreements between the learned 3D pre-trains the encoders to predict the noise. They demonstrate
geometry and 2D graph representations; GeomGCL [Li et al., that such a denoising objective approximates learning a molec-
2022b] adopts a dual-view Geometric Message Passing Neural ular force field. A concurrent work, Uni-Mol [Zhou et al.,
Network (GeomMPNN) to encode both 2D and 3D graphs of 2023] adds noise to atomic coordinates motivated by the fact
a molecule and design a geometric contrastive objective. The that masked atom types can be easily inferred given 3D atomic
general formulation of the SSC pre-training objective is positions. More recently, GeoSSL [Liu et al., 2023] proposes a
" # distance denoising pre-training method to model the dynamic
′
X
− nature of 3D molecules. Generally, the pre-training objective
LSSC = −EM∈D log s(M, M ) − log s(M, M ) ,
of denoising can be formulated as
M− ∈N
(5) f 2,
LDN = EM∈D ∥ϵ − fθ (M)∥ (7)
where M′ is the augmented versions or other descriptors of
the molecule M, and N is the set of negative samples. where ϵ denotes the added noise, M f denotes the input
Although CL has achieved promising results, several crit- molecule M with noise added, and fθ (·) denotes the encoders
ical issues impede its broader applications. Firstly, it is dif- that predict the noise.
ficult to preserve semantics during molecular augmentations.
Existing solutions pick augmentations with manual trial-and- 3.8 Extensions
errors [You et al., 2020], cumbersome optimization [You et al., Knowledge-enriched pre-training. CPMs usually learn
2021], or through the guidance of expensive domain knowl- general molecular representations from a large molecular
edge [Sun et al., 2021], but an efficient and principled way to database. However, they often lack domain-specific knowl-
design the chemically appropriate augmentation for molecular edge. To improve their performance, several recent works
pre-training is still lacking. Also, the assumption behind CL try to inject external knowledge into CPMs. For example,
that pulls similar representations closer may not always hold GraphCL [You et al., 2020] first points out that bond perturba-
true for molecular representation learning. For example, in the tions (adding or dropping the bonds as data augmentations) are
case of molecular activity cliffs [Stumpfe et al., 2019], similar conceptually incompatible with domain knowledge and empir-
molecules hold completely different properties. Therefore, it ically not helpful for contrastive pre-training on chemical com-
remains unsolved which pre-training strategies can better cap- pounds. Therefore, they avoid adopting bond perturbations for
ture discrepancies between molecules. Additionally, the CL molecular graph augmentation. More explicitly, MoCL [Sun et
objective in most CPMs randomly chooses all other molecules al., 2021] proposes a domain knowledge-based molecular aug-
in one batch as the negative samples regardless of their true mentation operator called substructure substitution, in which
semantics, which will undesirably repel the molecules of simi- a valid substructure of a molecule is replaced by a bioisostere
lar properties and undermine the performance due to the false which produces a new molecule with similar physical or chem-
negatives [Xia et al., 2022a]. ical properties as the original one. More recently, KCL [Fang
et al., 2022b] constructs a chemical element Knowledge Graph
3.6 Replaced Components Detection (RCD) (KG) to summarize microscopic associations between chemi-
Replaced Components Detection (RCD, Fig. 3f) proposes cal elements and presents a novel Knowledge-enhanced Con-
to recognize randomly replaced components of the input trastive Learning (KCL) framework for molecular represen-
molecules. For example, MPG [Li et al., 2021] splits each tation learning. Additionally, MGSSL [Zhang et al., 2021]
molecule into two parts, shuffles their structure by combining first leverages existing algorithms [Degen et al., 2008] to
parts from two molecules, and trains the encoder to detect extract semantically meaningful motifs and then pre-trains
whether the combined parts belong to the same molecule. This neural encoders to predict the motifs in an autoregressive man-
objective can be written as ner. ChemRL-GEM [Fang et al., 2022a] proposes to utilize
molecular geometry information to enhance molecular graph
LRCD = −EM∈D [log p(t | M1 , M2 )] , (6) pre-training. It designs a geometry-based GNN architecture as
where t = 1 if the two parts M1 and M2 are from the same well as several geometry-level self-supervised learning strate-
molecule M and t = 0 otherwise. While RCD can uncover gies (the bond lengths prediction, the bond angles prediction,
intrinsic patterns in molecular structures, the encoders are pre- and the atomic distance matrices prediction) to capture the
trained to always produce the same “non-replacement” label molecular geometry knowledge during pre-training. Although
for all natural molecules and “replacement” label for randomly knowledge-enriched pre-training helps CPMs capture chemi-
combined molecules. However, in downstream tasks, the cal domain knowledge, it requires expensive prior knowledge
input molecules are all natural ones, causing the molecular as guidance, which poses a hurdle to broader applications
representations produced by RCD to be less distinguishable. when the prior is incomplete, incorrect, or expensive to obtain.
3.7 DeNoising (DN) Multimodal pre-training. In addition to the descriptors

Inspired by the success of denoising diffusion probabilistic mentioned in Sec. 2, molecules can also be described us-
models [Ho et al., 2020], DeNoising (DN, Fig. 3g) has been re- ing other modalities including images and biochemical texts.
cently adopted as a pre-training strategy for learning molecular Some recent works perform multimodal pre-training on
representations as well. A recent work [Zaidi et al., 2023] adds molecules. For example, KV-PLM [Zeng et al., 2022b] first
Table 1: A summary of representative Chemical Pre-trained Models (CPMs) in literature.
Model Input Backbone architecture Pre-training task Pre-training database #Params. Link
SMILES Transformer [Honda et al., 2019] SMILES Transformer AE ChEMBL (861K) [Gaulton et al., 2012] — Link
Sequence
ChemBERTa [Chithrananda et al., 2020] SMILES/SELFIES [Krenn et al., 2020] Transformer MCM PubChem (77M) [Wang et al., 2017] — Link
SMILES-BERT [Wang et al., 2019] SMILES Transformer MCM ZINC15 (∼18.6M) [Sterling and Irwin, 2015] — Link
Molformer [Ross et al., 2022] SMILES Transformer MCM ZINC15 (1B) + PubChem (111M) — —
Hu et al. [Hu et al., 2020a] Graph 5-layer GIN CP + MCM ZINC15 (2M) + ChEMBL (456K) ∼2M Link
GraphCL [You et al., 2020] Graph 5-layer GIN SSC ZINC15 (2M) ∼2M Link
JOAO [You et al., 2021] Graph 5-layer GIN SSC ZINC15 (2M) ∼2M Link
AD-GCL [Suresh et al., 2021] Graph 5-layer GIN SSC ZINC15 (2M) ∼2M Link
GraphLoG [Xu et al., 2021b] Graph 5-layer GIN SSC ZINC15 (2M) ∼2M Link
MGSSL [Zhang et al., 2021] Graph 5-layer GIN MCM + AM ZINC15 (250K) ∼2M Link
Graph / Geometry
MPG [Li et al., 2021] Graph MolGNet [Li et al., 2021] RCD + MCM ZINC + ChEMBL (11M) 53M Link
LP-Info [You et al., 2022] Graph 5-layer GIN SSC ZINC15 (2M) ∼2M Link
SimGRACE [Xia et al., 2022b] Graph 5-layer GIN SSC ZINC15 (2M) ∼2M Link
GraphMAE [Hou et al., 2022] Graph 5-layer GIN AE ZINC15 (2M) ∼2M Link
MGMAE [Feng et al., 2022] Graph 5-layer GIN AE ZINC15 (2M) + ChEMBL (456K) ∼2M —
GROVER [Rong et al., 2020] Graph GTransformer [Rong et al., 2020] CP + MCM ZINC + ChEMBL (10M) 48M∼100M Link
MolCLR [Wang et al., 2022b] Graph GCN + GIN SSC PubChem (10M) — Link
Graphomer [Ying et al., 2021] Graph Graphomer [Ying et al., 2021] Supervised PCQM4M-LSC (∼3.8M) [Hu et al., 2021] — Link
3D-EMGP [Jiao et al., 2023] Geometry E(3)-equivariant GNNs DN GEOM (100K) [Axelrod and Gómez-Bombarelli, 2022] — Link
Mole-BERT [Xia et al., 2023] Graph 5-layer GIN MCM + SSC ZINC15 (2M) ∼2M Link
Denoising [Zaidi et al., 2023] Geometry GNS [Sanchez-Gonzalez et al., 2020] DN PCQM4Mv2 (∼3.4M) — Link
GeoSSL [Liu et al., 2023] Geometry PaiNN [Schütt et al., 2021] DN Molecule3D [Xu et al., 2021c] (∼1M) — Link
DMP [Zhu et al., 2021] Graph + SMILES DeeperGCN + Transformer MCM + SSC PubChem (110M) 104.1M Link
Multimodal / External knowledge
GraphMVP [Liu et al., 2022] Graph + Geometry 5-layer GIN + SchNet [Schütt et al., 2017] SSC + AE GEOM (50K) ∼2M Link
3D Infomax [Stärk et al., 2022] Graph + Geometry PNA [Corso et al., 2020] SSC QM9 (50K) + GEOM (140K) + QMugs (620K) — Link
KCL [Fang et al., 2022b] Graph + Knowledge Graph GCN + KMPNN [Fang et al., 2022b] SSC ZINC15 (250K) <1M Link
KV-PLM [Zeng et al., 2022b] SMILES + Text Transformer MLM + MCM PubChem (150M) ∼110M Link
MEMO [Zhu et al., 2022b] SMILES + FP + Graph + Geometry Transformer + GIN + SchNet SSC GEOM (50K) — —
MolT5 [Edwards et al., 2022] SMILES + Text Transformer Replace Corrupted Spans ZINC-15 (100M) 60M / 770M Link
MICER [Yi et al., 2022] SMILES + Image CNNs + LSTM AE ZINC20 — Link
MM-Deacon [Guo et al., 2022] SMILES + IUPAC Transformer SSC PubChem 10M —
PanGu Drug Model [Lin et al., 2022] Graph + SELFIES [Krenn et al., 2020] Transformer AE ZINC20 + DrugSpaceX + UniChem (∼1.7B) ∼104M Link
KPGT [Li et al., 2022a] SMILES + FP LiGhT [Li et al., 2022a] MCM ChEMBL29 (2M) — Link
ChemRL-GEM [Fang et al., 2022a] Graph + Geometry GeoGNN [Fang et al., 2022a] MCM+CP ZINC15 (20M) — Link
ImageMol [Zeng et al., 2022a] Molecular Images ResNet18 [He et al., 2016] AE + SSC + CP PubChem (∼10M) — Link
Uni-Mol [Zhou et al., 2023] Geometry + Protein Pockets Transformer MCM + DN ZINC/ChemBL + PDB [Berman et al., 2000] — Link
tokenizes both the SMILES strings and biochemical texts. molecules can be extremely scarce because wet-lab experi-
Then, they randomly mask part of the tokens and pre-train the ments are often laborious and expensive. CPMs offer a so-
neural encoders to recover the masked tokens. Analogously, lution that can exploit the massive unlabeled molecules and
following the replace corrupted spans task of T5 [Raffel et al., serve as powerful backbones for downstream molecular prop-
2020], MolT5 [Edwards et al., 2022] first masks some spans erty prediction tasks [Wang et al., 2022b; Zaidi et al., 2023].
of abundant SMILES strings and biochemical text descriptions Furthermore, compared with the models trained from scratch,
of molecules and then pre-train the transformer to predict the CPMs can better extrapolate to out-of-distribution molecules,
masked spans. In this way, these pre-trained models can gen- which is especially important when predicting the properties
erate both the SMILES strings and biochemical texts, which of newly synthesized drugs [Hu et al., 2020a].
is especially effective for text-guided molecule generation
and molecule captioning (generation of the descriptive texts 4.2 Molecular Generation (MG)
for molecules). [Zhu et al., 2022b] propose to maximize the Molecular generation, a long-standing challenge in computer-
consistency between the embeddings of four molecular de- aided drug design, has been revolutionized by machine learn-
scriptors and their aggregated embedding using a contrastive ing methods, especially generative models, that narrow the
objective. In this way, these various descriptors can collabo- search space and improve computational efficiency, making it
rate with each other for molecular property prediction tasks. possible to delve into the seemingly infinite drug-like chemical
Additionally, MICER [Yi et al., 2022] adopts an autoencoder- space [Du et al., 2022b]. CPMs, such as MolGPT [Bagal et al.,
based pre-training framework for molecular image captioning. 2021], which employs an autoregressive pre-training approach,
Specifically, they feed molecular images to the pre-trained has proven to be instrumental in generating valid, unique, and
encoder and then decode the corresponding SMILES strings. novel molecular structures. The emergence of multi-modal
The above-mentioned multimodal pre-training strategies can molecular pre-training techniques [Edwards et al., 2022; Zeng
advance the translations between various modalities. Also, et al., 2022b] has further expanded the possibilities of molec-
these modalities can work together to create a more complete ular generation by enabling the transformation of descriptive
knowledge base for various downstream tasks. text into molecular structures. Another crucial area where
CPMs have demonstrated their prowess is the generation of
4 Applications three-dimensional molecular conformations, particularly for
the prediction of protein-ligand binding poses. Unlike conven-
The following section takes drug discovery as a case study and tional approaches based on molecular dynamics or Markov
showcases several promising applications of CPMs (Tab. 1). chain Monte Carlo, which are often hindered by computa-
tional limitations, especially for larger molecules [Hawkins,
4.1 Molecular Property Prediction (MPP) 2017], CPMs based on 3D geometry [Zhu et al., 2022a;
The bioactivity of a new drug candidate is influenced by var- Zhou et al., 2023] exhibit remarkable superiority in confor-
ious factors in real life, including solubility in the gastroin- mation generation tasks, as they can capture some inherent
testinal tract, intestinal membrane permeability, and intesti- relationships between 2D molecules and 3D conformations
nal/hepatic first-pass metabolism. However, such labels for during the pre-training process.
4.3 Drug-Target Interactions (DTI) 5.2 Building Reliable and Realistic Benchmarks
Predictive analysis of Drug-Target Interactions (DTI) is a vital Despite the numerous studies conducted on CPMs, their ex-
step in the early stages of drug discovery, as it helps to identify perimental results can sometimes be unreliable due to the
drug candidates with binding potential to specific protein tar- inconsistent evaluation settings employed (e.g., random seeds
gets. This is particularly important in drug repurposing, where and dataset splits). For instance, on MoleculeNet [Wu et al.,
the goal is to recycle approved drugs for a new disease, thereby 2018] that contains several expensive datasets for molecular
reducing the need for further drug discovery and minimizing property prediction, the performance of the same model can
safety risks. Also, obtaining sufficient drug-target data for vary significantly with different random seeds, possibly due
supervised training can be challenging. CPMs can overcome to the relatively small scale of these molecular datasets. It is
this issue by providing molecular encoders with good initial- also crucial to establish more reliable and realistic benchmarks
izations. To achieve accurate DTI predictions, it is essential for CPMs that take out-of-distribution generalization into ac-
to consider both molecular encoders and target encoders, pre- count. One solution is to evaluate CPMs through scaffold
dict binding affinities, and co-train both for the DTI prediction splitting, which involves splitting molecules based on their
task [Nguyen et al., 2021]. Previous works such as MPG [Li et substructures. In reality, researchers must often apply CPMs
al., 2021] follow these principles to advance DTI predictions. trained from already known molecules to newly synthesized,
unknown molecules that may differ greatly in properties and
4.4 Drug-Drug Interactions (DDI) belong to divergent domains. In this regard, the recently es-
Accurately predicting Drug-Drug Interactions (DDI) is another tablished Therapeutics Data Commons (TDC) [Huang et al.,
crucial stage in drug discovery pipelines as such interactions 2021] offers a promising opportunity to fairly evaluate CPMs
can result in adverse reactions that can harm health and even across a diverse range of therapeutic applications.
cause death. Moreover, accurate DDI predictions can also
assist in making informed medication recommendations, mak- 5.3 Broadening the Impact of CPMs
ing it an essential part of the regulatory investigation prior The ultimate goal of CPMs is to develop versatile molecular
to market approval. From the machine learning perspective, encoders that can be applied to various downstream tasks re-
DDI prediction can be regarded as a classification task that lated to molecules. Compared to the progress of PLMs in the
determines the influence of combination drugs as synergistic, NLP community, there remains a substantial disparity between
additive, or antagonistic. To achieve effective DDI prediction, methodological advancements and practical applications of
expressive molecular representations are required, which can CPMs. On one hand, the representations produced by CPMs
be obtained using CPMs. MPG [Li et al., 2021] is a represen- have not yet been extensively used to replace conventional
tative example that has demonstrated the usefulness of CPMs molecular descriptors in chemistry, and these pre-trained mod-
by adopting DDI prediction as a downstream task. els have not yet become standard tools for the community.
On the other hand, there is limited exploration into how these
5 Conclusions and Future Outlooks models can benefit a more extensive range of downstream
In conclusion, this paper provides a comprehensive overview tasks beyond individual molecules, such as chemical reaction
of Chemical Pre-trained Models. We start by reviewing the prediction, molecular similarity searches in virtual screening,
widely-used descriptors and encoders for molecules, then retrosynthesis, chemical space exploration, and many others.
present the representative pre-training strategies and evalu-
ate their advantages and disadvantages. We also showcase 5.4 Establishing Theoretical Foundations
various successful applications of CPMs in drug discovery Even though CPMs have demonstrated impressive perfor-
and development. Despite the fruitful progress, there are still mance in various downstream tasks, a rigorous theoretical
several challenges that warrant further research in the future. understanding of these models has been limited. This lack
5.1 Improving Pre-training Architectures of underpinnings presents a hindrance to both the scientific
community and industry stakeholders who seek to maximize
While remarkable advancements have been achieved in ana- the potential of these models. The theoretical foundations of
lyzing the learning capabilities of neural architectures, such as CPMs must be established in order to fully comprehend their
the WL-test for GNNs, these analyses lack specificity in de- mechanisms and how they drive improved performance in var-
termining the optimal design for highly structured molecules. ious applications. For example, a recent empirical study [Sun
The ideal featurizations and architectures for CPMs remain et al., 2022] has questioned the superiority of certain self-
elusive, as evidenced by conflicting results such as the nega- supervised graph pre-training strategies over non-pre-trained
tive impact of Graph Attention Networks (GATs) [Velickovic counterparts in some downstream tasks. Further research is
et al., 2018], which are widely adopted in graph learning, on necessary to gain a more robust understanding of the effec-
downstream performance in previous studies [Hu et al., 2020a; tiveness of different molecular pre-training objectives, so as to
Hou et al., 2022]. Furthermore, there is a pressing need to provide guidance for optimal methodology design.
explore ways of seamlessly integrating message-passing tech-
niques into transformers as a unified encoder to accommodate
pre-training of large-scale molecular graphs. Additionally, as Acknowledgements
discussed in Sec. 3, the pre-training objectives still leave much This work was supported by the National Key R&D Program
room for improvement, with the efficient masking strategy of of China (2022ZD0115100), the National Natural Science
subcomponents in MCM being a prime example. Foundation of China (U21A20427), the Research Center for
Industries of the Future (WU2022C043), and the Competitive [He et al., 2016] K. He, X. Zhang, et al. Deep Residual Learning
Research Fund (WU2022A009) from the Westlake Center for for Image Recognition. In CVPR, 2016.
Synthetic Biology and Integrated Bioengineering. [Hjelm et al., 2019] R. D. Hjelm, A. Fedorov, et al. Learning Deep
Representations by Mutual Information Estimation and Maximiza-
References tion. In ICLR, 2019.
[Ho et al., 2020] J. Ho, A. Jain, et al. Denoising Diffusion Proba-
[Axelrod and Gómez-Bombarelli, 2022] S. Axelrod and R. Gómez-
bilistic Models. In NeurIPS, 2020.
Bombarelli. GEOM, Energy-Annotated Molecular Conformations
for Property Prediction and Molecular Generation. Sci. Data, [Honda et al., 2019] S. Honda, S. Shi, et al. SMILES Transformer:
2022. Pre-trained Molecular Fingerprint for Low Data Drug Discovery.
arXiv.org, November 2019.
[Bagal et al., 2021] V. Bagal, R. Aggarwal, et al. MolGPT: Molecu-
lar Generation Using a Transformer-Decoder Model. J. Chem. Inf. [Hou et al., 2022] Z. Hou, X. Liu, et al. GraphMAE: Self-
Model., 2021. Supervised Masked Graph Autoencoders. In KDD, 2022.
[Berman et al., 2000] H. M. Berman, J. Westbrook, et al. The Pro- [Hu et al., 2020a] W. Hu, B. Liu, et al. Strategies for Pre-training
tein Data Bank. Nucleic Acids Res., 2000. Graph Neural Networks. In ICLR, 2020.
[Brown et al., 2020] T. B. Brown, B. Mann, et al. Language Models [Hu et al., 2020b] Z. Hu, Y. Dong, et al. GPT-GNN: Generative
are Few-Shot Learners. In NeurIPS, 2020. Pre-Training of Graph Neural Networks. In KDD, 2020.
[Hu et al., 2021] W. Hu, M. Fey, et al. OGB-LSC: A large-scale
[Chithrananda et al., 2020] S. Chithrananda, G. Grand, et al. Chem-
challenge for machine learning on graphs. In NeurIPS Datasets
BERTa: Large-Scale Self-Supervised Pretraining for Molecular
and Benchmarks, 2021.
Property Prediction. arXiv.org, October 2020.
[Huang et al., 2021] K. Huang, T. Fu, et al. Therapeutics Data Com-
[Consonni and Todeschini, 2009] V. Consonni and R. Todeschini. mons: Machine Learning Datasets and Tasks for Drug Discovery
Molecular Descriptors for Chemoinformatics. 2009. and Development. In NeurIPS Datasets and Benchmarks, 2021.
[Corso et al., 2020] G. Corso, L. Cavalleri, et al. Principal Neigh- [Jiao et al., 2023] R. Jiao, J. Han, et al. Energy-Motivated Equivari-
bourhood Aggregation for Graph Nets. In NeurIPS, 2020. ant Pretraining for 3D Molecular Graphs. In AAAI, 2023.
[Degen et al., 2008] J. Degen, C. Wegscheid-Gerlach, et al. On the [Kearnes et al., 2016] S. Kearnes, K. McCloskey, et al. Molecular
Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Graph Convolutions: Moving Beyond Fingerprints. J. Comput.
Spaces. ChemMedChem, 2008. Aided Mol. Des., 2016.
[Devlin et al., 2019] J. Devlin, M. Chang, et al. BERT: Pre-training [Kim et al., 2022] D. Kim, J. Baek, et al. Graph Self-supervised
of Deep Bidirectional Transformers for Language Understanding. Learning with Accurate Discrepancy Learning. In NeurIPS, 2022.
In NAACL, 2019. [Kipf and Welling, 2017] N. T. Kipf and M. Welling. Semi-
[Du et al., 2022a] W. Du, H. Zhang, et al. SE(3) Equivariant Graph Supervised Classification with Graph Convolutional Networks. In
Neural Networks with Complete Local Frames. In ICML, 2022. ICLR, 2017.
[Du et al., 2022b] Y. Du, T. Fu, et al. MolGenSurvey: A System- [Krenn et al., 2020] M. Krenn, F. Häse, et al. Self-Referencing Em-
atic Survey in Machine Learning Models for Molecule Design. bedded Strings (SELFIES): A 100% Robust Molecular String
arXiv.org, March 2022. Representation. Mach. Learn. Sci. Technol., 2020.
[Edwards et al., 2022] C. Edwards, T. Lai, et al. Translation between [Li et al., 2021] P. Li, J. Wang, et al. An Effective Self-Supervised
Molecules and Natural Language. In EMNLP, 2022. Framework for Learning Expressive Molecular Global Represen-
tations to Drug Discovery. Briefings Bioinform., 2021.
[Fang et al., 2022a] X. Fang, L. Liu, et al. Geometry-Enhanced
[Li et al., 2022a] H. Li, D. Zhao, et al. KPGT: Knowledge-Guided
Molecular Representation Learning for Property Prediction. Nat.
Mach. Intell., 2022. Pre-training of Graph Transformer for Molecular Property Predic-
tion. In KDD, 2022.
[Fang et al., 2022b] Y. Fang, Q. Zhang, et al. Molecular Contrastive
[Li et al., 2022b] S. Li, J. Zhou, et al. GeomGCL: Geometric Graph
Learning with Chemical Element Knowledge Graph. In AAAI,
Contrastive Learning for Molecular Property Prediction. In AAAI,
2022.
2022.
[Feng et al., 2022] J. Feng, Z. Wang, et al. MGMAE: Molecu- [Lin et al., 2022] X. Lin, C. Xu, et al. PanGu Drug Model: Learn a
lar Representation Learning by Reconstructing Heterogeneous Molecule Like a Human. bioRxiv.org, April 2022.
Graphs with A High Mask Ratio. In CIKM, 2022.
[Liu et al., 2022] S. Liu, H. Wang, et al. Pre-training Molecular
[Gaulton et al., 2012] A. Gaulton, L. J. Bellis, et al. ChEMBL: A Graph Representation with 3D Geometry. In ICLR, 2022.
Large-Scale Bioactivity Database for Drug Discovery. Nucleic
Acids Res., 2012. [Liu et al., 2023] S. Liu, H. Guo, et al. Molecular Geometry Pre-
training with SE(3)-Invariant Denoising Distance Matching. In
[Guo et al., 2022] Z. Guo, P. K. Sharma, et al. Multilingual Molecu- ICLR, 2023.
lar Representation Learning via Contrastive Pre-training. In ACL,
[Nguyen et al., 2021] T. Nguyen, H. Le, et al. GraphDTA: Predict-
2022.
ing Drug-Target Binding Affinity with Graph Neural Networks.
[Hassani and Ahmadi, 2020] K. Hassani and A. H. K. Ahmadi. Con- Bioinform., 2021.
trastive Multi-View Representation Learning on Graphs. In ICML, [Pinheiro et al., 2022] G. A. Pinheiro, J. L. Da Silva, et al. SMICLR:
2020. Contrastive Learning on Multiple Molecular Representations for
[Hawkins, 2017] P. C. D. Hawkins. Conformation Generation: The Semisupervised and Unsupervised Representation Learning. J.
State of the Art. J. Chem. Inf. Model., 2017. Chem. Inf. Model., 2022.
[Raffel et al., 2020] C. Raffel, N. Shazeer, et al. Exploring the Lim- [Weininger, 1988] D. Weininger. SMILES, A Chemical Language
its of Transfer Learning with a Unified Text-to-Text Transformer. and Information System. 1. Introduction to Methodology and
J. Mach. Learn. Res., 2020. Encoding Rules. J. Chem. Inf. Comput. Sci., 1988.
[Rong et al., 2020] Y. Rong, Y. Bian, et al. Self-Supervised Graph [Wu et al., 2018] Z. Wu, B. Ramsundar, et al. MoleculeNet: A
Transformer on Large-Scale Molecular Data. In NeurIPS, 2020. Benchmark for Molecular Machine Learning. Chem. Sci., 2018.
[Ross et al., 2022] J. Ross, B. Belgodere, et al. Molformer: Large [Xia et al., 2022a] J. Xia, L. Wu, et al. ProGCL: Rethinking Hard
Scale Chemical Language Representations Capture Molecular Negative Mining in Graph Contrastive Learning. In ICML, 2022.
Structure and Properties. Nat. Mach. Intell., 2022. [Xia et al., 2022b] J. Xia, L. Wu, et al. SimGRACE: A Simple
[Sanchez-Gonzalez et al., 2020] A. Sanchez-Gonzalez, J. Godwin, Framework for Graph Contrastive Learning without Data Aug-
et al. Learning to Simulate Complex Physics with Graph Networks. mentation. In WWW, 2022.
In ICML, 2020. [Xia et al., 2023] J. Xia, C. Zhao, et al. Mole-BERT: Rethinking
[Satorras et al., 2021] V. G. Satorras, E. Hoogeboom, et al. E(n) Pre-training Graph Neural Networks for Molecules. In ICLR,
Equivariant Graph Neural Networks. In ICML, 2021. 2023.
[Xu et al., 2019] K. Xu, W. Hu, et al. How Powerful are Graph
[Schütt et al., 2017] K. Schütt, P.-J. Kindermans, et al. SchNet: A
Neural Networks? In ICLR, 2019.
Continuous-Filter Convolutional Neural Network for Modeling
Quantum Interactions. In NIPS, 2017. [Xu et al., 2021a] D. Xu, W. Cheng, et al. InfoGCL: Information-
Aware Graph Contrastive Learning. In NeurIPS, 2021.
[Schütt et al., 2021] K. Schütt, O. T. Unke, et al. Equivariant mes-
sage passing for the prediction of tensorial properties and molecu- [Xu et al., 2021b] M. Xu, H. Wang, et al. Self-supervised Graph-
lar spectra. In ICML, 2021. level Representation Learning with Local and Global Structure.
In ICML, 2021.
[Stärk et al., 2022] H. Stärk, D. Beaini, et al. 3D Infomax Improves
[Xu et al., 2021c] Z. Xu, Y. Luo, et al. Molecule3D: A Benchmark
GNNs for Molecular Property Prediction. In ICML, 2022.
for Predicting 3D Geometries from Molecular Graphs. arXiv.org,
[Sterling and Irwin, 2015] T. Sterling and J. J. Irwin. ZINC 15– September 2021.
Ligand Discovery for Everyone. J. Chem. Inf. Model., 2015. [Yi et al., 2022] J. Yi, C. Wu, et al. MICER: A Pre-Trained Encoder–
[Stumpfe et al., 2019] D. Stumpfe, H. Hu, et al. Evolving Concept Decoder Architecture for Molecular Image Captioning. Bioin-
of Activity Cliffs. ACS Omega, 2019. form., 2022.
[Sun et al., 2020] F. Sun, J. Hoffmann, et al. InfoGraph: Unsuper- [Ying et al., 2021] C. Ying, T. Cai, et al. Do Transformers Really
vised and Semi-supervised Graph-Level Representation Learning Perform Badly for Graph Representation? In NeurIPS, 2021.
via Mutual Information Maximization. In ICLR, 2020. [You et al., 2020] Y. You, T. Chen, et al. Graph Contrastive Learning
[Sun et al., 2021] M. Sun, J. Xing, et al. MoCL: Contrastive Learn- with Augmentations. In NeurIPS, 2020.
ing on Molecular Graphs with Multi-level Domain Knowledge. In [You et al., 2021] Y. You, T. Chen, et al. Graph Contrastive Learning
KDD, 2021. Automated. In ICML, 2021.
[Sun et al., 2022] R. Sun, H. Dai, et al. Does GNN Pretraining Help [You et al., 2022] Y. You, T. Chen, et al. Bringing Your Own View:
Molecular Representation? In NeurIPS, 2022. Graph Contrastive Learning without Prefabricated Data Augmen-
[Suresh et al., 2021] S. Suresh, P. Li, et al. Adversarial Graph Aug- tations. In WSDM, 2022.
mentation to Improve Graph Contrastive Learning. In NeurIPS, [Zaidi et al., 2023] S. Zaidi, M. Schaarschmidt, et al. Pre-training
2021. via Denoising for Molecular Property Prediction. In ICLR, 2023.
[Vaswani et al., 2017] A. Vaswani, N. Shazeer, et al. Attention is [Zeng et al., 2022a] X. Zeng, H. Xiang, et al. Accurate Prediction
All You Need. In NIPS, 2017. of Molecular Properties and Drug Targets Using a Self-Supervised
Image Representation Learning Framework. Nat. Mach. Intell.,
[Velickovic et al., 2018] P. Velickovic, G. Cucurull, et al. Graph 2022.
Attention Networks. In ICLR, 2018.
[Zeng et al., 2022b] Z. Zeng, Y. Yao, et al. A Deep-Learning System
[Velickovic et al., 2019] P. Velickovic, W. Fedus, et al. Deep Graph Bridging Molecule Structure and Biomedical Text with Compre-
Infomax. In ICLR, 2019. hension Comparable to Human Professionals. Nat. Commun.,
[Wang et al., 2017] Y. Wang, S. H. Bryant, et al. Pubchem Bioassay: 2022.
2017 Update. Nucleic Acids Res., 2017. [Zhang et al., 2021] Z. Zhang, Q. Liu, et al. Motif-based Graph
[Wang et al., 2019] S. Wang, Y. Guo, et al. SMILES-BERT: Large Self-Supervised Learning for Molecular Property Prediction. In
Scale Unsupervised Pre-Training for Molecular Property Predic- NeurIPS, 2021.
tion. In BCB, 2019. [Zhou et al., 2023] G. Zhou, Z. Gao, et al. Uni-Mol: A Universal
[Wang et al., 2021] Y. Wang, Y. Min, et al. Molecular Graph Con- 3D Molecular Representation Learning Framework. In ICLR,
trastive Learning with Parameterized Explainable Augmentations. 2023.
In BIBM, 2021. [Zhu et al., 2021] J. Zhu, Y. Xia, et al. Dual-view Molecule Pre-
[Wang et al., 2022a] Y. Wang, R. Magar, et al. Improving Molecular training. arXiv.org, June 2021.
Contrastive Learning via Faulty Negative Mitigation and Decom- [Zhu et al., 2022a] J. Zhu, Y. Xia, et al. Unified 2D and 3D Pre-
posed Fragment Contrast. J. Chem. Inf. Model., 2022. Training of Molecular Representations. In KDD, 2022.
[Wang et al., 2022b] Y. Wang, J. Wang, et al. MolCLR: Molecu- [Zhu et al., 2022b] Y. Zhu, D. Chen, et al. Featurizations Matter: A
lar Contrastive Learning of Representations via Graph Neural Multiview Contrastive Learning Approach to Molecular Pretrain-
Networks. Nat. Mach. Intell., 2022. ing. In AI4Science@ICML, 2022.

A Systematic Survey of Chemical Pre-Trained Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Systematic Survey of Chemical Pre-Trained Models

Uploaded by

Copyright:

Available Formats

A Systematic Survey of Chemical Pre-trained Models

Jun Xia1,∗ , Yanqiao Zhu2,∗ , Yuanqi Du3,∗ and Stan Z. Li1,†

scratch often requires abundant labeled molecules, Fine-tuning

devoted to Chemical Pre-trained Models (CPMs), Chemical Pre-trained Models ...

Sequences ChemBERTa [Chithrananda et al., 2020], SMILES-BERT [Wang et al., 2019]

MCM AttrMasking [Hu et al., 2020a], Mole-BERT [Xia et al., 2023]

CP GROVER [Rong et al., 2020], ContextPred [Hu et al., 2020a]

SSC GraphCL [Suresh et al., 2021], MoCL [Sun et al., 2021]

Knowledge-enriched KCL [Fang et al., 2022b], MoCL [Sun et al., 2021]

Property prediction ChemRL-GEM [Fang et al., 2022a]

DTI prediction GraphMVP [Liu et al., 2022]

Figure 2: A taxonomy of Chemical Pre-trained Models (CPMs) with representative examples.

(a) Autoencoding (b) Autoregressive Modeling (c) Masked Components Modeling

Figure 3: Semantic diagrams of seven unsupervised pre-training strategies. E: encoder; D: decoder.

3.7 DeNoising (DN) Multimodal pre-training. In addition to the descriptors

You might also like