Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Page 1 of 3 Bioinformatics

1
2
Bioinformatics, YYYY, 0–0
3
doi: 10.1093/bioinformatics/xxxxx
4
Advance Access Publication Date: DD Month YYYY
5
Applications Note
6
7

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac842/6961187 by guest on 27 December 2022


8
9 Genome analysis
10
11 ASTER: accurately estimating the number of cell
12
13
types in single-cell chromatin accessibility data
14 Shengquan Chen1,*, Rongxiang Wang2, Wenxin Long1 and Rui Jiang2,*
15
1School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China and 2Ministry of
16
17 Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National
18 Research Center for Information Science and Technology, Center for Synthetic and Systems Biology,
19 Department of Automation, Tsinghua University, Beijing 100084, China
20 * To whom correspondence should be addressed.
21
22 Associate Editor: XXXXXXX
23 Received on XXXXX; revised on XXXXX; accepted on XXXXX
24
25 Abstract
26 Summary: Recent innovations in single-cell chromatin accessibility sequencing (scCAS) have revolutionized the
27 characterization of epigenomic heterogeneity. Estimation of the number of cell types is a crucial step for
28 downstream analyses and biological implications. However, efforts to perform estimation specifically for scCAS
29 data are limited. Here we propose ASTER, an ensemble learning-based tool for accurately estimating the number
30 of cell types in scCAS data. ASTER outperformed baseline methods in systematic evaluation on 27 datasets of
various protocols, sizes, numbers of cell types, degrees of cell-type imbalance, cell states, and qualities, providing
31
valuable guidance for scCAS data analysis.
32
Availability and implementation: ASTER along with detailed documentation is freely accessible at
33 https://aster.readthedocs.io/ under the MIT License. It can be seamlessly integrated into existing scCAS analysis
34 workflows. The source code is available at https://github.com/biox-nku/aster.
35 Contact: chenshengquan@nankai.edu.cn and ruijiang@tsinghua.edu.cn
36 Supplementary information: Supplementary data are available at Bioinformatics online.
37
38
39
40 community detection-based techniques to find the best possible grouping,
41 the estimation of the number of cell types in scCAS data is still typically
1 Introduction
42 subjective and largely relied on the investigator’s desired clustering
Rapid advances in single-cell chromatin accessibility sequencing (scCAS)
43 resolution and/or prior knowledge (Supplementary Text S2).
technologies, such as single-cell assay for transposase-accessible To address this need, we propose a Python package named ASTER to
44
chromatin with sequencing (scATAC-seq), have enabled the accurately estimate the number of cell types in scCAS data.
45 characterization of epigenomic heterogeneity and the interrogation of gene
46 regulation at an unprecedented resolution. A number of embedding and
47 clustering methods have been proposed to identify groups of cells with 2 Methods
48 similar epigenomic patterns in scCAS data (Chen, et al., 2019; Chen, et Given a peak-by-cell scCAS data 𝐗 ∈ ℝ
𝑝×𝑛
, ASTER estimates the
49 al., 2021). However, none of these methods suggests the number of cell number of cell types based on ensemble strategies (Fig. 1a). Firstly,
50 types present in the data, which is crucial in clustering analysis and can be ASTER performs estimation based on the within-cluster sum-of-squares
(WSS) criterion. Specifically, ASTER applies term frequency-inverse
51 critical for downstream analyses of single-cell data (Yu, et al., 2022).
Several methods have been proposed specifically for cell-type number document frequency (TF-IDF) transformation (V1) to matrix 𝐗
52 (Supplementary Text S3). ASTER then performs principal component
53 estimation in single-cell RNA sequencing (scRNA-seq) data
analysis (PCA) using the widely-used EpiScanpy workflow and performs
(Supplementary Text S1), and their performance has been benchmarked 𝑁
54 K-Means clustering to measure WSS = ∑𝑖 = 1 min (‖𝑥𝑖 ― 𝜇𝑗‖ ), where 𝑁
2
systematically (Yu, et al., 2022). Although almost all the widely-used 𝜇𝑗 ∈ 𝐶
55 denotes the number of cells, 𝑥𝑖 the representation of the 𝑖-th cell, 𝜇𝑗 the
scCAS data analysis workflows, e.g., Signac (Stuart, et al., 2021), ArchR
56 (Granja, et al., 2021) and EpiScanpy (Danese, et al., 2021), adopted representation of the 𝑗-th cluster center, and 𝐶 the resulting clusters. A
57
58
© The Author(s) 2022. Published by Oxford University Press.
59 This is an Open Access article distributed under the terms of the Creative Commons Attribution License
60 (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,
provided the original work is properly cited.
Bioinformatics Page 2 of 3

S.Chen et al.
1
2
3
4
5
6
7

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac842/6961187 by guest on 27 December 2022


8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Fig. 1. Benchmarking results of various methods based on 27 scCAS datasets. (a) The ensemble estimation strategy of ASTER. (b) Performance of different methods on datasets
27 generated from different species and protocols, and with various sizes, dimensions, numbers of batches, numbers of cell types, proportions of the major type, degrees of cell-type imbalance,
28 levels of sparsity, and cell states. Note that we encountered memory errors (exceeded 256 GB) when performing scCCESS and scLCA on BoneMarrowB. (c) P-values of one-sided paired
29 Wilcoxon signed-rank tests that test if a method (one of the row names) achieves significantly lower absolute estimation deviation on the 27 datasets than another method (one of the
30 column names). (d) The performance of various methods on BoneMarrowA at different dropout rates evaluated by estimation error.

31 good clustering is one with a low WSS score and a low number (𝑘) of process, we further improve the search strategy based on the weighted bias
32 clusters. However, this is a tradeoff because WSS decreases as 𝑘 increases. as follows:
𝑘 ― 𝑘𝑚𝑖𝑛
33 Therefore, we adopt an elbow method to identify the elbow/knee point
𝑟𝑛𝑒𝑥𝑡 = 𝑟𝑡ℎ𝑖𝑠 + (𝑟𝑚𝑎𝑥 ― 𝑟𝑚𝑖𝑛) ×
34 (the point with maximum curvature) of a 𝑘-versus-WSS line (Satopaa, et 𝑘𝑚𝑎𝑥 ― 𝑘𝑚𝑖𝑛
35 al., 2011). The 𝑘 of the elbow point is adopted as the optimal number of where 𝑟𝑛𝑒𝑥𝑡 and 𝑟𝑡ℎ𝑖𝑠 denote the resolutions in the next and this attempt,
clusters. respectively, 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 denote the maximum and minimum
36 Secondly, ASTER performs estimation based on the Davies-Bouldin resolutions to be searched, respectively, 𝑘𝑚𝑎𝑥 and 𝑘𝑚𝑖𝑛 denote the obtained
37 index (Davies and Bouldin, 1979). Instead of TF-IDF transformation (V1),
numbers of clusters using the maximum and minimum resolutions,
38 ASTER applies another widely-used TF-IDF transformation (V2) to 𝐗
respectively. For each 𝑘, ASTER calculates the mean silhouette
39 (Supplementary Text S3). ASTER then performs PCA and K-Means as
1 coefficient of all cells based on Louvain and Leiden clustering results,
40 above to measure the Davies-Bouldin index, which is defined as 𝑘 respectively, and then sums up the two means. A higher silhouette
𝑠𝑖 + 𝑠𝑗
41 ∑𝑘
𝑖=1
max 𝑑𝑖𝑗 , where 𝑘 is the number of clusters, 𝑠𝑖 the average distance coefficient relates to a model with better-defined clusters, and the 𝑘 that
42 𝑖≠𝑗
provides the maximum coefficient is thus adopted as the optimal number
between each cell of cluster 𝑖 and the centroid of that cluster, and 𝑑𝑖𝑗 the
43 distance between cluster centroids 𝑖 and 𝑗. A lower index indicates a better
of clusters.
44 partition, and the 𝑘 that provides the minimum index is thus adopted as
Finally, ASTER estimates the number of cell types by averaging the
45 the optimal number of clusters. three numbers estimated above, that is, the ensemble estimation is based
46 Thirdly, ASTER performs estimation based on the silhouette on three metrics, two TF-IDF approaches, and three clustering methods.
coefficient (Rousseeuw, 1987), which is defined for a single cell as Besides, building upon the widely-used AnnData format, ASTER can be
47 𝑏 ―𝑎
where 𝑎 denotes the mean distance between the cell and all other seamlessly integrated into the EpiScanpy analysis workflow.
48 max (𝑎,𝑏),

49 cells in the same cluster, 𝑏 denotes the mean distance between the cell and
all other cells in the next nearest cluster. ASTER performs TF-IDF
50 transformation (V2) and PCA as above, and then constructs a 3 Results
51 neighborhood graph of cells using the EpiScanpy workflow. Instead of K- We evaluated the performance of ASTER by estimation error (the
52 Means clustering, ASTER adopts another two widely-used clustering
difference between the estimated and true number of cell types) and
methods, i.e., Louvain and Leiden, which require a resolution parameter
53 but not the number of clusters. To obtain the desired number of clusters, a estimation deviation (the estimation error normalized by the true number
54 binary search strategy is usually adopted (Chen, et al., 2019; Chen, et al., of cell types) as recommended in a recent benchmark study (Yu, et al.,
55 2021; Danese, et al., 2021). However, each attempt in the search process 2022). Note that this task is different from clustering and higher clustering
56 is time-consuming, especially for large data. To speed up the search
concordance does not necessarily mean a more accurate estimation (Yu,
57
58
59
60
Page 3 of 3 Bioinformatics

Estimating the number of cell types


1
2
3 et al., 2022). We compared the performance of ASTER with four methods This work was supported by the National Key Research and Development Program
4 (Supplementary Text S4), including Louvain and Leiden with default of China [2021YFF1200902], the National Natural Science Foundation of China
5 resolution, two widely-used methods in scCAS data analysis, and [62203236, 62273194, 61873141, 61721003].

6 scCCESS and scLCA (Cheng, et al., 2019), two of the best methods in the
Conflict of Interest: none declared.
7 most recent benchmark study for scRNA-seq data (Yu, et al., 2022). We

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac842/6961187 by guest on 27 December 2022


8 collected 27 datasets generated from different protocols, and with various
9 sizes, dimensions, numbers of batches, numbers of cell types, degrees of References
10 cell-type imbalance, cell states, and levels of sparsity for systematic Chen, H., et al. Assessment of computational methods for the analysis of single-cell
11 benchmarking (Supplementary Text S5 and Table S1). ATAC-seq data. Genome Biol. 2019;20(1):241.
12 As shown in Fig. 1b, ASTER accurately estimates the number of cell
Chen, S., et al. RA3 is a reference-guided approach for epigenetic characterization
13 types in scCAS data and significantly outperformed the baseline methods.
of single cells. Nat. Commun. 2021;12(1).
14 First, ASTER performed well on BM0828BoneMarrow (a dataset of
Cheng, C., et al. Latent cellular analysis robustly reveals subtle diversity in large-
15 differentiating bone marrow cells from a donor), indicating its ability for
scale single-cell RNA-seq data. Nucleic Acids Res. 2019;47(22):e143.
datasets where expression changes among cells are expected to be
16 Danese, A., et al. EpiScanpy: integrated single-cell epigenomic analysis. Nat.
gradients. Second, ASTER performed well on CLP/CMP/MPP (a subset
17 Commun. 2021;12(1).
of bone marrow cells from 4 donors) and BoneMarrowA (the entire dataset
18 of bone marrow cells from 7 donors), indicating its ability for datasets Davies, D.L. and Bouldin, D.W. A cluster separation measure. IEEE Trans Pattern
19 derived from multiple batches. However, ASTER does not model the Anal Mach Intell 1979;1(2):224-227.
20 batch variation specifically. Since technical variation may be large in Granja, J.M., et al. ArchR is a scalable software package for integrative single-cell
21 some scCAS datasets, we recommend performing batch correction before chromatin accessibility analysis. Nat. Genet. 2021;53(3):403-411.
22 estimating the number of cell types by ASTER. Third, ASTER also Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of
23 outperformed other methods on Melanoma (a dataset of cells in time series
cluster analysis. Journal of Computational and Applied Mathematics 1987;20:53-65.
24 after knockdown of SOX10 in two short-term patient cultures). Fourth, we
Satopaa, V., et al. Finding a "Kneedle" in a Haystack: Detecting Knee Points in
25 evaluated ASTER on BoneMarrowB, a dataset containing 136,463 cells
System Behavior. In, 2011 31st International Conference on Distributed Computing
26 from 2 batches. ASTER again provided superior performance, indicating
Systems Workshops. 2011. p. 166-171.
its ability for large-scale datasets, which can lead to poor estimation in the
27 Stuart, T., et al. Single-cell chromatin state analysis with Signac. Nat. Methods
benchmark study (Yu, et al., 2022). Fifth, in addition to differentiating cell
28 2021;18(11):1333-1341.
states, we also demonstrated the superior performance of ASTER on three
29 differentiated cell-line mixtures. Sixth, in addition to the above human Yu, L., et al. Benchmarking clustering algorithms on estimating the number of cell
30 datasets generated by 3 various protocols, ASTER also outperformed types from single-cell RNA-sequencing data. Genome Biol. 2022;23(1).
31 other methods on 19 mouse datasets generated by another 3 protocols.
32 Seventh, ASTER also performed well on challenging datasets generated
33 from complex tissues and with high degrees of cell-type imbalance. One-
34 sided paired Wilcoxon signed-rank tests further demonstrated that the
35 advantages of ASTER over the baseline methods were significant (Fig.
36 1c). We provided more details of the above results in Supplementary Text
37 S6 and Figs. S1-2.
To mimic protocols that generate sparser scCAS data, we downsampled
38
the reads in BoneMarrowA, which provides cell-type labels after
39
fluorescent activated cell sorting. ASTER consistently outperformed other
40 methods when the dropout rate varied from 5% to 90% (Fig. 1d). We also
41 performed model ablation analysis to demonstrate the advantage of the
42 ensemble strategies of ASTER (Supplementary Text S7 and Figs. S3-4).
43 Besides, among all the 27 datasets, the improved Louvain and Leiden
44 clustering strategies in ASTER reduced the average number of searches
45 by 4.74 and 2.04, respectively.
46
47
4 Conclusion
48
Based on comprehensive experiments on multiple datasets, ASTER
49
provides an accurate way to estimate the number of cell types in scCAS
50
data. We anticipate that ASTER will provide a valuable guidance and
51
greatly assist with refining cell ontology in scCAS data analysis.
52
53
54 Funding
55
56
57
58
59
60

You might also like