Professional Documents
Culture Documents
ASTER Accurately Estimating The Number of Cell Typ
ASTER Accurately Estimating The Number of Cell Typ
1
2
Bioinformatics, YYYY, 0–0
3
doi: 10.1093/bioinformatics/xxxxx
4
Advance Access Publication Date: DD Month YYYY
5
Applications Note
6
7
S.Chen et al.
1
2
3
4
5
6
7
31 good clustering is one with a low WSS score and a low number (𝑘) of process, we further improve the search strategy based on the weighted bias
32 clusters. However, this is a tradeoff because WSS decreases as 𝑘 increases. as follows:
𝑘 ― 𝑘𝑚𝑖𝑛
33 Therefore, we adopt an elbow method to identify the elbow/knee point
𝑟𝑛𝑒𝑥𝑡 = 𝑟𝑡ℎ𝑖𝑠 + (𝑟𝑚𝑎𝑥 ― 𝑟𝑚𝑖𝑛) ×
34 (the point with maximum curvature) of a 𝑘-versus-WSS line (Satopaa, et 𝑘𝑚𝑎𝑥 ― 𝑘𝑚𝑖𝑛
35 al., 2011). The 𝑘 of the elbow point is adopted as the optimal number of where 𝑟𝑛𝑒𝑥𝑡 and 𝑟𝑡ℎ𝑖𝑠 denote the resolutions in the next and this attempt,
clusters. respectively, 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛 denote the maximum and minimum
36 Secondly, ASTER performs estimation based on the Davies-Bouldin resolutions to be searched, respectively, 𝑘𝑚𝑎𝑥 and 𝑘𝑚𝑖𝑛 denote the obtained
37 index (Davies and Bouldin, 1979). Instead of TF-IDF transformation (V1),
numbers of clusters using the maximum and minimum resolutions,
38 ASTER applies another widely-used TF-IDF transformation (V2) to 𝐗
respectively. For each 𝑘, ASTER calculates the mean silhouette
39 (Supplementary Text S3). ASTER then performs PCA and K-Means as
1 coefficient of all cells based on Louvain and Leiden clustering results,
40 above to measure the Davies-Bouldin index, which is defined as 𝑘 respectively, and then sums up the two means. A higher silhouette
𝑠𝑖 + 𝑠𝑗
41 ∑𝑘
𝑖=1
max 𝑑𝑖𝑗 , where 𝑘 is the number of clusters, 𝑠𝑖 the average distance coefficient relates to a model with better-defined clusters, and the 𝑘 that
42 𝑖≠𝑗
provides the maximum coefficient is thus adopted as the optimal number
between each cell of cluster 𝑖 and the centroid of that cluster, and 𝑑𝑖𝑗 the
43 distance between cluster centroids 𝑖 and 𝑗. A lower index indicates a better
of clusters.
44 partition, and the 𝑘 that provides the minimum index is thus adopted as
Finally, ASTER estimates the number of cell types by averaging the
45 the optimal number of clusters. three numbers estimated above, that is, the ensemble estimation is based
46 Thirdly, ASTER performs estimation based on the silhouette on three metrics, two TF-IDF approaches, and three clustering methods.
coefficient (Rousseeuw, 1987), which is defined for a single cell as Besides, building upon the widely-used AnnData format, ASTER can be
47 𝑏 ―𝑎
where 𝑎 denotes the mean distance between the cell and all other seamlessly integrated into the EpiScanpy analysis workflow.
48 max (𝑎,𝑏),
49 cells in the same cluster, 𝑏 denotes the mean distance between the cell and
all other cells in the next nearest cluster. ASTER performs TF-IDF
50 transformation (V2) and PCA as above, and then constructs a 3 Results
51 neighborhood graph of cells using the EpiScanpy workflow. Instead of K- We evaluated the performance of ASTER by estimation error (the
52 Means clustering, ASTER adopts another two widely-used clustering
difference between the estimated and true number of cell types) and
methods, i.e., Louvain and Leiden, which require a resolution parameter
53 but not the number of clusters. To obtain the desired number of clusters, a estimation deviation (the estimation error normalized by the true number
54 binary search strategy is usually adopted (Chen, et al., 2019; Chen, et al., of cell types) as recommended in a recent benchmark study (Yu, et al.,
55 2021; Danese, et al., 2021). However, each attempt in the search process 2022). Note that this task is different from clustering and higher clustering
56 is time-consuming, especially for large data. To speed up the search
concordance does not necessarily mean a more accurate estimation (Yu,
57
58
59
60
Page 3 of 3 Bioinformatics
6 scCCESS and scLCA (Cheng, et al., 2019), two of the best methods in the
Conflict of Interest: none declared.
7 most recent benchmark study for scRNA-seq data (Yu, et al., 2022). We