Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Bioinformatics, 36(12), 2020, 3927–3929

doi: 10.1093/bioinformatics/btaa205
Advance Access Publication Date: 27 March 2020
Applications Note

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/12/3927/5812779 by University Libraries | Virginia Tech user on 27 June 2020
Data and text mining
debCAM: a bioconductor R package for fully
unsupervised deconvolution of complex tissues
Lulu Chen1, Chiung-Ting Wu1, Niya Wang2, David M. Herrington3, Robert Clarke4 and
Yue Wang1,*
1
Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA,
2
Search Ranking Unit, Google LLC, Mountain View, CA 94043, USA, 3Department of Internal Medicine, Wake Forest University,
Winston-Salem, NC 27157, USA and 4Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University,
Washington, DC 20057, USA
*To whom correspondence should be addressed.
Associate Editor: Jonathan Wren
Received on October 18, 2019; revised on March 5, 2020; editorial decision on March 19, 2020; accepted on March 23, 2020

Abstract
Summary: We develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly dis-
tinctive tissue or cell subtypes based on bulk expression profiles. We implement an R package, deconvolution by
Convex Analysis of Mixtures (debCAM) that can automatically detect tissue/cell-specific markers, determine the
number of constituent subtypes, calculate subtype proportions in individual samples and estimate tissue/cell-
specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expres-
sion, methylation, proteomics and imaging data. With enhanced data preprocessing and prior knowledge incorpor-
ation, debCAM software tool will allow biologists to perform a more comprehensive and unbiased characterization
of tissue remodeling in many biomedical contexts.
Availability and implementation: http://bioconductor.org/packages/debCAM.
Contact: yuewang@vt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction We have demonstrated real biomedical utilities of debCAM tool on


gene expression (Wang et al., 2016), proteomics (Herrington et al.,
Tissue heterogeneity serves as both an underexploited information 2018), imaging (Chen et al., 2011) and methylation data. These
source in characterizing complex tissue phenotypes and a major con- applications have led to novel findings and hypotheses. The present
founding factor in studying individual tissue or cell subtypes. software also enhances data preprocessing, accelerates robust scatter
Having analytic tools to define the molecular landscape of tissue simplex identification and integrates supervising information, such
heterogeneity, and to determine how subtypes are remodeled with as known markers.
phenotypic transitions, will be essential for the next step in systems
biology research. We have developed a fully unsupervised deconvo-
lution method, namely, deconvolution by Convex Analysis of 2 Description
Mixtures (debCAM), to dissect complex tissues into molecularly dis-
tinctive tissue or cell subtypes based on bulk expression profiles 2.1 Methodology and software
(Avila Cobos et al., 2018; Chan et al., 2008; Wang et al., 2016). Ideally, a molecularly distinctive subtype would be composed of in-
Importantly, debCAM requires no a priori information on the num- dividual molecular features that are exclusively expressed in the cog-
ber, signatures or compositions of the subtypes present in heteroge- nate cell or tissue subtype of interest while in no others—so-called
neous samples, and does not require pure subtype references. molecular markers. Fundamental to the success of debCAM is the
Supported by a well-grounded mathematical framework (Chan newly proven mathematical theorems (Chan et al., 2008; Chen
et al., 2008; Wang et al., 2016), debCAM automatically detects et al., 2011; Wang et al., 2016), showing that the scatter simplex of
tissue/cell-specific markers, determines the number of constituent mixed expressions is a rotated and compressed version of the scatter
subtypes, calculates subtype proportions in individual samples and simplex of pure subtype expressions, where the molecular markers
estimates tissue/cell-specific expression profiles. are located at each of the vertices (Fig. 1A–E). debCAM works by
The debCAM R package implements and tests the latest func- geometrically identifying the vertices (and their resident markers) of
tionalities of the debCAM algorithm pipeline in the literature. the scatter simplex of globally measured expressions. Tissue samples

C The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
V 3927
3928 L.Chen et al.

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/12/3927/5812779 by University Libraries | Virginia Tech user on 27 June 2020
Fig. 1. debCAM workflow and case studies. (A) The implemented debCAM pipeline with four major functional modules: data preprocessing, simplex and marker detection,
deconvolution of mixed expression profiles and model selection (Supplementary Material). Matrix ‘A’ is the mixing proportions whose column vectors correspond to the verti-
ces of the scatter simplex, matrix ‘S’ contains subtype-specific expression profiles and K is the number of subtypes present; all are unknown and need to be estimated. (B)
Scatter simplex of 12 in vitro mixtures shows six vertices, hosting the subtype-specific markers (detected by debCAM). (C) A priori known markers (five immune cell types)
superimposed onto the scatter simplex are closely resided around the corresponding vertices. (D and E) Scatter simplex of purified neuron and glia subtypes (astrocytes and ma-
ture oligodendrocytes) shows clearly three distinctive vertices. As expectedly, with varying mixing proportions, the scatter simplex of bulk tissue samples shows a rotated and
compressed version of the original simplex. (F) Neuron-specific expression profiles obtained from purified samples and from bulk tissue samples are highly correlated over all
CpG and marker sites

to be deconvolved by debCAM contain unknown number and vary- estimation of subtype-specific expression profiles. (iv) Model selec-
ing proportions of molecularly distinctive subtypes. Expression of a tion. The optimal number of subtypes present in samples is deter-
given molecular feature (e.g. mRNA and protein) in a specific sub- mined by the MDL information criterion among the competing
type is modeled as being linearly proportional to the abundance of models.
that subtype. The minimum description length (MDL) information
criterion determines the number of subtypes present (Chen et al.,
2011; Wang et al., 2016). 2.2 Case study with validation
The debCAM software tool performs the following major ana- We use three datasets of two omics types (mRNA and methylation)
lytics pipelines (Fig. 1A), in additional to normalization and antilog- and three validation schemes (gold standard, ground truth and
arithm: (i) data preprocessing. Molecule features whose expression cross-validation) to assess the performance of debCAM R package.
levels are lower or higher than pre-fixed threshold are removed as We first tested debCAM on biologically mixed gene expression
noise or outliers. The number of mixture profiles is reduced by prin- profiles (GSE64385). The 12 samples represent the mixtures of five
cipal component analysis. On the scatter simplex constructed by immune cell subtypes and one cancer cell-line in various known pro-
perspective projection and local outlier factoring, molecule features portions. The optimal scatter simplex blindly identified by debCAM
are aggregated into representative cluster centers using K-means or is given in Figure 1B, showing six distinctive vertices. To validate
Affinity Propagation Clustering. The interior cluster centers are the accuracy of subtype-specific markers blindly detected by
removed by QuickHull and a greedy/floating research algorithm. (ii) debCAM, we superimpose the color-coded known markers in the
Simplex and Marker detection. An exhaustive combinatorial search scatter simplex showing close proximity to the corresponding verti-
is conducted to identify the optimal scatter simplex for a given num- ces (Fig. 1C, the cancer cell-line-specific markers are unavailable for
ber of vertices among the peripheral cluster centers, guided by a comparison). More convincingly, the sample-wise subtype propor-
convex-hull-to-data fitting criterion. The members of the identified tions estimated by debCAM match the ground truth almost perfect-
simplex vertex clusters are considered subtype-specific markers. (iii) ly, with correlation coefficient of 0.975–0.996.
Deconvolution of mixed expression profiles. The mixing propor- We further assessed debCAM on DNA methylation data
tions are first estimated by the collective expression levels of (GSE41826) profiled from both purified (flow-sorted) neuron and
subtype-specific markers, followed by non-negative least squares glia subtypes as well as bulk tissue samples from prefrontal cortex.
debCAM 3929

Application of debCAM to both purified and bulk tissue samples types within the studied samples. As a fully unsupervised machine-
detects consistently three molecularly distinctive tissue subtypes, learning method, debCAM is expectedly not immune to hidden con-
whose cellular phenotypes are inferred as neuron, astrocytes and founding factors, e.g. batch effect and collinearity (Supplementary
mature oligodendrocytes according to their temporal cellular dy- Figs S7 and S8).
namics (Supplementary Fig. S6). The subtype-specific markers resid-
ing near the simplex vertices of bulk tissue samples are firstly
detected by debCAM and then cross-validated by the largely com-

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/12/3927/5812779 by University Libraries | Virginia Tech user on 27 June 2020
mon markers residing around the simplex vertices of purified sub- Funding
type samples (Fig. 1D and E, Supplementary Table S1). This work was supported by the National Institutes of Health [HL111362-
Furthermore, the subtype-specific expression profiles estimated by 05A1, HL133932, NS115658]; and the Department of Defence [W81XWH-
debCAM are highly correlated between purified and bulk samples, 18-1-0723, BC171885P1].
achieving almost perfect correlation coefficient of 0.982–0.992 over-
all CpG sites and correlation coefficient of 0.884–0.962 over CpG Conflict of Interest: none declared.
markers (Fig. 1F, Supplementary Fig. S4).

References
3 Discussion Avila Cobos,F. et al. (2018) Computational deconvolution of transcriptomics
The debCAM provides a completely unsupervised deconvolution data from mixed cell populations. Bioinformatics, 34, 1969–1979.
tool, complementary to numerous existing deconvolution methods Chan,T.-H. et al. (2008) A convex analysis framework for blind separation of
(Avila Cobos et al., 2018; Hart et al., 2015; Lobanova and non-negative sources. IEEE Trans. Signal Process., 56, 5120–5134.
Lobanov, 2019; Newman et al., 2015). Moreover, the debCAM Chen,L. et al. (2011) Tissue-specific compartmental analysis for dynamic
software can readily perform semi-supervised deconvolution by contrast-enhanced MR imaging of complex tumors. IEEE Trans. Med.
incorporating relevant a priori information such as known or Imaging, 30, 2044–2058.
Hart,Y. et al. (2015) Inferring biological tasks using Pareto analysis of
reference-derived subtype-specific markers. We expect that
high-dimensional data. Nat. Methods, 12, 233–235.
debCAM method, with a Bioconductor R package, to be a very use-
Herrington,D.M. et al. (2018) Proteomic architecture of human coronary and
ful software tool for unbiased molecular analysis of complex tissues
aortic atherosclerosis. Circulation, 137, 2741–2756.
in their native environment. Though the case studies are illustrated Lobanova,E. and Lobanov,S. (2019) Efficient quantitative hyperspectral
only in scatter space, debCAM is principally applicable to sample image unmixing method for large-scale Raman micro-spectroscopy data
space when significantly dense sampling is available, such as single- analysis. Anal. Chim. Acta, 1050, 32–43.
cell profiling. Historically this technique was motivated by the need Newman,A.M. et al. (2015) Robust enumeration of cell subsets from tissue ex-
to identify expressed marker genes for tissue/cell subtypes within a pression profiles. Nat. Methods, 12, 453–457.
mixed sample; however, the approach described here is applicable Wang,N. et al. (2016) Mathematical modelling of transcriptional heterogen-
to any molecular profiling technique for identifying the unique mol- eity identifies novel markers and subpopulations in complex tissues. Sci.
ecule marker or other features that are affiliated with distinct sub- Rep., 6, 18909.

You might also like