Professional Documents
Culture Documents
Graph-Based Characterisations of Cell Types and Functionally Related
Graph-Based Characterisations of Cell Types and Functionally Related
Lelde Lace, Gatis Melkus, Peteris Rucevskis, Edgars Celms, Karlis Cerans, Paulis Kikusts,
Mārtiņš Opmanis, Darta Rituma and Juris Viksna
Institute of Mathematics and Computer Science, University of Latvia, Rainis Boulevard 29, Riga, Latvia
lelde.lace, gatis.melkus, peteris.rucevskis, edgars.celms, karlis.cerans, paulis.kikusts, martins.opmanis, darta.rituma,
Keywords: Hi-C Networks, Cell Type Specificity, Graph-Based Metrics, Graph Topology.
Abstract: Current technologies, noteworthy Hi-C, for chromosome conformation capture allow to understand a broad
spectrum of functional interactions between genome elements. Although significant progress has been made,
there are still a lot of open questions regarding best approaches for analysis of Hi-C data to identify bio-
logically significant features. In this paper we approach this problem by focusing strictly on the topological
properties of Hi-C interaction graphs. Graph topological properties were analysed from the perspective of
two research questions: 1) are topological properties alone able to distinguish between different cell types
and assign biologically meaningful distances between them; 2) what is a typical structure of Hi-C interaction
graphs and can we assign a biological significance to structural elements or features? The analysis was applied
to a set of Hi-C interactions in 17 human haematopoietic cell types. Promising results have been obtained at
answering both questions. Firstly, we propose a concrete set Base11 of 11 topology-based metrics that pro-
vide good discriminatory power between cell types. Secondly, we have explored the topological features
of connected components of Hi-C interaction graphs and demonstrate that such components tend to be well
conserved within particular cell type subgroups and can be well associated with known biological processes.
78
Lace, L., Melkus, G., Rucevskis, P., Celms, E., Cerans, K., Kikusts, P., Opmanis, M., Rituma, D. and Viksna, J.
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data.
DOI: 10.5220/0007390800780089
In Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019), pages 78-89
ISBN: 978-989-758-353-7
Copyright
c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data
datasets could best be mined to answer additional bi- biologically meaningful distances between them; 2)
ological questions. Several methods exist by which what is a typical structure of Hi-C interaction graphs
the resolution of a Hi-C dataset can be theoretically and can we assign some biological significance to
improved through more sophisticated interaction call- structural elements or features of these graphs?
ing algorithms. These include several algorithms with
broadly comparable performances for general Hi-C
data (Forcato et al., 2017) and also more specific tools 2 DATASET USED FOR THIS
such as the CHiCAGO analysis pipeline meant specif-
ically for cHi-C datasets (Cairns et al., 2016), and STUDY
these algorithms can involve techniques ranging from
modelling technical biases (Ay et al., 2014) to em- For this study we use a dataset of long-range inter-
ploying a deep convolutional neural network (Zhang actions between promoters and other regulatory el-
et al., 2018). Another set of approaches that are dis- ements that was generated by The Babraham Insti-
tinct from these are clustering methods where avail- tute and University of Cambridge (Javierre et al.,
able 3C and Hi-C data are analysed in the interest of 2016). The data comprise interactions that were
discovering functionally related modules of genes and determined by promoter capture Hi-C in 17 human
regulatory elements. These include the use of δ-teams primary haematopoietic cell types. The measure-
models for Hi-C data to identify both known and puta- ments have identified interaction regions of 31253
tive gene clusters (Schulz et al., 2018), hard- and soft- promoters across all chromosomes; from these high-
clustering algorithms that could theoretically assist in confidence PCHi-C interactions have been selected
the interpretation of combined metagenomic sequenc- using CHiCAGO pipeline (Cairns et al., 2016) and se-
ing and 3C data (DeMaere and Darling, 2016), and lecting interactions with score 5 or more. These data
also spectral clustering-based methods such as the are available from the Open Science Framework web-
Arboretum-Hi-C framework that was found to be use- site (Javierre et al., 2016). This data set is still largely
ful in identifying chromatin features in mammalian unique because it contains genome-wide data cover-
Hi-C datasets at several levels of organization and ing a representative subset of the entire haematopoi-
pointed to the potential utility of graph-based cluster- etic lineage collected using a unified protocol.
ing in both analysing and comparing Hi-C datasets in From these data we have constructed directed
general (Siahpirani et al., 2016). graphs separately for each chromosome (few inter-
chromosome interactions were rejected and chromo-
Although the advantages of analysing Hi-C inter- some Y was not considered due to a very few interac-
action data in graph-related terms are already well ac- tions) and for each cell type. The vertices of graphs
cepted – e.g approaches by (Siahpirani et al., 2016) are chromosome segments that correspond either to
or (Cairns et al., 2016) explicitly discuss graph-based promoters (’baits’) or the detected interaction regions
formalisms and their methods have been success- (’other ends’), the edges correspond to interactions
fully applied when analysing new data sets (Javierre and are directed from ’baits’ to ’other ends’. Un-
et al., 2016), only the properties of interaction ma- like most other analyses we only consider the topol-
trices (which can be considered as weighted graphs) ogy of interaction graphs, without assigning concrete
are mainly taken into account as well as some addi- sets of genes to the vertices. In total we obtained
tional data (e.g. interaction segment distances and 23 × 17 chromosome and cell type-specific graphs,
associations with known gene regulations). Analysis each of which can be considered as a subgraph of the
of topological features of these interaction graphs and ’complete interaction network’ having a total number
their biological significance, however, remain largely of vertices 251209 (with ranges between 2904 and
unexplored. (Although some studies explicitly men- 23079 per chromosome) and 723165 edges. Simi-
tion ’topological features’, these are very limited and larly, each of these graphs can be considered as a sub-
usually more related to the topology of chromosomes graph of the ’interaction network’ of one of the chro-
rather than interaction graphs – e.g. in (Wang et al., mosomes, which is specific for a given cell type.
2013) these are interpreted as the distributions of in- In a number of computational tests the graphs
teraction endpoints on chromosomes). were further modified by varying CHiCAGO scores
In this paper we focus explicitly only on the topo- between thresholds for edge inclusion between 3 and
logical properties of Hi-C interaction graphs. These 8. Although no comprehensive analysis was done, in
properties were identified and analysed from the per- general such variations had a limited impact on the
spective of two research questions: 1) are topologi- stability of the results.
cal properties of Hi-C interaction graphs alone able
to distinguish between different cell types and assign
79
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
80
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data
81
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
Figure 2: The most discriminating metrics for cell type dis- Figure 3: The most discriminating for cell type distances Dt
tances Dcont (top) and Db (bottom). The relative discrimi- (top) and D4 (bottom).
natory power of any pair of metrics is proportional to the ra-
tio of their absolute values of ”effect size” shown on y axis.
The exact ranges of ”effect size” are data-dependent and Table 1: Pearson correlations between cell type distances
are not directly comparable between different cell type dis- and predictions of their values.
tances, and (even more notably) between full genome and Dcont Db Dt D4 DA D8 DC
single chromosome metrics data. Base57 0.79 0.68 0.67 0.72 0.77 0.44 0.63
Base34 0.78 0.67 0.66 0.71 0.76 0.42 0.62
in Table 1. To some extent, it shows the suitability Base11 0.78 0.67 0.66 0.70 0.74 0.40 0.58
of different metrics for discrimination between cell Base11 + V 0.81 0.69 0.67 0.73 0.75 0.46 0.62
Base11 + E 0.80 0.68 0.67 0.72 0.75 0.45 0.61
types, although the obtained correlation values are
Base11 + E17 0.80 0.68 0.67 0.72 0.75 0.45 0.61
likely influenced by over-fitting and should be treated Base11 + E9 0.87 0.72 0.73 0.76 0.75 0.55 0.71
with caution (although the results are very stable – V 0.67 0.57 0.56 0.55 0.47 0.36 0.45
10 repeated bootstrapping tests with training sets con- E 0.78 0.66 0.65 0.70 0.68 0.29 0.25
taining 75% of data produced at most 2% deviations). E17 0.78 0.66 0.65 0.70 0.68 0.29 0.25
Of larger interest is the comparative performance of E9 0.86 0.71 0.72 0.75 0.69 0.33 0.31
different metrics in cell type differentiation. Good
performance of ’counting’ metrics V, E, E9 and E17 An interesting feature can be observed from com-
is not surprising, since one should expect that more paring the performance of Base11 on different chro-
closely related cell types will share more common Hi- mosomes (shown by heatmaps in Figure 4). The chro-
C interactions. Nevertheless, they alone do not out- mosomes are grouped in a number of similarity clus-
perform sets of Base metrics (apart from E9, which ters, which is not the case for randomised data. The
counts the interactions that are common to no more clusters, however, strongly depend on the used cell
than 50% of cell types, and thus is dataset depen- type distance (although there are few stably related
dent). Surprising, however, is the low performance pairs of chromosomes as well as few persistent out-
of these metrics for identification of select clusters liers).
of closely related cell types (distances DA, DB and For Dcont distance Figure 5 shows the statistical
DC). Overall, however, these results confirm that the significance of Base11 metrics for which it correlates
Base topology-based metrics, particularly the Base11 well with metrics significance for the whole chromo-
subset, perform well in lineage-based identification of some set.
blood cell types in chromatin interaction data.
82
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data
83
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
The questions we are trying to answer here are: 1) ponents that are shared by Mon and at least one other
what is a ’typical structure’ of Hi-C interaction graph? cell type and that are almost absent in at least one
2) how do the structures of graphs change if we con- other cell type is around 25%. Comparatively few
sider common subgraphs for given sets of cell types? (around 5%) components remain little changed in all
3) can we assign some biological significance to the of the cell types. The remaining 70% could be further
structural elements or features of these graphs? subdivided in not very strictly separated subclasses,
ranging from ones for which part of the component
4.1 General Topological Properties of shows good cell type specificity, to quite noisy ones
(full statistics is lacking, but manual inspections indi-
Interaction Graphs cate that the first of these behaviours tends to be more
common).
Since the data that we have used are focused par-
ticularly on small-scale interactions, it is not unex-
pected that our interaction graphs separate easily into
connected components. Approximately 29% of these
contain just 2 vertices (isolated edges or cycles of
length 2), there are comparatively few components of
size 3-7 (29%) and the remaining 45% contain 8 or
more vertices. Very few of them are comparatively
large (up to 150 vertices), with a typical range be-
ing 10-70 vertices and an average size being 32 ver-
tices. A sample distribution of connected compo-
nent sizes is shown in Figure 7. To investigate how
these components change for different cell types we
Figure 8: The reduction of sizes of connected components
further consider components of 10 or more vertices
(with 10 or more vertices) for chromosome 5 and Mon cell
(although chosen somewhat arbitrary, this threshold type when the component is replaced by a subgraph shared
seems well suited to largely reduce random ’noise’ by all 17 cell types. The remaining component size (in %)
in graphs). The average number of such components is shown on the horizontal axis and the percentage of com-
in each chromosome and cell type-specific graph is ponents on the vertical axis.
around 50.
As an example, the following figures illustrate a
connected component of chromosome 5 and cell type
EP. The complete initial component is shown in Fig-
ure 9. Figure 9 shows the parts of the component that
are shared by Mac0, Mac1 and Mac2 – although a
number of interactions and nodes are lost the topo-
logical structure remains mostly preserved; and also
additionally is shared by cell type tCD8, in which case
only a few vertices and interactions remain. The com-
ponent is completely absent in cell type Neu.
84
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data
retained, n(ne )
retained, n(ne )
retained, n(ne )
retained, n(ne )
retained, n(ne )
75% − 100%
50% − 75%
25% − 50%
Total, n(ne )
0% − 25%
0%
Chr
Figure 10: The parts of EP Hi-C interactions component 1 23(5) 15(10) 45(36) 39(29) 111(14) 233(94)
shared with cell types Mac0, Mac1 and Mac2 (top) and ad- 2 14(1) 12(10) 36(31) 38(29) 96(17) 196(88)
ditionally with tCD8 (bottom). There are no common inter- 3 13(1) 9(4) 34(24) 29(26) 83(11) 168(66)
actions shared with cell type Neu. 4 7(3) 10(5) 15(11) 24(17) 80(14) 136(50)
5 9(3) 11(5) 30(25) 17(15) 80(8) 147(56)
6 7(2) 16(10) 26(22) 20(15) 79(8) 148(57)
4.2 Biological Interpretation of Hi-C 7 11(3) 13(6) 20(16) 22(21) 84(10) 150(56)
8 6(0) 13(4) 14(10) 15(12) 73(13) 121(39)
Interaction Components 9 7(0) 13(7) 16(12) 19(16) 65(8) 120(43)
10 17(3) 9(5) 15(12) 26(24) 65(10) 132(54)
The observation that connected components in Hi-C 11 11(2) 18(8) 29(19) 28(22) 67(3) 153(54)
interaction graphs are either largely shared by two 12 8(0) 13(7) 24(17) 12(11) 87(13) 144(48)
13 7(0) 3(1) 11(8) 13(11) 31(5) 65(25)
cell types, or are present in one of them and largely
14 9(2) 8(4) 12(9) 12(5) 62(9) 103(29)
absent in another strongly suggest that they have bi- 15 11(0) 8(5) 19(9) 17(13) 68(3) 123(30)
ological roles. These, however, might differ between 16 7(1) 9(8) 18(6) 8(5) 47(4) 89(24)
the components and the very large number of these 17 10(1) 18(6) 24(18) 21(19) 67(3) 140(47)
make comprehensive analysis practically infeasible. 18 2(0) 5(1) 13(9) 10(4) 38(9) 68(23)
Still, we have performed limited analyses that con- 19 18(1) 16(7) 18(7) 6(2) 48(0) 106(17)
firm a strong relation of a component structure with 20 10(1) 8(4) 12(10) 7(5) 38(5) 75(25)
21 4(0) 2(1) 5(4) 5(3) 24(4) 40(12)
known biological interactions and we have explored
22 6(0) 5(3) 9(4) 11(7) 34(3) 65(17)
several illustrative components in more detail. X 4(1) 16(7) 20(13) 14(11) 66(2) 120(34)
In order to ascertain whether the clusters previ- Total 221 250 465 413 1493 2842
ously obtained by linking bait-to-bait and bait-to-end (30) (128) (332) (322) (176) (988)
interactions contained likely candidates for function-
ally related gene modules we first chose a sequence After collecting data about node retention, the
of related cell types from the myeloid haematopoietic largest components showing high (75-100%) or low
lineage to examine for cell type-specific chromatin (0-25%) retention in chromosomes of interest ac-
architecture. Starting from a large pool of common cording to our previous graph analysis (chromosomes
connected components for resting Mac0 and inflam- 5, 9, 14, 15 and 19) were analysed for enrichment
matory macrophages Mac1, we sequentially added with registered transcription factor protein-protein in-
more cell types to the selection criteria – alternatively teractions, known transcription factor binding sites,
activated macrophages Mac2, monocytes Mon, neu- co-expressed transcription factors and binding mo-
trophils Neu and endothelial precursors EP – to nar- tifs with the Enrichr web tool (Chen et al., 2013;
85
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
Figure 11: A component of Hi-C interactions shared by cell types Mac0, Mac1 and Mac2, Mon, Neu and EP. The initial larger
component, shown with dashed lines, is present in Mac0.
86
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data
Figure 12: Another chromosome 5 component of Hi-C interactions shared by cell types Mac0, Mac1 and Mac2, Mon, Neu
and EP. The initial larger component, shown with dashed lines, is present in Mac0.
87
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms
interesting to test our approach on another genome- Chen, E., Tan, C., et al. (2013). Enrichr: interactive and
wide PCHi-C interaction dataset, in order to assess collaborative html5 gene list enrichment analysis tool.
both: 1) the applicability of Base11 metrics for dis- BMC Bioinformatics, 14:128.
crimination between cell types, other than blood cells Das, A., Yang, C., et al. (2018). High-resolution map-
studied here; and 2) to analyse the similarity of topo- ping and dynamics of the transcriptome, transcription
factors, and transcription co-factor networks in classi-
logical structure and component behaviour of interac- cally and alternatively activated macrophages. Front.
tion graphs in order to assess how the properties ob- Immunol., 9:22.
served for haematopoietic cells generalises to other Dekker, J., Rippe, K., et al. (2002). Capturing chromosome
cell types. Unfortunately, as far as we know, an- conformation. Science, 295(5558):1306–1311.
other appropriate PCHi-C dataset covering multiple DeMaere, M. and Darling, A. (2016). Deconvoluting sim-
cell types has yet to become available. ulated metagenomes: the performance of hard- and
There is also a good potential to further extend soft- clustering algorithms applied to metagenomic
the graph topology based approach that we have used chromosome conformation capture (3C). PeerJ,
here. It has been quite successful to show that topo- 4:e2676.
logical properties alone could be quite informative Dryden, N., Broome, L., et al. (2014). Unbiased analysis of
for discrimination between different cell types and potential targets of breast cancer susceptibility loci by
capture Hi-C. Genome Research, 24:1854–1868.
also for assigning biological meaning to specific com-
ponents on interaction graphs. At the same time Forcato, M., Nicoletti, C., et al. (2017). Comparison of
computational methods for Hi-C data analysis. Nature
some useful information in the current graph repre- Methods, 14:679–685.
sentation is absent, notably, an edge in an interac-
Guimaraes, J. and Zavolan, M. (2016). Patterns of riboso-
tion graph might represent an interaction that forms mal protein expression specify normal and malignant
a well-defined loop on a chromosome (if the distance human cells. Genome Biology, 17:236.
between interaction segments is limited and there are Jäger, R., Migliorini, G., et al. (2015). Capture Hi-C iden-
no intermediate interactions between them), or it can tifies the chromatin interactome of colorectal cancer
represent a long range interaction with a far less ob- risk loci. Nature Communications, 6(6178).
vious biological role. We have plans to further de- Javierre, B., Burren, O., et al. (2016). Lineage-specific
velop the mathematical formalism for description of genome architecture links enhancers and non-coding
interaction graphs in order to incorporate and analyse disease variants to target gene promoters. Cell,
such features, although additional studies are needed 167(5):1369–1384.
to determine the best way to achieve this. Kanki, Y., Nakaki, R., et al. (2017). Dynamically and epige-
netically coordinated GATA/ETS/SOX transcription
factor expression is indispensable for endothelial cell
differentiation. Nucleic Acids Research, 45(8):4344–
ACKNOWLEDGEMENTS 4358.
Kuleshow, M., Jones, M., et al. (2016). Enrichr: a compre-
The research was supported by ERDF project hensive gene set enrichment analysis web server 2016
update. Nucleic Acids Research, 44:W90–W97.
1.1.1.1/16/A/135.
Lajoie, B., Dekker, J., and Kaplan, N. (2015). The hitch-
hiker’s guide to Hi-C analysis: Practical guidelines.
Methods, 72:65–75.
REFERENCES Lavin, Y., Mortha, A., et al. (2016). Regulation of
macrophage development and function in peripheral
Ay, F., Bailey, T., and Noble, W. (2014). Statistical con- tissues. Nature Reviews Immunology, 15(12):731–
fidence estimation for Hi-C data reveals regulatory 744.
chromatin contacts. Genome Research, 24(6):999– Lieberman-Aiden, E. van Berkum, E. et al. (2009). Com-
1011. prehensive mapping of long-range interactions reveals
Belton, J., McCord, R., et al. (2012). Hi-C: A com- folding principles of the human genome. Science,
prehensive technique to capture the conformation of 326(5950):289–293.
genomes. Methods, 58(3):268–276. 3D chromatin ar- Martin, P., McGovern, A., et al. (2015). Capture Hi-C re-
chitecture. veals novel candidate genes and complex long-range
Cairns, J., Freire-Pritchett, P., et al. (2016). CHiCAGO: ro- interactions with related autoimmune risk loci. Nature
bust detection of DNA looping interactions in capture Communications, 6(10069).
Hi-C data. Genome Biol, 17:127. McDonel, P., Demmers, J., et al. (2012). Sin3a is essential
Celms, E., Cerans, K., et al. (2018). Application of graph for the genome integrity and viability of pluripotent
clustering and visualisation methods to analysis of cells. Dev. Biol., 363(1):62–73.
biomolecular data. Communications in Computer and Mifsud, B., Tavares-Cadete, F., et al. (2015). Mapping long-
Information Science, 838:243–257. range promoter contacts in human cells with high-
88
Graph-based Characterisations of Cell Types and Functionally Related Modules in Promoter Capture Hi-C Data
89