Pes1ug20cs540 438

Model Formalization for Genomes Comparative Analysis Using a Graph Database
Baddela Divya Malika, PES1UG20CS540 Suchit S K, PES1UG20CS438
Introduction
The issue of how to translate the vast amount of genetic data now accessible into
observable traits and behaviors that we wish to understand emerges as a result of the
exponential growth of genomic data availability. The goal of comparative analysis in
comparative genomics is to learn more from sequences about conserved characteristics
and evolutionary indicators. The homology search duty is specifically delegated to the
Comparative Genomics sector. A subsequence's homology is a shared ancestry that
may be deduced from its resemblance to other strings.
Homology search permits measuring evolutionary distances, and discovering functional
subsequences, due to the fact that shared characteristics are encoded within the
conserved DNA sections between host organisms.
Due to point mutations that occur during DNA replication and recombination processes,
genomic information reveals variances both within and across species, implying both
deterministic and probabilistic features. In addition, not only can mutation rates vary
across species and between different types of point mutations, but they can also be
influenced by nearby subsequences. In order to properly compare genomes, point
nucleotide insertions, deletions, and substitutions must be taken into consideration.
Literature Review
There are already many commercially accessible next-generation machines that can
sequence more than a billion DNA bases in a single run, and more are being created.
Microorganisms are at the forefront of attempts to use the high throughput of these
technologies to create new windows on the study of genetic diversity and evolution due
to their small genome sizes. It is also possible to infer some population genetic factors
from ambient DNA using metagenomic. surveys, which assess the species abundance
and metabolic makeup of microbial communities Our understanding of sequence
diversity across isolates of a single species and between closely related species is
being filled in by population genomic techniques.
It has been suggested that de Bruijn graphs be used as a data structure to make it
easier to analyze linked whole genome sequences in both population and
comparative-genomic contexts. Current methods, however, do not scale well to multiple
big genomes.
De Bruijn Graphs
Algorithm
In graph theory, the standard de Bruijn graph is the graph obtained by taking all strings
over any finite alphabet of length L as vertices, and adding edges between vertices that
have an overlap of L−1. In the following, we consider assembly using a slightly modified
version of the standard de Bruijn graph from the L-spectrum of a genome.
Given the L-spectrum of a genome, we construct a de Bruijn graph as follows:
1. Add a vertex for each (L-1)-mer in the L-spectrum.

2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the
corresponding L-mer appears k times in the L-spectrum.
Markov Chains
A stochastic process that meets the Markov property is referred to as a Markov process
(sometimes characterized as "memorylessness"). In simpler terms, it is a process for
which future outcomes can be predicted only based on the current state of the process,
and—most importantly—such predictions are just as accurate as those that could be
made if the process's complete history were known.
Types of Markov Chains
Proposed Methodology
It is suggested that a graph database be used for homology searches while taking into
account the graph theory-based methodology.
Graph Multi-genome Representation
Let’s consider a set {Γ1, Γ2, . . . , Γm} of genomic sequences Γi of host-organisms of a

particular species, called multi-genome. We suggest an extended de Bruijn Graph
B({Γ1, Γ2, . . . , Γm}, ΣDNA, k) = G(V, E), where each node v ∈ V is supplemented with
information about which genomes it belongs to. Thus, each genome Γi is represented
as a walk in the graph B({Γ1, Γ2, . . . , Γm}, ΣDNA, k).
Now consider the Markov chain underlying the de Bruijn Graph B({Γ1, Γ2, . . . , Γm},
ΣDNA, k). With sequence indices as discrete time, the state space for such Markov
chain is V. Let’s supplement every edge (u, v) ∈ E with a label puv, that represents the
probability of obtaining the nucleotide v[k − 1] preceding the k-mer u in the species
multi-genome.
We denote the proposed graph model by MBG({Γ1,Γ2,…, Γm}, ΣDNA, k). For every
nucleotide subsequence H there may be calculated probability pH of observing this
subsequence in the species multi-genome as a product of edge weights through a walk
in the MBG({Γ1,Γ2,...,Γm}, ΣDNA, k).
Genomic Variations Representation
The concept of bubbles is used to represent genomic variations flanked by homologous

sequences. In this paper terms single-nucleotide variation and point mutation are used
interchangeably, both terms refer to single nucleotide dissimilarity between DNA
sequences. Singlenucleotide variation serves as a dissimilarity unit, which enables
evaluating homology score. A bubble is a subgraph corresponding to a possible
mutation.
A pair of walks x and y form a bubble (x, y) if all of the following holds true:
• x and y share common starting and ending vertices;
• x and y share no other common vertices except for the starting and the ending ones.
Results
The suggested paradigm might be used to store genetic data using a graph database
management system. The homologous subsequences search issue is thus a graph
traversal problem, with walks representing homologous subsequences as a chain of
bubbles.
The amount of mutations may be penalized and increased odds of seeing the candidate
subsequence in the species multi-genome awarded for walks carrying homologous
candidates. The quality of the empirical data used to predict transition probabilities has
a significant impact on how accurate the underlying Markov chain model is. The
suggested methodology enables parallelizing homology search and the deployment of a
distributed storage system by allowing the creation of separate uniform databases for
various species multi-genomes.
Conclusions
In this study, a graph data model for comparative genome analysis is proposed, with a
particular emphasis on the homology search job. The suggested model is intended to
be used with the graph database. The concept proposes a de Bruijn Graph structure
and an underlying Markov chain for multi-genome graph representation. It is true that
the quality of the empirical data used to determine transition probabilities affects how
accurate the homology search based on the suggested model will be.
When a network is traversed, genomic variants are represented by the idea of bubbles,
and this allows for the evaluation of the similarity score between subsequences to infer
homology. Ranking complicated bubbles, meanwhile, presents a challenge in identifying
the best homology candidates.
The suggested model enables parallelizing homologous searches by allowing the

creation of standard databases for a group of species multi-genomes.

Pes1ug20cs540 438

Uploaded by

Copyright:

Available Formats

You might also like

Pes1ug20cs540 438

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pes1ug20cs540 438

Uploaded by

Copyright:

Available Formats

Model Formalization for Genomes Comparative Analysis Using a Graph Database

Baddela Divya Malika, PES1UG20CS540 Suchit S K, PES1UG20CS438

Given the L-spectrum of a genome, we construct a de Bruijn graph as follows:

1. Add a vertex for each (L-1)-mer in the L-spectrum.

Graph Multi-genome Representation

Let’s consider a set {Γ1, Γ2, . . . , Γm} of genomic sequences Γi of host-organisms of a

Genomic Variations Representation

The concept of bubbles is used to represent genomic variations flanked by homologous

The suggested model enables parallelizing homologous searches by allowing the

You might also like