Professional Documents
Culture Documents
Pes1ug20cs540 438
Pes1ug20cs540 438
Pes1ug20cs540 438
Introduction
The issue of how to translate the vast amount of genetic data now accessible into
observable traits and behaviors that we wish to understand emerges as a result of the
exponential growth of genomic data availability. The goal of comparative analysis in
comparative genomics is to learn more from sequences about conserved characteristics
and evolutionary indicators. The homology search duty is specifically delegated to the
Comparative Genomics sector. A subsequence's homology is a shared ancestry that
may be deduced from its resemblance to other strings.
Homology search permits measuring evolutionary distances, and discovering functional
subsequences, due to the fact that shared characteristics are encoded within the
conserved DNA sections between host organisms.
Due to point mutations that occur during DNA replication and recombination processes,
genomic information reveals variances both within and across species, implying both
deterministic and probabilistic features. In addition, not only can mutation rates vary
across species and between different types of point mutations, but they can also be
influenced by nearby subsequences. In order to properly compare genomes, point
nucleotide insertions, deletions, and substitutions must be taken into consideration.
Literature Review
There are already many commercially accessible next-generation machines that can
sequence more than a billion DNA bases in a single run, and more are being created.
Microorganisms are at the forefront of attempts to use the high throughput of these
technologies to create new windows on the study of genetic diversity and evolution due
to their small genome sizes. It is also possible to infer some population genetic factors
from ambient DNA using metagenomic. surveys, which assess the species abundance
and metabolic makeup of microbial communities Our understanding of sequence
diversity across isolates of a single species and between closely related species is
being filled in by population genomic techniques.
It has been suggested that de Bruijn graphs be used as a data structure to make it
easier to analyze linked whole genome sequences in both population and
comparative-genomic contexts. Current methods, however, do not scale well to multiple
big genomes.
De Bruijn Graphs
Algorithm
In graph theory, the standard de Bruijn graph is the graph obtained by taking all strings
over any finite alphabet of length L as vertices, and adding edges between vertices that
have an overlap of L−1. In the following, we consider assembly using a slightly modified
version of the standard de Bruijn graph from the L-spectrum of a genome.
Markov Chains
A stochastic process that meets the Markov property is referred to as a Markov process
(sometimes characterized as "memorylessness"). In simpler terms, it is a process for
which future outcomes can be predicted only based on the current state of the process,
and—most importantly—such predictions are just as accurate as those that could be
made if the process's complete history were known.
Types of Markov Chains
Proposed Methodology
It is suggested that a graph database be used for homology searches while taking into
account the graph theory-based methodology.
Now consider the Markov chain underlying the de Bruijn Graph B({Γ1, Γ2, . . . , Γm},
ΣDNA, k). With sequence indices as discrete time, the state space for such Markov
chain is V. Let’s supplement every edge (u, v) ∈ E with a label puv, that represents the
probability of obtaining the nucleotide v[k − 1] preceding the k-mer u in the species
multi-genome.
We denote the proposed graph model by MBG({Γ1,Γ2,…, Γm}, ΣDNA, k). For every
nucleotide subsequence H there may be calculated probability pH of observing this
subsequence in the species multi-genome as a product of edge weights through a walk
in the MBG({Γ1,Γ2,...,Γm}, ΣDNA, k).
A pair of walks x and y form a bubble (x, y) if all of the following holds true:
• x and y share common starting and ending vertices;
• x and y share no other common vertices except for the starting and the ending ones.
Results
The suggested paradigm might be used to store genetic data using a graph database
management system. The homologous subsequences search issue is thus a graph
traversal problem, with walks representing homologous subsequences as a chain of
bubbles.
The amount of mutations may be penalized and increased odds of seeing the candidate
subsequence in the species multi-genome awarded for walks carrying homologous
candidates. The quality of the empirical data used to predict transition probabilities has
a significant impact on how accurate the underlying Markov chain model is. The
suggested methodology enables parallelizing homology search and the deployment of a
distributed storage system by allowing the creation of separate uniform databases for
various species multi-genomes.
Conclusions
In this study, a graph data model for comparative genome analysis is proposed, with a
particular emphasis on the homology search job. The suggested model is intended to
be used with the graph database. The concept proposes a de Bruijn Graph structure
and an underlying Markov chain for multi-genome graph representation. It is true that
the quality of the empirical data used to determine transition probabilities affects how
accurate the homology search based on the suggested model will be.
When a network is traversed, genomic variants are represented by the idea of bubbles,
and this allows for the evaluation of the similarity score between subsequences to infer
homology. Ranking complicated bubbles, meanwhile, presents a challenge in identifying
the best homology candidates.