21 MI040 JEEL VYASBNF Assigment

 c
 The operators AND, OR, NOT, NOR, NAND, and XOR are examples of boolean operators,
also known as logical operators. These operators are employed in programming, search engines,
algorithms, and calculations using conditional statements.
 The key insight offered by Boole is that all categorical subject-predicate constructs include
classes of objects connected logically by a finite set of functions. Boole names three logical
operators that modify the meaning of the words "not," "and," and "or" in everyday English. These
three logical connectives, which are also referred to as complementation, conjunction, and
disjunction, are known as Boolean operators. They can be symbolically expressed as NOT, AND,
and OR, but not yet.
 NOT: The Boolean Operator NOT helps narrow your search by excluding certain terms from
your search. When using NOT, you are telling the database that you want information that is
related to the first term, but not the second.
 AND: The most frequent Boolean operator is AND. Because AND instructs the database that all
of your search phrases must be in the results, it will restrict your search and give you fewer
results. Although you will obtain results with all of your keywords together when you use AND
to combine them, take in mind that they might not always appear adjacent to one another.
 OR: The Boolean Operator OR broadens your search. Remember that in database searching, OR
means MORE results. OR tells the database that you want results that mention one or both of
your search terms.OR is a helpful operator to use if you have a search term that has multiple
meanings, like preschool OR nursery school. You will notice when you do your searching, that
some authors might use the term "preschool" and others will use "nursery school" to mean the
exact same thing. OR helps you make sure that you find the most possible articles about your
topic.
1) What are databases? Why are they important in biological studies?
1
 A database is an organized collection of structured information, or data, typically stored
electronically in a computer system. A database is usually controlled by a database management
system (DBMS). Together, the data and the DBMS, along with the applications that are
associated with them, are referred to as a database system, often shortened to just database.
 Bioinformatics databases, also known as biological databases, are computerised, well-organized
repositories of biological data that offer a uniform method for data searching and updating. They
can be described as collections of data gathered through published literature, computational
analysis, and scientific experimentation. These are databases that contain organised biological
data such as protein sequencing, molecular structure, DNA sequences, etc. They are an efficient
way to store, search for, and retrieve data. The vast majority of different biological data can be
found in free-to-use biological databases. They contribute to the reduction of data redundancy.
They aid in the understanding of biological events by scientists. With the right tools, it is possible
to create new data from existing databases, such as protein structure predictions made by
artificial intelligence. The creation of new scientific discoveries and advances in agriculture,
health, and other fields is made possible by biological databases.
 The three main categories of biological databases are:
 1. Archival databases are used as primary databases. The scientists' submitted experimental
results are archived. The main database is filled with information that was obtained through
experimentation, such as the genomic sequence and macromolecular structure. The entered data
is yet uncurated Users can access it without any changes being done. When the data are added
to the database, accession numbers are assigned to them. The accession number can be used to
later obtain the same data. Each piece of data is individually identified by an accession number,
which is constant. Examples of primary databases include databases for proteins and nucleic
acids, such as Protein Data Bank and GenBank (PDB) 2.Data from the primary databases'
analyses are contained in secondary databases. The primary database is subjected to
computational techniques, and the secondary database contains useful and instructive data. The
data are carefully vetted here Compared to the primary database, a secondary database holds
more useful information. InterPro is an example of a secondary database (protein). 3.Composite
databases are assemblages of a number of primary database resources. This makes it easier to
find information by scanning fewer databases that contain the same information. In every
composite database, the methodology, for instance, the search algorithm used, varies greatly. For
instance, DrugBank provides information on medications and their intended uses, BioGraph
integrates several biomedical science concepts, and BioModels is a database of mathematical
models of biological and biomedical systems.
 FEATURES OF DATABASES: • Large volume data • Data heterogeneity • Easy to add new data •
Dynamic • Uncertainty • Data curation • Large scale data intergartion • Data sharing
2
 IMPORTANCE OF DATABASES :
 Need for storing and communicating large datasets has grown • Distributed resources
(experimental platforms & bioinformatics platform different) • Make biological data available to
scientists. • To make biological data available in computer-readable form with ease of access
 Databases serve as an information repository. It enables knowledge discovery, which is the
process of finding links between informational fragments that weren't previously recognised.
This helps researchers uncover novel biological insights from unprocessed data.Over the past ten
years or more, secondary databases have evolved into the molecular biologist's reference library,
offering a wealth of data on virtually any gene or gene product that has been studied by the
scientific community.
3
2) Describe FASTA file. Why is it important in Bioinformatics?
 In the text-based FASTA format, base pairs or amino acids are denoted by single-letter codes to
represent either peptide or nucleotide sequences. In FASTA format, a sequence starts with a single
line of description and is then followed by lines of sequence information. By placing a greater-than
(">") symbol in the first column, the description line may be separated from the sequence data. All
lines of text should ideally be no more than 80 characters.
 Biologic files possess the.fasta file extension. Nucleic acid, DNA, and protein sequence-related files
are identified by the.fasta file extension. It includes headers with additional data, such as comments,
in addition to the core data recorded in the.fasta format. The sequence, symbols, and names are often
written in a single line in the.fasta format. When biologists and biochemists want to capture
electronic data on DNA and other information, they typically utilise the FASTA tool, which has this
file extension.
 Software that use FastA format
 In most case throughout this workshop you will encounter this format when using a reference
sequence. DB query tools like blast and multiple-sequence alignment algorithms accept only FastA
format. Also, when you download reference genomes they are delivered in this format.
 An example sequence in FASTA format is:
>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
 IMPORTANCE OF FASTA:
 FASTA stands for fast-all” or “FastA”. Before the development of BLAST, it was the first database
similarity search tool created.
 Method used to search for similarities between DNA and protein sequences is called FASTA.
 For a brief segment of identical residues with a length of k, FASTA employs a "hashing" method to
locate matches. The string of residues is referred to as ktuples or ktups, which in BLAST are
comparable to words but typically shorter than words.
 Protein sequences typically consist of two residues, whereas DNA sequences typically consist of six
residues.
 As a result, the query sequence is divided into sequence patterns or words known as k-tuples, and the
target sequences are then looked for these k-tuples to detect similarities between the two.
 For similarity searches, FASTA is a good tool.
 A basic framework to study bioinformatics and computational biology is to learn about biological
sequences and sequence analysis. Learning sequence analysis techniques requires a fundamental
understanding of sequence representation. The FASTA file format is one of the most widely used in
bioinformatics. Base pairs or amino acids in biological sequences are represented using single-letter
4
codes in the text file format known as FASTA, which is used to represent nucleotide sequences or
peptide sequences. A single-line description beginning with a > symbol is followed by sequence data
at the beginning of a FASTA format sequence. The FASTA algorithm first uses this sequence format
for sequence alignment.
 Main Use: FASTA file format was initially introduced as a part of the FASTA software package
and has since been accepted as the industry norm for representing nucleotide and protein
sequences. The sequence data in the FASTA file is present in a plain-text format making it easily
comprehensible for the user to read the file as well. Manipulation and parsing of the FASTA
sequence can be done easily done using standard text-processors and script-based programming
languages such as R, Python, Perl, and Ruby. FASTA file format is capable of storing multiple
nucleotide/protein sequences and is hence also known as the FASTA database
format. FASTA files can be analyzed using most DNA analysis software in the market.
 Features: FASTA files usually begin with a header line that includes comments or other
descriptive information related to the sequence. The first line in a FASTA file usually starts with
a > symbol followed by the name of the sequence. The subsequent line contains the nucleotide or
peptide sequence made up of A, T, G, and C inline with the IUPAC code. Each line can consist
of a maximum of 80 characters.
5
3) What is Multiple sequence alignment? How is it related to Phylogeny and
Phylogenic tree construction?
 One of the active research fields in bioinformatics is multiple sequence alignment. A
vital tool in the fields of molecular biology, bioinformatics, and computational biology
is multiple sequence alignment, commonly known as MSA. A MSA is a broad
sequence alignment of three or more similar-length biological sequences, such as protein,
nucleic acid, DNA, and RNA sequences.
 A major focus of present molecular biology is the clarification of the interrelationships among the
sequence, structure, function, and evolution (FESS relationships) of a family of genes or gene
products. Numerous disciplines of research have found multiple sequence alignment to be an
effective technique, including phylogenetic reconstruction, the lighting of functionally significant
regions, and the prediction of higher order structures of proteins and RNAs. Automatically creating a
multiple alignment from a group of linked sequences is, however, far too simple. We explore a
number of approaches to this computationally challenging challenge. Also covered are a number of
significant uses of multiple alignment for understanding the FESS linkages.
 For the sequence study of protein structure and the prediction of its function, multiple sequence
alignments are a crucial tool. They can also be used to construct phylogeny and perform other
routine sequence analysis activities. It is possible to infer homology and evolutionary links between
the sequences under study using the results of the sequence analysis.
 recent developmental studies of multi-sequence alignment have shown advanced progress and the
state of the art with its accuracy, scalability with thousands of proteins, and flexibility to compare
thousands of proteins that have varied domain structures. So, MSA is characterized as the
computational problem with the highest complexity.
 Multiple Sequence Alignments deals with the alignment of three or more biological sequences. Since
it is difficult to have three or more biological sequences of exact length and also it is a very long time
taking to align them by hand, there are many computational algorithms that are used to create and
analyze the biological sequence alignments. Many bioinformatics techniques and procedures depend
on the accuracy of MSA and hence it is of high critical importance.
 Multiple sequence alignment is the process of aligning two or more DNA sequences together using
bioinformatic software tools . The goal of this process is to identify regions where there are
similarities between one sequence and another sequence that may indicate a mutation has occurred at
some point during evolution. The main challenge with this type of analysis is finding accurate
methods for comparing two very different lengths of data ( DNA ). One way these challenges can be
overcome is through the use of Bioinformatic or algorithmic analysis to help identify regions that are
similar in two different pieces of data .
6
 is bioinformatic analysis . It is considered an ideal method to perform this type of sequencing since it
does not require the of expensive equipment or highly trained personnel . Bioinformatic software
tools are readily available to most scientists , thus making this type of sequencing more cost -
effective . Bioinformatic analysis is used in a number of fields to perform tasks such as : analyzing
RNA expression structure and function . patterns for biomarkers of disease , studying gene function ,
and identifying new drug targets for diseases . It is also used to identify protein
 Multiple sequence alignment (MSA) is a common tool in phylogenetic analysis, where the
evolutionary tree of different organisms are identified and organized in a hierarchical structure in
which closely related species are physically placed near each other. MSAs are often used to create a
phylogenetic tree, tree of life, due to the two distinct features that MSAs possess. First, most
functional domains are known and well annotated in sequences; thus, MSAs help identify these
functional domains in non-annotated sequences. The second feature is the capability to find
conserved regions that are known to be functionally important. The main purpose of a phylogenetic
tree is to depict the evolution of the organisms. Thus, correctly reconstructing the evolution and
representing it as a phylogenetic tree is a critical task. Phylogenetic trees are usually generated by the
distance methods or character-based methods.
 A phylogenetic tree can be made using multiple sequence alignments. [52] There are two factors that
make this possible. The first is that non-annotated sequences can be aligned using functional
domains that have been identified in annotated sequences. The other is the discovery of conserved
areas that are known to be crucial for certain functions. In order to evaluate and discover
evolutionary links through sequence homology, multiple sequence alignments can now be used. It is
possible to identify point mutations as well as insertion or deletion events (known as indels).
 By discovering conserved domains, multiple sequence alignments can also be utilised to find
functionally significant sites, such as binding sites, active sites, or sites corresponding to other
critical functions. It is helpful to take into account various features of the sequences when comparing
sequences when looking at numerous sequence alignments. These characteristics include homology,
resemblance, and identity. Identity refers to the fact that each location in the sequences has the same
residue. On the other hand, similarity refers to the quantitative similarity of the residues between the
sequences under comparison. For instance, pyrimidines and purines are seen as being comparable to
one another in terms of nucleotide sequences. In the end, similarity leads to homology since the more
similar two sequences are, the more homologous they are likely to be. The discovery of common
ancestry can then be assisted by this commonality in sequences.
7
4) What are primers used in Molecular biology? What is their importance?
 A primer is a short, single-stranded nucleic acid that is utilised by all living things to start the
production of DNA. A primer must be linked to the template before DNA polymerase may start a
complementary strand because DNA polymerase enzymes, which are in charge of DNA replication,
can only add nucleotides to the 3' end of an existing nucleic acid. After attaching to the RNA primer,
DNA polymerase synthesises the entire strand by adding nucleotides. Later, the RNA strands must
be precisely cut out and replaced with DNA nucleotides, leaving a nick that must be patched up
using an enzyme called ligase.
 Several enzymes, including Fen1, Lig1, and others that associate with DNA polymerase, are needed
for the removal of the RNA primer to ensure that the RNA nucleotides are removed and the DNA
nucleotides are added. Living things only employ RNA primers, but DNA primers are more
temperature stable and are typically used in biochemistry and molecular biology laboratory
techniques that call for in vitro DNA synthesis, such as DNA sequencing and polymerase chain
reaction.
 In the lab, primers can be created for specialised processes like the polymerase chain reaction (PCR).
The melting temperature of the primers and the reaction's annealing temperature are two unique
factors that must be taken into account while constructing PCR primers. Additionally, the primer's
DNA binding sequence in vitro needs to be carefully picked. To do this, a technique called basic
local alignment search tool (BLAST) searches the DNA and identifies distinct and particular places
where the primer can bind.
 The primary goal of the primer is to create DNA with a free terminal end and polymerase initiation
point. In the same way that the 3' corresponds to the template strand for the process of elongation, a
pair of primers, one at the template strand and the other at the complimentary strand, bind on the
opposite ends of the sequence being constructed.
 The reverse primer runs in 5'-3' whereas the forward primer runs in 3'-5'. However, the elongation
step produces two additional strands of ds DNA.
 The complementarity rule is not followed by forward and reverse primers; instead, a forward primer
binds to one end of a target at 5' P, while the reverse primer occupies the other end.
 Compared to replication, PCR made the most extensive use of DNA primers because:
 the idea of stability in DNA primers as opposed to mostly RNA primers
 RNA primers cannot be eliminated after the process is finished since the polymerization is occurring
in a unidirectional way.
 It is simple to create DNA primers.
8
 It is advised to utilise just specific nucleotide primers because DNA constitutes the majority of the
amplification.
 Primer base sequences should be between 16 and 25 bases long and should contain between 40 and
60 percent G+C. Since this GC coupling is more powerful than the AT pairing, it improves primer
stability. A primer's 3' end should have a C, G, CG, or GC. This improves priming effectiveness by
preventing the primer's end from breathing, thereby preventing dimerization.
 Because it is impacted by the primer's length and sequencing, the annealing temperature of the
primer is also crucial. The temperature at which the primer binds to the template most effectively and
specifically is known as the annealing temperature (Ta) (Meyers, p 547).
 It should be as high as possible in order to prevent random binding to other sequences and also
reduce the chances of cyclization during the PCR. Ta is taken to be less by 15 of the melting
temperature (Tm). Tm of a primer is calculated by ascribing a temperature of2oC for every A or T in
the sequence and 4oC for every G or C. Primer sequences should also be checked for self
complementation, formation of hair pin loops and ability of hybridize preferentially with each other
rather than the template
9
5) Define Restriction enzymes. Why is there a need to analyse restriction
sites Insilico?
 Restrictions enzyme, also known as restriction endonuclease, is a bacterial protein that cleaves DNA
at particular locations along the molecule. Restrictions enzymes in the bacterial cell cut foreign
DNA, destroying infected organisms. Because restriction enzymes can be extracted from bacterial
cells and employed in the lab to work on DNA fragments, including those containing genes, they are
essential tools for recombinant DNA technology.
 A bacterium uses a restriction enzyme to protect itself from bacteriophages, also known as phages,
which are bacterial viruses. A phage inserts its DNA into the bacterial cell when it infects a
bacterium so that it can multiply. By slicing the phage DNA into several pieces, the restriction
enzyme hinders reproduction of the phage DNA. Because they can limit the variety of bacteriophage
strains that can infect a bacterium, restriction enzymes get their name.
 Each restriction enzyme recognizes a short, specific sequence of nucleotide bases (the four basic
chemical subunits of the linear double-stranded DNA molecule adenine, cytosine, thymine, and
guanine). These regions are called recognition sequences, or recognition sites, and are randomly
distributed throughout the DNA. Different bacterial species make restriction enzymes that recognize
different nucleotide sequences.
 When a restriction endonuclease detects a sequence, it cuts through the DNA molecule by catalysing
the hydrolysis of the bonds between adjacent nucleotides. By hiding their recognition sequences,
bacteria stop this kind of DNA degradation from happening to their own DNA. The recognition
sequence is changed and shielded from the endonuclease by methylases, which add methyl groups
to adenine or cytosine bases. The restriction-modification system of a bacterial species is made up of
the restriction enzyme and the corresponding methylase.
 There are four categories of restriction enzymes—I, II, III, and IV—that have historically been
recognised. These categories differ mainly in structure, cleavage site, specificity, and cofactors. In
contrast to the type II system, where the restriction enzyme is independent of its methylase, types I
and III enzymes are comparable in that both restriction and methylase actions are carried out by one
sizable enzyme complex. The fact that type II restriction enzymes cleave DNA at particular
locations inside the recognition site as opposed to types I and III's random cleavage of DNA, often
hundreds of bases from the recognition sequence, is another way that they differ from types I and
III. Many different bacterial species have produced thousands of type II restriction enzymes.
 These enzymes can identify a few hundred different sequences, which are typically four to eight
bases long. Type IV restriction enzymes have poor sequence selectivity and only cleave methylated
DNA.
10
 In silico Restriction Digestion (activity)
 As molecular scissors, restriction enzymes work. The ones that cut within well-known sequences that
happen frequently enough but infrequently enough to separate our DNA into analyzable fragments
are the ones we employ in molecular biology. 6-cutters are frequently used by molecular biologists.
This indicates that a recognition sequence of 6 nucleotides is "limited" for the location of digestion.
As mentioned, these nucleotides are typically palindromic.
 Consider a strand of DNA that is linear in nature. You get two pieces of string when you cut it once. Think
about a plasmid now. It has already been mentioned that this is a circular bit of DNA. A circle has no
beginning or end. One rather than two pieces of DNA are produced when the plasmid is sliced. When
digesting circular plasmids, bear this in mind. Nucleotide 1 in this instance is next to and contiguous with the
sequence's final nucleotide.
In silico digestion
1. This activity is meant to supplement Identification of DNA (activity)
2. Launch UGENE and open the following files:
 pGlo.gb
 pUC19.gb
 pcDNA3.1.gb
 pUC with Insert
3. In the Objects menu, right-click on the sequences and select “Mark as circular”
 the sequences will now be treated as circular DNA
 the first nucleotide and the last nucleotide become adjacent as a continuous sequence this
way
4. From top menu, select Actions → Analyze → Find restriction sites

 this will load a set of default restriction enzymes
 if none are loaded, select them individually (found in alphabetical order)
 choose: BamHI, BglII, ClaI, DraI, EcoRI, EcoRV, HindIII, KpnI, PstI, SalI, SmaI, XbaI,
XmaI, NotI
5. Actions → Cloning → Digest into fragments
 Choose your enzyme
 Try HindIII for each plasmid
11
 Fragments will be added to the circular view and annotations for these fragments will be
added
 You can use this information to calculate how many fragments come from enzymes based on
how many times they cut and by the nucleotide coordinates found in the annotation of the
fragment
6) Why is there a need for analysis of DNA sequences?
12
 DNA Sequencing: The term "DNA analysis" is used to describe the interpretation of genetic
sequences, and it has a wide range of applications. It can be used to distinguish between members of
the same species as well as between distinct species. Unsurprisingly, there is a greater difference
between the DNA sequences of two different species than between two members of the same
species. Nevertheless, substantial DNA can still be shared by various species. For instance, bananas
and humans both have roughly 50% of the same DNA.
 The black and white photos with bands may be the most identifiable representations of DNA analysis
(they look a bit similar to barcodes). Together, these bands can serve as a kind of "genetic
fingerprint" that can be used to compare various samples. Each of these bands represents a different
DNA fragment. A DNA sample from a crime scene can be rapidly and accurately compared to the
DNA of a suspect using this method, and a person's biological tie to their purported father can be
established or disproven.
 Discovery of DNA analysis:
 Before DNA analysis, gathering fingerprints from crime scenes was the main method used to solve
crimes. The issue with this was that thieves might easily evade capture by donning gloves or
cleaning surfaces. However, DNA can be gathered from nearly any source. It's considerably more
difficult for offenders to avoid shedding biological materials like saliva, hair, perspiration, skin,
earwax, or mucus.
 These biological materials initially began to be used in the investigation of crimes in the mid-1980s,
when the idea of DNA analysis first emerged. Alec Jeffreys, an English biologist, created the DNA
profiling technique, also known as genetic fingerprinting, to analyse DNA in order to match suspects
to materials gathered from crime scenes. Since then, technology has advanced significantly, making
DNA analysis more affordable, available, and useful for a variety of official and consumer uses.
 Types of DNA Analysis:
 Restriction Fragment Length Polymorphism (RFLP)
 Short Tandem Repeat (STR) Analysis
 Single Nucleotide Polymorphism (SNP) Analysis
 Y DNA Analysis
 Mitochondrial DNA (mtDNA) Analysis:
 Uses of DNA analysis
 Paternity and other relationship tests:
By demonstrating or disproving the biological relationship between an alleged father and his child,
for example, DNA analysis can be used to determine if a specific person is biologically connected to
another person.
If there are questions regarding paternity, it is possible to test more than one supposed father.
Another method of determining paternity is to take a mother's blood sample, isolate the baby's DNA,
and compare it to the DNA of the presumed father.
Have a look at the paternity testing businesses we cover or the businesses we list for paternity testing
during pregnancy if you wish to determine a child's father.
If you're eager to establish a different kind of biological relationship, DNA analysis can also be used
to confirm or refute maternity, siblingship, grandparentage, and many other relationships.
 Ancestry
DNA analysis has given researchers a new tool for studying human populations, a field that has been
researched for decades. Researchers can examine samples acquired from living people and ancient
artefacts all around the world to determine the origins of human populations and to investigate
various ethnic groups. For instance, researchers in the 1980s concluded that all humans are
descended from a single woman known as "Mitochondrial Eve" who lived hundreds of thousands of
years ago using mitochondrial DNA analysis. Consumer ancestry DNA kits that let people to learn
more about their ancestry and ethnic mix have been developed using research like this in the past and
are still being developed now.
 Identification
13
DNA testing is used in both archaeology and criminal investigations to identify human remains.
However, DNA analysis is also utilised to obtain a person's DNA profile for people in high-risk
occupations or situations (such as oil rig workers or military personnel). As DNA is significantly
more durable than conventional forms of identification, it can be used to identify someone in the
event of a fatal accident.
The US army now uses DNA profiles to supplement the traditional ‘dog tag’, and every new military
recruit is required to provide saliva and blood samples. These can then be used to identify those
killed while discharging duties. Outside the military however, DNA is rarely used for this purpose, as
fingerprints and dental records are still preferred.
 Predicting Disease
 Genetic ties to diseases can also be found through DNA analysis. To identify the genetic elements
that raise a person's risk of developing a condition (like Alzheimer's disease), researchers look at
groups of people who already have the disorder. Because of this, online DNA tests have been created
that can determine your "genetic propensity" to a variety of diseases.
 Naturally, genetic predisposition tests are more strictly controlled than other types of DNA analysis
and typically require a doctor's approval. It's crucial to note that these tests only reveal whether you
have a higher chance of contracting a disease rather than diagnosing a sickness. A genetic test can
reveal an increased or decreased genetic propensity to a variety of illnesses, including rheumatoid
arthritis, breast cancer, and type 2 diabetes.
 The future of DNA analysis
Large volumes of reasonably high-quality DNA were required for the initial DNA analysis
techniques. However, newer techniques can give DNA results in a lot less time, with considerably
lower quality materials. Additionally, methods for obtaining viable DNA from inaccessible or
polluted sources have been devised.
DNA analysis still has its limitations and occasionally falls short of expectations. DNA analysis
typically yields information or proof, but not always definite conclusions. The information stored in
your DNA is becoming more and more accessible and affordable, yet the interpretation of DNA
results is still primarily dependent on experts.
 Tracking Epidemics
Knowing the structure of the virus and how it interacts with the human body is crucial when there is
a virus outbreak. By doing this, we can more accurately monitor the disease's development. In the
recent past, DNA sequencing was used to monitor the Ebola outbreak; it is currently being used to
monitor COVID-19.
With COVID-19, we are dealing with a pandemic rather than an epidemic. When new cases begin to
emerge in their nation, scientists are utilising DNA sequencing to support contact tracing techniques.
In an effort to prevent a second wave, this can be used to identify the source of outbreaks.
14
7) Name any two softwares for submission of DNA sequences online. Why Submission
of DNA sequence to an online repository is important?
 Genomic Testing:
 To investigate gene mutations, genomic testing is used. Mutations are a sign of problems and
diseases, some of which can be just as dangerous as cancer. In order to understand the relationships
between a group of genes, genomic sequencing tests discover varying levels of expression. Thus, test
genomic analysis is essential for comprehending gene activity and developing a reliable diagnosis.
 Different Types of Genomic Testing:
Genomic tests are a kind of medical tests done to rule out the possibility of a person having or
developing a medical condition. These types of genomic testing are performed by geneticist or
genetic counsellors:
 Diagnostic testing
An individual's potential genetic problems are discovered through diagnostic testing. These are based
on clinical presentations that clinicians have created to support a preliminary diagnosis.
This type of genomic test involves examining an allele or a particular gene variant linked to a certain
disease. Here, bioinformatics tools assist with genomic sequencing for additional analysis.
 Clinical predictive testing
Clinical predictive kind of genomic testing is done to see if or not an individual is
susceptible to a certain disorder. Open source and free bioinformatics tools here help with
examining the causative gene variant for targeted tests.
 Pharmacogenomic testing
Open source and free bioinformatics tools for Linux help with pharmacogenomic testing by
studying the genomic determinants of different drug responses. This test genomic analysis
helps assess whether a particular medicine would be effective.
 Tumour testing
Testing for tumours aids in the sequencing of DNAs so that the mutations within them can be
studied. Tumor testing is being done here with the use of bioinformatics technologies. When it
comes to identifying malignancies and tumours, this kind of genomic testing is quite significant.
 How Bioinformatics Tools Help with Genomic Testing
 The best software for bioinformatics is suitable for genetic analysis or next-generation sequencing.
Gene mutations are examined using next-generation sequencing technologies to determine the type
of disease that will develop.
 If we discuss this technology in the context of India, you will learn that the country has begun
employing bioinformatics methods to sequence the genome of COVID-19. Although Covid-19 is
largely an RNA virus, it possesses protein binding on its membrane. These elements are to blame for
the virus's current personality and its rapid mutation.
 In order to understand how a virus changes, genome sequencing is done to look at how its DNA,
RNA, and proteins are organised.
 Scientists in India have used bioinformatics techniques with next-generation sequencing and genetic
testing capabilities to analyse these alterations.
 Genomic data analysis is made possible by popular bioinformatics software programmes.
15
 Thus, multiarray technology serves as the foundation for bioinformatics tools that convert complex
genomic data into insightful information.
 Additionally supporting the interpretation of sequencing data for aligning and assembling DNA
fragments are paid and free bioinformatics tools. You may perform data mining, RNA expression
profiling, variant calling, data visualisations, gene fusion detections, and data mining.
Popular applications for bioinformatics are best for sequence analysis and curations. The best solutions in
the field have key inbuilt computational and big data analysis tools for genome sequencing. Let us have a
look at what else these applications are comprised of in the following list.
 geWorkbench
 BioPerl
 Biojava Bioinformatics Tool for Linux
 Biopython Test Genomic Software
 InterMine
 EMBOSS Bioinformatics Tool Linux
 Clustal Omega
 BLAST
 Bedtool
 Bioclipse Open Source Bioinformatics Tool
 Bioconductor
BioPerl
Best for: Computational molecular biology
BioPerl bioinformatics tool for Linux is most deployed for computational molecular biology. The
standardized CPAN style is the unique selling point of this bioinformatics platform. The Linux
bioinformatics software offers Perl modules for peptide and nucleotide sequence data.
BioPerl Modules:
 Parsing real BLAST output
 Graphical rendering
 EST clusters
 Cytogenetic and radiation hybrid maps
BioPerl Download: You can install BioPerl on Linux, Unix and mac OS devices.
 Clustal Omega
Clustal Omega in bioinformatics is a next generation sequencing tool designed for doing multi sequence
alignments. The genome testing software for Linux supports different input sequence types such as HMM
profile and aligning the sequence.
Features of Clustal Omega Alignment Tool:
 Pairwise sequence alignment tools
 Progress alignment for Clustal Omega result interpretation
 Fast guide tree
 HMM profile techniques
 Clustal Omega phylogenetic tree generation
16
8) What is Data Mining for proteins?
 Data mining is the process of extracting useful information from an accumulation of data, often from
a data warehouse or collection of linked data sets.
 Data mining tools include powerful statistical, mathematical, and analytics capabilities whose
primary purpose is to sift through large sets of data to identify trends, patterns, and relationships to
support informed decision-making and planning.
 Accurately identifying functional sites in proteins is one of the most important topics in bioinformatics
and systems biology.
 In bioinformatics, identifying protease cleavage sites in protein sequences can aid drug/inhibitor design.
In systems biology, post-translational protein-protein interaction activity is one of the major components
for analyzing signaling pathway activities.
 Determining functional sites using laboratory experiments are normally time consuming and expensive.
Computer programs have therefore been widely used for this kind of task.
 Mining protein sequence data using computer programs covers two major issues:
 1) discovering how amino acid specificity affects functional sites and
 2) discovering what amino acid specificity is. Both need a proper coding mechanism prior to using a
proper machine learning algorithm.
 The development of the bio-basis function neural network (BBFNN) has made a new way for protein
sequence data mining.
 The bio-basis function used in BBFNN is biologically sound in well coding biological information in
protein sequences, i.e. well measuring the similarity between protein sequences.
 BBFNN has therefore been outperforming conventional neural networks in many subjects of protein
sequence data mining from protease cleavage site prediction to disordered protein identification.
 This review focuses on the variants of BBFNN and their applications in mining protein sequence data
9) Write short notes on the following Insilico studies :

1. Importance of studying protein parameters:
17
One of the most important components of our lives is protein. Each protein has a certain shape and
structure. In order to sustain human life, proteins collaborate in a complex and coordinated manner.
In other words, knowing how proteins function can help us comprehend the nature of life and a close
examination of the structures can show us how they operate. In structural biology, the function of
proteins is investigated using knowledge of their structural makeup.
Although we have only found living things on Earth, we may discover more in the future. We must
comprehend how life first emerged and progressed on Earth in order to fully comprehend the traits
and attributes of such a terrestrial species.
measurement of tissue or cellular content in biological research.
Understanding body physiology is necessary for ELISA and western blot illness diagnosis.
to identify the disease's underlying cause.
control of protein supplements and packaged food products for quality.
to produce vaccinations
Forensic Investigation of Hormone and Enzyme Biosynthesis
Bacterial identification
2. Importance of studying protein glycosylation:
 The main factor for proteins' microheterogeneity is glycosylation (glycoforms). These demonstrate
complexity at both the molecular and cellular levels.
 Under typical physiological circumstances, protein sugar prints are preserved and not random.
Glycosylation could serve a variety of purposes.
 Examples of physical features include folding, trafficking, packing, stabilisation, protease protection,
quaternary structure, and water structure organisation.
 Weak contacts, multiple presentation, and exact geometry are characteristics of properties related to
recognition and biological triggering.
 Many of the characteristics might only be functional in a particular biological setting. Physiological
changes, such as cancer and rheumatoid arthritis, may both reflect and be caused by changes in sugar
prints.
 Glycosylation assessment is required in many systems. This necessitates precise monitoring of sugar
prints using automatic and predictive techniques.
 Glycosylation often affords a sensitive means of monitoring pharmaceutical products for Quality
Control.
3. Translation tools
 TRANSLATION: DNA to PROTEIN
 Simple translation tools - DNA to protein sequences:
18
 Open Reading Frame Finder (NCBI) - searches for open reading frames (ORFs) in the DNA
sequence you enter. The program returns the range of each ORF, along with its protein translation.
Use ORF finder to search newly sequenced DNA for potential protein encoding segments, verify
predicted protein using newly developed SMART BLAST or regular BLASTP.
 Six-frame Translations can be done at Tuebingen, Russia, Bioline, and +
 EMBOSS Sixpack (EMBL-EBI) - reads a DNA sequence and outputs the three forward and
(optionally) three reverse translations in a visual manner. Alternatively use EMBOSS Transeq
 MBS Translator (JustBio Tools) - An excellent new site since one can translate specifically from
ATG and the results are presented with the nucleotide sequence overlaying the amino acid sequence.
Ideal for Cut/Paste into a manuscript. You need to register to use this free tool. Other quick
translation tools are here and here.
4. .Importance of Signal peptides
 A signal peptide is a short peptide present at the N-terminus of most newly synthesized proteins that
are destined toward the secretory pathway. These proteins include those that reside either inside
certain organelles secreted from the cell, or inserted into most cellular membranes. Although
most type I membrane-bound proteins have signal peptides, the majority of type II and multi-
spanning membrane-bound proteins are targeted to the secretory pathway by their
first transmembrane domain, which biochemically resembles a signal sequence except that it is not
cleaved. They are a kind of target peptide.
 Signal peptides function to prompt a cell to translocate the protein, usually to the cellular membrane.
In prokaryotes, signal peptides direct the newly synthesized protein to the SecYEG protein-
conducting channel, which is present in the plasma membrane
 A homologous system exists in eukaryotes, where the signal peptide directs the newly synthesized
protein to the Sec61 channel, which shares structural and sequence homology with SecYEG, but is
present in the endoplasmic reticulum. Both the SecYEG and Sec61 channels are commonly referred
to as the translocon, and transit through this channel is known as translocation. While secreted
proteins are threaded through the channel, transmembrane domains may diffuse across a lateral gate
in the translocon to partition into the surrounding membrane.
5. Functional (conserved) domain search:
 Using Conserved Domains to Find Protein Homologs If you study proteins, one thing you might
want to do is look for homologs of a certain protein using the sequence. This can shed light on the
function and mechanism of the protein as well as point out potential models for the protein of interest
among proteins with known three-dimensional structures. Proteins that exhibit a high degree of
sequence similarity to the protein domain fingerprints are grouped in the Conserved Domains
Database (CDD), and any protein sequence can be used to search these groups. Since the scoring
19
matrices utilised are designed to find significant functional sites and sequence motifs that are highly
conserved within the domain, such searches are frequently more sensitive than normal BLAST
searches. The outcomes can then be used to investigate how these proteins have evolved or to
pinpoint these crucial structure and sequence characteristics.
6. .Protein structure prediction
 Sequence and structural homology serve as the main foundation for the prediction of protein
structure. Since a protein's activity primarily depends on its three-dimensional structure, protein
structure prediction or modelling is crucial. Similar to this, a protein's amino acid makeup determines
its 3D structure. A slight change in the protein's sequence can result in significant structural changes
in the protein's native structure. Although it is critical to have a thorough understanding of protein
3D structure, it can be challenging to determine a protein's natural structure when it exists in the
physiological environment of the body. The structure of proteins and protein-ligand complexes is
primarily ascertained using X-ray crystallography and nuclear magnetic resonance spectroscopy
(NMR) techniques. These experimental methods for determining structure take a long time.
 Understanding a protein's function requires having a fundamental understanding of its tertiary
structure, or three-dimensional, known as 3D. Currently, nuclear magnetic resonance and X-ray
crystallography are the primary methods used to determine protein 3D structure (NMR). Proteins are
crystallised through X-ray crystallography, and their structures are subsequently ascertained through
X-ray diffraction. It can take up to three to five years and is not always easy to determine 3D
structure using X-ray crystallography. Another effective method to ascertain the protein structure is
NMR. The protein can be investigated in an aqueous environment that may more nearly approximate
its real physiological state using NMR than with X-ray crystallography.
20
10) What are the various types of protein structures? What are the uses of
visualization of proteins?
 Amino acid polymers called proteins are found in living things. A polypeptide chain is made up of
amino acids that are joined by peptide bonds.
 Several polypeptide chains twisted into a three-dimensional shape make up a protein. Proteins come
in a variety of folds, loops, and curves that give them their complicated forms.
 In proteins, folding happens on its own. The polypeptide chain's chemical connections between
different segments help hold the protein together and give it form.
 Protein molecules can be divided into two categories: globular proteins and fibrous proteins. The
majority of globular proteins are spherical, soluble, and compact. The majority of fibrous proteins
are elongated and insoluble. One or more of the four forms of protein structure may be present in
globular and fibrous proteins.
Four Protein Structure Types
1. Primary Structure: The distinctive arrangement in which amino acids are joined to create a protein
is referred to as primary structure. Twenty amino acids are used to build proteins. Amino acids
typically contain the following structural characteristics:
A carbon (the alpha carbon) bonded to the four groups below:
A hydrogen atom (H)
A Carboxyl group (-COOH)
An Amino group (-NH2)
A "variable" group or "R" group
2. Secondary Structure:
The coiling or folding of a polypeptide chain, which gives a protein its three-dimensional shape, is
referred to as secondary structure. Proteins have two different kinds of secondary structures. The
alpha () helix structure is one example. Using hydrogen bonds throughout the polypeptide chain, this
structure, which resembles a coil spring, is held in place. The beta () pleated sheet is the second form
of secondary structure seen in proteins. It is held together by hydrogen bonds between neighbouring
polypeptide units of the folded chain, which gives the structure the appearance of being folded or
pleated.
3. Tertiary Structure:
The term "tertiary structure" describes the complete 3-D structure of a protein's polypeptide chain.
The links and forces that keep a protein in its tertiary structure come in many different forms.
Hydrophobic interactions play a significant role in how a protein folds and takes on its shape. The
amino acid's "R" group can be hydrophilic or hydrophobic. While amino acids with hydrophobic "R"
21
groups will try to stay away from water and position themselves closer to the centre of the protein,
amino acids with hydrophilic "R" groups will seek contact with their aqueous environment
By keeping the protein in the shape created by the hydrophobic interactions, hydrogen bonds in the
polypeptide chain and between amino acid "R" groups aid in the stabilisation of protein structure.
Ionic bonding between the positively and negatively charged "R" groups that are in close proximity
to one another can happen as a result of protein folding.
Covalent bonding between the "R" groups of cysteine amino acids can also happen as a result of
folding. A disulfide bridge is formed by this kind of bonding. Van der Waals forces play a role in the
stabilisation of protein structure as well. These interactions deal with the attraction and repellent
forces that develop between polarised molecules. The bonding that takes place between molecules is
influenced by these factors.
4. Quaternary Structure
Quaternary Structure refers to the structure of a protein macromolecule formed by interactions
between multiple polypeptide chains. Each polypeptide chain is referred to as a subunit. Proteins
with quaternary structure may consist of more than one of the same type of protein subunit. They
may also be composed of different subunits. Hemoglobin is an example of a protein with quaternary
structure. Hemoglobin, found in the blood, is an iron-containing protein that binds oxygen
molecules. It contains four subunits: two alpha subunits and two beta subunits.
use of visualization of proteins
Once a protein structure has been solved, the structure has to be presented in a three-dimensional
view on the basis of the solved Cartesian coordinates. Before computer visualization software was
developed, molecular structures were represented by physical models of metal wires, rods, and
spheres. With the development of computer hardware and software technology, sophisticated
computer graphics programs have been developed for visualizing and manipulating complicated
three-dimensional structures. The computer graphics help to analyze and compare protein structures
to gain insight to functions of the proteins.
22
11) What is Metagenomics? Explain its importance
 Metagenomics is a research topic as well as a collection of research methods that includes numerous
related techniques. Meta means "transcendent" in Greek. Metagenomics avoids the most significant
barriers to advancements in clinical and environmental microbiology—the unculturability and
genetic diversity of the majority of microorganisms.
 In the first example, meta acknowledges the necessity for the creation of computational techniques
that maximise comprehension of the genetic make-up and behaviours of communities that are so
complicated that they can only ever be sampled, never fully defined. In the second sense, that of a
research area, the term "meta" denotes that this emerging science aims to comprehend biology at the
systemic level, transcending the level of the individual organism to concentrate on the community of
genes and how genes may interact to serve collective tasks.
 In fact, individual creatures continue to serve as the basic building blocks for social interactions, and
we predict that studies of individuals and their genomes will benefit from and be stimulated by
metagenomics. We anticipate a convergence of the metagenomics top-down, classical microbiology
bottom-up, and organism-level genomics approaches during the coming decades. We will view
communities and the network of communities that makes up the biosphere as nested systems of
systems that are essential to human survival. It will occasionally be feasible to apply the new
knowledge to urgent and significant issues.
 The term "metagenomics" encompasses studies that aim to understand transorganismal behaviours
and the biosphere at the genomic level as well as studies that characterise communities or their
members at the genome level, high-throughput gene-level studies of communities using methods
from genomics, and other "omics" studies. Metagenomics in either sense will likely never be tightly
defined by a definition, and it would be undesirable to try to do so now. Although metagenomics is
still in its early stages of development and now concentrates on non-eukaryotic bacteria, there is no
question that its ideas and techniques will ultimately revolutionise all of biology.
 In precisely this way, the science of all organisms and its applications in epidemiology, clinical
microbiology, virology, agriculture, forestry, fisheries, biotechnology, microbial forensics, and many
other fields have been revolutionised by genomics, a discipline developed to aid in the advancement
of biomedicine and the understanding of our own species.
 puzzle. You actually have all 10 smaller puzzles put together into one box. And when you want to
think about metagenomics and the genomes of these 10 organisms, you're trying to solve 10 puzzles
simultaneously to understand all the different pictures that are in this same box of genomes.
 Important of metagenomics:
 In the first place, we may state that the purpose is to recognise the bacteria, but there is more to it
than that.
23
 We must perform metagenomics for two common reasons, first the limits of traditional
microbiological methods.
In a few months, we can identify a few hundred germs from a soil sample if we want to investigate
the microbes that live there. And is only feasible if the sample is kept clean.
 However, metagenomic analysis allows us to quickly and contaminant-free analyse tens of thousands
of microorganisms.
 Second, the complicated behaviour of the complete community of microorganisms controls the
majority of microbial activity. Thus, in order to adequately investigate an effect or action, we must
study everything.
 The examination of the 16s rRNA gene is not metagenomics, it should be noted. Analysis of the 16s
rRNA gene is used to describe microorganisms. We cannot perform a functional genomic study with
its assistance.
 So if you only know about 16s rRNA, make sure to read this article in its whole. You can read the
relevant article about 16s rRNA gene sequencing here, by the way.
 However, metagenomics is used to investigate the functional genomic region of bacteria. Here, we're
looking into the behaviour of certain microorganism genes and how they relate to certain diseases.
 One of the uppermost advantages of the present technique is its power to study many
microorganisms in a single experiment that could not be possible with the conventional microbiology
methods.
 Applications of metagenomics:
 Biotechnology studies Recent years have seen scientists place a greater emphasis on metagenomic
analysis of microbial investigations. Metagenomic research has led to the discovery of enzymes like
protease, lipase, and nitrilases. Only by studying bacteria are enzymes, antibiotics, biochemicals,
bioactive substances, and medicines produced.
 Ecological studies: Metagenomics and other microbial research are crucial for understanding
invasive species, ecology, and conservation. Numerous different creatures and bacteria have their
homes in the sea, rivers, soil, air, and rain forest.Understanding the condition of the habitat is made
easier by the intricate symbiotic link between animals, microorganisms, and plants. For instance, a
different species of animal may use the faeces of one animal as a nutrient-rich food source. Be aware
that the action of the bacteria may make it feasible!The metagenomic study sheds light on the
significance of both for an ecosystem.Studies on conservative and endangered species also make use
of it.
24
 Healthcare and medical: Metagenomic analysis allows for the direct study of complex illnesses.In
order to determine which microorganisms may be present in the patient, a sample is obtained from
them and subjected to DNA sequencing processing.By separating RNA from it and turning it into
cDNA, it is also possible to assess the impact of various RNA viruses on human health.By using
metagenomic analysis to understand how its microbial load acts, it is possible to monitor and
evaluate the effects of pollutants on the ecosystem and environment.
 Agriculture and soil ecology Numerous studies on soil and agriculture employ metagenomics
extensively.Numerous microbes and plants commonly live in dirt. A soil sample weighing around
one gramme contains between 109 and 1010 microbial cells.One gramme of soil sample contains one
gigabyte's worth of sequencing data, which is enormous if we sequence every single germ present in
it.
 The main subject of these investigations is the intricate interactions between plants and microbes.
The microbiomes that aid plant development have significant economic benefits in terms of output.
25

21 MI040 JEEL VYASBNF Assigment

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

21 MI040 JEEL VYASBNF Assigment

Uploaded by

Copyright:

Available Formats

 c

1) What are databases? Why are they important in biological studies?

4. From top menu, select Actions → Analyze → Find restriction sites

6) Why is there a need for analysis of DNA sequences?

9) Write short notes on the following Insilico studies :

You might also like