Unit 1

UNIT 1
INTRODUCTION
Life in space and Time, Dogmas, Data Archieves, WWW, Computers,
Biological Classification, Use of Sequences, Protein Structure, Clinical
Implications.
Bioinformatics :
Bioinformatics is a science field that is similar to but distinct from biological
computation, while it is often considered synonymous to computational biology.
Biological computation uses bioengineering and biology to build biological
computers, whereas bioinformatics uses computation to better understand biology.
Bioinformatics is an interdisciplinary field that develops methods and software
tools for understanding biological data, in particular when the data sets are large
and complex.
Life in space and Time:
Dogmas:
The central dogma of molecular biology is an explanation of the flow of genetic
information within a biological system. It was first stated by Francis Crick in 1956
and re-stated in a Nature paper published in 1970.
The central dogma of molecular biology deals with the detailed residue-by-residue
transfer of sequential information. It states that such information cannot be
transferred back from protein to either protein or nucleic acid.
The central dogma has also been described as "DNA makes RNA and RNA makes
protein". However , the dogma also admits the reverse flow of information from
RNA to DNA but not from proteins to DNA or RNA, as recalled by Francis Crick
(1970). The dogma is a framework for understanding the transfer of sequence
information between bipolymers sequential carrying information. There are 3
major classes of such biopolymers: DNA and RNA (both nucleic acids), and
protein. There are 3×3 = 9 conceivable direct transfers of information that can
occur between these. The dogma classes these into 3 groups of 3: 3 general
transfers (believed to occur normally in most cells), 3 special transfers (known to
occur, but only under specific conditions in case of some viruses or in a
laboratory), and 3 unknown transfers (believed never to occur). The general
transfers describe the normal flow of biological information: DNA can be copied
to DNA (DNA replication), DNA information can be copied into mRNA
(transcription), and proteins can be synthesized using the information in mRNA as
a template (translation).
Central Dogma:
The ‘Central Dogma’ is the process by which the instructions in DNA are
converted into a functional product. It was first proposed in 1958 by Francis Crick,
discoverer of the structure of DNA.
The central dogma of molecular biology explains the flow of genetic information,
from DNA ?to RNA?, to make a functional product, a protein?.
The central dogma suggests that DNA contains the information needed to make all
of our proteins, and that RNA is a messenger that carries this information to
the ribosomes?.
The ribosomes serve as factories in the cell where the information is ‘translated’
from a code into the functional product.
The process by which the DNA instructions are converted into the functional
product is called gene expression?.
Gene expression has two key stages - transcription? and translation?.
In transcription, the information in the DNA of every cell is converted into small,
portable RNA messages.
During translation, these messages travel from where the DNA is in the cell
nucleus to the ribosomes where they are ‘read’ to make specific proteins.
The central dogma states that the pattern of information that occurs most
frequently in our cells is:
From existing DNA to make new DNA (DNA replication?)
From DNA to make new RNA (transcription)
From RNA to make new proteins (translation).
An illustration showing the flow of information between DNA, RNA and protein.
Image credit: Genome Research Limited
Reverse transcription is the transfer of information from RNA to make new DNA,
this occurs in the case of retroviruses, such as HIV?. It is the process by which the
genetic information from RNA is assembled into new DNA.
Does the ‘Central Dogma’ always apply?
With modern research it is becoming clear that some aspects of the central dogma
are not entirely accurate.
Current research is focusing on investigating the function of non-coding RNA?.
Although this does not follow the central dogma it still has a functional role in the
cell.
Data Archieves:
Data processed in a bioinformatics way:
Data that are input to and output from workflows are stored and linked with
the corresponding execution specification, thus tracking all information related to a
computational experiment. For example, users can search for experiments that used
specific data as an input or created specific data as an output.
GSA: Genome Sequence Archive
Introduction
Next-generation sequencing (NGS) technologies have been extensively and

routinely applied to a wide range of important issues in life and health sciences,
leading to an unprecedented explosion in sequence data. Considering the
increasingly higher throughput and lower costs attributable to rapid advancements
of NGS technologies, large-scale sequencing projects for population genomics and
precision medicine are ongoing or in the planning stages around the world, e.g., the
US Precision Medicine Initiative (PMI) , UK10 K Project , Icelandic Population
Genome Project , and Dog 10 K Project . As a corollary, such deluge of
sequencing data poses great challenges in big data deposition, integration, and
translation , . Accordingly, it is fundamentally crucial to store and manage
sequencing data in support of integrative in-depth analyses and large-scale data
mining.
The International Nucleotide Sequence Database Collaboration
(INSDC) operating between the DNA Data Bank of Japan (DDBJ) [8], the
European Molecular Biology Laboratory-European Bioinformatics Institute
(EMBL-EBI) , and the National Center for Biotechnology Information (NCBI) ,
provides valuable services for archiving a broad spectrum of sequence data.
However, with the exponentially accumulating volume of sequence data,
submitting big data to INSDC database resources becomes increasingly daunting
and time-consuming, simply because network bandwidth is a formidable
bottleneck for big data transfer across countries/regions. This situation is
particularly severer in China; to our experience, for instance, submission of ∼1
terabyte (TB) data to the NCBI Sequence Read Archive (SRA) takes ∼2 weeks
based on the 150-Mbps upload bandwidth over a shared international network in
Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS). China,
with the increasing funding support in biomedical research, has been a powerhouse
in generating enormous amounts of sequencing data. Given the huge population
and rich biodiversities in China, it is undoubted that data generated from
sequencing projects for the Chinese population (e.g., CAS PMI
at http://news.xinhuanet.com/english/2016–01/09/c_134993997.htm) and
domestically featured species would be growing strikingly at extraordinarily
exponential rates, which accordingly brings an insurmountable challenge and
burden to current practice of data submission and sharing.
To address this issue, here we present Genome Sequence Archive
(GSA; http://bigd.big.ac.cn/gsa or http://gsa.big.ac.cn), a data repository for
archiving raw sequence data. As a core database resource of BIG Data
Center [11] (http://bigd.big.ac.cn), GSA is built based on INSDC data standards
and structures and provides data archival services for scientific communities not
only in China but also throughout the world. GSA accepts raw sequence reads
produced by a variety of sequencing platforms, stores both sequence reads and
metadata, and provides free and unrestricted access to all publicly available data
for worldwide scientific communities.
Implementation
GSA is implemented with Java Server Pages (JSP; a Java programming framework
for constructing dynamic web pages), Spring (an application framework and
inversion of control container; http://www.springsource.org), Struts (a Model-
View-Controller framework for creating Java web
applications; http://struts.apache.org), and MyBatis (a persistence framework for
the database connection and operation; http://www.mybatis.org). GSA adopts
MySQL (http://www.mysql.org) as relational database management system to
store metadata information. All codes are developed using Eclipse
(http://www.eclipse.org), an integrated development environment (IDE) that
features rapid development of Java-based web applications. To provide stable web
services, GSA is hosted on a CentOS-7 operating system with four servers,
namely, Apache serving static content, Tomcat serving dynamic content, a MySQL
server for database management, as well as a FTP server for file upload and
download.
Database content and usage
Data structure and organization
Designed for compatibility, GSA follows INSDC data standards and structures. All
data are organized into four objects, i.e., BioProject, BioSample, Experiment, and
Run (Figure 1). “BioProject”, bearing an accession number prefixed with “PRJC”
(where C, hereinafter, stands for China), provides an overall description for an
individual research initiative, including basic description, organism, data type,
submitter, funding information, and publication(s) if available. “BioSample”,
possessing an accession number prefixed with SAMC, contains descriptive
information about biological materials used in the experiments, including sample
types and attributes. “Experiment”, having an accession number prefixed with
CRX, provides a detailed description of treatments for a specific BioSample,
including experiment intention, library method, and sequencing type. “Run”,
adopting an accession number prefixed with CRR, includes a list of sequence data
file(s) related to a specific experiment. It is noted that “Experiment” and “Run”
constitute China Read Archive (CRA). Based on these standardized data objects,
GSA not only facilitates data submission and deposition, but also enables data
sharing and exchange.
Figure 1. Data model in GSAPrefixes of accession numbers for data objects,
including BioProject, BioSample, Experiment, and Run, are indicated in red.
Data objects Experiment and Run constitute China Read Archive.
In addition, GSA features umbrella projects and provides an organizational
structure for a large collaborative project consisting of multiple sub-projects that
are funded by a same grant and have very close collaborations. GSA is well
supported by CAS that functions as the national scientific think tank and academic
governing body. Currently, two umbrella projects from CAS Strategic Priority
Research Programs and one CAS Key Research Program make it officially
mandatory to submit sequencing data to GSA.
Data archive and statistics
GSA accepts data submissions from all over the world, covers the spectrum of
sequence reads generated by a variety of sequencing platforms, and accommodates
several commonly-used file formats, like FASTQ, BAM, and VCF. GSA performs
validations for all submitted data items to ensure data integrity and increase data
reusability. Similar to INSDC members, GSA allows users to set data as either
public or controlled, indicating that the data is publicly accessible or placed under
controlled access over a given period of time, respectively. Regarding data
security, all submitted data have copies stored in physically separate disks. Since
its inception in August 2015, GSA presents a dramatic increase on data
submissions in terms of the numbers of BioProjects and BioSamples, Experiments,
and Runs, as well as file size (Figure 2). As of December 2016, GSA houses a total
of 198 BioProjects, 8674 BioSamples, 9263 Experiments and 10,745 Runs for
more than 80 species, submitted by more than 160 data providers from a total of 39
institutions, and archives more than 200 TBs of sequence data.
Figure 2. Data statistics of GSAA. Numbers of BioProjects and BioSamples

in GSA. B. Numbers of Experiments and Runs, as well as file size in GSA.
All statistics are based on data submissions ranging from December 2015 to
December 2016.
Data submission and retrieval
To create a submission, users need to register and log into the GSA system.
Basically, to submit data to GSA, there are five straightforward steps involving
BioProject, BioSample, Experiment, Run, and Sequence Files (Figure 3). In order
to maximally simplify the submission procedure, GSA is equipped with a user-
friendly input wizard for metadata collection. To ease sequence file uploading,
GSA provides a FTP server supporting two Internet Protocols (IPv4 and IPv6). In
addition, GSA provides user-friendly web interfaces for data query and browsing.
Users can search the data of interest by specifying a given BioProject, BioSample,
Experiment, or Run ID. Moreover, GSA allows users to conduct advanced search
by inputting species name, sequencing type, sequencing platform,
disease/phenotype/trait, tissue/cell line, etc. GSA also allows users to browse all
publicly available BioProjects, BioSamples, and Experiments.
Figure 3. Graphic illustration of data submissions to GSATwo representative
studies are provided here as examples to depict the data objects involved in
data submission.
Perspectives and concluding remarks
“With great power comes with great responsibility”. Nowadays, China is the
second largest economy, playing an increasingly important and influential role in
the global economy. Equally, in academia, it is time for us to implement the
practice of archiving sequence data for worldwide scientific communities,
especially considering the larger quantities of sequence data generated in China.
Equivalent to INSDC members, GSA is committed to archiving raw sequence data.
GSA’s ultimate goal, which is also the expectation from funding agencies, is to
provide free archival services for raw sequence data, establish and promote a
centralized archival practice in China, play an important role in global sequence
data archive, and support research activities in both academia and industry
throughout the world. In addition, there are also strong domestic incentives and
agreements from academia, industry, and government (over 1000 supporters from
more than 380 organizations; http://bigd.big.ac.cn/gdsd) to deposit data into GSA
and make GSA a centralized archival resource in China.
To sum up, GSA is a data repository for archiving raw sequence data. Designed for
compatibility, GSA adopts INSDC data standards and structures, archives both
sequence reads and metadata submitted from all over the world, and makes all
these data publicly available to worldwide scientific communities. In the era of big
data, GSA is not only an important complement to existing INSDC members by
alleviating the increasing burdens of handling sequence data deluge, but also takes
the significant responsibility for global big data archive and provides free
unrestricted access to all publicly available data in support of research activities
throughout the world. In future, we will not only upgrade infrastructure of GSA to
achieve big data storage, exchange and sharing, but also will develop new
functionalities to archive population-based PMI data and a variety of metagenome
data.
WWW:
Bioinformatics is a new and emerging branch of Biotechnology that has come up
very recently. It mainly involves the use of software to utilize information from
vast biological database that is developed by experienced Biotechnologists. Gene
sequencing is a part of Bioinformatics wherein a lot of data related to
biotechnology is processed. This brings biotechnology within the ambit of
information technology, and hence the label, Bioinformatics. In fact
Bioinformatics is a new discipline that involves molecular biology and computer
science. Presently genomic research, sequencing of human genome and advances
in disease related issues have required and helped in developing this at fast level.
So we can say that in Bioinformatics, computers are required to store, retrieve,
analyze or predict the composition or the structure of biomolecules. It is the
fascinating hybrid of computer science and biology.
The National Center for Biotechnology Information (NCBI 2001) defines
Bioinformatics as : "Bioinformatics is the field of science in which biology,
computer science, and information technology merge into a single
discipline...There are three important sub-disciplines within Bioinformatics: the
development of new algorithms and statistics with which to assess relationships
among members of large data sets; the analysis and interpretation of various types
of data including nucleotide and amino acid sequences, protein domains, and
protein structures; and the development and implementation of tools that enable
efficient access and management of different types of information." Bioinformatics
is a particularly international subject, with a notably high degree of information
sharing among researchers in different countries. It is also known as computational
biology e.g., USC Computational Biology, NCSA Computational Biology.
USE OF INTERNET IN BIOINFORMATICS
When we talk about sources of biological information and computers for providing
it, we can not ignore the role and impact of information superhighway i.e., Internet.
Internet is the most potential tool of this information age and it is serving as a
platform for Bioinformatics tool. It provides the opportunity to search that
information, which was available only by reaching to the information centre.
Areas of Services
The Internet provides various facilities for Bioinformatics, such as;
• Bioinformatics research
• Courses
• Resources
• Biological databases
• Construction tools
• Software resources
• WWW search tools
• Courses of Bioinformatics
Advanced topics in Bioinformatics
• Scientific databases
• Electronic journals
• Asking queries from the librarian in online manner
• News events and activities such as; announcement for Bioinformatics interest
group, meetings on federated databases, molecular biosciences and technology
seminars.
Subject Specific Sites

These sites are more likely to concentrate on a particular area of Bioinformatics.
These sites are further divided into the various areas of Bioinformatics e.g., Codon
usage, and Genome analysis/Genomic comparisons, Phylogenetics etc.
General Bioinformatics Web Sites
Many of the sites are offering the same sorts of links and many to other
Bioinformatics sites; many have links to a Sequence Retrieval System or other
facilities for sequence retrieval. These are categorized as under: • Academic Sites •
Corporate/Government Sites
BIOINFORMATICS, INTERNET AND LINKS
Internet facilitates linking to current programs and initiatives utilizing the Internet
to form clearing-houses and distributed networks of biological information . Some
are integrated system for agricultural genome analysis, including databases,
conferences, publications, courses, and a particularly good plant genome online
database tutorial . Links to many tools and programs are available from the
National Institutes of Health for sequence analysis and molecular biology,
including databases, protocols and tutorials . Many a pages include links to a
number of model organism databases, banks and tables, and to a number of genetic
databases. Dept. of Molecular and Cellular Biology, Harvard University is
maintaining such kind of service . Presently "Molecules R US", which is a World
Wide Web (WWW) Forms interface, facilitates access (browsing, searching and
retrieval) to the molecular structure data contained within the Brookhaven Protein
Data Bank (PDB). Through the Molecules R US request FORM, one may request a
PDB data file to be returned in one of several different formats .
Biological Classification:
According to fossil evidence, life on earth is speculated to have begun roughly 3.7
billion years ago. Today, the earth is home to countless species – ranging from
microscopic microbes to gargantuan blue whales. The diversity of life is so vast,
that there are many species, yet to be discovered. For instance, the giant squid
(Architeuthis dux), was nothing more than a sailor’s tall tale until an actual
specimen was recorded on video in 2004.
Similarly, there are many organisms that are yet to be identified or discovered.
However, we do need a system to classify the organisms that we do know about.
This is due to the fact that the same organism or its variations may exist across
multiple locations around the planet. And these organisms are given different
names according to the locations, though all are biologically the same organism.
Hence, the idea of biological classification was put forward. We shall explore what
is biological classification and its basis of classification in detail.
What is Biological Classification
Biological classification is the scientific procedure of arranging the organisms in a
hierarchical series of groups and sub-groups on the basis of their similarities and
dissimilarities. Many biologists have contributed to this method of classification,
which took years for researchers to decide the most fundamental characteristics for
the classification.
Basis of Classification
The history of biological classification began with Aristotle, the Greek
philosopher, who is often called the father of biological classification. He
described animal classification based on their habitat, i.e., air, water and land. He
was the first person to recognize the need for groups and group names in the study
of the animal kingdom.
Later, biologists started to work on the classification of living organisms based on
their characteristics. Characteristics can be explained in many ways. A group of
organisms is similar enough to be classified together by certain characteristics.
Characteristics are the appearance/form and behaviour/function of something.
These characteristics decide which organism will be placed in which group.
For example, a dog has limbs, but a snake doesn’t. A dog and a snake can move,
but plants cannot. These are the characteristics of different organisms. These
behaviours classify them into different groups. But which character should be the
fundamental form or function? As per the above example, how should a dog be
classified- whether on the basis of body design or its locomotion? Therefore, this
was not successful
In the mid-1700s, Carolus Linnaeus, a Swedish physician and botanist, published
several books on different species of plants and animals. According to his system
of classification, he grouped organisms according to common physical traits and
developed the two-part binomial taxonomy system of categorizing organisms
according to genus and species. This type of classification was effective. Later, his
work was combined with the work of Charles Darwin, forming the foundation of
modern taxonomy.
Today, some of the characteristics which are used today to classify organisms are
as follows:
 Type of cell – Prokaryotic or Eukaryotic cell.

 Number of Cells – Unicellular or Multicellular.
 Mode of Nutrition – Autotrophs (Photosynthetic) or Heterotrophs (Non-
photosynthetic).
 The level of organization and development of organs.
Biological classification is the process by which scientists group living organisms.
Organisms are classified based on how similar they are. Historically, similarity was
determined by examining the physical characteristics of an organism but modern
classification uses a variety of techniques including genetic analysis.
Organisms are classified according to a system of seven ranks:
1. Kingdom
2. Phylum
3. Class
4. Order
5. Family
6. Genus
7. Species
For example, the honey bee (Apis mellifera) would be classified in the following
way:
1. Kingdom = Animalia
2. Phylum = Arthropoda
3. Class = Insecta
4. Order = Hymenoptera
5. Family = Apidae
6. Genus = Apis
7. Species = Apis mellifera
Species names are always written including the Genus in either full or
abbreviated, for example, Apis mellifera or A. mellifera respectively.
Use of Sequences:
Sequence Analysis:
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or
peptide sequence to any of a wide range of analytical methods to understand its
features, function, structure, or evolution.
Methodologies used include sequence alignment, searches against biological
databases, and others.
Purpose of sequencing:
DNA sequencing is a laboratory technique used to determine the exact sequence
of bases (A, C, G, and T) in a DNA molecule. The DNA base sequence carries the
information a cell needs to assemble protein and RNA molecules. DNA sequence
information is important to scientists investigating the functions of genes.
Since the development of methods of high-throughput production of gene and
protein sequences, the rate of addition of new sequences to the databases
increased exponentially. Such a collection of sequences does not, by itself, increase
the scientist's understanding of the biology of organisms. However, comparing
these new sequences to those with known functions is a key way of understanding
the biology of an organism from which the new sequence comes. Thus, sequence
analysis can be used to assign function to genes and proteins by the study of the
similarities between the compared sequences. Nowadays, there are many tools and
techniques that provide the sequence comparisons (sequence alignment) and
analyze the alignment product to understand its biology.
Sequence analysis in molecular biology includes a very wide range of relevant
topics:
1. The comparison of sequences in order to find similarity, often to infer if they

are related (homologous)
2. Identification of intrinsic features of the sequence such as active sites, post
translational modification sites, gene-structures, reading frames,
distributions of introns and exons and regulatory elements
3. Identification of sequence differences and variations such as point
mutations and single nucleotide polymorphism (SNP) in order to get
the genetic marker.
4. Revealing the evolution and genetic diversity of sequences and organisms
5. Identification of molecular structure from sequence alone
In chemistry, sequence analysis comprises techniques used to determine the
sequence of a polymer formed of several monomers (see Sequence analysis of
synthetic polymers). In molecular biology and genetics, the same process is called
simply "sequencing".
In marketing, sequence analysis is often used in analytical customer relationship
management applications, such as NPTB models (Next Product to Buy).
In sociology, sequence methods are increasingly used to study life-course and
career trajectories, patterns of organizational and national development,
conversation and interaction structure, and the problem of work/family synchrony.
This body of research has given rise to the emerging subfield of social sequence
analysis.
Protein Structure :
Protein structure is the three-dimensional arrangement of atoms in an amino acid-
chain molecule. ... Proteins form by amino acids undergoing condensation
reactions, in which the amino acids lose one water molecule per reaction in order
to attach to one another with a peptide bond.
Types of Protein Structure:
The four levels of protein structure are primary, secondary, tertiary, and
quaternary. It is helpful to understand the nature and function of each level
of protein structure in order to fully understand how a protein works
Protein in Bioinformatics:
Bioinformatics plays an important role in all aspects of protein analysis, including
sequence analysis, structure analysis, and evolution analysis. ...
With bioinformatics techniques and databases, function, structure and evolutionary
history of proteins can be easily identified.
Basic Principles of Protein Structure:

When an all atom model of a protein structure is seen for the first time (be it a
figure, a three-dimensional model or a computer graphics representation) it may
appear a daunting task to decipher any underlying pattern within it. After all, such
structures typically contain thousands if not tens- or hundreds-of-thousands of
atoms. For this reason, it is often considered convenient to simplify the problem by
using a hierarchical description of protein structure in which successive layers of
the hierarchy describe increasingly more complex levels of organization. Typically
four levels are used and referred to as the primary, secondary, tertiary and
quaternary structure of a protein. It is often useful to include two additional
intervening levels between the secondary and tertiary structures, which are referred
to as super secondary structures and domains. Many excellent textbooks exist on
the subject which should be consulted for more detailed information (1, 29–31).
Here, we will attempt only to give an outline in order to emphasize the importance
of the study of the three-dimensional structure of proteins for the drug design
process which will be described towards the end of this chapter.
Primary structure :
The primary structure of a protein originally referred to its complete covalent
structure but is more frequently interpreted as being the sequence of amino acids of
each polypeptide chain of which the protein is composed. These are often one and
the same thing but disulphide bonds and other rarer types of covalent bond formed
between amino acid side chains are not directly encoded by the sequence itself.
A polypeptide chain is a unidimensional heteropolymer composed of amino acid
residues. There are basically only twenty naturally occurring amino acids which
are directly encoded by the corresponding gene, although in exceptional cases stop
codons can be used for the incorporation of two additional amino acids
(selenocysteine and pyrrolysine (32)). All of these amino acids are α-amino acids
which possess the generic structure given in Figure 3a. Common to all such amino
acids is the amino group, carboxylic acid group and hydrogen bound to the central
carbon atom (the α carbon). Only the R group (also known as the side chain)
differs from one amino acid to another and it varies in terms of size, polarity,
hydrophobicity, charge, shape, volume etc. With the twenty different amino acids
available, nature is able to produce the wide diversity of functions which proteins
perform in living organisms.
Figure 3
The amino acids. In (a) the generic structure of an α-amino acid is given, in which
only the R group (side chain) varies from one type to another. In (b) and (c) the
difference between L- and D- amino acids respectively is shown (one being
the (more...)
It should be noted that the α-carbon is tetrahedral and in general is bound to four
different chemical moieties. As such it is asymmetric (a chiral center) and two
different enantiomers (D and L) for each amino acid exist (Fig 3b). Amongst the
naturally occurring amino acids, the only exception is the amino acid glycine,
whose R-group is a hydrogen atom, making the α-carbon symmetric. This confers
a series of important conformational properties on this amino acid which are often
essential for the maintenance of a given structure. For this reason, critical glycines
are often conserved amongst members of a given protein family. The remaining
amino acids are (with very few exceptions) always found to be L-amino acids. This
has important consequences for the chiral structures observed in proteins at all
levels of the hierarchy. Only two other chiral centers exist within the twenty amino
acids, these are the β-carbons within the side-chains of the amino acids threonine
and isoleucine, which also exist as only one of the two possible enantiomers.
A polypeptide chain is generated by a series of condensation reactions between the
carboxyl group of one amino acid and the amino group of the next, yielding a
covalent bond (Fig. 4). This is an amide bond which is given the trivial name of a
peptide bond in the case of polypeptide chains. The union of two amino acids
results in a dipeptide, which possesses a free amino group (N-terminus) and
carboxyl group (C-terminus) allowing the condensation reaction to continue ad
infinitum in principle, in both directions (Fig. 4). After undergoing condensation,
the amino acid is termed a residue (to denote `that which is left' after
condensation). Given the existence of the 20 naturally occurring amino acids, for a
polypeptide of length n there are 20n possible amino acid sequences. Since n is
typically of the order of hundreds and may even be tens-of-thousands, the number
of theoretically possible polypeptide sequences is literally astronomical. The vast
majority do not currently exist (indeed, there is insufficient matter in the universe)
and never have existed before (as there has been insufficient time to generate
them). The polypeptides we observe in the extant species are the products selected
from a tiny fraction of all possible combinations that were expressed along the
evolutive process.
Figure 4
A polypeptide chain is generated by a series of condensation reactions, in
vivo normally occurring within the ribosome during protein synthesis. The
individual amino acids are shown in their zwitterionic form, in which the amino
and carboxylic acid groups (more...)
The conformational properties of a polypeptide chain
Due to the chemical nature of the amide bond, the peptide linkage between two
amino acids is subject to the phenomenon of resonance, meaning that it acquires
the characteristics of a partial double bond. The C-N distance (1.32Å) is shorter
than a normal single bond and longer than a normal double bond. Furthermore, this
introduces rigidity into the structure as the bond is no longer freely rotatable. The
consequence is that the α-carbon atoms of two adjacent amino acids, together with
the carbonyl (C=O) and NH groups of the peptide group itself, all lie within the
same plane (Fig. 5). The torsion (dihedral) angle associated with this bond is
termed ω, which in order to be planar must be either 180o (trans) or
0o (cis). Cis peptide bonds are only common when the amino acid proline lies on
the C-terminal side of the bond. This is due to the unusual nature of the proline
side chain which forms a covalent bond with its own mainchain nitrogen
generating a closed ring. As such proline is strictly speaking a secondary amino
acid and neither trans nor cis are particularly energetically favorable.
Approximately one third adopt the cis conformation and these are frequently
conserved amongst protein family members (33).
A short section of polypeptide chain showing the planar peptide groups and
identifying the torsion angles ϕ and ψ.
Polypeptides necessarily obey standard stereochemistry and therefore all bond
lengths and angles are effectively fixed, varying only minimally around their
standard values. Therefore, the only source of conformational freedom that the
polypeptide possesses comes from the torsional rotation around its single bonds.
We have already seen that the peptide bond is not freely rotable and therefore ω is
effectively fixed, normally in the trans configuration. There are only two remaining
single bonds per residue along the main chain, and from these bonds one may
associate the torsion angles ϕ and ψ (Fig. 5), which are defined to be 0o when in the
conformation trans. The conformation of the mainchain of a given residue may
therefore be described in terms of these two parameters, which may be
conveniently represented in terms of a two-dimensional coordinate system (the
Ramachandran plot), in which both ϕ and ψ vary between -180 and +180o
Clinical Implications of Bioinformatics
Over the past decade, the science of clinical bioinformatics has become one of the
fastest growing areas of research and development within the healthcare
environment. Indeed, the job of a bioinformatician has become an integral part of
research laboratories. In particular, clinical bioinformatics aims to address the
challenges in diagnosis, prognosis, and therapies of patients with diseases such as
cancer, neurodegenerative (e.g. ALS, Alzheimer’s and Parkinson’s disease),
allergic (e.g. asthma), and psychiatric disorders (e.g. depression), amongst others.
In 1970, Ben Hesper and Paulien Hogeweg coined the term bioinformatics to refer
to “the study of information processes in biotic systems”. In consequence,
bioinformatics was placed as a field parallel to biochemistry and biophysics.1 Since
then, the digital world expanded and the definition of bioinformatics took on a
whole new meaning. It now combines the fields of biology, computer science,
engineering, mathematics, and statistics to decipher biological data and make sense
of it in translational research.
Over the past decade, the advent of high throughput or next generation sequencing
(NGS) has accelerated the rate at which genes and co-regulated gene networks are
discovered. Indeed, a vast amount of data is now available, in particular from the
completion of the human genome project in 2003. Together, this data is being used
to modulate disease outcome, predisposition, and progression.2 For this reason, the
science of clinical bioinformatics has become one of the fastest growing areas of
development within the healthcare environment. It is an important component in
laboratories that generate and interpret data from molecular genetics testing.
Overall, the aim of clinical bioinformatics is to address the challenges in initial
diagnosis, prognosis, and therapies of patients3 with diseases such as cancer,
neurodegenerative and psychiatric disorders, amongst others.
Cancer
In clinical medicine, it has become apparent that there is a need to develop and
introduce advanced and new bioinformatics methodologies to answer the specific
question of cancer.4 In order for cancer bioinformatics to be effective, the tools
must thus concentrate on the communication, metabolism, proliferation, and
signalling of the disease. In particular, cancer bioinformatics is expected to have a
significant role in the identification and validation of biomarkers. For example, one
of the strategies is to evaluate and monitor biomarkers at different stages and time
points during cancer development. Identified as dynamic network biomarkers,
these markers should compare with clinical informatics, such as patient
complaints, history, symptoms, and therapies. In addition, these biomarkers should
also correlate to biochemical analyses, imaging profiles, pathologies, physician’s
examinations, and other measurements.5
For instance, through a genetic screen of hepatic cellular carcinoma, Sawey et
al.6 discovered that a common alteration in liver cancer (11q13.3 amplification)
causes the activation of the fibroblast growth factor 19 (FGF19), a hormone that
regulates bile production with effects on glucose and lipid metabolism. In turn,
through subsequent bioinformatics analysis with mouse models and RNAi, it was
found that activation of FGF19 results in selective responsiveness to FGF19
inhibition. Therefore, Sawey et al. propose for the 11q13.3 amplification to be used
as a biomarker for patients who, in all likelihood, will respond to anti-FGF19
therapies. In a somewhat similar approach, Baert-Desurmont et al.7 revealed that a
combination of single nucleotide polymorphisms (8q23, 15q13 and 18q21 SNPs)
could explain an increased risk for colorectal cancer.
Using genome-wide screening methods, aberrant expression profiles of
microRNAs (miRNAs) have also been identified in human cancers, thus revealing
their potential as diagnostic and prognostic biomarkers of cancer.8 Now, in order to
infer the regulatory processes of miRNAs, bioinformatics approaches are
fundamental. For example, Laczny et al.9 developed a comprehensive and
integrative tool, called miRTrail, to generate reliable and robust data on
deregulated pathogenic processes which could offer insights into the interactions
between genes and miRNAs. In fact, the use of miRTrail on melanoma samples
demonstrated how this platform opened new avenues for investigating a wide
range of diseases, including cancer.
In clinical practice and medical research, medical image processing facilitates the
accurate, initial detection and diagnosis of cancer. Indeed, medical imaging –
imaging in clinical pathology, nuclear magnetic resonance imaging, positron
emission tomography, and ultrasonic computed tomography – is one of the most
important factors in the application of cancer bioinformatics. Kimori et al.10 for
instance, used a mathematical morphology-based approach to enhance fine features
of a lesion with high suppression of surrounding tissues. Here, the effectiveness of
the method was evaluated in terms of the contrast improvement ratio as applied to
three kinds of medical images: a chest radiographic image, a mammographic
image, and a retinal image.

Overall, the aim of cancer bioinformatics is to continue developing tools so that the
right treatment is provided to the right patient at the right time, based on the
characteristics of each patient’s tumour; in other words, tailored bioinformatics.
Neurodegenerative Diseases
It is known that the economical and societal costs of neurodegenerative diseases
are accelerating. Therefore, there is a demand to find new solutions to resolve the
situation.11 However, having said that, progress in this area has proved to be
challenging. In part, this is because the cause of diseases such as Alzheimer’s (AD)
or Parkinson’s disease (PD) is not known, making them difficult to understand.12 In
addition, while understanding these diseases on a molecular level could lead to the
development of better biomarkers and treatments, the enormous amount of data
involved renders it an arduous task. For this reason, bioinformatics approaches are
used to manage data from high-throughput technologies, pushing forward the
frontiers of this field.
In regard to late onset AD and PD, both have an obvious genetic component,
however, their genetic architecture is complex, with just a few, constant, associated
risk factors. It is therefore possible that undiscovered AD and PD-related genes
exist. Kim et al.13 using biomedical text mining, were able to pinpoint genes that
have a direct relationship with both neurodegenerative diseases. In another
approach, Hofmann-Apitius et al.12 developed a bioinformatics and modelling
method based on patient data available to the public. Here, the work presented was
driven through AETIONOMY, a public- private partnership between the European
Union and the pharmaceutical industry association EFPIA.

ALS, short for amyotrophic lateral sclerosis, is another neurodegenerative disease,
but one that affects nerve cells in the brain and the spinal cord. To date, there is a
vast volume of data capturing this motor neurone disease. In consequence, there is
a corresponding need for storage and interpretation. In keeping with this, Abel et
al.14 presented an ALS online bioinformatics database (ALSoD) combining
genotype, phenotype, and geographical information with associated analysis tools.
Likewise, PRO-MINE (PROtein Mutations In NEurodegeneration)15 is a database
describing all TDP-43 disease mutations identified up to now; TDP-43 is a
multifunctional RNA-binding protein found in AD, ALS, and also frontotemporal
lobar degeneration.
Allergic and Psychiatric Disorders
In 2008, TIME magazine named 23andMe the invention of the year. 23andMe
provides a home-based saliva collection kit that decodes the genomic DNA of
adults and interprets their genetic health risks, with results accessible online. In
particular, it tests for ten diseases, including AD, PD, and some rare blood
diseases. It is important to note that the 23andMe kit describes if an individual has
a higher risk of developing a disease but it is not intended to diagnose disease. It is
meant to provide information that can be used to inform life decisions.
Using the 23andMe gene pool, Hyde et al. 16 discovered 15 genetic loci associated
with a risk of major depression in people of European descent. In a similar
approach, genome-wide analyses for personality traits identified 6 loci with
correlations to psychiatric disorders.17 In addition, through a multi-trait analysis of
a genome-wide association study, Turley et al.18 identified loci for depressive
symptoms, neuroticism, and subjective well-being. Using this 23andMe gene pool,
scientists have also discovered that asthma, eczema and hay fever share a genetic
origin, in part due to shared genetic risk variants that dysregulate the expression of
immune-related genes.19
Conclusion
Overall, clinical bioinformatics is the critical step to discovering and developing
new diagnostics and therapies for diseases. Here, we described cancer,
neurodegenerative and psychiatric disorders, however, bioinformatics has been
used in other disorders as well, such as acute rejection after renal
transplantation20 and lung diseases.21 In addition, bioinformatics has been used in
studies of model organisms such as Saccharomyces cerevisiae (yeast), Drosophila
melanogaster (flies), and Mus musculus (mice), which in turn shed light onto non-
model organisms such as humans.
It is evident that bioinformatics will continue to push the boundaries of medicine
and shape clinical testing for the future. Just like microscopes, computers have
become a requirement, and the job of a bioinformatician is now an integral part of
research laboratories and also in the clinical setting. In the future, success will
depend on improved analytics, annotations, software to deliver this information,
and systems to capture the realised knowledge.22

Unit 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

UNIT 1

GSA: Genome Sequence Archive

Next-generation sequencing (NGS) technologies have been extensively and

Database content and usage

Data structure and organization

Data archive and statistics

Figure 2. Data statistics of GSAA. Numbers of BioProjects and BioSamples

Data submission and retrieval

Perspectives and concluding remarks

Subject Specific Sites

 Type of cell – Prokaryotic or Eukaryotic cell.

Organisms are classified according to a system of seven ranks:

1. The comparison of sequences in order to find similarity, often to infer if they

Basic Principles of Protein Structure:

You might also like