Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Joint BecA-ILRI Hub, SLU and UNESCO Advanced

Genomics and Bioinformatics

Mark Wamalwa 7th - 17th October 2013


BecA-ILRI Hub, Nairobi, Kenya
h"p://hub.africabiosciences.org/
h"p://www.Ilri.org/
m.wamalwa@cgiar.org
Plan for the Week

Day 1 Introduction to
Linux

Introduction to Perl
Day 2 Shell Programming
programming

Perl programming Nucleotide and protein


Day 3 contd Sequence Manipulation

Regulatory sequence
CLC Genomics
Day 4 analysis

Cocktail
Day 5 CLC Genomics contd
What is Bioinformatics/ Computational
Biology?
Bioinformatics: Seeks to analyze large sets of biological data in order
to solve biological questions, to formulate hypotheses and to build
models of underlying biological processes involved.

Bioinformatics: collection and storage of biological information


Bulk Data analysis
Bulk Data storage
Bulk Data mining
Computational biology: development of algorithms and statistical
models to analyze biological data
Scope of bioinformatics

Storage and retrieval of biological data
Molecular structures: visualiza9on and analysis, classica9on, predic9on
Sequence analysis: Sequence alignments, database searches, mo9f detec9on
Genomics: annota9on, compara9ve genomics
Phylogeny
Func;onal genomics: Transcriptome, proteome, interactome
Analysis of biochemical networks: metabolic networks, regulatory networks
Systems biology: Modelling and simula9on of dynamical systems

Multidisciplinarity
molecular genomics
biology

genetics mathematics

biochemistry statistics

bioinformatics numerical
biophysics analysis

algorithmics
evolution

image data
analysis management
Multidisciplinary
n Scientists can not be experts in all of these domains

n Problems:
qBiologists (generally) hate statistics and computers
qComputer scientists (generally) ignore statistics and biology
qStatisticians and mathematicians (generally)
Spend their time writing formula everywhere
qComplexity of the biological domain
Each time you try to formulate a rule, there is a possible
counter-example
q Solution: multidisciplinary teams/multi-lab projects
Applications
q Research in biology
Molecular organization of the cell/organism
Development
Mechanisms of evolution
q Medicine
Diagnostic of cancers
Detecting genes involved in cancer
q Pharmaceutical research
mechanisms of drug action
drug target identification
q Biotechnology
Gene therapy
Bioengineering
From wet science to bioinformatics
q Progresses in biology stimulated the incorporation of new methods in
bioinformatics
Structure analysis (since the Genomes (since the 90s)
50s) Genome annotation
structure comparison Comparative genomics
structure prediction Functional classifications
(ontologies)
Sequencing (since the 70s) Transcriptome (since 1997)
Sequence alignment Multivariate analysis
Sequence search in Proteome (~ 2000)
databases Graph analysis
High throughput technologies
Genome projects stimulated drastic improvement of sequencing technology

q Post-genomic era
Genome sequence is not sufficient to predict gene function
This stimulated the development of new experimental methods
transcriptomics (microarrays)
proteomics (Y=2-hybrid, mass spectrometry, ...)
q The "omics" trend:
High throughput methods raised a fashion of "omics.
Some of the "omics" are not associated to any new/high throughput
approach, this is just a new name on a previous method, or on an
abstract concept
Large-scale analyses
q The availability of massive amounts of data enables to address questions
that could not even be imagined a few years ago
genome-scale measurement of transcriptional regulation
comparative genomics

q Downstream analyses require a good understanding of statistics

q Warning: the global trends


the capability to analyze large amounts of data presents a risk to remain at a superficial
level, or to be fooled by forgetting to check the pertinence of the results (with some in-
depth examples)
good news: this does not prevent the authors from publishing in highly quoted journals
Bioinformatics is a science of inference

q The risks of inference

q Any analysis of massive data will unavoidably generate a certain rate of


errors (false positives and false negatives).
q Good research and development will include an evaluation of the error
rates.
q Good methods will minimize the error rate.
q Trade-off between specificity and sensitivity.
Why bioinformatics then ?
nIn most cases, wet biology will be required afterwards to validate the predictions
nBioinformatics can
q Reduce data to a small set of testable predictions
q assign a degree of confidence to each prediction

nThe biologist will often have to chose the appropriate degree of confidence, depending
on the trade between
q cost for validating predictions
q benefit expected from the right predictions

nBioinformatics as in silico biology


q Allows to explore domains that can not be addressed experimentally e.g., the study of past
evolutionary events
Phylogenetic inference and comparative genomics give us insights in the mechanisms of evolution
and in the past evolutionary events
The time scale of these events is however so large (billions of years) that one cannot conceive to
reproduce the inferred events with experimental methods.
Goals of Bioinformatics
Molecular Biology as an Information Science.
What is the Information?

Central Dogma Central Paradigm


of Molecular Biology for Bioinformatics
DNA
-> RNA Genomic Sequence Information
-> Protein -> mRNA (level)
-> Phenotype -> Protein Sequence
-> DNA -> Protein Structure
-> Protein Function
Molecules -> Phenotype
Sequence, Structure, Function
Processes Large Amounts of Information
Mechanism, Specificity, Regulation
Standardized
Statistical
Most cellular functions are performed or
facilitated by proteins. "
Primary biocatalyst"
Cofactor transport/storage"
Mechanical motion/support"
Immune protection"
Control of growth/differentiation"
Information transfer (mRNA)"
Genetic material Protein synthesis (tRNA/mRNA)"
Some catalytic activity"
(idea from D Brutlag, Stanford, graphics from S Strobel)
Scope of Bioinformatics
nDevelopment of computational tools
qWriting software
q Creating databases

nApplication of these tools to generate biological knowledge


q Creating databases

q Molecular sequence analysis

q Molecular sequence analysis


qMolecular structural analysis

qMolecular functional analysis


The Bioinforma;cs PlaAorm
High-performance compu;ng server:
32 total processing cores
128GB of memory (RAM)
8TB of disk space
25TB LTO4 tape backup library
Linux cluster
32 CPUs (AMD 64-bit)
128 Gigabyte RAM
>10 terabytes disk storage
Grid compu;ng
Parallel applica;ons:
> Genome assembly (Newbler, MIRA, Celera,
velvet, CAP3. )
> Genome annota;on (glimmer, )
> Phylogene;c analysis (Beast, Mr Bayes)
> Other sequence analysis tools (BLAST,
clustalw, HMMER, R)

BecA-ILRI Genomics PlaAorm
Opportuni1es for genomics and metagenomics research

Capillary sequencing
ABI 3130-xl ABI 3730-xl ABI 3500-xl

Next genera1on sequencing Genomics


Viral genomics
1 sample = 1 library 454 GS
= 1 plate Func;onal Genomics
500 mb/run pyrosequencer
1/2 cassava genome Metagenomics
1/8 human genome
Bioinformatics Core Activities
Statistical support Training/Capacity Building
Experimental design motif finding
functional/network analysis
Primary data analysis microarray analysis
NGS QC, spatial defect removal Data management
454 GA pipeline NGS data storage and manipulation
Data warehouse facilities : databases
Secondary/downstream analysis
Differential expression Software development
ChIP-seq peak calling Bioconductor packages: NGS annotation
Structural variation, genomic packages
rearrangements Automated NGS analysis packages
SNP and CN analysis
microRNA profiling Bioinformatics tools
GO enrichment Ensembl, Galaxy, Cytoscape
From Sequence (genomics/metagenomics) to impact

phylogenetic
analysis Diagnostics

geographical
mapping
Global diseases
(meta)genome sequencing surveillance

protein
Databases modeling
Vaccine dvlpmt

sequence
variation
analysis Drug dvlpmt
Compilation of complete
genomes, metagenomes, Primer,
annotation and Improved drug
microarray selection
curation of metadata
Extraction of
important biological Environmental
discovery of sustainability
information new micro-
organisms and
pathways
Improved Public
health intervention
Books

nZvelebil, M. & Baum, J.O. Understanding Bioinformatics. (2007) pp. 772


nPevzner, J. (2003). Bioinformatics and Functional Genomics. Wiley.
qAll the slides available at: http://www.bioinfbook.org/
nW. Mount. Bioinformatics: Sequence and Genome Analysis. (2004) pp. 692.
qhttp://www.bioinformaticsonline.org/
nWesthead, D.R., J.H. Parish, and R.M. Twyman. 2002. Bioinformatics. BIOS Scientific Publishers,
Oxford.

nBranden et al. Introduction to Protein Structure. (1998) pp. 410


The BecA Hub team
08 countries, 17 females, 19 males
Australia, Benin, Cameroon, England, Ethiopia, Italy, Kenya, USA
Dankie!!!

You might also like