#1 Pendahuluan

Introduction to Bioinformatics
1
What is Bioinformatics?
2
3
NIH – definitions
What is Bioinformatics? - Research, development,
and application of computational tools and on molecular
approaches for expanding the use of biological,
medical, behavioral, and health data, including the
means to acquire, store, organize, archive, analyze,
or visualize such data.
What is Computational Biology? - The

development and application of analytical and
theoretical methods, mathematical modeling and
computational simulation techniques to the study of
biological, behavioral, and social data.
4
NSF – introduction
Large databases that can be accessed and analyzed with
sophisticated tools have become central to biological
research and education. The information content in the
genomes of organisms, in the molecular dynamics of
proteins, and in population dynamics, to name but a few
areas, is enormous. Biologists are increasingly finding that
the management of complex data sets is becoming a
bottleneck for scientific advances. Therefore,
bioinformatics is rapidly become a key technology in all
fields of biology.
5
NSF – mission statement
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment
of computer scientists into this evolving field, the limited
availability of developed databases of biological information, and
the need for more efficient and intelligent search engines for
complex databases.
6
NSF – mission statement
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment of
computer scientists into this evolving field, the limited availability
of developed databases of biological information, and the need for
more efficient and intelligent search engines for complex databases.
7
Molecular Bioinformatics
Molecular Bioinformatics involves the use
of computational tools to discover new
information in complex data sets (from the
one-dimensional information of DNA through
the two-dimensional information of RNA and
the three-dimensional information of proteins,
to the four-dimensional information of
evolving living systems).
8
Bioinformatics (Oxford English Dictionary):
The branch of science concerned with

information and information flow in biological
systems, esp. the use of computational
methods in genetics and genomics.
9
The field of science in which biology, computer science and
information technology merge into a single discipline
Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Bioinformaticians
Study biological questions by
analyzing molecular data
Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
10
The hereditary information of all living organisms, with the exception of some viruses,
is carried by deoxyribonucleic acid (DNA) molecules.
2 purines: 2 pyrimidines:
adenine (A) guanine (G) cytosine (C) thymine (T)
two rings one ring

11
The entire complement of genetic material carried by an individual is called the
genome.
Eukaryotes may have up to 3 subcellular

genomes:
1. Nuclear
2. Mitochondrial
3. Plastid
Bacteria have either circular or linear

Human chromosomes
genomes and may also carry plasmids
12
Circular genome
Central dogma: DNA makes RNA makes Protein
Modified dogma: DNA makes DNA and RNA, RNA makes DNA, RNA
an Protein
13
Amino acids - The protein building blocks
14
15
Any region of the DNA sequence can, in principle, code for six different amino acid
sequences, because any one of three different reading frames can be used to
interpret each of the two strands.
16
Protein folding
A human Hemoglobin:
17
How does it all looks like on a computer monitor?
18
A cDNA sequence
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA

ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCG
CCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGA
CGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACA
AGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCC
GAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCG
TTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGT
ACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC
19
A cDNA sequence (reading frame)
>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA

ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCC
GCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCAC
CACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG
ACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCAC
AAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGC
CGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACC
GTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCC
GTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC
A protein sequence
>gi|4504347|ref|NP_000549.1| alpha 1 globin [Homo sapiens]

MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAH
VDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
20
And, a whole genome…
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGG
GGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCT
ACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGC
CGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC
AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCT
CCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCT
TGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCG
GCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCT
GGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGAC
CTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAAC
GCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGG
TCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGC
CTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTT
CTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGG
CGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGC
CTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAG
ACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCA
ACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCC
GGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCAC
GCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGC
TTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTG
GGCGGCGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGG
ACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGT
GCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCC
ATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTG
AGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
21
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG...
How big are whole genomes?
E. coli 4.6 x 106 nucleotides

– Approx. 4,000 genes
Yeast 15 x 106 nucleotides

Human 3 x 109 nucleotides

Smallest human chromosome 50 x 106 nucleotides
22
What do we actually do with bioinformatics?
23
Sequence assembly
24
(next generation sequencing)
Genome annotation
25
Molecular evolution
Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic
26
Smith et al. (2009) Nature 459, 1122-1125
Analysis of gene expression
Gene expression profile of relapsing versus non-relapsing Wilms tumors.

A set of 39 genes discriminates between the two classes of tumors.
27
(http://www.biozentrum2.uni-wuerzburg.de/, Prof. Gessler)
Analysis of regulation
28
Toledo and Bardot (2009) Nature 460, 466-467
Protein structure prediction
Protein docking
29
30
Luscombe, Greenbaum, Gerstein (2001)
From DNA to Genome
Sanger sequences
Watson and Crick insulin protein
1955
DNA model
1960 Dayhoff’s Atlas
Sequence
alignment 1965 ARPANET
(early Internet)
1970
PDB (Protein
Sanger dideoxy
Data Bank) 1975 DNA sequencing
1980 PCR (Polymerase
GenBank database
Chain Reaction)31
1985
NCBI SWISS-PROT
database
FASTA 1990
Human Genome
Initiative
BLAST
EBI
1995
World Wide Web First bacterial

genome
Yeast genome 2000 First human
genome draft 32
Origin of bioinformatics and
biological databases:
The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.
Nearly a decade later, the first nucleic acid

sequence was reported, that of yeast
tRNAalanine with 77 bases.
33
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).
The Protein DataBank followed in 1972 with a

collection of ten X-ray crystallographic protein
structures. The SWISSPROT protein sequence
database began in 1987. 34
Nucleotides
35
Complete Genomes
as of August 2011:
Eukaryotes 37
Prokaryotes 1708
Total 1745
36
What can we do with sequences and other type of molecular information?
37
Open reading frames
Functional sites
Annotation
Structure, function
38
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
39
promoter TF binding site
Transcription
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
Start Site
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA Ribosome binding Site
ORF = Open Reading Frame

CDS = Coding Sequence 40
Comparing ORFs
Identifying orthologs
Inferences on structure
and function
Comparative
genomics
Comparing functional sites
Inferences on regulatory
networks
41
Similarity profiles
Researchers can learned a great deal about the structure and

function of human genes by examining their counterparts in
42
model organisms.
Alignment preproinsulin
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****
Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:******* 43
44
Ultraconserved Elements in the
Human Genome
Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart
Stephen, W. James Kent, John S. Mattick, & David Haussler
(Science 2004. 304:1321-1325)
There are 481 segments longer than 200 base pairs (bp) that
are absolutely conserved (100% identity with no insertions or
deletions) between orthologous regions of the human, rat, and
mouse genomes. Nearly all of these segments are also
conserved in the chicken and dog genomes, with an average of
95 and 99% identity, respectively. Many are also significantly
conserved in fish. These ultraconserved elements of the human
genome are most often located either overlapping exons in
genes involved in RNA processing or in introns or nearby genes
involved in the regulation of transcription and development.
There are 156 intergenic, untranscribed,

ultraconserved segments
45
Junk:
Supporting evidence
Junk is real!
46
Genome-wide profiling of:
• mRNA levels
• Protein levels
Co-expression of genes
and/or proteins
Functional
genomics
Identifying protein-protein
interactions
Networks of interactions
47
Understanding the function of genes and other
parts of the genome
48
Structural Assign structure to all
genomics proteins encoded in
a genome
49
Biological
databases
50
Database or databank?
Initially
• Databank (in UK)
• Database (in the USA)
Solution
• The abbreviation db
51
What is a Database?
A structured collection of data held in computer storage; esp. one

that incorporates software to make it accessible in a variety of ways;
transf., any large collection of information.
database management: the organization and manipulation of data in

a database.
database management system (DBMS): a software package that

provides all the functions required for database management.
database system: a database together with a database

management system.
52
Oxford Dictionary
What is a database?
• A collection of data
– structured
– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with other db
• Includes also associated tools (software) necessary

for access, updating, information insertion,
information deletion….
• Data storage management: flat files, relational

databases…
53
Database: a « flat file » example
Flat-file database (« flat file, 3 entries »):
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: Pottery 2000; Pottery 2001;
//
Accession number: 2
First Name: Dan
Last name: Graur
Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002
//
Accession number 3:
First Name: John
Last name: Travolta
Course: Ballet 2001; Ballet 2002;
//
• Easy to manage: all the entries are visible at the same time !
54
Database: a « relational » example
Relational database (« table file »):
Teacher Accession Education

number
Amos 1 Biochemistry
Dan 2 Genetics
John 3 Scientology
Course Year Involved

teachers
Advanced 2000; 2001 1; 2
Pottery
Ballet for Fat 2001; 2002 2; 3
People
55
Why biological databases?
• Exponential growth in biological data.
• Data (genomic sequences, 3D structures, 2D

gel analysis, MS analysis, Microarrays….) are
no longer published in a conventional
manner, but directly submitted to databases.
• Essential tools for biological research. The

only way to publish massive amounts of data
without using all the paper in the world.
56
Distribution of sequences
• Books, articles 1968 -> 1985
• Computer tapes 1982 -> 1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 ->
• FTP 1989 ->
• On-line services 1982 -> 1994
• WWW 1993 ->
• DVD 2001 ->
57
Some statistics
• More than 1000 different ‘biological’ databases
• Variable size: <100Kb to >20Gb

– DNA: > 20 Gb
– Protein: 1 Gb
– 3D structure: 5 Gb
– Other: smaller
• Update frequency: daily to annually to seldom to forget

about it.
• Usually accessible through the web (some free, some not)
58
 Some databases in the field of molecular biology…
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,

ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, 59
YPM, etc .................. !!!!
Categories of databases for Life
Sciences
• Sequences (DNA, protein)
• Genomics
• Mutation/polymorphism
• Protein domain/family
• Proteomics (2D gel, Mass Spectrometry)
• 3D structure
• Metabolic networks
• Regulatory networks
• Bibliography
• Expression (Microarrays,…)
• Specialized 60
NCBI:
http://www.ncbi.nlm.nih.gov
EBI:
http://www.ebi.ac.uk/
DDBJ:
http://www.ddbj.nig.ac.jp/
61
Literature Databases:
Bookshelf: A collection of searchable biomedical books linked to
PubMed.
PubMed: Allows searching by author names, journal titles, and a

new Preview/Index option. PubMed database provides access to
over 12 million MEDLINE citations back to the mid-1960's. It
includes History and Clipboard options which may enhance your
search session.
PubMed Central: The U.S. National Library of Medicine digital

archive of life science journal literature.
OMIM: Online Mendelian Inheritance in Man is a database of

human genes and genetic disorders (also OMIA). 62
PubMed (Medline)
• MEDLINE covers the fields of medicine, nursing, dentistry,
veterinary medicine, public health, and preclinical sciences
• Contains citations from approximately 5,200 worldwide journals in
37 languages; 60 languages for older journals.
• Contains over 20 million citations since 1948
• Contains links to biological db and to some journals
• New records are
added to
PreMEDLINE daily!
• Alerting services
– http://www.pubcrawler.ie/
– http://www.biomail.org
65
A search by subject: “mitochondrion evolution”
Type in a Query term
• Enter your search words in the
query box and hit the “Go” button
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html#Searching
68
The Syntax …
1. Boolean operators: AND, OR, NOT must be entered in
UPPERCASE (e.g., promoters OR response elements). The default
is AND.
2. Entrez processes all Boolean operators in a left-to-right sequence.

The order in which Entrez processes a search statement can be
changed by enclosing individual concepts in parentheses. The terms
inside the parentheses are processed first. For example, the search
statement: g1p3 OR (response AND element AND promoter).
3. Quotation marks: The term inside the quotation marks is read as one
phrase (e.g. “public health” is different than public health, which will
also include articles on public latrines and their effect on health
workers).
4. Asterisk: Extends the search to all terms that start with the letters
before the asterisk. For example, dia* will include such terms as 69
diaphragm, dial, and diameter.
Refine the Query
• Often a search finds too many (or too few) sequences, so you
can go back and try again with more (or fewer) keywords in
your query
• The “History” feature allows you to combine any of your past
queries.
• The “Limits” feature allows you to limit a query to specific
organisms, sequences submitted during a specific period of
time, etc.
• [Many other features are designed to search for literature in
MEDLINE]
70
Related Items
You can search for a text term in sequence annotations or in
MEDLINE abstracts, and find all articles, DNA, and protein
sequences that mention that term.
Then from any article or sequence, you can move to "related
articles" or "related sequences".
•Relationships between sequences are computed with BLAST
•Relationships between articles are computed with "MESH" terms
(shared keywords)
•Relationships between DNA and protein sequences rely on accession
numbers
•Relationships between sequences and MEDLINE articles rely on both
shared keywords and the mention of accession numbers in the articles.
71
72
73
74
A search by authors: “Esser” [au] AND “martin” [au]
A search by title word: “Wolbachia pipientis” [ti]
Database Search Strategies
• General search principles - not limited
to sequence (or to biology).
• Start with broad keywords and narrow
the search using more specific terms.
• Try variants of spelling, numbers, etc.
• Search many databases.
• Be persistent!!
77
Searching PubMed
• Structureless searches
– Automatic term mapping
• Structured searches
– Tags, e.g. [au], [ta], [dp], [ti]
– Boolean operators, e.g. AND, OR, NOT, ()
• Additional features
– Subsets, limits
– Clipboard, history
78
Start working:
Search PubMed
1. cuban cigars
2. cuban OR cigars
3. “cuban cigars”
4. cuba* cigar*
5. (cuba* cigar*) NOT smok*
6. Fidel Castro
7. “fidel castro”
8. #6 NOT #7 79
“Details” and “History” in
PubMed
80
“Details” and “History” in
PubMed
81
The OMIM (Online Mendelian
Inheritance in Man)
– Genes and genetic disorders

– Edited by team at Johns Hopkins
– Updated daily
82
MIM Number Prefixes
* gene with known sequence
+ gene with known sequence and
phenotype
# phenotype description, molecular
basis known
% mendelian phenotype or locus,
molecular basis unknown
no prefix other, mainly phenotypes with
suspected mendelian basis
83
Searching OMIM
• Search Fields
– Name of trait, e.g., hypertension
– Cytogenetic location, e.g., 1p31.6
– Inheritance, e.g., autosomal dominant
– Gene, e.g., coagulation factor VIII
84
OMIM search tags
All Fields [ALL]

Allelic Variant [AV] or [VAR]
Chromosome [CH] or [CHR]
Clinical Synopsis [CS] or [CLIN]
Gene Map [GM] or [MAP]
Gene Name [GN] or [GENE]
Reference [RE] or [REF]
85
86
Start working:
Search OMIM
How many types of hemophilia are there?

For how many is the affected gene known?
What are the genes involved in hemophilia A?
What are the mutations in hemophilia A?
87
Online Literature databases
1. How to use the UH online Library?

2. Online glossaries
3. Google Scholar
4. Google Books
5. Web of Science 88
How to use the online UH Library? http://info.lib.uh.edu/
89
90
Online Glossaries
Bioinformatics :
http://www.geocities.com/bioinformaticsweb/glossary.html
http://big.mcw.edu/
Genomics:
http://www.geocities.com/bioinformaticsweb/genomicglossary.html
Molecular Evolution:
http://workshop.molecularevolution.org/resources/glossary/
Biology dictionary:
http://www.biology-online.org/dictionary/satellite_cells
Other glossaries, e.g., the list of phobias:

http://www.phobialist.com/class.html
91
4. Google Scholar
http://www.scholar.google.com/
92
What is Google Scholar?
Enables you to search specifically for scholarly

literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical
reports from all broad areas of research.
93
Use Google Scholar to find articles from a
wide variety of academic publishers,
professional societies, preprint repositories
and universities, as well as scholarly articles
available across the web.
94
Google Scholar
orders your
search results by
how relevant they
are to your query,
so the most
useful references
should appear at
the top of the
page
This relevance
ranking takes into
account the: full
text of each article.
the article's author,
the publication in
which the article
appeared and how
often it has been
cited in scholarly 95
literature.
What other DATA can we retrieve from the record?
96
97
98
5. Google Book Search
99
100
Start working:
Search Google Books
How many times is the tail of the giraffe

mentioned in On the Origin of Species by Mr.
Darwin?
101
6. Web of science
http://http://apps.webofknowledge.com.ezproxy.lib.uh.edu/WOS_GeneralSearch_input.do?product
=WOS&search_mode=GeneralSearch&SID=4FB7LbbLgDMhG9fDiLh&preferencesSaved=
102
103
104
Strategi: penelitian kolaboratif
Contoh 1: Repositori dan Genome Browser
Pengetahuan Departemen
yang diperlukan terkait
DNA/genom, RNA, Biologi

Aplication
Data Molekuler protein (proses
Source Information Genome
Browser biologi, dll) Biokimia
DNA
RNA Data Storage Pertanuan, dll

& Analysis
Protein
Database,
Data Store / Ilmu Komputer
Storage Repositori Datamining
Biological
Knowledge Statistika
Biochemistry Data
Bioinformatika
Biology
Knowledge
Data Engineering Information
retrieval, usability, Ilmu Komputer
algoritme dan
Repositori data bioinformatika pemrograman
Manfaat repositori dan aplikasi
bioinformatika
Pencarian Sekuens
Genom, RNA, Protein, dll
Identifikasi SNP untuk

mendukung pemuliaan
Peneliti Pemulia
Klasifikasi tumbuhan,
hewan
Peneliti Biologi
Pencarian/perancangan
obat, termasuk obat
Peneliti biokimia/ herbal
farmasi
Contoh 2 : Sistem Pengidentifikasi SNP
Definisi, karakteristik Biologi

SNP dan Verifikasi
SNP Biokimia
Agronomi,
Pemuliaan, Tek
Pertanian
Multiple Sequence
Alignment dengan
Dynamic Ilmu Komputer
Programming, (Lab. IR, NCC
Parallel Computing Bioinformatika)
Contoh 3: DNA Sequence Assembly
Karakteristik Biologi
sekuens DNA,
repeats, micro- Biokimia
satellite,
sequencing
error,
polymorphism
de Bruijn graph
Solusi dicapai
dengan
menemukan algoritme graf,
struktur data, Matematika
Eulerian Path
dynamic
programming Ilmu Komputer
Consensus (Lab. Information
dilakukan Retrieval,
dengan Multiple Bioinformatika)
Alignment
DNA Sequence Assembly
Selayang pandang proses pembacaan seluruh genom dengan teknik shotgun.
Potongan
DNA genom
Reads
Contigs
Scaffolds
Pemetaan
scaffolds
Peta genom
Perakitan urutan genom
Gambaran menyeluruh dari semua reads
yang saling bersambung dan membentuk Tampilan dekat: urutan basa
sebuah contig. dari contig yang terbentuk.
Tampilan dekat: urutan basa

dari tiap read (bacaan) dan
posisinya dalam contig
(sambungan).
Contoh 4: Metagenome Fragment Binning
Binning
tanah
DNA Sequencing
http://www.eng.unimelb.edu.au
Pengetahuan Departemen Pengetahuan Departemen

yang diperlukan terkait yang diperlukan terkait
Karakteristik Biologi Machine Statistika

data Learning,
metagenom Biokimia Peluang Ilmu Komputer,
Bioinformatika
Contoh 5: Protein-protein docking
Karakteristik Biologi
sekuens protein,
ligand, receptor, Biokimia
protein-protein
interaction kimia
Clustering/machine Fisika
learning
kombinatorial, Statistika
geometri, quantum,
energi, pemodelan Matematika
dan simulasi,
grafika komputer, Ilmu Komputer (Lab.
high performance CI, IR, NCC, KDE,
computing SEIns Bioinformatika)
Capaian riset Departemen
Ilmu Komputer IPB
2016/4/5 113
PENGELOMPOKAN FRAGMEN
METAGENOM DENGAN
METODE GROWING SELF
ORGANIZING MAP
Marlinda Vasty Overbeek

Dr Eng Wisnu Ananta Kusuma, ST, MT
Dr Ir Agus Buono, MSi, MKom
Dipublikasikan di ICACSIS 2013 114

Metagenome fragment binning - background
 Hanya 1% mikro-organisme yang dapat

dikulturkan [Wu 2008]
 Sisanya diambil langsung dari lingkungan

(metagenome samples) seperti tanah [Daniel 2004],
jaringan tubuh manusia [Qin 2010], and air [Tyson 2004]
 Metagenome sample mengandung mikroorganisme

yang bercampur, prose binning fragments diperlukan
untuk mengelompokkan fragmen ke dalam level
taksonomi tertentu
Binning is required for metagenome sequence

analysis 115
2016/4/5
Latar Belakang
Bercampurnya fragmen-2 dari banyak mikroorganisme
menyebabkan dihasilkannya interspecies chimeras
INTERSPECIES CHIMERAS
B
BINNING
KESALAHAN
PERAKITAN
116
Solusi Masalah
Environmental Sampels
BINNING
Firmicutes Proteobacteria
117
BINNING berdasarkan taksonomi Filum
Metagenome fragment binning – two main approaches
 Homology based approach
• compare fragments against reference databases by sequence alignment
• assign taxon ID to each sequence based on the NCBI taxonomy
• summary the results at different taxonomy levels [Huson 2007]
DNA fragments
reference database
 Composition based approach

• bypass the need of sequence alignment
• calculate k-mers frequencies as compositional features
• these features are used as inputs for unsupervised or supervised learning
Homology based approach is time consuming, thus

researchers introduce compositional based approach using 118
2016/4/5 machine learning
Pengelopokan Fragmen Metagenom dengan Metode
119
Growing Self Organizing Map
Evaluasi
Kombinasi Parameter terbaik pada frekuensi Spaced K-Mer dengan parameter Learning
Rate (LR) 0.5 dan Neighborhood Size (NS) 1
Perhitungan Quantization Error pada Spaced K-Mer

1.000
0.900
0.800
0.700
Error
0.600
0.500
0.400 Titik Minimum
0.300
0.200
0.100
0.000
0 1 2 3 4
LR 0.1 0.813 0.807 0.822 0.819 0.789
Quantization Error
LR 0.25 0.842 0.816 0.827 0.827
Terkecil pada Learning
0.806
LR 0.5 0.823 0.665 0.746
Rate 0.5 dan0.803 0.798
LR 0.75 0.870 0.767 Neighborhood
0.806 Size 1
0.801 0.813
LR 0.9 0.818 0.776 dengan error0.801
0.811 0.665 0.786
LR 0.1 LR 0.25 LR 0.5 LR 0.75 LR 0.9
120
PENGEMBANGAN SISTEM IDENTIFIKASI DAN
ANALISIS SINGLE NUCLEOTIDE POLYMORPHISM
UNTUK PEMULIAAN
TANAMAN KEDELAI
Tim Peneliti:
Dr. Wisnu Ananta Kusuma (Ilmu Komputer, IPB)
Dr. Agus Buono (Ilmu Komputer, IPB)
Dr. Ir. I Made Tasman (BB Biogen)
Habib Rijzaani (BB Biogen)
Mukhlis Hidayat, M.Kom (Matematika, Unsyiah)
BIDANG PENELITIAN: PROGRAM PENCAPAIAN SWASEMBADA DAN

SWASEMBADA BERKELANJUTAN (SB) – A.4.1
Kontribusi Penelitian
 Penciptaan galur-galur unggul
 Pemanfaatan analisis genom untuk
pemuliaan
 Pemanfaatan teknologi informasi untuk
mendukung usaha pemuliaan berbasis
analisis genom
Tahun
Lingkup Kegiatan tahun 2013-2015 2014
diperoleh SNP potensial
Membangun perangkat lunak

asosiasi SNP dan fenotipenya
Sekuens DNA dari
beberapa varietas
Identifikasi SNP usaha perakitan

varietas melalui
persilangan
berdasarkan informasi
SNP potensial
Tahun Membangun perangkat lunak

2013 Varietas
Identifikasi SNP dengan teknik Unggul
Multiple Sequence Alignment Tahun 2015
Metodologi – Detil – Tahun
2013
• Data :
 Menggunakan data sekuens kedelai dari BB
Biogen
Sekuens yang akan dijajarkan adalah
merupakan subset/segmen tertentu dari
genom keseluruhan
Jumlah sekuens yang dijajarkan banyak
>gi|125991767|ref|NM_174354.3| Bos taurus inhibitor of kappa light
(puluhan)
polypeptide sehingga
gene enhancer diperlukan
in B-cells, kinase gammalingkungan
(IKBKG), mRNA
CGCCGGAGCAGGTCAAACCAGCCCTATTAGATAATAGCCTCGTTGTGGTTCCAACCCTAGCGGGAGAAG
paralel untuk menjadikan proses
GGGCCCTGACGTGTTGGATGAGCAGGCCCCCCTGGAAGTCCGGCAGGGGACCAGGACATGCTGGGTGAA
GAGTCTTCTCTGGGGAAGCCAGCCATGCTCCACGTGCCTTCAGAGCAGGGCACCCCTGAGAC
komputasinya efisien
CGTTGTGGTTCCAACCCTAGCGGGAGAAGGGGCCCTGACGTGTTGGATGAGCAGGCCCCCCTGGAAGTC
CGGCAGGGGAC
Format data yang digunakan adalah FASTA
Metodologi – Detil – Tahun
2013
• Teknik yang digunakan untuk
mengidentifikasi SNP adalah Multiple
Sequence Alignment
• Langkah-langkahnya :
A. Pairwise alignment
– Menghitung matriks jarak kemiripan
Luaran dan Capaian
Perangkat Lunak Identifikasi SNP dan modul

interface berbasis web
Aplikasi Berbasis Web
PROGRESS: 40% PROGRESS: 80%

MEKANISME MSA PROGRESIF
1. Menghitung kemiripan antar-sekuens
2. Membangun guide tree dengan clustering
3. Penjajaran progresif
(Lloyd 2010)
Paralelisasi pairwise alignment
• Matriks penjajaran dibagi secara blocked
columnwise
• Tiap prosesor menghitung isi matriks pada
kolom bagiannya masing-masing secara
pipelining
(Li et al. 2012)

KENDALA
Implementasi MSA dengan menggunakan komputasi

GPU ternyata tidak sederhana. Untuk itu pada tahap
awal ini MSA akan diimplementasikan dengan
menggunakan MPI
MODUL APLIKASI WEB
Homepage
Menu Login
Memulai Submit Job
“Semoga kini telah menjadi Mimpi Bersama”
Membangun Sistem Enterprise Pengelolaan Data
Bioinformatika Lokal Indonesia
10-15 tahun

#1 Pendahuluan

Uploaded by

Copyright:

Available Formats

You might also like

#1 Pendahuluan

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

#1 Pendahuluan

Uploaded by

Copyright:

Available Formats

Introduction to Bioinformatics

What is Computational Biology? - The

The branch of science concerned with

adenine (A) guanine (G) cytosine (C) thymine (T)

two rings one ring

Eukaryotes may have up to 3 subcellular

Bacteria have either circular or linear

>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA

>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA

>gi|4504347|ref|NP_000549.1| alpha 1 globin [Homo sapiens]

E. coli 4.6 x 106 nucleotides

Yeast 15 x 106 nucleotides

Human 3 x 109 nucleotides

Smallest human chromosome 50 x 106 nucleotides

Gene expression profile of relapsing versus non-relapsing Wilms tumors.

World Wide Web First bacterial

Nearly a decade later, the first nucleic acid

The Protein DataBank followed in 1972 with a

.............. TGAAAAACGTA Ribosome binding Site

ORF = Open Reading Frame

Researchers can learned a great deal about the structure and

There are 156 intergenic, untranscribed,

A structured collection of data held in computer storage; esp. one

database management: the organization and manipulation of data in

database management system (DBMS): a software package that

database system: a database together with a database

– updated periodically (release) -> new edition

– cross-referenced (hyperlinks) -> links with other db

• Includes also associated tools (software) necessary

• Data storage management: flat files, relational

Teacher Accession Education

Course Year Involved

• Data (genomic sequences, 3D structures, 2D

• Essential tools for biological research. The

• Variable size: <100Kb to >20Gb

• Update frequency: daily to annually to seldom to forget

• Usually accessible through the web (some free, some not)

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,

PubMed: Allows searching by author names, journal titles, and a

PubMed Central: The U.S. National Library of Medicine digital

OMIM: Online Mendelian Inheritance in Man is a database of

2. Entrez processes all Boolean operators in a left-to-right sequence.

– Genes and genetic disorders

All Fields [ALL]

How many types of hemophilia are there?

1. How to use the UH online Library?

Other glossaries, e.g., the list of phobias:

Enables you to search specifically for scholarly

Search Google Books

How many times is the tail of the giraffe

DNA/genom, RNA, Biologi

RNA Data Storage Pertanuan, dll

Identifikasi SNP untuk

Definisi, karakteristik Biologi

Tampilan dekat: urutan basa

Pengetahuan Departemen Pengetahuan Departemen

Karakteristik Biologi Machine Statistika