Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

Bioinformatics Notebook

BCS-8B

By:

Abdul Hannan Malik


FA18-BCS-001
1

Lab 1: Exploring NCBI 2


What is NCBI? 2
Basic Research 2
Responsibilities 3
Structure of NCIB 3
Outreach and Education 3
Tools and Topics 4

Lab 2: Exploring Repositories 5


GenBank: 5
Accessing: 6
Confidentiality 6
Privacy 6
European Molecular Biology Laboratory 7
Governance 7
Our Public Engagement and Outreach 7
Sustainable and inclusive 7
DNA Data Bank of Japan (DDBJ) 8
Mission 8
Governing Structure 9

LAB 3: ORF Finder 10


Procedure 10
Output 10

Lab 4: Blast DNA Sequence 12


Procedure 12
Output 12
Sequence 13

Lab 5: Explore ExPASy Tool 15

Lab 6: Dot Plot Genetic Sequences 16


Procedure 16
Output 16

Lab 7: Blasting a Protein Sequence with PSI - BLAST 17


Graphic 18
2

Graphic Summary 19

Lab 8: Multiple Sequence Alignment 20


Muscle 20

T-Coffee 21

ClustalW 22

Lab 9: Explore Molecular Evolutionary Genetics Analysis (MEGA) Software 23

Lab 10: RNA Protein Analysis 26


Output 26
Minimum Free Energy Prediction 26
Thermodynamic Ensemble Prediction 26
Graphical Output 27
3

Lab 1: Exploring NCBI

What is NCBI?

NCBI is one of the leading information repositories when it comes to understanding the

language of human cells, ie., how they are made and what they do. There are only 4 letters,

through which millions, if not, trillions of individual living organisms have been composed.

With such a large pool of data, it is difficult to keep track of the various functions and

features that are found in the most complex of organisms, Man.

Thus, enter molecular biology, a trade that allowed us to make a sense of these “alphabets”

derived into words and phrases. The challenge that is presented is to find new innovations

and solutions to handle the large scale of data and relay this to researchers with better

access to analysis and computing to advance the principles of genetic legacy for ourselves.

Basic Research

● NCBI has a mission to develop new information technologies to aid in

understanding the fundamental, molecular and genetic process to control health

and disease.

● NCBI has also created automated systems for storing and analyzing molecular

biology.

● They are also performing research into advanced methods of computer-based

information processing for analyzing structure and function.

Responsibilities

● Control research on biomedical problems.

● Maintain collaborations with several NIH institutes, academia, industry and other

governmental agencies.

● Foster scientific communication by sponsoring meetings, workshops, and lecture

series.
4

● Supports training on basic and applied research in computational biology for

postdoctoral fellows through the NIH-IRP (Intramural Visitors Program).

● Develops, distributes, supports and coordinates access to a variety of databases and

software for scientific and medical communities.

Structure of NCIB

● The Computational Biology Branch conducts basic and applied search in multiple

fields of molecular biology and genetics, which includes genome analysis, sequence

comparisons, search methodologies, dynamics and structure/functions prediction.

● The Information Engineering Branch performs applied research in data

representation and analysis. This includes the development of computer-based

systems for storage, management and retrieval of knowledge relating to molecular

biology, genetics and biochemistry.

● The Information Resources Branch plans, directs and manages technical operations

of NCBI, including the computer systems used for research and development,

alongwith computer systems used to access public databases.

Outreach and Education

NCBI carries scientific communication in the field of computers, applied to molecular

biology and genetics, through sponsoring meetings, workshops and lecture series. A

Scientific Visitors Program has been established to foster collaborations with extramural

scientists.

Tools and Topics

Analysis of Complete Genomes

BioNLP

Clusters of Orthologous Groups (COGs)

Genetics Analysis Software


5

HistoneDB2.0 with variants

IBIS

LogOddsLogo

MutaBind

MutaGene

SNPDelScore

Structure
6

Lab 2: Exploring Repositories

GenBank:

● Genbank is an NIH genetic sequence database, which has an annotated collection of

all publicly available DNA sequences.

● It is a representative of the INSDC (International Nucleotide Sequence Database

Collaboration), with other known organizations such as DDBJ, ENA and NCBI.

● A GenBank release occurs every two months and is available from the ftp site.

● The release notes for the current version of GenBank provide detailed information

about the release and notifications of upcoming changes to GenBank.

● An annotated sample GenBank record for a Saccharomyces cerevisiae gene

demonstrates many of the features of the GenBank flat file format.


7

Accessing:

● Search GenBank for sequence identifiers and annotations with Entrez Nucleotide.

● Search and align GenBank sequences to a query sequence using BLAST (Basic Local

Alignment Search Tool). See BLAST info for more information about the numerous

BLAST databases.

● Search, link, and download sequences programmatically using NCBI e-utilities.

● The ASN.1 and flat file formats are available at NCBI's anonymous FTP server.

Confidentiality

Some authors are concerned that the appearance of their data in GenBank prior to

publication will compromise their work. GenBank will, upon request, withhold release of

new submissions for a specified period of time. However, if the accession number or

sequence data appears in print or online prior to the specified date, your sequence will be

released. In order to prevent the delay in the appearance of published sequence data, we

urge authors to inform us of the appearance of the published data. As soon as it is

available, please send the full publication data--all authors, title, journal, volume, pages and

date--to the following address: update@ncbi.nlm.nih.gov

Privacy

If you are submitting human sequences to GenBank, do not include any data that could

reveal the personal identity of the source. GenBank assumes that the submitter has

received any necessary informed consent authorizations required prior to submitting

sequences.

European Molecular Biology Laboratory


8

Europe-wide, global impact, infinite curiosity. The European Molecular Biology Laboratory

is a powerhouse of biological expertise.

Governance

EMBL Council comprises national representatives of our member states and is EMBL’s

governing body. Our Director General is Professor Edith Heard FRS.

EMBL pursues five missions in research, services, training, technology transfer and policy

development. Our five-year plans are set out in our Scientific Programme.

Our Public Engagement and Outreach

Our public engagement and outreach activities seek to ensure wider awareness and

application of EMBL’s knowledge by general scientific audiences, teachers, young learners,

and the public.

Sustainable and inclusive

We are committed to transitioning to a sustainable organization, and to creating a more

inclusive research and work culture, including the provision of independent and impartial

support for staff.


9

DNA Data Bank of Japan (DDBJ)

DDBJ Center collects nucleotide sequence data as a member of INSDC(International Nucleotide


Sequence Database Collaboration) and provides freely available nucleotide sequence data and
supercomputer system, to support research activities in life science.

Mission
It is generally accepted that research in biology today requires both computer and
experimental equipment equally well. Information achieved from enormous exhaustive
data has greatly contributed to the paradigm shift in biology.
10

In silico and in vitro / in vivo analyses together will push back the frontiers of life sciences.
In particular, researchers in life science must rely on computers to analyze nucleotide
sequence data accumulating at a remarkably rapid rate.

DDBJ Center is to play a major role in carrying out research in information biology and to
run DDBJ operations in the world.

Nucleotide sequence records organismic evolution more directly than other biological
materials and thus is invaluable not only for research in life sciences but also human
welfare in general. The database is, so to speak, a common treasure of human beings. With
this in mind, we make the database online accessible to anyone in the world.

Governing Structure
DDBJ Center is in operation at Research Organization of Information and System National
Institute of Genetics(NIG) in Mishima, Japan with endorsement of MEXT; Japanese Ministry
of Education, Culture, Sports, Science and Technology.

DDBJ Center is reviewed and advised by its own advisory board, DNA Database Advisory
Committee (an outside committee of NIG), and also by the advisory board to INSDC,
International Advisory Committee.
11

LAB 3: ORF Finder

Procedure
1. Open https://www.ncbi.nlm.nih.gov website.
a. Select nucleotide category and enter AE008975.
b. Click on FASTA and copy some kb of data.
2. Open NCB site, go to ORF Finder.
3. Paste the copied sequence and search.
4. Observe Start Codon and ORF Found.
5. Start Codon = ATG, ORF’s = 14.

Output
12
13

Lab 4: Blast DNA Sequence


Procedure
1. Open https://www.ncbi.nlm.nih.gov website.
a. Select gene category and search Insulin of humans.
b. Click on FASTA and copy some kb of data.
2. Go to http://www.softberry.com/
a. Click on genes found in eukaryotes.
b. Click on FGENESH
c. Paste the sequence, select the organism as human and search.
3. Observe the PDF.

Output
14

Sequence
Predicted protein(s): >FGENESH:[mRNA] 1 1 exon (s) 1083 - 4811 3729 bp, chain +
ATGGCGAGCCCTCCGGAGAGCGATGGCTTCTCGGACGTGCGCAAGGTGGGCTACCTGCGC
AAACCCAAGAGCATGCACAAACGCTTCTTCGTACTGCGCGCGGCCAGCGAGGCTGGGGGC
CCGGCGCGCCTCGAGTACTACGAGAACGAGAAGAAGTGGCGGCACAAGTCGAGCGCCCCC
AAACGCTCGATCCCCCTTGAGAGCTGCTTCAACATCAACAAGCGGGCTGACTCCAAGAAC
AAGCACCTGGTGGCTCTCTACACCCGGGACGAGCACTTTGCCATCGCGGCGGACAGCGAG
GCCGAGCAAGACAGCTGGTACCAGGCTCTCCTACAGCTGCACAACCGTGCTAAGGGCCAC
CACGACGGAGCTGCGGCCCTCGGGGCGGGAGGTGGTGGGGGCAGCTGCAGCGGCAGCTCC
GGCCTTGGTGAGGCTGGGGAGGACTTGAGCTACGGTGACGTGCCCCCAGGACCCGCATTC
AAAGAGGTCTGGCAAGTGATCCTGAAGCCCAAGGGCCTGGGTCAGACAAAGAACCTGATT
GGTATCTACCGCCTTTGCCTGACCAGCAAGACCATCAGCTTCGTGAAGCTGAACTCGGAG
GCAGCGGCCGTGGTGCTGCAGCTGATGAACATCAGGCGCTGTGGCCACTCGGAAAACTTC
TTCTTCATCGAGGTGGGCCGTTCTGCCGTGACGGGGCCCGGGGAGTTCTGGATGCAGGTG
GATGACTCTGTGGTGGCCCAGAACATGCACGAGACCATCCTGGAGGCCATGCGGGCCATG
AGTGATGAGTTCCGCCCTCGCAGCAAGAGCCAGTCCTCGTCCAACTGCTCTAACCCCATC
AGCGTCCCCCTGCGCCGGCACCATCTCAACAATCCCCCGCCCAGCCAGGTGGGGCTGACC
CGCCGATCACGCACTGAGAGCATCACCGCCACCTCCCCGGCCAGCATGGTGGGCGGGAAG
CCAGGCTCCTTCCGTGTCCGCGCCTCCAGTGACGGCGAAGGCACCATGTCCCGCCCAGCC
TCGGTGGACGGCAGCCCTGTGAGTCCCAGCACCAACAGAACCCACGCCCACCGGCATCGG
GGCAGCGCCCGGCTGCACCCCCCGCTCAACCACAGCCGCTCCATCCCCATGCCGGCTTCC
CGCTGCTCGCCTTCGGCCACCAGCCCGGTCAGTCTGTCGTCCAGTAGCACCAGTGGCCAT
GGCTCCACCTCGGATTGTCTCTTCCCACGGCGATCTAGTGCTTCGGTGTCTGGTTCCCCC
15

AGCGATGGCGGTTTCATCTCCTCGGATGAGTATGGCTCCAGTCCCTGCGATTTCCGGAGT
TCCTTCCGCAGTGTCACTCCGGATTCCCTGGGCCACACCCCACCAGCCCGCGGTGAGGAG
GAGCTAAGCAACTATATCTGCATGGGTGGCAAGGGGCCCTCCACCCTGACCGCCCCCAAC
GGTCACTACATTTTGTCTCGGGGTGGCAATGGCCACCGCTGCACCCCAGGAACAGGCTTG
GGCACGAGTCCAGCCTTGGCTGGGGATGAAGCAGCCAGTGCTGCAGATCTGGATAATCGG
TTCCGAAAGAGAACTCACTCGGCAGGCACATCCCCTACCATTACCCACCAGAAGACCCCG
TCCCAGTCCTCAGTGGCTTCCATTGAGGAGTACACAGAGATGATGCCTGCCTACCCACCA
GGAGGTGGCAGTGGAGGCCGACTGCCGGGACACAGGCACTCCGCCTTCGTGCCCACCCGC
TCCTACCCAGAGGAGGGTCTGGAAATGCACCCCTTGGAGCGTCGGGGGGGGCACCACCGC
CCAGACAGCTCCACCCTCCACACGGATGATGGCTACATGCCCATGTCCCCAGGGGTGGCC
CCAGTGCCCAGTGGCCGAAAGGGCAGTGGAGACTATATGCCCATGAGCCCCAAGAGCGTA
TCTGCCCCACAGCAGATCATCAATCCCATCAGACGCCATCCCCAGAGAGTGGACCCCAAT
GGCTACATGATGATGTCCCCCAGCGGTGGCTGCTCTCCTGACATTGGAGGTGGCCCCAGC
AGCAGCAGCAGCAGCAGCAACGCCGTCCCTTCCGGGACCAGCTATGGAAAGCTGTGGACA
AACGGGGTAGGGGGCCACCACTCTCATGTCTTGCCTCACCCCAAACCCCCAGTGGAGAGC
AGCGGTGGTAAGCTCTTACCTTGCACAGGTGACTACATGAACATGTCACCAGTGGGGGAC
TCCAACACCAGCAGCCCCTCCGACTGCTACTACGGCCCTGAGGACCCCCAGCACAAGCCA
GTCCTCTCCTACTACTCATTGCCAAGATCCTTTAAGCACACCCAGCGCCCCGGGGAGCCG
GAGGAGGGTGCCCGGCATCAGCACCTCCGCCTTTCCACTAGCTCTGGTCGCCTTCTCTAT
GCTGCAACAGCAGATGATTCTTCCTCTTCCACCAGCAGCGACAGCCTGGGTGGGGGATAC
TGCGGGGCTAGGCTGGAGCCCAGCCTTCCACATCCCCACCATCAGGTTCTGCAGCCCCAT
CTGCCTCGAAAGGTGGACACAGCTGCTCAGACCAATAGCCGCCTGGCCCGGCCCACGAGG
CTGTCCCTGGGGGATCCCAAGGCCAGCACCTTACCTCGGGCCCGAGAGCAGCAGCAGCAG
CAGCAGCCCTTGCTGCACCCTCCAGAGCCCAAGAGCCCGGGGGAATATGTCAATATTGAA
TTTGGGAGTGATCAGTCTGGCTACTTGTCTGGCCCGGTGGCTTTCCACAGCTCACCTTCT
GTCAGGTGTCCATCCCAGCTCCAGCCAGCTCCCAGAGAGGAAGAGACTGGCACTGAGGAG
TACATGAAGATGGACCTGGGGCCGGGCCGGAGGGCAGCCTGGCAGGAGAGCACTGGGGTC
GAGATGGGCAGACTGGGCCCTGCACCTCCCGGGGCTGCTAGCATTTGCAGGCCTACCCGG
GCAGTGCCCAGCAGCCGGGGTGACTACATGACCATGCAGATGAGTTGTCCCCGTCAGAGC
TACGTGGACACCTCGCCAGCTGCCCCTGTAAGCTATGCTGACATGCGAACAGGCATTGCT
GCAGAGGAGGTGAGCCTGCCCAGGGCCACCATGGCTGCTGCCTCCTCATCCTCAGCAGCC
TCTGCTTCCCCGACTGGGCCTCAAGGGGCAGCAGAGCTGGCTGCCCACTCGTCCCTGCTG
GGGGGCCCACAAGGACCTGGGGGCATGAGCGCCTTCACCCGGGTGAACCTCAGTCCTAAC
CGCAACCAGAGTGCCAAAGTGATCCGTGCAGACCCACAAGGGTGCCGGCGGAGGCATAGC
TCCGAGACTTTCTCCTCAACACCCAGTGCCACCCGGGTGGGCAACACAGTGCCCTTTGGA
GCGGGGGCAGCAGTAGGGGGCGGTGGCGGTAGCAGCAGCAGCAGCGAGGATGTGAAACGC
16

CACAGCTCTGCTTCCTTTGAGAATGTGTGGCTGAGGCCTGGGGAGCTTGGGGGAGCCCCC
AAGGAGCCAGCCAAACTGTGTGGGGCTGCTGGGGGTTTGGAGAATGGTCTTAACTACATA
GACCTGGATTTGGTCAAGGACTTCAAACAGTGCCCTCAGGAGTGCACCCCTGAACCGCAG
CCTCCCCCACCCCCACCCCCTCATCAACCCCTGGGCAGCGGTGAGAGCAGCTCCACCCGC
CGCTCAAGTGAGGATTTAAGCGCCTATGCCAGCATCAGTTTCCAGAAGCAGCCAGAGGAC
CGTCAGTAG >FGENESH: 1 1 exon (s) 1083 - 4811 1242 aa, chain +
MASPPESDGFSDVRKVGYLRKPKSMHKRFFVLRAASEAGGPARLEYYENEKKWRHKSSAP
KRSIPLESCFNINKRADSKNKHLVALYTRDEHFAIAADSEAEQDSWYQALLQLHNRAKGH
HDGAAALGAGGGGGSCSGSSGLGEAGEDLSYGDVPPGPAFKEVWQVILKPKGLGQTKNLI
GIYRLCLTSKTISFVKLNSEAAAVVLQLMNIRRCGHSENFFFIEVGRSAVTGPGEFWMQV
DDSVVAQNMHETILEAMRAMSDEFRPRSKSQSSSNCSNPISVPLRRHHLNNPPPSQVGLT
RRSRTESITATSPASMVGGKPGSFRVRASSDGEGTMSRPASVDGSPVSPSTNRTHAHRHR
GSARLHPPLNHSRSIPMPASRCSPSATSPVSLSSSSTSGHGSTSDCLFPRRSSASVSGSP
SDGGFISSDEYGSSPCDFRSSFRSVTPDSLGHTPPARGEEELSNYICMGGKGPSTLTAPN
GHYILSRGGNGHRCTPGTGLGTSPALAGDEAASAADLDNRFRKRTHSAGTSPTITHQKTP
SQSSVASIEEYTEMMPAYPPGGGSGGRLPGHRHSAFVPTRSYPEEGLEMHPLERRGGHHR
PDSSTLHTDDGYMPMSPGVAPVPSGRKGSGDYMPMSPKSVSAPQQIINPIRRHPQRVDPN
GYMMMSPSGGCSPDIGGGPSSSSSSSNAVPSGTSYGKLWTNGVGGHHSHVLPHPKPPVES
SGGKLLPCTGDYMNMSPVGDSNTSSPSDCYYGPEDPQHKPVLSYYSLPRSFKHTQRPGEP
EEGARHQHLRLSTSSGRLLYAATADDSSSSTSSDSLGGGYCGARLEPSLPHPHHQVLQPH
LPRKVDTAAQTNSRLARPTRLSLGDPKASTLPRAREQQQQQQPLLHPPEPKSPGEYVNIE
FGSDQSGYLSGPVAFHSSPSVRCPSQLQPAPREEETGTEEYMKMDLGPGRRAAWQESTGV
EMGRLGPAPPGAASICRPTRAVPSSRGDYMTMQMSCPRQSYVDTSPAAPVSYADMRTGIA
AEEVSLPRATMAAASSSSAASASPTGPQGAAELAAHSSLLGGPQGPGGMSAFTRVNLSPN
RNQSAKVIRADPQGCRRRHSSETFSSTPSATRVGNTVPFGAGAAVGGGGGSSSSSEDVKR
HSSASFENVWLRPGELGGAPKEPAKLCGAAGGLENGLNYIDLDLVKDFKQCPQECTPEPQ
PPPPPPPHQPLGSGESSSTRRSSEDLSAYASISFQKQPEDRQ

Lab 5: Explore ExPASy Tool


1. Go to the NCBI database
2. Select Gene and go-to “Human Insulin”.
17

3. Go to “FASTA” and you will have its genetic sequence.


4. Repeat the same steps for any other Insulin Gene (such as Cat
Insulin).
5. Go to the ExPASy Translate tool and copy-paste the sequences in
the respective field. (https://web.expasy.org/translate/)
6. After the process, you will get the results of the translation.
7. Select the one with the longest number of matches.

Lab 6: Dot Plot Genetic Sequences

Procedure
1. Go to NCBI database (https://www.ncbi.nlm.nih.gov).
2. Select Gene and go-to “Human Insulin”.
18

3. Go to “FASTA” and you will have its genetic sequence.


4. Repeat the same steps for any other Insulin Gene (such as Cat Insulin).
5. Go to EMBOSS: dotmatcher and copy-paste both sequences in their respective
fields. (https://www.bioinformatics.nl/cgi-bin/emboss/dotmatcher)
6. After process, you get dotmatcher graph showing results.

Output

Lab 7: Blasting a Protein Sequence with PSI - BLAST


1. Go to the NCBI database (https://www.ncbi.nlm.nih.gov).

2. Select Protein and go-to “Human Retina”.

3. Go to “FASTA” and you will have its protein sequence.


19

4. Repeat the same steps for any other Protein.

5. Go to NCBI BLAST and paste the protein sequence in the search query.

6. In the Programs selection section, select the ‘PSI - BLAST’ Algorithm and hit the
BLAST button.

7. After the process, you will get the results in the form of a graph.

Graphic

Multiple Alignment

MSA Viewer
20

● Graphic Summary
21

Lab 8: Multiple Sequence Alignment

Muscle
1. Go to the NCBI database (https://www.ncbi.nlm.nih.gov).

2. Select Gene and go-to “Human Insulin”.

3. Go to “FASTA” and you will have its protein sequence.

4. Repeat the same steps for any other Protein (Cat Insulin.

5. Go to Muscle MSA Tool and paste the protein sequences in the search query.

6. Run the sequences.

7. After the process, you will get the results.


22

T-Coffee
1. Go to the NCBI database (https://www.ncbi.nlm.nih.gov).

2. Select Gene and go-to “Human Insulin”.

3. Go to “FASTA” and you will have its protein sequence.

4. Repeat the same steps for any other Protein (Cat Insulin.

5. Go to T-Coffee Tool and paste the protein sequences in the search query.

6. Run the sequences.

7. After the process, you will get the results.


23

ClustalW
1. Go to the NCBI database (https://www.ncbi.nlm.nih.gov).

2. Select Gene and go-to “Human Insulin”.

3. Go to “FASTA” and you will have its protein sequence.

4. Repeat the same steps for any other Protein (Cat Insulin.

5. Go to ClustalW MSA Tool and paste the protein sequences in the search query.

6. Run the sequences.

7. After the process, you will get the results.


24

Lab 9: Explore Molecular Evolutionary Genetics Analysis (MEGA)


Software
1. Open the Mega Software.

2. Select the ‘Align’ option from the toolbar and select the ‘create new alignment’
option.

3. In the ‘Data Type Format’ window, select the ‘DNA’ option.

4. Then, under the ‘Edit’ option, select the ‘Input Sequence From File’ option. You can
get this sequence file from NCBI.

5. You will need to get more than one sequence in order to align them.

6. Export the alignment in MEGA format from file options.

7. After that, go to the ‘Phylogeny’ option on the toolbar and open the file saved above
in MEGA format.

8. After that, the Phylogeny tree is generated as a result.


25
26
27

Lab 10: RNA Protein Analysis


1. Go to the NCBI database (https://www.ncbi.nlm.nih.gov).

2. Select Nucleotide and search for a RNA structure (example: C.glauca symC mRNA for
haemoglobin)

3. Go to “FASTA” and you will have its protein sequence.

4. Go to an RNA analysis server lab (http://rna.tbi.univie.ac.at or


https://rna.urmc.rochester.edu)

5. Go to RNAfold web server

6. Paste the FASTA sequence and Process it

7. You will the following readings:

a. Minimum free energy prediction

b. Thermodynamic Ensemble Prediction

c. Graphical Output

Output
Minimum Free Energy Prediction

Thermodynamic Ensemble Prediction


28

Graphical Output

You might also like