Professional Documents
Culture Documents
Ass 2 Bioinformatics
Ass 2 Bioinformatics
Task: Write a Python script to find Total number Open Reading Frames (orf) in
Hepatitis C virus complete genome sequence and predict the Gene on the basis of
Orf. Also find the Translated Protein Sequence (CDS) of predicted ORF gene.
Solution:
We use ipython jupyter Notebook for this task. First we import Biopython libraries to calculate
translated protein sequence after orf gene finding. We open the genome file and declare the
orf_finder function, in function we use 2 arguments which are DNA and frame then we define
start and stop codon values, Stored Start Stop Positions in list also declare where to start and
stop. After that we use for loop because we have many base pairs, i:i+3 is used in loop to make
chunks of 3 amino acids which is required. We declared founds orf in dictionary (key, value) and uses if
else condition to make our script more useful, counter is a container that keeps track of how many times
equivalent values of orf are added so basically its used to define orf. Then we use 2nd dictionary to save
our genome sequence and also declare for loop in dict to skip first line (header) and only read original
sequence of fasta file. After that we define frame because our script reading with all 3 reading frames.
Finally we access and view dictionary values through .items() function and print the orf number and
data. We find longest orf with 9084 bp length so we can say that it will be a gene because when
scientists first time discovered orf then they directly said them as a gene but latest discoveries in
genomics proves them wrong. After finding of orf gene, we calculate translated protein sequence by
using .translate () function through Biopython libraries
Script:
# Importing Biopython Libraries
ORFfound = {}
if Num_Stops >=1 and Num_Starts >=1: #First Statment: the number of Stop Codons and Start Codons
are greater than or equal to 1;
ORFs = True
Stop_Before = 0
Start_Before = 0
Position_Stop_Previous = 0
Position_Start_Previous = 0
Counter = 0
Position_Start = Position_Start.rstrip()
Counter += 1
Name_Orf = "ORF"+str(Counter)
Position_Stop_Previous += int(Position_Stop) - int(Position_Stop_Previous)
Position_Start_Previous += int(Position_Start) - int(Position_Start_Previous)
Size_Orf = int(Position_Stop) - int(Position_Start) + 1
ORFfound[Name_Orf] = Position_Start,Position_Stop,Size_Orf,Frame
else:
pass
else:
ORFs = False
return ORFfound
#FUNCTION END
SEQs={}
for Line in Genome_Seq:
Line = Line.rstrip()
if Line[0] == '>':
Dna_Words=Line.split()
Name=Dna_Words[0][1:]
SEQs[Name]= ''
else:
SEQs[Name]= SEQs[Name] + Line
# DEFINE FRAME TO FIND ORF
for i in SEQs.items():
Header= i[0]
seq = i[1]
Orf = ORF_Finder(seq,Frame)
print("\nThe Total ORF Found in Frame" ,Frame, "is : " ,len(Orf) ,'\n\n' )
Orf_gene = seq[339:9423]
Orf_Protein =Seq(Orf_gene)
for i in Orf.items():
Num_Orf = i[0]
Start_Orf = Orf[Num_Orf][0]
Stop_Orf = Orf[Num_Orf][1]
Length_Orf = Orf[Num_Orf][2]
Frame_Orf = Orf[Num_Orf][3]
print(Num_Orf,"starts from",Start_Orf,"bp and stop at",Stop_Orf,"bp , length:",Length_Orf,",
Frame:",Frame_Orf,'\n')
if Length_Orf >= 150:
print(Num_Orf , "in Frame" ,Frame_Orf, "is legimate ORF as it contains more than 150 Base Pairs
in b/w start and stop codon \n")
Output:
Enter Frame Number: 1
The Total ORF Found in Frame 1 is : 2
ORF2 starts from 340 bp and stop at 9423 bp , length: 9084 , Frame: 1
ORF2 in Frame 1 is legimate ORF as it contains more than 150 Base Pairs in
b/w start and stop codon
Note:
This Script only works for Fast Sequence Files which contains header in
first line. If we open and read .txt file with no header in first line then the
output gives error. To access those files we have to change the code in for
loop of 2nd dictionary which contains DNA sequence and all script except
2nd dictionary remains same