Biopython Tutorial and Cookbook

5/13/2017 BiopythonTutorialandCookbook
BiopythonTutorialandCookbook
JeffChang,BradChapman,IddoFriedberg,ThomasHamelryck,
MichieldeHoon,PeterCock,TiagoAntao,EricTalevich,BartekWilczyski
LastUpdate6April2017(Biopython1.69)
Contents
Chapter1Introduction
1.1WhatisBiopython?
1.2WhatcanIfindintheBiopythonpackage
1.3InstallingBiopython
1.4FrequentlyAskedQuestions(FAQ)
Chapter2QuickStartWhatcanyoudowithBiopython?
2.1GeneraloverviewofwhatBiopythonprovides
2.2Workingwithsequences
2.3Ausageexample
2.4Parsingsequencefileformats
2.4.1SimpleFASTAparsingexample
2.4.2SimpleGenBankparsingexample
2.4.3Iloveparsingpleasedontstoptalkingaboutit!
2.5Connectingwithbiologicaldatabases
2.6Whattodonext
Chapter3Sequenceobjects
3.1SequencesandAlphabets
3.2Sequencesactlikestrings
3.3Slicingasequence
3.4TurningSeqobjectsintostrings
3.5Concatenatingoraddingsequences
3.6Changingcase
3.7Nucleotidesequencesand(reverse)complements
3.8Transcription
3.9Translation
3.10TranslationTables
3.11ComparingSeqobjects
3.12MutableSeqobjects
3.13UnknownSeqobjects
3.14Workingwithstringsdirectly
Chapter4Sequenceannotationobjects
4.1TheSeqRecordobject
4.2CreatingaSeqRecord
4.2.1SeqRecordobjectsfromscratch
4.2.2SeqRecordobjectsfromFASTAfiles
4.2.3SeqRecordobjectsfromGenBankfiles
4.3Feature,locationandpositionobjects
4.3.1SeqFeatureobjects
4.3.2Positionsandlocations
4.3.3Sequencedescribedbyafeatureorlocation
4.4Comparison
4.5References
4.6Theformatmethod
4.7SlicingaSeqRecord
4.8AddingSeqRecordobjects
4.9ReversecomplementingSeqRecordobjects
Chapter5SequenceInput/Output
5.1ParsingorReadingSequences
5.1.1ReadingSequenceFiles
5.1.2Iteratingovertherecordsinasequencefile
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 1/184
5.1.3Gettingalistoftherecordsinasequencefile
5.1.4Extractingdata
5.2Parsingsequencesfromcompressedfiles
5.3Parsingsequencesfromthenet
5.3.1ParsingGenBankrecordsfromthenet
5.3.2ParsingSwissProtsequencesfromthenet
5.4SequencefilesasDictionaries
5.4.1SequencefilesasDictionariesInmemory
5.4.2SequencefilesasDictionariesIndexedfiles
5.4.3SequencefilesasDictionariesDatabaseindexedfiles
5.4.4Indexingcompressedfiles
5.4.5Discussion
5.5WritingSequenceFiles
5.5.1Roundtrips
5.5.2Convertingbetweensequencefileformats
5.5.3Convertingafileofsequencestotheirreversecomplements
5.5.4GettingyourSeqRecordobjectsasformattedstrings
5.6LowlevelFASTAandFASTQparsers
Chapter6MultipleSequenceAlignmentobjects
6.1ParsingorReadingSequenceAlignments
6.1.1SingleAlignments
6.1.2MultipleAlignments
6.1.3AmbiguousAlignments
6.2WritingAlignments
6.2.1Convertingbetweensequencealignmentfileformats
6.2.2Gettingyouralignmentobjectsasformattedstrings
6.3ManipulatingAlignments
6.3.1Slicingalignments
6.3.2Alignmentsasarrays
6.4AlignmentTools
6.4.1ClustalW
6.4.2MUSCLE
6.4.3MUSCLEusingstdout
6.4.4MUSCLEusingstdinandstdout
6.4.5EMBOSSneedleandwater
6.4.6Biopythonspairwise2
Chapter7BLAST
7.1RunningBLASTovertheInternet
7.2RunningBLASTlocally
7.2.1Introduction
7.2.2StandaloneNCBIBLAST+
7.2.3OtherversionsofBLAST
7.3ParsingBLASToutput
7.4TheBLASTrecordclass
7.5DeprecatedBLASTparsers
7.5.1ParsingplaintextBLASToutput
7.5.2ParsingaplaintextBLASTfilefullofBLASTruns
7.5.3FindingabadrecordsomewhereinahugeplaintextBLASTfile
7.6DealingwithPSIBLAST
7.7DealingwithRPSBLAST
Chapter8BLASTandothersequencesearchtools(experimentalcode)
8.1TheSearchIOobjectmodel
8.1.1QueryResult
8.1.2Hit
8.1.3HSP
8.1.4HSPFragment
8.2Anoteaboutstandardsandconventions
8.3Readingsearchoutputfiles
8.4Dealingwithlargesearchoutputfileswithindexing
8.5Writingandconvertingsearchoutputfiles
Chapter9AccessingNCBIsEntrezdatabases
9.1EntrezGuidelines
9.2EInfo:ObtaininginformationabouttheEntrezdatabases
9.3ESearch:SearchingtheEntrezdatabases
9.4EPost:Uploadingalistofidentifiers
9.5ESummary:RetrievingsummariesfromprimaryIDs
9.6EFetch:DownloadingfullrecordsfromEntrez
9.7ELink:SearchingforrelateditemsinNCBIEntrez
9.8EGQuery:GlobalQuerycountsforsearchterms
9.9ESpell:Obtainingspellingsuggestions
9.10ParsinghugeEntrezXMLfiles
9.11Handlingerrors
9.12Specializedparsers
9.12.1ParsingMedlinerecords
9.12.2ParsingGEOrecords
9.12.3ParsingUniGenerecords
9.13Usingaproxy
9.14Examples
9.14.1PubMedandMedline
9.14.2Searching,downloading,andparsingEntrezNucleotiderecords
9.14.3Searching,downloading,andparsingGenBankrecords
9.14.4Findingthelineageofanorganism
9.15UsingthehistoryandWebEnv
9.15.1Searchingforanddownloadingsequencesusingthehistory
9.15.2Searchingforanddownloadingabstractsusingthehistory
9.15.3Searchingforcitations
Chapter10SwissProtandExPASy
10.1ParsingSwissProtfiles
10.1.1ParsingSwissProtrecords
10.1.2ParsingtheSwissProtkeywordandcategorylist
10.2ParsingPrositerecords
10.3ParsingPrositedocumentationrecords
10.4ParsingEnzymerecords
10.5AccessingtheExPASyserver
10.5.1RetrievingaSwissProtrecord
10.5.2SearchingSwissProt
10.5.3RetrievingPrositeandPrositedocumentationrecords
10.6ScanningthePrositedatabase
Chapter11Going3D:ThePDBmodule
11.1Readingandwritingcrystalstructurefiles
11.1.1ReadingaPDBfile
11.1.2ReadinganmmCIFfile
11.1.3ReadingfilesintheMMTFformat
11.1.4ReadingfilesinthePDBXMLformat
11.1.5WritingPDBfiles
11.2Structurerepresentation
11.2.1Structure
11.2.2Model
11.2.3Chain
11.2.4Residue
11.2.5Atom
11.2.6ExtractingaspecificAtom/Residue/Chain/ModelfromaStructure
11.3Disorder
11.3.1Generalapproach
11.3.2Disorderedatoms
11.3.3Disorderedresidues
11.4Heteroresidues
11.4.1Associatedproblems
11.4.2Waterresidues
11.4.3Otherheteroresidues
11.5NavigatingthroughaStructureobject
11.6Analyzingstructures
11.6.1Measuringdistances
11.6.2Measuringangles
11.6.3Measuringtorsionangles
11.6.4Determiningatomatomcontacts
11.6.5Superimposingtwostructures
11.6.6Mappingtheresiduesoftworelatedstructuresontoeachother
11.6.7CalculatingtheHalfSphereExposure
11.6.8Determiningthesecondarystructure
11.6.9Calculatingtheresiduedepth
11.7CommonproblemsinPDBfiles
11.7.1Examples
11.7.2Automaticcorrection
11.7.3Fatalerrors
11.8AccessingtheProteinDataBank
11.8.1DownloadingstructuresfromtheProteinDataBank
11.8.2DownloadingtheentirePDB
11.8.3KeepingalocalcopyofthePDBuptodate
11.9Generalquestions
11.9.1HowwelltestedisBio.PDB?
11.9.2Howfastisit?
11.9.3Istheresupportformoleculargraphics?
11.9.4WhosusingBio.PDB?
Chapter12Bio.PopGen:Populationgenetics
12.1GenePop
Chapter13PhylogeneticswithBio.Phylo
13.1Demo:WhatsinaTree?
13.1.1Coloringbrancheswithinatree
13.2I/Ofunctions
13.3Viewandexporttrees
13.4UsingTreeandCladeobjects
13.4.1Searchandtraversalmethods
13.4.2Informationmethods
13.4.3Modificationmethods
13.4.4FeaturesofPhyloXMLtrees
13.5Runningexternalapplications
13.6PAMLintegration
13.7Futureplans
Chapter14SequencemotifanalysisusingBio.motifs
14.1Motifobjects
14.1.1Creatingamotiffrominstances
14.1.2Creatingasequencelogo
14.2Readingmotifs
14.2.1JASPAR
14.2.2MEME
14.2.3TRANSFAC
14.3Writingmotifs
14.4PositionWeightMatrices
14.5PositionSpecificScoringMatrices
14.6Searchingforinstances
14.6.1Searchingforexactmatches
14.6.2SearchingformatchesusingthePSSMscore
14.6.3Selectingascorethreshold
14.7EachmotifobjecthasanassociatedPositionSpecificScoringMatrix
14.8Comparingmotifs
14.9Denovomotiffinding
14.9.1MEME
14.10Usefullinks
Chapter15Clusteranalysis
15.1Distancefunctions
15.2Calculatingclusterproperties
15.3Partitioningalgorithms
15.4Hierarchicalclustering
15.5SelfOrganizingMaps
15.6PrincipalComponentAnalysis
15.7HandlingCluster/TreeViewtypefiles
15.8Examplecalculation
15.9Auxiliaryfunctions
Chapter16Supervisedlearningmethods
16.1TheLogisticRegressionModel
16.1.1BackgroundandPurpose
16.1.2Trainingthelogisticregressionmodel
16.1.3Usingthelogisticregressionmodelforclassification
16.1.4LogisticRegression,LinearDiscriminantAnalysis,andSupportVectorMachines
16.2kNearestNeighbors
16.2.1Backgroundandpurpose
16.2.2Initializingaknearestneighborsmodel
16.2.3Usingaknearestneighborsmodelforclassification
16.3NaveBayes
16.4MaximumEntropy
16.5MarkovModels
Chapter17GraphicsincludingGenomeDiagram
17.1GenomeDiagram
17.1.1Introduction
17.1.2Diagrams,tracks,featuresetsandfeatures
17.1.3Atopdownexample
17.1.4Abottomupexample
17.1.5FeatureswithoutaSeqFeature
17.1.6Featurecaptions
17.1.7Featuresigils
17.1.8Arrowsigils
17.1.9Aniceexample
17.1.10Multipletracks
17.1.11CrossLinksbetweentracks
17.1.12Furtheroptions
17.1.13Convertingoldcode
17.2Chromosomes
17.2.1SimpleChromosomes
17.2.2AnnotatedChromosomes
Chapter18KEGG
18.1ParsingKEGGrecords
18.2QueryingtheKEGGAPI
Chapter19Bio.phenotype:analysephenotypicdata
19.1PhenotypeMicroarrays
19.1.1ParsingPhenotypeMicroarraydata
19.1.2ManipulatingPhenotypeMicroarraydata
19.1.3WritingPhenotypeMicroarraydata
Chapter20CookbookCoolthingstodowithit
20.1Workingwithsequencefiles
20.1.1Filteringasequencefile
20.1.2Producingrandomisedgenomes
20.1.3TranslatingaFASTAfileofCDSentries
20.1.4MakingthesequencesinaFASTAfileuppercase
20.1.5Sortingasequencefile
20.1.6SimplequalityfilteringforFASTQfiles
20.1.7Trimmingoffprimersequences
20.1.8Trimmingoffadaptorsequences
20.1.9ConvertingFASTQfiles
20.1.10ConvertingFASTAandQUALfilesintoFASTQfiles
20.1.11IndexingaFASTQfile
20.1.12ConvertingSFFfiles
20.1.13Identifyingopenreadingframes
20.2Sequenceparsingplussimpleplots
20.2.1Histogramofsequencelengths
20.2.2PlotofsequenceGC%
20.2.3Nucleotidedotplots
20.2.4Plottingthequalityscoresofsequencingreaddata
20.3Dealingwithalignments
20.3.1Calculatingsummaryinformation
20.3.2Calculatingaquickconsensussequence
20.3.3PositionSpecificScoreMatrices
20.3.4InformationContent
20.4SubstitutionMatrices
20.4.1Usingcommonsubstitutionmatrices
20.4.2Creatingyourownsubstitutionmatrixfromanalignment
20.5BioSQLstoringsequencesinarelationaldatabase
Chapter21TheBiopythontestingframework
21.1Runningthetests
21.1.1RunningthetestsusingTox
21.2Writingtests
21.2.1Writingaprintandcomparetest
21.2.2Writingaunittestbasedtest
21.3Writingdoctests
21.4WritingdoctestsintheTutorial
Chapter22Advanced
22.1ParserDesign
22.2.1SubsMat
22.2.2FreqTable
Chapter23WheretogofromherecontributingtoBiopython
23.1BugReports+FeatureRequests
23.2Mailinglistsandhelpingnewcomers
23.3ContributingDocumentation
23.4Contributingcookbookexamples
23.5Maintainingadistributionforaplatform
23.6ContributingUnitTests
23.7ContributingCode
Chapter24Appendix:UsefulstuffaboutPython
24.1Whattheheckisahandle?
24.1.1Creatingahandlefromastring
Chapter1Introduction
1.1WhatisBiopython?
TheBiopythonProjectisaninternationalassociationofdevelopersoffreelyavailablePython(http://www.python.org)toolsforcomputationalmolecularbiology.Python
isanobjectoriented,interpreted,flexiblelanguagethatisbecomingincreasinglypopularforscientificcomputing.Pythoniseasytolearn,hasaveryclearsyntaxandcan
easilybeextendedwithmoduleswritteninC,C++orFORTRAN.
TheBiopythonwebsite(http://www.biopython.org)providesanonlineresourceformodules,scripts,andweblinksfordevelopersofPythonbasedsoftwarefor
bioinformaticsuseandresearch.Basically,thegoalofBiopythonistomakeitaseasyaspossibletousePythonforbioinformaticsbycreatinghighquality,reusable
modulesandclasses.BiopythonfeaturesincludeparsersforvariousBioinformaticsfileformats(BLAST,Clustalw,FASTA,Genbank,...),accesstoonlineservices
(NCBI,Expasy,...),interfacestocommonandnotsocommonprograms(Clustalw,DSSP,MSMS...),astandardsequenceclass,variousclusteringmodules,aKDtree
datastructureetc.andevendocumentation.
Basically,wejustliketoprograminPythonandwanttomakeitaseasyaspossibletousePythonforbioinformaticsbycreatinghighquality,reusablemodulesand
scripts.
1.2WhatcanIfindintheBiopythonpackage
ThemainBiopythonreleaseshavelotsoffunctionality,including:
TheabilitytoparsebioinformaticsfilesintoPythonutilizabledatastructures,includingsupportforthefollowingformats:
BlastoutputbothfromstandaloneandWWWBlast
Clustalw
FASTA
GenBank
PubMedandMedline
ExPASyfiles,likeEnzymeandProsite
SCOP,includingdomandlinfiles
UniGene
SwissProt
FilesinthesupportedformatscanbeiteratedoverrecordbyrecordorindexedandaccessedviaaDictionaryinterface.
Codetodealwithpopularonlinebioinformaticsdestinationssuchas:
NCBIBlast,EntrezandPubMedservices
ExPASySwissProtandPrositeentries,aswellasPrositesearches
Interfacestocommonbioinformaticsprogramssuchas:
StandaloneBlastfromNCBI
Clustalwalignmentprogram
EMBOSScommandlinetools
Astandardsequenceclassthatdealswithsequences,idsonsequences,andsequencefeatures.
Toolsforperformingcommonoperationsonsequences,suchastranslation,transcriptionandweightcalculations.
CodetoperformclassificationofdatausingkNearestNeighbors,NaiveBayesorSupportVectorMachines.
Codefordealingwithalignments,includingastandardwaytocreateanddealwithsubstitutionmatrices.
Codemakingiteasytosplitupparallelizabletasksintoseparateprocesses.
GUIbasedprogramstodobasicsequencemanipulations,translations,BLASTing,etc.
Extensivedocumentationandhelpwithusingthemodules,includingthisfile,onlinewikidocumentation,thewebsite,andthemailinglist.
IntegrationwithBioSQL,asequencedatabaseschemaalsosupportedbytheBioPerlandBioJavaprojects.
WehopethisgivesyouplentyofreasonstodownloadandstartusingBiopython!
1.3InstallingBiopython
AlloftheinstallationinformationforBiopythonwasseparatedfromthisdocumenttomakeiteasiertokeepupdated.
Theshortversionisgotoourdownloadspage(http://biopython.org/wiki/Download),downloadandinstallthelisteddependencies,thendownloadandinstall
Biopython.Biopythonrunsonmanyplatforms(Windows,Mac,andonthevariousflavorsofLinuxandUnix).ForWindowsweprovideprecompiledclickandrun
installers,whileforUnixandotheroperatingsystemsyoumustinstallfromsourceasdescribedintheincludedREADMEfile.Thisisusuallyassimpleasthestandard
commands:
pythonsetup.pybuild
pythonsetup.pytest
sudopythonsetup.pyinstall
(Youcaninfactskipthebuildandtest,andgostraighttotheinstallbutitsbettertomakesureeverythingseemstobeworking.)
ThelongerversionofourinstallationinstructionscoversinstallationofPython,BiopythondependenciesandBiopythonitself.ItisavailableinPDF
(http://biopython.org/DIST/docs/install/Installation.pdf)andHTMLformats(http://biopython.org/DIST/docs/install/Installation.html).
1.4FrequentlyAskedQuestions(FAQ)
1.HowdoIciteBiopythoninascientificpublication?
Pleaseciteourapplicationnote[1,Cocketal.,2009]asthemainBiopythonreference.Inaddition,pleaseciteanypublicationsfromthefollowinglistif
appropriate,inparticularasareferenceforspecificmoduleswithinBiopython(moreinformationcanbefoundonourwebsite):
Fortheofficialprojectannouncement:[13,ChapmanandChang,2000]
ForBio.PDB:[18,HamelryckandManderick,2003]
ForBio.Cluster:[14,DeHoonetal.,2004]
ForBio.Graphics.GenomeDiagram:[2,Pritchardetal.,2006]
ForBio.PhyloandBio.Phylo.PAML:[9,Talevichetal.,2012]
FortheFASTQfileformatassupportedinBiopython,BioPerl,BioRuby,BioJava,andEMBOSS:[7,Cocketal.,2010].
2.HowshouldIcapitalizeBiopython?IsBioPythonOK?
ThecorrectcapitalizationisBiopython,notBioPython(eventhoughthatwouldhavematchedBioPerl,BioJavaandBioRuby).
3.Whatisgoingwrongwithmyprintcommands?
ThistutorialnowusesthePython3styleprintfunction.AsofBiopython1.62,wesupportbothPython2andPython3.Themostobviouslanguagedifferenceisthe
printstatementinPython2becameaprintfunctioninPython3.
Forexample,thiswillonlyworkunderPython2:
>>>print"HelloWorld!"
HelloWorld!
IfyoutrythatonPython3youllgetaSyntaxError.UnderPython3youmustwrite:
>>>print("HelloWorld!")
HelloWorld!
SurprisinglythatwillalsoworkonPython2butonlyforsimpleexamplesprintingonething.Ingeneralyouneedtoaddthismagiclinetothestartofyour
PythonscriptstousetheprintfunctionunderPython2.6and2.7:
from__future__importprint_function
Ifyouforgettoaddthismagicimport,underPython2youllseeextrabracketsproducedbytryingtousetheprintfunctionwhenPython2isinterpretingitasa
printstatementandatuple.
4.HowdoIfindoutwhatversionofBiopythonIhaveinstalled?
Usethis:
>>>importBio
>>>print(Bio.__version__)
...

IftheimportBiolinefails,Biopythonisnotinstalled.Notethatthosearedoubleunderscoresbeforeandafterversion.Ifthesecondlinefails,yourversionis
veryoutofdate.
Iftheversionstringendswithapluslike1.66+,youdonthaveanofficialrelease,butanoldsnapshotoftheindevelopmentcodeafterthatversionwasreleased.
ThisnamingwasuseduntilJune2016intherunuptoBiopython1.68..
Iftheversionstringendswith.dev<number>like1.68.dev0,againyoudonthaveanofficialrelease,butinsteadasnapshotoftheindevelopementcodebefore
thatversionwasreleased.
5.Whereisthelatestversionofthisdocument?
IfyoudownloadaBiopythonsourcecodearchive,itwillincludetherelevantversioninbothHTMLandPDFformats.Thelatestpublishedversionofthis
document(updatedateachrelease)isonline:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
6.Whatiswrongwithmysequencecomparisons?
TherewasamajorchangeinBiopython1.65makingtheSeqandMutableSeqclasses(andsubclasses)usesimplestringbasedcomparison(ignoringthealphabet
otherthanifgivingawarning),whichyoucandoexplicitlywithstr(seq1)==str(seq2).
OlderversionsofBiopythonwoulduseinstancebasedcomparisonforSeqobjectswhichyoucandoexplicitlywithid(seq1)==id(seq2).
IfyoustillneedtosupportoldversionsofBiopython,usetheseexplicitformstoavoidproblems.SeeSection3.11.
7.WhyistheSeqobjectmissingtheupper&lowermethodsdescribedinthisTutorial?
YouneedBiopython1.53orlater.Alternatively,usestr(my_seq).upper()togetanuppercasestring.IfyouneedaSeqobject,trySeq(str(my_seq).upper())but
becarefulaboutblindlyreusingthesamealphabet.
8.WhydoesnttheSeqobjecttranslationmethodsupportthecdsoptiondescribedinthisTutorial?
YouneedBiopython1.51orlater.
9.WhatfileformatsdoBio.SeqIOandBio.AlignIOreadandwrite?
Checkthebuiltindocstrings(fromBioimportSeqIO,thenhelp(SeqIO)),orseehttp://biopython.org/wiki/SeqIOandhttp://biopython.org/wiki/AlignIOonthe
wikiforthelatestlisting.
10.WhywonttheBio.SeqIOandBio.AlignIOfunctionsparse,readandwritetakefilenames?Theyinsistonhandles!
YouneedBiopython1.54orlater,orjustusehandlesexplicitly(seeSection24.1).Itisespeciallyimportanttoremembertocloseoutputhandlesexplicitlyafter
writingyourdata.
11.WhywonttheBio.SeqIO.write()andBio.AlignIO.write()functionsacceptasinglerecordoralignment?Theyinsistonalistoriterator!
YouneedBiopython1.54orlater,orjustwraptheitemwith[...]tocreatealistofoneelement.
12.Whydoesntstr(...)givemethefullsequenceofaSeqobject?
13.WhydoesntBio.BlastworkwiththelatestplaintextNCBIblastoutput?
TheNCBIkeeptweakingtheplaintextoutputfromtheBLASTtools,andkeepingourparseruptodateis/wasanongoingstruggle.Ifyouarentusingthelatest
versionofBiopython,youcouldtryupgrading.However,we(andtheNCBI)recommendyouusetheXMLoutputinstead,whichisdesignedtobereadbya
computerprogram.
14.WhydoesntBio.Entrez.parse()work?Themoduleimportsfinebutthereisnoparsefunction!
15.WhyhasmyscriptusingBio.Entrez.efetch()stoppedworking?
ThiscouldbeduetoNCBIchangesinFebruary2012introducingEFetch2.0.First,theychangedthedefaultreturnmodesyouprobablywanttoadd
retmode="text"toyourcall.Second,theyarenowstricterabouthowtoprovidealistofIDsBiopython1.59onwardsturnsalistintoacommaseparatedstring
automatically.
16.WhydoesntBio.Blast.NCBIWWW.qblast()givethesameresultsastheNCBIBLASTwebsite?
YouneedtospecifythesameoptionstheNCBIoftenadjustthedefaultsettingsonthewebsite,andtheydonotmatchtheQBLASTdefaultsanymore.Check
thingslikethegappenaltiesandexpectationthreshold.
17.WhydoesntBio.Blast.NCBIXML.read()work?Themoduleimportsbutthereisnoreadfunction!
YouneedBiopython1.50orlater.Or,usenext(Bio.Blast.NCBIXML.parse(...))instead.
18.WhydoesntmySeqRecordobjecthavealetter_annotationsattribute?
PerletterannotationsupportwasaddedinBiopython1.50.
19.WhycantIslicemySeqRecordtogetasubrecord?
20.WhycantIaddSeqRecordobjectstogether?
21.WhydoesntBio.SeqIO.convert()orBio.AlignIO.convert()work?Themodulesimportfinebutthereisnoconvertfunction!
YouneedBiopython1.52orlater.Alternatively,combinetheparseandwritefunctionsasdescribedinthistutorial(seeSections5.5.2and6.2.1).
22.WhydoesntBio.SeqIO.index()work?Themoduleimportsfinebutthereisnoindexfunction!
23.WhydoesntBio.SeqIO.index_db()work?Themoduleimportsfinebutthereisnoindex_dbfunction!
YouneedBiopython1.57orlater(andaPythonwithSQLite3support).
24.WhereistheMultipleSeqAlignmentobject?TheBio.Alignmoduleimportsfinebutthisclassisntthere!
YouneedBiopython1.54orlater.Alternatively,theolderBio.Align.Generic.Alignmentclasssupportssomeofitsfunctionality,butusingthisisnowdiscouraged.
25.WhycantIruncommandlinetoolsdirectlyfromtheapplicationwrappers?
YouneedBiopython1.55orlater.Alternatively,usethePythonsubprocessmoduledirectly.
26.Ilookedinadirectoryforcode,butIcouldntfindthecodethatdoessomething.Wheresithidden?
Onethingtoknowisthatweputcodein__init__.pyfiles.Ifyouarenotusedtolookingforcodeinthisfilethiscanbeconfusing.Thereasonwedothisisto
maketheimportseasierforusers.Forinstance,insteadofhavingtodoarepetitiveimportlikefromBio.GenBankimportGenBank,youcanjustusefromBio
importGenBank.
27.WhydoesthecodefromCVSseemoutofdate?
InlateSeptember2009,justafterthereleaseofBiopython1.52,weswitchedfromusingCVStogit,adistributedversioncontrolsystem.TheoldCVSserverwill
remainavailableasastaticandreadonlybackup,butifyouwanttograbthelatestcode,youllneedtousegitinstead.Seeourwebsiteformoredetails.
28.WhydoesntBio.Fastawork?
WedeprecatedtheBio.FastamoduleinBiopython1.51(August2009)andremoveditinBiopython1.55(August2010).Thereisabriefexampleshowinghowto
convertoldcodetouseBio.SeqIOinsteadintheDEPRECATED.rstfile.
Formoregeneralquestions,thePythonFAQpageshttp://www.python.org/doc/faq/maybeuseful.
Chapter2QuickStartWhatcanyoudowithBiopython?
ThissectionisdesignedtogetyoustartedquicklywithBiopython,andtogiveageneraloverviewofwhatisavailableandhowtouseit.Alloftheexamplesinthis
sectionassumethatyouhavesomegeneralworkingknowledgeofPython,andthatyouhavesuccessfullyinstalledBiopythononyoursystem.Ifyouthinkyouneedto
brushuponyourPython,themainPythonwebsiteprovidesquiteabitoffreedocumentationtogetstartedwith(http://www.python.org/doc/).
Sincemuchbiologicalworkonthecomputerinvolvesconnectingwithdatabasesontheinternet,someoftheexampleswillalsorequireaworkinginternetconnectionin
ordertorun.
Nowthatthatisalloutoftheway,letsgetintowhatwecandowithBiopython.
2.1GeneraloverviewofwhatBiopythonprovides
Asmentionedintheintroduction,Biopythonisasetoflibrariestoprovidetheabilitytodealwiththingsofinteresttobiologistsworkingonthecomputer.Ingeneral
thismeansthatyouwillneedtohaveatleastsomeprogrammingexperience(inPython,ofcourse!)oratleastaninterestinlearningtoprogram.Biopythonsjobisto
makeyourjobeasierasaprogrammerbysupplyingreusablelibrariessothatyoucanfocusonansweringyourspecificquestionofinterest,insteadoffocusingonthe
internalsofparsingaparticularfileformat(ofcourse,ifyouwanttohelpbywritingaparserthatdoesntexistandcontributingittoBiopython,pleasegoahead!).So
Biopythonsjobistomakeyouhappy!
OnethingtonoteaboutBiopythonisthatitoftenprovidesmultiplewaysofdoingthesamething.Thingshaveimprovedinrecentreleases,butthiscanstillbe
frustratingasinPythonthereshouldideallybeonerightwaytodosomething.However,thiscanalsobearealbenefitbecauseitgivesyoulotsofflexibilityandcontrol
overthelibraries.Thetutorialhelpstoshowyouthecommonoreasywaystodothingssothatyoucanjustmakethingswork.Tolearnmoreaboutthealternative
possibilities,lookintheCookbook(Chapter20,thishassomecoolstricksandtips),theAdvancedsection(Chapter22),thebuiltindocstrings(viathePythonhelp
command,ortheAPIdocumentation)orultimatelythecodeitself.
2.2Workingwithsequences
Disputably(ofcourse!),thecentralobjectinbioinformaticsisthesequence.Thus,wellstartwithaquickintroductiontotheBiopythonmechanismsfordealingwith
sequences,theSeqobject,whichwelldiscussinmoredetailinChapter3.
MostofthetimewhenwethinkaboutsequenceswehaveinmymindastringofletterslikeAGTACACTGGT.YoucancreatesuchSeqobjectwiththissequenceasfollows
the>>>representsthePythonpromptfollowedbywhatyouwouldtypein:
>>>fromBio.SeqimportSeq
>>>my_seq=Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT',Alphabet())
>>>print(my_seq)
AGTACACTGGT
>>>my_seq.alphabet
Alphabet()
WhatwehavehereisasequenceobjectwithagenericalphabetreflectingthefactwehavenotspecifiedifthisisaDNAorproteinsequence(okay,aproteinwithalot
ofAlanines,Glycines,CysteinesandThreonines!).WelltalkmoreaboutalphabetsinChapter3.
Inadditiontohavinganalphabet,theSeqobjectdiffersfromthePythonstringinthemethodsitsupports.Youcantdothiswithaplainstring:
>>>my_seq
>>>my_seq.complement()
Seq('TCATGTGACCA',Alphabet())
>>>my_seq.reverse_complement()
Seq('ACCAGTGTACT',Alphabet())
ThenextmostimportantclassistheSeqRecordorSequenceRecord.Thisholdsasequence(asaSeqobject)withadditionalannotationincludinganidentifier,nameand
description.TheBio.SeqIOmoduleforreadingandwritingsequencefileformatsworkswithSeqRecordobjects,whichwillbeintroducedbelowandcoveredinmore
detailbyChapter5.
ThiscoversthebasicfeaturesandusesoftheBiopythonsequenceclass.NowthatyouvegotsomeideaofwhatitisliketointeractwiththeBiopythonlibraries,itstime
todelveintothefun,funworldofdealingwithbiologicalfileformats!
2.3Ausageexample
BeforewejumprightintoparsersandeverythingelsetodowithBiopython,letssetupanexampletomotivateeverythingwedoandmakelifemoreinteresting.After
all,iftherewasntanybiologyinthistutorial,whywouldyouwantyoureadit?
SinceIloveplants,Ithinkwerejustgoingtohavetohaveaplantbasedexample(sorrytoallthefansofotherorganismsoutthere!).Havingjustcompletedarecenttrip
toourlocalgreenhouse,wevesuddenlydevelopedanincredibleobsessionwithLadySlipperOrchids(ifyouwonderwhy,havealookatsomeLadySlipperOrchids
photosonFlickr,ortryaGoogleImageSearch).
Ofcourse,orchidsarenotonlybeautifultolookat,theyarealsoextremelyinterestingforpeoplestudyingevolutionandsystematics.Soletssupposewerethinking
aboutwritingafundingproposaltodoamolecularstudyofLadySlipperevolution,andwouldliketoseewhatkindofresearchhasalreadybeendoneandhowwecan
addtothat.
AfteralittlebitofreadingupwediscoverthattheLadySlipperOrchidsareintheOrchidaceaefamilyandtheCypripedioideaesubfamilyandaremadeupof5genera:
Cypripedium,Paphiopedilum,Phragmipedium,SelenipediumandMexipedium.
Thatgivesusenoughtogetstarteddelvingformoreinformation.So,letslookathowtheBiopythontoolscanhelpus.WellstartwithsequenceparsinginSection2.4,
buttheorchidswillbebacklateronaswellforexamplewellsearchPubMedforpapersaboutorchidsandextractsequencedatafromGenBankinChapter9,extract
datafromSwissProtfromcertainorchidproteinsinChapter10,andworkwithClustalWmultiplesequencealignmentsoforchidproteinsinSection6.4.1.
2.4Parsingsequencefileformats
Alargepartofmuchbioinformaticsworkinvolvesdealingwiththemanytypesoffileformatsdesignedtoholdbiologicaldata.Thesefilesareloadedwithinteresting
biologicaldata,andaspecialchallengeisparsingthesefilesintoaformatsothatyoucanmanipulatethemwithsomekindofprogramminglanguage.Howeverthetask
ofparsingthesefilescanbefrustratedbythefactthattheformatscanchangequiteregularly,andthatformatsmaycontainsmallsubtletieswhichcanbreakeventhemost
welldesignedparsers.
WearenowgoingtobrieflyintroducetheBio.SeqIOmoduleyoucanfindoutmoreinChapter5.Wellstartwithanonlinesearchforourfriends,theladyslipper
orchids.Tokeepthisintroductionsimple,werejustusingtheNCBIwebsitebyhand.LetsjusttakealookthroughthenucleotidedatabasesatNCBI,usinganEntrez
onlinesearch(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide)foreverythingmentioningthetextCypripedioideae(thisisthesubfamilyoflady
slipperorchids).
Whenthistutorialwasoriginallywritten,thissearchgaveusonly94hits,whichwesavedasaFASTAformattedtextfileandasaGenBankformattedtextfile(files
ls_orchid.fastaandls_orchid.gbk,alsoincludedwiththeBiopythonsourcecodeunderdocs/tutorial/examples/).
Ifyourunthesearchtoday,youllgethundredsofresults!Whenfollowingthetutorial,ifyouwanttoseethesamelistofgenes,justdownloadthetwofilesaboveor
copythemfromdocs/examples/intheBiopythonsourcecode.InSection2.5wewilllookathowtodoasearchlikethisfromwithinPython.
2.4.1SimpleFASTAparsingexample
IfyouopentheladyslipperorchidsFASTAfilels_orchid.fastainyourfavouritetexteditor,youllseethatthefilestartslikethis:
>gi|2765658|emb|Z78533.1|CIZ78533C.irapeanum5.8SrRNAgeneandITS1andITS2DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
Itcontains94records,eachhasalinestartingwith>(greaterthansymbol)followedbythesequenceononeormorelines.NowtrythisinPython:
fromBioimportSeqIO
forseq_recordinSeqIO.parse("ls_orchid.fasta","fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
Youshouldgetsomethinglikethisonyourscreen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',SingleLetterAlphabet())
592
NoticethattheFASTAformatdoesnotspecifythealphabet,soBio.SeqIOhasdefaultedtotherathergenericSingleLetterAlphabet()ratherthansomethingDNA
specific.
2.4.2SimpleGenBankparsingexample
NowletsloadtheGenBankfilels_orchid.gbkinsteadnoticethatthecodetodothisisalmostidenticaltothesnippetusedabovefortheFASTAfiletheonly
differenceiswechangethefilenameandtheformatstring:
fromBioimportSeqIO
forseq_recordinSeqIO.parse("ls_orchid.gbk","genbank"):
Thisshouldgive:
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',IUPACAmbiguousDNA())
740
...
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',IUPACAmbiguousDNA())
592
ThistimeBio.SeqIOhasbeenabletochooseasensiblealphabet,IUPACAmbiguousDNA.Youllalsonoticethatashorterstringhasbeenusedastheseq_record.idin
thiscase.
2.4.3Iloveparsingpleasedontstoptalkingaboutit!
Biopythonhasalotofparsers,andeachhasitsownlittlespecialnichesbasedonthesequenceformatitisparsingandallofthat.Chapter5coversBio.SeqIOinmore
detail,whileChapter6introducesBio.AlignIOforsequencealignments.
WhilethemostpopularfileformatshaveparsersintegratedintoBio.SeqIOand/orBio.AlignIO,forsomeoftherarerandunlovedfileformatsthereiseithernoparserat
all,oranoldparserwhichhasnotbeenlinkedinyet.Pleasealsocheckthewikipageshttp://biopython.org/wiki/SeqIOandhttp://biopython.org/wiki/AlignIOforthe
latestinformation,oraskonthemailinglist.Thewikipagesshouldincludeanuptodatelistofsupportedfiletypes,andsomeadditionalexamples.
ThenextplacetolookforinformationaboutspecificparsersandhowtodocoolthingswiththemisintheCookbook(Chapter20ofthisTutorial).Ifyoudontfindthe
informationyouarelookingfor,pleaseconsiderhelpingoutyourpooroverworkeddocumentorsandsubmittingacookbookentryaboutit!(onceyoufigureouthowto
doit,thatis!)
2.5Connectingwithbiologicaldatabases
Oneoftheverycommonthingsthatyouneedtodoinbioinformaticsisextractinformationfrombiologicaldatabases.Itcanbequitetedioustoaccessthesedatabases
manually,especiallyifyouhavealotofrepetitiveworktodo.BiopythonattemptstosaveyoutimeandenergybymakingsomeonlinedatabasesavailablefromPython
scripts.Currently,Biopythonhascodetoextractinformationfromthefollowingdatabases:
Entrez(andPubMed)fromtheNCBISeeChapter9.
ExPASySeeChapter10.
SCOPSeetheBio.SCOP.search()function.
ThecodeinthesemodulesbasicallymakesiteasytowritePythoncodethatinteractwiththeCGIscriptsonthesepages,sothatyoucangetresultsinaneasytodealwith
format.Insomecases,theresultscanbetightlyintegratedwiththeBiopythonparserstomakeiteveneasiertoextractinformation.
2.6Whattodonext
Nowthatyouvemadeitthisfar,youhopefullyhaveagoodunderstandingofthebasicsofBiopythonandarereadytostartusingitfordoingusefulwork.Thebestthing
todonowisfinishreadingthistutorial,andthenifyouwantstartsnoopingaroundinthesourcecode,andlookingattheautomaticallygenerateddocumentation.
Onceyougetapictureofwhatyouwanttodo,andwhatlibrariesinBiopythonwilldoit,youshouldtakeapeakattheCookbook(Chapter20),whichmayhaveexample
codetodosomethingsimilartowhatyouwanttodo.
Ifyouknowwhatyouwanttodo,butcantfigureouthowtodoit,pleasefeelfreetopostquestionstothemainBiopythonlist(see
http://biopython.org/wiki/Mailing_lists).Thiswillnotonlyhelpusansweryourquestion,itwillalsoallowustoimprovethedocumentationsoitcanhelpthenext
persondowhatyouwanttodo.
Enjoythecode!
Chapter3Sequenceobjects
BiologicalsequencesarearguablythecentralobjectinBioinformatics,andinthischapterwellintroducetheBiopythonmechanismfordealingwithsequences,theSeq
object.Chapter4willintroducetherelatedSeqRecordobject,whichcombinesthesequenceinformationwithanyannotation,usedagaininChapter5forSequence
Input/Output.
SequencesareessentiallystringsofletterslikeAGTACACTGGT,whichseemsverynaturalsincethisisthemostcommonwaythatsequencesareseeninbiologicalfile
formats.
TherearetwoimportantdifferencesbetweenSeqobjectsandstandardPythonstrings.Firstofall,theyhavedifferentmethods.AlthoughtheSeqobjectsupportsmanyof
thesamemethodsasaplainstring,itstranslate()methoddiffersbydoingbiologicaltranslation,andtherearealsoadditionalbiologicallyrelevantmethodslike
reverse_complement().Secondly,theSeqobjecthasanimportantattribute,alphabet,whichisanobjectdescribingwhattheindividualcharactersmakingupthesequence
stringmean,andhowtheyshouldbeinterpreted.Forexample,isAGTACACTGGTaDNAsequence,orjustaproteinsequencethathappenstoberichinAlanines,Glycines,
CysteinesandThreonines?
3.1SequencesandAlphabets
ThealphabetobjectisperhapstheimportantthingthatmakestheSeqobjectmorethanjustastring.ThecurrentlyavailablealphabetsforBiopythonaredefinedinthe
Bio.Alphabetmodule.WellusetheIUPACalphabets(http://www.chem.qmw.ac.uk/iupac/)heretodealwithsomeofourfavoriteobjects:DNA,RNAandProteins.
Bio.Alphabet.IUPACprovidesbasicdefinitionsforproteins,DNAandRNA,butadditionallyprovidestheabilitytoextendandcustomizethebasicdefinitions.For
instance,forproteins,thereisabasicIUPACProteinclass,butthereisanadditionalExtendedIUPACProteinclassprovidingfortheadditionalelementsU(orSecfor
selenocysteine)andO(orPylforpyrrolysine),plustheambiguoussymbolsB(orAsxforasparagineorasparticacid),Z(orGlxforglutamineorglutamic
acid),J(orXleforleucineisoleucine)andX(orXxxforanunknownaminoacid).ForDNAyouvegotchoicesofIUPACUnambiguousDNA,whichprovides
forjustthebasicletters,IUPACAmbiguousDNA(whichprovidesforambiguitylettersforeverypossiblesituation)andExtendedIUPACDNA,whichallowslettersfor
modifiedbases.Similarly,RNAcanberepresentedbyIUPACAmbiguousRNAorIUPACUnambiguousRNA.
Theadvantagesofhavinganalphabetclassaretwofold.First,thisgivesanideaofthetypeofinformationtheSeqobjectcontains.Secondly,thisprovidesameansof
constrainingtheinformation,asameansoftypechecking.
Nowthatweknowwhatwearedealingwith,letslookathowtoutilizethisclasstodointerestingwork.Youcancreateanambiguoussequencewiththedefaultgeneric
alphabetlikethis:
>>>my_seq=Seq("AGTACACTGGT")
>>>my_seq
>>>my_seq.alphabet
Alphabet()
However,wherepossibleyoushouldspecifythealphabetexplicitlywhencreatingyoursequenceobjectsinthiscaseanunambiguousDNAalphabetobject:
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq("AGTACACTGGT",IUPAC.unambiguous_dna)
>>>my_seq
Seq('AGTACACTGGT',IUPACUnambiguousDNA())
>>>my_seq.alphabet
IUPACUnambiguousDNA()
Unlessofcourse,thisreallyisanaminoacidsequence:
>>>my_prot=Seq("AGTACACTGGT",IUPAC.protein)
>>>my_prot
Seq('AGTACACTGGT',IUPACProtein())
>>>my_prot.alphabet
IUPACProtein()
3.2Sequencesactlikestrings
Inmanyways,wecandealwithSeqobjectsasiftheywerenormalPythonstrings,forexamplegettingthelength,oriteratingovertheelements:
>>>my_seq=Seq("GATCG",IUPAC.unambiguous_dna)
>>>forindex,letterinenumerate(my_seq):
...print("%i%s"%(index,letter))
0G
1A
2T
3C
4G
>>>print(len(my_seq))
5
Youcanaccesselementsofthesequenceinthesamewayasforstrings(butremember,Pythoncountsfromzero!):
>>>print(my_seq[0])#firstletter
G
>>>print(my_seq[2])#thirdletter
T
>>>print(my_seq[1])#lastletter
G
TheSeqobjecthasa.count()method,justlikeastring.NotethatthismeansthatlikeaPythonstring,thisgivesanonoverlappingcount:
>>>"AAAA".count("AA")
2
>>>Seq("AAAA").count("AA")
2
Forsomebiologicaluses,youmayactuallywantanoverlappingcount(i.e.3inthistrivialexample).Whensearchingforsingleletters,thismakesnodifference:
>>>my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPAC.unambiguous_dna)
>>>len(my_seq)
32
>>>my_seq.count("G")
9
>>>100*float(my_seq.count("G")+my_seq.count("C"))/len(my_seq)
46.875
WhileyoucouldusetheabovesnippetofcodetocalculateaGC%,notethattheBio.SeqUtilsmodulehasseveralGCfunctionsalreadybuilt.Forexample:
>>>fromBio.SeqUtilsimportGC
>>>my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPAC.unambiguous_dna)
>>>GC(my_seq)
46.875
NotethatusingtheBio.SeqUtils.GC()functionshouldautomaticallycopewithmixedcasesequencesandtheambiguousnucleotideSwhichmeansGorC.
AlsonotethatjustlikeanormalPythonstring,theSeqobjectisinsomewaysreadonly.Ifyouneedtoedityoursequence,forexamplesimulatingapointmutation,
lookattheSection3.12belowwhichtalksabouttheMutableSeqobject.
3.3Slicingasequence
Amorecomplicatedexample,letsgetasliceofthesequence:
>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",IUPAC.unambiguous_dna)
>>>my_seq[4:12]
Seq('GATGGGCC',IUPACUnambiguousDNA())
Twothingsareinterestingtonote.First,thisfollowsthenormalconventionsforPythonstrings.Sothefirstelementofthesequenceis0(whichisnormalforcomputer
science,butnotsonormalforbiology).Whenyoudoaslicethefirstitemisincluded(i.e.4inthiscase)andthelastisexcluded(12inthiscase),whichisthewaythings
workinPython,butofcoursenotnecessarilythewayeveryoneintheworldwouldexpect.ThemaingoalistostayconsistentwithwhatPythondoes.
Thesecondthingtonoticeisthatthesliceisperformedonthesequencedatastring,butthenewobjectproducedisanotherSeqobjectwhichretainsthealphabet
informationfromtheoriginalSeqobject.
AlsolikeaPythonstring,youcandosliceswithastart,stopandstride(thestepsize,whichdefaultstoone).Forexample,wecangetthefirst,secondandthirdcodon
positionsofthisDNAsequence:
>>>my_seq[0::3]
Seq('GCTGTAGTAAG',IUPACUnambiguousDNA())
>>>my_seq[1::3]
Seq('AGGCATGCATC',IUPACUnambiguousDNA())
>>>my_seq[2::3]
Seq('TAGCTAAGAC',IUPACUnambiguousDNA())
AnotherstridetrickyoumighthaveseenwithaPythonstringistheuseofa1stridetoreversethestring.YoucandothiswithaSeqobjecttoo:
>>>my_seq[::1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG',IUPACUnambiguousDNA())
3.4TurningSeqobjectsintostrings
Ifyoureallydojustneedaplainstring,forexampletowritetoafile,orinsertintoadatabase,thenthisisveryeasytoget:
>>>str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Sincecallingstr()onaSeqobjectreturnsthefullsequenceasastring,youoftendontactuallyhavetodothisconversionexplicitly.Pythondoesthisautomaticallyin
theprintfunction(andtheprintstatementunderPython2):
>>>print(my_seq)
GATCGATGGGCCTATATAGGATCGAAAATCGC
YoucanalsousetheSeqobjectdirectlywitha%splaceholderwhenusingthePythonstringformattingorinterpolationoperator(%):
>>>fasta_format_string=">Name\n%s\n"%my_seq
>>>print(fasta_format_string)
>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC
<BLANKLINE>
ThislineofcodeconstructsasimpleFASTAformatrecord(withoutworryingaboutlinewrapping).Section4.6describesaneatwaytogetaFASTAformattedstring
fromaSeqRecordobject,whilethemoregeneraltopicofreadingandwritingFASTAformatsequencefilesiscoveredinChapter5.
>>>str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
3.5Concatenatingoraddingsequences
Naturally,youcaninprincipleaddanytwoSeqobjectstogetherjustlikeyoucanwithPythonstringstoconcatenatethem.However,youcantaddsequenceswith
incompatiblealphabets,suchasaproteinsequenceandaDNAsequence:
>>>protein_seq=Seq("EVRNAK",IUPAC.protein)
>>>dna_seq=Seq("ACGT",IUPAC.unambiguous_dna)
>>>protein_seq+dna_seq
Traceback(mostrecentcalllast):
...
TypeError:IncompatiblealphabetsIUPACProtein()andIUPACUnambiguousDNA()
Ifyoureallywantedtodothis,youdhavetofirstgivebothsequencesgenericalphabets:
>>>fromBio.Alphabetimportgeneric_alphabet
>>>protein_seq.alphabet=generic_alphabet
>>>dna_seq.alphabet=generic_alphabet
>>>protein_seq+dna_seq
Seq('EVRNAKACGT',Alphabet())
HereisanexampleofaddingagenericnucleotidesequencetoanunambiguousIUPACDNAsequence,resultinginanambiguousnucleotidesequence:
>>>fromBio.Alphabetimportgeneric_nucleotide
>>>nuc_seq=Seq("GATCGATGC",generic_nucleotide)
>>>nuc_seq
Seq('GATCGATGC',NucleotideAlphabet())
>>>dna_seq
Seq('ACGT',IUPACUnambiguousDNA())
>>>nuc_seq+dna_seq
Seq('GATCGATGCACGT',NucleotideAlphabet())
Youmayoftenhavemanysequencestoaddtogether,whichcanbedonewithaforlooplikethis:
>>>fromBio.Alphabetimportgeneric_dna
>>>list_of_seqs=[Seq("ACGT",generic_dna),Seq("AACC",generic_dna),Seq("GGTT",generic_dna)]
>>>concatenated=Seq("",generic_dna)
>>>forsinlist_of_seqs:
...concatenated+=s
...
>>>concatenated
Seq('ACGTAACCGGTT',DNAAlphabet())
Or,amoreelegantapproachistotheusebuiltinsumfunctionwithitsoptionalstartvalueargument(whichotherwisedefaultstozero):
>>>list_of_seqs=[Seq("ACGT",generic_dna),Seq("AACC",generic_dna),Seq("GGTT",generic_dna)]
>>>sum(list_of_seqs,Seq("",generic_dna))
Seq('ACGTAACCGGTT',DNAAlphabet())
UnlikethePythonstring,theBiopythonSeqdoesnot(currently)havea.joinmethod.
3.6Changingcase
Pythonstringshaveveryusefulupperandlowermethodsforchangingthecase.AsofBiopython1.53,theSeqobjectgainedsimilarmethodswhicharealphabetaware.
Forexample,
>>>dna_seq=Seq("acgtACGT",generic_dna)
>>>dna_seq
Seq('acgtACGT',DNAAlphabet())
>>>dna_seq.upper()
Seq('ACGTACGT',DNAAlphabet())
>>>dna_seq.lower()
Seq('acgtacgt',DNAAlphabet())
Theseareusefulfordoingcaseinsensitivematching:
>>>"GTAC"indna_seq
False
>>>"GTAC"indna_seq.upper()
True
NotethatstrictlyspeakingtheIUPACalphabetsareforuppercasesequencesonly,thus:
>>>dna_seq
Seq('ACGT',IUPACUnambiguousDNA())
>>>dna_seq.lower()
Seq('acgt',DNAAlphabet())
3.7Nucleotidesequencesand(reverse)complements
Fornucleotidesequences,youcaneasilyobtainthecomplementorreversecomplementofaSeqobjectusingitsbuiltinmethods:
>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",IUPAC.unambiguous_dna)
>>>my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPACUnambiguousDNA())
>>>my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG',IUPACUnambiguousDNA())
>>>my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC',IUPACUnambiguousDNA())
Asmentionedearlier,aneasywaytojustreverseaSeqobject(oraPythonstring)issliceitwith1step:
>>>my_seq[::1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG',IUPACUnambiguousDNA())
Inalloftheseoperations,thealphabetpropertyismaintained.Thisisveryusefulincaseyouaccidentallyenduptryingtodosomethingweirdliketakethe
(reverse)complementofaproteinsequence:
>>>protein_seq=Seq("EVRNAK",IUPAC.protein)
>>>protein_seq.complement()
...
ValueError:Proteinsdonothavecomplements!
TheexampleinSection5.5.3combinestheSeqobjectsreversecomplementmethodwithBio.SeqIOforsequenceinput/output.
3.8Transcription
Beforetalkingabouttranscription,Iwanttotrytoclarifythestrandissue.Considerthefollowing(madeup)stretchofdoublestrandedDNAwhichencodesashort
peptide:

DNAcodingstrand(akaCrickstrand,strand+1)
5 ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3
|||||||||||||||||||||||||||||||||||||||
3 TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5
3 5
DNAtemplatestrand(akaWatsonstrand,strand1)

|
Transcription

5 AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3
SinglestrandedmessengerRNA

Theactualbiologicaltranscriptionprocessworksfromthetemplatestrand,doingareversecomplement(TCAGCUGA)togivethemRNA.However,inBiopython
andbioinformaticsingeneral,wetypicallyworkdirectlywiththecodingstrandbecausethismeanswecangetthemRNAsequencejustbyswitchingTU.
NowletsactuallygetdowntodoingatranscriptioninBiopython.First,letscreateSeqobjectsforthecodingandtemplateDNAstrands:
>>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",IUPAC.unambiguous_dna)
>>>coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',IUPACUnambiguousDNA())
>>>template_dna=coding_dna.reverse_complement()
>>>template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT',IUPACUnambiguousDNA())
Theseshouldmatchthefigureaboverememberbyconventionnucleotidesequencesarenormallyreadfromthe5to3direction,whileinthefigurethetemplatestrand
isshownreversed.
NowletstranscribethecodingstrandintothecorrespondingmRNA,usingtheSeqobjectsbuiltintranscribemethod:
>>>coding_dna
>>>messenger_rna=coding_dna.transcribe()
>>>messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',IUPACUnambiguousRNA())
Asyoucansee,allthisdoesisswitchTU,andadjustthealphabet.
Ifyoudowanttodoatruebiologicaltranscriptionstartingwiththetemplatestrand,thenthisbecomesatwostepprocess:
>>>template_dna.reverse_complement().transcribe()
TheSeqobjectalsoincludesabacktranscriptionmethodforgoingfromthemRNAtothecodingstrandoftheDNA.Again,thisisasimpleUTsubstitutionand
associatedchangeofalphabet:
>>>messenger_rna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",IUPAC.unambiguous_rna)
>>>messenger_rna
>>>messenger_rna.back_transcribe()
Note:TheSeqobjectstranscribeandback_transcribemethodswereaddedinBiopython1.49.ForolderreleasesyouwouldhavetousetheBio.Seqmodulesfunctions
instead,seeSection3.14.
3.9Translation
Stickingwiththesameexamplediscussedinthetranscriptionsectionabove,nowletstranslatethismRNAintothecorrespondingproteinsequenceagaintaking
advantageofoneoftheSeqobjectsbiologicalmethods:
>>>messenger_rna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",IUPAC.unambiguous_rna)
>>>messenger_rna
>>>messenger_rna.translate()
Seq('MAIVMGR*KGAR*',HasStopCodon(IUPACProtein(),'*'))
YoucanalsotranslatedirectlyfromthecodingstrandDNAsequence:
>>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",IUPAC.unambiguous_dna)
>>>coding_dna
>>>coding_dna.translate()
Youshouldnoticeintheaboveproteinsequencesthatinadditiontotheendstopcharacter,thereisaninternalstopaswell.Thiswasadeliberatechoiceofexample,asit
givesanexcusetotalkaboutsomeoptionalarguments,includingdifferenttranslationtables(GeneticCodes).
ThetranslationtablesavailableinBiopythonarebasedonthosefromtheNCBI(seethenextsectionofthistutorial).Bydefault,translationwillusethestandardgenetic
code(NCBItableid1).Supposewearedealingwithamitochondrialsequence.Weneedtotellthetranslationfunctiontousetherelevantgeneticcodeinstead:
>>>coding_dna.translate(table="VertebrateMitochondrial")
Seq('MAIVMGRWKGAR*',HasStopCodon(IUPACProtein(),'*'))
YoucanalsospecifythetableusingtheNCBItablenumberwhichisshorter,andoftenincludedinthefeatureannotationofGenBankfiles:
>>>coding_dna.translate(table=2)
Now,youmaywanttotranslatethenucleotidesuptothefirstinframestopcodon,andthenstop(ashappensinnature):
>>>coding_dna.translate()
>>>coding_dna.translate(to_stop=True)
Seq('MAIVMGR',IUPACProtein())
>>>coding_dna.translate(table=2)
>>>coding_dna.translate(table=2,to_stop=True)
Seq('MAIVMGRWKGAR',IUPACProtein())
Noticethatwhenyouusetheto_stopargument,thestopcodonitselfisnottranslatedandthestopsymbolisnotincludedattheendofyourproteinsequence.
Youcanevenspecifythestopsymbolifyoudontlikethedefaultasterisk:
>>>coding_dna.translate(table=2,stop_symbol="@")
Seq('MAIVMGRWKGAR@',HasStopCodon(IUPACProtein(),'@'))
Now,supposeyouhaveacompletecodingsequenceCDS,whichistosayanucleotidesequence(e.g.mRNAafteranysplicing)whichisawholenumberofcodons
(i.e.thelengthisamultipleofthree),commenceswithastartcodon,endswithastopcodon,andhasnointernalinframestopcodons.Ingeneral,givenacompleteCDS,
thedefaulttranslatemethodwilldowhatyouwant(perhapswiththeto_stopoption).However,whatifyoursequenceusesanonstandardstartcodon?Thishappensa
lotinbacteriaforexamplethegeneyaaXinE.coliK12:
>>>gene=Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"+\
..."GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"+\
..."AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"+\
..."TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"+\
..."AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
...generic_dna)
>>>gene.translate(table="Bacterial")
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
HasStopCodon(ExtendedIUPACProtein(),'*')
>>>gene.translate(table="Bacterial",to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
InthebacterialgeneticcodeGTGisavalidstartcodon,andwhileitdoesnormallyencodeValine,ifusedasastartcodonitshouldbetranslatedasmethionine.This
happensifyoutellBiopythonyoursequenceisacompleteCDS:
>>>gene.translate(table="Bacterial",cds=True)
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
InadditiontotellingBiopythontotranslateanalternativestartcodonasmethionine,usingthisoptionalsomakessureyoursequencereallyisavalidCDS(youllgetan
exceptionifnot).
TheexampleinSection20.1.3combinestheSeqobjectstranslatemethodwithBio.SeqIOforsequenceinput/output.
3.10TranslationTables
IntheprevioussectionswetalkedabouttheSeqobjecttranslationmethod(andmentionedtheequivalentfunctionintheBio.SeqmoduleseeSection3.14).Internally
theseusecodontableobjectsderivedfromtheNCBIinformationatftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt,alsoshownon
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgiinamuchmorereadablelayout.
Asbefore,letsjustfocusontwochoices:theStandardtranslationtable,andthetranslationtableforVertebrateMitochondrialDNA.
>>>fromBio.DataimportCodonTable
>>>standard_table=CodonTable.unambiguous_dna_by_name["Standard"]
>>>mito_table=CodonTable.unambiguous_dna_by_name["VertebrateMitochondrial"]
Alternatively,thesetablesarelabeledwithIDnumbers1and2,respectively:
>>>fromBio.DataimportCodonTable
>>>standard_table=CodonTable.unambiguous_dna_by_id[1]
>>>mito_table=CodonTable.unambiguous_dna_by_id[2]
Youcancomparetheactualtablesvisuallybyprintingthem:
>>>print(standard_table)
Table1Standard,SGC0
|T|C|A|G|
+++++
T|TTTF|TCTS|TATY|TGTC|T
T|TTCF|TCCS|TACY|TGCC|C
T|TTAL|TCAS|TAAStop|TGAStop|A
T|TTGL(s)|TCGS|TAGStop|TGGW|G
+++++
C|CTTL|CCTP|CATH|CGTR|T
C|CTCL|CCCP|CACH|CGCR|C
C|CTAL|CCAP|CAAQ|CGAR|A
C|CTGL(s)|CCGP|CAGQ|CGGR|G
+++++
A|ATTI|ACTT|AATN|AGTS|T
A|ATCI|ACCT|AACN|AGCS|C
A|ATAI|ACAT|AAAK|AGAR|A
A|ATGM(s)|ACGT|AAGK|AGGR|G
+++++
G|GTTV|GCTA|GATD|GGTG|T
G|GTCV|GCCA|GACD|GGCG|C
G|GTAV|GCAA|GAAE|GGAG|A
G|GTGV|GCGA|GAGE|GGGG|G
+++++
and:
>>>print(mito_table)
Table2VertebrateMitochondrial,SGC1
|T|C|A|G|
+++++
T|TTTF|TCTS|TATY|TGTC|T
T|TTCF|TCCS|TACY|TGCC|C
T|TTAL|TCAS|TAAStop|TGAW|A
T|TTGL|TCGS|TAGStop|TGGW|G
+++++
C|CTTL|CCTP|CATH|CGTR|T
C|CTCL|CCCP|CACH|CGCR|C
C|CTAL|CCAP|CAAQ|CGAR|A
C|CTGL|CCGP|CAGQ|CGGR|G
+++++
A|ATTI(s)|ACTT|AATN|AGTS|T
A|ATCI(s)|ACCT|AACN|AGCS|C
A|ATAM(s)|ACAT|AAAK|AGAStop|A
A|ATGM(s)|ACGT|AAGK|AGGStop|G
+++++
G|GTTV|GCTA|GATD|GGTG|T
G|GTCV|GCCA|GACD|GGCG|C
G|GTAV|GCAA|GAAE|GGAG|A
G|GTGV(s)|GCGA|GAGE|GGGG|G
+++++
Youmayfindthesefollowingpropertiesusefulforexampleifyouaretryingtodoyourowngenefinding:
>>>mito_table.stop_codons
['TAA','TAG','AGA','AGG']
>>>mito_table.start_codons
['ATT','ATC','ATA','ATG','GTG']
>>>mito_table.forward_table["ACG"]
'T'
3.11ComparingSeqobjects
Sequencecomparisonisactuallyaverycomplicatedtopic,andthereisnoeasywaytodecideiftwosequencesareequal.Thebasicproblemisthemeaningoftheletters
inasequencearecontextdependenttheletterAcouldbepartofaDNA,RNAorproteinsequence.BiopythonusesalphabetobjectsaspartofeachSeqobjecttotryto
capturethisinformationsocomparingtwoSeqobjectscouldmeanconsideringboththesequencestringsandthealphabets.
Forexample,youmightarguethatthetwoDNASeqobjectsSeq("ACGT",IUPAC.unambiguous_dna)andSeq("ACGT",IUPAC.ambiguous_dna)shouldbeequal,eventhough
theydohavedifferentalphabets.Dependingonthecontextthiscouldbeimportant.
ThisgetsworsesupposeyouthinkSeq("ACGT",IUPAC.unambiguous_dna)andSeq("ACGT")(i.e.thedefaultgenericalphabet)shouldbeequal.Then,logically,
Seq("ACGT",IUPAC.protein)andSeq("ACGT")shouldalsobeequal.Now,inlogicifA=BandB=C,bytransitivityweexpectA=C.Soforlogicalconsistencywedrequire
Seq("ACGT",IUPAC.unambiguous_dna)andSeq("ACGT",IUPAC.protein)tobeequalwhichmostpeoplewouldagreeisjustnotright.Thistransitivityalsohas
implicationsforusingSeqobjectsasPythondictionarykeys.
Now,ineverydayuse,yoursequenceswillprobablyallhavethesamealphabet,oratleastallbethesametypeofsequence(allDNA,allRNA,orallprotein).Whatyou
probablywantistojustcomparethesequencesasstringswhichyoucandoexplicitly:
>>>seq1=Seq("ACGT",IUPAC.unambiguous_dna)
>>>seq2=Seq("ACGT",IUPAC.ambiguous_dna)
>>>str(seq1)==str(seq2)
True
>>>str(seq1)==str(seq1)
True
So,whatdoesBiopythondo?Well,asofBiopython1.65,sequencecomparisononlylooksatthesequence,essentiallyignoringthealphabet:
>>>seq1==seq2
True
>>>seq1=="ACGT"
True
Asanextensiontothis,usingsequenceobjectsaskeysinaPythondictionaryisnowequivalenttousingthesequenceasaplainstringforthekey.SeealsoSection3.4.
Noteifyoucomparesequenceswithincompatiblealphabets(e.g.DNAvsRNA,ornucleotideversusprotein),thenyouwillgetawarningbutforthecomparisonitself
onlythestringoflettersinthesequenceisused:
>>>fromBio.Alphabetimportgeneric_dna,generic_protein
>>>dna_seq=Seq("ACGT",generic_dna)
>>>prot_seq=Seq(`ÀCGT'',generic_protein)
>>>dna_seq==prot_seq
BiopythonWarning:IncompatiblealphabetsDNAAlphabet()andProteinAlphabet()
True
WARNING:OlderversionsofBiopythoninsteadusedtocheckiftheSeqobjectswerethesameobjectinmemory.Thisisimportantifyouneedtosupportscriptsonboth
oldandnewversionsofBiopython.Heremakethecomparisonexplicitbywrappingyoursequenceobjectswitheitherstr(...)forstringbasedcomparisonorid(...)
forobjectinstancebasedcomparison.
3.12MutableSeqobjects
JustlikethenormalPythonstring,theSeqobjectisreadonly,orinPythonterminology,immutable.ApartfromwantingtheSeqobjecttoactlikeastring,thisisalsoa
usefuldefaultsinceinmanybiologicalapplicationsyouwanttoensureyouarenotchangingyoursequencedata:
>>>my_seq=Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",IUPAC.unambiguous_dna)
Observewhathappensifyoutrytoeditthesequence:
>>>my_seq[5]="G"
...
TypeError:'Seq'objectdoesnotsupportitemassignment
However,youcanconvertitintoamutablesequence(aMutableSeqobject)anddoprettymuchanythingyouwantwithit:
>>>mutable_seq=my_seq.tomutable()
>>>mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
Alternatively,youcancreateaMutableSeqobjectdirectlyfromastring:
>>>fromBio.SeqimportMutableSeq
>>>mutable_seq=MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",IUPAC.unambiguous_dna)
Eitherwaywillgiveyouasequenceobjectwhichcanbechanged:
>>>mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
>>>mutable_seq[5]="C"
>>>mutable_seq
MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
>>>mutable_seq.remove("T")
>>>mutable_seq
MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
>>>mutable_seq.reverse()
>>>mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG',IUPACUnambiguousDNA())
DonotethatunliketheSeqobject,theMutableSeqobjectsmethodslikereverse_complement()andreverse()actinsitu!
AnimportanttechnicaldifferencebetweenmutableandimmutableobjectsinPythonmeansthatyoucantuseaMutableSeqobjectasadictionarykey,butyoucanusea
PythonstringoraSeqobjectinthisway.
OnceyouhavefinishededitingyouraMutableSeqobject,itseasytogetbacktoareadonlySeqobjectshouldyouneedto:
>>>new_seq=mutable_seq.toseq()
>>>new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG',IUPACUnambiguousDNA())
YoucanalsogetastringfromaMutableSeqobjectjustlikefromaSeqobject(Section3.4).
3.13UnknownSeqobjects
TheUnknownSeqobjectisasubclassofthebasicSeqobjectanditspurposeistorepresentasequencewhereweknowthelength,butnottheactuallettersmakingitup.
YoucouldofcourseuseanormalSeqobjectinthissituation,butitwastesratheralotofmemorytoholdastringofamillionNcharacterswhenyoucouldjuststorea
singleletterNandthedesiredlengthasaninteger.
>>>fromBio.SeqimportUnknownSeq
>>>unk=UnknownSeq(20)
>>>unk
UnknownSeq(20,alphabet=Alphabet(),character='?')
>>>print(unk)
????????????????????
>>>len(unk)
20
Youcanofcoursespecifyanalphabet,meaningfornucleotidesequencestheletterdefaultstoNandforproteinsX,ratherthanjust?.
>>>fromBio.SeqimportUnknownSeq
>>>unk_dna=UnknownSeq(20,alphabet=IUPAC.ambiguous_dna)
>>>unk_dna
UnknownSeq(20,alphabet=IUPACAmbiguousDNA(),character='N')
>>>print(unk_dna)
NNNNNNNNNNNNNNNNNNNN
YoucanusealltheusualSeqobjectmethodstoo,notethesegivebackmemorysavingUnknownSeqobjectswhereappropriateasyoumightexpect:
>>>unk_dna
>>>unk_dna.complement()
>>>unk_dna.reverse_complement()
>>>unk_dna.transcribe()
UnknownSeq(20,alphabet=IUPACAmbiguousRNA(),character='N')
>>>unk_protein=unk_dna.translate()
>>>unk_protein
UnknownSeq(6,alphabet=ProteinAlphabet(),character='X')
>>>print(unk_protein)
XXXXXX
>>>len(unk_protein)
6
YoumaybeabletofindausefortheUnknownSeqobjectinyourowncode,butitismorelikelythatyouwillfirstcomeacrosstheminaSeqRecordobjectcreatedby
Bio.SeqIO(seeChapter5).Somesequencefileformatsdontalwaysincludetheactualsequence,forexampleGenBankandEMBLfilesmayincludealistoffeaturesbut
forthesequencejustpresentthecontiginformation.Alternatively,theQUALfilesusedinsequencingworkholdqualityscoresbuttheynevercontainasequence
insteadthereisapartnerFASTAfilewhichdoeshavethesequence.
3.14Workingwithstringsdirectly
Toclosethischapter,forthoseyouwhoreallydontwanttousethesequenceobjects(orwhopreferafunctionalprogrammingstyletoanobjectorientatedone),thereare
modulelevelfunctionsinBio.SeqwillacceptplainPythonstrings,Seqobjects(includingUnknownSeqobjects)orMutableSeqobjects:
>>>fromBio.Seqimportreverse_complement,transcribe,back_transcribe,translate
>>>my_string="GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
>>>reverse_complement(my_string)
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
>>>transcribe(my_string)
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
>>>back_transcribe(my_string)
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
>>>translate(my_string)
'AVMGRWKGGRAAG*'
Youare,however,encouragedtoworkwithSeqobjectsbydefault.
Chapter4Sequenceannotationobjects
Chapter3introducedthesequenceclasses.ImmediatelyabovetheSeqclassistheSequenceRecordorSeqRecordclass,definedintheBio.SeqRecordmodule.Thisclass
allowshigherlevelfeaturessuchasidentifiersandfeatures(asSeqFeatureobjects)tobeassociatedwiththesequence,andisusedthroughoutthesequenceinput/output
interfaceBio.SeqIOdescribedfullyinChapter5.
IfyouareonlygoingtobeworkingwithsimpledatalikeFASTAfiles,youcanprobablyskipthischapterfornow.Ifontheotherhandyouaregoingtobeusingrichly
annotatedsequencedata,sayfromGenBankorEMBLfiles,thisinformationisquiteimportant.
WhilethischaptershouldcovermostthingstodowiththeSeqRecordandSeqFeatureobjectsinthischapter,youmayalsowanttoreadtheSeqRecordwikipage
(http://biopython.org/wiki/SeqRecord),andthebuiltindocumentation(alsoonlineSeqRecordandSeqFeature):
>>>fromBio.SeqRecordimportSeqRecord
>>>help(SeqRecord)
...
4.1TheSeqRecordobject
TheSeqRecord(SequenceRecord)classisdefinedintheBio.SeqRecordmodule.Thisclassallowshigherlevelfeaturessuchasidentifiersandfeaturestobeassociated
withasequence(seeChapter3),andisthebasicdatatypefortheBio.SeqIOsequenceinput/outputinterface(seeChapter5).
TheSeqRecordclassitselfisquitesimple,andoffersthefollowinginformationasattributes:
.seq
Thesequenceitself,typicallyaSeqobject.
.id
TheprimaryIDusedtoidentifythesequenceastring.Inmostcasesthisissomethinglikeanaccessionnumber.
.name
Acommonname/idforthesequenceastring.Insomecasesthiswillbethesameastheaccessionnumber,butitcouldalsobeaclonename.Ithinkofthisas
beinganalogoustotheLOCUSidinaGenBankrecord.
.description
Ahumanreadabledescriptionorexpressivenameforthesequenceastring.
.letter_annotations
Holdsperletterannotationsusinga(restricted)dictionaryofadditionalinformationaboutthelettersinthesequence.Thekeysarethenameoftheinformation,and
theinformationiscontainedinthevalueasaPythonsequence(i.e.alist,tupleorstring)withthesamelengthasthesequenceitself.Thisisoftenusedforquality
scores(e.g.Section20.1.6)orsecondarystructureinformation(e.g.fromStockholm/PFAMalignmentfiles).
.annotations
Adictionaryofadditionalinformationaboutthesequence.Thekeysarethenameoftheinformation,andtheinformationiscontainedinthevalue.Thisallowsthe
additionofmoreunstructuredinformationtothesequence.
.features
AlistofSeqFeatureobjectswithmorestructuredinformationaboutthefeaturesonasequence(e.g.positionofgenesonagenome,ordomainsonaprotein
sequence).ThestructureofsequencefeaturesisdescribedbelowinSection4.3.
.dbxrefs
Alistofdatabasecrossreferencesasstrings.
4.2CreatingaSeqRecord
UsingaSeqRecordobjectisnotverycomplicated,sincealloftheinformationispresentedasattributesoftheclass.UsuallyyouwontcreateaSeqRecordbyhand,but
insteaduseBio.SeqIOtoreadinasequencefileforyou(seeChapter5andtheexamplesbelow).However,creatingSeqRecordcanbequitesimple.
4.2.1SeqRecordobjectsfromscratch
TocreateaSeqRecordataminimumyoujustneedaSeqobject:
>>>simple_seq=Seq("GATC")
>>>simple_seq_r=SeqRecord(simple_seq)
Additionally,youcanalsopasstheid,nameanddescriptiontotheinitializationfunction,butifnottheywillbesetasstringsindicatingtheyareunknown,andcanbe
modifiedsubsequently:
>>>simple_seq_r.id
'<unknownid>'
>>>simple_seq_r.id="AC12345"
>>>simple_seq_r.description="MadeupsequenceIwishIcouldwriteapaperabout"
>>>print(simple_seq_r.description)
MadeupsequenceIwishIcouldwriteapaperabout
>>>simple_seq_r.seq
Seq('GATC',Alphabet())
IncludinganidentifierisveryimportantifyouwanttooutputyourSeqRecordtoafile.Youwouldnormallyincludethiswhencreatingtheobject:
>>>simple_seq=Seq("GATC")
>>>simple_seq_r=SeqRecord(simple_seq,id="AC12345")
Asmentionedabove,theSeqRecordhasandictionaryattributeannotations.Thisisusedforanymiscellaneousannotationsthatdoesntfitunderoneoftheothermore
specificattributes.Addingannotationsiseasy,andjustinvolvesdealingdirectlywiththeannotationdictionary:
>>>simple_seq_r.annotations["evidence"]="None.Ijustmadeitup."
>>>print(simple_seq_r.annotations)
{'evidence':'None.Ijustmadeitup.'}
>>>print(simple_seq_r.annotations["evidence"])
None.Ijustmadeitup.
Workingwithperletterannotationsissimilar,letter_annotationsisadictionarylikeattributewhichwillletyouassignanyPythonsequence(i.e.astring,listortuple)
whichhasthesamelengthasthesequence:
>>>simple_seq_r.letter_annotations["phred_quality"]=[40,40,38,30]
>>>print(simple_seq_r.letter_annotations)
{'phred_quality':[40,40,38,30]}
>>>print(simple_seq_r.letter_annotations["phred_quality"])
[40,40,38,30]
ThedbxrefsandfeaturesattributesarejustPythonlists,andshouldbeusedtostorestringsandSeqFeatureobjects(discussedlaterinthischapter)respectively.
4.2.2SeqRecordobjectsfromFASTAfiles
ThisexampleusesafairlylargeFASTAfilecontainingthewholesequenceforYersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,originallydownloadedfromthe
NCBI.ThisfileisincludedwiththeBiopythonunittestsundertheGenBankfolder,oronlineNC_005816.fnafromourwebsite.
Thefilestartslikethisandyoucancheckthereisonlyonerecordpresent(i.e.onlyonelinestartingwithagreaterthansymbol):
>gi|45478711|ref|NC_005816.1|YersiniapestisbiovarMicrotus...pPCP1,completesequence
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...
BackinChapter2youwillhaveseenthefunctionBio.SeqIO.parse(...)usedtoloopoveralltherecordsinafileasSeqRecordobjects.TheBio.SeqIOmodulehasasister
functionforuseonfileswhichcontainjustonerecordwhichwellusehere(seeChapter5fordetails):
>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.fna","fasta")
>>>record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
SingleLetterAlphabet()),id='gi|45478711|ref|NC_005816.1|',name='gi|45478711|ref|NC_005816.1|',
description='gi|45478711|ref|NC_005816.1|YersiniapestisbiovarMicrotus...sequence',
dbxrefs=[])
Now,letshavealookatthekeyattributesofthisSeqRecordindividuallystartingwiththeseqattributewhichgivesyouaSeqobject:
>>>record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',SingleLetterAlphabet())
HereBio.SeqIOhasdefaultedtoagenericalphabet,ratherthanguessingthatthisisDNA.IfyouknowinadvancewhatkindofsequenceyourFASTAfilecontains,you
cantellBio.SeqIOwhichalphabettouse(seeChapter5).
Next,theidentifiersanddescription:
>>>record.id
'gi|45478711|ref|NC_005816.1|'
>>>record.name
'gi|45478711|ref|NC_005816.1|'
>>>record.description
'gi|45478711|ref|NC_005816.1|YersiniapestisbiovarMicrotus...pPCP1,completesequence'
Asyoucanseeabove,thefirstwordoftheFASTArecordstitleline(afterremovingthegreaterthansymbol)isusedforboththeidandnameattributes.Thewholetitle
line(afterremovingthegreaterthansymbol)isusedfortherecorddescription.Thisisdeliberate,partlyforbackwardscompatibilityreasons,butitalsomakessenseif
youhaveaFASTAfilelikethis:
>YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...
NotethatnoneoftheotherannotationattributesgetpopulatedwhenreadingaFASTAfile:
>>>record.dbxrefs
[]
>>>record.annotations
{}
>>>record.letter_annotations
{}
>>>record.features
[]
InthiscaseourexampleFASTAfilewasfromtheNCBI,andtheyhaveafairlywelldefinedsetofconventionsforformattingtheirFASTAlines.Thismeansitwouldbe
possibletoparsethisinformationandextracttheGInumberandaccessionforexample.However,FASTAfilesfromothersourcesvary,sothisisntpossibleingeneral.
4.2.3SeqRecordobjectsfromGenBankfiles
Asinthepreviousexample,weregoingtolookatthewholesequenceforYersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,originallydownloadedfromthe
NCBI,butthistimeasaGenBankfile.Again,thisfileisincludedwiththeBiopythonunittestsundertheGenBankfolder,oronlineNC_005816.gbfromourwebsite.
Thisfilecontainsasinglerecord(i.e.onlyoneLOCUSline)andstarts:
LOCUSNC_0058169609bpDNAcircularBCT21JUL2008
DEFINITIONYersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,complete
sequence.
ACCESSIONNC_005816
VERSIONNC_005816.1GI:45478711
PROJECTGenomeProject:10638
...
Again,welluseBio.SeqIOtoreadthisfilein,andthecodeisalmostidenticaltothatforusedabovefortheFASTAfile(seeChapter5fordetails):
>>>record=SeqIO.read("NC_005816.gb","genbank")
>>>record
IUPACAmbiguousDNA()),id='NC_005816.1',name='NC_005816',
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.',
dbxrefs=['Project:10638'])
Youshouldbeabletospotsomedifferencesalready!Buttakingtheattributesindividually,thesequencestringisthesameasbefore,butthistimeBio.SeqIOhasbeenable
toautomaticallyassignamorespecificalphabet(seeChapter5fordetails):
>>>record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',IUPACAmbiguousDNA())
ThenamecomesfromtheLOCUSline,whiletheidincludestheversionsuffix.ThedescriptioncomesfromtheDEFINITIONline:
>>>record.id
'NC_005816.1'
>>>record.name
'NC_005816'
>>>record.description
'YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.'
GenBankfilesdonthaveanyperletterannotations:
>>>record.letter_annotations
{}
Mostoftheannotationsinformationgetsrecordedintheannotationsdictionary,forexample:
>>>len(record.annotations)
11
>>>record.annotations["source"]
'YersiniapestisbiovarMicrotusstr.91001'
ThedbxrefslistgetspopulatedfromanyPROJECTorDBLINKlines:
>>>record.dbxrefs
['Project:10638']
Finally,andperhapsmostinterestingly,alltheentriesinthefeaturestable(e.g.thegenesorCDSfeatures)getrecordedasSeqFeatureobjectsinthefeatureslist.
>>>len(record.features)
29
WelltalkaboutSeqFeatureobjectsnext,inSection4.3.
4.3Feature,locationandpositionobjects
4.3.1SeqFeatureobjects
Sequencefeaturesareanessentialpartofdescribingasequence.Onceyougetbeyondthesequenceitself,youneedsomewaytoorganizeandeasilygetatthemore
abstractinformationthatisknownaboutthesequence.Whileitisprobablyimpossibletodevelopageneralsequencefeatureclassthatwillcovereverything,the
BiopythonSeqFeatureclassattemptstoencapsulateasmuchoftheinformationaboutthesequenceaspossible.ThedesignisheavilybasedontheGenBank/EMBL
featuretables,soifyouunderstandhowtheylook,youllprobablyhaveaneasiertimegraspingthestructureoftheBiopythonclasses.
ThekeyideaabouteachSeqFeatureobjectistodescribearegiononaparentsequence,typicallyaSeqRecordobject.Thatregionisdescribedwithalocationobject,
typicallyarangebetweentwopositions(seeSection4.3.2below).
TheSeqFeatureclasshasanumberofattributes,sofirstwelllistthemandtheirgeneralfeatures,andthenlaterinthechapterworkthroughexamplestoshowhowthis
appliestoareallifeexample.TheattributesofaSeqFeatureare:
.type
Thisisatextualdescriptionofthetypeoffeature(forinstance,thiswillbesomethinglikeCDSorgene).
.location
ThelocationoftheSeqFeatureonthesequencethatyouaredealingwith,seeSection4.3.2below.TheSeqFeaturedelegatesmuchofitsfunctionalitytothe
locationobject,andincludesanumberofshortcutattributesforpropertiesofthelocation:
.ref
shorthandfor.location.refany(different)referencesequencethelocationisreferringto.UsuallyjustNone.
.ref_db
shorthandfor.location.ref_dbspecifiesthedatabaseanyidentifierin.refrefersto.UsuallyjustNone.
.strand
shorthandfor.location.strandthestrandonthesequencethatthefeatureislocatedon.Fordoublestrandednucleotidesequencethismayeitherbe1forthe
topstrand,1forthebottomstrand,0ifthestrandisimportantbutisunknown,orNoneifitdoesntmatter.ThisisNoneforproteins,orsinglestranded
sequences.
.qualifiers
ThisisaPythondictionaryofadditionalinformationaboutthefeature.Thekeyissomekindofterseoneworddescriptionofwhattheinformationcontainedinthe
valueisabout,andthevalueistheactualinformation.Forexample,acommonkeyforaqualifiermightbeevidenceandthevaluemightbecomputational(non
experimental).Thisisjustawaytoletthepersonwhoislookingatthefeatureknowthatithasnotbeexperimentally(i.e.inawetlab)confirmed.Notethatother
thevaluewillbealistofstrings(evenwhenthereisonlyonestring).ThisisareflectionofthefeaturetablesinGenBank/EMBLfiles.
.sub_features
ThisusedtobeusedtorepresentfeatureswithcomplicatedlocationslikejoinsinGenBank/EMBLfiles.Thishasbeendeprecatedwiththeintroductionofthe
CompoundLocationobject,andshouldnowbeignored.
4.3.2Positionsandlocations
ThekeyideaabouteachSeqFeatureobjectistodescribearegiononaparentsequence,forwhichweusealocationobject,typicallydescribingarangebetweentwo
positions.Twotrytoclarifytheterminologywereusing:
position
Thisreferstoasinglepositiononasequence,whichmaybefuzzyornot.Forinstance,5,20,<100and>200areallpositions.
location
Alocationisregionofsequenceboundedbysomepositions.Forinstance5..20(i.e.5to20)isalocation.
IjustmentionthisbecausesometimesIgetconfusedbetweenthetwo.
4.3.2.1FeatureLocationobject
Unlessyouworkwitheukaryoticgenes,mostSeqFeaturelocationsareextremelysimpleyoujustneedstartandendcoordinatesandastrand.Thatsessentiallyallthe
basicFeatureLocationobjectdoes.
Inpractiseofcourse,thingscanbemorecomplicated.Firstofallwehavetohandlecompoundlocationsmadeupofseveralregions.Secondly,thepositionsthemselves
maybefuzzy(inexact).
4.3.2.2CompoundLocationobject
Biopython1.62introducedtheCompoundLocationaspartofarestructuringofhowcomplexlocationsmadeupofmultipleregionsarerepresented.Themainusageisfor
handlingjoinlocationsinEMBL/GenBankfiles.
4.3.2.3FuzzyPositions
Sofarweveonlyusedsimplepositions.Onecomplicationindealingwithfeaturelocationscomesinthepositionsthemselves.Inbiologymanytimesthingsarent
entirelycertain(asmuchasuswetlabbiologiststrytomakethemcertain!).Forinstance,youmightdoadinucleotideprimingexperimentanddiscoverthatthestartof
mRNAtranscriptstartsatoneoftwosites.Thisisveryusefulinformation,butthecomplicationcomesinhowtorepresentthisasaposition.Tohelpusdealwiththis,we
havetheconceptoffuzzypositions.Basicallythereareseveraltypesoffuzzypositions,sowehavefiveclassesdodealwiththem:
ExactPosition
Asitsnamesuggests,thisclassrepresentsapositionwhichisspecifiedasexactalongthesequence.Thisisrepresentedasjustanumber,andyoucangetthe
positionbylookingatthepositionattributeoftheobject.
BeforePosition
Thisclassrepresentsafuzzypositionthatoccurspriortosomespecifiedsite.InGenBank/EMBLnotation,thisisrepresentedassomethinglike`<13',signifying
thattherealpositionislocatedsomewherelessthan13.Togetthespecifiedupperboundary,lookatthepositionattributeoftheobject.
AfterPosition
ContrarytoBeforePosition,thisclassrepresentsapositionthatoccursaftersomespecifiedsite.ThisisrepresentedinGenBankas`>13',andlikeBeforePosition,
yougettheboundarynumberbylookingatthepositionattributeoftheobject.
WithinPosition
OccasionallyusedforGenBank/EMBLlocations,thisclassmodelsapositionwhichoccurssomewherebetweentwospecifiednucleotides.InGenBank/EMBL
notation,thiswouldberepresentedas(1.5),torepresentthatthepositionissomewherewithintherange1to5.Togettheinformationinthisclassyouhavetolook
attwoattributes.Thepositionattributespecifiesthelowerboundaryoftherangewearelookingat,soinourexamplecasethiswouldbeone.Theextension
attributespecifiestherangetothehigherboundary,sointhiscaseitwouldbe4.Soobject.positionisthelowerboundaryandobject.position+object.extension
istheupperboundary.
OneOfPosition
OccasionallyusedforGenBank/EMBLlocations,thisclassdealswithapositionwhereseveralpossiblevaluesexist,forinstanceyoucouldusethisifthestart
codonwasunclearandtherewheretwocandidatesforthestartofthegene.Alternatively,thatmightbehandledexplicitlyastworelatedgenefeatures.
UnknownPosition
Thisclassdealswithapositionofunknownlocation.ThisisnotusedinGenBank/EMBL,butcorrespondstothe?featurecoordinateusedinUniProt.
Heresanexamplewherewecreatealocationwithfuzzyendpoints:
>>>fromBioimportSeqFeature
>>>start_pos=SeqFeature.AfterPosition(5)
>>>end_pos=SeqFeature.BetweenPosition(9,left=8,right=9)
>>>my_location=SeqFeature.FeatureLocation(start_pos,end_pos)
NotethatthedetailsofsomeofthefuzzylocationschangedinBiopython1.59,inparticularforBetweenPositionandWithinPositionyoumustnowmakeitexplicit
whichintegerpositionshouldbeusedforslicingetc.Forastartpositionthisisgenerallythelower(left)value,whileforanendpositionthiswouldgenerallybethe
higher(right)value.
IfyouprintoutaFeatureLocationobject,youcangetanicerepresentationoftheinformation:
>>>print(my_location)
[>5:(8^9)]
Wecanaccessthefuzzystartandendpositionsusingthestartandendattributesofthelocation:
>>>my_location.start
AfterPosition(5)
>>>print(my_location.start)
>5
>>>my_location.end
BetweenPosition(9,left=8,right=9)
>>>print(my_location.end)
(8^9)
Ifyoudontwanttodealwithfuzzypositionsandjustwantnumbers,theyareactuallysubclassesofintegerssoshouldworklikeintegers:
>>>int(my_location.start)
5
>>>int(my_location.end)
9
ForcompatibilitywitholderversionsofBiopythonyoucanaskforthenofuzzy_startandnofuzzy_endattributesofthelocationwhichareplainintegers:
>>>my_location.nofuzzy_start
5
>>>my_location.nofuzzy_end
9
Noticethatthisjustgivesyoubackthepositionattributesofthefuzzylocations.
Similarly,tomakeiteasytocreateapositionwithoutworryingaboutfuzzypositions,youcanjustpassinnumberstotheFeaturePositionconstructors,andyoullget
backoutExactPositionobjects:
>>>exact_location=SeqFeature.FeatureLocation(5,9)
>>>print(exact_location)
[5:9]
>>>exact_location.start
ExactPosition(5)
>>>int(exact_location.start)
5
>>>exact_location.nofuzzy_start
5
ThatismostofthenittygrittyaboutdealingwithfuzzypositionsinBiopython.Ithasbeendesignedsothatdealingwithfuzzinessisnotthatmuchmorecomplicated
thandealingwithexactpositions,andhopefullyyoufindthattrue!
4.3.2.4Locationtesting
YoucanusethePythonkeywordinwithaSeqFeatureorlocationobjecttoseeifthebase/residueforaparentcoordinateiswithinthefeature/locationornot.
Forexample,supposeyouhaveaSNPofinterestandyouwanttoknowwhichfeaturesthisSNPiswithin,andletssupposethisSNPisatindex4350(Pythoncounting!).
Hereisasimplebruteforcesolutionwherewejustcheckallthefeaturesonebyoneinaloop:
>>>my_snp=4350
>>>forfeatureinrecord.features:
...ifmy_snpinfeature:
...print("%s%s"%(feature.type,feature.qualifiers.get('db_xref')))
...
source['taxon:229193']
gene['GeneID:2767712']
CDS['GI:45478716','GeneID:2767712']
NotethatgeneandCDSfeaturesfromGenBankorEMBLfilesdefinedwithjoinsaretheunionoftheexonstheydonotcoveranyintrons.
4.3.3Sequencedescribedbyafeatureorlocation
ASeqFeatureorlocationobjectdoesntdirectlycontainasequence,insteadthelocation(seeSection4.3.2)describeshowtogetthisfromtheparentsequence.For
exampleconsidera(short)genesequencewithlocation5:18onthereversestrand,whichinGenBank/EMBLnotationusing1basedcountingwouldbe
complement(6..18),likethis:
>>>fromBio.SeqFeatureimportSeqFeature,FeatureLocation
>>>example_parent=Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>>example_feature=SeqFeature(FeatureLocation(5,18),type="gene",strand=1)
Youcouldtaketheparentsequence,sliceittoextract5:18,andthentakethereversecomplement.IfyouareusingBiopython1.59orlater,thefeaturelocationsstartand
endareintegerlikesothisworks:
>>>feature_seq=example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>>print(feature_seq)
AGCCTTTGCCGTC
Thisisasimpleexamplesothisisnttoobadhoweveronceyouhavetodealwithcompoundfeatures(joins)thisisrathermessy.Instead,theSeqFeatureobjecthasan
extractmethodtotakecareofallthis:
>>>feature_seq=example_feature.extract(example_parent)
>>>print(feature_seq)
AGCCTTTGCCGTC
ThelengthofaSeqFeatureorlocationmatchesthatoftheregionofsequenceitdescribes.
>>>print(example_feature.extract(example_parent))
AGCCTTTGCCGTC
>>>print(len(example_feature.extract(example_parent)))
13
>>>print(len(example_feature))
13
>>>print(len(example_feature.location))
13
ForsimpleFeatureLocationobjectsthelengthisjustthedifferencebetweenthestartandendpositions.However,foraCompoundLocationthelengthisthesumofthe
constituentregions.
4.4Comparison
TheSeqRecordobjectscanbeverycomplex,butheresasimpleexample:
>>>record1=SeqRecord(Seq("ACGT"),id="test")
>>>record2=SeqRecord(Seq("ACGT"),id="test")
Whathappenswhenyoutrytocomparetheseidenticalrecords?
>>>record1==record2
...
PerhapssurprisinglyolderversionsofBiopythonwouldusePythonsdefaultobjectcomparisonfortheSeqRecord,meaningrecord1==record2wouldonlyreturnTrueif
thesevariablespointedatthesameobjectinmemory.Inthisexample,record1==record2wouldhavereturnedFalsehere!
>>>record1==record2#onoldversionsofBiopython!
False
AsofBiopython1.67,SeqRecordcomparisonlikerecord1==record2willinsteadraiseanexpliciterrortoavoidpeoplebeingcaughtoutbythis:
>>>record1==record2
...
NotImplementedError:SeqRecordcomparisonisdeliberatelynotimplemented.Explicitlycomparetheattributesofinterest.
Insteadyoushouldchecktheattributesyouareinterestedin,forexampletheidentifierandthesequence:
>>>record1.id==record2.id
True
>>>record1.seq==record2.seq
True
Bewarethatcomparingcomplexobjectsquicklygetscomplicated(seealsoSection3.11).
4.5References
Anothercommonannotationrelatedtoasequenceisareferencetoajournalorotherpublishedworkdealingwiththesequence.Wehaveafairlysimplewayof
representingaReferenceinBiopythonwehaveaBio.SeqFeature.Referenceclassthatstorestherelevantinformationaboutareferenceasattributesofanobject.
Theattributesincludethingsthatyouwouldexpecttoseeinareferencelikejournal,titleandauthors.Additionally,italsocanholdthemedline_idandpubmed_idand
acommentaboutthereference.Theseareallaccessedsimplyasattributesoftheobject.
Areferencealsohasalocationobjectsothatitcanspecifyaparticularlocationonthesequencethatthereferencerefersto.Forinstance,youmighthaveajournalthatis
dealingwithaparticulargenelocatedonaBAC,andwanttospecifythatitonlyreferstothispositionexactly.Thelocationisapotentiallyfuzzylocation,asdescribed
insection4.3.2.
AnyreferenceobjectsarestoredasalistintheSeqRecordobjectsannotationsdictionaryunderthekeyreferences.Thatsallthereistooit.Referencesaremeanttobe
easytodealwith,andhopefullygeneralenoughtocoverlotsofusagecases.
4.6Theformatmethod
Theformat()methodoftheSeqRecordclassgivesastringcontainingyourrecordformattedusingoneoftheoutputfileformatssupportedbyBio.SeqIO,suchasFASTA:
fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBio.Alphabetimportgeneric_protein
record=SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"\
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"\
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"\
+"SSAC",generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalconesynthase[Cucumissativus]")
print(record.format("fasta"))
whichshouldgive:
>gi|14150838|gb|AAK54648.1|AF376133_1chalconesynthase[Cucumissativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
Thisformatmethodtakesasinglemandatoryargument,alowercasestringwhichissupportedbyBio.SeqIOasanoutputformat(seeChapter5).However,someofthe
fileformatsBio.SeqIOcanwritetorequiremorethanonerecord(typicallythecaseformultiplesequencealignmentformats),andthuswontworkviathisformat()
method.SeealsoSection5.5.4.
4.7SlicingaSeqRecord
YoucansliceaSeqRecord,togiveyouanewSeqRecordcoveringjustpartofthesequence.Whatisimportanthereisthatanyperletterannotationsarealsosliced,and
anyfeatureswhichfallcompletelywithinthenewsequencearepreserved(withtheirlocationsadjusted).
Forexample,takingthesameGenBankfileusedearlier:
>>>record
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence',
>>>len(record)
9609
41
Forthisexampleweregoingtofocusinonthepimgene,YP_pPCP05.IfyouhavealookattheGenBankfiledirectlyyoullfindthisgene/CDShaslocationstring
4343..4780,orinPythoncounting4342:4780.Fromlookingatthefileyoucanworkoutthatthesearethetwelfthandthirteenthentriesinthefile,soinPythonzerobased
countingtheyareentries11and12inthefeatureslist:
>>>print(record.features[20])
type:gene
location:[4342:4780](+)
qualifiers:
Key:db_xref,Value:['GeneID:2767712']
Key:gene,Value:['pim']
Key:locus_tag,Value:['YP_pPCP05']
<BLANKLINE>
>>>print(record.features[21])
type:CDS
location:[4342:4780](+)
qualifiers:
Key:codon_start,Value:['1']
Key:db_xref,Value:['GI:45478716','GeneID:2767712']
Key:note,Value:['similartomanypreviouslysequencedpesticinimmunity...']
Key:product,Value:['pesticinimmunityprotein']
Key:protein_id,Value:['NP_995571.1']
Key:transl_table,Value:['11']
Key:translation,Value:['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']
Letsslicethisparentrecordfrom4300to4800(enoughtoincludethepimgene/CDS),andseehowmanyfeaturesweget:
>>>sub_record=record[4300:4800]
>>>sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
dbxrefs=[])
>>>len(sub_record)
500
>>>len(sub_record.features)
2
Oursubrecordjusthastwofeatures,thegeneandCDSentriesforYP_pPCP05:
>>>print(sub_record.features[0])
type:gene
location:[42:480](+)
qualifiers:
Key:db_xref,Value:['GeneID:2767712']
<BLANKLINE>
>>>print(sub_record.features[1])
type:CDS
location:[42:480](+)
qualifiers:
Key:codon_start,Value:['1']
Key:db_xref,Value:['GI:45478716','GeneID:2767712']
Key:note,Value:['similartomanypreviouslysequencedpesticinimmunity...']
Key:product,Value:['pesticinimmunityprotein']
Key:protein_id,Value:['NP_995571.1']
Key:transl_table,Value:['11']
Key:translation,Value:['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']
Noticethattheirlocationshavebeenadjustedtoreflectthenewparentsequence!
WhileBiopythonhasdonesomethingsensibleandhopefullyintuitivewiththefeatures(andanyperletterannotation),fortheotherannotationitisimpossibletoknowif
thisstillappliestothesubsequenceornot.Toavoidguessing,theannotationsanddbxrefsareomittedfromthesubrecord,anditisuptoyoutotransferanyrelevant
informationasappropriate.
>>>sub_record.annotations
{}
>>>sub_record.dbxrefs
[]
Thesamepointcouldbemadeabouttherecordid,nameanddescription,butforpracticalitythesearepreserved:
>>>sub_record.id
'NC_005816.1'
>>>sub_record.name
'NC_005816'
>>>sub_record.description
'YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence'
Thisillustratestheproblemnicelythough,ournewsubrecordisnotthecompletesequenceoftheplasmid,sothedescriptioniswrong!Letsfixthisandthenviewthe
subrecordasareducedGenBankfileusingtheformatmethoddescribedaboveinSection4.6:
>>>sub_record.description="YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,partial."
>>>print(sub_record.format("genbank"))
...
SeeSections20.1.7and20.1.8forsomeFASTQexampleswheretheperletterannotations(thereadqualityscores)arealsosliced.
4.8AddingSeqRecordobjects
YoucanaddSeqRecordobjectstogether,givinganewSeqRecord.Whatisimportanthereisthatanycommonperletterannotationsarealsoadded,allthefeaturesare
preserved(withtheirlocationsadjusted),andanyothercommonannotationisalsokept(liketheid,nameanddescription).
Foranexamplewithperletterannotation,wellusethefirstrecordinaFASTQfile.Chapter5willexplaintheSeqIOfunctions:
>>>record=next(SeqIO.parse("example.fastq","fastq"))
>>>len(record)
25
>>>print(record.seq)
CCCTTCTTGTCTTCAGCGTTTCTCC
>>>print(record.letter_annotations["phred_quality"])
[26,26,18,26,26,26,26,26,26,26,26,26,26,26,26,22,26,26,26,26,
26,26,26,23,23]
LetssupposethiswasRoche454data,andthatfromotherinformationyouthinktheTTTshouldbeonlyTT.Wecanmakeaneweditedrecordbyfirstslicingthe
SeqRecordbeforeandaftertheextrathirdT:
>>>left=record[:20]
>>>print(left.seq)
CCCTTCTTGTCTTCAGCGTT
>>>print(left.letter_annotations["phred_quality"])
[26,26,18,26,26,26,26,26,26,26,26,26,26,26,26,22,26,26,26,26]
>>>right=record[21:]
>>>print(right.seq)
CTCC
>>>print(right.letter_annotations["phred_quality"])
[26,26,23,23]
Nowaddthetwopartstogether:
>>>edited=left+right
>>>len(edited)
24
>>>print(edited.seq)
CCCTTCTTGTCTTCAGCGTTCTCC
>>>print(edited.letter_annotations["phred_quality"])
[26,26,18,26,26,26,26,26,26,26,26,26,26,26,26,22,26,26,26,26,
26,26,23,23]
Easyandintuitive?Wehopeso!Youcanmakethisshorterwithjust:
>>>edited=record[:20]+record[21:]
Now,foranexamplewithfeatures,welluseaGenBankfile.Supposeyouhaveacirculargenome:
>>>record
>>>len(record)
9609
41
>>>record.dbxrefs
['Project:58037']
>>>record.annotations.keys()
['comment','sequence_version','source','taxonomy','keywords','references',
'accessions','data_file_division','date','organism','gi']
Youcanshifttheoriginlikethis:
>>>shifted=record[2000:]+record[:2000]
>>>shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA',
dbxrefs=[])
>>>len(shifted)
9609
Notethatthisisntperfectinthatsomeannotationlikethedatabasecrossreferencesandoneofthefeatures(thesourcefeature)havebeenlost:
>>>len(shifted.features)
40
>>>shifted.dbxrefs
[]
>>>shifted.annotations.keys()
[]
ThisisbecausetheSeqRecordslicingstepiscautiousinwhatannotationitpreserves(erroneouslypropagatingannotationcancausemajorproblems).Ifyouwanttokeep
thedatabasecrossreferencesortheannotationsdictionary,thismustbedoneexplicitly:
>>>shifted.dbxrefs=record.dbxrefs[:]
>>>shifted.annotations=record.annotations.copy()
>>>shifted.dbxrefs
['Project:10638']
>>>shifted.annotations.keys()
['comment','sequence_version','source','taxonomy','keywords','references',
'accessions','data_file_division','date','organism','gi']
Alsonotethatinanexamplelikethis,youshouldprobablychangetherecordidentifierssincetheNCBIreferencesrefertotheoriginalunmodifiedsequence.
4.9ReversecomplementingSeqRecordobjects
OneofthenewfeaturesinBiopython1.57wastheSeqRecordobjectsreverse_complementmethod.Thistriestobalanceeasyofusewithworriesaboutwhattodowith
theannotationinthereversecomplementedrecord.
Forthesequence,thisusestheSeqobjectsreversecomplementmethod.Anyfeaturesaretransferredwiththelocationandstrandrecalculated.Likewiseanyperletter
annotationisalsocopiedbutreversed(whichmakessensefortypicalexampleslikequalityscores).However,transferofmostannotationisproblematical.
Forinstance,iftherecordIDwasanaccession,thataccessionshouldnotreallyapplytothereversecomplementedsequence,andtransferringtheidentifierbydefault
couldeasilycausesubtledatacorruptionindownstreamanalysis.Thereforebydefault,theSeqRecordsid,name,description,annotationsanddatabasecrossreferences
areallnottransferredbydefault.
TheSeqRecordobjectsreverse_complementmethodtakesanumberofoptionalargumentscorrespondingtopropertiesoftherecord.SettingtheseargumentstoTrue
meanscopytheoldvalues,whileFalsemeansdroptheoldvaluesandusethedefaultvalue.Youcanalternativelyprovidethenewdesiredvalueinstead.
Considerthisexamplerecord:
>>>print("%s%i%i%i%i"%(record.id,len(record),len(record.features),len(record.dbxrefs),len(record.annotations)))
NC_005816.1960941113
Herewetakethereversecomplementandspecifyanewidentifierbutnoticehowmostoftheannotationisdropped(butnotthefeatures):
>>>rc=record.reverse_complement(id="TESTING")
>>>print("%s%i%i%i%i"%(rc.id,len(rc),len(rc.features),len(rc.dbxrefs),len(rc.annotations)))
TESTING96094100
Chapter5SequenceInput/Output
InthischapterwelldiscussinmoredetailtheBio.SeqIOmodule,whichwasbrieflyintroducedinChapter2andalsousedinChapter4.Thisaimstoprovideasimple
interfaceforworkingwithassortedsequencefileformatsinauniformway.SeealsotheBio.SeqIOwikipage(http://biopython.org/wiki/SeqIO),andthebuiltin
documentation(alsoonline):
>>>help(SeqIO)
...
ThecatchisthatyouhavetoworkwithSeqRecordobjects(seeChapter4),whichcontainaSeqobject(seeChapter3)plusannotationlikeanidentifieranddescription.
NotethatwhendealingwithverylargeFASTAorFASTQfiles,theoverheadofworkingwithalltheseobjectscanmakescriptstooslow.Inthiscaseconsiderthelow
levelSimpleFastaParserandFastqGeneralIteratorparserswhichreturnjustatupleofstringsforeachrecord(seeSection5.6).
5.1ParsingorReadingSequences
TheworkhorsefunctionBio.SeqIO.parse()isusedtoreadinsequencedataasSeqRecordobjects.Thisfunctionexpectstwoarguments:
1.Thefirstargumentisahandletoreadthedatafrom,orafilename.Ahandleistypicallyafileopenedforreading,butcouldbetheoutputfromacommandline
program,ordatadownloadedfromtheinternet(seeSection5.3).SeeSection24.1formoreabouthandles.
2.Thesecondargumentisalowercasestringspecifyingsequenceformatwedonttryandguessthefileformatforyou!Seehttp://biopython.org/wiki/SeqIOfor
afulllistingofsupportedformats.
Thereisanoptionalargumentalphabettospecifythealphabettobeused.ThisisusefulforfileformatslikeFASTAwhereotherwiseBio.SeqIOwilldefaulttoageneric
alphabet.
TheBio.SeqIO.parse()functionreturnsaniteratorwhichgivesSeqRecordobjects.Iteratorsaretypicallyusedinaforloopasshownbelow.
Sometimesyoullfindyourselfdealingwithfileswhichcontainonlyasinglerecord.ForthissituationusethefunctionBio.SeqIO.read()whichtakesthesame
arguments.Providedthereisoneandonlyonerecordinthefile,thisisreturnedasaSeqRecordobject.Otherwiseanexceptionisraised.
5.1.1ReadingSequenceFiles
IngeneralBio.SeqIO.parse()isusedtoreadinsequencefilesasSeqRecordobjects,andistypicallyusedwithaforlooplikethis:
fromBioimportSeqIO
TheaboveexampleisrepeatedfromtheintroductioninSection2.4,andwillloadtheorchidDNAsequencesintheFASTAformatfilels_orchid.fasta.Ifinsteadyou
wantedtoloadaGenBankformatfilelikels_orchid.gbkthenallyouneedtodoischangethefilenameandtheformatstring:
fromBioimportSeqIO
Similarly,ifyouwantedtoreadinafileinanotherfileformat,thenassumingBio.SeqIO.parse()supportsityouwouldjustneedtochangetheformatstringas
appropriate,forexampleswissforSwissProtfilesoremblforEMBLtextfiles.Thereisafulllistingonthewikipage(http://biopython.org/wiki/SeqIO)andinthe
builtindocumentation(alsoonline).
AnotherverycommonwaytouseaPythoniteratoriswithinalistcomprehension(orageneratorexpression).Forexample,ifallyouwantedtoextractfromthefilewas
alistoftherecordidentifierswecaneasilydothiswiththefollowinglistcomprehension:
>>>identifiers=[seq_record.idforseq_recordinSeqIO.parse("ls_orchid.gbk","genbank")]
>>>identifiers
['Z78533.1','Z78532.1','Z78531.1','Z78530.1','Z78529.1','Z78527.1',...,'Z78439.1']
TherearemoreexamplesusingSeqIO.parse()inalistcomprehensionlikethisinSection20.2(e.g.forplottingsequencelengthsorGC%).
5.1.2Iteratingovertherecordsinasequencefile
Intheaboveexamples,wehaveusuallyusedaforlooptoiterateoveralltherecordsonebyone.YoucanusetheforloopwithallsortsofPythonobjects(includinglists,
tuplesandstrings)whichsupporttheiterationinterface.
TheobjectreturnedbyBio.SeqIOisactuallyaniteratorwhichreturnsSeqRecordobjects.Yougettoseeeachrecordinturn,butonceandonlyonce.Thepluspointisthat
aniteratorcansaveyoumemorywhendealingwithlargefiles.
Insteadofusingaforloop,canalsousethenext()functiononaniteratortostepthroughtheentries,likethis:
fromBioimportSeqIO
record_iterator=SeqIO.parse("ls_orchid.fasta","fasta")
first_record=next(record_iterator)
print(first_record.id)
print(first_record.description)
second_record=next(record_iterator)
print(second_record.id)
print(second_record.description)
Notethatifyoutrytousenext()andtherearenomoreresults,youllgetthespecialStopIterationexception.
Onespecialcasetoconsideriswhenyoursequencefileshavemultiplerecords,butyouonlywantthefirstone.Inthissituationthefollowingcodeisveryconcise:
fromBioimportSeqIO
first_record=next(SeqIO.parse("ls_orchid.gbk","genbank"))
Awordofwarninghereusingthenext()functionlikethiswillsilentlyignoreanyadditionalrecordsinthefile.Ifyourfileshaveoneandonlyonerecord,likesomeof
theonlineexampleslaterinthischapter,oraGenBankfileforasinglechromosome,thenusethenewBio.SeqIO.read()functioninstead.Thiswillcheckthereareno
extraunexpectedrecordspresent.
5.1.3Gettingalistoftherecordsinasequencefile
IntheprevioussectionwetalkedaboutthefactthatBio.SeqIO.parse()givesyouaSeqRecorditerator,andthatyougettherecordsonebyone.Veryoftenyouneedtobe
abletoaccesstherecordsinanyorder.ThePythonlistdatatypeisperfectforthis,andwecanturntherecorditeratorintoalistofSeqRecordobjectsusingthebuiltin
Pythonfunctionlist()likeso:
fromBioimportSeqIO
records=list(SeqIO.parse("ls_orchid.gbk","genbank"))
print("Found%irecords"%len(records))
print("Thelastrecord")
last_record=records[1]#usingPython'slisttricks
print(last_record.id)
print(repr(last_record.seq))
print(len(last_record))
print("Thefirstrecord")
first_record=records[0]#remember,Pythoncountsfromzero
print(first_record.id)
print(repr(first_record.seq))
print(len(first_record))
Giving:
Found94records
Thelastrecord
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',IUPACAmbiguousDNA())
592
Thefirstrecord
Z78533.1
740
YoucanofcoursestilluseaforloopwithalistofSeqRecordobjects.Usingalistismuchmoreflexiblethananiterator(forexample,youcandeterminethenumberof
recordsfromthelengthofthelist),butdoesneedmorememorybecauseitwillholdalltherecordsinmemoryatonce.
5.1.4Extractingdata
TheSeqRecordobjectanditsannotationstructuresaredescribedmorefullyinChapter4.Asanexampleofhowannotationsarestored,welllookattheoutputfrom
parsingthefirstrecordintheGenBankfilels_orchid.gbk.
fromBioimportSeqIO
record_iterator=SeqIO.parse("ls_orchid.gbk","genbank")
first_record=next(record_iterator)
print(first_record)
Thatshouldgivesomethinglikethis:
ID:Z78533.1
Name:Z78533
Description:C.irapeanum5.8SrRNAgeneandITS1andITS2DNA.
Numberoffeatures:5
/sequence_version=1
/source=Cypripediumirapeanum
/taxonomy=['Eukaryota','Viridiplantae','Streptophyta',...,'Cypripedium']
/keywords=['5.8SribosomalRNA','5.8SrRNAgene',...,'ITS1','ITS2']
/references=[...]
/accessions=['Z78533']
/data_file_division=PLN
/date=30NOV2006
/organism=Cypripediumirapeanum
/gi=2765658
ThisgivesahumanreadablesummaryofmostoftheannotationdatafortheSeqRecord.Forthisexampleweregoingtousethe.annotationsattributewhichisjusta
Pythondictionary.Thecontentsofthisannotationsdictionarywereshownwhenweprintedtherecordabove.Youcanalsoprintthemoutdirectly:
print(first_record.annotations)
LikeanyPythondictionary,youcaneasilygetalistofthekeys:
print(first_record.annotations.keys())
orvalues:
print(first_record.annotations.values())
Ingeneral,theannotationvaluesarestrings,orlistsofstrings.Onespecialcaseisanyreferencesinthefilegetstoredasreferenceobjects.
Supposeyouwantedtoextractalistofthespeciesfromthels_orchid.gbkGenBankfile.Theinformationwewant,Cypripediumirapeanum,isheldintheannotations
dictionaryundersourceandorganism,whichwecanaccesslikethis:
>>>print(first_record.annotations["source"])
Cypripediumirapeanum
or:
>>>print(first_record.annotations["organism"])
Cypripediumirapeanum
Ingeneral,organismisusedforthescientificname(inLatin,e.g.Arabidopsisthaliana),whilesourcewilloftenbethecommonname(e.g.thalecress).Inthis
example,asisoftenthecase,thetwofieldsareidentical.
Nowletsgothroughalltherecords,buildingupalistofthespecieseachorchidsequenceisfrom:
fromBioimportSeqIO
all_species=[]
all_species.append(seq_record.annotations["organism"])
print(all_species)
Anotherwayofwritingthiscodeistousealistcomprehension:
fromBioimportSeqIO
all_species=[seq_record.annotations["organism"]forseq_recordin\
SeqIO.parse("ls_orchid.gbk","genbank")]
print(all_species)
Ineithercase,theresultis:
['Cypripediumirapeanum','Cypripediumcalifornicum',...,'Paphiopedilumbarbatum']
Great.ThatwasprettyeasybecauseGenBankfilesareannotatedinastandardisedway.
Now,letssupposeyouwantedtoextractalistofthespeciesfromaFASTAfile,ratherthantheGenBankfile.Thebadnewsisyouwillhavetowritesomecodeto
extractthedatayouwantfromtherecordsdescriptionlineiftheinformationisinthefileinthefirstplace!OurexampleFASTAformatfilels_orchid.fastastartslike
this:
>gi|2765658|emb|Z78533.1|CIZ78533C.irapeanum5.8SrRNAgeneandITS1andITS2DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
Youcancheckbyhand,butforeveryrecordthespeciesnameisinthedescriptionlineasthesecondword.Thismeansifwebreakupeachrecords.descriptionatthe
spaces,thenthespeciesisthereasfieldnumberone(fieldzeroistherecordidentifier).Thatmeanswecandothis:
fromBioimportSeqIO
all_species=[]
all_species.append(seq_record.description.split()[1])
print(all_species)
Thisgives:
['C.irapeanum','C.californicum','C.fasciculatum','C.margaritaceum',...,'P.barbatum']
Theconcisealternativeusinglistcomprehensionswouldbe:
fromBioimportSeqIO
all_species==[seq_record.description.split()[1]forseq_recordin\
SeqIO.parse("ls_orchid.fasta","fasta")]
print(all_species)
Ingeneral,extractinginformationfromtheFASTAdescriptionlineisnotverynice.IfyoucangetyoursequencesinawellannotatedfileformatlikeGenBankorEMBL,
thenthissortofannotationinformationismucheasiertodealwith.
5.2Parsingsequencesfromcompressedfiles
Intheprevioussection,welookedatparsingsequencedatafromafile.Insteadofusingafilename,youcangiveBio.SeqIOahandle(seeSection24.1),andinthissection
wellusehandlestoparsesequencefromcompressedfiles.
Asyoullhaveseenabove,wecanuseBio.SeqIO.read()orBio.SeqIO.parse()withafilenameforinstancethisquickexamplecalculatesthetotallengthofthe
sequencesinamultiplerecordGenBankfileusingageneratorexpression:
>>>print(sum(len(r)forrinSeqIO.parse("ls_orchid.gbk","gb")))
67518
Hereweuseafilehandleinstead,usingthewithstatementtoclosethehandleautomatically:
>>>withopen("ls_orchid.gbk")ashandle:
...print(sum(len(r)forrinSeqIO.parse(handle,"gb")))
67518
Or,theoldfashionedwaywhereyoumanuallyclosethehandle:
>>>handle=open("ls_orchid.gbk")
>>>print(sum(len(r)forrinSeqIO.parse(handle,"gb")))
67518
>>>handle.close()
Now,supposewehaveagzipcompressedfileinstead?TheseareverycommonlyusedonLinux.WecanusePythonsgzipmoduletoopenthecompressedfilefor
readingwhichgivesusahandleobject:
>>>importgzip
>>>withgzip.open("ls_orchid.gbk.gz","rt")ashandle:
...
67518
Similarlyifwehadabzip2compressedfile(sadlythefunctionnameisntquiteasconsistentunderPython2):
>>>importbz2
>>>ifhasattr(bz2,"open"):
...handle=bz2.open("ls_orchid.gbk.bz2","rt")#Python3
...else:
...handle=bz2.BZ2File("ls_orchid.gbk.bz2","r")#Python2
...
>>>withhandle:
...
67518
Thereisagzip(GNUZip)variantcalledBGZF(BlockedGNUZipFormat),whichcanbetreatedlikeanordinarygzipfileforreading,buthasadvantagesforrandom
accesslaterwhichwelltalkaboutlaterinSection5.4.4.
5.3Parsingsequencesfromthenet
Intheprevioussections,welookedatparsingsequencedatafromafile(usingafilenameorhandle),andfromcompressedfiles(usingahandle).Herewelluse
Bio.SeqIOwithanothertypeofhandle,anetworkconnection,todownloadandparsesequencesfromtheinternet.
NotethatjustbecauseyoucandownloadsequencedataandparseitintoaSeqRecordobjectinonegodoesntmeanthisisagoodidea.Ingeneral,youshouldprobably
downloadsequencesonceandsavethemtoafileforreuse.
5.3.1ParsingGenBankrecordsfromthenet
Section9.6talksabouttheEntrezEFetchinterfaceinmoredetail,butfornowletsjustconnecttotheNCBIandgetafewOpuntia(pricklypear)sequencesfrom
GenBankusingtheirGInumbers.
Firstofall,letsfetchjustonerecord.IfyoudontcareabouttheannotationsandfeaturesdownloadingaFASTAfileisagoodchoiceasthesearecompact.Now
remember,whenyouexpectthehandletocontainoneandonlyonerecord,usetheBio.SeqIO.read()function:
fromBioimportEntrez
fromBioimportSeqIO
Entrez.email="A.N.Other@example.com"
withEntrez.efetch(db="nucleotide",rettype="fasta",retmode="text",id="6273291")ashandle:
seq_record=SeqIO.read(handle,"fasta")
print("%swith%ifeatures"%(seq_record.id,len(seq_record.features)))
Expectedoutput:
gi|6273291|gb|AF191665.1|AF191665with0features
TheNCBIwillalsoletyouaskforthefileinotherformats,inparticularasaGenBankfile.UntilEaster2009,theEntrezEFetchAPIletyouusegenbankasthereturn
type,howevertheNCBInowinsistonusingtheofficialreturntypesofgb(orgpforproteins)asdescribedonEFetchforSequenceandotherMolecularBiology
Databases.Asaresult,inBiopython1.50onwards,wesupportgbasanaliasforgenbankinBio.SeqIO.
fromBioimportEntrez
fromBioimportSeqIO
withEntrez.efetch(db="nucleotide",rettype="gb",retmode="text",id="6273291")ashandle
seq_record=SeqIO.read(handle,"gb")#using"gb"asanaliasfor"genbank"
print("%swith%ifeatures"%(seq_record.id,len(seq_record.features)))
Theexpectedoutputofthisexampleis:
AF191665.1with3features
Noticethistimewehavethreefeatures.
Nowletsfetchseveralrecords.Thistimethehandlecontainsmultiplerecords,sowemustusetheBio.SeqIO.parse()function:
fromBioimportEntrez
fromBioimportSeqIO
withEntrez.efetch(db="nucleotide",rettype="gb",retmode="text",
id="6273291,6273290,6273289")ashandle:
forseq_recordinSeqIO.parse(handle,"gb"):
print("%s%s..."%(seq_record.id,seq_record.description[:50]))
print("Sequencelength%i,%ifeatures,from:%s"
%(len(seq_record),len(seq_record.features),seq_record.annotations["source"]))
Thatshouldgivethefollowingoutput:
AF191665.1Opuntiamarenaerpl16gene;chloroplastgeneforc...
Sequencelength902,3features,from:chloroplastOpuntiamarenae
AF191664.1Opuntiaclavatarpl16gene;chloroplastgeneforc...
Sequencelength899,3features,from:chloroplastGrusoniaclavata
AF191663.1Opuntiabradtianarpl16gene;chloroplastgenefor...
Sequencelength899,3features,from:chloroplastOpuntiabradtianaa
SeeChapter9formoreabouttheBio.Entrezmodule,andmakesuretoreadabouttheNCBIguidelinesforusingEntrez(Section9.1).
5.3.2ParsingSwissProtsequencesfromthenet
NowletsuseahandletodownloadaSwissProtfilefromExPASy,somethingcoveredinmoredepthinChapter10.Asmentionedabove,whenyouexpectthehandleto
containoneandonlyonerecord,usetheBio.SeqIO.read()function:
fromBioimportExPASy
fromBioimportSeqIO
withExPASy.get_sprot_raw("O23729")ashandle:
seq_record=SeqIO.read(handle,"swiss")
print(seq_record.name)
print(seq_record.description)
print("Length%i"%len(seq_record))
print(seq_record.annotations["keywords"])
AssumingyournetworkconnectionisOK,youshouldgetback:
O23729
CHS3_BROFI
RecName:Full=Chalconesynthase3;EC=2.3.1.74;AltName:Full=Naringeninchalconesynthase3;
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE',ProteinAlphabet())
Length394
['Acyltransferase','Flavonoidbiosynthesis','Transferase']
5.4SequencefilesasDictionaries
WerenowgoingtointroducethreerelatedfunctionsintheBio.SeqIOmodulewhichallowdictionarylikerandomaccesstoamultisequencefile.Thereisatradeoffhere
betweenflexibilityandmemoryusage.Insummary:
Bio.SeqIO.to_dict()isthemostflexiblebutalsothemostmemorydemandingoption(seeSection5.4.1).Thisisbasicallyahelperfunctiontobuildanormal
PythondictionarywitheachentryheldasaSeqRecordobjectinmemory,allowingyoutomodifytherecords.
Bio.SeqIO.index()isausefulmiddleground,actinglikeareadonlydictionaryandparsingsequencesintoSeqRecordobjectsondemand(seeSection5.4.2).
Bio.SeqIO.index_db()alsoactslikeareadonlydictionarybutstorestheidentifiersandfileoffsetsinafileondisk(asanSQLite3database),meaningithasvery
lowmemoryrequirements(seeSection5.4.3),butwillbealittlebitslower.
Seethediscussionforanbroadoverview(Section5.4.5).
5.4.1SequencefilesasDictionariesInmemory
ThenextthingthatwelldowithourubiquitousorchidfilesistoshowhowtoindexthemandaccessthemlikeadatabaseusingthePythondictionarydatatype(likea
hashinPerl).Thisisveryusefulformoderatelylargefileswhereyouonlyneedtoaccesscertainelementsofthefile,andmakesforanicequickndirtydatabase.For
dealingwithlargerfileswherememorybecomesaproblem,seeSection5.4.2below.
YoucanusethefunctionBio.SeqIO.to_dict()tomakeaSeqRecorddictionary(inmemory).Bydefaultthiswilluseeachrecordsidentifier(i.e.the.idattribute)asthe
key.LetstrythisusingourGenBankfile:
>>>orchid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk","genbank"))
ThereisjustonerequiredargumentforBio.SeqIO.to_dict(),alistorgeneratorgivingSeqRecordobjects.HerewehavejustusedtheoutputfromtheSeqIO.parse
function.Asthenamesuggests,thisreturnsaPythondictionary.
Sincethisvariableorchid_dictisanordinaryPythondictionary,wecanlookatallofthekeyswehaveavailable:
>>>len(orchid_dict)
94
>>>list(orchid_dict.keys())
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']
Youcanleaveoutthelist(...)bitifyouarestillusingPython2.UnderPython3thedictionarymethodslike.keys()and.values()areiteratorsratherthanlists.
Ifyoureallywantto,youcanevenlookatalltherecordsatonce:
>>>list(orchid_dict.values())#lotsofoutput!
...
WecanaccessasingleSeqRecordobjectviathekeysandmanipulatetheobjectasnormal:
>>>seq_record=orchid_dict["Z78475.1"]
>>>print(seq_record.description)
P.supardii5.8SrRNAgeneandITS1andITS2DNA
>>>print(repr(seq_record.seq))
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT',IUPACAmbiguousDNA())
So,itisveryeasytocreateaninmemorydatabaseofourGenBankrecords.NextwelltrythisfortheFASTAfileinstead.
NotethatthoseofyouwithpriorPythonexperienceshouldallbeabletoconstructadictionarylikethisbyhand.However,typicaldictionaryconstructionmethodswill
notdealwiththecaseofrepeatedkeysverynicely.UsingtheBio.SeqIO.to_dict()willexplicitlycheckforduplicatekeys,andraiseanexceptionifanyarefound.
5.4.1.1Specifyingthedictionarykeys
Usingthesamecodeasabove,butfortheFASTAfileinstead:
fromBioimportSeqIO
orchid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta","fasta"))
print(orchid_dict.keys())
Thistimethekeysare:
['gi|2765596|emb|Z78471.1|PDZ78471','gi|2765646|emb|Z78521.1|CCZ78521',...
...,'gi|2765613|emb|Z78488.1|PTZ78488','gi|2765583|emb|Z78458.1|PHZ78458']
YoushouldrecognisethesestringsfromwhenweparsedtheFASTAfileearlierinSection2.4.1.Supposeyouwouldratherhavesomethingelseasthekeyslikethe
accessionnumbers.ThisbringsusnicelytoSeqIO.to_dict()soptionalargumentkey_function,whichletsyoudefinewhattouseasthedictionarykeyforyourrecords.
Firstyoumustwriteyourownfunctiontoreturnthekeyyouwant(asastring)whengivenaSeqRecordobject.Ingeneral,thedetailsoffunctionwilldependonthesort
ofinputrecordsyouaredealingwith.Butforourorchids,wecanjustsplituptherecordsidentifierusingthepipecharacter(theverticalline)andreturnthefourth
entry(fieldthree):
defget_accession(record):
""""GivenaSeqRecord,returntheaccessionnumberasastring.
e.g."gi|2765613|emb|Z78488.1|PTZ78488">"Z78488.1"
"""
parts=record.id.split("|")
assertlen(parts)==5andparts[0]=="gi"andparts[2]=="emb"
returnparts[3]
ThenwecangivethisfunctiontotheSeqIO.to_dict()functiontouseinbuildingthedictionary:
fromBioimportSeqIO
orchid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta","fasta"),key_function=get_accession)
print(orchid_dict.keys())
Finally,asdesired,thenewdictionarykeys:
>>>print(orchid_dict.keys())
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']
Nottoocomplicated,Ihope!
5.4.1.2IndexingadictionaryusingtheSEGUIDchecksum
TogiveanotherexampleofworkingwithdictionariesofSeqRecordobjects,wellusetheSEGUIDchecksumfunction.Thisisarelativelyrecentchecksum,andcollisions
shouldbeveryrare(i.e.twodifferentsequenceswiththesamechecksum),animprovementontheCRC64checksum.
Onceagain,workingwiththeorchidsGenBankfile:
fromBioimportSeqIO
fromBio.SeqUtils.CheckSumimportseguid
forrecordinSeqIO.parse("ls_orchid.gbk","genbank"):
print(record.id,seguid(record.seq))
Thisshouldgive:
Z78533.1JUEoWn6DPhgZ9nAyowsgtoD9TTo
Z78532.1MN/s0q9zDoCVEEc+k/IFwCNF2pY
...
Z78439.1H+JfaShya/4yyAj7IbMqgNkxdxQ
Now,recalltheBio.SeqIO.to_dict()functionskey_functionargumentexpectsafunctionwhichturnsaSeqRecordintoastring.Wecantusetheseguid()function
directlybecauseitexpectstobegivenaSeqobject(orastring).However,wecanusePythonslambdafeaturetocreateaoneofffunctiontogiveto
Bio.SeqIO.to_dict()instead:
>>>fromBio.SeqUtils.CheckSumimportseguid
>>>seguid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk","genbank"),
...lambdarec:seguid(rec.seq))
>>>record=seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"]
>>>print(record.id)
Z78532.1
>>>print(record.description)
C.californicum5.8SrRNAgeneandITS1andITS2DNA
ThatshouldhaveretrievedtherecordZ78532.1,thesecondentryinthefile.
5.4.2SequencefilesasDictionariesIndexedfiles
Asthepreviouscoupleofexamplestriedtoillustrate,usingBio.SeqIO.to_dict()isveryflexible.However,becauseitholdseverythinginmemory,thesizeoffileyou
canworkwithislimitedbyyourcomputersRAM.Ingeneral,thiswillonlyworkonsmalltomediumfiles.
ForlargerfilesyoushouldconsiderBio.SeqIO.index(),whichworksalittledifferently.Althoughitstillreturnsadictionarylikeobject,thisdoesnotkeepeverythingin
memory.Instead,itjustrecordswhereeachrecordiswithinthefilewhenyouaskforaparticularrecord,itthenparsesitondemand.
Asanexample,letsusethesameGenBankfileasbefore:
>>>orchid_dict=SeqIO.index("ls_orchid.gbk","genbank")
>>>len(orchid_dict)
94
>>>orchid_dict.keys()
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']
>>>seq_record=orchid_dict["Z78475.1"]
>>>print(seq_record.description)
P.supardii5.8SrRNAgeneandITS1andITS2DNA
>>>seq_record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT',IUPACAmbiguousDNA())
>>>orchid_dict.close()
NotethatBio.SeqIO.index()wonttakeahandle,butonlyafilename.Therearegoodreasonsforthis,butitisalittletechnical.Thesecondargumentisthefileformat(a
lowercasestringasusedintheotherBio.SeqIOfunctions).Youcanusemanyothersimplefileformats,includingFASTAandFASTQfiles(seetheexamplein
Section20.1.11).However,alignmentformatslikePHYLIPorClustalarenotsupported.Finallyasanoptionalargumentyoucansupplyanalphabet,orakeyfunction.
HereisthesameexampleusingtheFASTAfileallwechangeisthefilenameandtheformatname:
>>>orchid_dict=SeqIO.index("ls_orchid.fasta","fasta")
>>>len(orchid_dict)
94
>>>orchid_dict.keys()
['gi|2765596|emb|Z78471.1|PDZ78471','gi|2765646|emb|Z78521.1|CCZ78521',...
...,'gi|2765613|emb|Z78488.1|PTZ78488','gi|2765583|emb|Z78458.1|PHZ78458']
5.4.2.1Specifyingthedictionarykeys
Supposeyouwanttousethesamekeysasbefore?MuchlikewiththeBio.SeqIO.to_dict()exampleinSection5.4.1.1,youllneedtowriteatinyfunctiontomapfrom
theFASTAidentifier(asastring)tothekeyyouwant:
defget_acc(identifier):
""""GivenaSeqRecordidentifierstring,returntheaccessionnumberasastring.
e.g."gi|2765613|emb|Z78488.1|PTZ78488">"Z78488.1"
"""
parts=identifier.split("|")
assertlen(parts)==5andparts[0]=="gi"andparts[2]=="emb"
returnparts[3]
ThenwecangivethisfunctiontotheBio.SeqIO.index()functiontouseinbuildingthedictionary:
>>>orchid_dict=SeqIO.index("ls_orchid.fasta","fasta",key_function=get_acc)
>>>print(orchid_dict.keys())
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']
Easywhenyouknowhow?
5.4.2.2Gettingtherawdataforarecord
ThedictionarylikeobjectfromBio.SeqIO.index()givesyoueachentryasaSeqRecordobject.However,itissometimesusefultobeabletogettheoriginalrawdata
straightfromthefile.Forthisusetheget_raw()methodwhichtakesasingleargument(therecordidentifier)andreturnsabytesstring(extractedfromthefilewithout
modification).
AmotivatingexampleisextractingasubsetofarecordsfromalargefilewhereeitherBio.SeqIO.write()doesnot(yet)supporttheoutputfileformat(e.g.theplaintext
SwissProtfileformat)orwhereyouneedtopreservethetextexactly(e.g.GenBankorEMBLoutputfromBiopythondoesnotyetpreserveeverylastbitofannotation).
LetssupposeyouhavedownloadthewholeofUniProtintheplaintextSwissPortfileformatfromtheirFTPsite
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz)anduncompresseditasthefileuniprot_sprot.dat,
andyouwanttoextractjustafewrecordsfromit:
>>>uniprot=SeqIO.index("uniprot_sprot.dat","swiss")
>>>withopen("selected.dat","wb")asout_handle:
...foraccin["P33487","P19801","P13689","Q8JZQ5","Q9TRC7"]:
...out_handle.write(uniprot.get_raw(acc))
...
NotewithPython3onwards,wehavetoopenthefileforwritinginbinarymodebecausetheget_raw()methodreturnsbytesstrings.
ThereisalongerexampleinSection20.1.5usingtheSeqIO.index()functiontosortalargesequencefile(withoutloadingeverythingintomemoryatonce).
5.4.3SequencefilesasDictionariesDatabaseindexedfiles
Biopython1.57introducedanalternative,Bio.SeqIO.index_db(),whichcanworkonevenextremelylargefilessinceitstorestherecordinformationasafileondisk
(usinganSQLite3database)ratherthaninmemory.Also,youcanindexmultiplefilestogether(providingalltherecordidentifiersareunique).
TheBio.SeqIO.index()functiontakesthreerequiredarguments:
Indexfilename,wesuggestusingsomethingending.idx.ThisindexfileisactuallyanSQLite3database.
Listofsequencefilenamestoindex(orasinglefilename)
Fileformat(lowercasestringasusedintherestoftheSeqIOmodule).
Asanexample,considertheGenBankflatfilereleasesfromtheNCBIFTPsite,ftp://ftp.ncbi.nih.gov/genbank/,whicharegzipcompressedGenBankfiles.
AsofGenBankrelease210,thereare38filesmakinguptheviralsequences,gbvrl1.seq,,gbvrl38.seq,takingabout8GBondiskoncedecompressed,andcontaining
intotalnearlytwomillionrecords.
Ifyouwereinterestedintheviruses,youcoulddownloadallthevirusfilesfromthecommandlineveryeasilywiththersynccommand,andthendecompressthemwith
gunzip:
#Forillustrationonly,seereducedexamplebelow
$rsyncavP"ftp.ncbi.nih.gov::genbank/gbvrl*.seq.gz".
$gunzipgbvrl*.seq.gz
Unlessyoucareaboutviruses,thatsalotofdatatodownloadjustforthisexamplesoletsdownloadjustthefirstfourchunks(about25MBeachcompressed),and
decompressthem(takinginallabout1GBofspace):
#Reducedexample,downloadonlythefirstfourchunks
$curlOftp://ftp.ncbi.nih.gov/genbank/gbvrl1.seq.gz
$gunzipgbvrl*.seq.gz
Now,inPython,indextheseGenBankfilesasfollows:
>>>importglob
>>>files=glob.glob("gbvrl*.seq")
>>>print("%ifilestoindex"%len(files))
4
>>>gb_vrl=SeqIO.index_db("gbvrl.idx",files,"genbank")
>>>print("%isequencesindexed"%len(gb_vrl))
272960sequencesindexed
IndexingthefullsetofvirusGenBankfilestookabouttenminutesonmymachine,justthefirstfourfilestookaboutaminuteorso.
However,oncedone,repeatingthiswillreloadtheindexfilegbvrl.idxinafractionofasecond.
YoucanusetheindexasareadonlyPythondictionarywithouthavingtoworryaboutwhichfilethesequencecomesfrom,e.g.
>>>print(gb_vrl[`ÀB811634.1''].description)
EquineencephalosisvirusNS3gene,completecds,isolate:Kimron1.
5.4.3.1Gettingtherawdataforarecord
JustaswiththeBio.SeqIO.index()functiondiscussedaboveinSection5.4.2.2,thedictionarylikeobjectalsoletsyougetattherawbytesofeachrecord:
>>>print(gb_vrl.get_raw(`ÀB811634.1''))
LOCUSAB811634723bpRNAlinearVRL17JUN2015
DEFINITIONEquineencephalosisvirusNS3gene,completecds,isolate:Kimron1.
ACCESSIONAB811634
...
//
5.4.4Indexingcompressedfiles
Veryoftenwhenyouareindexingasequencefileitcanbequitelargesoyoumaywanttocompressitondisk.Unfortunatelyefficientrandomaccessisdifficultwith
themorecommonfileformatslikegzipandbzip2.Inthissetting,BGZF(BlockedGNUZipFormat)canbeveryhelpful.Thisisavariantofgzip(andcanbe
decompressedusingstandardgziptools)popularisedbytheBAMfileformat,samtools,andtabix.
TocreateaBGZFcompressedfileyoucanusethecommandlinetoolbgzipwhichcomeswithsamtools.Inourexamplesweuseafilenameextension*.bgz,sotheycan
bedistinguishedfromnormalgzippedfiles(named*.gz).YoucanalsousetheBio.bgzfmoduletoreadandwriteBGZFfilesfromwithinPython.
TheBio.SeqIO.index()andBio.SeqIO.index_db()canbothbeusedwithBGZFcompressedfiles.Forexample,ifyoustartedwithanuncompressedGenBankfile:
>>>orchid_dict=SeqIO.index("ls_orchid.gbk","genbank")
>>>len(orchid_dict)
94
Youcouldcompressthis(whilekeepingtheoriginalfile)atthecommandlineusingthefollowingcommandbutdontworry,thecompressedfileisalreadyincluded
withtheotherexamplefiles:
$bgzipcls_orchid.gbk>ls_orchid.gbk.bgz
Youcanusethecompressedfileinexactlythesameway:
>>>orchid_dict=SeqIO.index("ls_orchid.gbk.bgz","genbank")
>>>len(orchid_dict)
94
or:
>>>orchid_dict=SeqIO.index_db("ls_orchid.gbk.bgz.idx","ls_orchid.gbk.bgz","genbank")
>>>len(orchid_dict)
94
TheSeqIOindexingautomaticallydetectstheBGZFcompression.Notethatyoucantusethesameindexfilefortheuncompressedandcompressedfiles.
5.4.5Discussion
So,whichofthesemethodsshouldyouuseandwhy?Itdependsonwhatyouaretryingtodo(andhowmuchdatayouaredealingwith).However,ingeneralpicking
Bio.SeqIO.index()isagoodstartingpoint.Ifyouaredealingwithmillionsofrecords,multiplefiles,orrepeatedanalyses,thenlookatBio.SeqIO.index_db().
ReasonstochooseBio.SeqIO.to_dict()overeitherBio.SeqIO.index()orBio.SeqIO.index_db()boildowntoaneedforflexibilitydespiteitshighmemoryneeds.The
advantageofstoringtheSeqRecordobjectsinmemoryistheycanbechanged,addedto,orremovedatwill.Inadditiontothedownsideofhighmemoryconsumption,
indexingcanalsotakelongerbecausealltherecordsmustbefullyparsed.
BothBio.SeqIO.index()andBio.SeqIO.index_db()onlyparserecordsondemand.Whenindexing,theyscanthefileoncelookingforthestartofeachrecordanddoas
littleworkaspossibletoextracttheidentifier.
ReasonstochooseBio.SeqIO.index()overBio.SeqIO.index_db()include:
Fastertobuildtheindex(morenoticeableinsimplefileformats)
SlightlyfasteraccessasSeqRecordobjects(butthedifferenceisonlyreallynoticeableforsimpletoparsefileformats).
CanuseanyimmutablePythonobjectasthedictionarykeys(e.g.atupleofstrings,orafrozenset)notjuststrings.
Dontneedtoworryabouttheindexdatabasebeingoutofdateifthesequencefilebeingindexedhaschanged.
ReasonstochooseBio.SeqIO.index_db()overBio.SeqIO.index()include:
Notmemorylimitedthisisalreadyimportantwithfilesfromsecondgenerationsequencingwhere10sofmillionsofsequencesarecommon,andusing
Bio.SeqIO.index()canrequiremorethan4GBofRAMandthereforea64bitversionofPython.
Becausetheindexiskeptondisk,itcanbereused.Althoughbuildingtheindexdatabasefiletakeslonger,ifyouhaveascriptwhichwillbererunonthesame
datafilesinfuture,thiscouldsavetimeinthelongrun.
Indexingmultiplefilestogether
Theget_raw()methodcanbemuchfaster,sinceformostfileformatsthelengthofeachrecordisstoredaswellasitsoffset.
5.5WritingSequenceFiles
WevetalkedaboutusingBio.SeqIO.parse()forsequenceinput(readingfiles),andnowwelllookatBio.SeqIO.write()whichisforsequenceoutput(writingfiles).
Thisisafunctiontakingthreearguments:someSeqRecordobjects,ahandleorfilenametowriteto,andasequenceformat.
Hereisanexample,wherewestartbycreatingafewSeqRecordobjectsthehardway(byhand,ratherthanbyloadingthemfromafile):
fromBio.Alphabetimportgeneric_protein
rec1=SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"\
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"\
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"\
+"SSAC",generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalconesynthase[Cucumissativus]")
rec2=SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"\
+"DMVVVEIPKLGKEAAVKAIKEWGQ",generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalconesynthase[Fragariavescasubsp.bracteata]")
rec3=SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"\
+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"\
+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"\
+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"\
+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"\
+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"\
+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT",generic_protein),
id="gi|13925890|gb|AAK49457.1|",
description="chalconesynthase[Nicotianatabacum]")
my_records=[rec1,rec2,rec3]
NowwehavealistofSeqRecordobjects,wellwritethemtoaFASTAformatfile:
fromBioimportSeqIO
SeqIO.write(my_records,"my_example.faa","fasta")
Andifyouopenthisfileinyourfavouritetexteditoritshouldlooklikethis:
>gi|14150838|gb|AAK54648.1|AF376133_1chalconesynthase[Cucumissativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
>gi|13919613|gb|AAK33142.1|chalconesynthase[Fragariavescasubsp.bracteata]
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ
DMVVVEIPKLGKEAAVKAIKEWGQ
>gi|13925890|gb|AAK49457.1|chalconesynthase[Nicotianatabacum]
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC
EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP
KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN
NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV
SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW
IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT
TGEGLEWGVLFGFGPGLTVETVVLHSVAT
SupposeyouwantedtoknowhowmanyrecordstheBio.SeqIO.write()functionwrotetothehandle?Ifyourrecordswereinalistyoucouldjustuselen(my_records),
howeveryoucantdothatwhenyourrecordscomefromagenerator/iterator.TheBio.SeqIO.write()functionreturnsthenumberofSeqRecordobjectswrittentothefile.
NoteIfyoutelltheBio.SeqIO.write()functiontowritetoafilethatalreadyexists,theoldfilewillbeoverwrittenwithoutanywarning.
5.5.1Roundtrips
Somepeopleliketheirparserstoberoundtripable,meaningifyoureadinafileandwriteitbackoutagainitisunchanged.Thisrequiresthattheparsermustextract
enoughinformationtoreproducetheoriginalfileexactly.Bio.SeqIOdoesnotaimtodothis.
Asatrivialexample,anylinewrappingofthesequencedatainFASTAfilesisallowed.AnidenticalSeqRecordwouldbegivenfromparsingthefollowingtwoexamples
whichdifferonlyintheirlinebreaks:
>YAL068C7235.2170Putativepromotersequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>YAL068C7235.2170Putativepromotersequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA
TomakearoundtripableFASTAparseryouwouldneedtokeeptrackofwherethesequencelinebreaksoccurred,andthisextrainformationisusuallypointless.Instead
Biopythonusesadefaultlinewrappingof60charactersonoutput.Thesameproblemwithwhitespaceappliesinmanyotherfileformatstoo.Anotherissueinsome
casesisthatBiopythondoesnot(yet)preserveeverylastbitofannotation(e.g.GenBankandEMBL).
Occasionallypreservingtheoriginallayout(withanyquirksitmayhave)isimportant.SeeSection5.4.2.2abouttheget_raw()methodoftheBio.SeqIO.index()
dictionarylikeobjectforonepotentialsolution.
5.5.2Convertingbetweensequencefileformats
InpreviousexampleweusedalistofSeqRecordobjectsasinputtotheBio.SeqIO.write()function,butitwillalsoacceptaSeqRecorditeratorlikewegetfrom
Bio.SeqIO.parse()thisletsusdofileconversionbycombiningthesetwofunctions.
ForthisexamplewellreadintheGenBankformatfilels_orchid.gbkandwriteitoutinFASTAformat:
fromBioimportSeqIO
records=SeqIO.parse("ls_orchid.gbk","genbank")
count=SeqIO.write(records,"my_example.fasta","fasta")
print("Converted%irecords"%count)
Still,thatisalittlebitcomplicated.So,becausefileconversionissuchacommontask,thereisahelperfunctionlettingyoureplacethatwithjust:
fromBioimportSeqIO
count=SeqIO.convert("ls_orchid.gbk","genbank","my_example.fasta","fasta")
TheBio.SeqIO.convert()functionwilltakehandlesorfilenames.Watchoutthoughiftheoutputfilealreadyexists,itwilloverwriteit!Tofindoutmore,seethebuilt
inhelp:
>>>help(SeqIO.convert)
...
Inprinciple,justbychangingthefilenamesandtheformatnames,thiscodecouldbeusedtoconvertbetweenanyfileformatsavailableinBiopython.However,writing
someformatsrequiresinformation(e.g.qualityscores)whichotherfilesformatsdontcontain.Forexample,whileyoucanturnaFASTQfileintoaFASTAfile,you
cantdothereverse.SeealsoSections20.1.9and20.1.10inthecookbookchapterwhichlooksatinterconvertingbetweendifferentFASTQformats.
Finally,asanaddedincentiveforusingtheBio.SeqIO.convert()function(ontopofthefactyourcodewillbeshorter),doingitthiswaymayalsobefaster!Thereason
forthisistheconvertfunctioncantakeadvantageofseveralfileformatspecificoptimisationsandtricks.
5.5.3Convertingafileofsequencestotheirreversecomplements
Supposeyouhadafileofnucleotidesequences,andyouwantedtoturnitintoafilecontainingtheirreversecomplementsequences.Thistimealittlebitofworkis
requiredtotransformtheSeqRecordobjectswegetfromourinputfileintosomethingsuitableforsavingtoouroutputfile.
Tostartwith,welluseBio.SeqIO.parse()toloadsomenucleotidesequencesfromafile,thenprintouttheirreversecomplementsusingtheSeqobjectsbuiltin
.reverse_complement()method(seeSection3.7):
>>>forrecordinSeqIO.parse("ls_orchid.gbk","genbank"):
...print(record.id)
...print(record.seq.reverse_complement())
Now,ifwewanttosavethesereversecomplementstoafile,wellneedtomakeSeqRecordobjects.WecanusetheSeqRecordobjectsbuiltin.reverse_complement()
method(seeSection4.9)butwemustdecidehowtonameournewrecords.
Thisisanexcellentplacetodemonstratethepoweroflistcomprehensionswhichmakealistinmemory:
>>>records=[rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")]
>>>len(records)
94
Nowlistcomprehensionshaveanicetrickuptheirsleeves,youcanaddaconditionalstatement:
>>>records=[rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")iflen(rec)<700]
>>>len(records)
18
Thatwouldcreateaninmemorylistofreversecomplementrecordswherethesequencelengthwasunder700basepairs.However,wecandoexactlythesamewitha
generatorexpressionbutwiththeadvantagethatthisdoesnotcreatealistofalltherecordsinmemoryatonce:
>>>records=(rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")iflen(rec)<700)
Asacompleteexample:
>>>records=(rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")iflen(rec)<700)
>>>SeqIO.write(records,"rev_comp.fasta","fasta")
18
ThereisarelatedexampleinSection20.1.3,translatingeachrecordinaFASTAfilefromnucleotidestoaminoacids.
5.5.4GettingyourSeqRecordobjectsasformattedstrings
Supposethatyoudontreallywanttowriteyourrecordstoafileorhandleinsteadyouwantastringcontainingtherecordsinaparticularfileformat.TheBio.SeqIO
interfaceisbasedonhandles,butPythonhasausefulbuiltinmodulewhichprovidesastringbasedhandle.
Foranexampleofhowyoumightusethis,letsloadinabunchofSeqRecordobjectsfromourorchidsGenBankfile,andcreateastringcontainingtherecordsinFASTA
format:
fromBioimportSeqIO
fromStringIOimportStringIO
records=SeqIO.parse("ls_orchid.gbk","genbank")
out_handle=StringIO()
SeqIO.write(records,out_handle,"fasta")
fasta_data=out_handle.getvalue()
print(fasta_data)
Thisisntentirelystraightforwardthefirsttimeyouseeit!Onthebrightside,forthespecialcasewhereyouwouldlikeastringcontainingasinglerecordinaparticular
fileformat,usethetheSeqRecordclassformat()method(seeSection4.6).
Notethatalthoughwedontencourageit,youcanusetheformat()methodtowritetoafile,forexamplesomethinglikethis:
fromBioimportSeqIO
withopen("ls_orchid_long.tab","w")asout_handle:
forrecordinSeqIO.parse("ls_orchid.gbk","genbank"):
iflen(record)>100:
out_handle.write(record.format("tab"))
WhilethisstyleofcodewillworkforasimplesequentialfileformatlikeFASTAorthesimpletabseparatedformatusedhere,itwillnotworkformorecomplexor
interlacedfileformats.ThisiswhywestillrecommendusingBio.SeqIO.write(),asinthefollowingexample:
fromBioimportSeqIO
records=(recforrecinSeqIO.parse("ls_orchid.gbk","genbank")iflen(rec)>100)
SeqIO.write(records,"ls_orchid.tab","tab")
MakingasinglecalltoSeqIO.write(...)isalsomuchquickerthanmultiplecallstotheSeqRecord.format(...)method.
5.6LowlevelFASTAandFASTQparsers
WorkingwiththelowlevelSimpleFastaParserorFastqGeneralIteratorisoftenmorepracticalthanBio.SeqIO.parsewhendealingwithlargehighthroughputFASTAor
FASTQsequencingfileswherespeedmatters.Asnotedintheintroductiontothischapter,thefileformatneutralBio.SeqIOinterfacehastheoverheadofcreatingmany
objectsevenforsimpleformatslikeFASTA.
WhenparsingFASTAfiles,internallyBio.SeqIO.parse()callsthelowlevelSimpleFastaParserwiththefilehandle.Youcanusethisdirectlyititeratesoverthefile
handlereturningeachrecordasatupleoftwostrings,thetitleline(everythingafterthe>character)andthesequence(asaplainstring):
>>>fromBio.SeqIO.FastaIOimportSimpleFastaParser
>>>count=0
>>>total_len=0
>>>withopen("ls_orchid.fasta")asin_handle:
...fortitle,seqinSimpleFastaParser(in_handle):
...count+=1
...total_len+=len(seq)
...
>>>print("%irecordswithtotalsequencelength%i"%(count,total_len))
94recordswithtotalsequencelength67518
Aslongasyoudontcareaboutlinewrapping(andyouprobablydontforshortreadhighthrougputdata),thenoutputingFASTAformatfromthesestringsisalsovery
fast:
...
out_handle.write(">%s\n%s\n"%(title,seq))
...
Likewise,whenparsingFASTQfiles,internallyBio.SeqIO.parse()callsthelowlevelFastqGeneralIteratorwiththefilehandle.Ifyoudontneedthequalityscores
turnedintointegers,orcanworkwiththemasASCIIstringsthisisideal:
>>>fromBio.SeqIO.QualityIOimportFastqGeneralIterator
>>>count=0
>>>total_len=0
>>>withopen("example.fastq")asin_handle:
...fortitle,seq,qualinFastqGeneralIterator(in_handle):
...count+=1
...total_len+=len(seq)
...
>>>print("%irecordswithtotalsequencelength%i"%(count,total_len))
3recordswithtotalsequencelength75
TherearemoreexamplesofthisintheCookbook(Chapter20),includinghowtooutputFASTQefficientlyfromstringsusingthiscodesnippet:
...
out_handle.write("@%s\n%s\n+\n%s\n"%(title,seq,qual))
...
Chapter6MultipleSequenceAlignmentobjects
ThischapterisaboutMultipleSequenceAlignments,bywhichwemeanacollectionofmultiplesequenceswhichhavebeenalignedtogetherusuallywiththeinsertion
ofgapcharacters,andadditionofleadingortrailinggapssuchthatallthesequencestringsarethesamelength.Suchanalignmentcanberegardedasamatrixofletters,
whereeachrowisheldasaSeqRecordobjectinternally.
WewillintroducetheMultipleSeqAlignmentobjectwhichholdsthiskindofdata,andtheBio.AlignIOmoduleforreadingandwritingthemasvariousfileformats
(followingthedesignoftheBio.SeqIOmodulefromthepreviouschapter).NotethatbothBio.SeqIOandBio.AlignIOcanreadandwritesequencealignmentfiles.The
appropriatechoicewilldependlargelyonwhatyouwanttodowiththedata.
ThefinalpartofthischapterisaboutourcommandlinewrappersforcommonmultiplesequencealignmenttoolslikeClustalWandMUSCLE.
6.1ParsingorReadingSequenceAlignments
Wehavetwofunctionsforreadinginsequencealignments,Bio.AlignIO.read()andBio.AlignIO.parse()whichfollowingtheconventionintroducedinBio.SeqIOarefor
filescontainingoneormultiplealignmentsrespectively.
UsingBio.AlignIO.parse()willreturnaniteratorwhichgivesMultipleSeqAlignmentobjects.Iteratorsaretypicallyusedinaforloop.Examplesofsituationswhereyou
willhavemultipledifferentalignmentsincluderesampledalignmentsfromthePHYLIPtoolseqboot,ormultiplepairwisealignmentsfromtheEMBOSStoolswateror
needle,orBillPearsonsFASTAtools.
However,inmanysituationsyouwillbedealingwithfileswhichcontainonlyasinglealignment.Inthiscase,youshouldusetheBio.AlignIO.read()functionwhich
returnsasingleMultipleSeqAlignmentobject.
Bothfunctionsexpecttwomandatoryarguments:
1.Thefirstargumentisahandletoreadthedatafrom,typicallyanopenfile(seeSection24.1),orafilename.
2.Thesecondargumentisalowercasestringspecifyingthealignmentformat.AsinBio.SeqIOwedonttryandguessthefileformatforyou!See
http://biopython.org/wiki/AlignIOforafulllistingofsupportedformats.
Thereisalsoanoptionalseq_countargumentwhichisdiscussedinSection6.1.3belowfordealingwithambiguousfileformatswhichmaycontainmorethanone
alignment.
Afurtheroptionalalphabetargumentallowingyoutospecifytheexpectedalphabet.Thiscanbeusefulasmanyalignmentfileformatsdonotexplicitlylabelthe
sequencesasRNA,DNAorproteinwhichmeansBio.AlignIOwilldefaulttousingagenericalphabet.
6.1.1SingleAlignments
Asanexample,considerthefollowingannotationrichproteinalignmentinthePFAMorStockholmfileformat:
#STOCKHOLM1.0
#=GSCOATB_BPIKE/3081ACP03620.1
#=GSCOATB_BPIKE/3081DRPDB;1ifl;152;
#=GSQ9T0Q8_BPIKE/152ACQ9T0Q8.1
#=GSCOATB_BPI22/3283ACP15416.1
#=GSCOATB_BPM13/2472ACP69541.1
#=GSCOATB_BPM13/2472DRPDB;2cpb;149;
#=GSCOATB_BPM13/2472DRPDB;2cps;149;
#=GSCOATB_BPZJ2/149ACP03618.1
#=GSQ9T0Q9_BPFD/149ACQ9T0Q9.1
#=GSQ9T0Q9_BPFD/149DRPDB;1nh4A;149;
#=GSCOATB_BPIF1/2273ACP03619.2
#=GSCOATB_BPIF1/2273DRPDB;1ifk;150;
COATB_BPIKE/3081AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
#=GRCOATB_BPIKE/3081SSHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Q9T0Q8_BPIKE/152AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
COATB_BPI22/3283DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
COATB_BPM13/2472AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
#=GRCOATB_BPM13/2472SSST...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT
COATB_BPZJ2/149AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
Q9T0Q9_BPFD/149AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
#=GRQ9T0Q9_BPFD/149SS...HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
COATB_BPIF1/2273FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
#=GRCOATB_BPIF1/2273SSXXHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
#=GCSS_consXHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC
#=GCseq_consAEssss...AptAhDSLpspAThIu.sWshVsslVsAsluIKLFKKFsSKA
//
ThisistheseedalignmentforthePhage_Coat_Gp8(PF05371)PFAMentry,downloadedfromanowoutofdatereleaseofPFAMfromhttp://pfam.sanger.ac.uk/.We
canloadthisfileasfollows(assumingithasbeensavedtodiskasPF05371_seed.sthinthecurrentworkingdirectory):
>>>fromBioimportAlignIO
>>>alignment=AlignIO.read("PF05371_seed.sth","stockholm")
Thiscodewillprintoutasummaryofthealignment:
>>>print(alignment)
SingleLetterAlphabet()alignmentwith7rowsand52columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRACOATB_BPIF1/2273
Youllnoticeintheaboveoutputthesequenceshavebeentruncated.Wecouldinsteadwriteourowncodetoformatthisaswepleasebyiteratingovertherowsas
SeqRecordobjects:
>>>print("Alignmentlength%i"%alignment.get_alignment_length())
Alignmentlength52
>>>forrecordinalignment:
...print("%s%s"%(record.seq,record.id))
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273
YoucouldalsousethealignmentobjectsformatmethodtoshowitinaparticularfileformatseeSection6.2.2fordetails.
DidyounoticeintherawfileabovethatseveralofthesequencesincludedatabasecrossreferencestothePDBandtheassociatedknownsecondarystructure?Trythis:
...ifrecord.dbxrefs:
...print("%s%s"%(record.id,record.dbxrefs))
COATB_BPIKE/3081['PDB;1ifl;152;']
COATB_BPM13/2472['PDB;2cpb;149;','PDB;2cps;149;']
Q9T0Q9_BPFD/149['PDB;1nh4A;149;']
COATB_BPIF1/2273['PDB;1ifk;150;']
Tohavealookatallthesequenceannotation,trythis:
...print(record)
Sangerprovideanicewebinterfaceathttp://pfam.sanger.ac.uk/family?acc=PF05371whichwillactuallyletyoudownloadthisalignmentinseveralotherformats.This
iswhatthefilelookslikeintheFASTAfileformat:
>COATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
>Q9T0Q8_BPIKE/152
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
>COATB_BPI22/3283
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
>COATB_BPM13/2472
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
>COATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
>Q9T0Q9_BPFD/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
>COATB_BPIF1/2273
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
Notethewebsiteshouldhaveanoptionaboutshowinggapsasperiods(dots)ordashes,weveshowndashesabove.Assumingyoudownloadandsavethisasfile
PF05371_seed.faathenyoucanloaditwithalmostexactlythesamecode:
fromBioimportAlignIO
alignment=AlignIO.read("PF05371_seed.faa","fasta")
print(alignment)
Allthathaschangedinthiscodeisthefilenameandtheformatstring.Youllgetthesameoutputasbefore,thesequencesandrecordidentifiersarethesame.However,
asyoushouldexpect,ifyoucheckeachSeqRecordthereisnoannotationnordatabasecrossreferencesbecausethesearenotincludedintheFASTAfileformat.
NotethatratherthanusingtheSangerwebsite,youcouldhaveusedBio.AlignIOtoconverttheoriginalStockholmformatfileintoaFASTAfileyourself(seebelow).
Withanysupportedfileformat,youcanloadanalignmentinexactlythesamewayjustbychangingtheformatstring.Forexample,usephylipforPHYLIPfiles,
nexusforNEXUSfilesorembossforthealignmentsoutputbytheEMBOSStools.Thereisafulllistingonthewikipage(http://biopython.org/wiki/AlignIO)and
inthebuiltindocumentation(alsoonline):
>>>help(AlignIO)
...
6.1.2MultipleAlignments
Theprevioussectionfocusedonreadingfilescontainingasinglealignment.Ingeneralhowever,filescancontainmorethanonealignment,andtoreadthesefileswe
mustusetheBio.AlignIO.parse()function.
SupposeyouhaveasmallalignmentinPHYLIPformat:
56
AlphaAACAAC
BetaAACCCC
GammaACCAAC
DeltaCCACCA
EpsilonCCAAAC
IfyouwantedtobootstrapaphylogenetictreeusingthePHYLIPtools,oneofthestepswouldbetocreateasetofmanyresampledalignmentsusingthetoolbootseq.
Thiswouldgiveoutputsomethinglikethis,whichhasbeenabbreviatedforconciseness:
56
AlphaAAACCA
BetaAAACCC
GammaACCCCA
DeltaCCCAAC
EpsilonCCCAAA
56
AlphaAAACAA
BetaAAACCC
GammaACCCAA
DeltaCCCACC
EpsilonCCCAAA
56
AlphaAAAAAC
BetaAAACCC
GammaAACAAC
DeltaCCCCCA
EpsilonCCCAAC
...
56
AlphaAAAACC
BetaACCCCC
GammaAAAACC
DeltaCCCCAA
EpsilonCAAACC
IfyouwantedtoreadthisinusingBio.AlignIOyoucoulduse:
alignments=AlignIO.parse("resampled.phy","phylip")
foralignmentinalignments:
print(alignment)
print("")
Thiswouldgivethefollowingoutput,againabbreviatedfordisplay:
AAACCAAlpha
AAACCCBeta
ACCCCAGamma
CCCAACDelta
CCCAAAEpsilon
AAACAAAlpha
AAACCCBeta
ACCCAAGamma
CCCACCDelta
CCCAAAEpsilon
AAAAACAlpha
AAACCCBeta
AACAACGamma
CCCCCADelta
CCCAACEpsilon
...
AAAACCAlpha
ACCCCCBeta
AAAACCGamma
CCCCAADelta
CAAACCEpsilon
AswiththefunctionBio.SeqIO.parse(),usingBio.AlignIO.parse()returnsaniterator.Ifyouwanttokeepallthealignmentsinmemoryatonce,whichwillallowyouto
accesstheminanyorder,thenturntheiteratorintoalist:
alignments=list(AlignIO.parse("resampled.phy","phylip"))
last_align=alignments[1]
first_align=alignments[0]
6.1.3AmbiguousAlignments
Manyalignmentfileformatscanexplicitlystoremorethanonealignment,andthedivisionbetweeneachalignmentisclear.However,whenageneralsequencefile
formathasbeenusedthereisnosuchblockstructure.ThemostcommonsuchsituationiswhenalignmentshavebeensavedintheFASTAfileformat.Forexample
considerthefollowing:
>Alpha
ACTACGACTAGCTCAGG
>Beta
ACTACCGCTAGCTCAGAAG
>Gamma
ACTACGGCTAGCACAGAAG
>Alpha
ACTACGACTAGCTCAGG
>Beta
ACTACCGCTAGCTCAGAAG
>Gamma
ACTACGGCTAGCACAGAAG
Thiscouldbeasinglealignmentcontainingsixsequences(withrepeatedidentifiers).Or,judgingfromtheidentifiers,thisisprobablytwodifferentalignmentseachwith
threesequences,whichhappentoallhavethesamelength.
Whataboutthisnextexample?
>Alpha
ACTACGACTAGCTCAGG
>Beta
ACTACCGCTAGCTCAGAAG
>Alpha
ACTACGACTAGCTCAGG
>Gamma
ACTACGGCTAGCACAGAAG
>Alpha
ACTACGACTAGCTCAGG
>Delta
ACTACGGCTAGCACAGAAG
Again,thiscouldbeasinglealignmentwithsixsequences.Howeverthistimebasedontheidentifierswemightguessthisisthreepairwisealignmentswhichbychance
haveallgotthesamelengths.
Thisfinalexampleissimilar:
>Alpha
ACTACGACTAGCTCAGG
>XXX
ACTACCGCTAGCTCAGAAG
>Alpha
ACTACGACTAGCTCAGG
>YYY
ACTACGGCAAGCACAGG
>Alpha
ACTACGACTAGCTCAGG
>ZZZ
GGACTACGACAATAGCTCAGG
Inthisthirdexample,becauseofthedifferinglengths,thiscannotbetreatedasasinglealignmentcontainingallsixrecords.However,itcouldbethreepairwise
alignments.
ClearlytryingtostoremorethanonealignmentinaFASTAfileisnotideal.However,ifyouareforcedtodealwiththeseasinputfilesBio.AlignIOcancopewiththe
mostcommonsituationwhereallthealignmentshavethesamenumberofrecords.Oneexampleofthisisacollectionofpairwisealignments,whichcanbeproducedby
theEMBOSStoolsneedleandwateralthoughinthissituation,Bio.AlignIOshouldbeabletounderstandtheirnativeoutputusingembossastheformatstring.
TointerprettheseFASTAexamplesasseveralseparatealignments,wecanuseBio.AlignIO.parse()withtheoptionalseq_countargumentwhichspecifieshowmany
sequencesareexpectedineachalignment(intheseexamples,3,2and2respectively).Forexample,usingthethirdexampleastheinputdata:
foralignmentinAlignIO.parse(handle,"fasta",seq_count=2):
print("Alignmentlength%i"%alignment.get_alignment_length())
forrecordinalignment:
print("%s%s"%(record.seq,record.id))
print("")
giving:
Alignmentlength19
ACTACGACTAGCTCAGGAlpha
ACTACCGCTAGCTCAGAAGXXX
Alignmentlength17
ACTACGGCAAGCACAGGYYY
Alignmentlength21
GGACTACGACAATAGCTCAGGZZZ
UsingBio.AlignIO.read()orBio.AlignIO.parse()withouttheseq_countargumentwouldgiveasinglealignmentcontainingallsixrecordsforthefirsttwoexamples.
Forthethirdexample,anexceptionwouldberaisedbecausethelengthsdifferpreventingthembeingturnedintoasinglealignment.
IfthefileformatitselfhasablockstructureallowingBio.AlignIOtodeterminethenumberofsequencesineachalignmentdirectly,thentheseq_countargumentisnot
needed.Ifitissupplied,anddoesntagreewiththefilecontents,anerrorisraised.
Notethatthisoptionalseq_countargumentassumeseachalignmentinthefilehasthesamenumberofsequences.Hypotheticallyyoumaycomeacrossstranger
situations,forexampleaFASTAfilecontainingseveralalignmentseachwithadifferentnumberofsequencesalthoughIwouldlovetohearofarealworldexampleof
this.Assumingyoucannotgetthedatainanicerfileformat,thereisnostraightforwardwaytodealwiththisusingBio.AlignIO.Inthiscase,youcouldconsiderreading
inthesequencesthemselvesusingBio.SeqIOandbatchingthemtogethertocreatethealignmentsasappropriate.
6.2WritingAlignments
WevetalkedaboutusingBio.AlignIO.read()andBio.AlignIO.parse()foralignmentinput(readingfiles),andnowwelllookatBio.AlignIO.write()whichisfor
alignmentoutput(writingfiles).Thisisafunctiontakingthreearguments:someMultipleSeqAlignmentobjects(orforbackwardscompatibilitytheobsoleteAlignment
objects),ahandleorfilenametowriteto,andasequenceformat.
Hereisanexample,wherewestartbycreatingafewMultipleSeqAlignmentobjectsthehardway(byhand,ratherthanbyloadingthemfromafile).Notewecreatesome
SeqRecordobjectstoconstructthealignmentfrom.
fromBio.Alphabetimportgeneric_dna
fromBio.AlignimportMultipleSeqAlignment
align1=MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG",generic_dna),id="Alpha"),
SeqRecord(Seq("ACTCTAGCTAG",generic_dna),id="Beta"),
SeqRecord(Seq("ACTGCTAGDTAG",generic_dna),id="Gamma"),
])
SeqRecord(Seq("GTCAGCAG",generic_dna),id="Delta"),
SeqRecord(Seq("GACAGCTAG",generic_dna),id="Epsilon"),
SeqRecord(Seq("GTCAGCTAG",generic_dna),id="Zeta"),
])
SeqRecord(Seq("ACTAGTACAGCTG",generic_dna),id="Eta"),
SeqRecord(Seq("ACTAGTACAGCT",generic_dna),id="Theta"),
SeqRecord(Seq("CTACTACAGGTG",generic_dna),id="Iota"),
])
my_alignments=[align1,align2,align3]
NowwehavealistofAlignmentobjects,wellwritethemtoaPHYLIPformatfile:
AlignIO.write(my_alignments,"my_example.phy","phylip")
Andifyouopenthisfileinyourfavouritetexteditoritshouldlooklikethis:
312
AlphaACTGCTAGCTAG
BetaACTCTAGCTAG
GammaACTGCTAGDTAG
39
DeltaGTCAGCAG
EpislonGACAGCTAG
ZetaGTCAGCTAG
313
EtaACTAGTACAGCTG
ThetaACTAGTACAGCT
IotaCTACTACAGGTG
Itsmorecommontowanttoloadanexistingalignment,andsavethat,perhapsaftersomesimplemanipulationlikeremovingcertainrowsorcolumns.
SupposeyouwantedtoknowhowmanyalignmentstheBio.AlignIO.write()functionwrotetothehandle?Ifyouralignmentswereinalistliketheexampleabove,you
couldjustuselen(my_alignments),howeveryoucantdothatwhenyourrecordscomefromagenerator/iterator.ThereforetheBio.AlignIO.write()functionreturnsthe
numberofalignmentswrittentothefile.
NoteIfyoutelltheBio.AlignIO.write()functiontowritetoafilethatalreadyexists,theoldfilewillbeoverwrittenwithoutanywarning.
6.2.1Convertingbetweensequencealignmentfileformats
ConvertingbetweensequencealignmentfileformatswithBio.AlignIOworksinthesamewayasconvertingbetweensequencefileformatswithBio.SeqIO
(Section5.5.2).Weloadgenerallythealignment(s)usingBio.AlignIO.parse()andthensavethemusingtheBio.AlignIO.write()orjustusethe
Bio.AlignIO.convert()helperfunction.
Forthisexample,wellloadthePFAM/StockholmformatfileusedearlierandsaveitasaClustalWformatfile:
count=AlignIO.convert("PF05371_seed.sth","stockholm","PF05371_seed.aln","clustal")
print("Converted%ialignments"%count)
Or,usingBio.AlignIO.parse()andBio.AlignIO.write():
alignments=AlignIO.parse("PF05371_seed.sth","stockholm")
count=AlignIO.write(alignments,"PF05371_seed.aln","clustal")
print("Converted%ialignments"%count)
TheBio.AlignIO.write()functionexpectstobegivenmultiplealignmentobjects.Intheexampleabovewegaveitthealignmentiteratorreturnedby
Bio.AlignIO.parse().
Inthiscase,weknowthereisonlyonealignmentinthefilesowecouldhaveusedBio.AlignIO.read()instead,butnoticewehavetopassthisalignmentto
Bio.AlignIO.write()asasingleelementlist:
alignment=AlignIO.read("PF05371_seed.sth","stockholm")
AlignIO.write([alignment],"PF05371_seed.aln","clustal")
Eitherway,youshouldendupwiththesamenewClustalWformatfilePF05371_seed.alnwiththefollowingcontent:
CLUSTALX(1.81)multiplesequencealignment
COATB_BPIKE/3081AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIKE/152AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI22/3283DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM13/2472AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJ2/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFD/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIF1/2273FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
COATB_BPIKE/3081KA
Q9T0Q8_BPIKE/152RA
COATB_BPI22/3283KA
COATB_BPM13/2472KA
COATB_BPZJ2/149KA
Q9T0Q9_BPFD/149KA
COATB_BPIF1/2273RA
Alternatively,youcouldmakeaPHYLIPformatfilewhichwellnamePF05371_seed.phy:
AlignIO.convert("PF05371_seed.sth","stockholm","PF05371_seed.phy","phylip")
Thistimetheoutputlookslikethis:
752
COATB_BPIKAEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIAEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI2DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM1AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJAEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFAEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIFFAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
KA
RA
KA
KA
KA
KA
RA
OneofthebighandicapsoftheoriginalPHYLIPalignmentfileformatisthatthesequenceidentifiersarestrictlytruncatedattencharacters.Inthisexample,asyoucan
seetheresultingnamesarestilluniquebuttheyarenotveryreadable.Asaresult,amorerelaxedvariantoftheoriginalPHYLIPformatisnowquitewidelyused:
AlignIO.convert("PF05371_seed.sth","stockholm","PF05371_seed.phy","phyliprelaxed")
Thistimetheoutputlookslikethis,usingalongerindentationtoallowalltheidentiferstobegiveninfull::
752
COATB_BPIKE/3081AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIKE/152AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI22/3283DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM13/2472AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJ2/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFD/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIF1/2273FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
KA
RA
KA
KA
KA
KA
RA
IfyouhavetoworkwiththeoriginalstrictPHYLIPformat,thenyoumayneedtocompresstheidentiferssomehoworassignyourownnamesornumberingsystem.
Thisfollowingbitofcodemanipulatestherecordidentifiersbeforesavingtheoutput:
name_mapping={}
fori,recordinenumerate(alignment):
name_mapping[i]=record.id
record.id="seq%i"%i
print(name_mapping)
AlignIO.write([alignment],"PF05371_seed.phy","phylip")
ThiscodeusedaPythondictionarytorecordasimplemappingfromthenewsequencesystemtotheoriginalidentifier:
{0:'COATB_BPIKE/3081',1:'Q9T0Q8_BPIKE/152',2:'COATB_BPI22/3283',...}
Hereisthenew(strict)PHYLIPformatoutput:
752
seq0AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
seq1AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
seq2DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
seq3AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
seq4AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
seq5AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
seq6FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
KA
RA
KA
KA
KA
KA
RA
Ingeneral,becauseoftheidentifierlimitation,workingwithstrictPHYLIPfileformatsshouldntbeyourfirstchoice.UsingthePFAM/Stockholmformatontheother
handallowsyoutorecordalotofadditionalannotationtoo.
6.2.2Gettingyouralignmentobjectsasformattedstrings
TheBio.AlignIOinterfaceisbasedonhandles,whichmeansifyouwanttogetyouralignment(s)intoastringinaparticularfileformatyouneedtodoalittlebitmore
work(seebelow).However,youwillprobablyprefertotakeadvantageofthealignmentobjectsformat()method.Thistakesasinglemandatoryargument,alowercase
stringwhichissupportedbyBio.AlignIOasanoutputformat.Forexample:
print(alignment.format("clustal"))
AsdescribedinSection4.6,theSeqRecordobjecthasasimilarmethodusingoutputformatssupportedbyBio.SeqIO.
Internallytheformat()methodisusingtheStringIOstringbasedhandleandcallingBio.AlignIO.write().Youcandothisinyourowncodeifforexampleyouareusing
anolderversionofBiopython:
fromStringIOimportStringIO
alignments=AlignIO.parse("PF05371_seed.sth","stockholm")
out_handle=StringIO()
AlignIO.write(alignments,out_handle,"clustal")
clustal_data=out_handle.getvalue()
print(clustal_data)
6.3ManipulatingAlignments
Nowthatwevecoveredloadingandsavingalignments,welllookatwhatelseyoucandowiththem.
6.3.1Slicingalignments
Firstofall,insomesensesthealignmentobjectsactlikeaPythonlistofSeqRecordobjects(therows).Withthismodelinmindhopefullytheactionsoflen()(the
numberofrows)anditeration(eachrowasaSeqRecord)makesense:
>>>print("Numberofrows:%i"%len(alignment))
Numberofrows:7
...print("%s%s"%(record.seq,record.id))
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273
Youcanalsousethelistlikeappendandextendmethodstoaddmorerowstothealignment(asSeqRecordobjects).Keepingthelistmetaphorinmind,simpleslicingof
thealignmentshouldalsomakesenseitselectssomeoftherowsgivingbackanotheralignmentobject:
>>>print(alignment)
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKACOATB_BPI22/3283
>>>print(alignment[3:7])
Whatifyouwantedtoselectbycolumn?ThoseofyouwhohaveusedtheNumPymatrixorarrayobjectswontbesurprisedatthisyouuseadoubleindex.
>>>print(alignment[2,6])
T
Usingtwointegerindicespullsoutasingleletter,shorthandforthis:
>>>print(alignment[2].seq[6])
T
Youcanpulloutasinglecolumnasastringlikethis:
>>>print(alignment[:,6])
TTTT
Youcanalsoselectarangeofcolumns.Forexample,topickoutthosesamethreerowsweextractedearlier,buttakejusttheirfirstsixcolumns:
>>>print(alignment[3:6,:6])
AEGDDPCOATB_BPM13/2472
AEGDDPCOATB_BPZJ2/149
AEGDDPQ9T0Q9_BPFD/149
Leavingthefirstindexas:meanstakealltherows:
>>>print(alignment[:,:6])
AEPNAACOATB_BPIKE/3081
AEPNAAQ9T0Q8_BPIKE/152
DGTSTACOATB_BPI22/3283
AEGDDPCOATB_BPM13/2472
AEGDDPCOATB_BPZJ2/149
AEGDDPQ9T0Q9_BPFD/149
FAADDACOATB_BPIF1/2273
Thisbringsustoaneatwaytoremoveasection.Noticecolumns7,8and9whicharegapsinthreeofthesevensequences:
>>>print(alignment[:,6:9])
TNYCOATB_BPIKE/3081
TNYQ9T0Q8_BPIKE/152
TSYCOATB_BPI22/3283
COATB_BPM13/2472
COATB_BPZJ2/149
Q9T0Q9_BPFD/149
TSQCOATB_BPIF1/2273
Again,youcanslicetogeteverythingaftertheninthcolumn:
>>>print(alignment[:,9:])
ATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
ATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
ATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149
AKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273
Now,theinterestingthingisthatadditionofalignmentobjectsworksbycolumn.Thisletsyoudothisasawaytoremoveablockofcolumns:
>>>edited=alignment[:,:6]+alignment[:,9:]
>>>print(edited)
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
DGTSTAATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
FAADDAAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273
Anothercommonuseofalignmentadditionwouldbetocombinealignmentsforseveraldifferentgenesintoametaalignment.Watchoutthoughtheidentifiersneedto
matchup(seeSection4.8forhowaddingSeqRecordobjectsworks).Youmayfindithelpfultofirstsortthealignmentrowsalphabeticallybyid:
>>>edited.sort()
>>>print(edited)
DGTSTAATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
FAADDAAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
Notethatyoucanonlyaddtwoalignmentstogetheriftheyhavethesamenumberofrows.
6.3.2Alignmentsasarrays
Dependingonwhatyouaredoing,itcanbemoreusefultoturnthealignmentobjectintoanarrayoflettersandyoucandothiswithNumPy:
>>>importnumpyasnp
>>>align_array=np.array([list(rec)forrecinalignment],np.character)
>>>print("Arrayshape%iby%i"%align_array.shape)
Arrayshape7by52
Ifyouwillbeworkingheavilywiththecolumns,youcantellNumPytostorethearraybycolumn(asinFortran)ratherthenitsdefaultofbyrow(asinC):
>>>align_array=np.array([list(rec)forrecinalignment],np.character,order="F")
NotethatthisleavestheoriginalBiopythonalignmentobjectandtheNumPyarrayinmemoryasseparateobjectseditingonewillnotupdatetheother!
6.4AlignmentTools
Therearelotsofalgorithmsoutthereforaligningsequences,bothpairwisealignmentsandmultiplesequencealignments.Thesecalculationsarerelativelyslow,andyou
generallywouldntwanttowritesuchanalgorithminPython.ForpairwisealignmentsBiopythoncontainstheBio.pairwise2module(seeSection6.4.6),whichis
supplementedbyfunctionswritteninCforspeedenhancements.Inaddition,youcanuseBiopythontoinvokeacommandlinetoolonyourbehalf.Normallyyouwould:
1.Prepareaninputfileofyourunalignedsequences,typicallythiswillbeaFASTAfilewhichyoumightcreateusingBio.SeqIO(seeChapter5).
2.Callthecommandlinetooltoprocessthisinputfile,typicallyviaoneofBiopythonscommandlinewrappers(whichwelldiscusshere).
3.Readtheoutputfromthetool,i.e.youralignedsequences,typicallyusingBio.AlignIO(seeearlierinthischapter).
Allthecommandlinewrappersweregoingtotalkaboutinthischapterfollowthesamestyle.Youcreateacommandlineobjectspecifyingtheoptions(e.g.theinput
filenameandtheoutputfilename),theninvokethiscommandlineviaaPythonoperatingsystemcall(e.g.usingthesubprocessmodule).
MostofthesewrappersaredefinedintheBio.Align.Applicationsmodule:
>>>importBio.Align.Applications
>>>dir(Bio.Align.Applications)
...
['ClustalwCommandline','DialignCommandline','MafftCommandline','MuscleCommandline',
'PrankCommandline','ProbconsCommandline','TCoffeeCommandline'...]
(IgnoretheentriesstartingwithanunderscorethesehavespecialmeaninginPython.)ThemoduleBio.Emboss.ApplicationshaswrappersforsomeoftheEMBOSS
suite,includingneedleandwater,whicharedescribedbelowinSection6.4.5,andwrappersfortheEMBOSSpackagedversionsofthePHYLIPtools(whichEMBOSS
refertoasoneoftheirEMBASSYpackagesthirdpartytoolswithanEMBOSSstyleinterface).Wewontexploreallthesealignmenttoolshereinthesection,justa
sample,butthesameprinciplesapply.
6.4.1ClustalW
ClustalWisapopularcommandlinetoolformultiplesequencealignment(thereisalsoagraphicalinterfacecalledClustalX).BiopythonsBio.Align.Applications
modulehasawrapperforthisalignmenttool(andseveralothers).
BeforetryingtouseClustalWfromwithinPython,youshouldfirsttryrunningtheClustalWtoolyourselfbyhandatthecommandline,tofamiliariseyourselftheother
options.YoullfindtheBiopythonwrapperisveryfaithfultotheactualcommandlineAPI:
>>>fromBio.Align.ApplicationsimportClustalwCommandline
>>>help(ClustalwCommandline)
...
Forthemostbasicusage,allyouneedistohaveaFASTAinputfile,suchasopuntia.fasta(availableonlineorintheDoc/examplessubdirectoryoftheBiopythonsource
code).ThisisasmallFASTAfilecontainingsevenpricklypearDNAsequences(fromthecactusfamilyOpuntia).
BydefaultClustalWwillgenerateanalignmentandguidetreefilewithnamesbasedontheinputFASTAfile,inthiscaseopuntia.alnandopuntia.dnd,butyoucan
overridethisormakeitexplicit:
>>>cline=ClustalwCommandline("clustalw2",infile="opuntia.fasta")
>>>print(cline)
clustalw2infile=opuntia.fasta
Noticeherewehavegiventheexecutablenameasclustalw2,indicatingwehaveversiontwoinstalled,whichhasadifferentfilenametoversionone(clustalw,the
default).Fortunatelybothversionssupportthesamesetofargumentsatthecommandline(andindeed,shouldbefunctionallyidentical).
YoumayfindthateventhoughyouhaveClustalWinstalled,theabovecommanddoesntworkyoumaygetamessageaboutcommandnotfound(especiallyon
Windows).ThisindicatedthattheClustalWexecutableisnotonyourPATH(anenvironmentvariable,alistofdirectoriestobesearched).Youcaneitherupdateyour
PATHsettingtoincludethelocationofyourcopyofClustalWtools(howyoudothiswilldependonyourOS),orsimplytypeinthefullpathofthetool.Forexample:
>>>importos
>>>clustalw_exe=r"C:\ProgramFiles\newclustal\clustalw2.exe"
>>>clustalw_cline=ClustalwCommandline(clustalw_exe,infile="opuntia.fasta")
>>>assertos.path.isfile(clustalw_exe),"ClustalWexecutablemissing"
>>>stdout,stderr=clustalw_cline()
Remember,inPythonstrings\nand\tarebydefaultinterpretedasanewlineandatabwhichiswhywereputaletterratthestartforarawstringthatisnt
translatedinthisway.ThisisgenerallygoodpracticewhenspecifyingaWindowsstylefilename.
InternallythisusesthesubprocessmodulewhichisnowtherecommendedwaytorunanotherprograminPython.Thisreplacesolderoptionsliketheos.system()and
theos.popen*functions.
Now,atthispointithelpstoknowabouthowcommandlinetoolswork.Whenyourunatoolatthecommandline,itwilloftenprinttextoutputdirectlytoscreen.This
textcanbecapturedorredirected,viatwopipes,calledstandardoutput(thenormalresults)andstandarderror(forerrormessagesanddebugmessages).Thereisalso
standardinput,whichisanytextfedintothetool.Thesenamesgetshortenedtostdin,stdoutandstderr.Whenthetoolfinishes,ithasareturncode(aninteger),whichby
conventioniszeroforsuccess.
WhenyourunthecommandlinetoollikethisviatheBiopythonwrapper,itwillwaitforittofinish,andcheckthereturncode.Ifthisisnonzero(indicatinganerror),an
exceptionisraised.Thewrapperthenreturnstwostrings,stdoutandstderr.
InthecaseofClustalW,whenrunatthecommandlinealltheimportantoutputiswrittendirectlytotheoutputfiles.Everythingnormallyprintedtoscreenwhileyouwait
(viastdoutorstderr)isboringandcanbeignored(assumingitworked).
Whatwecareaboutarethetwooutputfiles,thealignmentandtheguidetree.WedidnttellClustalWwhatfilenamestouse,butitdefaultstopickingnamesbasedonthe
inputfile.Inthiscasetheoutputshouldbeinthefileopuntia.aln.YoushouldbeabletoworkouthowtoreadinthealignmentusingBio.AlignIObynow:
>>>align=AlignIO.read("opuntia.aln","clustal")
>>>print(align)
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273285|gb|AF191659.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273290|gb|AF191664.1|AF191
Incaseyouareinterested(andthisisanasidefromthemainthrustofthischapter),theopuntia.dndfileClustalWcreatesisjustastandardNewicktreefile,and
Bio.Phylocanparsethese:
>>>fromBioimportPhylo
>>>tree=Phylo.read("opuntia.dnd","newick")
>>>Phylo.draw_ascii(tree)
_______________gi|6273291|gb|AF191665.1|AF191665
__________________________|
||______gi|6273290|gb|AF191664.1|AF191664
||__|
||_____gi|6273289|gb|AF191663.1|AF191663
|
_|_________________gi|6273287|gb|AF191661.1|AF191661
|
|__________gi|6273286|gb|AF191660.1|AF191660
|
|__gi|6273285|gb|AF191659.1|AF191659
|___|
|gi|6273284|gb|AF191658.1|AF191658
<BLANKLINE>
Chapter13coversBiopythonssupportforphylogenetictreesinmoredepth.
6.4.2MUSCLE
MUSCLEisamorerecentmultiplesequencealignmenttoolthanClustalW,andBiopythonalsohasawrapperforitundertheBio.Align.Applicationsmodule.As
before,werecommendyoutryusingMUSCLEfromthecommandlinebeforetryingitfromwithinPython,astheBiopythonwrapperisveryfaithfultotheactual
commandlineAPI:
>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>help(MuscleCommandline)
...
Forthemostbasicusage,allyouneedistohaveaFASTAinputfile,suchasopuntia.fasta(availableonlineorintheDoc/examplessubdirectoryoftheBiopythonsource
code).YoucanthentellMUSCLEtoreadinthisFASTAfile,andwritethealignmenttoanoutputfile:
>>>cline=MuscleCommandline(input="opuntia.fasta",out="opuntia.txt")
>>>print(cline)
muscleinopuntia.fastaoutopuntia.txt
NotethatMUSCLEusesinandoutbutinBiopythonwehavetouseinputandoutasthekeywordargumentsorpropertynames.Thisisbecauseinisa
reservedwordinPython.
BydefaultMUSCLEwilloutputthealignmentasaFASTAfile(usinggappedsequences).TheBio.AlignIOmoduleshouldbeabletoreadthisalignmentusing
format="fasta".YoucanalsoaskforClustalWlikeoutput:
>>>cline=MuscleCommandline(input="opuntia.fasta",out="opuntia.aln",clw=True)
>>>print(cline)
muscleinopuntia.fastaoutopuntia.alnclw
Or,strictClustalWoutputwheretheoriginalClustalWheaderlineisusedformaximumcompatibility:
>>>cline=MuscleCommandline(input="opuntia.fasta",out="opuntia.aln",clwstrict=True)
>>>print(cline)
muscleinopuntia.fastaoutopuntia.alnclwstrict
TheBio.AlignIOmoduleshouldbeabletoreadthesealignmentsusingformat="clustal".
MUSCLEcanalsooutputinGCGMSFformat(usingthemsfargument),butBiopythoncantcurrentlyparsethat,orusingHTMLwhichwouldgiveahumanreadable
webpage(notsuitableforparsing).
Youcanalsosettheotheroptionalparameters,forexamplethemaximumnumberofiterations.Seethebuiltinhelpfordetails.
YouwouldthenrunMUSCLEcommandlinestringasdescribedaboveforClustalW,andparsetheoutputusingBio.AlignIOtogetanalignmentobject.
6.4.3MUSCLEusingstdout
UsingaMUSCLEcommandlineasintheexamplesabovewillwritethealignmenttoafile.Thismeanstherewillbenoimportantinformationwrittentothestandardout
(stdout)orstandarderror(stderr)handles.However,bydefaultMUSCLEwillwritethealignmenttostandardoutput(stdout).Wecantakeadvantageofthistoavoid
havingatemporaryoutputfile!Forexample:
>>>muscle_cline=MuscleCommandline(input="opuntia.fasta")
>>>print(muscle_cline)
muscleinopuntia.fasta
Ifwerunthisviathewrapper,wegetbacktheoutputasastring.InordertoparsethiswecanuseStringIOtoturnitintoahandle.RememberthatMUSCLEdefaultsto
usingFASTAastheoutputformat:
>>>stdout,stderr=muscle_cline()
>>>fromStringIOimportStringIO
>>>align=AlignIO.read(StringIO(stdout),"fasta")
>>>print(align)
Theaboveapproachisfairlysimple,butifyouaredealingwithverylargeoutputtextthefactthatallofstdoutandstderrisloadedintomemoryasastringcanbea
potentialdrawback.Usingthesubprocessmodulewecanworkdirectlywithhandlesinstead:
>>>importsubprocess
>>>child=subprocess.Popen(str(muscle_cline),
...stdout=subprocess.PIPE,
...stderr=subprocess.PIPE,
...universal_newlines=True,
...shell=(sys.platform!="win32"))
>>>align=AlignIO.read(child.stdout,"fasta")
>>>print(align)
6.4.4MUSCLEusingstdinandstdout
WedontactuallyneedtohaveourFASTAinputsequencespreparedinafile,becausebydefaultMUSCLEwillreadintheinputsequencefromstandardinput!Notethis
isabitmoreadvancedandfiddly,sodontbotherwiththistechniqueunlessyouneedto.
First,wellneedsomeunalignedsequencesinmemoryasSeqRecordobjects.ForthisdemonstrationImgoingtouseafilteredversionoftheoriginalFASTAfile(usinga
generatorexpression),takingjustsixofthesevensequences:
>>>records=(rforrinSeqIO.parse("opuntia.fasta","fasta")iflen(r)<900)
ThenwecreatetheMUSCLEcommandline,leavingtheinputandoutputtotheirdefaults(stdinandstdout).ImalsogoingtoaskforstrictClustalWformatasforthe
output.
>>>muscle_cline=MuscleCommandline(clwstrict=True)
>>>print(muscle_cline)
muscleclwstrict
Nowforthefiddlybitsusingthesubprocessmodule,stdinandstdout:
>>>importsubprocess
>>>importsys
>>>child=subprocess.Popen(str(cline),
...stdin=subprocess.PIPE,
...stdout=subprocess.PIPE,
...stderr=subprocess.PIPE,
...universal_newlines=True,
...shell=(sys.platform!="win32"))
ThatshouldstartMUSCLE,butitwillbesittingwaitingforitsFASTAinputsequences,whichwemustsupplyviaitsstdinhandle:
>>>SeqIO.write(records,child.stdin,"fasta")
6
>>>child.stdin.close()
Afterwritingthesixsequencestothehandle,MUSCLEwillstillbewaitingtoseeifthatisalltheFASTAsequencesornotsowemustsignalthatthisisalltheinput
databyclosingthehandle.AtthatpointMUSCLEshouldstarttorun,andwecanaskfortheoutput:
>>>align=AlignIO.read(child.stdout,"clustal")
>>>print(align)
Wow!Therewearewithanewalignmentofjustthesixrecords,withouthavingcreatedatemporaryFASTAinputfile,oratemporaryalignmentoutputfile.However,a
wordofcaution:Dealingwitherrorswiththisstyleofcallingexternalprogramsismuchmorecomplicated.Italsobecomesfarhardertodiagnoseproblems,becauseyou
canttryrunningMUSCLEmanuallyoutsideofBiopython(becauseyoudonthavetheinputfiletosupply).Therecanalsobesubtlecrossplatformissues(e.g.Windows
versusLinux,Python2versusPython3),andhowyourunyourscriptcanhaveanimpact(e.g.atthecommandline,fromIDLEoranIDE,orasaGUIscript).Theseare
allgenericPythonissuesthough,andnotspecifictoBiopython.
Ifyoufindworkingdirectlywithsubprocesslikethisscary,thereisanalternative.Ifyouexecutethetoolwithmuscle_cline()youcansupplyanystandardinputasabig
string,muscle_cline(stdin=...).So,providedyourdataisntverybig,youcanpreparetheFASTAinputinmemoryasastringusingStringIO(seeSection24.1):
>>>records=(rforrinSeqIO.parse("opuntia.fasta","fasta")iflen(r)<900)
>>>handle=StringIO()
>>>SeqIO.write(records,handle,"fasta")
6
>>>data=handle.getvalue()
Youcanthenrunthetoolandparsethealignmentasfollows:
>>>stdout,stderr=muscle_cline(stdin=data)
>>>align=AlignIO.read(StringIO(stdout),"clustal")
>>>print(align)
Youmightfindthiseasier,butitdoesrequiremorememory(RAM)forthestringsusedfortheinputFASTAandoutputClustalformatteddata.
6.4.5EMBOSSneedleandwater
TheEMBOSSsuiteincludesthewaterandneedletoolsforSmithWatermanalgorithmlocalalignment,andNeedlemanWunschglobalalignment.Thetoolssharethe
samestyleinterface,soswitchingbetweenthetwoistrivialwelljustuseneedlehere.
Supposeyouwanttodoaglobalpairwisealignmentbetweentwosequences,preparedinFASTAformatasfollows:
>HBA_HUMAN
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
inafilealpha.faa,andsecondlyinafilebeta.faa:
>HBB_HUMAN
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
YoucanfindcopiesoftheseexamplefileswiththeBiopythonsourcecodeundertheDoc/examples/directory.
Letsstartbycreatingacompleteneedlecommandlineobjectinonego:
>>>fromBio.Emboss.ApplicationsimportNeedleCommandline
>>>needle_cline=NeedleCommandline(asequence="alpha.faa",bsequence="beta.faa",
...gapopen=10,gapextend=0.5,outfile="needle.txt")
>>>print(needle_cline)
needleoutfile=needle.txtasequence=alpha.faabsequence=beta.faagapopen=10gapextend=0.5
Whynottryrunningthisbyhandatthecommandprompt?Youshouldseeitdoesapairwisecomparisonandrecordstheoutputinthefileneedle.txt(inthedefault
EMBOSSalignmentfileformat).
EvenifyouhaveEMBOSSinstalled,runningthiscommandmaynotworkyoumightgetamessageaboutcommandnotfound(especiallyonWindows).This
probablymeansthattheEMBOSStoolsarenotonyourPATHenvironmentvariable.YoucaneitherupdateyourPATHsetting,orsimplytellBiopythonthefullpathto
thetool,forexample:
>>>needle_cline=NeedleCommandline(r"C:\EMBOSS\needle.exe",
...asequence="alpha.faa",bsequence="beta.faa",
...gapopen=10,gapextend=0.5,outfile="needle.txt")
RememberinPythonthatforadefaultstring\nor\tmeansanewlineoratabwhichiswhywereputaletterratthestartforarawstring.
AtthispointitmighthelptotryrunningtheEMBOSStoolsyourselfbyhandatthecommandline,tofamiliariseyourselftheotheroptionsandcomparethemtothe
Biopythonhelptext:
>>>help(NeedleCommandline)
...
Notethatyoucanalsospecify(orchangeorlookat)thesettingslikethis:
>>>needle_cline=NeedleCommandline()
>>>needle_cline.asequence="alpha.faa"
>>>needle_cline.bsequence="beta.faa"
>>>needle_cline.gapopen=10
>>>needle_cline.gapextend=0.5
>>>needle_cline.outfile="needle.txt"
>>>print(needle_cline)
needleoutfile=needle.txtasequence=alpha.faabsequence=beta.faagapopen=10gapextend=0.5
>>>print(needle_cline.outfile)
needle.txt
NextwewanttousePythontorunthiscommandforus.Asexplainedabove,forfullcontrol,werecommendyouusethebuiltinPythonsubprocessmodule,butfor
simpleusagethewrapperobjectusuallysuffices:
>>>stdout,stderr=needle_cline()
>>>print(stdout+stderr)
NeedlemanWunschglobalalignmentoftwosequences
NextwecanloadtheoutputfilewithBio.AlignIOasdiscussedearlierinthischapter,astheembossformat:
>>>align=AlignIO.read("needle.txt","emboss")
>>>print(align)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYRHBA_HUMAN
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRF...KYHHBB_HUMAN
Inthisexample,wetoldEMBOSStowritetheoutputtoafile,butyoucantellittowritetheoutputtostdoutinstead(usefulifyoudontwantatemporaryoutputfileto
getridofusestdout=Trueratherthantheoutfileargument),andalsotoreadoneoftheoneoftheinputsfromstdin(e.g.asequence="stdin",muchlikeinthe
MUSCLEexampleinthesectionabove).
Thishasonlyscratchedthesurfaceofwhatyoucandowithneedleandwater.Oneusefultrickisthatthesecondfilecancontainmultiplesequences(sayfive),andthen
EMBOSSwilldofivepairwisealignments.
6.4.6Biopythonspairwise2
Biopythonhasitsownmoduletomakelocalandglobalpairwisealignments,Bio.pairwise2.Thismodulecontainsessentiallythesamealgorithmsaswater(local)and
needle(global)fromtheEMBOSSsuite(seeabove)andshouldreturnthesameresults.
Supposeyouwanttodoaglobalpairwisealignmentbetweenthesametwohemoglobinsequencesfromabove(HBA_HUMAN,HBB_HUMAN)storedinalpha.faaandbeta.faa:
>>>fromBioimportpairwise2
>>>seq1=SeqIO.read("alpha.faa","fasta")
>>>seq2=SeqIO.read("beta.faa","fasta")
>>>alignments=pairwise2.align.globalxx(seq1.seq,seq2.seq)
Asyousee,wecallthealignmentfunctionwithalign.globalxx.Thetrickypartarethelasttwolettersofthefunctionname(here:xx),whichareusedfordecodingthe
scoresandpenaltiesformatches(andmismatches)andgaps.Thefirstletterdecodesthematchscore,e.g.xmeansthatamatchcounts1whilemismatcheshavenocosts.
Withmgeneralvaluesforeithermatchesormismatchescanbedefined(formoreoptionsseeBiopythonsAPI).Thesecondletterdecodesthecostforgapsxmeansno
gapcostsatall,withsdifferentpenaltiesforopeningandextendingagapcanbeassigned.So,globalxxmeansthatonlymatchesbetweenbothsequencesarecounted.
Ourvariablealignmentsnowcontainsalistofalignments(atleastone)whichhavethesameoptimalscoreforthegivenconditions.Inourexamplethisare80different
alignmentswiththescore72(Bio.pairwise2willreturnupto1000alignments).Havealookatoneofthesealignments:
>>>len(alignments)
80
>>>print(alignments[0])
('MVLSPADKTNVKAAWGKVGAHAG...YR','MVHLTPEEKSAVTALWGKV...YH',
72.0,0,217)
Eachalignmentisatupleconsistingofthetwoalignedsequences,thescore,thestartandtheendpositionsofthealignment(inglobalalignmentsthestartisalways0and
theendthelengthofthealignment).Bio.pairwise2hasafunctionformat_alignmentforanicerprintout:
>>>print(pairwise2.format_alignment(*alignment[0]))
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYF...YR
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...|||
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRF...YH
Score=72
Betteralignmentsareusuallyobtainedbypenalizinggaps:highercostsforopeningagapandlowercostsforextendinganexistinggap.Foraminoacidsequencesmatch
scoresareusuallyencodedinmatriceslikePAMorBLOSUM.Thus,amoremeaningfulalignmentforourexamplecanbeobtainedbyusingtheBLOSUM62matrix,together
withagapopenpenaltyof10andagapextensionpenaltyof0.5(usingglobalds):
>>>fromBio.SubsMat.MatrixInfoimportblosum62
>>>seq1=SeqIO.read("alpha.faa","fasta")
>>>seq2=SeqIO.read("beta.faa","fasta")
>>>alignments=pairwise2.align.globalds(seq1.seq,seq2.seq,blosum62,10,0.5)
>>>len(alignments)
2
>>>print(pairwise2.format_alignment(*alignments[0]))
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR
||||||||||||||||||||||||||||||||||||||||||||...|||
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFF...KYH
Score=292.5
ThisalignmenthasthesamescorethatweobtainedearlierwithEMBOSSneedleusingthesamesequencesandthesameparameters.
Localalignmentsarecalledsimilarlywiththefunctionalign.localXX,whereagainXXstandsforatwolettercodeforthematchandgapfunctions:
>>>fromBio.SubsMat.MatrixInfoimportblosum62
>>>alignments=pairwise2.align.localds("LSPADKTNVKAA","PEEKSAV",blosum62,10,1)
LSPADKTNVKAA
|||||||
PEEKSAV
Score=16
<BLANKLINE>
Insteadofsupplyingacompletematch/mismatchmatrix,thematchcodemallowsforeasydefininggeneralmatch/mismatchvalues.Thenextexampleuses
match/mismatchscoresof5/4andgappenalties(open/extend)of2/0.5usinglocalms):
>>>alignments=pairwise2.align.localms("AGAACT","GAC",5,4,2,0.5)
AGAACT
||||
GAC
Score=13
<BLANKLINE>
OneusefulkeywordargumentoftheBio.pairwise2.alignfunctionsisscore_only.WhensettoTrueitwillonlyreturnthescoreofthebestalignment(s),butina
significantlyshortertime.Itwillalsoallowthealignmentoflongersequencesbeforeamemoryerrorisraised.
Unfortunately,Bio.pairwise2doesnotworkwithBiopythonsmultiplesequencealignmentobjects(yet).However,themodulehassomeinterestingadvancedfeatures:
youcandefineyourownmatchandgapfunctions(interestedintestingaffinelogarithmicgapcosts?),gappenaltiesandendgapspenaltiescanbedifferentforboth
sequences,sequencescanbesuppliedaslists(usefulifyouhaveresiduesthatareencodedbymorethanonecharacter),etc.Thesefeaturesarehard(ifatall)torealize
withotheralignmenttools.FormoredetailsseethemodulesdocumentationinBiopythonsAPI.
Chapter7BLAST
Hey,everybodylovesBLASTright?Imean,geez,howcanitgetanyeasiertodocomparisonsbetweenoneofyoursequencesandeveryothersequenceintheknown
world?But,ofcourse,thissectionisntabouthowcoolBLASTis,sincewealreadyknowthat.ItisabouttheproblemwithBLASTitcanbereallydifficulttodealwith
thevolumeofdatageneratedbylargeruns,andtoautomateBLASTrunsingeneral.
Fortunately,theBiopythonfolksknowthisonlytoowell,sotheyvedevelopedlotsoftoolsfordealingwithBLASTandmakingthingsmucheasier.Thissectiondetails
howtousethesetoolsanddousefulthingswiththem.
DealingwithBLASTcanbesplitupintotwosteps,bothofwhichcanbedonefromwithinBiopython.Firstly,runningBLASTforyourquerysequence(s),andgetting
someoutput.Secondly,parsingtheBLASToutputinPythonforfurtheranalysis.
YourfirstintroductiontorunningBLASTwasprobablyviatheNCBIwebservice.Infact,therearelotsofwaysyoucanrunBLAST,whichcanbecategorisedinseveral
ways.ThemostimportantdistinctionisrunningBLASTlocally(onyourownmachine),andrunningBLASTremotely(onanothermachine,typicallytheNCBIservers).
WeregoingtostartthischapterbyinvokingtheNCBIonlineBLASTservicefromwithinaPythonscript.
NOTE:ThefollowingChapter8describesBio.SearchIO,anexperimentalmoduleinBiopython.WeintendthistoultimatelyreplacetheolderBio.Blastmodule,asit
providesamoregeneralframeworkhandlingotherrelatedsequencesearchingtoolsaswell.However,untilthatisdeclaredstable,forproductioncodepleasecontinueto
usetheBio.BlastmodulefordealingwithNCBIBLAST.
7.1RunningBLASTovertheInternet
Weusethefunctionqblast()intheBio.Blast.NCBIWWWmoduletocalltheonlineversionofBLAST.Thishasthreenonoptionalarguments:
Thefirstargumentistheblastprogramtouseforthesearch,asalowercasestring.Theoptionsanddescriptionsoftheprogramsareavailableat
http://www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml.Currentlyqblastonlyworkswithblastn,blastp,blastx,tblastandtblastx.
Thesecondargumentspecifiesthedatabasestosearchagainst.Again,theoptionsforthisareavailableontheNCBIwebpagesat
http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml.
Thethirdargumentisastringcontainingyourquerysequence.Thiscaneitherbethesequenceitself,thesequenceinfastaformat,oranidentifierlikeaGInumber.
TheqblastfunctionalsotakeanumberofotheroptionargumentswhicharebasicallyanalogoustothedifferentparametersyoucansetontheBLASTwebpage.Well
justhighlightafewofthemhere:
Theargumenturl_basesetsthebaseURLforrunningBLASTovertheinternet.BydefaultitconnectstotheNCBI,butonecanusethistoconnecttoaninstance
ofNCBIBLASTrunninginthecloud.Pleaserefertothedocumentationfortheqblastfunctionforfurtherdetails.
TheqblastfunctioncanreturntheBLASTresultsinvariousformats,whichyoucanchoosewiththeoptionalformat_typekeyword:"HTML","Text","ASN.1",or
"XML".Thedefaultis"XML",asthatistheformatexpectedbytheparser,describedinsection7.3below.
Theargumentexpectsetstheexpectationorevaluethreshold.
FormoreabouttheoptionalBLASTarguments,wereferyoutotheNCBIsowndocumentation,orthatbuiltintoBiopython:
>>>fromBio.BlastimportNCBIWWW
>>>help(NCBIWWW.qblast)
...
NotethatthedefaultsettingsontheNCBIBLASTwebsitearenotquitethesameasthedefaultsonQBLAST.Ifyougetdifferentresults,youllneedtocheckthe
parameters(e.g.,theexpectationvaluethresholdandthegapvalues).
Forexample,ifyouhaveanucleotidesequenceyouwanttosearchagainstthenucleotidedatabase(nt)usingBLASTN,andyouknowtheGInumberofyourquery
sequence,youcanuse:
>>>result_handle=NCBIWWW.qblast("blastn","nt","8332116")
Alternatively,ifwehaveourquerysequencealreadyinaFASTAformattedfile,wejustneedtoopenthefileandreadinthisrecordasastring,andusethatasthequery
argument:
>>>fasta_string=open("m_cold.fasta").read()
>>>result_handle=NCBIWWW.qblast("blastn","nt",fasta_string)
WecouldalsohavereadintheFASTAfileasaSeqRecordandthensuppliedjustthesequenceitself:
>>>record=SeqIO.read("m_cold.fasta",format="fasta")
>>>result_handle=NCBIWWW.qblast("blastn","nt",record.seq)
SupplyingjustthesequencemeansthatBLASTwillassignanidentifierforyoursequenceautomatically.YoumightprefertousetheSeqRecordobjectsformatmethodto
makeaFASTAstring(whichwillincludetheexistingidentifier):
>>>record=SeqIO.read("m_cold.fasta",format="fasta")
>>>result_handle=NCBIWWW.qblast("blastn","nt",record.format("fasta"))
Thisapproachmakesmoresenseifyouhaveyoursequence(s)inanonFASTAfileformatwhichyoucanextractusingBio.SeqIO(seeChapter5).
Whateverargumentsyougivetheqblast()function,youshouldgetbackyourresultsinahandleobject(bydefaultinXMLformat).Thenextstepwouldbetoparsethe
XMLoutputintoPythonobjectsrepresentingthesearchresults(Section7.3),butyoumightwanttosavealocalcopyoftheoutputfilefirst.Ifindthisespeciallyuseful
whendebuggingmycodethatextractsinfofromtheBLASTresults(becausererunningtheonlinesearchisslowandwastestheNCBIcomputertime).
Weneedtobeabitcarefulsincewecanuseresult_handle.read()toreadtheBLASToutputonlyoncecallingresult_handle.read()againreturnsanemptystring.
>>>withopen("my_blast.xml","w")asout_handle:
...out_handle.write(result_handle.read())
...
>>>result_handle.close()
Afterdoingthis,theresultsareinthefilemy_blast.xmlandtheoriginalhandlehashadallitsdataextracted(soweclosedit).However,theparsefunctionoftheBLAST
parser(describedin7.3)takesafilehandlelikeobject,sowecanjustopenthesavedfileforinput:
>>>result_handle=open("my_blast.xml")
NowthatwevegottheBLASTresultsbackintoahandleagain,wearereadytodosomethingwiththem,sothisleadsusrightintotheparsingsection(seeSection7.3
below).Youmaywanttojumpaheadtothatnow.
7.2RunningBLASTlocally
7.2.1Introduction
RunningBLASTlocally(asopposedtoovertheinternet,seeSection7.1)hasatleastmajortwoadvantages:
LocalBLASTmaybefasterthanBLASTovertheinternet
LocalBLASTallowsyoutomakeyourowndatabasetosearchforsequencesagainst.
DealingwithproprietaryorunpublishedsequencedatacanbeanotherreasontorunBLASTlocally.Youmaynotbeallowedtoredistributethesequences,sosubmitting
themtotheNCBIasaBLASTquerywouldnotbeanoption.
Unfortunately,therearesomemajordrawbackstooinstallingallthebitsandgettingitsetuprighttakessomeeffort:
LocalBLASTrequirescommandlinetoolstobeinstalled.
LocalBLASTrequires(large)BLASTdatabasestobesetup(andpotentiallykeptuptodate).
TofurtherconfusemattersthereareseveraldifferentBLASTpackagesavailable,andtherearealsoothertoolswhichcanproduceimitationBLASToutputfiles,suchas
BLAT.
7.2.2StandaloneNCBIBLAST+
ThenewNCBIBLAST+suitewasreleasedin2009.ThisreplacestheoldNCBIlegacyBLASTpackage(seebelow).
ThissectionwillshowbrieflyhowtousethesetoolsfromwithinPython.IfyouhavealreadyreadortriedthealignmenttoolexamplesinSection6.4thisshouldallseem
quitestraightforward.First,weconstructacommandlinestring(asyouwouldtypeinatthecommandlinepromptifrunningstandaloneBLASTbyhand).Thenwecan
executethiscommandfromwithinPython.
Forexample,takingaFASTAfileofgenenucleotidesequences,youmightwanttorunaBLASTX(translation)searchagainstthenonredundant(NR)proteindatabase.
Assumingyou(oryoursystemsadministrator)hasdownloadedandinstalledtheNRdatabase,youmightrun:
blastxqueryopuntia.fastadbnroutopuntia.xmlevalue0.001outfmt5
ThisshouldrunBLASTXagainsttheNRdatabase,usinganexpectationcutoffvalueof0.001andproduceXMLoutputtothespecifiedfile(whichwecanthenparse).
Onmycomputerthistakesaboutsixminutesagoodreasontosavetheoutputtoafilesoyoucanrepeatanyanalysisasneeded.
FromwithinBiopythonwecanusetheNCBIBLASTXwrapperfromtheBio.Blast.Applicationsmoduletobuildthecommandlinestring,andrunit:
>>>fromBio.Blast.ApplicationsimportNcbiblastxCommandline
>>>help(NcbiblastxCommandline)
...
>>>blastx_cline=NcbiblastxCommandline(query="opuntia.fasta",db="nr",evalue=0.001,
...outfmt=5,out="opuntia.xml")
>>>blastx_cline
NcbiblastxCommandline(cmd='blastx',out='opuntia.xml',outfmt=5,query='opuntia.fasta',
db='nr',evalue=0.001)
>>>print(blastx_cline)
blastxoutopuntia.xmloutfmt5queryopuntia.fastadbnrevalue0.001
>>>stdout,stderr=blastx_cline()
InthisexamplethereshouldntbeanyoutputfromBLASTXtotheterminal,sostdoutandstderrshouldbeempty.Youmaywanttochecktheoutputfileopuntia.xmlhas
beencreated.
Asyoumayrecallfromearlierexamplesinthetutorial,theopuntia.fastacontainssevensequences,sotheBLASTXMLoutputshouldcontainmultipleresults.
ThereforeuseBio.Blast.NCBIXML.parse()toparseitasdescribedbelowinSection7.3.
7.2.3OtherversionsofBLAST
NCBIBLAST+(writteninC++)wasfirstreleasedin2009asareplacementfortheoriginalNCBIlegacyBLAST(writteninC)whichisnolongerbeingupdated.
TherewerealotofchangestheoldversionhadasinglecorecommandlinetoolblastallwhichcoveredmultipledifferentBLASTsearchtypes(whicharenow
separatecommandsinBLAST+),andallthecommandlineoptionswererenamed.BiopythonswrappersfortheNCBIlegacyBLASTtoolshavebeendeprecatedand
willberemovedinafuturerelease.Totrytoavoidconfusion,wedonotcovercallingtheseoldtoolsfromBiopythoninthistutorial.
YoumayalsocomeacrossWashingtonUniversityBLAST(WUBLAST),anditssuccessor,AdvancedBiocomputingBLAST(ABBLAST,releasedin2009,not
free/opensource).Thesepackagesincludethecommandlinetoolswublastallandabblastall,whichmimickedblastallfromtheNCBIlegacyBLASTsuite.
Biopythondoesnotcurrentlyprovidewrappersforcallingthesetools,butshouldbeabletoparseanyNCBIcompatibleoutputfromthem.
7.3ParsingBLASToutput
Asmentionedabove,BLASTcangenerateoutputinvariousformats,suchasXML,HTML,andplaintext.Originally,BiopythonhadparsersforBLASTplaintextand
HTMLoutput,astheseweretheonlyoutputformatsofferedatthetime.Unfortunately,theBLASToutputintheseformatskeptchanging,eachtimebreakingthe
Biopythonparsers.OurHTMLBLASTparserhasbeenremoved,buttheplaintextBLASTparserisstillavailable(seeSection7.5).Useitatyourownrisk,itmayor
maynotwork,dependingonwhichBLASTversionyoureusing.
AskeepingupwithchangesinBLASTbecameahopelessendeavor,especiallywithusersrunningdifferentBLASTversions,wenowrecommendtoparsetheoutputin
XMLformat,whichcanbegeneratedbyrecentversionsofBLAST.NotonlyistheXMLoutputmorestablethantheplaintextandHTMLoutput,itisalsomucheasier
toparseautomatically,makingBiopythonawholelotmorestable.
YoucangetBLASToutputinXMLformatinvariousways.Fortheparser,itdoesntmatterhowtheoutputwasgenerated,aslongasitisintheXMLformat.
YoucanuseBiopythontorunBLASTovertheinternet,asdescribedinsection7.1.
YoucanuseBiopythontorunBLASTlocally,asdescribedinsection7.2.
YoucandotheBLASTsearchyourselfontheNCBIsitethroughyourwebbrowser,andthensavetheresults.YouneedtochooseXMLastheformatinwhichto
receivetheresults,andsavethefinalBLASTpageyouget(youknow,theonewithalloftheinterestingresults!)toafile.
YoucanalsorunBLASTlocallywithoutusingBiopython,andsavetheoutputinafile.Again,youneedtochooseXMLastheformatinwhichtoreceivethe
results.
TheimportantpointisthatyoudonothavetouseBiopythonscriptstofetchthedatainordertobeabletoparseit.Doingthingsinoneoftheseways,youthenneedto
getahandletotheresults.InPython,ahandleisjustanicegeneralwayofdescribinginputtoanyinfosourcesothattheinfocanberetrievedusingread()and
readline()functions(seeSectionsec:appendixhandles).
IfyoufollowedthecodeaboveforinteractingwithBLASTthroughascript,thenyoualreadyhaveresult_handle,thehandletotheBLASTresults.Forexample,usinga
GInumbertodoanonlinesearch:
>>>result_handle=NCBIWWW.qblast("blastn","nt","8332116")
IfinsteadyouranBLASTsomeotherway,andhavetheBLASToutput(inXMLformat)inthefilemy_blast.xml,allyouneedtodoistoopenthefileforreading:
>>>result_handle=open("my_blast.xml")
Nowthatwevegotahandle,wearereadytoparsetheoutput.Thecodetoparseitisreallyquitesmall.IfyouexpectasingleBLASTresult(i.e.,youusedasingle
query):
>>>fromBio.BlastimportNCBIXML
>>>blast_record=NCBIXML.read(result_handle)
or,ifyouhavelotsofresults(i.e.,multiplequerysequences):
>>>blast_records=NCBIXML.parse(result_handle)
JustlikeBio.SeqIOandBio.AlignIO(seeChapters5and6),wehaveapairofinputfunctions,readandparse,wherereadisforwhenyouhaveexactlyoneobject,and
parseisaniteratorforwhenyoucanhavelotsofobjectsbutinsteadofgettingSeqRecordorMultipleSeqAlignmentobjects,wegetBLASTrecordobjects.
TobeabletohandlethesituationwheretheBLASTfilemaybehuge,containingthousandsofresults,NCBIXML.parse()returnsaniterator.InplainEnglish,aniterator
allowsyoutostepthroughtheBLASToutput,retrievingBLASTrecordsonebyoneforeachBLASTsearchresult:
>>>blast_record=next(blast_records)
#...dosomethingwithblast_record
File"<stdin>",line1,in<module>
StopIteration
#Nofurtherrecords
Or,youcanuseaforloop:
>>>forblast_recordinblast_records:
...#Dosomethingwithblast_record
NotethoughthatyoucanstepthroughtheBLASTrecordsonlyonce.Usually,fromeachBLASTrecordyouwouldsavetheinformationthatyouareinterestedin.Ifyou
wanttosaveallreturnedBLASTrecords,youcanconverttheiteratorintoalist:
>>>blast_records=list(blast_records)
NowyoucanaccesseachBLASTrecordinthelistwithanindexasusual.IfyourBLASTfileishugethough,youmayrunintomemoryproblemstryingtosavethemall
inalist.
Usually,youllberunningoneBLASTsearchatatime.Then,allyouneedtodoistopickupthefirst(andonly)BLASTrecordinblast_records:
ormoreelegantly:
>>>blast_record=NCBIXML.read(result_handle)
IguessbynowyourewonderingwhatisinaBLASTrecord.
7.4TheBLASTrecordclass
ABLASTRecordcontainseverythingyoumighteverwanttoextractfromtheBLASToutput.Rightnowwelljustshowanexampleofhowtogetsomeinfooutofthe
BLASTreport,butifyouwantsomethinginparticularthatisnotdescribedhere,lookattheinfoontherecordclassindetail,andtakeaganderintothecodeor
automaticallygenerateddocumentationthedocstringshavelotsofgoodinfoaboutwhatisstoredineachpieceofinformation.
Tocontinuewithourexample,letsjustprintoutsomesummaryinfoaboutallhitsinourblastreportgreaterthanaparticularthreshold.Thefollowingcodedoesthis:
>>>E_VALUE_THRESH=0.04
>>>foralignmentinblast_record.alignments:
...forhspinalignment.hsps:
...ifhsp.expect<E_VALUE_THRESH:
...print('****Alignment****')
...print('sequence:',alignment.title)
...print('length:',alignment.length)
...print('evalue:',hsp.expect)
...print(hsp.query[0:75]+'...')
...print(hsp.match[0:75]+'...')
...print(hsp.sbjct[0:75]+'...')
Thiswillprintoutsummaryreportslikethefollowing:
****Alignment****
sequence:>gb|AF283004.1|AF283004ArabidopsisthalianacoldacclimationproteinWCOR413likeprotein
alphaformmRNA,completecds
length:783
evalue:0.034
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc...
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...
Basically,youcandoanythingyouwanttowiththeinfointheBLASTreportonceyouhaveparsedit.Thiswill,ofcourse,dependonwhatyouwanttouseitfor,but
hopefullythishelpsyougetstartedondoingwhatyouneedtodo!
AnimportantconsiderationforextractinginformationfromaBLASTreportisthetypeofobjectsthattheinformationisstoredin.InBiopython,theparsersreturnRecord
objects,eitherBlastorPSIBlastdependingonwhatyouareparsing.TheseobjectsaredefinedinBio.Blast.Recordandarequitecomplete.
HerearemyattemptsatUMLclassdiagramsfortheBlastandPSIBlastrecordclasses.IfyouaregoodatUMLandseemistakes/improvementsthatcanbemade,please
letmeknow.TheBlastclassdiagramisshowninFigure7.4.
ThePSIBlastrecordobjectissimilar,buthassupportfortheroundsthatareusedintheiterationstepsofPSIBlast.TheclassdiagramforPSIBlastisshowninFigure7.4.
7.5DeprecatedBLASTparsers
OlderversionsofBiopythonhadparsersforBLASToutputinplaintextorHTMLformat.Overtheyears,wediscoveredthatitisveryhardtomaintaintheseparsersin
workingorder.Basically,anysmallchangetotheBLASToutputinnewlyreleasedBLASTversionstendstocausetheplaintextandHTMLparserstobreak.We
thereforerecommendparsingBLASToutputinXMLformat,asdescribedinsection7.3.
DependingonwhichBLASTversionsorprogramsyoureusing,ourplaintextBLASTparsermayormaynotwork.Useitatyourownrisk!
7.5.1ParsingplaintextBLASToutput
TheplaintextBLASTparserislocatedinBio.Blast.NCBIStandalone.
AswiththeXMLparser,weneedtohaveahandleobjectthatwecanpasstotheparser.Thehandlemustimplementthereadline()methodanddothisproperly.The
commonwaystogetsuchahandlearetoeitherusetheprovidedblastallorblastpgpfunctionstorunthelocalblast,ortorunalocalblastviathecommandline,and
thendosomethinglikethefollowing:
>>>result_handle=open("my_file_of_blast_output.txt")
Well,nowthatwevegotahandle(whichwellcallresult_handle),wearereadytoparseit.Thiscanbedonewiththefollowingcode:
>>>fromBio.BlastimportNCBIStandalone
>>>blast_parser=NCBIStandalone.BlastParser()
>>>blast_record=blast_parser.parse(result_handle)
ThiswillparsetheBLASTreportintoaBlastRecordclass(eitheraBlastoraPSIBlastrecord,dependingonwhatyouareparsing)sothatyoucanextractthe
informationfromit.Inourcase,letsjustprintoutaquicksummaryofallofthealignmentsgreaterthansomethresholdvalue.
>>>E_VALUE_THRESH=0.04
>>>foralignmentinblast_record.alignments:
...print(hsp.query[0:75]+'...')
...print(hsp.match[0:75]+'...')
...print(hsp.sbjct[0:75]+'...')
Ifyoualsoreadthesection7.3onparsingBLASTXMLoutput,youllnoticethattheabovecodeisidenticaltowhatisfoundinthatsection.Onceyouparsesomething
intoarecordclassyoucandealwithitindependentoftheformatoftheoriginalBLASTinfoyouwereparsing.Prettysnazzy!
Sure,parsingonerecordisgreat,butIvegotaBLASTfilewithtonsofrecordshowcanIparsethemall?Well,fearnot,theanswerliesintheverynextsection.
7.5.2ParsingaplaintextBLASTfilefullofBLASTruns
Wecandothisusingtheblastiterator.Tosetupaniterator,wefirstsetupaparser,toparseourblastreportsinBlastRecordobjects:
>>>blast_parser=NCBIStandalone.BlastParser()
Thenwewillassumewehaveahandletoabunchofblastrecords,whichwellcallresult_handle.Gettingahandleisdescribedinfulldetailaboveintheblastparsing
sections.
Nowthatwevegotaparserandahandle,wearereadytosetuptheiteratorwiththefollowingcommand:
>>>blast_iterator=NCBIStandalone.Iterator(result_handle,blast_parser)
Thesecondoption,theparser,isoptional.Ifwedontsupplyaparser,thentheiteratorwilljustreturntherawBLASTreportsoneatatime.
Nowthatwevegotaniterator,westartretrievingblastrecords(generatedbyourparser)usingnext():
>>>blast_record=next(blast_iterator)
Eachcalltonextwillreturnanewrecordthatwecandealwith.Nowwecaniteratethroughtheserecordsandgenerateouroldfavorite,anicelittleblastreport:
>>>forblast_recordinblast_iterator:
...E_VALUE_THRESH=0.04
...foralignmentinblast_record.alignments:
...iflen(hsp.query)>75:
...dots='...'
...else:
...dots=''
...print(hsp.query[0:75]+dots)
...print(hsp.match[0:75]+dots)
...print(hsp.sbjct[0:75]+dots)
Theiteratorallowsyoutodealwithhugeblastrecordswithoutanymemoryproblems,sincethingsarereadinoneatatime.Ihaveparsedtremendouslyhugefiles
withoutanyproblemsusingthis.
7.5.3FindingabadrecordsomewhereinahugeplaintextBLASTfile
OnereallyuglyproblemthathappenstomeisthatIllbeparsingahugeblastfileforawhile,andtheparserwillbomboutwithaValueError.Thisisaseriousproblem,
sinceyoucanttelliftheValueErrorisduetoaparserproblem,oraproblemwiththeBLAST.Tomakeitevenworse,youhavenoideawheretheparsefailed,soyou
cantjustignoretheerror,sincethiscouldbeignoringanimportantdatapoint.
Weusedtohavetomakealittlescripttogetaroundthisproblem,buttheBio.BlastmodulenowincludesaBlastErrorParserwhichreallyhelpsmakethiseasier.The
BlastErrorParserworksverysimilartotheregularBlastParser,butitaddsanextralayerofworkbycatchingValueErrorsthataregeneratedbytheparser,and
attemptingtodiagnosetheerrors.
Letstakealookatusingthisparserfirstwedefinethefilewearegoingtoparseandthefiletowritetheproblemreportsto:
>>>importos
>>>blast_file=os.path.join(os.getcwd(),"blast_out","big_blast.out")
>>>error_file=os.path.join(os.getcwd(),"blast_out","big_blast.problems")
NowwewanttogetaBlastErrorParser:
>>>error_handle=open(error_file,"w")
>>>blast_error_parser=NCBIStandalone.BlastErrorParser(error_handle)
Noticethattheparsertakeanoptionalargumentofahandle.Ifahandleispassed,thentheparserwillwriteanyblastrecordswhichgenerateaValueErrortothishandle.
Otherwise,theserecordswillnotberecorded.
NowwecanusetheBlastErrorParserjustlikearegularblastparser.Specifically,wemightwanttomakeaniteratorthatgoesthroughourblastrecordsoneatatimeand
parsesthemwiththeerrorparser:
>>>result_handle=open(blast_file)
>>>iterator=NCBIStandalone.Iterator(result_handle,blast_error_parser)
Wecanreadtheserecordsoneatime,butnowwecancatchanddealwitherrorsthatareduetoproblemswithBlast(andnotwiththeparseritself):
>>>try:
...next_record=next(iterator)
...exceptNCBIStandalone.LowQualityBlastErrorasinfo:
...print("LowQualityBlastErrordetectedinid%s"%info[1])
Thenext()functionalityisnormallycalledindirectlyviaaforloop.RightnowtheBlastErrorParsercangeneratethefollowingerrors:
ValueErrorThisisthesameerrorgeneratedbytheregularBlastParser,andisduetotheparsernotbeingabletoparseaspecificfile.Thisisnormallyeitherdue
toabugintheparser,orsomekindofdiscrepancybetweentheversionofBLASTyouareusingandtheversionstheparserisabletohandle.
LowQualityBlastErrorWhenBLASTingasequencethatisofreallybadquality(forexample,ashortsequencethatisbasicallyastretchofonenucleotide),it
seemsthatBlastendsupmaskingouttheentiresequenceandendingupwithnothingtoparse.Inthiscaseitwillproduceatruncatedreportthatcausestheparserto
generateaValueError.LowQualityBlastErrorisreportedinthesecases.Thiserrorreturnsaninfoitemwiththefollowinginformation:
item[0]Theerrormessage
item[1]Theidoftheinputrecordthatcausedtheerror.Thisisreallyusefulifyouwanttorecordalloftherecordsthatarecausingproblems.
Asmentioned,witheacherrorgenerated,theBlastErrorParserwillwritetheoffendingrecordtothespecifiederror_handle.Youcanthengoaheadandlookandthese
anddealwiththemasyouseefit.Eitheryouwillbeabletodebugtheparserwithasingleblastreport,orwillfindoutproblemsinyourblastruns.Eitherway,itwill
definitelybeausefulexperience!
HopefullytheBlastErrorParserwillmakeitmucheasiertodebuganddealwithlargeBlastfiles.
7.6DealingwithPSIBLAST
YoucanrunthestandaloneversionofPSIBLAST(thelegacyNCBIcommandlinetoolblastpgp,oritsreplacementpsiblast)usingthewrappersin
Bio.Blast.Applicationsmodule.
Atthetimeofwriting,theNCBIdonotappeartosupporttoolsrunningaPSIBLASTsearchviatheinternet.
NotethattheBio.Blast.NCBIXMLparsercanreadtheXMLoutputfromcurrentversionsofPSIBLAST,butinformationlikewhichsequencesineachiterationisnewor
reusedisntpresentintheXMLfile.IfyoucareaboutthisinformationyoumayhavemorejoywiththeplaintextoutputandthePSIBlastParserin
Bio.Blast.NCBIStandalone.
7.7DealingwithRPSBLAST
YoucanrunthestandaloneversionofRPSBLAST(eitherthelegacyNCBIcommandlinetoolrpsblast,oritsreplacementwiththesamename)usingthewrappersin
Bio.Blast.Applicationsmodule.
Atthetimeofwriting,theNCBIdonotappeartosupporttoolsrunninganRPSBLASTsearchviatheinternet.
YoucanusetheBio.Blast.NCBIXMLparsertoreadtheXMLoutputfromcurrentversionsofRPSBLAST.
Chapter8BLASTandothersequencesearchtools(experimentalcode)
WARNING:ThischapteroftheTutorialdescribesanexperimentalmoduleinBiopython.ItisbeingincludedinBiopythonanddocumentedhereinthetutorialinapre
finalstatetoallowaperiodoffeedbackandrefinementbeforewedeclareitstable.Untilthenthedetailswillprobablychange,andanyscriptsusingthecurrent
Bio.SearchIOwouldneedtobeupdated.Pleasekeepthisinmind!ForstablecodeworkingwithNCBIBLAST,pleasecontinuetouseBio.Blastdescribedinthe
precedingChapter7.
Biologicalsequenceidentificationisanintegralpartofbioinformatics.Severaltoolsareavailableforthis,eachwiththeirownalgorithmsandapproaches,suchas
BLAST(arguablythemostpopular),FASTA,HMMER,andmanymore.Ingeneral,thesetoolsusuallyuseyoursequencetosearchadatabaseofpotentialmatches.With
thegrowingnumberofknownsequences(hencethegrowingnumberofpotentialmatches),interpretingtheresultsbecomesincreasinglyhardastherecouldbehundreds
oreventhousandsofpotentialmatches.Naturally,manualinterpretationofthesesearchesresultsisoutofthequestion.Moreover,youoftenneedtoworkwithseveral
sequencesearchtools,eachwithitsownstatistics,conventions,andoutputformat.Imaginehowdauntingitwouldbewhenyouneedtoworkwithmultiplesequences
usingmultiplesearchtools.
Weknowthistoowellourselves,whichiswhywecreatedtheBio.SearchIOsubmoduleinBiopython.Bio.SearchIOallowsyoutoextractinformationfromyoursearch
resultsinaconvenientway,whilealsodealingwiththedifferentstandardsandconventionsusedbydifferentsearchtools.ThenameSearchIOisahomagetoBioPerls
moduleofthesamename.
Inthischapter,wellgothroughthemainfeaturesofBio.SearchIOtoshowwhatitcandoforyou.Wellusetwopopularsearchtoolsalongtheway:BLASTandBLAT.
Theyareusedmerelyforillustrativepurposes,andyoushouldbeabletoadapttheworkflowtoanyothersearchtoolssupportedbyBio.SearchIOinabreeze.Yourevery
welcometofollowalongwiththesearchoutputfileswellbeusing.TheBLASToutputfilecanbedownloadedhere,andtheBLAToutputfilehereorareincludedwith
theBiopythonsourcecodeundertheDoc/examples/folder.Bothoutputfilesweregeneratedusingthissequence:
>mystery_seq
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
TheBLASTresultisanXMLfilegeneratedusingblastnagainsttheNCBIrefseq_rnadatabase.ForBLAT,thesequencedatabasewastheFebruary2009hg19human
genomedraftandtheoutputformatisPSL.
WellstartfromanintroductiontotheBio.SearchIOobjectmodel.Themodelistherepresentationofyoursearchresults,thusitiscoretoBio.SearchIOitself.Afterthat,
wellcheckoutthemainfunctionsinBio.SearchIOthatyoumayoftenuse.
Nowthatwereallset,letsgotothefirststep:introducingthecoreobjectmodel.
8.1TheSearchIOobjectmodel
Despitethewildlydifferingoutputstylesamongmanysequencesearchtools,itturnsoutthattheirunderlyingconceptissimilar:
Theoutputfilemaycontainresultsfromoneormoresearchqueries.
Ineachsearchquery,youwillseeoneormorehitsfromthegivensearchdatabase.
Ineachdatabasehit,youwillseeoneormoreregionscontainingtheactualsequencealignmentbetweenyourquerysequenceandthedatabasesequence.
SomeprogramslikeBLATorExoneratemayfurthersplittheseregionsintoseveralalignmentfragments(orblocksinBLATandpossiblyexonsinexonerate).This
isnotsomethingyoualwayssee,asprogramslikeBLASTandHMMERdonotdothis.
Realizingthisgenerality,wedecideduseitasbaseforcreatingtheBio.SearchIOobjectmodel.TheobjectmodelconsistsofanestedhierarchyofPythonobjects,each
onerepresentingoneconceptoutlinedabove.Theseobjectsare:
QueryResult,torepresentasinglesearchquery.
Hit,torepresentasingledatabasehit.HitobjectsarecontainedwithinQueryResultandineachQueryResultthereiszeroormoreHitobjects.
HSP(shortforhighscoringpair),torepresentregion(s)ofsignificantalignmentsbetweenqueryandhitsequences.HSPobjectsarecontainedwithinHitobjectsand
eachHithasoneormoreHSPobjects.
HSPFragment,torepresentasinglecontiguousalignmentbetweenqueryandhitsequences.HSPFragmentobjectsarecontainedwithinHSPobjects.Mostsequence
searchtoolslikeBLASTandHMMERunifyHSPandHSPFragmentobjectsaseachHSPwillonlyhaveasingleHSPFragment.HowevertherearetoolslikeBLATand
ExoneratethatproduceHSPcontainingmultipleHSPFragment.Dontworryifthisseemsatadconfusingnow,wellelaboratemoreonthesetwoobjectslateron.
ThesefourobjectsaretheonesyouwillinteractwithwhenyouuseBio.SearchIO.TheyarecreatedusingoneofthemainBio.SearchIOmethods:read,parse,index,or
index_db.Thedetailsofthesemethodsareprovidedinlatersections.Forthissection,wellonlybeusingreadandparse.Thesefunctionsbehavesimilarlytotheir
Bio.SeqIOandBio.AlignIOcounterparts:
readisusedforsearchoutputfileswithasinglequeryandreturnsaQueryResultobject
parseisusedforsearchoutputfileswithmultiplequeriesandreturnsageneratorthatyieldsQueryResultobjects
Withthatsettled,letsstartprobingeachBio.SearchIOobject,beginningwithQueryResult.
8.1.1QueryResult
TheQueryResultobjectrepresentsasinglesearchqueryandcontainszeroormoreHitobjects.LetsseewhatitlookslikeusingtheBLASTfilewehave:
>>>fromBioimportSearchIO
>>>blast_qresult=SearchIO.read('my_blast.xml','blastxml')
>>>print(blast_qresult)
Program:blastn(2.2.27+)
Query:42291(61)
mystery_seq
Target:refseq_rna
Hits:
##HSPID+description

01gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA52...
11gi|301171311|ref|NR_035856.1|PantroglodytesmicroRNA...
21gi|270133242|ref|NR_032573.1|MacacamulattamicroRNA...
~~~
971gi|356517317|ref|XM_003527287.1|PREDICTED:Glycinema...
981gi|297814701|ref|XM_002875188.1|Arabidopsislyratasu...
991gi|397513516|ref|XM_003827011.1|PREDICTED:Panpanisc...
Wevejustbeguntoscratchthesurfaceoftheobjectmodel,butyoucanseethattheresalreadysomeusefulinformation.ByinvokingprintontheQueryResultobject,
youcansee:
Theprogramnameandversion(blastnversion2.2.27+)
ThequeryID,description,anditssequencelength(IDis42291,descriptionismystery_seq,anditis61nucleotideslong)
Thetargetdatabasetosearchagainst(refseq_rna)
Aquickoverviewoftheresultinghits.Forourquerysequence,thereare100potentialhits(numbered099inthetable).Foreachhit,wecanalsoseehowmany
HSPsitcontains,itsID,andasnippetofitsdescription.NoticeherethatBio.SearchIOtruncatesthehittableoverview,byshowingonlyhitsnumbered029,and
then9799.
NowletscheckourBLATresultsusingthesameprocedureasabove:
>>>blat_qresult=SearchIO.read('my_blat.psl','blatpsl')
>>>print(blat_qresult)
Program:blat(<unknownversion>)
Query:mystery_seq(61)
<unknowndescription>
Target:<unknowntarget>
Hits:
##HSPID+description

017chr19<unknowndescription>
Youllimmediatelynoticethattherearesomedifferences.SomeofthesearecausedbythewayPSLformatstoresitsdetails,asyoullsee.Therestarecausedbythe
genuineprogramandtargetdatabasedifferencesbetweenourBLASTandBLATsearches:
Theprogramnameandversion.Bio.SearchIOknowsthattheprogramisBLAT,butintheoutputfilethereisnoinformationregardingtheprogramversionsoit
defaultsto<unknownversion>.
ThequeryID,description,anditssequencelength.NoticeherethatthesedetailsareslightlydifferentfromtheoneswesawinBLAST.TheIDismystery_seq
insteadof42991,thereisnoknowndescription,butthequerylengthisstill61.Thisisactuallyadifferenceintroducedbythefileformatsthemselves.BLAST
sometimescreatesitsownqueryIDsandusesyouroriginalIDasthesequencedescription.
Thetargetdatabaseisnotknown,asitisnotstatedintheBLAToutputfile.
Andfinally,thelistofhitswehaveiscompletelydifferent.Here,weseethatourquerysequenceonlyhitsthechr19databaseentry,butinitwesee17HSP
regions.Thisshouldnotbesurprisinghowever,giventhatweareusingadifferentprogram,eachwithitsowntargetdatabase.
AllthedetailsyousawwheninvokingtheprintmethodcanbeaccessedindividuallyusingPythonsobjectattributeaccessnotation(a.k.a.thedotnotation).Thereare
alsootherformatspecificattributesthatyoucanaccessusingthesamemethod.
>>>print("%s%s"%(blast_qresult.program,blast_qresult.version))
blastn2.2.27+
>>>print("%s%s"%(blat_qresult.program,blat_qresult.version))
blat<unknownversion>
>>>blast_qresult.param_evalue_threshold#blastxmlspecific
10.0
Foracompletelistofaccessibleattributes,youcancheckeachformatspecificdocumentation.HerearetheonesforBLASTandforBLAT.
HavinglookedatusingprintonQueryResultobjects,letsdrilldowndeeper.WhatexactlyisaQueryResult?IntermsofPythonobjects,QueryResultisahybridbetween
alistandadictionary.Inotherwords,itisacontainerobjectwithalltheconvenientfeaturesoflistsanddictionaries.
LikePythonlistsanddictionaries,QueryResultobjectsareiterable.EachiterationreturnsaHitobject:
>>>forhitinblast_qresult:
...hit
Hit(id='gi|262205317|ref|NR_030195.1|',query_id='42291',1hsps)
...
Tocheckhowmanyitems(hits)aQueryResulthas,youcansimplyinvokePythonslenmethod:
>>>len(blast_qresult)
100
>>>len(blat_qresult)
1
LikePythonlists,youcanretrieveitems(hits)fromaQueryResultusingtheslicenotation:
>>>blast_qresult[0]#retrievesthetophit
>>>blast_qresult[1]#retrievesthelasthit
Hit(id='gi|397513516|ref|XM_003827011.1|',query_id='42291',1hsps)
Toretrievemultiplehits,youcansliceQueryResultobjectsusingtheslicenotationaswell.Inthiscase,theslicewillreturnanewQueryResultobjectcontainingonlythe
slicedhits:
>>>blast_slice=blast_qresult[:3]#slicesthefirstthreehits
>>>print(blast_slice)
Program:blastn(2.2.27+)
Query:42291(61)
mystery_seq
Target:refseq_rna
Hits:
##HSPID+description

LikePythondictionaries,youcanalsoretrievehitsusingthehitsID.ThisisparticularlyusefulifyouknowagivenhitIDexistswithinasearchqueryresults:
>>>blast_qresult['gi|262205317|ref|NR_030195.1|']
YoucanalsogetafulllistofHitobjectsusinghitsandafulllistofHitIDsusinghit_keys:
>>>blast_qresult.hits
[...]#listofallhits
>>>blast_qresult.hit_keys
[...]#listofallhitIDs
Whatifyoujustwanttocheckwhetheraparticularhitispresentinthequeryresults?YoucandoasimplePythonmembershiptestusingtheinkeyword:
>>>'gi|262205317|ref|NR_030195.1|'inblast_qresult
True
>>>'gi|262205317|ref|NR_030194.1|'inblast_qresult
False
Sometimes,knowingwhetherahitispresentisnotenoughyoualsowanttoknowtherankofthehit.Here,theindexmethodcomestotherescue:
>>>blast_qresult.index('gi|301171437|ref|NR_035870.1|')
22
RememberthatwereusingPythonsindexingstylehere,whichiszerobased.Thismeansourhitaboveisrankedatno.23,not22.
Also,notethatthehitrankyouseehereisbasedonthenativehitorderingpresentintheoriginalsearchoutputfile.Differentsearchtoolsmayorderthesehitsbasedon
differentcriteria.
Ifthenativehitorderingdoesntsuityourtaste,youcanusethesortmethodoftheQueryResultobject.ItisverysimilartoPythonslist.sortmethod,withtheaddition
ofanoptiontocreateanewsortedQueryResultobjectornot.
HereisanexampleofusingQueryResult.sorttosortthehitsbasedoneachhitsfullsequencelength.Forthisparticularsort,wellsetthein_placeflagtoFalsesothat
sortingwillreturnanewQueryResultobjectandleaveourinitialobjectunsorted.WellalsosetthereverseflagtoTruesothatwesortindescendingorder.
>>>forhitinblast_qresult[:5]:#idandsequencelengthofthefirstfivehits
...print("%s%i"%(hit.id,hit.seq_len))
...
gi|262205317|ref|NR_030195.1|61
gi|301171311|ref|NR_035856.1|60
gi|270133242|ref|NR_032573.1|85
gi|301171322|ref|NR_035857.1|86
gi|301171267|ref|NR_035851.1|80
>>>sort_key=lambdahit:hit.seq_len
>>>sorted_qresult=blast_qresult.sort(key=sort_key,reverse=True,in_place=False)
>>>forhitinsorted_qresult[:5]:
...print("%s%i"%(hit.id,hit.seq_len))
...
gi|397513516|ref|XM_003827011.1|6002
gi|390332045|ref|XM_776818.2|4082
gi|390332043|ref|XM_003723358.1|4079
gi|356517317|ref|XM_003527287.1|3251
gi|356543101|ref|XM_003539954.1|2936
Theadvantageofhavingthein_placeflaghereisthatwerepreservingthenativeordering,sowemayuseitagainlater.Youshouldnotethatthisisnotthedefault
behaviorofQueryResult.sort,however,whichiswhyweneededtosetthein_placeflagtoTrueexplicitly.
Atthispoint,youveknownenoughaboutQueryResultobjectstomakeitworkforyou.ButbeforewegoontothenextobjectintheBio.SearchIOmodel,letstakea
lookattwomoresetsofmethodsthatcouldmakeiteveneasiertoworkwithQueryResultobjects:thefilterandmapmethods.
IfyourefamiliarwithPythonslistcomprehensions,generatorexpressionsorthebuiltinfilterandmapfunctions,youllknowhowusefultheyareforworkingwith
listlikeobjects(ifyourenot,checkthemout!).YoucanusethesebuiltinmethodstomanipulateQueryResultobjects,butyoullendupwithregularPythonlistsand
losetheabilitytodomoreinterestingmanipulations.
Thatswhy,QueryResultobjectsprovideitsownflavoroffilterandmapmethods.Analogoustofilter,therearehit_filterandhsp_filtermethods.Astheirname
implies,thesemethodsfilteritsQueryResultobjecteitheronitsHitobjectsorHSPobjects.Similarly,analogoustomap,QueryResultobjectsalsoprovidethehit_mapand
hsp_mapmethods.ThesemethodsapplyagivenfunctiontoallhitsorHSPsinaQueryResultobject,respectively.
Letsseethesemethodsinaction,beginningwithhit_filter.ThismethodacceptsacallbackfunctionthatcheckswhetheragivenHitobjectpassestheconditionyouset
ornot.Inotherwords,thefunctionmustacceptasitsargumentasingleHitobjectandreturnsTrueorFalse.
Hereisanexampleofusinghit_filtertofilteroutHitobjectsthatonlyhaveoneHSP:
>>>filter_func=lambdahit:len(hit.hsps)>1#thecallbackfunction
>>>len(blast_qresult)#no.ofhitsbeforefiltering
100
>>>filtered_qresult=blast_qresult.hit_filter(filter_func)
>>>len(filtered_qresult)#no.ofhitsafterfiltering
37
>>>forhitinfiltered_qresult[:5]:#quickcheckforthehitlengths
...print("%s%i"%(hit.id,len(hit.hsps)))
gi|301171322|ref|NR_035857.1|2
gi|262205330|ref|NR_030198.1|2
gi|301171447|ref|NR_035871.1|2
gi|262205298|ref|NR_030190.1|2
gi|270132717|ref|NR_032716.1|2
hsp_filterworksthesameashit_filter,onlyinsteadoflookingattheHitobjects,itperformsfilteringontheHSPobjectsineachhits.
Asforthemapmethods,theytooacceptacallbackfunctionastheirarguments.However,insteadofreturningTrueorFalse,thecallbackfunctionmustreturnthe
modifiedHitorHSPobject(dependingonwhetheryoureusinghit_maporhsp_map).
Letsseeanexamplewherewereusinghit_maptorenamethehitIDs:
>>>defmap_func(hit):
...hit.id=hit.id.split('|')[3]#renames'gi|301171322|ref|NR_035857.1|'to'NR_035857.1'
...returnhit
...
>>>mapped_qresult=blast_qresult.hit_map(map_func)
>>>forhitinmapped_qresult[:5]:
...print(hit.id)
NR_030195.1
NR_035856.1
NR_032573.1
NR_035857.1
NR_035851.1
Again,hsp_mapworksthesameashit_map,butonHSPobjectsinsteadofHitobjects.
8.1.2Hit
Hitobjectsrepresentallqueryresultsfromasingledatabaseentry.TheyarethesecondlevelcontainerintheBio.SearchIOobjecthierarchy.Youveseenthattheyare
containedbyQueryResultobjects,buttheythemselvescontainHSPobjects.
Letsseewhattheylooklike,beginningwithourBLASTsearch:
>>>blast_hit=blast_qresult[3]#fourthhitfromthequeryresult
>>>print(blast_hit)
Query:42291
mystery_seq
Hit:gi|301171322|ref|NR_035857.1|(86)
PantroglodytesmicroRNAmir520c(MIR520C),microRNA
HSPs:
#EvalueBitscoreSpanQueryrangeHitrange

08.9e20100.4760[1:61][13:73]
13.3e0655.3960[0:60][13:73]
Youseethatwevegottheessentialscoveredhere:
ThequeryIDanddescriptionispresent.Ahitisalwaystiedtoaquery,sowewanttokeeptrackoftheoriginatingqueryaswell.Thesevaluescanbeaccessed
fromahitusingthequery_idandquery_descriptionattributes.
WealsohavetheuniquehitID,description,andfullsequencelengths.Theycanbeaccessedusingid,description,andseq_len,respectively.
Finally,theresatablecontainingquickinformationabouttheHSPsthishitcontains.Ineachrow,wevegottheimportantHSPdetailslisted:theHSPindex,itse
value,itsbitscore,itsspan(thealignmentlengthincludinggaps),itsquerycoordinates,anditshitcoordinates.
NowletscontrastthiswiththeBLATsearch.RememberthatintheBLATsearchwehadonehitwith17HSPs.
>>>blat_hit=blat_qresult[0]#theonlyhit
>>>print(blat_hit)
Query:mystery_seq
Hit:chr19(59128983)
HSPs:

0???[0:61][54204480:54204541]
1???[0:61][54233104:54264463]
2???[0:61][54254477:54260071]
3???[1:61][54210720:54210780]
4???[0:60][54198476:54198536]
5???[0:61][54265610:54265671]
6???[0:61][54238143:54240175]
7???[0:60][54189735:54189795]
8???[0:61][54185425:54185486]
9???[0:60][54197657:54197717]
10???[0:61][54255662:54255723]
11???[0:61][54201651:54201712]
12???[8:60][54206009:54206061]
13???[10:61][54178987:54179038]
14???[8:61][54212018:54212071]
15???[8:51][54234278:54234321]
16???[8:61][54238143:54238196]
Here,wevegotasimilarlevelofdetailaswiththeBLASThitwesawearlier.Therearesomedifferencesworthexplaining,though:
Theevalueandbitscorecolumnvalues.AsBLATHSPsdonothaveevaluesandbitscores,thedisplaydefaultsto?.
Whataboutthespancolumn?Thespanvaluesismeanttodisplaythecompletealignmentlength,whichconsistsofallresiduesandanygapsthatmaybepresent.
ThePSLformatdonothavethisinformationreadilyavailableandBio.SearchIOdoesnotattempttotryguesswhatitis,sowegeta?similartotheevalueand
bitscorecolumns.
IntermsofPythonobjects,HitbehavesalmostthesameasPythonlists,butcontainHSPobjectsexclusively.Ifyourefamiliarwithlists,youshouldencounterno
difficultiesworkingwiththeHitobject.
JustlikePythonlists,Hitobjectsareiterable,andeachiterationreturnsoneHSPobjectitcontains:
>>>forhspinblast_hit:
...hsp
HSP(hit_id='gi|301171322|ref|NR_035857.1|',query_id='42291',1fragments)
HSP(hit_id='gi|301171322|ref|NR_035857.1|',query_id='42291',1fragments)
YoucaninvokelenonaHittoseehowmanyHSPobjectsithas:
>>>len(blast_hit)
2
>>>len(blat_hit)
17
YoucanusetheslicenotationonHitobjects,whethertoretrievesingleHSPormultipleHSPobjects.LikeQueryResult,ifyousliceformultipleHSP,anewHitobjectwill
bereturnedcontainingonlytheslicedHSPobjects:
>>>blat_hit[0]#retrievesingleitems
HSP(hit_id='chr19',query_id='mystery_seq',1fragments)
>>>sliced_hit=blat_hit[4:9]#retrievemultipleitems
>>>len(sliced_hit)
5
>>>print(sliced_hit)
Query:mystery_seq
Hit:chr19(59128983)
HSPs:

0???[0:60][54198476:54198536]
1???[0:61][54265610:54265671]
2???[0:61][54238143:54240175]
3???[0:60][54189735:54189795]
4???[0:61][54185425:54185486]
YoucanalsosorttheHSPinsideaHit,usingtheexactsameargumentslikethesortmethodyousawintheQueryResultobject.
Finally,therearealsothefilterandmapmethodsyoucanuseonHitobjects.UnlikeintheQueryResultobject,Hitobjectsonlyhaveonevariantoffilter(Hit.filter)
andonevariantofmap(Hit.map).BothofHit.filterandHit.mapworkontheHSPobjectsaHithas.
8.1.3HSP
HSP(highscoringpair)representsregion(s)inthehitsequencethatcontainssignificantalignment(s)tothequerysequence.Itcontainstheactualmatchbetweenyour
querysequenceandadatabaseentry.Asthismatchisdeterminedbythesequencesearchtoolsalgorithms,theHSPobjectcontainsthebulkofthestatisticscomputedby
thesearchtool.ThisalsomakesthedistinctionbetweenHSPobjectsfromdifferentsearchtoolsmoreapparentcomparedtothedifferencesyouveseeninQueryResultor
Hitobjects.
LetsseesomeexamplesfromourBLASTandBLATsearches.WelllookattheBLASTHSPfirst:
>>>blast_hsp=blast_qresult[0][0]#firsthit,firsthsp
>>>print(blast_hsp)
Query:42291mystery_seq
Hit:gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA520b(MIR520...
Queryrange:[0:61](1)
Hitrange:[0:61](1)
Quickstats:evalue4.9e23;bitscore111.29
Fragments:1(61columns)
QueryCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HitCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
JustlikeQueryResultandHit,invokingprintonanHSPshowsitsgeneraldetails:
TherearethequeryandhitIDsanddescriptions.WeneedthesetoidentifyourHSP.
Wevealsogotthematchingrangeofthequeryandhitsequences.TheslicenotationwereusinghereisanindicationthattherangeisdisplayedusingPythons
indexingstyle(zerobased,halfopen).Thenumberinsidetheparenthesisdenotesthestrand.Inthiscase,bothsequenceshavetheplusstrand.
Somequickstatisticsareavailable:theevalueandbitscore.
ThereisinformationabouttheHSPfragments.Ignorethisfornowitwillbeexplainedlateron.
Andfinally,wehavethequeryandhitsequencealignmentitself.
Thesedetailscanbeaccessedontheirownusingthedotnotation,justlikeinQueryResultandHit:
>>>blast_hsp.query_range
(0,61)
>>>blast_hsp.evalue
4.91307e23
Theyrenottheonlyattributesavailable,though.HSPobjectscomewithadefaultsetofpropertiesthatmakesiteasytoprobetheirvariousdetails.Herearesome
examples:
>>>blast_hsp.hit_start#startcoordinateofthehitsequence
0
>>>blast_hsp.query_span#howmanyresiduesinthequerysequence
61
>>>blast_hsp.aln_span#howlongthealignmentis
61
CheckouttheHSPdocumentationforafulllistofthesepredefinedproperties.
Furthermore,eachsequencesearchtoolusuallycomputesitsownstatistics/detailsforitsHSPobjects.Forexample,anXMLBLASTsearchalsooutputsthenumberof
gapsandidenticalresidues.Theseattributescanbeaccessedlikeso:
>>>blast_hsp.gap_num#numberofgaps
0
>>>blast_hsp.ident_num#numberofidenticalresidues
61
Thesedetailsareformatspecifictheymaynotbepresentinotherformats.Toseewhichdetailsareavailableforagivensequencesearchtool,youshouldcheckthe
formatsdocumentationinBio.SearchIO.Alternatively,youmayalsouse.__dict__.keys()foraquicklistofwhatsavailable:
>>>blast_hsp.__dict__.keys()
['bitscore','evalue','ident_num','gap_num','bitscore_raw','pos_num','_items']
Finally,youmayhavenoticedthatthequeryandhitattributesofourHSParenotjustregularstrings:
>>>blast_hsp.query
SeqRecord(seq=Seq('CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTT...GGG',DNAAlphabet()),id='42291',name='alignedquerysequence',description='mystery_
>>>blast_hsp.hit
SeqRecord(seq=Seq('CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTT...GGG',DNAAlphabet()),id='gi|262205317|ref|NR_030195.1|',name='alignedhitsequence',
TheyareSeqRecordobjectsyousawearlierinSection4!ThismeansthatyoucandoallsortsofinterestingthingsyoucandowithSeqRecordobjectsonHSP.queryand/or
HSP.hit.
ItshouldnotsurpriseyounowthattheHSPobjecthasanalignmentpropertywhichisaMultipleSeqAlignmentobject:
>>>print(blast_hsp.aln)
DNAAlphabet()alignmentwith2rowsand61columns
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAG...GGG42291
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAG...GGGgi|262205317|ref|NR_030195.1|
HavingprobedtheBLASTHSP,letsnowtakealookatHSPsfromourBLATresultsforadifferentkindofHSP.Asusual,wellbeginbyinvokingprintonit:
>>>blat_hsp=blat_qresult[0][0]#firsthit,firsthsp
>>>print(blat_hsp)
Query:mystery_seq<unknowndescription>
Hit:chr19<unknowndescription>
Hitrange:[54204480:54204541](1)
Quickstats:evalue?;bitscore?
Fragments:1(?columns)
Someoftheoutputsyoumayhavealreadyguessed.WehavethequeryandhitIDsanddescriptionsandthesequencecoordinates.Valuesforevalueandbitscoreis?as
BLATHSPsdonothavetheseattributes.ButThebiggestdifferencehereisthatyoudontseeanysequencealignmentsdisplayed.Ifyoulookcloser,PSLformats
themselvesdonothaveanyhitorquerysequences,soBio.SearchIOwontcreateanysequenceoralignmentobjects.WhathappensifyoutrytoaccessHSP.query,
HSP.hit,orHSP.aln?Youllgetthedefaultvaluesfortheseattributes,whichisNone:
>>>blat_hsp.hitisNone
True
>>>blat_hsp.queryisNone
True
>>>blat_hsp.alnisNone
True
Thisdoesnotaffectotherattributes,though.Forexample,youcanstillaccessthelengthofthequeryorhitalignment.Despitenotdisplayinganyattributes,thePSL
formatstillhavethisinformationsoBio.SearchIOcanextractthem:
>>>blat_hsp.query_span#lengthofquerymatch
61
>>>blat_hsp.hit_span#lengthofhitmatch
61
Otherformatspecificattributesarestillpresentaswell:
>>>blat_hsp.score#PSLscore
61
>>>blat_hsp.mismatch_num#themismatchcolumn
0
Sofarsogood?ThingsgetmoreinterestingwhenyoulookatanothervariantofHSPpresentinourBLATresults.YoumightrecallthatinBLATsearches,sometimes
wegetourresultsseparatedintoblocks.Theseblocksareessentiallyalignmentfragmentsthatmayhavesomeinterveningsequencebetweenthem.
LetstakealookataBLATHSPthatcontainsmultipleblockstoseehowBio.SearchIOdealswiththis:
>>>blat_hsp2=blat_qresult[0][1]#firsthit,secondhsp
>>>print(blat_hsp2)
Hitrange:[54233104:54264463](1)
Quickstats:evalue?;bitscore?
Fragments:
#SpanQueryrangeHitrange

0?[0:18][54233104:54233122]
1?[18:61][54264420:54264463]
Whatshappeninghere?Westillsomeessentialdetailscovered:theIDsanddescriptions,thecoordinates,andthequickstatisticsaresimilartowhatyouveseenbefore.
Butthefragmentsdetailisalldifferent.InsteadofshowingFragments:1,wenowhaveatablewithtwodatarows.
ThisishowBio.SearchIOdealswithHSPshavingmultiplefragments.Asmentionedbefore,anHSPalignmentmaybeseparatedbyinterveningsequencesinto
fragments.Theinterveningsequencesarenotpartofthequeryhitmatch,sotheyshouldnotbeconsideredpartofquerynorhitsequence.However,theydoaffecthow
wedealwithsequencecoordinates,sowecantignorethem.
TakealookatthehitcoordinateoftheHSPabove.IntheHitrange:field,weseethatthecoordinateis[54233104:54264463].Butlookingatthetablerows,weseethat
nottheentireregionspannedbythiscoordinatematchesourquery.Specifically,theinterveningregionspansfrom54233122to54264420.
Whythen,isthequerycoordinatesseemtobecontiguous,youask?Thisisperfectlyfine.Inthiscaseitmeansthatthequerymatchiscontiguous(nointervening
regions),whilethehitmatchisnot.
AlltheseattributesareaccessiblefromtheHSPdirectly,bytheway:
>>>blat_hsp2.hit_range#hitstartandendcoordinatesoftheentireHSP
(54233104,54264463)
>>>blat_hsp2.hit_range_all#hitstartandendcoordinatesofeachfragment
[(54233104,54233122),(54264420,54264463)]
>>>blat_hsp2.hit_span#hitspanoftheentireHSP
31359
>>>blat_hsp2.hit_span_all#hitspanofeachfragment
[18,43]
>>>blat_hsp2.hit_inter_ranges#startandendcoordinatesofinterveningregionsinthehitsequence
[(54233122,54264420)]
>>>blat_hsp2.hit_inter_spans#spanofinterveningregionsinthehitsequence
[31298]
MostoftheseattributesarenotreadilyavailablefromthePSLfilewehave,butBio.SearchIOcalculatesthemforyouontheflywhenyouparsethePSLfile.Allitneeds
arethestartandendcoordinatesofeachfragment.
Whataboutthequery,hit,andalnattributes?IftheHSPhasmultiplefragments,youwontbeabletousetheseattributesastheyonlyfetchsingleSeqRecordor
MultipleSeqAlignmentobjects.However,youcanusetheir*_allcounterparts:query_all,hit_all,andaln_all.ThesepropertieswillreturnalistcontainingSeqRecordor
MultipleSeqAlignmentobjectsfromeachoftheHSPfragment.Thereareotherattributesthatbehavesimilarly,i.e.theyonlyworkforHSPswithonefragment.Checkout
theHSPdocumentationforafulllist.
Finally,tocheckwhetheryouhavemultiplefragmentsornot,youcanusetheis_fragmentedpropertylikeso:
>>>blat_hsp2.is_fragmented#BLATHSPwith2fragments
True
>>>blat_hsp.is_fragmented#BLATHSPfromearlier,withonefragment
False
Beforewemoveon,youshouldalsoknowthatwecanusetheslicenotationonHSPobjects,justlikeQueryResultorHitobjects.Whenyouusethisnotation,youllgetan
HSPFragmentobjectinreturn,thelastcomponentoftheobjectmodel.
8.1.4HSPFragment
HSPFragmentrepresentsasingle,contiguousmatchbetweenthequeryandhitsequences.Youcouldconsideritthecoreoftheobjectmodelandsearchresult,sinceitisthe
presenceofthesefragmentsthatdeterminewhetheryoursearchhaveresultsornot.
Inmostcases,youdonthavetodealwithHSPFragmentobjectsdirectlysincenotthatmanysequencesearchtoolsfragmenttheirHSPs.Whenyoudohavetodealwith
them,whatyoushouldrememberisthatHSPFragmentobjectswerewrittenwithtobeascompactaspossible.Inmostcases,theyonlycontainattributesdirectlyrelatedto
sequences:strands,readingframes,alphabets,coordinates,thesequencesthemselves,andtheirIDsanddescriptions.
TheseattributesarereadilyshownwhenyouinvokeprintonanHSPFragment.Heresanexample,takenfromourBLASTsearch:
>>>blast_frag=blast_qresult[0][0][0]#firsthit,firsthsp,firstfragment
>>>print(blast_frag)
Query:42291mystery_seq
Hit:gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA520b(MIR520...
Hitrange:[0:61](1)
Fragments:1(61columns)
QueryCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HitCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
Atthislevel,theBLATfragmentlooksquitesimilartotheBLASTfragment,saveforthequeryandhitsequenceswhicharenotpresent:
>>>blat_frag=blat_qresult[0][0][0]#firsthit,firsthsp,firstfragment
>>>print(blat_frag)
Hitrange:[54204480:54204541](1)
Fragments:1(?columns)
Inallcases,theseattributesareaccessibleusingourfavoritedotnotation.Someexamples:
>>>blast_frag.query_start#querystartcoordinate
0
>>>blast_frag.hit_strand#hitsequencestrand
1
>>>blast_frag.hit#hitsequence,asaSeqRecordobject
SeqRecord(seq=Seq('CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTT...GGG',DNAAlphabet()),id='gi|262205317|ref|NR_030195.1|',name='alignedhitsequence',
8.2Anoteaboutstandardsandconventions
Beforewemoveontothemainfunctions,thereissomethingyououghttoknowaboutthestandardsBio.SearchIOuses.Ifyouveworkedwithmultiplesequencesearch
tools,youmighthavehadtodealwiththemanydifferentwayseachprogramdealswiththingslikesequencecoordinates.Itmightnothavebeenapleasantexperienceas
thesesearchtoolsusuallyhavetheirownstandards.Forexample,onetoolsmightuseonebasedcoordinates,whiletheotheruseszerobasedcoordinates.Or,one
programmightreversethestartandendcoordinatesifthestrandisminus,whileothersdont.Inshort,theseoftencreatesunnecessarymessmustbedealtwith.
WerealizethisproblemourselvesandweintendtoaddressitinBio.SearchIO.Afterall,oneofthegoalsofBio.SearchIOistocreateacommon,easytouseinterfaceto
dealwithvarioussearchoutputfiles.Thismeanscreatingstandardsthatextendbeyondtheobjectmodelyoujustsaw.
Now,youmightcomplain,"Notanotherstandard!".Well,eventuallywehavetochooseoneconventionortheother,sothisisnecessary.Plus,werenotcreating
somethingentirelynewherejustadoptingastandardwethinkisbestforaPythonprogrammer(itisBiopython,afterall).
TherearethreeimplicitstandardsthatyoucanexpectwhenworkingwithBio.SearchIO:
Thefirstonepertainstosequencecoordinates.InBio.SearchIO,allsequencecoordinatesfollowsPythonscoordinatestyle:zerobasedandhalfopen.Forexample,
ifinaBLASTXMLoutputfilethestartandendcoordinatesofanHSPare10and28,theywouldbecome9and28inBio.SearchIO.Thestartcoordinatebecomes
9becausePythonindicesstartfromzero,whiletheendcoordinateremains28asPythonslicesomitthelastiteminaninterval.
Thesecondisonsequencecoordinateorders.InBio.SearchIO,startcoordinatesarealwayslessthanorequaltoendcoordinates.Thisisntalwaysthecasewithall
sequencesearchtools,assomeofthemhavelargerstartcoordinateswhenthesequencestrandisminus.
Thelastoneisonstrandandreadingframevalues.Forstrands,thereareonlyfourvalidchoices:1(plusstrand),1(minusstrand),0(proteinsequences),andNone
(nostrand).Forreadingframes,thevalidchoicesareintegersfrom3to3andNone.
NotethatthesestandardsonlyexistinBio.SearchIOobjects.IfyouwriteBio.SearchIOobjectsintoanoutputformat,Bio.SearchIOwillusetheformatsstandardforthe
output.Itdoesnotforceitsstandardovertoyouroutputfile.
8.3Readingsearchoutputfiles
TherearetwofunctionsyoucanuseforreadingsearchoutputfilesintoBio.SearchIOobjects:readandparse.Theyreessentiallysimilartoreadandparsefunctionsin
othersubmoduleslikeBio.SeqIOorBio.AlignIO.Inbothcases,youneedtosupplythesearchoutputfilenameandthefileformatname,bothasPythonstrings.Youcan
checkthedocumentationforalistofformatnamesBio.SearchIOrecognizes.
Bio.SearchIO.readisusedforreadingsearchoutputfileswithonlyonequeryandreturnsaQueryResultobject.Youveseenreadusedinourpreviousexamples.What
youhaventseenisthatreadmayalsoacceptadditionalkeywordarguments,dependingonthefileformat.
Herearesomeexamples.Inthefirstone,weusereadjustlikepreviouslytoreadaBLASTtabularoutputfile.Inthesecondone,weuseakeywordargumenttomodify
soitparsestheBLASTtabularvariantwithcommentsinit:
>>>qresult=SearchIO.read('tab_2226_tblastn_003.txt','blasttab')
>>>qresult
QueryResult(id='gi|16080617|ref|NP_391444.1|',3hits)
>>>qresult2=SearchIO.read('tab_2226_tblastn_007.txt','blasttab',comments=True)
>>>qresult2
Thesekeywordargumentsdiffersamongfileformats.Checktheformatdocumentationtoseeifithaskeywordargumentsthatmodifiesitsparsersbehavior.
AsfortheBio.SearchIO.parse,itisusedforreadingsearchoutputfileswithanynumberofqueries.ThefunctionreturnsageneratorobjectthatyieldsaQueryResult
objectineachiteration.LikeBio.SearchIO.read,italsoacceptsformatspecifickeywordarguments:
>>>qresults=SearchIO.parse('tab_2226_tblastn_001.txt','blasttab')
>>>forqresultinqresults:
...print(qresult.id)
gi|16080617|ref|NP_391444.1|
gi|11464971:4101
>>>qresults2=SearchIO.parse('tab_2226_tblastn_005.txt','blasttab',comments=True)
>>>forqresultinqresults2:
...print(qresult.id)
random_s00
gi|16080617|ref|NP_391444.1|
gi|11464971:4101
8.4Dealingwithlargesearchoutputfileswithindexing
Sometimes,yourehandedasearchoutputfilecontaininghundredsorthousandsofqueriesthatyouneedtoparse.YoucanofcourseuseBio.SearchIO.parseforthisfile,
butthatwouldbegrosslyinefficientifyouneedtoaccessonlyafewofthequeries.Thisisbecauseparsewillparseallqueriesitseesbeforeitfetchesyourqueryof
interest.
Inthiscase,theidealchoicewouldbetoindexthefileusingBio.SearchIO.indexorBio.SearchIO.index_db.Ifthenamessoundfamiliar,itsbecauseyouveseenthem
beforeinSection5.4.2.ThesefunctionsalsobehavesimilarlytotheirBio.SeqIOcounterparts,withtheadditionofformatspecifickeywordarguments.
Herearesomeexamples.Youcanuseindexwithjustthefilenameandformatname:
>>>idx=SearchIO.index('tab_2226_tblastn_001.txt','blasttab')
>>>sorted(idx.keys())
['gi|11464971:4101','gi|16080617|ref|NP_391444.1|']
>>>idx['gi|16080617|ref|NP_391444.1|']
>>>idx.close()
Oralsowiththeformatspecifickeywordargument:
>>>idx=SearchIO.index('tab_2226_tblastn_005.txt','blasttab',comments=True)
['gi|11464971:4101','gi|16080617|ref|NP_391444.1|','random_s00']
>>>idx['gi|16080617|ref|NP_391444.1|']
>>>idx.close()
Orwiththekey_functionargument,asinBio.SeqIO:
>>>key_function=lambdaid:id.upper()#capitalizesthekeys
>>>idx=SearchIO.index('tab_2226_tblastn_001.txt','blasttab',key_function=key_function)
['GI|11464971:4101','GI|16080617|REF|NP_391444.1|']
>>>idx['GI|16080617|REF|NP_391444.1|']
>>>idx.close()
Bio.SearchIO.index_dbworkslikeasindex,onlyitwritesthequeryoffsetsintoanSQLitedatabasefile.
8.5Writingandconvertingsearchoutputfiles
Itisoccasionallyusefultobeabletomanipulatesearchresultsfromanoutputfileandwriteitagaintoanewfile.Bio.SearchIOprovidesawritefunctionthatletsyoudo
exactlythis.IttakesasitsargumentsaniterablereturningQueryResultobjects,theoutputfilenametowriteto,theformatnametowriteto,andoptionallysomeformat
specifickeywordarguments.Itreturnsafouritemtuple,whichdenotesthenumberorQueryResult,Hit,HSP,andHSPFragmentobjectsthatwerewritten.
>>>qresults=SearchIO.parse('mirna.xml','blastxml')#readXMLfile
>>>SearchIO.write(qresults,'results.tab','blasttab')#writetotabularfile
(3,239,277,277)
YoushouldnotedifferentfileformatsrequiredifferentattributesoftheQueryResult,Hit,HSPandHSPFragmentobjects.Iftheseattributesarenotpresent,writingwont
work.Inotherwords,youcantalwayswritetotheoutputformatthatyouwant.Forexample,ifyoureadaBLASTXMLfile,youwouldntbeabletowritetheresultsto
aPSLfileasPSLfilesrequireattributesnotcalculatedbyBLAST(e.g.thenumberofrepeatmatches).Youcanalwayssettheseattributesmanually,ifyoureallywantto
writetoPSL,though.
Likeread,parse,index,andindex_db,writealsoacceptsformatspecifickeywordarguments.CheckoutthedocumentationforacompletelistofformatsBio.SearchIO
canwritetoandtheirarguments.
Finally,Bio.SearchIOalsoprovidesaconvertfunction,whichissimplyashortcutforBio.SearchIO.parseandBio.SearchIO.write.Usingtheconvertfunction,our
exampleabovewouldbe:
>>>SearchIO.convert('mirna.xml','blastxml','results.tab','blasttab')
(3,239,277,277)
Asconvertuseswrite,itisonlylimitedtoformatconversionsthathavealltherequiredattributes.Here,theBLASTXMLfileprovidesallthedefaultvaluesaBLAST
tabularfilerequires,soitworksjustfine.However,otherformatconversionsarelesslikelytoworksinceyouneedtomanuallyassigntherequiredattributesfirst.
Chapter9AccessingNCBIsEntrezdatabases
Entrez(http://www.ncbi.nlm.nih.gov/Entrez)isadataretrievalsystemthatprovidesusersaccesstoNCBIsdatabasessuchasPubMed,GenBank,GEO,andmany
others.YoucanaccessEntrezfromawebbrowsertomanuallyenterqueries,oryoucanuseBiopythonsBio.EntrezmoduleforprogrammaticaccesstoEntrez.The
latterallowsyouforexampletosearchPubMedordownloadGenBankrecordsfromwithinaPythonscript.
TheBio.EntrezmodulemakesuseoftheEntrezProgrammingUtilities(alsoknownasEUtils),consistingofeighttoolsthataredescribedindetailonNCBIspageat
http://www.ncbi.nlm.nih.gov/entrez/utils/.EachofthesetoolscorrespondstoonePythonfunctionintheBio.Entrezmodule,asdescribedinthesectionsbelow.This
modulemakessurethatthecorrectURLisusedforthequeries,andthatnotmorethanonerequestismadeeverythreeseconds,asrequiredbyNCBI.
TheoutputreturnedbytheEntrezProgrammingUtilitiesistypicallyinXMLformat.Toparsesuchoutput,youhaveseveraloptions:
1.UseBio.EntrezsparsertoparsetheXMLoutputintoaPythonobject
2.UsetheDOM(DocumentObjectModel)parserinPythonsstandardlibrary
3.UsetheSAX(SimpleAPIforXML)parserinPythonsstandardlibrary
4.ReadtheXMLoutputasrawtext,andparseitbystringsearchingandmanipulation.
FortheDOMandSAXparsers,seethePythondocumentation.TheparserinBio.Entrezisdiscussedbelow.
NCBIusesDTD(DocumentTypeDefinition)filestodescribethestructureoftheinformationcontainedinXMLfiles.MostoftheDTDfilesusedbyNCBIareincluded
intheBiopythondistribution.TheBio.EntrezparsermakesuseoftheDTDfileswhenparsinganXMLfilereturnedbyNCBIEntrez.
Occasionally,youmayfindthattheDTDfileassociatedwithaspecificXMLfileismissingintheBiopythondistribution.Inparticular,thismayhappenwhenNCBI
updatesitsDTDfiles.Ifthishappens,Entrez.readwillshowawarningmessagewiththenameandURLofthemissingDTDfile.Theparserwillproceedtoaccessthe
missingDTDfilethroughtheinternet,allowingtheparsingoftheXMLfiletocontinue.However,theparserismuchfasteriftheDTDfileisavailablelocally.Forthis
purpose,pleasedownloadtheDTDfilefromtheURLinthewarningmessageandplaceitinthedirectory...sitepackages/Bio/Entrez/DTDs,containingtheotherDTD
files.Ifyoudonthavewriteaccesstothisdirectory,youcanalsoplacetheDTDfilein~/.biopython/Bio/Entrez/DTDs,where~representsyourhomedirectory.Since
thisdirectoryisreadbeforethedirectory...sitepackages/Bio/Entrez/DTDs,youcanalsoputnewerversionsofDTDfilesthereiftheonesin...site
packages/Bio/Entrez/DTDsbecomeoutdated.Alternatively,ifyouinstalledBiopythonfromsource,youcanaddtheDTDfiletothesourcecodesBio/Entrez/DTDs
directory,andreinstallBiopython.ThiswillinstallthenewDTDfileinthecorrectlocationtogetherwiththeotherDTDfiles.
TheEntrezProgrammingUtilitiescanalsogenerateoutputinotherformats,suchastheFastaorGenBankfileformatsforsequencedatabases,ortheMedLineformatfor
theliteraturedatabase,discussedinSection9.12.
9.1EntrezGuidelines
BeforeusingBiopythontoaccesstheNCBIsonlineresources(viaBio.Entrezorsomeoftheothermodules),pleasereadtheNCBIsEntrezUserRequirements.Ifthe
NCBIfindsyouareabusingtheirsystems,theycanandwillbanyouraccess!
Toparaphrase:
Foranyseriesofmorethan100requests,dothisatweekendsoroutsideUSApeaktimes.Thisisuptoyoutoobey.
Usethehttp://eutils.ncbi.nlm.nih.govaddress,notthestandardNCBIWebaddress.Biopythonusesthiswebaddress.
Makenomorethanthreerequestseveryseconds(relaxedfromatmostonerequesteverythreesecondsinearly2009).Thisisautomaticallyenforcedby
Biopython.
UsetheoptionalemailparametersotheNCBIcancontactyouifthereisaproblem.YoucaneitherexplicitlysetthisasaparameterwitheachcalltoEntrez(e.g.
includeemail="A.N.Other@example.com"intheargumentlist),oryoucansetaglobalemailaddress:
>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"
Bio.EntrezwillthenusethisemailaddresswitheachcalltoEntrez.Theexample.comaddressisareserveddomainnamespecificallyfordocumentation(RFC
2606).PleaseDONOTusearandomemailitsbetternottogiveanemailatall.TheemailparameterhasbeenmandatorysinceJune1,2010.Incaseofexcessive
usage,NCBIwillattempttocontactauserattheemailaddressprovidedpriortoblockingaccesstotheEutilities.
IfyouareusingBiopythonwithinsomelargersoftwaresuite,usethetoolparametertospecifythis.Youcaneitherexplicitlysetthetoolnameasaparameterwith
eachcalltoEntrez(e.g.includetool="MyLocalScript"intheargumentlist),oryoucansetaglobaltoolname:
>>>Entrez.tool="MyLocalScript"
ThetoolparameterwilldefaulttoBiopython.
Forlargequeries,theNCBIalsorecommendusingtheirsessionhistoryfeature(theWebEnvsessioncookiestring,seeSection9.15).Thisisonlyslightlymore
complicated.
Inconclusion,besensiblewithyourusagelevels.Ifyouplantodownloadlotsofdata,considerotheroptions.Forexample,ifyouwanteasyaccesstoallthehuman
genes,considerfetchingeachchromosomebyFTPasaGenBankfile,andimportingtheseintoyourownBioSQLdatabase(seeSection20.5).
9.2EInfo:ObtaininginformationabouttheEntrezdatabases
EInfoprovidesfieldindextermcounts,lastupdate,andavailablelinksforeachofNCBIsdatabases.Inaddition,youcanuseEInfotoobtainalistofalldatabasenames
accessiblethroughtheEntrezutilities:
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.einfo()
>>>result=handle.read()
>>>handle.close()
ThevariableresultnowcontainsalistofdatabasesinXMLformat:
>>>print(result)
<?xmlversion="1.0"?>
<!DOCTYPEeInfoResultPUBLIC"//NLM//DTDeInfoResult,11May2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
<DbName>pubmed</DbName>
<DbName>protein</DbName>
<DbName>nucleotide</DbName>
<DbName>nuccore</DbName>
<DbName>nucgss</DbName>
<DbName>nucest</DbName>
<DbName>structure</DbName>
<DbName>genome</DbName>
<DbName>books</DbName>
<DbName>cancerchromosomes</DbName>
<DbName>cdd</DbName>
<DbName>gap</DbName>
<DbName>domains</DbName>
<DbName>gene</DbName>
<DbName>genomeprj</DbName>
<DbName>gensat</DbName>
<DbName>geo</DbName>
<DbName>gds</DbName>
<DbName>homologene</DbName>
<DbName>journals</DbName>
<DbName>mesh</DbName>
<DbName>ncbisearch</DbName>
<DbName>nlmcatalog</DbName>
<DbName>omia</DbName>
<DbName>omim</DbName>
<DbName>pmc</DbName>
<DbName>popset</DbName>
<DbName>probe</DbName>
<DbName>proteinclusters</DbName>
<DbName>pcassay</DbName>
<DbName>pccompound</DbName>
<DbName>pcsubstance</DbName>
<DbName>snp</DbName>
<DbName>taxonomy</DbName>
<DbName>toolkit</DbName>
<DbName>unigene</DbName>
<DbName>unists</DbName>
</DbList>
</eInfoResult>
SincethisisafairlysimpleXMLfile,wecouldextracttheinformationitcontainssimplybystringsearching.UsingBio.Entrezsparserinstead,wecandirectlyparse
thisXMLfileintoaPythonobject:
>>>handle=Entrez.einfo()
>>>record=Entrez.read(handle)
Nowrecordisadictionarywithexactlyonekey:
>>>record.keys()
[u'DbList']
ThevaluesstoredinthiskeyisthelistofdatabasenamesshownintheXMLabove:
>>>record["DbList"]
['pubmed','protein','nucleotide','nuccore','nucgss','nucest',
'structure','genome','books','cancerchromosomes','cdd','gap',
'domains','gene','genomeprj','gensat','geo','gds','homologene',
'journals','mesh','ncbisearch','nlmcatalog','omia','omim','pmc',
'popset','probe','proteinclusters','pcassay','pccompound',
'pcsubstance','snp','taxonomy','toolkit','unigene','unists']
Foreachofthesedatabases,wecanuseEInfoagaintoobtainmoreinformation:
>>>handle=Entrez.einfo(db="pubmed")
>>>record["DbInfo"]["Description"]
'PubMedbibliographicrecord'
>>>record["DbInfo"]["Count"]
'17989604'
>>>record["DbInfo"]["LastUpdate"]
'2008/05/2406:45'
Tryrecord["DbInfo"].keys()forotherinformationstoredinthisrecord.OneofthemostusefulisalistofpossiblesearchfieldsforusewithESearch:
>>>forfieldinrecord["DbInfo"]["FieldList"]:
...print("%(Name)s,%(FullName)s,%(Description)s"%field)
ALL,AllFields,Alltermsfromallsearchablefields
UID,UID,Uniquenumberassignedtopublication
FILT,Filter,Limitstherecords
TITL,Title,Wordsintitleofpublication
WORD,TextWord,Freetextassociatedwithpublication
MESH,MeSHTerms,MedicalSubjectHeadingsassignedtopublication
MAJR,MeSHMajorTopic,MeSHtermsofmajorimportancetopublication
AUTH,Author,Author(s)ofpublication
JOUR,Journal,Journalabbreviationofpublication
AFFL,Affiliation,Author'sinstitutionalaffiliationandaddress
...
Thatsalonglist,butindirectlythistellsyouthatforthePubMeddatabase,youcandothingslikeJones[AUTH]tosearchtheauthorfield,orSanger[AFFL]torestrictto
authorsattheSangerCentre.Thiscanbeveryhandyespeciallyifyouarenotsofamiliarwithaparticulardatabase.
9.3ESearch:SearchingtheEntrezdatabases
Tosearchanyofthesedatabases,weuseBio.Entrez.esearch().Forexample,letssearchinPubMedforpublicationsrelatedtoBiopython:
>>>handle=Entrez.esearch(db="pubmed",term="biopython")
>>>record["IdList"]
['19304878','18606172','16403221','16377612','14871861','14630660','12230038']
Inthisoutput,youseesevenPubMedIDs(including19304878whichisthePMIDfortheBiopythonapplicationnote),whichcanberetrievedbyEFetch(seesection
9.6).
YoucanalsouseESearchtosearchGenBank.HerewelldoaquicksearchforthematKgeneinCypripedioideaeorchids(seeSection9.2aboutEInfoforonewaytofind
outwhichfieldsyoucansearchineachEntrezdatabase):
>>>handle=Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn]ANDmatK[Gene]",idtype="acc")
>>>record["Count"]
'348'
>>>record["IdList"]
['JQ660909.1','JQ660908.1','JQ660907.1','JQ660906.1',...,'JQ660890.1']
EachoftheIDs(JQ660909.1,JQ660908.1,JQ660907.1,)isaGenBankidentifier(Accessionnumber).Seesection9.6forinformationonhowtoactuallydownload
theseGenBankrecords.
NotethatinsteadofaspeciesnamelikeCypripedioideae[Orgn],youcanrestrictthesearchusinganNCBItaxonidentifier,herethiswouldbetxid158330[Orgn].This
isntcurrentlydocumentedontheESearchhelppagetheNCBIexplainedthisinreplytoanemailquery.Youcanoftendeducethesearchtermformattingbyplaying
withtheEntrezwebinterface.Forexample,includingcomplete[prop]inagenomesearchrestrictstojustcompletedgenomes.
Asafinalexample,letsgetalistofcomputationaljournaltitles:
>>>handle=Entrez.esearch(db="nlmcatalog",term="computational[Journal]",retmax='20')
>>>print("{}computationaljournalsfound".format(record["Count"]))
117computationalJournalsfound
>>>print("Thefirst20are\n{}".format(record['IdList']))
['101660833','101664671','101661657','101659814','101657941',
'101653734','101669877','101649614','101647835','101639023',
'101627224','101647801','101589678','101585369','101645372',
'101586429','101582229','101574747','101564639','101671907']
Again,wecoulduseEFetchtoobtainmoreinformationforeachofthesejournalIDs.
ESearchhasmanyusefuloptionsseetheESearchhelppageformoreinformation.
9.4EPost:Uploadingalistofidentifiers
EPostuploadsalistofUIsforuseinsubsequentsearchstrategiesseetheEPosthelppageformoreinformation.ItisavailablefromBiopythonthroughthe
Bio.Entrez.epost()function.
Togiveanexampleofwhenthisisuseful,supposeyouhavealonglistofIDsyouwanttodownloadusingEFetch(maybesequences,maybecitationsanything).When
youmakearequestwithEFetchyourlistofIDs,thedatabaseetc,areallturnedintoalongURLsenttotheserver.IfyourlistofIDsislong,thisURLgetslong,andlong
URLscanbreak(e.g.someproxiesdontcopewell).
Instead,youcanbreakthisupintotwosteps,firstuploadingthelistofIDsusingEPost(thisusesanHTMLpostinternally,ratherthananHTMLget,gettinground
thelongURLproblem).Withthehistorysupport,youcanthenrefertothislonglistofIDs,anddownloadtheassociateddatawithEFetch.
LetslookatasimpleexampletoseehowEPostworksuploadingsomePubMedidentifiers:
>>>id_list=["19304878","18606172","16403221","16377612","14871861","14630660"]
>>>print(Entrez.epost("pubmed",id=",".join(id_list)).read())
<!DOCTYPEePostResultPUBLIC"//NLM//DTDePostResult,11May2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
<ePostResult>
<QueryKey>1</QueryKey>
<WebEnv>NCID_01_206841095_130.14.22.101_9001_1242061629</WebEnv>
</ePostResult>
ThereturnedXMLincludestwoimportantstrings,QueryKeyandWebEnvwhichtogetherdefineyourhistorysession.Youwouldextractthesevaluesforusewithanother
EntrezcallsuchasEFetch:
>>>id_list=["19304878","18606172","16403221","16377612","14871861","14630660"]
>>>search_results=Entrez.read(Entrez.epost("pubmed",id=",".join(id_list)))
>>>webenv=search_results["WebEnv"]
>>>query_key=search_results["QueryKey"]
Section9.15showshowtousethehistoryfeature.
9.5ESummary:RetrievingsummariesfromprimaryIDs
ESummaryretrievesdocumentsummariesfromalistofprimaryIDs(seetheESummaryhelppageformoreinformation).InBiopython,ESummaryisavailableas
Bio.Entrez.esummary().Usingthesearchresultabove,wecanforexamplefindoutmoreaboutthejournalwithID30367:
>>>handle=Entrez.esummary(db="nlmcatalog",id="101660833")
>>>info=record[0]['TitleMainList'][0]
>>>print("Journalinfo\nid:{}\nTitle:{}".format(record[0]["Id"],info["Title"]))
Journalinfo
id:101660833
Title:IEEEtransactionsoncomputationalimaging.
9.6EFetch:DownloadingfullrecordsfromEntrez
EFetchiswhatyouusewhenyouwanttoretrieveafullrecordfromEntrez.Thiscoversseveralpossibledatabases,asdescribedonthemainEFetchHelppage.
Formostoftheirdatabases,theNCBIsupportseveraldifferentfileformats.RequestingaspecificfileformatfromEntrezusingBio.Entrez.efetch()requiresspecifying
therettypeand/orretmodeoptionalarguments.ThedifferentcombinationsaredescribedforeachdatabasetypeonthepageslinkedtoonNCBIefetchwebpage(e.g.
literature,sequencesandtaxonomy).
OnecommonusageisdownloadingsequencesintheFASTAorGenBank/GenPeptplaintextformats(whichcanthenbeparsedwithBio.SeqIO,seeSections5.3.1
and9.6).FromtheCypripedioideaeexampleabove,wecandownloadGenBankrecordEU490707usingBio.Entrez.efetch:
>>>handle=Entrez.efetch(db="nucleotide",id="EU490707",rettype="gb",retmode="text")
>>>print(handle.read())
LOCUSEU4907071302bpDNAlinearPLN26JUL2016
DEFINITIONSelenipediumaequinoctialematuraseK(matK)gene,partialcds;
chloroplast.
ACCESSIONEU490707
VERSIONEU490707.1
KEYWORDS.
SOURCEchloroplastSelenipediumaequinoctiale
ORGANISMSelenipediumaequinoctiale
Eukaryota;Viridiplantae;Streptophyta;Embryophyta;Tracheophyta;
Spermatophyta;Magnoliophyta;Liliopsida;Asparagales;Orchidaceae;
Cypripedioideae;Selenipedium.
REFERENCE1(bases1to1302)
AUTHORSNeubig,K.M.,Whitten,W.M.,Carlsward,B.S.,Blanco,M.A.,Endara,L.,
Williams,N.H.andMoore,M.
TITLEPhylogeneticutilityofycf1inorchids:aplastidgenemore
variablethanmatK
JOURNALPlantSyst.Evol.277(12),7584(2009)
AUTHORSNeubig,K.M.,Whitten,W.M.,Carlsward,B.S.,Blanco,M.A.,
Endara,C.L.,Williams,N.H.andMoore,M.J.
TITLEDirectSubmission
JOURNALSubmitted(14FEB2008)DepartmentofBotany,Universityof
Florida,220BartramHall,Gainesville,FL326118526,USA
FEATURESLocation/Qualifiers
source1..1302
/organism="Selenipediumaequinoctiale"
/organelle="plastid:chloroplast"
/mol_type="genomicDNA"
/specimen_voucher="FLAS:Blanco2475"
/db_xref="taxon:256374"
gene<1..>1302
/gene="matK"
CDS<1..>1302
/gene="matK"
/codon_start=1
/transl_table=11
/product="maturaseK"
/protein_id="ACC99456.1"
/translation="IFYEPVEIFGYDNKSSLVLVKRLITRMYQQNFLISSVNDSNQKG
FWGHKHFFSSHFSSQMVSEGFGVILEIPFSSQLVSSLEEKKIPKYQNLRSIHSIFPFL
EDKFLHLNYVSDLLIPHPIHLEILVQILQCRIKDVPSLHLLRLLFHEYHNLNSLITSK
KFIYAFSKRKKRFLWLLYNSYVYECEYLFQFLRKQSSYLRSTSSGVFLERTHLYVKIE
HLLVVCCNSFQRILCFLKDPFMHYVRYQGKAILASKGTLILMKKWKFHLVNFWQSYFH
FWSQPYRIHIKQLSNYSFSFLGYFSSVLENHLVVRNQMLENSFIINLLTKKFDTIAPV
ISLIGSLSKAQFCTVLGHPISKPIWTDFSDSDILDRFCRICRNLCRYHSGSSKKQVLY
RIKYILRLSCARTLARKHKSTVRTFMRRLGSGLLEEFFMEEE"
ORIGIN
1attttttacgaacctgtggaaatttttggttatgacaataaatctagtttagtacttgtg
61aaacgtttaattactcgaatgtatcaacagaattttttgatttcttcggttaatgattct
121aaccaaaaaggattttgggggcacaagcattttttttcttctcatttttcttctcaaatg
181gtatcagaaggttttggagtcattctggaaattccattctcgtcgcaattagtatcttct
241cttgaagaaaaaaaaataccaaaatatcagaatttacgatctattcattcaatatttccc
301tttttagaagacaaatttttacatttgaattatgtgtcagatctactaataccccatccc
361atccatctggaaatcttggttcaaatccttcaatgccggatcaaggatgttccttctttg
421catttattgcgattgcttttccacgaatatcataatttgaatagtctcattacttcaaag
481aaattcatttacgccttttcaaaaagaaagaaaagattcctttggttactatataattct
541tatgtatatgaatgcgaatatctattccagtttcttcgtaaacagtcttcttatttacga
601tcaacatcttctggagtctttcttgagcgaacacatttatatgtaaaaatagaacatctt
661ctagtagtgtgttgtaattcttttcagaggatcctatgctttctcaaggatcctttcatg
721cattatgttcgatatcaaggaaaagcaattctggcttcaaagggaactcttattctgatg
781aagaaatggaaatttcatcttgtgaatttttggcaatcttattttcacttttggtctcaa
841ccgtataggattcatataaagcaattatccaactattccttctcttttctggggtatttt
901tcaagtgtactagaaaatcatttggtagtaagaaatcaaatgctagagaattcatttata
961ataaatcttctgactaagaaattcgataccatagccccagttatttctcttattggatca
1021ttgtcgaaagctcaattttgtactgtattgggtcatcctattagtaaaccgatctggacc
1081gatttctcggattctgatattcttgatcgattttgccggatatgtagaaatctttgtcgt
1141tatcacagcggatcctcaaaaaaacaggttttgtatcgtataaaatatatacttcgactt
1201tcgtgtgctagaactttggcacggaaacataaaagtacagtacgcacttttatgcgaaga
1261ttaggttcgggattattagaagaattctttatggaagaagaa
//
PleasebeawarethatasofOctober2016GIidentifiersarediscontinuedinfavourofaccessionnumbers.YoucanstillfetchsequencesbasedontheirGI,butnew
sequencesarenolongergiventhisidentifier.YoushouldinsteadrefertothembytheAccessionnumberasdoneintheexample.
Theargumentsrettype="gb"andretmode="text"letusdownloadthisrecordintheGenBankformat.
NotethatuntilEaster2009,theEntrezEFetchAPIletyouusegenbankasthereturntype,howevertheNCBInowinsistonusingtheofficialreturntypesofgbor
gbwithparts(orgpforproteins)asdescribedononline.AlsonotthatuntilFeb2012,theEntrezEFetchAPIwoulddefaulttoreturningplaintextfiles,butnow
defaultstoXML.
Alternatively,youcouldforexampleuserettype="fasta"togettheFastaformatseetheEFetchSequencesHelppageforotheroptions.Remembertheavailable
formatsdependonwhichdatabaseyouaredownloadingfromseethemainEFetchHelppage.
IfyoufetchtherecordinoneoftheformatsacceptedbyBio.SeqIO(seeChapter5),youcoulddirectlyparseitintoaSeqRecord:
>>>fromBioimportEntrez,SeqIO
>>>handle=Entrez.efetch(db="nucleotide",id="EU490707",rettype="gb",retmode="text")
>>>record=SeqIO.read(handle,"genbank")
>>>handle.close()
>>>print(record)
ID:EU490707.1
Name:EU490707
Description:SelenipediumaequinoctialematuraseK(matK)gene,partialcds;chloroplast.
Numberoffeatures:3
...
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA',IUPACAmbiguousDNA())
Notethatamoretypicalusewouldbetosavethesequencedatatoalocalfile,andthenparseitwithBio.SeqIO.Thiscansaveyouhavingtoredownloadthesamefile
repeatedlywhileworkingonyourscript,andplaceslessloadontheNCBIsservers.Forexample:
importos
fromBioimportSeqIO
fromBioimportEntrez
Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
filename="EU490707.gbk"
ifnotos.path.isfile(filename):
#Downloading...
net_handle=Entrez.efetch(db="nucleotide",id="EU490707",rettype="gb",retmode="text")
out_handle=open(filename,"w")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
print("Parsing...")
record=SeqIO.read(filename,"genbank")
print(record)
TogettheoutputinXMLformat,whichyoucanparseusingtheBio.Entrez.read()function,useretmode="xml":
>>>handle=Entrez.efetch(db="nucleotide",id="EU490707",retmode="xml")
>>>handle.close()
>>>record[0]["GBSeq_definition"]
'SelenipediumaequinoctialematuraseK(matK)gene,partialcds;chloroplast'
>>>record[0]["GBSeq_source"]
'chloroplastSelenipediumaequinoctiale'
So,thatdealtwithsequences.Forexamplesofparsingfileformatsspecifictotheotherdatabases(e.g.theMEDLINEformatusedinPubMed),seeSection9.12.
IfyouwanttoperformasearchwithBio.Entrez.esearch(),andthendownloadtherecordswithBio.Entrez.efetch(),youshouldusetheWebEnvhistoryfeaturesee
Section9.15.
9.7ELink:SearchingforrelateditemsinNCBIEntrez
ELink,availablefromBiopythonasBio.Entrez.elink(),canbeusedtofindrelateditemsintheNCBIEntrezdatabases.Forexample,youcanusthistofindnucleotide
entriesforanentryinthegenedatabase,andothercoolstuff.
LetsuseELinktofindarticlesrelatedtotheBiopythonapplicationnotepublishedinBioinformaticsin2009.ThePubMedIDofthisarticleis19304878:
>>>pmid="19304878"
>>>record=Entrez.read(Entrez.elink(dbfrom="pubmed",id=pmid))
TherecordvariableconsistsofaPythonlist,oneforeachdatabaseinwhichwesearched.SincewespecifiedonlyonePubMedIDtosearchfor,recordcontainsonlyone
item.Thisitemisadictionarycontaininginformationaboutoursearchterm,aswellasalltherelateditemsthatwerefound:
>>>record[0]["DbFrom"]
'pubmed'
>>>record[0]["IdList"]
['19304878']
The"LinkSetDb"keycontainsthesearchresults,storedasalistconsistingofoneitemforeachtargetdatabase.Inoursearchresults,weonlyfindhitsinthePubMed
database(althoughsubdividedintocategories):
>>>len(record[0]["LinkSetDb"])
5
>>>forlinksetdbinrecord[0]["LinkSetDb"]:
...print(linksetdb["DbTo"],linksetdb["LinkName"],len(linksetdb["Link"]))
...
pubmedpubmed_pubmed110
pubmedpubmed_pubmed_combined6
pubmedpubmed_pubmed_five6
pubmedpubmed_pubmed_reviews5
pubmedpubmed_pubmed_reviews_five5
Theactualsearchresultsarestoredasunderthe"Link"key.Intotal,110itemswerefoundunderstandardsearch.Letsnowatthefirstsearchresult:
>>>record[0]["LinkSetDb"][0]["Link"][0]
{u'Id':'19304878'}
Thisisthearticlewesearchedfor,whichdoesnthelpusmuch,soletslookatthesecondsearchresult:
>>>record[0]["LinkSetDb"][0]["Link"][1]
{u'Id':'14630660'}
Thispaper,withPubMedID14630660,isabouttheBiopythonPDBparser.
WecanusealooptoprintoutallPubMedIDs:
>>>forlinkinrecord[0]["LinkSetDb"][0]["Link"]:
...print(link["Id"])
19304878
14630660
18689808
17121776
16377612
12368254
......
Nowthatwasnice,butpersonallyIamoftenmoreinterestedtofindoutifapaperhasbeencited.Well,ELinkcandothattooatleastforjournalsinPubmedCentral
(seeSection9.15.3).
ForhelponELink,seetheELinkhelppage.Thereisanentiresubpagejustforthelinknames,describinghowdifferentdatabasescanbecrossreferenced.
9.8EGQuery:GlobalQuerycountsforsearchterms
EGQueryprovidescountsforasearchtermineachoftheEntrezdatabases(i.e.aglobalquery).Thisisparticularlyusefultofindouthowmanyitemsyoursearchterms
wouldfindineachdatabasewithoutactuallyperforminglotsofseparatesearcheswithESearch(seetheexamplein9.14.2below).
Inthisexample,weuseBio.Entrez.egquery()toobtainthecountsforBiopython:
>>>handle=Entrez.egquery(term="biopython")
>>>forrowinrecord["eGQueryResult"]:
...print(row["DbName"],row["Count"])
...
pubmed6
pmc62
journals0
...
SeetheEGQueryhelppageformoreinformation.
9.9ESpell:Obtainingspellingsuggestions
ESpellretrievesspellingsuggestions.Inthisexample,weuseBio.Entrez.espell()toobtainthecorrectspellingofBiopython:
>>>handle=Entrez.espell(term="biopythooon")
>>>record["Query"]
'biopythooon'
>>>record["CorrectedQuery"]
'biopython'
SeetheESpellhelppageformoreinformation.ThemainuseofthisisforGUItoolstoprovideautomaticsuggestionsforsearchterms.
9.10ParsinghugeEntrezXMLfiles
TheEntrez.readfunctionreadstheentireXMLfilereturnedbyEntrezintoasinglePythonobject,whichiskeptinmemory.ToparseEntrezXMLfilestoolargetofitin
memory,youcanusethefunctionEntrez.parse.ThisisageneratorfunctionthatreadsrecordsintheXMLfileonebyone.ThisfunctionisonlyusefuliftheXMLfile
reflectsaPythonlistobject(inotherwords,ifEntrez.readonacomputerwithinfinitememoryresourceswouldreturnaPythonlist).
Forexample,youcandownloadtheentireEntrezGenedatabaseforagivenorganismasafilefromNCBIsftpsite.Thesefilescanbeverylarge.Asanexample,on
September4,2009,thefileHomo_sapiens.ags.gz,containingtheEntrezGenedatabaseforhuman,hadasizeof116576kB.Thisfile,whichisintheASNformat,canbe
convertedintoanXMLfileusingNCBIsgene2xmlprogram(seeNCBIsftpsiteformoreinformation):
gene2xmlbTiHomo_sapiens.agsoHomo_sapiens.xml
TheresultingXMLfilehasasizeof6.1GB.AttemptingEntrez.readonthisfilewillresultinaMemoryErroronmanycomputers.
TheXMLfileHomo_sapiens.xmlconsistsofalistofEntrezgenerecords,eachcorrespondingtooneEntrezgeneinhuman.Entrez.parseretrievesthesegenerecordsone
byone.Youcanthenprintoutorstoretherelevantinformationineachrecordbyiteratingovertherecords.Forexample,thisscriptiteratesovertheEntrezgenerecords
andprintsoutthegenenumbersandnamesforallcurrentgenes:
>>>handle=open("Homo_sapiens.xml")
>>>records=Entrez.parse(handle)
>>>forrecordinrecords:
...status=record['Entrezgene_trackinfo']['Genetrack']['Genetrack_status']
...ifstatus.attributes['value']=='discontinued':
...continue
...geneid=record['Entrezgene_trackinfo']['Genetrack']['Genetrack_geneid']
...genename=record['Entrezgene_gene']['Generef']['Generef_locus']
...print(geneid,genename)
Thiswillprint:
1A1BG
2A2M
3A2MP
8AA
9NAT1
10NAT2
11AACP
12SERPINA3
13AADAC
14AAMP
15AANAT
16AARS
17AAVS1
...
9.11Handlingerrors
ThreethingscangowrongwhenparsinganXMLfile:
ThefilemaynotbeanXMLfiletobeginwith
Thefilemayendprematurelyorotherwisebecorrupted
ThefilemaybecorrectXML,butcontainitemsthatarenotrepresentedintheassociatedDTD.
Thefirstcaseoccursif,forexample,youtrytoparseaFastafileasifitwereanXMLfile:
>>>handle=open("NC_005816.fna")#aFastafile
...
Bio.Entrez.Parser.NotXMLError:FailedtoparsetheXMLdata(syntaxerror:line1,column0).PleasemakesurethattheinputdataareinXMLformat.
Here,theparserdidntfindthe<?xml...tagwithwhichanXMLfileissupposedtostart,andthereforedecides(correctly)thatthefileisnotanXMLfile.
WhenyourfileisintheXMLformatbutiscorrupted(forexample,byendingprematurely),theparserwillraiseaCorruptedXMLError.HereisanexampleofanXML
filethatendsprematurely:
<!DOCTYPEeInfoResultPUBLIC"//NLM//DTDeInfoResult,11May2002//EN""http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
<DbName>protein</DbName>
<DbName>nucleotide</DbName>
<DbName>nuccore</DbName>
<DbName>nucgss</DbName>
<DbName>nucest</DbName>
<DbName>structure</DbName>
<DbName>genome</DbName>
<DbName>books</DbName>
<DbName>cancerchromosomes</DbName>
<DbName>cdd</DbName>
whichwillgeneratethefollowingtraceback:
>>>Entrez.read(handle)
...
Bio.Entrez.Parser.CorruptedXMLError:FailedtoparsetheXMLdata(noelementfound:line16,column0).Pleasemakesurethattheinputdataarenotcorrupted.
NotethattheerrormessagetellsyouatwhatpointintheXMLfiletheerrorwasdetected.
ThethirdtypeoferroroccursiftheXMLfilecontainstagsthatdonothaveadescriptioninthecorrespondingDTDfile.ThisisanexampleofsuchanXMLfile:
<!DOCTYPEeInfoResultPUBLIC"//NLM//DTDeInfoResult,11May2002//EN""http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbInfo>
<MenuName>PubMed</MenuName>
<Description>PubMedbibliographicrecord</Description>
<Count>20161961</Count>
<LastUpdate>2010/09/1004:52</LastUpdate>
<FieldList>
<Field>
...
</Field>
</FieldList>
<DocsumList>
<Docsum>
<DsName>PubDate</DsName>
<DsType>4</DsType>
<DsTypeName>string</DsTypeName>
</Docsum>
<Docsum>
<DsName>EPubDate</DsName>
...
</DbInfo>
</eInfoResult>
Inthisfile,forsomereasonthetag<DocsumList>(andseveralothers)arenotlistedintheDTDfileeInfo_020511.dtd,whichisspecifiedonthesecondlineastheDTD
forthisXMLfile.Bydefault,theparserwillstopandraiseaValidationErrorifitcannotfindsometagintheDTD:
>>>handle=open("einfo3.xml")
...
Bio.Entrez.Parser.ValidationError:Failedtofindtag'DocsumList'intheDTD.ToskipalltagsthatarenotrepresentedintheDTD,pleasecallBio.Entrez.reado
Optionally,youcaninstructtheparsertoskipsuchtagsinsteadofraisingaValidationError.ThisisdonebycallingEntrez.readorEntrez.parsewiththeargument
validateequaltoFalse:
>>>handle=open("einfo3.xml")
>>>record=Entrez.read(handle,validate=False)
Ofcourse,theinformationcontainedintheXMLtagsthatarenotintheDTDarenotpresentintherecordreturnedbyEntrez.read.
9.12Specializedparsers
TheBio.Entrez.read()functioncanparsemost(ifnotall)XMLoutputreturnedbyEntrez.Entreztypicallyallowsyoutoretrieverecordsinotherformats,whichmay
havesomeadvantagescomparedtotheXMLformatintermsofreadability(ordownloadsize).
TorequestaspecificfileformatfromEntrezusingBio.Entrez.efetch()requiresspecifyingtherettypeand/orretmodeoptionalarguments.Thedifferentcombinations
aredescribedforeachdatabasetypeontheNCBIefetchwebpage.
OneobviouscaseisyoumayprefertodownloadsequencesintheFASTAorGenBank/GenPeptplaintextformats(whichcanthenbeparsedwithBio.SeqIO,see
Sections5.3.1and9.6).Fortheliteraturedatabases,BiopythoncontainsaparserfortheMEDLINEformatusedinPubMed.
9.12.1ParsingMedlinerecords
YoucanfindtheMedlineparserinBio.Medline.Supposewewanttoparsethefilepubmed_result1.txt,containingoneMedlinerecord.Youcanfindthisfilein
BiopythonsTests\Medlinedirectory.Thefilelookslikethis:
PMID12230038
OWNNLM
STATMEDLINE
DA20020916
DCOM20030606
LR20041117
PUBMPrint
IS14675463(Print)
VI3
IP3
DP2002Sep
TITheBio*toolkitsabriefoverview.
PG296302
ABBioinformaticsresearchisoftendifficulttodowithcommercialsoftware.The
OpenSourceBioPerl,BioPythonandBiojavaprojectsprovidetoolkitswith
...
Wefirstopenthefileandthenparseit:
>>>fromBioimportMedline
>>>withopen("pubmed_result1.txt")ashandle:
...record=Medline.read(handle)
...
TherecordnowcontainstheMedlinerecordasaPythondictionary:
>>>record["PMID"]
'12230038'
>>>record["AB"]
'Bioinformaticsresearchisoftendifficulttodowithcommercialsoftware.
TheOpenSourceBioPerl,BioPythonandBiojavaprojectsprovidetoolkitswith
multiplefunctionalitythatmakeiteasiertocreatecustomisedpipelinesor
analysis.Thisreviewbrieflycomparesthequirksoftheunderlyinglanguages
andthefunctionality,documentation,utilityandrelativeadvantagesofthe
Biocounterparts,particularlyfromthepointofviewofthebeginning
biologistprogrammer.'
ThekeynamesusedinaMedlinerecordcanberatherobscureuse
>>>help(record)
forabriefsummary.
ToparseafilecontainingmultipleMedlinerecords,youcanusetheparsefunctioninstead:
>>>withopen("pubmed_result2.txt")ashandle:
...forrecordinMedline.parse(handle):
...print(record["TI"])
...
AhighlevelinterfacetoSCOPandASTRALimplementedinpython.
GenomeDiagram:apythonpackageforthevisualizationoflargescalegenomicdata.
Opensourceclusteringsoftware.
PDBfileparserandstructureclassimplementedinPython.
InsteadofparsingMedlinerecordsstoredinfiles,youcanalsoparseMedlinerecordsdownloadedbyBio.Entrez.efetch.Forexample,letslookatallMedlinerecordsin
PubMedrelatedtoBiopython:
>>>handle=Entrez.esearch(db="pubmed",term="biopython")
>>>record["IdList"]
['19304878','18606172','16403221','16377612','14871861','14630660','12230038']
WenowuseBio.Entrez.efetchtodownloadtheseMedlinerecords:
>>>idlist=record["IdList"]
>>>handle=Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text")
Here,wespecifyrettype="medline",retmode="text"toobtaintheMedlinerecordsinplaintextMedlineformat.NowweuseBio.Medlinetoparsetheserecords:
>>>records=Medline.parse(handle)
...print(record["AU"])
['CockPJ','AntaoT','ChangJT','ChapmanBA','CoxCJ','DalkeA',...,'deHoonMJ']
['MunteanuCR','GonzalezDiazH','MagalhaesAL']
['CasbonJA','CrooksGE','SaqiMA']
['PritchardL','WhiteJA','BirchPR','TothIK']
['deHoonMJ','ImotoS','NolanJ','MiyanoS']
['HamelryckT','ManderickB']
['MangalamH']
Forcomparison,hereweshowanexampleusingtheXMLformat:
>>>handle=Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="xml")
>>>records=Entrez.read(handle)
>>>forrecordinrecords['PubmedArticle']:
...print(record["MedlineCitation"]["Article"]["ArticleTitle"])
Biopython:freelyavailablePythontoolsforcomputationalmolecularbiologyand
bioinformatics.
Enzymes/nonenzymesclassificationmodelcomplexitybasedoncomposition,sequence,
3Dandtopologicalindices.
AhighlevelinterfacetoSCOPandASTRALimplementedinpython.
GenomeDiagram:apythonpackageforthevisualizationoflargescalegenomicdata.
Opensourceclusteringsoftware.
PDBfileparserandstructureclassimplementedinPython.
TheBio*toolkitsabriefoverview.
Notethatinbothoftheseexamples,forsimplicitywehavenaivelycombinedESearchandEFetch.Inthissituation,theNCBIwouldexpectyoutousetheirhistory
feature,asillustratedinSection9.15.
9.12.2ParsingGEOrecords
GEO(GeneExpressionOmnibus)isadatarepositoryofhighthroughputgeneexpressionandhybridizationarraydata.TheBio.GeomodulecanbeusedtoparseGEO
formatteddata.
ThefollowingcodefragmentshowshowtoparsetheexampleGEOfileGSE16.txtintoarecordandprinttherecord:
>>>fromBioimportGeo
>>>handle=open("GSE16.txt")
>>>records=Geo.parse(handle)
...print(record)
Youcansearchthegdsdatabase(GEOdatasets)withESearch:
>>>handle=Entrez.esearch(db="gds",term="GSE16")
>>>record["Count"]
2
>>>record["IdList"]
['200000016','100000028']
FromtheEntrezwebsite,UID200000016isGDS16whiletheotherhit100000028isfortheassociatedplatform,GPL28.Unfortunately,atthetimeofwritingthe
NCBIdontseemtosupportdownloadingGEOfilesusingEntrez(notasXML,norintheSimpleOmnibusFormatinText(SOFT)format).
However,itisactuallyprettystraightforwardtodownloadtheGEOfilesbyFTPfrom ftp://ftp.ncbi.nih.gov/pub/geo/instead.Inthiscaseyoumightwant
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE16/GSE16_family.soft.gz(acompressedfile,seethePythonmodulegzip).
9.12.3ParsingUniGenerecords
UniGeneisanNCBIdatabaseofthetranscriptome,witheachUniGenerecordshowingthesetoftranscriptsthatareassociatedwithaparticulargeneinaspecific
organism.AtypicalUniGenerecordlookslikethis:
IDHs.2
TITLENacetyltransferase2(arylamineNacetyltransferase)
GENENAT2
CYTOBAND8p22
GENE_ID10
LOCUSLINK10
HOMOLYES
EXPRESSbone|connectivetissue|intestine|liver|livertumor|normal|softtissue/muscletissuetumor|adult
RESTR_EXPRadult
CHROMOSOME8
STSACC=PMC310725P3UNISTS=272646
STSACC=WIAF2120UNISTS=44576
STSACC=G59899UNISTS=137181
...
STSACC=GDB:187676UNISTS=155563
PROTSIMORG=10090;PROTGI=6754794;PROTID=NP_035004.1;PCT=76.55;ALN=288
PROTSIMORG=9796;PROTGI=149742490;PROTID=XP_001487907.1;PCT=79.66;ALN=288
PROTSIMORG=9986;PROTGI=126722851;PROTID=NP_001075655.1;PCT=76.90;ALN=288
...
PROTSIMORG=9598;PROTGI=114619004;PROTID=XP_519631.2;PCT=98.28;ALN=288
SCOUNT38
SEQUENCEACC=BC067218.1;NID=g45501306;PID=g45501307;SEQTYPE=mRNA
SEQUENCEACC=NM_000015.2;NID=g116295259;PID=g116295260;SEQTYPE=mRNA
SEQUENCEACC=D90042.1;NID=g219415;PID=g219416;SEQTYPE=mRNA
SEQUENCEACC=D90040.1;NID=g219411;PID=g219412;SEQTYPE=mRNA
SEQUENCEACC=BC015878.1;NID=g16198419;PID=g16198420;SEQTYPE=mRNA
SEQUENCEACC=CR407631.1;NID=g47115198;PID=g47115199;SEQTYPE=mRNA
SEQUENCEACC=BG569293.1;NID=g13576946;CLONE=IMAGE:4722596;END=5';LID=6989;SEQTYPE=EST;TRACE=44157214
...
SEQUENCEACC=AU099534.1;NID=g13550663;CLONE=HSI08034;END=5';LID=8800;SEQTYPE=EST
//
Thisparticularrecordshowsthesetoftranscripts(shownintheSEQUENCElines)thatoriginatefromthehumangeneNAT2,encodingenNacetyltransferase.ThePROTSIM
linesshowproteinswithsignificantsimilaritytoNAT2,whereastheSTSlinesshowthecorrespondingsequencetaggedsitesinthegenome.
ToparseUniGenefiles,usetheBio.UniGenemodule:
>>>fromBioimportUniGene
>>>input=open("myunigenefile.data")
>>>record=UniGene.read(input)
TherecordreturnedbyUniGene.readisaPythonobjectwithattributescorrespondingtothefieldsintheUniGenerecord.Forexample,
>>>record.ID
"Hs.2"
>>>record.title
"Nacetyltransferase2(arylamineNacetyltransferase)"
TheEXPRESSandRESTR_EXPRlinesarestoredasPythonlistsofstrings:
['bone','connectivetissue','intestine','liver','livertumor','normal','softtissue/muscletissuetumor','adult']
SpecializedobjectsarereturnedfortheSTS,PROTSIM,andSEQUENCElines,storingthekeysshownineachlineasattributes:
>>>record.sts[0].acc
'PMC310725P3'
>>>record.sts[0].unists
'272646'
andsimilarlyforthePROTSIMandSEQUENCElines.
ToparseafilecontainingmorethanoneUniGenerecord,usetheparsefunctioninBio.UniGene:
>>>fromBioimportUniGene
>>>input=open("unigenerecords.data")
>>>records=UniGene.parse(input)
...print(record.ID)
9.13Usingaproxy
Normallyyouwonthavetoworryaboutusingaproxy,butifthisisanissueonyournetworkhereishowtodealwithit.Internally,Bio.EntrezusesthestandardPython
libraryurllibforaccessingtheNCBIservers.Thiswillcheckanenvironmentvariablecalledhttp_proxytoconfigureanysimpleproxyautomatically.Unfortunatelythis
moduledoesnotsupporttheuseofproxieswhichrequireauthentication.
Youmaychoosetosetthehttp_proxyenvironmentvariableonce(howyoudothiswilldependonyouroperatingsystem).AlternativelyyoucansetthiswithinPythonat
thestartofyourscript,forexample:
importos
os.environ["http_proxy"]="http://proxyhost.example.com:8080"
Seetheurllibdocumentationformoredetails.
9.14Examples
9.14.1PubMedandMedline
Ifyouareinthemedicalfieldorinterestedinhumanissues(andmanytimesevenifyouarenot!),PubMed(http://www.ncbi.nlm.nih.gov/PubMed/)isanexcellent
sourceofallkindsofgoodies.Solikeotherthings,wedliketobeabletograbinformationfromitanduseitinPythonscripts.
Inthisexample,wewillqueryPubMedforallarticleshavingtodowithorchids(seesection2.3forourmotivation).Wefirstcheckhowmanyofsucharticlesthereare:
>>>handle=Entrez.egquery(term="orchid")
...ifrow["DbName"]=="pubmed":
...print(row["Count"])
463
NowweusetheBio.Entrez.efetchfunctiontodownloadthePubMedIDsofthese463articles:
>>>handle=Entrez.esearch(db="pubmed",term="orchid",retmax=463)
>>>idlist=record["IdList"]
>>>print(idlist)
ThisreturnsaPythonlistcontainingallofthePubMedIDsofarticlesrelatedtoorchids:
['18680603','18665331','18661158','18627489','18627452','18612381',
'18594007','18591784','18589523','18579475','18575811','18575690',
...
Nowthatwevegotthem,weobviouslywanttogetthecorrespondingMedlinerecordsandextracttheinformationfromthem.Here,welldownloadtheMedlinerecords
intheMedlineflatfileformat,andusetheBio.Medlinemoduletoparsethem:
>>>handle=Entrez.efetch(db="pubmed",id=idlist,rettype="medline",
retmode="text")
>>>records=Medline.parse(handle)
NOTEWevejustdoneaseparatesearchandfetchhere,theNCBImuchpreferyoutotakeadvantageoftheirhistorysupportinthissituation.SeeSection9.15.
Keepinmindthatrecordsisaniterator,soyoucaniteratethroughtherecordsonlyonce.Ifyouwanttosavetherecords,youcanconvertthemtoalist:
>>>records=list(records)
Letsnowiterateovertherecordstoprintoutsomeinformationabouteachrecord:
...print("title:",record.get("TI","?"))
...print("authors:",record.get("AU","?"))
...print("source:",record.get("SO","?"))
...print("")
Theoutputforthislookslike:
title:Sexpheromonemimicryintheearlyspiderorchid(ophryssphegodes):
patternsofhydrocarbonsasthekeymechanismforpollinationbysexual
deception[InProcessCitation]
authors:['SchiestlFP','AyasseM','PaulusHF','LofstedtC','HanssonBS',
'IbarraF','FranckeW']
source:JCompPhysiol[A]2000Jun;186(6):56774
Especiallyinterestingtonoteisthelistofauthors,whichisreturnedasastandardPythonlist.ThismakesiteasytomanipulateandsearchusingstandardPythontools.
Forinstance,wecouldloopthroughawholebunchofentriessearchingforaparticularauthorwithcodelikethefollowing:
>>>search_author="WaitsT"
...ifnot"AU"inrecord:
...continue
...ifsearch_authorinrecord["AU"]:
...print("Author%sfound:%s"%(search_author,record["SO"]))
HopefullythissectiongaveyouanideaofthepowerandflexibilityoftheEntrezandMedlineinterfacesandhowtheycanbeusedtogether.
9.14.2Searching,downloading,andparsingEntrezNucleotiderecords
HerewellshowasimpleexampleofperformingaremoteEntrezquery.Insection2.3oftheparsingexamples,wetalkedaboutusingNCBIsEntrezwebsitetosearch
theNCBInucleotidedatabasesforinfoonCypripedioideae,ourfriendstheladyslipperorchids.Now,welllookathowtoautomatethatprocessusingaPythonscript.In
thisexample,welljustshowhowtoconnect,gettheresults,andparsethem,withtheEntrezmoduledoingallofthework.
First,weuseEGQuerytofindoutthenumberofresultswewillgetbeforeactuallydownloadingthem.EGQuerywilltellushowmanysearchresultswerefoundineach
ofthedatabases,butforthisexampleweareonlyinterestedinnucleotides:
>>>handle=Entrez.egquery(term="Cypripedioideae")
...ifrow["DbName"]=="nuccore":
814
So,weexpecttofind814EntrezNucleotiderecords(thisisthenumberIobtainedin2008itislikelytoincreaseinthefuture).Ifyoufindsomeridiculouslyhighnumber
ofhits,youmaywanttoreconsiderifyoureallywanttodownloadallofthem,whichisournextstep:
>>>handle=Entrez.esearch(db="nucleotide",term="Cypripedioideae",retmax=814,idtype="acc")
Here,recordisaPythondictionarycontainingthesearchresultsandsomeauxiliaryinformation.Justforinformation,letslookatwhatisstoredinthisdictionary:
>>>print(record.keys())
[u'Count',u'RetMax',u'IdList',u'TranslationSet',u'RetStart',u'QueryTranslation']
First,letscheckhowmanyresultswerefound:
>>>print(record["Count"])
'814'
whichisthenumberweexpected.The814resultsarestoredinrecord['IdList']:
>>>len(record["IdList"])
814
Letslookatthefirstfiveresults:
>>>record["IdList"][:5]
['KX265015.1','KX265014.1','KX265013.1','KX265012.1','KX265011.1']
Wecandownloadtheserecordsusingefetch.Whileyoucoulddownloadtheserecordsonebyone,toreducetheloadonNCBIsservers,itisbettertofetchabunchof
recordsatthesametime,shownbelow.However,inthissituationyoushouldideallybeusingthehistoryfeaturedescribedlaterinSection9.15.
>>>idlist=",".join(record["IdList"][:5])
>>>print(idlist)
KX265015.1,KX265014.1,KX265013.1,KX265012.1,KX265011.1]
>>>handle=Entrez.efetch(db="nucleotide",id=idlist,retmode="xml")
>>>len(records)
5
EachoftheserecordscorrespondstooneGenBankrecord.
>>>print(records[0].keys())
[u'GBSeq_moltype',u'GBSeq_source',u'GBSeq_sequence',
u'GBSeq_primaryaccession',u'GBSeq_definition',u'GBSeq_accessionversion',
u'GBSeq_topology',u'GBSeq_length',u'GBSeq_featuretable',
u'GBSeq_createdate',u'GBSeq_otherseqids',u'GBSeq_division',
u'GBSeq_taxonomy',u'GBSeq_references',u'GBSeq_updatedate',
u'GBSeq_organism',u'GBSeq_locus',u'GBSeq_strandedness']
>>>print(records[0]["GBSeq_primaryaccession"])
DQ110336
>>>print(records[0]["GBSeq_otherseqids"])
['gb|DQ110336.1|','gi|187237168']
>>>print(records[0]["GBSeq_definition"])
CypripediumcalceolusvoucherDavis0303Amaturase(matR)gene,partialcds;
mitochondrial
>>>print(records[0]["GBSeq_organism"])
Cypripediumcalceolus
Youcouldusethistoquicklysetupsearchesbutforheavyusage,seeSection9.15.
9.14.3Searching,downloading,andparsingGenBankrecords
TheGenBankrecordformatisaverypopularmethodofholdinginformationaboutsequences,sequencefeatures,andotherassociatedsequenceinformation.Theformat
isagoodwaytogetinformationfromtheNCBIdatabasesathttp://www.ncbi.nlm.nih.gov/.
InthisexamplewellshowhowtoquerytheNCBIdatabases,toretrievetherecordsfromthequery,andthenparsethemusingBio.SeqIOsomethingtouchedonin
Section5.3.1.Forsimplicity,thisexampledoesnottakeadvantageoftheWebEnvhistoryfeatureseeSection9.15forthis.
First,wewanttomakeaqueryandfindouttheidsoftherecordstoretrieve.Herewelldoaquicksearchforoneofourfavoriteorganisms,Opuntia(pricklypearcacti).
WecandoquicksearchandgetbacktheGIs(GenBankidentifiers)forallofthecorrespondingrecords.Firstwecheckhowmanyrecordsthereare:
>>>handle=Entrez.egquery(term="OpuntiaANDrpl16")
...ifrow["DbName"]=="nuccore":
...
9
NowwedownloadthelistofGenBankidentifiers:
>>>handle=Entrez.esearch(db="nuccore",term="OpuntiaANDrpl16")
>>>gi_list=record["IdList"]
>>>gi_list
['57240072','57240071','6273287','6273291','6273290','6273289','6273286',
'6273285','6273284']
NowweusetheseGIstodownloadtheGenBankrecordsnotethatwitholderversionsofBiopythonyouhadtosupplyacommaseparatedlistofGInumberstoEntrez,
asofBiopython1.59youcanpassalistandthisisconvertedforyou:
>>>gi_str=",".join(gi_list)
>>>handle=Entrez.efetch(db="nuccore",id=gi_str,rettype="gb",retmode="text")
IfyouwanttolookattherawGenBankfiles,youcanreadfromthishandleandprintouttheresult:
>>>text=handle.read()
>>>print(text)
LOCUSAY851612892bpDNAlinearPLN10APR2007
DEFINITIONOpuntiasubulatarpl16gene,intron;chloroplast.
ACCESSIONAY851612
VERSIONAY851612.1GI:57240072
KEYWORDS.
SOURCEchloroplastAustrocylindropuntiasubulata
ORGANISMAustrocylindropuntiasubulata
Eukaryota;Viridiplantae;Streptophyta;Embryophyta;Tracheophyta;
Spermatophyta;Magnoliophyta;eudicotyledons;coreeudicotyledons;
Caryophyllales;Cactaceae;Opuntioideae;Austrocylindropuntia.
AUTHORSButterworth,C.A.andWallace,R.S.
...
Inthiscase,wearejustgettingtherawrecords.TogettherecordsinamorePythonfriendlyform,wecanuseBio.SeqIOtoparsetheGenBankdataintoSeqRecord
objects,includingSeqFeatureobjects(seeChapter5):
>>>handle=Entrez.efetch(db="nuccore",id=gi_str,rettype="gb",retmode="text")
>>>records=SeqIO.parse(handle,"gb")
Wecannowstepthroughtherecordsandlookattheinformationweareinterestedin:
>>>...print("%s,length%i,with%ifeatures"\
>>>...%(record.name,len(record),len(record.features)))
AY851612,length892,with3features
AY851611,length881,with3features
AF191661,length895,with3features
Usingtheseautomatedqueryretrievalfunctionalityisabigplusoverdoingthingsbyhand.AlthoughthemoduleshouldobeytheNCBIsmaxthreequeriespersecond
rule,theNCBIhaveotherrecommendationslikeavoidingpeakhours.SeeSection9.1.Inparticular,pleasenotethatforsimplicity,thisexampledoesnotusetheWebEnv
historyfeature.Youshouldusethisforanynontrivialsearchanddownloadwork,seeSection9.15.
Finally,ifplantorepeatyouranalysis,ratherthandownloadingthefilesfromtheNCBIandparsingthemimmediately(asshowninthisexample),youshouldjust
downloadtherecordsonceandsavethemtoyourharddisk,andthenparsethelocalfile.
9.14.4Findingthelineageofanorganism
Stayingwithaplantexample,letsnowfindthelineageoftheCypripedioideaeorchidfamily.First,wesearchtheTaxonomydatabaseforCypripedioideae,whichyields
exactlyoneNCBItaxonomyidentifier:
>>>handle=Entrez.esearch(db="Taxonomy",term="Cypripedioideae")
>>>record["IdList"]
['158330']
>>>record["IdList"][0]
'158330'
Now,weuseefetchtodownloadthisentryintheTaxonomydatabase,andthenparseit:
>>>handle=Entrez.efetch(db="Taxonomy",id="158330",retmode="xml")
Again,thisrecordstoreslotsofinformation:
>>>records[0].keys()
[u'Lineage',u'Division',u'ParentTaxId',u'PubDate',u'LineageEx',
u'CreateDate',u'TaxId',u'Rank',u'GeneticCode',u'ScientificName',
u'MitoGeneticCode',u'UpdateDate']
Wecangetthelineagedirectlyfromthisrecord:
>>>records[0]["Lineage"]
'cellularorganisms;Eukaryota;Viridiplantae;Streptophyta;Streptophytina;
Embryophyta;Tracheophyta;Euphyllophyta;Spermatophyta;Magnoliophyta;
Liliopsida;Asparagales;Orchidaceae'
Therecorddatacontainsmuchmorethanjusttheinformationshownhereforexamplelookunder"LineageEx"insteadof"Lineage"andyoullgettheNCBItaxon
identifiersofthelineageentriestoo.
9.15UsingthehistoryandWebEnv
Oftenyouwillwanttomakeaseriesoflinkedqueries.Mosttypically,runningasearch,perhapsrefiningthesearch,andthenretrievingdetailedsearchresults.Youcan
dothisbymakingaseriesofseparatecallstoEntrez.However,theNCBIpreferyoutotakeadvantageoftheirhistorysupportforexamplecombiningESearchand
EFetch.
AnothertypicaluseofthehistorysupportwouldbetocombineEPostandEFetch.YouuseEPosttouploadalistofidentifiers,whichstartsanewhistorysession.You
thendownloadtherecordswithEFetchbyreferringtothesession(insteadoftheidentifiers).
9.15.1Searchingforanddownloadingsequencesusingthehistory
SupposewewanttosearchanddownloadalltheOpuntiarpl16nucleotidesequences,andstoretheminaFASTAfile.AsshowninSection9.14.3,wecannaively
combineBio.Entrez.esearch()togetalistofAccessionnumbers,andthencallBio.Entrez.efetch()todownloadthemall.
However,theapprovedapproachistorunthesearchwiththehistoryfeature.Then,wecanfetchtheresultsbyreferencetothesearchresultswhichtheNCBIcan
anticipateandcache.
Todothis,callBio.Entrez.esearch()asnormal,butwiththeadditionalargumentofusehistory="y",
>>>Entrez.email="history.user@example.com"
>>>search_handle=Entrez.esearch(db="nucleotide",term="Opuntia[orgn]andrpl16",
usehistory="y",idtype="acc")
>>>search_results=Entrez.read(search_handle)
>>>search_handle.close()
WhenyougettheXMLoutputback,itwillstillincludetheusualsearchresults:
>>>acc_list=search_results["IdList"]
>>>count=int(search_results["Count"])
>>>assertcount==len(acc_list)
However,youalsogetgiventwoadditionalpiecesofinformation,theWebEnvsessioncookie,andtheQueryKey:
>>>webenv=search_results["WebEnv"]
>>>query_key=search_results["QueryKey"]
Havingstoredthesevaluesinvariablessession_cookieandquery_keywecanusethemasparameterstoBio.Entrez.efetch()insteadofgivingtheGInumbersas
identifiers.
WhileforsmallsearchesyoumightbeOKdownloadingeverythingatonce,itisbettertodownloadinbatches.Youusetheretstartandretmaxparameterstospecify
whichrangeofsearchresultsyouwantreturned(startingentryusingzerobasedcounting,andmaximumnumberofresultstoreturn).Sometimesyouwillgetintermittent
errorsfromEntrez,HTTPError5XX,weuseatryexceptpauseretryblocktoaddressthis.Forexample,
#Thisassumesyouhavealreadyrunasearchasshownabove,
#andsetthevariablescount,webenv,query_key
try:
fromurllib.errorimportHTTPError#forPython3
exceptImportError:
fromurllib2importHTTPError#forPython2
batch_size=3
out_handle=open("orchid_rpl16.fasta","w")
forstartinrange(0,count,batch_size):
end=min(count,start+batch_size)
print("Goingtodownloadrecord%ito%i"%(start+1,end))
attempt=0
whileattempt<3:
attempt+=1
try:
fetch_handle=Entrez.efetch(db="nucleotide",
rettype="fasta",retmode="text",
retstart=start,retmax=batch_size,
webenv=webenv,query_key=query_key,
idtype="acc")
exceptHTTPErroraserr:
if500<=err.code<=599:
print("Receivederrorfromserver%s"%err)
print("Attempt%iof3"%attempt)
time.sleep(15)
else:
raise
data=fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
out_handle.close()
Forillustrativepurposes,thisexampledownloadedtheFASTArecordsinbatchesofthree.Unlessyouaredownloadinggenomesorchromosomes,youwouldnormally
pickalargerbatchsize.
9.15.2Searchingforanddownloadingabstractsusingthehistory
Hereisanotherhistoryexample,searchingforpaperspublishedinthelastyearabouttheOpuntia,andthendownloadingthemintoafileinMedLineformat:
fromBioimportEntrez
importtime
try:
fromurllib.errorimportHTTPError#forPython3
exceptImportError:
fromurllib2importHTTPError#forPython2
Entrez.email="history.user@example.com"
search_results=Entrez.read(Entrez.esearch(db="pubmed",
term="Opuntia[ORGN]",
reldate=365,datetype="pdat",
usehistory="y"))
count=int(search_results["Count"])
print("Found%iresults"%count)
batch_size=10
out_handle=open("recent_orchid_papers.txt","w")
forstartinrange(0,count,batch_size):
end=min(count,start+batch_size)
print("Goingtodownloadrecord%ito%i"%(start+1,end))
attempt=1
whileattempt<=3:
try:
fetch_handle=Entrez.efetch(db="pubmed",rettype="medline",
retmode="text",retstart=start,
retmax=batch_size,
webenv=search_results["WebEnv"],
query_key=search_results["QueryKey"])
exceptHTTPErroraserr:
if500<=err.code<=599:
print("Receivederrorfromserver%s"%err)
print("Attempt%iof3"%attempt)
attempt+=1
time.sleep(15)
else:
raise
data=fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
out_handle.close()
Atthetimeofwriting,thisgave28matchesbutbecausethisisadatedependentsearch,thiswillofcoursevary.AsdescribedinSection9.12.1above,youcanthenuse
Bio.Medlinetoparsethesavedrecords.
9.15.3Searchingforcitations
BackinSection9.7wementionedELinkcanbeusedtosearchforcitationsofagivenpaper.UnfortunatelythisonlycoversjournalsindexedforPubMedCentral(doing
itforallthejournalsinPubMedwouldmeanalotmoreworkfortheNIH).LetstrythisfortheBiopythonPDBparserpaper,PubMedID14630660:
>>>pmid="14630660"
>>>results=Entrez.read(Entrez.elink(dbfrom="pubmed",db="pmc",
...LinkName="pubmed_pmc_refs",id=pmid))
>>>pmc_ids=[link["Id"]forlinkinresults[0]["LinkSetDb"][0]["Link"]]
>>>pmc_ids
['2744707','2705363','2682512',...,'1190160']
Greatelevenarticles.ButwhyhasnttheBiopythonapplicationnotebeenfound(PubMedID19304878)?Well,asyoumighthaveguessedfromthevariablenames,
therearenotactuallyPubMedIDs,butPubMedCentralIDs.Ourapplicationnoteisthethirdcitingpaperinthatlist,PMCID2682512.
So,whatif(likeme)youdrathergetbackalistofPubMedIDs?WellwecancallELinkagaintotranslatethem.Thisbecomesatwostepprocess,sobynowyoushould
expecttousethehistoryfeaturetoaccomplishit(Section9.15).
Butfirst,takingthemorestraightforwardapproachofmakingasecond(separate)calltoELink:
>>>results2=Entrez.read(Entrez.elink(dbfrom="pmc",db="pubmed",LinkName="pmc_pubmed",
...id=",".join(pmc_ids)))
>>>pubmed_ids=[link["Id"]forlinkinresults2[0]["LinkSetDb"][0]["Link"]]
>>>pubmed_ids
['19698094','19450287','19304878',...,'15985178']
ThistimeyoucanimmediatelyspottheBiopythonapplicationnoteasthethirdhit(PubMedID19304878).
Now,letsdothatallagainbutwiththehistoryTODO.
Andfinally,dontforgettoincludeyourownemailaddressintheEntrezcalls.
Chapter10SwissProtandExPASy
10.1ParsingSwissProtfiles
SwissProt(http://www.expasy.org/sprot)isahandcurateddatabaseofproteinsequences.BiopythoncanparsetheplaintextSwissProtfileformat,whichisstill
usedfortheUniProtKnowledgebasewhichcombinedSwissProt,TrEMBLandPIRPSD.Wedonot(yet)supporttheUniProtKBXMLfileformat.
10.1.1ParsingSwissProtrecords
InSection5.3.2,wedescribedhowtoextractthesequenceofaSwissProtrecordasaSeqRecordobject.Alternatively,youcanstoretheSwissProtrecordina
Bio.SwissProt.Recordobject,whichinfactstoresthecompleteinformationcontainedintheSwissProtrecord.Inthissection,wedescribehowtoextract
Bio.SwissProt.RecordobjectsfromaSwissProtfile.
ToparseaSwissProtrecord,wefirstgetahandletoaSwissProtrecord.Thereareseveralwaystodoso,dependingonwhereandhowtheSwissProtrecordisstored:
OpenaSwissProtfilelocally:
\verb|>>>handle=open("myswissprotfile.dat")
OpenagzippedSwissProtfile:
>>>importgzip
>>>handle=gzip.open("myswissprotfile.dat.gz","rt")
OpenaSwissProtfileovertheinternet:
>>>importurllib
>>>handle=urllib.urlopen("http://www.somelocation.org/data/someswissprotfile.dat")
OpenaSwissProtfileovertheinternetfromtheExPASydatabase(seesection10.5.1):
>>>fromBioimportExPASy
>>>handle=ExPASy.get_sprot_raw(myaccessionnumber)
Thekeypointisthatfortheparser,itdoesntmatterhowthehandlewascreated,aslongasitpointstodataintheSwissProtformat.
WecanuseBio.SeqIOasdescribedinSection5.3.2togetfileformatagnosticSeqRecordobjects.Alternatively,wecanuseBio.SwissProtgetBio.SwissProt.Record
objects,whichareamuchclosermatchtotheunderlyingfileformat.
ToreadoneSwissProtrecordfromthehandle,weusethefunctionread():
>>>fromBioimportSwissProt
>>>record=SwissProt.read(handle)
ThisfunctionshouldbeusedifthehandlepointstoexactlyoneSwissProtrecord.ItraisesaValueErrorifnoSwissProtrecordwasfound,andalsoifmorethanone
recordwasfound.
Wecannowprintoutsomeinformationaboutthisrecord:
>>>print(record.description)
'RecName:Full=Chalconesynthase3;EC=2.3.1.74;AltName:Full=Naringeninchalconesynthase3;'
>>>forrefinrecord.references:
...print("authors:",ref.authors)
...print("title:",ref.title)
...
authors:LiewC.F.,LimS.H.,LohC.S.,GohC.J.;
title:"MolecularcloningandsequenceanalysisofchalconesynthasecDNAsof
Bromheadiafinlaysoniana.";
>>>print(record.organism_classification)
['Eukaryota','Viridiplantae','Streptophyta','Embryophyta',...,'Bromheadia']
ToparseafilethatcontainsmorethanoneSwissProtrecord,weusetheparsefunctioninstead.Thisfunctionallowsustoiterateovertherecordsinthefile.
Forexample,letsparsethefullSwissProtdatabaseandcollectallthedescriptions.YoucandownloadthisfromtheExPAYsFTPsiteasasinglegzippedfile
uniprot_sprot.dat.gz(about300MB).Thisisacompressedfilecontainingasinglefile,uniprot_sprot.dat(over1.5GB).
Asdescribedatthestartofthissection,youcanusethePythonlibrarygziptoopenanduncompressa.gzfile,likethis:
>>>importgzip
>>>handle=gzip.open("uniprot_sprot.dat.gz","rt")
However,uncompressingalargefiletakestime,andeachtimeyouopenthefileforreadinginthisway,ithastobedecompressedonthefly.So,ifyoucansparethedisk
spaceyoullsavetimeinthelongrunifyoufirstdecompressthefiletodisk,togettheuniprot_sprot.datfileinside.Thenyoucanopenthefileforreadingasusual:
>>>handle=open("uniprot_sprot.dat")
AsofJune2009,thefullSwissProtdatabasedownloadedfromExPASycontained468851SwissProtrecords.Oneconcisewaytobuildupalistoftherecord
descriptionsiswithalistcomprehension:
>>>descriptions=[record.descriptionforrecordinSwissProt.parse(handle)]
>>>len(descriptions)
468851
>>>descriptions[:5]
['RecName:Full=ProteinMGF1001R;',
'RecName:Full=ProteinMGF1001R;',
'RecName:Full=ProteinMGF1002L;']
Or,usingaforloopovertherecorditerator:
>>>descriptions=[]
>>>forrecordinSwissProt.parse(handle):
...descriptions.append(record.description)
...
>>>len(descriptions)
468851
Becausethisissuchalargeinputfile,eitherwaytakesaboutelevenminutesonmynewdesktopcomputer(usingtheuncompresseduniprot_sprot.datfileasinput).
ItisequallyeasytoextractanykindofinformationyoudlikefromSwissProtrecords.ToseethemembersofaSwissProtrecord,use
>>>dir(record)
['__doc__','__init__','__module__','accessions','annotation_update',
'comments','created','cross_references','data_class','description',
'entry_name','features','gene_name','host_organism','keywords',
'molecule_type','organelle','organism','organism_classification',
'references','seqinfo','sequence','sequence_length',
'sequence_update','taxonomy_id']
10.1.2ParsingtheSwissProtkeywordandcategorylist
SwissProtalsodistributesafilekeywlist.txt,whichliststhekeywordsandcategoriesusedinSwissProt.Thefilecontainsentriesinthefollowingform:
ID2Fe2S.
ACKW0001
DEProteinwhichcontainsatleastone2Fe2Sironsulfurcluster:2iron
DEatomscomplexedto2inorganicsulfidesand4sulfuratomsof
DEcysteinesfromtheprotein.
SYFe2S2;[2Fe2S]cluster;[Fe2S2]cluster;Fe2/S2(inorganic)cluster;
SYDimusulfidodiiron;2iron,2sulfurclusterbinding.
GOGO:0051537;2iron,2sulfurclusterbinding
HILigand:Iron;Ironsulfur;2Fe2S.
HILigand:Metalbinding;2Fe2S.
CALigand.
//
ID3Dstructure.
ACKW0002
DEProtein,orpartofaprotein,whosethreedimensionalstructurehas
DEbeenresolvedexperimentally(forexamplebyXraycrystallographyor
DENMRspectroscopy)andwhosecoordinatesareavailableinthePDB
DEdatabase.Canalsobeusedfortheoreticalmodels.
HITechnicalterm:3Dstructure.
CATechnicalterm.
//
ID3Fe4S.
...
TheentriesinthisfilecanbeparsedbytheparsefunctionintheBio.SwissProt.KeyWListmodule.EachentryisthenstoredasaBio.SwissProt.KeyWList.Record,which
isaPythondictionary.
>>>fromBio.SwissProtimportKeyWList
>>>handle=open("keywlist.txt")
>>>records=KeyWList.parse(handle)
...print(record['ID'])
...print(record['DE'])
Thisprints
2Fe2S.
Proteinwhichcontainsatleastone2Fe2Sironsulfurcluster:2ironatoms
complexedto2inorganicsulfidesand4sulfuratomsofcysteinesfromthe
protein.
...
10.2ParsingPrositerecords
Prositeisadatabasecontainingproteindomains,proteinfamilies,functionalsites,aswellasthepatternsandprofilestorecognizethem.Prositewasdevelopedinparallel
withSwissProt.InBiopython,aPrositerecordisrepresentedbytheBio.ExPASy.Prosite.Recordclass,whosememberscorrespondtothedifferentfieldsinaProsite
record.
Ingeneral,aPrositefilecancontainmorethanonePrositerecords.Forexample,thefullsetofPrositerecords,whichcanbedownloadedasasinglefile(prosite.dat)
fromtheExPASyFTPsite,contains2073records(version20.24releasedon4December2007).Toparsesuchafile,weagainmakeuseofaniterator:
>>>fromBio.ExPASyimportProsite
>>>handle=open("myprositefile.dat")
>>>records=Prosite.parse(handle)
Wecannowtaketherecordsoneatatimeandprintoutsomeinformation.Forexample,usingthefilecontainingthecompletePrositedatabase,wedfind
>>>handle=open("prosite.dat")
>>>record=next(records)
>>>record.accession
'PS00001'
>>>record.name
'ASN_GLYCOSYLATION'
>>>record.pdoc
'PDOC00001'
>>>record.accession
'PS00004'
>>>record.name
'CAMP_PHOSPHO_SITE'
>>>record.pdoc
'PDOC00004'
>>>record.accession
'PS00005'
>>>record.name
'PKC_PHOSPHO_SITE'
>>>record.pdoc
'PDOC00005'
andsoon.IfyoureinterestedinhowmanyPrositerecordsthereare,youcoulduse
>>>handle=open("prosite.dat")
>>>n=0
>>>forrecordinrecords:n+=1
...
>>>n
2073
ToreadexactlyonePrositefromthehandle,youcanusethereadfunction:
>>>handle=open("mysingleprositerecord.dat")
>>>record=Prosite.read(handle)
ThisfunctionraisesaValueErrorifnoPrositerecordisfound,andalsoifmorethanonePrositerecordisfound.
10.3ParsingPrositedocumentationrecords
InthePrositeexampleabove,therecord.pdocaccessionnumbers'PDOC00001','PDOC00004','PDOC00005'andsoonrefertoPrositedocumentation.TheProsite
documentationrecordsareavailablefromExPASyasindividualfiles,andasonefile(prosite.doc)containingallPrositedocumentationrecords.
WeusetheparserinBio.ExPASy.ProdoctoparsePrositedocumentationrecords.Forexample,tocreatealistofallaccessionnumbersofPrositedocumentationrecord,
youcanuse
>>>fromBio.ExPASyimportProdoc
>>>handle=open("prosite.doc")
>>>records=Prodoc.parse(handle)
>>>accessions=[record.accessionforrecordinrecords]
Againaread()functionisprovidedtoreadexactlyonePrositedocumentationrecordfromthehandle.
10.4ParsingEnzymerecords
ExPASysEnzymedatabaseisarepositoryofinformationonenzymenomenclature.AtypicalEnzymerecordlooksasfollows:
ID3.1.1.34
DELipoproteinlipase.
ANClearingfactorlipase.
ANDiacylglycerollipase.
ANDiglyceridelipase.
CATriacylglycerol+H(2)O=diacylglycerol+acarboxylate.
CC!Hydrolyzestriacylglycerolsinchylomicronsandverylowdensity
CClipoproteins(VLDL).
CC!Alsohydrolyzesdiacylglycerol.
PRPROSITE;PDOC00110;
DRP11151,LIPL_BOVIN;P11153,LIPL_CAVPO;P11602,LIPL_CHICK;
DRP55031,LIPL_FELCA;P06858,LIPL_HUMAN;P11152,LIPL_MOUSE;
DRO46647,LIPL_MUSVI;P49060,LIPL_PAPAN;P49923,LIPL_PIG;
DRQ06000,LIPL_RAT;Q29524,LIPL_SHEEP;
//
Inthisexample,thefirstlineshowstheEC(EnzymeCommission)numberoflipoproteinlipase(secondline).Alternativenamesoflipoproteinlipaseare"clearingfactor
lipase","diacylglycerollipase",and"diglyceridelipase"(lines3through5).Thelinestartingwith"CA"showsthecatalyticactivityofthisenzyme.Commentlinesstart
with"CC".The"PR"lineshowsreferencestothePrositeDocumentationrecords,andthe"DR"linesshowreferencestoSwissProtrecords.Notoftheseentriesare
necessarilypresentinanEnzymerecord.
InBiopython,anEnzymerecordisrepresentedbytheBio.ExPASy.Enzyme.Recordclass.ThisrecordderivesfromaPythondictionaryandhaskeyscorrespondingtothe
twolettercodesusedinEnzymefiles.ToreadanEnzymefilecontainingoneEnzymerecord,usethereadfunctioninBio.ExPASy.Enzyme:
>>>fromBio.ExPASyimportEnzyme
>>>withopen("lipoprotein.txt")ashandle:
...record=Enzyme.read(handle)
...
>>>record["ID"]
'3.1.1.34'
>>>record["DE"]
'Lipoproteinlipase.'
>>>record["AN"]
['Clearingfactorlipase.','Diacylglycerollipase.','Diglyceridelipase.']
>>>record["CA"]
'Triacylglycerol+H(2)O=diacylglycerol+acarboxylate.'
>>>record["PR"]
['PDOC00110']
>>>record["CC"]
['Hydrolyzestriacylglycerolsinchylomicronsandverylowdensitylipoproteins
(VLDL).','Alsohydrolyzesdiacylglycerol.']
>>>record["DR"]
[['P11151','LIPL_BOVIN'],['P11153','LIPL_CAVPO'],['P11602','LIPL_CHICK'],
['P55031','LIPL_FELCA'],['P06858','LIPL_HUMAN'],['P11152','LIPL_MOUSE'],
['O46647','LIPL_MUSVI'],['P49060','LIPL_PAPAN'],['P49923','LIPL_PIG'],
['Q06000','LIPL_RAT'],['Q29524','LIPL_SHEEP']]
ThereadfunctionraisesaValueErrorifnoEnzymerecordisfound,andalsoifmorethanoneEnzymerecordisfound.
ThefullsetofEnzymerecordscanbedownloadedasasinglefile(enzyme.dat)fromtheExPASyFTPsite,containing4877records(releaseof3March2009).Toparse
suchafilecontainingmultipleEnzymerecords,usetheparsefunctioninBio.ExPASy.Enzymetoobtainaniterator:
>>>fromBio.ExPASyimportEnzyme
>>>handle=open("enzyme.dat")
>>>records=Enzyme.parse(handle)
Wecannowiterateovertherecordsoneatatime.Forexample,wecanmakealistofallECnumbersforwhichanEnzymerecordisavailable:
>>>ecnumbers=[record["ID"]forrecordinrecords]
10.5AccessingtheExPASyserver
SwissProt,Prosite,andPrositedocumentationrecordscanbedownloadedfromtheExPASywebserverathttp://www.expasy.org.Sixkindsofqueriesareavailable
fromExPASy:
get_prodoc_entry
TodownloadaPrositedocumentationrecordinHTMLformat
get_prosite_entry
TodownloadaPrositerecordinHTMLformat
get_prosite_raw
TodownloadaPrositeorPrositedocumentationrecordinrawformat
get_sprot_raw
TodownloadaSwissProtrecordinrawformat
sprot_search_ful
TosearchforaSwissProtrecord
sprot_search_de
TosearchforaSwissProtrecord
ToaccessthiswebserverfromaPythonscript,weusetheBio.ExPASymodule.
10.5.1RetrievingaSwissProtrecord
LetssaywearelookingatchalconesynthasesforOrchids(seesection2.3forsomejustificationforlookingforinterestingthingsaboutorchids).Chalconesynthaseis
involvedinflavanoidbiosynthesisinplants,andflavanoidsmakelotsofcoolthingslikepigmentcolorsandUVprotectants.
IfyoudoasearchonSwissProt,youcanfindthreeorchidproteinsforChalconeSynthase,idnumbersO23729,O23730,O23731.Now,letswriteascriptwhichgrabs
these,andparsesoutsomeinterestinginformation.
First,wegrabtherecords,usingtheget_sprot_raw()functionofBio.ExPASy.Thisfunctionisverynicesinceyoucanfeeditanidandgetbackahandletoarawtext
record(noHTMLtomesswith!).WecantheuseBio.SwissProt.readtopullouttheSwissProtrecord,orBio.SeqIO.readtogetaSeqRecord.Thefollowingcode
accomplisheswhatIjustwrote:
>>>accessions=["O23729","O23730","O23731"]
>>>records=[]
>>>foraccessioninaccessions:
...handle=ExPASy.get_sprot_raw(accession)
...record=SwissProt.read(handle)
...records.append(record)
IftheaccessionnumberyouprovidedtoExPASy.get_sprot_rawdoesnotexist,thenSwissProt.read(handle)willraiseaValueError.YoucancatchValueException
exceptionstodetectinvalidaccessionnumbers:
>>>foraccessioninaccessions:
...handle=ExPASy.get_sprot_raw(accession)
...try:
...record=SwissProt.read(handle)
...exceptValueException:
...print("WARNING:Accession%snotfound"%accession)
...records.append(record)
10.5.2SearchingSwissProt
Now,youmayremarkthatIknewtherecordsaccessionnumbersbeforehand.Indeed,get_sprot_raw()needseithertheentrynameoranaccessionnumber.Whenyou
donthavethemhandy,youcanuseoneofthesprot_search_de()orsprot_search_ful()functions.
sprot_search_de()searchesintheID,DE,GN,OSandOGlinessprot_search_ful()searchesin(nearly)allthefields.Theyaredetailedon
http://www.expasy.org/cgibin/sprotsearchdeandhttp://www.expasy.org/cgibin/sprotsearchfulrespectively.NotethattheydontsearchinTrEMBLbydefault
(argumenttrembl).NotealsothattheyreturnHTMLpageshowever,accessionnumbersarequiteeasilyextractable:
>>>importre
>>>handle=ExPASy.sprot_search_de("OrchidChalconeSynthase")
>>>#or:
>>>#handle=ExPASy.sprot_search_ful("Orchidand{ChalconeSynthase}")
>>>html_results=handle.read()
>>>if"Numberofsequencesfound"inhtml_results:
...ids=re.findall(r'HREF="/uniprot/(\w+)"',html_results)
...else:
...ids=re.findall(r'href="/cgibin/niceprot\.pl\?(\w+)"',html_results)
10.5.3RetrievingPrositeandPrositedocumentationrecords
PrositeandPrositedocumentationrecordscanberetrievedeitherinHTMLformat,orinrawformat.ToparsePrositeandPrositedocumentationrecordswithBiopython,
youshouldretrievetherecordsinrawformat.Forotherpurposes,however,youmaybeinterestedintheserecordsinHTMLformat.
ToretrieveaPrositeorPrositedocumentationrecordinrawformat,useget_prosite_raw().Forexample,todownloadaPrositerecordandprintitoutinrawtextformat,
use
>>>handle=ExPASy.get_prosite_raw('PS00001')
>>>text=handle.read()
>>>print(text)
ToretrieveaPrositerecordandparseitintoaBio.Prosite.Recordobject,use
>>>fromBioimportProsite
>>>handle=ExPASy.get_prosite_raw('PS00001')
>>>record=Prosite.read(handle)
ThesamefunctioncanbeusedtoretrieveaPrositedocumentationrecordandparseitintoaBio.ExPASy.Prodoc.Recordobject:
>>>fromBio.ExPASyimportProdoc
>>>handle=ExPASy.get_prosite_raw('PDOC00001')
>>>record=Prodoc.read(handle)
Fornonexistingaccessionnumbers,ExPASy.get_prosite_rawreturnsahandletoanemptrystring.Whenfacedwithanemptystring,Prosite.readandProdoc.readwill
raiseaValueError.Youcancatchtheseexceptionstodetectinvalidaccessionnumbers.
Thefunctionsget_prosite_entry()andget_prodoc_entry()areusedtodownloadPrositeandPrositedocumentationrecordsinHTMLformat.Tocreateawebpage
showingonePrositerecord,youcanuse
>>>handle=ExPASy.get_prosite_entry('PS00001')
>>>html=handle.read()
>>>withopen("myprositerecord.html","w")asout_handle:
...out_handle.write(html)
...
andsimilarlyforaPrositedocumentationrecord:
>>>handle=ExPASy.get_prodoc_entry('PDOC00001')
>>>html=handle.read()
>>>withopen("myprodocrecord.html","w")asout_handle:
...out_handle.write(html)
...
Forthesefunctions,aninvalidaccessionnumberreturnsanerrormessageinHTMLformat.
10.6ScanningthePrositedatabase
ScanPrositeallowsyoutoscanproteinsequencesonlineagainstthePrositedatabasebyprovidingaUniProtorPDBsequenceidentifierorthesequenceitself.Formore
informationaboutScanProsite,pleaseseetheScanPrositedocumentationaswellasthedocumentationforprogrammaticaccessofScanProsite.
YoucanuseBiopythonsBio.ExPASy.ScanPrositemoduletoscanthePrositedatabasefromPython.ThismodulebothhelpsyoutoaccessScanPrositeprogrammatically,
andtoparsetheresultsreturnedbyScanProsite.ToscanforPrositepatternsinthefollowingproteinsequence:
MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT
CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN
youcanusethefollowingcode:
>>>sequence="MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT
CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN"
>>>fromBio.ExPASyimportScanProsite
>>>handle=ScanProsite.scan(seq=sequence)
Byexecutinghandle.read(),youcanobtainthesearchresultsinrawXMLformat.Instead,letsuseBio.ExPASy.ScanProsite.readtoparsetherawXMLintoaPython
object:
>>>result=ScanProsite.read(handle)
>>>type(result)
<class'Bio.ExPASy.ScanProsite.Record'>
ABio.ExPASy.ScanProsite.Recordobjectisderivedfromalist,witheachelementintheliststoringoneScanPrositehit.Thisobjectalsostoresthenumberofhits,aswell
asthenumberofsearchsequences,asreturnedbyScanProsite.ThisScanPrositesearchresultedinsixhits:
>>>result.n_seq
1
>>>result.n_match
6
>>>len(result)
6
>>>result[0]
{'signature_ac':u'PS50948','level':u'0','stop':98,'sequence_ac':u'USERSEQ1','start':16,'score':u'8.873'}
>>>result[1]
{'start':37,'stop':39,'sequence_ac':u'USERSEQ1','signature_ac':u'PS00005'}
>>>result[2]
>>>result[3]
>>>result[4]
>>>result[5]
OtherScanPrositeparameterscanbepassedaskeywordargumentsseethedocumentationforprogrammaticaccessofScanPrositeformoreinformation.Asanexample,
passinglowscore=1toincludematcheswithlowlevelscoresletsusefindoneadditionalhit:
>>>handle=ScanProsite.scan(seq=sequence,lowscore=1)
>>>result=ScanProsite.read(handle)
>>>result.n_match
7
Chapter11Going3D:ThePDBmodule
Bio.PDBisaBiopythonmodulethatfocusesonworkingwithcrystalstructuresofbiologicalmacromolecules.Amongotherthings,Bio.PDBincludesaPDBParserclass
thatproducesaStructureobject,whichcanbeusedtoaccesstheatomicdatainthefileinaconvenientmanner.Thereislimitedsupportforparsingtheinformation
containedinthePDBheader.
11.1Readingandwritingcrystalstructurefiles
11.1.1ReadingaPDBfile
FirstwecreateaPDBParserobject:
>>>fromBio.PDB.PDBParserimportPDBParser
>>>p=PDBParser(PERMISSIVE=1)
ThePERMISSIVEflagindicatesthatanumberofcommonproblems(see11.7.1)associatedwithPDBfileswillbeignored(butnotethatsomeatomsand/orresidueswillbe
missing).IftheflagisnotpresentaPDBConstructionExceptionwillbegeneratedifanyproblemsaredetectedduringtheparseoperation.
TheStructureobjectisthenproducedbylettingthePDBParserobjectparseaPDBfile(thePDBfileinthiscaseiscalledpdb1fat.ent,1fatisauserdefinednamefor
thestructure):
>>>structure_id="1fat"
>>>filename="pdb1fat.ent"
>>>s=p.get_structure(structure_id,filename)
Youcanextracttheheaderandtrailer(simplelistsofstrings)ofthePDBfilefromthePDBParserobjectwiththeget_headerandget_trailermethods.Notehowever
thatmanyPDBfilescontainheaderswithincompleteorerroneousinformation.ManyoftheerrorshavebeenfixedintheequivalentmmCIFfiles.Hence,ifyouare
interestedintheheaderinformation,itisagoodideatoextractinformationfrommmCIFfilesusingtheMMCIF2Dicttooldescribedbelow,insteadofparsingthePDB
header.
Nowthatisclarified,letsreturntoparsingthePDBheader.ThestructureobjecthasanattributecalledheaderwhichisaPythondictionarythatmapsheaderrecordsto
theirvalues.
Example:
>>>resolution=structure.header['resolution']
>>>keywords=structure.header['keywords']
Theavailablekeysarename,head,deposition_date,release_date,structure_method,resolution,structure_reference(whichmapstoalistofreferences),
journal_reference,author,andcompound(whichmapstoadictionarywithvariousinformationaboutthecrystallizedcompound).
ThedictionarycanalsobecreatedwithoutcreatingaStructureobject,ie.directlyfromthePDBfile:
>>>withopen(filename,'r')ashandle:
...header_dict=parse_pdb_header(handle)
...
11.1.2ReadinganmmCIFfile
SimilarlytothecasethecaseofPDBfiles,firstcreateanMMCIFParserobject:
>>>fromBio.PDB.MMCIFParserimportMMCIFParser
>>>parser=MMCIFParser()
ThenusethisparsertocreateastructureobjectfromthemmCIFfile:
>>>structure=parser.get_structure('1fat','1fat.cif')
TohavesomemorelowlevelaccesstoanmmCIFfile,youcanusetheMMCIF2DictclasstocreateaPythondictionarythatmapsallmmCIFtagsinanmmCIFfiletotheir
values.Iftherearemultiplevalues(likeinthecaseoftag_atom_site.Cartn_y,whichholdstheycoordinatesofallatoms),thetagismappedtoalistofvalues.The
dictionaryiscreatedfromthemmCIFfileasfollows:
>>>fromBio.PDB.MMCIF2DictimportMMCIF2Dict
>>>mmcif_dict=MMCIF2Dict('1FAT.cif')
Example:getthesolventcontentfromanmmCIFfile:
>>>sc=mmcif_dict['_exptl_crystal.density_percent_sol']
Example:getthelistoftheycoordinatesofallatoms
>>>y_list=mmcif_dict['_atom_site.Cartn_y']
11.1.3ReadingfilesintheMMTFformat
YoucanusethedirectMMTFParsertoreadastructurefromafile:
>>>fromBio.PDB.mmtfimportMMTFParser
>>>structure=MMTFParser.get_structure("PDB/4CUP.mmtf")
OryoucanusethesameclasstogetastructurebyitsPDBID:
>>>structure=MMTFParser.get_structure_from_url("4CUP")
ThisgivesyouaStructureobjectasifreadfromaPDBormmCIFfile.
YoucanalsohaveaccesstotheunderlyingdatausingtheexternalMMTFlibrarywhichBiopythonisusinginternally:
>>>frommmtfimportfetch
>>>decoded_data=fetch("4CUP")
ForexampleyoucanaccessjusttheXcoordinate.
>>>print(decoded_data.x_coord_list)
...
11.1.4ReadingfilesinthePDBXMLformat
Thatsnotyetsupported,butwearedefinitelyplanningtosupportthatinthefuture(itsnotalotofwork).ContacttheBiopythondevelopers(biopython
dev@biopython.org)ifyouneedthis).
11.1.5WritingPDBfiles
UsethePDBIOclassforthis.Itseasytowriteoutspecificpartsofastructuretoo,ofcourse.
Example:savingastructure
>>>io=PDBIO()
>>>io.set_structure(s)
>>>io.save('out.pdb')
Ifyouwanttowriteoutapartofthestructure,makeuseoftheSelectclass(alsoinPDBIO).Selecthasfourmethods:
accept_model(model)
accept_chain(chain)
accept_residue(residue)
accept_atom(atom)
Bydefault,everymethodreturns1(whichmeansthemodel/chain/residue/atomisincludedintheoutput).BysubclassingSelectandreturning0whenappropriateyou
canexcludemodels,chains,etc.fromtheoutput.Cumbersomemaybe,butverypowerful.Thefollowingcodeonlywritesoutglycineresidues:
>>>classGlySelect(Select):
...defaccept_residue(self,residue):
...ifresidue.get_name()=='GLY':
...returnTrue
...else:
...returnFalse
...
>>>io=PDBIO()
>>>io.set_structure(s)
>>>io.save('gly_only.pdb',GlySelect())
Ifthisisalltoocomplicatedforyou,theDicemodulecontainsahandyextractfunctionthatwritesoutallresiduesinachainbetweenastartandendresidue.
11.2Structurerepresentation
TheoveralllayoutofaStructureobjectfollowsthesocalledSMCRA(Structure/Model/Chain/Residue/Atom)architecture:
Astructureconsistsofmodels
Amodelconsistsofchains
Achainconsistsofresidues
Aresidueconsistsofatoms
Thisisthewaymanystructuralbiologists/bioinformaticiansthinkaboutstructure,andprovidesasimplebutefficientwaytodealwithstructure.Additionalstuffis
essentiallyaddedwhenneeded.AUMLdiagramofthe Structureobject(forgetabouttheDisorderedclassesfornow)isshowninFig.11.1.Suchadatastructureisnot
necessarilybestsuitedfortherepresentationofthemacromolecularcontentofastructure,butitisabsolutelynecessaryforagoodinterpretationofthedatapresentina
filethatdescribesthestructure(typicallyaPDBorMMCIFfile).Ifthishierarchycannotrepresentthecontentsofastructurefile,itisfairlycertainthatthefilecontains
anerrororatleastdoesnotdescribethestructureunambiguously.IfaSMCRAdatastructurecannotbegenerated,thereisreasontosuspectaproblem.ParsingaPDB
filecanthusbeusedtodetectlikelyproblems.Wewillgiveseveralexamplesofthisinsection11.7.1.
Figure11.1:UMLdiagramofSMCRAarchitectureoftheStructureclassusedtorepresentamacromolecularstructure.Fulllineswithdiamondsdenote
aggregation,fulllineswitharrowsdenotereferencing,fulllineswithtrianglesdenoteinheritanceanddashedlineswithtrianglesdenoteinterface
realization.
Structure,Model,ChainandResidueareallsubclassesoftheEntitybaseclass.TheAtomclassonly(partly)implementstheEntityinterface(becauseanAtomdoesnot
havechildren).
ForeachEntitysubclass,youcanextractachildbyusingauniqueidforthatchildasakey(e.g.youcanextractanAtomobjectfromaResidueobjectbyusinganatom
namestringasakey,youcanextractaChainobjectfromaModelobjectbyusingitschainidentifierasakey).
DisorderedatomsandresiduesarerepresentedbyDisorderedAtomandDisorderedResidueclasses,whicharebothsubclassesoftheDisorderedEntityWrapperbaseclass.
TheyhidethecomplexityassociatedwithdisorderandbehaveexactlyasAtomandResidueobjects.
Ingeneral,achildEntityobject(i.e.Atom,Residue,Chain,Model)canbeextractedfromitsparent(i.e.Residue,Chain,Model,Structure,respectively)byusinganidas
akey.
>>>child_entity=parent_entity[child_id]
YoucanalsogetalistofallchildEntitiesofaparentEntityobject.Notethatthislistissortedinaspecificway(e.g.accordingtochainidentifierforChainobjectsina
Modelobject).
>>>child_list=parent_entity.get_list()
Youcanalsogettheparentfromachild:
>>>parent_entity=child_entity.get_parent()
AtalllevelsoftheSMCRAhierarchy,youcanalsoextractafullid.Thefullidisatuplecontainingallidsstartingfromthetopobject(Structure)downtothecurrent
object.AfullidforaResidueobjecte.g.issomethinglike:
>>>full_id=residue.get_full_id()
>>>print(full_id)
("1abc",0,"A",("",10,"A"))
Thiscorrespondsto:
TheStructurewithid"1abc"
TheModelwithid0
TheChainwithid"A"
TheResiduewithid("",10,"A").
TheResidueidindicatesthattheresidueisnotaheteroresidue(norawater)becauseithasablankheterofield,thatitssequenceidentifieris10andthatitsinsertion
codeis"A".
Togettheentitysid,usetheget_idmethod:
>>>entity.get_id()
Youcancheckiftheentityhasachildwithagivenidbyusingthehas_idmethod:
>>>entity.has_id(entity_id)
Thelengthofanentityisequaltoitsnumberofchildren:
>>>nr_children=len(entity)
Itispossibletodelete,rename,add,etc.childentitiesfromaparententity,butthisdoesnotincludeanysanitychecks(e.g.itispossibletoaddtworesidueswiththesame
idtoonechain).ThisreallyshouldbedoneviaaniceDecoratorclassthatincludesintegritychecking,butyoucantakealookatthecode(Entity.py)ifyouwanttouse
therawinterface.
11.2.1Structure
TheStructureobjectisatthetopofthehierarchy.Itsidisausergivenstring.TheStructurecontainsanumberofModelchildren.Mostcrystalstructures(butnotall)
containasinglemodel,whileNMRstructurestypicallyconsistofseveralmodels.Disorderincrystalstructuresoflargepartsofmoleculescanalsoresultinseveral
models.
11.2.2Model
TheidoftheModelobjectisaninteger,whichisderivedfromthepositionofthemodelintheparsedfile(theyareautomaticallynumberedstartingfrom0).Crystal
structuresgenerallyhaveonlyonemodel(withid0),whileNMRfilesusuallyhaveseveralmodels.WhereasmanyPDBparsersassumethatthereisonlyonemodel,the
StructureclassinBio.PDBisdesignedsuchthatitcaneasilyhandlePDBfileswithmorethanonemodel.
Asanexample,togetthefirstmodelfromaStructureobject,use
>>>first_model=structure[0]
TheModelobjectstoresalistofChainchildren.
11.2.3Chain
TheidofaChainobjectisderivedfromthechainidentifierinthePDB/mmCIFfile,andisasinglecharacter(typicallyaletter).EachChaininaModelobjecthasa
uniqueid.Asanexample,togettheChainobjectwithidentifierAfromaModelobject,use
>>>chain_A=model["A"]
TheChainobjectstoresalistofResiduechildren.
11.2.4Residue
Aresidueidisatuplewiththreeelements:
Theheterofield(hetfield):thisis
'W'inthecaseofawatermolecule
'H_'followedbytheresiduenameforotherheteroresidues(e.g.'H_GLC'inthecaseofaglucosemolecule)
blankforstandardaminoandnucleicacids.
Thisschemeisadoptedforreasonsdescribedinsection11.4.1.
Thesequenceidentifier(resseq),anintegerdescribingthepositionoftheresidueinthechain(e.g.,100)
Theinsertioncode(icode)astring,e.g.A.Theinsertioncodeissometimesusedtopreserveacertaindesirableresiduenumberingscheme.ASer80insertion
mutant(insertede.g.betweenaThr80andanAsn81residue)coulde.g.havesequenceidentifiersandinsertioncodesasfollows:Thr80A,Ser80B,Asn81.In
thiswaytheresiduenumberingschemestaysintunewiththatofthewildtypestructure.
Theidoftheaboveglucoseresiduewouldthusbe(H_GLC,100,A).Iftheheteroflagandinsertioncodeareblank,thesequenceidentifieralonecanbeused:
#Fullid
>>>residue=chain[('',100,'')]
#Shortcutid
>>>residue=chain[100]
Thereasonfortheheteroflagisthatmany,manyPDBfilesusethesamesequenceidentifierforanaminoacidandaheteroresidueorawater,whichwouldcreate
obviousproblemsiftheheteroflagwasnotused.
Unsurprisingly,aResidueobjectstoresasetofAtomchildren.Italsocontainsastringthatspecifiestheresiduename(e.g.ASN)andthesegmentidentifierofthe
residue(wellknowntoXPLORusers,butnotusedintheconstructionoftheSMCRAdatastructure).
Letslookatsomeexamples.Asn10withablankinsertioncodewouldhaveresidueid(,10,).Water10wouldhaveresidueid(W,10,).Aglucose
molecule(aheteroresiduewithresiduenameGLC)withsequenceidentifier10wouldhaveresidueid(H_GLC,10,).Inthisway,thethreeresidues(withthesame
insertioncodeandsequenceidentifier)canbepartofthesamechainbecausetheirresidueidsaredistinct.
Inmostcases,thehetflagandinsertioncodefieldswillbeblank,e.g.(,10,).Inthesecases,thesequenceidentifiercanbeusedasashortcutforthefullid:
#usefullid
>>>res10=chain[('',10,'')]
#useshortcut
>>>res10=chain[10]
EachResidueobjectinaChainobjectshouldhaveauniqueid.However,disorderedresiduesaredealtwithinaspecialway,asdescribedinsection11.3.3.
AResidueobjecthasanumberofadditionalmethods:
>>>residue.get_resname()#returnstheresiduename,e.g."ASN"
>>>residue.is_disordered()#returns1iftheresiduehasdisorderedatoms
>>>residue.get_segid()#returnstheSEGID,e.g."CHN1"
>>>residue.has_id(name)#testifaresiduehasacertainatom
Youcanuseis_aa(residue)totestifaResidueobjectisanaminoacid.
11.2.5Atom
TheAtomobjectstoresthedataassociatedwithanatom,andhasnochildren.Theidofanatomisitsatomname(e.g.OGforthesidechainoxygenofaSerresidue).
AnAtomidneedstobeuniqueinaResidue.Again,anexceptionismadefordisorderedatoms,asdescribedinsection11.3.2.
Theatomidissimplytheatomname(eg.CA).Inpractice,theatomnameiscreatedbystrippingallspacesfromtheatomnameinthePDBfile.
However,inPDBfiles,aspacecanbepartofanatomname.Often,calciumatomsarecalledCA..inordertodistinguishthemfromCatoms(whicharecalled
.CA.).Incaseswerestrippingthespaceswouldcreateproblems(ie.twoatomscalledCAinthesameresidue)thespacesarekept.
InaPDBfile,anatomnameconsistsof4chars,typicallywithleadingandtrailingspaces.Oftenthesespacescanberemovedforeaseofuse(e.g.anaminoacidC
atomislabeled.CA.inaPDBfile,wherethedotsrepresentspaces).Togenerateanatomname(andthusanatomid)thespacesareremoved,unlessthiswouldresult
inanamecollisioninaResidue(i.e.twoAtomobjectswiththesameatomnameandid).Inthelattercase,theatomnameincludingspacesistried.Thissituationcane.g.
happenwhenoneresiduecontainsatomswithnames.CA.andCA..,althoughthisisnotverylikely.
Theatomicdatastoredincludestheatomname,theatomiccoordinates(includingstandarddeviationifpresent),theBfactor(includinganisotropicBfactorsandstandard
deviationifpresent),thealtlocspecifierandthefullatomnameincludingspaces.Lessuseditemsliketheatomelementnumberortheatomicchargesometimesspecified
inaPDBfilearenotstored.
Tomanipulatetheatomiccoordinates,usethetransformmethodoftheAtomobject.Usetheset_coordmethodtospecifytheatomiccoordinatesdirectly.
AnAtomobjecthasthefollowingadditionalmethods:
>>>a.get_name()#atomname(spacesstripped,e.g."CA")
>>>a.get_id()#id(equalsatomname)
>>>a.get_coord()#atomiccoordinates
>>>a.get_vector()#atomiccoordinatesasVectorobject
>>>a.get_bfactor()#isotropicBfactor
>>>a.get_occupancy()#occupancy
>>>a.get_altloc()#alternativelocationspecifier
>>>a.get_sigatm()#standarddeviationofatomicparameters
>>>a.get_siguij()#standarddeviationofanisotropicBfactor
>>>a.get_anisou()#anisotropicBfactor
>>>a.get_fullname()#atomname(withspaces,e.g.".CA.")
Torepresenttheatomcoordinates,siguij,anisotropicBfactorandsigatmNumpyarraysareused.
Theget_vectormethodreturnsaVectorobjectrepresentationofthecoordinatesoftheAtomobject,allowingyoutodovectoroperationsonatomiccoordinates.Vector
implementsthefullsetof3Dvectoroperations,matrixmultiplication(leftandright)andsomeadvancedrotationrelatedoperationsaswell.
AsanexampleofthecapabilitiesofBio.PDBsVectormodule,supposethatyouwouldliketofindthepositionofaGlyresiduesCatom,ifithadone.RotatingtheN
atomoftheGlyresiduealongtheCCbondover120degreesroughlyputsitinthepositionofavirtualCatom.Hereshowtodoit,makinguseoftherotaxis
method(whichcanbeusedtoconstructarotationaroundacertainaxis)oftheVectormodule:
#getatomcoordinatesasvectors
>>>n=residue['N'].get_vector()
>>>c=residue['C'].get_vector()
>>>ca=residue['CA'].get_vector()
#centeratorigin
>>>n=nca
>>>c=cca
#findrotationmatrixthatrotatesn
#120degreesalongthecacvector
>>>rot=rotaxis(pi*120.0/180.0,c)
#applyrotationtocanvector
>>>cb_at_origin=n.left_multiply(rot)
#putontopofcaatom
>>>cb=cb_at_origin+ca
Thisexampleshowsthatitspossibletodosomequitenontrivialvectoroperationsonatomicdata,whichcanbequiteuseful.Inadditiontoalltheusualvectoroperations
(cross(use**),anddot(use*)product,angle,norm,etc.)andtheabovementionedrotaxisfunction,theVectormodulealsohasmethodstorotate(rotmat)orreflect
(refmat)onevectorontopofanother.
11.2.6ExtractingaspecificAtom/Residue/Chain/ModelfromaStructure
Thesearesomeexamples:
>>>model=structure[0]
>>>chain=model['A']
>>>atom=residue['CA']
Notethatyoucanuseashortcut:
>>>atom=structure[0]['A'][100]['CA']
11.3Disorder
Bio.PDBcanhandlebothdisorderedatomsandpointmutations(i.e.aGlyandanAlaresidueinthesameposition).
11.3.1Generalapproach
Disordershouldbedealtwithfromtwopointsofview:theatomandtheresiduepointsofview.Ingeneral,wehavetriedtoencapsulateallthecomplexitythatarisesfrom
disorder.IfyoujustwanttoloopoverallCatoms,youdonotcarethatsomeresidueshaveadisorderedsidechain.Ontheotherhanditshouldalsobepossibleto
representdisordercompletelyinthedatastructure.Therefore,disorderedatomsorresiduesarestoredinspecialobjectsthatbehaveasifthereisnodisorder.Thisisdone
byonlyrepresentingasubsetofthedisorderedatomsorresidues.Whichsubsetispicked(e.g.whichofthetwodisorderedOGsidechainatompositionsofaSerresidue
isused)canbespecifiedbytheuser.
11.3.2Disorderedatoms
DisorderedatomsarerepresentedbyordinaryAtomobjects,butallAtomobjectsthatrepresentthesamephysicalatomarestoredinaDisorderedAtomobject(seeFig.11.1).
EachAtomobjectinaDisorderedAtomobjectcanbeuniquelyindexedusingitsaltlocspecifier.TheDisorderedAtomobjectforwardsalluncaughtmethodcallstothe
selectedAtomobject,bydefaulttheonethatrepresentstheatomwiththehighestoccupancy.TheusercanofcoursechangetheselectedAtomobject,makinguseofits
altlocspecifier.Inthiswayatomdisorderisrepresentedcorrectlywithoutmuchadditionalcomplexity.Inotherwords,ifyouarenotinterestedinatomdisorder,youwill
notbebotheredbyit.
Eachdisorderedatomhasacharacteristicaltlocidentifier.YoucanspecifythataDisorderedAtomobjectshouldbehaveliketheAtomobjectassociatedwithaspecific
altlocidentifier:
>>>atom.disordered_select('A')#selectaltlocAatom
>>>print(atom.get_altloc())
"A"
>>>atom.disordered_select('B')#selectaltlocBatom
>>>print(atom.get_altloc())
"B"
11.3.3Disorderedresidues
Commoncase
Themostcommoncaseisaresiduethatcontainsoneormoredisorderedatoms.ThisisevidentlysolvedbyusingDisorderedAtomobjectstorepresentthedisordered
atoms,andstoringtheDisorderedAtomobjectinaResidueobjectjustlikeordinaryAtomobjects.TheDisorderedAtomwillbehaveexactlylikeanordinaryatom(infact
theatomwiththehighestoccupancy)byforwardingalluncaughtmethodcallstooneoftheAtomobjects(theselectedAtomobject)itcontains.
Pointmutations
Aspecialcaseariseswhendisorderisduetoapointmutation,i.e.whentwoormorepointmutantsofapolypeptidearepresentinthecrystal.Anexampleofthiscanbe
foundinPDBstructure1EN2.
Sincetheseresiduesbelongtoadifferentresiduetype(e.g.letssaySer60andCys60)theyshouldnotbestoredinasingleResidueobjectasinthecommoncase.Inthis
case,eachresidueisrepresentedbyoneResidueobject,andbothResidueobjectsarestoredinasingleDisorderedResidueobject(seeFig.11.1).
TheDisorderedResidueobjectforwardsalluncaughtmethodstotheselectedResidueobject(bydefaultthelastResidueobjectadded),andthusbehaveslikeanordinary
residue.EachResidueobjectinaDisorderedResidueobjectcanbeuniquelyidentifiedbyitsresiduename.Intheaboveexample,residueSer60wouldhaveidSERin
theDisorderedResidueobject,whileresidueCys60wouldhaveidCYS.TheusercanselecttheactiveResidueobjectinaDisorderedResidueobjectviathisid.
Example:supposethatachainhasapointmutationatposition10,consistingofaSerandaCysresidue.Makesurethatresidue10ofthischainbehavesastheCys
residue.
>>>residue.disordered_select('CYS')
Inaddition,youcangetalistofallAtomobjects(ie.allDisorderedAtomobjectsareunpackedtotheirindividualAtomobjects)usingtheget_unpacked_listmethodofa
(Disordered)Residueobject.
11.4Heteroresidues
11.4.1Associatedproblems
Acommonproblemwithheteroresiduesisthatseveralheteroandnonheteroresiduespresentinthesamechainsharethesamesequenceidentifier(andinsertioncode).
Therefore,togenerateauniqueidforeachheteroresidue,watersandotherheteroresiduesaretreatedinadifferentway.
RememberthatResidueobjecthavethetuple(hetfield,resseq,icode)asid.Thehetfieldisblank()foraminoandnucleicacids,andastringforwatersandotherhetero
residues.Thecontentofthehetfieldisexplainedbelow.
11.4.2Waterresidues
ThehetfieldstringofawaterresidueconsistsoftheletterW.Soatypicalresidueidforawateris(W,1,).
11.4.3Otherheteroresidues
ThehetfieldstringforotherheteroresiduesstartswithH_followedbytheresiduename.Aglucosemoleculee.g.withresiduenameGLCwouldhavehetfield
H_GLC.Itsresidueidcoulde.g.be(H_GLC,1,).
11.5NavigatingthroughaStructureobject
ParseaPDBfile,andextractsomeModel,Chain,ResidueandAtomobjects
>>>fromBio.PDB.PDBParserimportPDBParser
>>>parser=PDBParser()
>>>structure=parser.get_structure("test","1fat.pdb")
>>>chain=model["A"]
>>>residue=chain[1]
>>>atom=residue["CA"]
Iteratingthroughallatomsofastructure
>>>p=PDBParser()
>>>structure=p.get_structure('X','pdb1fat.ent')
>>>formodelinstructure:
...forchaininmodel:
...forresidueinchain:
...foratominresidue:
...print(atom)
...
Thereisashortcutifyouwanttoiterateoverallatomsinastructure:
>>>atoms=structure.get_atoms()
>>>foratominatoms:
...print(atom)
...
Similarly,toiterateoverallatomsinachain,use
>>>atoms=chain.get_atoms()
>>>foratominatoms:
...print(atom)
...
Iteratingoverallresiduesofamodel
orifyouwanttoiterateoverallresiduesinamodel:
>>>residues=model.get_residues()
>>>forresidueinresidues:
...print(residue)
...
YoucanalsousetheSelection.unfold_entitiesfunctiontogetallresiduesfromastructure:
>>>res_list=Selection.unfold_entities(structure,'R')
ortogetallatomsfromachain:
>>>atom_list=Selection.unfold_entities(chain,'A')
Obviously,A=atom,R=residue,C=chain,M=model,S=structure.Youcanusethistogoupinthehierarchy,e.g.togetalistof(unique)ResidueorChainparentsfroma
listofAtoms:
>>>residue_list=Selection.unfold_entities(atom_list,'R')
>>>chain_list=Selection.unfold_entities(atom_list,'C')
Formoreinfo,seetheAPIdocumentation.
Extractaheteroresiduefromachain(e.g.aglucose(GLC)moietywithresseq10)
>>>residue_id=("H_GLC",10,"")
>>>residue=chain[residue_id]
Printallheteroresiduesinchain
>>>forresidueinchain.get_list():
...residue_id=residue.get_id()
...hetfield=residue_id[0]
...ifhetfield[0]=="H":
...print(residue_id)
...
PrintoutthecoordinatesofallCAatomsinastructurewithBfactorgreaterthan50
>>>formodelinstructure.get_list():
...forchaininmodel.get_list():
...forresidueinchain.get_list():
...ifresidue.has_id("CA"):
...ca=residue["CA"]
...ifca.get_bfactor()>50.0:
...print(ca.get_coord())
...
Printoutalltheresiduesthatcontaindisorderedatoms
...ifresidue.is_disordered():
...resseq=residue.get_id()[1]
...resname=residue.get_resname()
...model_id=model.get_id()
...chain_id=chain.get_id()
...print(model_id,chain_id,resname,resseq)
...
Loopoveralldisorderedatoms,andselectallatomswithaltlocA(ifpresent)
ThiswillmakesurethattheSMCRAdatastructurewillbehaveasifonlytheatomswithaltlocAarepresent.
...ifresidue.is_disordered():
...foratominresidue.get_list():
...ifatom.is_disordered():
...ifatom.disordered_has_id("A"):
...atom.disordered_select("A")
...
ExtractingpolypeptidesfromaStructureobject
Toextractpolypeptidesfromastructure,constructalistofPolypeptideobjectsfromaStructureobjectusingPolypeptideBuilderasfollows:
>>>model_nr=1
>>>polypeptide_list=build_peptides(structure,model_nr)
>>>forpolypeptideinpolypeptide_list:
...print(polypeptide)
...
APolypeptideobjectissimplyaUserListofResidueobjects,andisalwayscreatedfromasingleModel(inthiscasemodel1).YoucanusetheresultingPolypeptide
objecttogetthesequenceasaSeqobjectortogetalistofCatomsaswell.PolypeptidescanbebuiltusingaCNoraCCdistancecriterion.
Example:
#UsingCN
>>>ppb=PPBuilder()
>>>forppinppb.build_peptides(structure):
...print(pp.get_sequence())
...
#UsingCACA
>>>ppb=CaPPBuilder()
>>>forppinppb.build_peptides(structure):
...print(pp.get_sequence())
...
Notethatintheabovecaseonlymodel0ofthestructureisconsideredbyPolypeptideBuilder.However,itispossibletousePolypeptideBuildertobuildPolypeptide
objectsfromModelandChainobjectsaswell.
Obtainingthesequenceofastructure
Thefirstthingtodoistoextractallpolypeptidesfromthestructure(asabove).ThesequenceofeachpolypeptidecantheneasilybeobtainedfromthePolypeptide
objects.ThesequenceisrepresentedasaBiopythonSeqobject,anditsalphabetisdefinedbyaProteinAlphabetobject.
Example:
>>>seq=polypeptide.get_sequence()
>>>print(seq)
Seq('SNVVE...',<classBio.Alphabet.ProteinAlphabet>)
11.6Analyzingstructures
11.6.1Measuringdistances
Theminusoperatorforatomshasbeenoverloadedtoreturnthedistancebetweentwoatoms.
#Getsomeatoms
>>>ca1=residue1['CA']
>>>ca2=residue2['CA']
#Simplysubtracttheatomstogettheirdistance
>>>distance=ca1ca2
11.6.2Measuringangles
Usethevectorrepresentationoftheatomiccoordinates,andthecalc_anglefunctionfromtheVectormodule:
>>>vector1=atom1.get_vector()
>>>angle=calc_angle(vector1,vector2,vector3)
11.6.3Measuringtorsionangles
Usethevectorrepresentationoftheatomiccoordinates,andthecalc_dihedralfunctionfromtheVectormodule:
>>>angle=calc_dihedral(vector1,vector2,vector3,vector4)
11.6.4Determiningatomatomcontacts
UseNeighborSearchtoperformneighborlookup.TheneighborlookupisdoneusingaKDtreemodulewritteninC(seeBio.KDTree),makingitveryfast.Italsoincludes
afastmethodtofindallpointpairswithinacertaindistanceofeachother.
11.6.5Superimposingtwostructures
UseaSuperimposerobjecttosuperimposetwocoordinatesets.Thisobjectcalculatestherotationandtranslationmatrixthatrotatestwolistsofatomsontopofeachother
insuchawaythattheirRMSDisminimized.Ofcourse,thetwolistsneedtocontainthesamenumberofatoms.TheSuperimposerobjectcanalsoapplythe
rotation/translationtoalistofatoms.TherotationandtranslationarestoredasatupleintherotranattributeoftheSuperimposerobject(notethattherotationisright
multiplying!).TheRMSDisstoredinthermsdattribute.
ThealgorithmusedbySuperimposercomesfrom[17,Golub&VanLoan]andmakesuseofsingularvaluedecomposition(thisisimplementedinthegeneral
Bio.SVDSuperimposermodule).
Example:
>>>sup=Superimposer()
#Specifytheatomlists
#'fixed'and'moving'arelistsofAtomobjects
#Themovingatomswillbeputonthefixedatoms
>>>sup.set_atoms(fixed,moving)
#Printrotation/translation/rmsd
>>>print(sup.rotran)
>>>print(sup.rms)
#Applyrotation/translationtothemovingatoms
>>>sup.apply(moving)
Tosuperimposetwostructuresbasedontheiractivesites,usetheactivesiteatomstocalculatetherotation/translationmatrices(asabove),andapplythesetothewhole
molecule.
11.6.6Mappingtheresiduesoftworelatedstructuresontoeachother
First,createanalignmentfileinFASTAformat,thenusetheStructureAlignmentclass.Thisclasscanalsobeusedforalignmentswithmorethantwostructures.
11.6.7CalculatingtheHalfSphereExposure
HalfSphereExposure(HSE)isanew,2Dmeasureofsolventexposure[20].Basically,itcountsthenumberofCatomsaroundaresidueinthedirectionofitsside
chain,andintheoppositedirection(withinaradiusof13).Despiteitssimplicity,itoutperformsmanyothermeasuresofsolventexposure.
HSEcomesintwoflavors:HSEandHSE.TheformeronlyusestheCatompositions,whilethelatterusestheCandCatompositions.TheHSEmeasureis
calculatedbytheHSExposureclass,whichcanalsocalculatethecontactnumber.ThelatterclasshasmethodswhichreturndictionariesthatmapaResidueobjecttoits
correspondingHSE,HSEandcontactnumbervalues.
Example:
>>>hse=HSExposure()
#CalculateHSEalpha
>>>exp_ca=hse.calc_hs_exposure(model,option='CA3')
#CalculateHSEbeta
>>>exp_cb=hse.calc_hs_exposure(model,option='CB')
#Calculateclassicalcoordinationnumber
>>>exp_fs=hse.calc_fs_exposure(model)
#PrintHSEalphaforaresidue
>>>print(exp_ca[some_residue])
11.6.8Determiningthesecondarystructure
Forthisfunctionality,youneedtoinstallDSSP(andobtainalicenseforitfreeforacademicuse,seehttp://www.cmbi.kun.nl/gv/dssp/).ThenusetheDSSPclass,
whichmapsResidueobjectstotheirsecondarystructure(andaccessiblesurfacearea).TheDSSPcodesarelistedinTable11.1.NotethatDSSP(theprogram,andthusby
consequencetheclass)cannothandlemultiplemodels!
Code Secondarystructure
H helix
B Isolatedbridgeresidue
E Strand
G 310helix
I helix
T Turn
S Bend
Other
Table11.1:DSSPcodesinBio.PDB.
TheDSSPclasscanalsobeusedtocalculatetheaccessiblesurfaceareaofaresidue.Butseealsosection11.6.9.
11.6.9Calculatingtheresiduedepth
Residuedepthistheaveragedistanceofaresiduesatomsfromthesolventaccessiblesurface.Itsafairlynewandverypowerfulparameterizationofsolvent
accessibility.Forthisfunctionality,youneedtoinstallMichelSannersMSMSprogram(http://www.scripps.edu/pub/olsonweb/people/sanner/html/msms_home.html).
ThenusetheResidueDepthclass.ThisclassbehavesasadictionarywhichmapsResidueobjectstocorresponding(residuedepth,Cdepth)tuples.TheCdepthisthe
distanceofaresiduesCatomtothesolventaccessiblesurface.
Example:
>>>rd=ResidueDepth(model,pdb_file)
>>>residue_depth,ca_depth=rd[some_residue]
Youcanalsogetaccesstothemolecularsurfaceitself(viatheget_surfacefunction),intheformofaNumericPythonarraywiththesurfacepoints.
11.7CommonproblemsinPDBfiles
ItiswellknownthatmanyPDBfilescontainsemanticerrors(notthestructuresthemselves,buttheirrepresentationinPDBfiles).Bio.PDBtriestohandlethisintwo
ways.ThePDBParserobjectcanbehaveintwoways:arestrictivewayandapermissiveway,whichisthedefault.
Example:
#Permissiveparser
>>>parser=PDBParser(PERMISSIVE=1)
>>>parser=PDBParser()#Thesame(default)
#Strictparser
>>>strict_parser=PDBParser(PERMISSIVE=0)
Inthepermissivestate(DEFAULT),PDBfilesthatobviouslycontainerrorsarecorrected(i.e.someresiduesoratomsareleftout).Theseerrorsinclude:
Multipleresidueswiththesameidentifier
Multipleatomswiththesameidentifier(takingintoaccountthealtlocidentifier)
TheseerrorsindicaterealproblemsinthePDBfile(fordetailssee[18,HamelryckandManderick,2003]).Intherestrictivestate,PDBfileswitherrorscausean
exceptiontooccur.ThisisusefultofinderrorsinPDBfiles.
Someerrorshoweverareautomaticallycorrected.Normallyeachdisorderedatomshouldhaveanonblankaltlocidentifier.However,therearemanystructuresthatdo
notfollowthisconvention,andhaveablankandanonblankidentifierfortwodisorderedpositionsofthesameatom.Thisisautomaticallyinterpretedintherightway.
SometimesastructurecontainsalistofresiduesbelongingtochainA,followedbyresiduesbelongingtochainB,andagainfollowedbyresiduesbelongingtochainA,
i.e.thechainsarebroken.Thisisalsocorrectlyinterpreted.
11.7.1Examples
ThePDBParser/Structureclasswastestedonabout800structures(eachbelongingtoauniqueSCOPsuperfamily).Thistakesabout20minutes,oronaverage1.5
secondsperstructure.Parsingthestructureofthelargeribosomalsubunit(1FKK),whichcontainsabout64000atoms,takes10secondsona1000MHzPC.
Threeexceptionsweregeneratedincaseswhereanunambiguousdatastructurecouldnotbebuilt.Inallthreecases,thelikelycauseisanerrorinthePDBfilethatshould
becorrected.Generatinganexceptioninthesecasesismuchbetterthanrunningthechanceofincorrectlydescribingthestructureinadatastructure.
11.7.1.1Duplicateresidues
Onestructurecontainstwoaminoacidresiduesinonechainwiththesamesequenceidentifier(resseq3)andicode.Uponinspectionitwasfoundthatthischaincontains
theresiduesThrA3,,GlyA202,LeuA3,GluA204.Clearly,LeuA3shouldbeLeuA203.Acoupleofsimilarsituationsexistforstructure1FFK(whiche.g.contains
GlyB64,MetB65,GluB65,ThrB67,i.e.residueGluB65shouldbeGluB66).
11.7.1.2Duplicateatoms
Structure1EJGcontainsaSer/PropointmutationinchainAatposition22.Inturn,Ser22containssomedisorderedatoms.Asexpected,allatomsbelongingtoSer22
haveanonblankaltlocspecifier(BorC).AllatomsofPro22havealtlocA,excepttheNatomwhichhasablankaltloc.Thisgeneratesanexception,becauseallatoms
belongingtotworesiduesatapointmutationshouldhavenonblankaltloc.ItturnsoutthatthisatomisprobablysharedbySerandPro22,asSer22missestheNatom.
Again,thispointstoaprobleminthefile:theNatomshouldbepresentinboththeSerandtheProresidue,inbothcasesassociatedwithasuitablealtlocidentifier.
11.7.2Automaticcorrection
Someerrorsarequitecommonandcanbeeasilycorrectedwithoutmuchriskofmakingawronginterpretation.Thesecasesarelistedbelow.
11.7.2.1Ablankaltlocforadisorderedatom
Normallyeachdisorderedatomshouldhaveanonblankaltlocidentifier.However,therearemanystructuresthatdonotfollowthisconvention,andhaveablankanda
nonblankidentifierfortwodisorderedpositionsofthesameatom.Thisisautomaticallyinterpretedintherightway.
11.7.2.2Brokenchains
SometimesastructurecontainsalistofresiduesbelongingtochainA,followedbyresiduesbelongingtochainB,andagainfollowedbyresiduesbelongingtochainA,
i.e.thechainsarebroken.Thisiscorrectlyinterpreted.
11.7.3Fatalerrors
SometimesaPDBfilecannotbeunambiguouslyinterpreted.Ratherthanguessingandriskingamistake,anexceptionisgenerated,andtheuserisexpectedtocorrectthe
PDBfile.Thesecasesarelistedbelow.
11.7.3.1Duplicateresidues
Allresiduesinachainshouldhaveauniqueid.Thisidisgeneratedbasedon:
Thesequenceidentifier(resseq).
Theinsertioncode(icode).
Thehetfieldstring(WforwatersandH_followedbytheresiduenameforotherheteroresidues)
Theresiduenamesoftheresiduesinthecaseofpointmutations(tostoretheResidueobjectsinaDisorderedResidueobject).
Ifthisdoesnotleadtoauniqueidsomethingisquitelikelywrong,andanexceptionisgenerated.
11.7.3.2Duplicateatoms
Allatomsinaresidueshouldhaveauniqueid.Thisidisgeneratedbasedon:
Theatomname(withoutspaces,orwithspacesifaproblemarises).
Thealtlocspecifier.
Ifthisdoesnotleadtoauniqueidsomethingisquitelikelywrong,andanexceptionisgenerated.
11.8AccessingtheProteinDataBank
11.8.1DownloadingstructuresfromtheProteinDataBank
StructurescanbedownloadedfromthePDB(ProteinDataBank)byusingtheretrieve_pdb_filemethodonaPDBListobject.TheargumentforthismethodisthePDB
identifierofthestructure.
>>>pdbl=PDBList()
>>>pdbl.retrieve_pdb_file('1FAT')
ThePDBListclasscanalsobeusedasacommandlinetool:
pythonPDBList.py1fat
Thedownloadedfilewillbecalledpdb1fat.entandstoredinthecurrentworkingdirectory.Notethattheretrieve_pdb_filemethodalsohasanoptionalargumentpdir
thatspecifiesaspecificdirectoryinwhichtostorethedownloadedPDBfiles.
Theretrieve_pdb_filemethodalsohassomeoptionstospecifythecompressionformatusedforthedownload,andtheprogramusedforlocaldecompression(default.Z
formatandgunzip).Inaddition,thePDBftpsitecanbespecifieduponcreationofthePDBListobject.Bydefault,theserveroftheWorldwideProteinDataBank
(ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/)isused.SeetheAPIdocumentationformoredetails.ThanksagaintoKristianRotherfordonatingthis
module.
11.8.2DownloadingtheentirePDB
ThefollowingcommandswillstoreallPDBfilesinthe/data/pdbdirectory:
pythonPDBList.pyall/data/pdb
pythonPDBList.pyall/data/pdbd
TheAPImethodforthisiscalleddownload_entire_pdb.Addingthedoptionwillstoreallfilesinthesamedirectory.Otherwise,theyaresortedintoPDBstyle
subdirectoriesaccordingtotheirPDBIDs.Dependingonthetraffic,acompletedownloadwilltake24days.
11.8.3KeepingalocalcopyofthePDBuptodate
ThiscanalsobedoneusingthePDBListobject.OnesimplycreatesaPDBListobject(specifyingthedirectorywherethelocalcopyofthePDBispresent)andcallsthe
update_pdbmethod:
>>>pl=PDBList(pdb='/data/pdb')
>>>pl.update_pdb()
Onecanofcoursemakeaweeklycronjoboutofthistokeepthelocalcopyautomaticallyuptodate.ThePDBftpsitecanalsobespecified(seeAPIdocumentation).
PDBListhassomeadditionalmethodsthatcanbeofuse.Theget_all_obsoletemethodcanbeusedtogetalistofallobsoletePDBentries.Thechanged_this_week
methodcanbeusedtoobtaintheentriesthatwereadded,modifiedorobsoletedduringthecurrentweek.FormoreinfoonthepossibilitiesofPDBList,seetheAPI
documentation.
11.9Generalquestions
11.9.1HowwelltestedisBio.PDB?
Prettywell,actually.Bio.PDBhasbeenextensivelytestedonnearly5500structuresfromthePDBallstructuresseemedtobeparsedcorrectly.Moredetailscanbe
foundintheBio.PDBBioinformaticsarticle.Bio.PDBhasbeenused/isbeingusedinmanyresearchprojectsasareliabletool.Infact,ImusingBio.PDBalmostdaily
forresearchpurposesandcontinueworkingonimprovingitandaddingnewfeatures.
11.9.2Howfastisit?
ThePDBParserperformancewastestedonabout800structures(eachbelongingtoauniqueSCOPsuperfamily).Thistakesabout20minutes,oronaverage1.5seconds
perstructure.Parsingthestructureofthelargeribosomalsubunit(1FKK),whichcontainsabout64000atoms,takes10secondsona1000MHzPC.Inshort:itsmore
thanfastenoughformanyapplications.
11.9.3Istheresupportformoleculargraphics?
Notdirectly,mostlysincetherearequiteafewPythonbased/Pythonawaresolutionsalready,thatcanpotentiallybeusedwithBio.PDB.MychoiceisPymol,BTW(Ive
usedthissuccessfullywithBio.PDB,andtherewillprobablybespecificPyMolmodulesinBio.PDBsoon/someday).Pythonbased/awaremoleculargraphicssolutions
include:
PyMol:http://pymol.sourceforge.net/
Chimera:http://www.cgl.ucsf.edu/chimera/
PMV:http://www.scripps.edu/~sanner/python/
Coot:http://www.ysbl.york.ac.uk/~emsley/coot/
CCP4mg:http://www.ysbl.york.ac.uk/~lizp/molgraphics.html
mmLib:http://pymmlib.sourceforge.net/
VMD:http://www.ks.uiuc.edu/Research/vmd/
MMTK:http://starship.python.net/crew/hinsen/MMTK/
11.9.4WhosusingBio.PDB?
Bio.PDBwasusedintheconstructionofDISEMBL,awebserverthatpredictsdisorderedregionsinproteins(http://dis.embl.de/),andCOLUMBA,awebsitethat
providesannotatedproteinstructures(http://www.columbadb.de/).Bio.PDBhasalsobeenusedtoperformalargescalesearchforactivesitessimilaritiesbetween
proteinstructuresinthePDB[19,Hamelryck,2003],andtodevelopanewalgorithmthatidentifieslinearsecondarystructureelements[26,Majumdaretal.,2005].
Judgingfromrequestsforfeaturesandinformation,Bio.PDBisalsousedbyseveralLPCs(LargePharmaceuticalCompanies:).
Chapter12Bio.PopGen:Populationgenetics
Bio.PopGenisaBiopythonmodulesupportingpopulationgenetics,availableinBiopython1.44onwards.
Themediumtermobjectiveforthemoduleistosupportwidelyuseddataformats,applicationsanddatabases.Thismoduleiscurrentlyunderintensedevelopmentand
supportfornewfeaturesshouldappearataratherfastpace.UnfortunatelythismightalsoentailsomeinstabilityontheAPI,especiallyifyouareusingadevelopment
version.APIsthataremadeavailableonourofficialpublicreleasesshouldbemuchmorestable.
12.1GenePop
GenePop(http://genepop.curtin.edu.au/)isapopularpopulationgeneticssoftwarepackagesupportingHardyWeinbergtests,linkagedesiquilibrium,population
diferentiation,basicstatistics,Fstandmigrationestimates,amongothers.GenePopdoesnotsupplysequencebasedstatisticsasitdoesnthandlesequencedata.The
GenePopfileformatissupportedbyawiderangeofotherpopulationgeneticsoftwareapplications,thusmakingitarelevantformatinthepopulationgeneticsfield.
Bio.PopGenprovidesaparserandgeneratorofGenePopfileformat.Utilitiestomanipulatethecontentofarecordarealsoprovided.Hereisanexampleonhowtoreada
GenePopfile(youcanfindexampleGenePopdatafilesintheTest/PopGendirectoryofBiopython):
fromBio.PopGenimportGenePop
withopen("example.gen")ashandle:
rec=GenePop.read(handle)
Thiswillreadafilecalledexample.genandparseit.Ifyoudoprintrec,therecordwillbeoutputagain,inGenePopformat.
Themostimportantinformationinrecwillbethelocinamesandpopulationinformation(butthereismoreusehelp(GenePop.Record)tochecktheAPI
documentation).Locinamescanbefoundonrec.loci_list.Populationinformationcanbefoundonrec.populations.Populationsisalistwithoneelementperpopulation.
Eachelementisitselfalistofindividuals,eachindividualisapaircomposedbyindividualnameandalistofalleles(2permarker),hereisanexamplefor
rec.populations:
[
[
('Ind1',[(1,2),(3,3),(200,201)],
('Ind2',[(2,None),(3,3),(None,None)],
],
[
('Other1',[(1,1),(4,3),(200,200)],
]
]
Sowehavetwopopulations,thefirstwithtwoindividuals,thesecondwithonlyone.ThefirstindividualofthefirstpopulationiscalledInd1,allelicinformationforeach
ofthe3locifollows.Pleasenotethatforanylocus,informationmightbemissing(seeasanexample,Ind2above).
AfewutilityfunctionstomanipulateGenePoprecordsaremadeavailable,hereisanexample:
fromBio.PopGenimportGenePop
#Imaginethatyouhaveloadedrec,asperthecodesnippetabove...
rec.remove_population(pos)
#Removesapopulationfromarecord,posisthepopulationpositionin
#rec.populations,rememberthatitstartsonposition0.
#recisaltered.
rec.remove_locus_by_position(pos)
#Removesalocusbyitsposition,posisthelocuspositionin
#rec.loci_list,rememberthatitstartsonposition0.
#recisaltered.
rec.remove_locus_by_name(name)
#Removesalocusbyitsname,nameisthelocusnameasin
#rec.loci_list.Ifthenamedoesn'texistthefunctionfails
#silently.
#recisaltered.
rec_loci=rec.split_in_loci()
#Splitsarecordinloci,thatis,foreachloci,itcreatesanew
#record,withasinglelociandallpopulations.
#Theresultisreturnedinadictionary,beingeachkeythelocusname.
#ThevalueistheGenePoprecord.
#recisnotaltered.
rec_pops=rec.split_in_pops(pop_names)
#Splitsarecordinpopulations,thatis,foreachpopulation,itcreates
#anewrecord,withasinglepopulationandallloci.
#Theresultisreturnedinadictionary,beingeachkey
#thepopulationname.AspopulationnamesarenotavailableinGenePop,
#theyarepassedinarray(pop_names).
#ThevalueofeachdictionaryentryistheGenePoprecord.
#recisnotaltered.
GenePopdoesnotsupportpopulationnames,alimitationwhichcanbecumbersomeattimes.Functionalitytoenablepopulationnamesiscurrentlybeingplannedfor
Biopython.Theseextensionswontbreakcompatibilityinanywaywiththestandardformat.Inthemediumterm,wewouldalsoliketosupporttheGenePopwebservice.
Chapter13PhylogeneticswithBio.Phylo
TheBio.PhylomodulewasintroducedinBiopython1.54.FollowingtheleadofSeqIOandAlignIO,itaimstoprovideacommonwaytoworkwithphylogenetictrees
independentlyofthesourcedataformat,aswellasaconsistentAPIforI/Ooperations.
Bio.Phyloisdescribedinanopenaccessjournalarticle[9,Talevichetal.,2012],whichyoumightalsofindhelpful.
13.1Demo:WhatsinaTree?
Togetacquaintedwiththemodule,letsstartwithatreethatwevealreadyconstructed,andinspectitafewdifferentways.Thenwellcolorizethebranches,tousea
specialphyloXMLfeature,andfinallysaveit.
CreateasimpleNewickfilenamedsimple.dndusingyourfavoritetexteditor,orusesimple.dndprovidedwiththeBiopythonsourcecode:
(((A,B),(C,D)),(E,F,G));
Thistreehasnobranchlengths,onlyatopologyandlabelledterminals.(Ifyouhavearealtreefileavailable,youcanfollowthisdemousingthatinstead.)
LaunchthePythoninterpreterofyourchoice:
%ipythonpylab
Forinteractivework,launchingtheIPythoninterpreterwiththepylabflagenablesmatplotlibintegration,sographicswillpopupautomatically.Wellusethatduring
thisdemo.
Now,withinPython,readthetreefile,givingthefilenameandthenameoftheformat.
>>>tree=Phylo.read("simple.dnd","newick")
Printingthetreeobjectasastringgivesusalookattheentireobjecthierarchy.
>>>print(tree)
Tree(rooted=False,weight=1.0)
Clade()
Clade()
Clade()
Clade(name='A')
Clade(name='B')
Clade()
Clade(name='C')
Clade(name='D')
Clade()
Clade(name='E')
Clade(name='F')
Clade(name='G')
TheTreeobjectcontainsglobalinformationaboutthetree,suchaswhetheritsrootedorunrooted.Ithasonerootclade,andunderthat,itsnestedlistsofcladesallthe
waydowntothetips.
Thefunctiondraw_asciicreatesasimpleASCIIart(plaintext)dendrogram.Thisisaconvenientvisualizationforinteractiveexploration,incasebettergraphicaltools
arentavailable.
________________________A
________________________|
||________________________B
________________________|
||________________________C
||________________________|
_||________________________D
|
|________________________E
||
|________________________|________________________F
|
|________________________G
<BLANKLINE>
Ifyouhavematplotliborpylabinstalled,youcancreateagraphicusingthedrawfunction(seeFig.13.1):
>>>tree.rooted=True
>>>Phylo.draw(tree)
13.1.1Coloringbrancheswithinatree
Thefunctionsdrawanddraw_graphvizsupportthedisplayofdifferentcolorsandbranchwidthsinatree.AsofBiopython1.59,thecolorandwidthattributesare
availableonthebasicCladeobjectandtheresnothingextrarequiredtousethem.Bothattributesrefertothebranchleadingthegivenclade,andapplyrecursively,soall
descendentbrancheswillalsoinherittheassignedwidthandcolorvaluesduringdisplay.
InearlierversionsofBiopython,thesewerespecialfeaturesofPhyloXMLtrees,andusingtheattributesrequiredfirstconvertingthetreetoasubclassofthebasictree
objectcalledPhylogeny,fromtheBio.Phylo.PhyloXMLmodule.
InBiopython1.55andlater,thisisaconvenienttreemethod:
>>>tree=tree.as_phyloxml()
InBiopython1.54,youcanaccomplishthesamethingwithoneextraimport:
>>>fromBio.Phylo.PhyloXMLimportPhylogeny
>>>tree=Phylogeny.from_tree(tree)
NotethatthefileformatsNewickandNexusdontsupportbranchcolorsorwidths,soifyouusetheseattributesinBio.Phylo,youwillonlybeabletosavethevaluesin
PhyloXMLformat.(YoucanstillsaveatreeasNewickorNexus,butthecolorandwidthvalueswillbeskippedintheoutputfile.)
Nowwecanbeginassigningcolors.First,wellcolortherootcladegray.Wecandothatbyassigningthe24bitcolorvalueasanRGBtriple,anHTMLstylehexstring,
orthenameofoneofthepredefinedcolors.
>>>tree.root.color=(128,128,128)
Or:
>>>tree.root.color="#808080"
Or:
>>>tree.root.color="gray"
Colorsforacladearetreatedascascadingdownthroughtheentireclade,sowhenwecolorizetheroothere,itturnsthewholetreegray.Wecanoverridethatby
assigningadifferentcolorlowerdownonthetree.
Letstargetthemostrecentcommonancestor(MRCA)ofthenodesnamedEandF.Thecommon_ancestormethodreturnsareferencetothatcladeintheoriginaltree,
sowhenwecolorthatcladesalmon,thecolorwillshowupintheoriginaltree.
>>>mrca=tree.common_ancestor({"name":"E"},{"name":"F"})
>>>mrca.color="salmon"
Ifwehappenedtoknowexactlywhereacertaincladeisinthetree,intermsofnestedlistentries,wecanjumpdirectlytothatpositioninthetreebyindexingit.Here,the
index[0,1]referstothesecondchildofthefirstchildoftheroot.
>>>tree.clade[0,1].color="blue"
Finally,showourwork(seeFig.13.1.1):
>>>Phylo.draw(tree)
Notethatacladescolorincludesthebranchleadingtothatclade,aswellasitsdescendents.ThecommonancestorofEandFturnsouttobejustundertheroot,andwith
thiscoloringwecanseeexactlywheretherootofthetreeis.
My,weveaccomplishedalot!Letstakeabreakhereandsaveourwork.Callthewritefunctionwithafilenameorhandlehereweusestandardoutput,toseewhat
wouldbewrittenandtheformatphyloxml.PhyloXMLsavesthecolorsweassigned,soyoucanopenthisphyloXMLfileinanothertreeviewerlikeArchaeopteryx,
andthecolorswillshowupthere,too.
>>>importsys
>>>Phylo.write(tree,sys.stdout,"phyloxml")
<phy:phyloxmlxmlns:phy="http://www.phyloxml.org">
<phy:phylogenyrooted="true">
<phy:clade>
<phy:branch_length>1.0</phy:branch_length>
<phy:color>
<phy:red>128</phy:red>
<phy:green>128</phy:green>
<phy:blue>128</phy:blue>
</phy:color>
<phy:clade>
<phy:clade>
<phy:clade>
<phy:name>A</phy:name>
...
TherestofthischaptercoversthecorefunctionalityofBio.Phyloingreaterdetail.FormoreexamplesofusingBio.Phylo,seethecookbookpageonBiopython.org:
http://biopython.org/wiki/Phylo_cookbook
13.2I/Ofunctions
LikeSeqIOandAlignIO,Phylohandlesfileinputandoutputthroughfourfunctions:parse,read,writeandconvert,allofwhichsupportthetreefileformatsNewick,
NEXUS,phyloXMLandNeXML,aswellastheComparativeDataAnalysisOntology(CDAO).
Thereadfunctionparsesasingletreeinthegivenfileandreturnsit.Carefulitwillraiseanerrorifthefilecontainsmorethanonetree,ornotrees.
>>>tree=Phylo.read("Tests/Nexus/int_node_labels.nwk","newick")
>>>print(tree)
(ExamplefilesareavailableintheTests/Nexus/andTests/PhyloXML/directoriesoftheBiopythondistribution.)
Tohandlemultiple(oranunknownnumberof)trees,usetheparsefunctioniteratesthrougheachofthetreesinthegivenfile:
>>>trees=Phylo.parse("../../Tests/PhyloXML/phyloxml_examples.xml","phyloxml")
>>>fortreeintrees:
...print(tree)
Writeatreeoriterableoftreesbacktofilewiththewritefunction:
>>>trees=list(Phylo.parse("../../Tests/PhyloXML/phyloxml_examples.xml","phyloxml"))
>>>tree1=trees[0]
>>>others=trees[1:]
>>>Phylo.write(tree1,"tree1.nwk","newick")
1
>>>Phylo.write(others,"other_trees.nwk","newick")
12
Convertfilesbetweenanyofthesupportedformatswiththeconvertfunction:
>>>Phylo.convert("tree1.nwk","newick","tree1.xml","nexml")
1
>>>Phylo.convert("other_trees.xml","phyloxml","other_trees.nex","nexus")
12
Tousestringsasinputoroutputinsteadofactualfiles,useStringIOasyouwouldwithSeqIOandAlignIO:
>>>handle=StringIO("(((A,B),(C,D)),(E,F,G));")
>>>tree=Phylo.read(handle,"newick")
13.3Viewandexporttrees
ThesimplestwaytogetanoverviewofaTreeobjectistoprintit:
>>>tree=Phylo.read("PhyloXML/example.xml","phyloxml")
>>>print(tree)
Phylogeny(description='phyloXMLallowstouseeithera"branch_length"attribute...',name='examplefromProf.JoeFelsenstein'sbook"InferringPhyl...',rooted=
Clade()
Clade(branch_length=0.06)
Clade(branch_length=0.102,name='A')
Clade(branch_length=0.23,name='B')
Clade(branch_length=0.4,name='C')
ThisisessentiallyanoutlineoftheobjecthierarchyBiopythonusestorepresentatree.Butmorelikely,youdwanttoseeadrawingofthetree.Therearethreefunctions
todothis.
Aswesawinthedemo,draw_asciiprintsanasciiartdrawingofthetree(arootedphylogram)tostandardoutput,oranopenfilehandleifgiven.Notalloftheavailable
informationaboutthetreeisshown,butitprovidesawaytoquicklyviewthetreewithoutrelyingonanyexternaldependencies.
>>>tree=Phylo.read("example.xml","phyloxml")
__________________A
__________|
_||___________________________________________B
|
|___________________________________________________________________________C
Thedrawfunctiondrawsamoreattractiveimageusingthematplotliblibrary.SeetheAPIdocumentationfordetailsontheargumentsitacceptstocustomizetheoutput.
>>>Phylo.draw(tree,branch_labels=lambdac:c.branch_length)
draw_graphvizdrawsanunrootedcladogram,butrequiresthatyouhaveGraphviz,PyDotorPyGraphviz,NetworkX,andmatplotlib(orpylab)installed.Usingthesame
exampleasabove,andthedotprogramincludedwithGraphviz,letsdrawarootedtree(seeFig.13.3):
>>>Phylo.draw_graphviz(tree,prog='dot')
>>>importpylab
>>>pylab.show()#Displaysthetreeinaninteractiveviewer
>>>pylab.savefig('phylodot.png')#CreatesaPNGfileofthesamegraphic
(Tip:IfyouexecuteIPythonwiththepylaboption,callingdraw_graphvizcausesthematplotlibviewertolaunchautomaticallywithoutmanuallycallingshow().)
ThisexportsthetreeobjecttoaNetworkXgraph,usesGraphviztolayoutthenodes,anddisplaysitusingmatplotlib.Thereareanumberofkeywordargumentsthatcan
modifytheresultingdiagram,includingmostofthoseacceptedbytheNetworkXfunctionsnetworkx.drawandnetworkx.draw_graphviz.
Thedisplayisalsoaffectedbytherootedattributeofthegiventreeobject.Rootedtreesareshownwithaheadoneachbranchindicatingdirection(seeFig.13.3):
>>>tree.rooted=True
>>>Phylo.draw_graphviz(tree)
TheprogargumentspecifiestheGraphvizengineusedforlayout.Thedefault,twopi,behaveswellforanysizetree,reliablyavoidingcrossedbranches.Theneato
programmaydrawmoreattractivemoderatelysizedtrees,butsometimeswillcrossbranches(seeFig.13.3).Thedotprogrammaybeusefulwithsmalltrees,buttends
todosurprisingthingswiththelayoutoflargertrees.
>>>Phylo.draw_graphviz(tree,prog="neato")
Thisviewingmodeisparticularlyhandyforexploringlargertrees,becausethematplotlibviewercanzoominonaselectedregion,thinningoutaclutteredgraphic.
>>>tree=Phylo.read("apaf.xml","phyloxml")
>>>Phylo.draw_graphviz(tree,prog="neato",node_size=0)
Notethatbranchlengthsarenotdisplayedaccurately,becauseGraphvizignoresthemwhencreatingthenodelayouts.Thebranchlengthsareretainedwhenexportinga
treeasaNetworkXgraphobject(to_networkx),however.
SeethePhylopageontheBiopythonwiki(http://biopython.org/wiki/Phylo)fordescriptionsandexamplesofthemoreadvancedfunctionalityindraw_ascii,
draw_graphvizandto_networkx.
13.4UsingTreeandCladeobjects
TheTreeobjectsproducedbyparseandreadarecontainersforrecursivesubtrees,attachedtotheTreeobjectattherootattribute(whetherornotthephylogenictreeis
actuallyconsideredrooted).ATreehasgloballyappliedinformationforthephylogeny,suchasrootedness,andareferencetoasingleCladeaCladehasnodeandclade
specificinformation,suchasbranchlength,andalistofitsowndescendentCladeinstances,attachedatthecladesattribute.
Sothereisadistinctionbetweentreeandtree.root.Inpractice,though,yourarelyneedtoworryaboutit.Tosmoothoverthedifference,bothTreeandCladeinherit
fromTreeMixin,whichcontainstheimplementationsformethodsthatwouldbecommonlyusedtosearch,inspectormodifyatreeoranyofitsclades.Thismeansthat
almostallofthemethodssupportedbytreearealsoavailableontree.rootandanycladebelowit.(Cladealsohasarootproperty,whichreturnsthecladeobjectitself.)
13.4.1Searchandtraversalmethods
Forconvenience,weprovideacoupleofsimplifiedmethodsthatreturnallexternalorinternalnodesdirectlyasalist:
get_terminals
makesalistofallofthistreesterminal(leaf)nodes.
get_nonterminals
makesalistofallofthistreesnonterminal(internal)nodes.
Thesebothwrapamethodwithfullcontrolovertreetraversal,find_clades.Twomoretraversalmethods,find_elementsandfind_any,relyonthesamecore
functionalityandacceptthesamearguments,whichwellcallatargetspecificationforlackofabetterdescription.Thesespecifywhichobjectsinthetreewillbe
matchedandreturnedduringiteration.Thefirstargumentcanbeanyofthefollowingtypes:
ATreeElementinstance,whichtreeelementswillmatchbyidentitysosearchingwithaCladeinstanceasthetargetwillfindthatcladeinthetree
Astring,whichmatchestreeelementsstringrepresentationinparticular,acladesname(addedinBiopython1.56)
Aclassortype,whereeverytreeelementofthesametype(orsubtype)willbematched
Adictionarywherekeysaretreeelementattributesandvaluesarematchedtothecorrespondingattributeofeachtreeelement.Thisonegetsevenmoreelaborate:
Ifanintisgiven,itmatchesnumericallyequalattributes,e.g.1willmatch1or1.0
Ifabooleanisgiven(TrueorFalse),thecorrespondingattributevalueisevaluatedasabooleanandcheckedforthesame
NonematchesNone
Ifastringisgiven,thevalueistreatedasaregularexpression(whichmustmatchthewholestringinthecorrespondingelementattribute,notjustaprefix).A
givenstringwithoutspecialregexcharacterswillmatchstringattributesexactly,soifyoudontuseregexes,dontworryaboutit.Forexample,inatreewith
cladenamesFoo1,Foo2andFoo3,tree.find_clades({"name":"Foo1"})matchesFoo1,{"name":"Foo.*"}matchesallthreeclades,and{"name":"Foo"}
doesntmatchanything.
Sincefloatingpointarithmeticcanproducesomestrangebehavior,wedontsupportmatchingfloatsdirectly.Instead,usethebooleanTruetomatcheveryelement
withanonzerovalueinthespecifiedattribute,thenfilteronthatattributemanuallywithaninequality(orexactnumber,ifyoulikelivingdangerously).
Ifthedictionarycontainsmultipleentries,amatchingelementmustmatcheachofthegivenattributevaluesthinkand,notor.
Afunctiontakingasingleargument(itwillbeappliedtoeachelementinthetree),returningTrueorFalse.Forconvenience,LookupError,AttributeErrorand
ValueErroraresilenced,sothisprovidesanothersafewaytosearchforfloatingpointvaluesinthetree,orsomemorecomplexcharacteristic.
Afterthetarget,therearetwooptionalkeywordarguments:
terminal
Abooleanvaluetoselectfororagainstterminalclades(a.k.a.leafnodes):Truesearchesforonlyterminalclades,Falsefornonterminal(internal)clades,andthe
default,None,searchesbothterminalandnonterminalclades,aswellasanytreeelementslackingtheis_terminalmethod.
order
Treetraversalorder:"preorder"(default)isdepthfirstsearch,"postorder"isDFSwithchildnodesprecedingparents,and"level"isbreadthfirstsearch.
Finally,themethodsacceptarbitrarykeywordargumentswhicharetreatedthesamewayasadictionarytargetspecification:keysindicatethenameoftheelement
attributetosearchfor,andtheargumentvalue(string,integer,Noneorboolean)iscomparedtothevalueofeachattributefound.Ifnokeywordargumentsaregiven,then
anyTreeElementtypesarematched.Thecodeforthisisgenerallyshorterthanpassingadictionaryasthetargetspecification:tree.find_clades({"name":"Foo1"})can
beshortenedtotree.find_clades(name="Foo1").
(InBiopython1.56orlater,thiscanbeevenshorter:tree.find_clades("Foo1"))
Nowthatwevemasteredtargetspecifications,herearethemethodsusedtotraverseatree:
find_clades
Findeachcladecontainingamatchingelement.Thatis,findeachelementaswithfind_elements,butreturnthecorrespondingcladeobject.(Thisisusuallywhatyou
want.)
Theresultisaniterablethroughallmatchingobjects,searchingdepthfirstbydefault.ThisisnotnecessarilythesameorderastheelementsappearintheNewick,
NexusorXMLsourcefile!
find_elements
Findalltreeelementsmatchingthegivenattributes,andreturnthematchingelementsthemselves.SimpleNewicktreesdonthavecomplexsubelements,sothis
behavesthesameasfind_cladesonthem.PhyloXMLtreesoftendohavecomplexobjectsattachedtoclades,sothismethodisusefulforextractingthose.
find_any
Returnthefirstelementfoundbyfind_elements(),orNone.Thisisalsousefulforcheckingwhetheranymatchingelementexistsinthetree,andcanbeusedina
conditional.
Twomoremethodshelpnavigatingbetweennodesinthetree:
get_path
Listthecladesdirectlybetweenthetreeroot(orcurrentclade)andthegiventarget.Returnsalistofallcladeobjectsalongthispath,endingwiththegiventarget,but
excludingtherootclade.
trace
Listofallcladeobjectbetweentwotargetsinthistree.Excludingstart,includingfinish.
13.4.2Informationmethods
Thesemethodsprovideinformationaboutthewholetree(oranyclade).
common_ancestor
Findthemostrecentcommonancestorofallthegiventargets.(ThiswillbeaCladeobject).Ifnotargetisgiven,returnstherootofthecurrentclade(theonethis
methodiscalledfrom)if1targetisgiven,thisreturnsthetargetitself.However,ifanyofthespecifiedtargetsarenotfoundinthecurrenttree(orclade),an
exceptionisraised.
count_terminals
Countsthenumberofterminal(leaf)nodeswithinthetree.
depths
Createamappingoftreecladestodepths.TheresultisadictionarywherethekeysarealloftheCladeinstancesinthetree,andthevaluesarethedistancefromthe
roottoeachclade(includingterminals).Bydefaultthedistanceisthecumulativebranchlengthleadingtotheclade,butwiththeunit_branch_lengths=Trueoption,
onlythenumberofbranches(levelsinthetree)iscounted.
distance
Calculatethesumofthebranchlengthsbetweentwotargets.Ifonlyonetargetisspecified,theotheristherootofthistree.
total_branch_length
Calculatethesumofallthebranchlengthsinthistree.Thisisusuallyjustcalledthelengthofthetreeinphylogenetics,butweuseamoreexplicitnametoavoid
confusionwithPythonterminology.
Therestofthesemethodsarebooleanchecks:
is_bifurcating
Trueifthetreeisstrictlybifurcatingi.e.allnodeshaveeither2or0children(internalorexternal,respectively).Therootmayhave3descendentsandstillbe
consideredpartofabifurcatingtree.
is_monophyletic
Testifallofthegiventargetscompriseacompletesubcladei.e.,thereexistsacladesuchthatitsterminalsarethesamesetasthegiventargets.Thetargetsshould
beterminalsofthetree.Forconvenience,thismethodreturnsthecommonancestor(MCRA)ofthetargetsiftheyaremonophyletic(insteadofthevalueTrue),and
Falseotherwise.
is_parent_of
Trueiftargetisadescendentofthistreenotrequiredtobeadirectdescendent.Tocheckdirectdescendentsofaclade,simplyuselistmembershiptesting:if
subcladeinclade:...
is_preterminal
TrueifalldirectdescendentsareterminalFalseifanydirectdescendentisnotterminal.
13.4.3Modificationmethods
Thesemethodsmodifythetreeinplace.Ifyouwanttokeeptheoriginaltreeintact,makeacompletecopyofthetreefirst,usingPythonscopymodule:
tree=Phylo.read('example.xml','phyloxml')
importcopy
newtree=copy.deepcopy(tree)
collapse
Deletesthetargetfromthetree,relinkingitschildrentoitsparent.
collapse_all
Collapseallthedescendentsofthistree,leavingonlyterminals.Branchlengthsarepreserved,i.e.thedistancetoeachterminalstaysthesame.Withatarget
specification(seeabove),collapsesonlytheinternalnodesmatchingthespecification.
ladderize
Sortcladesinplaceaccordingtothenumberofterminalnodes.Deepestcladesareplacedlastbydefault.Usereverse=Truetosortcladesdeepesttoshallowest.
prune
Prunesaterminalcladefromthetree.Iftaxonisfromabifurcation,theconnectingnodewillbecollapsedanditsbranchlengthaddedtoremainingterminalnode.
Thismightnolongerbeameaningfulvalue.
root_with_outgroup
Rerootthistreewiththeoutgroupcladecontainingthegiventargets,i.e.thecommonancestoroftheoutgroup.ThismethodisonlyavailableonTreeobjects,not
Clades.
Iftheoutgroupisidenticaltoself.root,nochangeoccurs.Iftheoutgroupcladeisterminal(e.g.asingleterminalnodeisgivenastheoutgroup),anewbifurcating
rootcladeiscreatedwitha0lengthbranchtothegivenoutgroup.Otherwise,theinternalnodeatthebaseoftheoutgroupbecomesatrifurcatingrootforthewhole
tree.Iftheoriginalrootwasbifurcating,itisdroppedfromthetree.
Inallcases,thetotalbranchlengthofthetreestaysthesame.
root_at_midpoint
Rerootthistreeatthecalculatedmidpointbetweenthetwomostdistanttipsofthetree.(Thisusesroot_with_outgroupunderthehood.)
split
Generaten(default2)newdescendants.Inaspeciestree,thisisaspeciationevent.Newcladeshavethegivenbranch_lengthandthesamenameasthiscladesroot
plusanintegersuffix(countingfrom0)forexample,splittingacladenamedAproducesthesubcladesA0andA1.
SeethePhylopageontheBiopythonwiki(http://biopython.org/wiki/Phylo)formoreexamplesofusingtheavailablemethods.
13.4.4FeaturesofPhyloXMLtrees
ThephyloXMLfileformatincludesfieldsforannotatingtreeswithadditionaldatatypesandvisualcues.
SeethePhyloXMLpageontheBiopythonwiki(http://biopython.org/wiki/PhyloXML)fordescriptionsandexamplesofusingtheadditionalannotationfeatures
providedbyPhyloXML.
13.5Runningexternalapplications
WhileBio.Phylodoesntinfertreesfromalignmentsitself,therearethirdpartyprogramsavailablethatdo.Thesearesupportedthroughthemodule
Bio.Phylo.Applications,usingthesamegeneralframeworkasBio.Emboss.Applications,Bio.Align.Applicationsandothers.
Biopython1.58introducedawrapperforPhyML( http://www.atgcmontpellier.fr/phyml/).Theprogramacceptsaninputalignmentinphyliprelaxedformat(thats
Phylipformat,butwithoutthe10characterlimitontaxonnames)andavarietyofoptions.Aquickexample:
>>>fromBio.Phylo.ApplicationsimportPhymlCommandline
>>>cmd=PhymlCommandline(input='Tests/Phylip/random.phy')
>>>out_log,err_log=cmd()
Thisgeneratesatreefileandastatsfilewiththenames[inputfilename]_phyml_tree.txtand[inputfilename]_phyml_stats.txt.ThetreefileisinNewickformat:
>>>tree=Phylo.read('Tests/Phylip/random.phy_phyml_tree.txt','newick')
AsimilarwrapperforRAxML(http://sco.hits.org/exelixis/software.html)wasaddedinBiopython1.60,andFastTree
(http://www.microbesonline.org/fasttree/)inBiopython1.62.
NotethatsomepopularPhylipprograms,includingdnamlandprotml,arealreadyavailablethroughtheEMBOSSwrappersinBio.Emboss.Applicationsifyouhavethe
PhylipextensionstoEMBOSSinstalledonyoursystem.SeeSection6.4forsomeexamplesandcluesonhowtouseprogramslikethese.
13.6PAMLintegration
Biopython1.58broughtsupportforPAML( http://abacus.gene.ucl.ac.uk/software/paml.html),asuiteofprogramsforphylogeneticanalysisbymaximumlikelihood.
Currentlytheprogramscodeml,basemlandyn00areimplemented.DuetoPAMLsusageofcontrolfilesratherthancommandlineargumentstocontrolruntimeoptions,
usageofthiswrapperstraysfromtheformatofotherapplicationwrappersinBiopython.
AtypicalworkflowwouldbetoinitializeaPAMLobject,specifyinganalignmentfile,atreefile,anoutputfileandaworkingdirectory.Next,runtimeoptionsaresetvia
theset_options()methodorbyreadinganexistingcontrolfile.Finally,theprogramisrunviatherun()methodandtheoutputfileisautomaticallyparsedtoaresults
dictionary.
Hereisanexampleoftypicalusageofcodeml:
>>>fromBio.Phylo.PAMLimportcodeml
>>>cml=codeml.Codeml()
>>>cml.alignment="Tests/PAML/alignment.phylip"
>>>cml.tree="Tests/PAML/species.tree"
>>>cml.out_file="results.out"
>>>cml.working_dir="./scratch"
>>>cml.set_options(seqtype=1,
...verbose=0,
...noisy=0,
...RateAncestor=0,
...model=0,
...NSsites=[0,1,2],
...CodonFreq=2,
...cleandata=1,
...fix_alpha=1,
...kappa=4.54006)
>>>results=cml.run()
>>>ns_sites=results.get("NSsites")
>>>m0=ns_sites.get(0)
>>>m0_params=m0.get("parameters")
>>>print(m0_params.get("omega"))
Existingoutputfilesmaybeparsedaswellusingamodulesread()function:
>>>results=codeml.read("Tests/PAML/Results/codeml/codeml_NSsites_all.out")
>>>print(results.get("lnLmax"))
DetaileddocumentationforthisnewmodulecurrentlylivesontheBiopythonwiki:http://biopython.org/wiki/PAML
13.7Futureplans
Bio.Phyloisunderactivedevelopment.Herearesomefeatureswemightaddinfuturereleases:
Newmethods
GenerallyusefulfunctionsforoperatingonTreeorCladeobjectsappearontheBiopythonwikifirst,sothatcasualuserscantestthemanddecideiftheyreuseful
beforeweaddthemtoBio.Phylo:
http://biopython.org/wiki/Phylo_cookbook
Bio.Nexusport
MuchofthismodulewaswrittenduringGoogleSummerofCode2009,undertheauspicesofNESCent,asaprojecttoimplementPythonsupportforthephyloXML
dataformat(see13.4.4).SupportforNewickandNexusformatswasaddedbyportingpartoftheexistingBio.NexusmoduletothenewclassesusedbyBio.Phylo.
Currently,Bio.NexuscontainssomeusefulfeaturesthathavenotyetbeenportedtoBio.Phyloclassesnotably,calculatingaconsensustree.Ifyoufindsome
functionalitylackinginBio.Phylo,trypokingthroughtBio.Nexustoseeifitsthereinstead.
Wereopentoanysuggestionsforimprovingthefunctionalityandusabilityofthismodulejustletusknowonthemailinglistorourbugdatabase.
Finally,ifyouneedadditionalfunctionalitynotyetincludedinthePhylomodule,checkifitsavailableinanotherofthehighqualityPythonlibrariesforphylogenetics
suchasDendroPy(http://pythonhosted.org/DendroPy/)orPyCogent(http://pycogent.org/).Sincetheselibrariesalsosupportstandardfileformatsforphylogenetic
trees,youcaneasilytransferdatabetweenlibrariesbywritingtoatemporaryfileorStringIOobject.
Chapter14SequencemotifanalysisusingBio.motifs
ThischaptergivesanoverviewofthefunctionalityoftheBio.motifspackageincludedinBiopython.Itisintendedforpeoplewhoareinvolvedintheanalysisof
sequencemotifs,soIllassumethatyouarefamiliarwithbasicnotionsofmotifanalysis.Incasesomethingisunclear,pleaselookatSection14.10forsomerelevant
links.
MostofthischapterdescribesthenewBio.motifspackageincludedinBiopython1.61onwards,whichisreplacingtheolderBio.Motifpackageintroducedwith
Biopython1.50,whichwasinturnbasedontwoolderformerBiopythonmodules,Bio.AlignAceandBio.MEME.Itprovidesmostoftheirfunctionalitywithaunifiedmotif
objectimplementation.
Speakingofotherlibraries,ifyouarereadingthisyoumightbeinterestedinTAMO,anotherpythonlibrarydesignedtodealwithsequencemotifs.Itsupportsmorede
novomotiffinders,butitisnotapartofBiopythonandhassomerestrictionsoncommercialuse.
14.1Motifobjects
Sinceweareinterestedinmotifanalysis,weneedtotakealookatMotifobjectsinthefirstplace.ForthatweneedtoimporttheBio.motifslibrary:
>>>fromBioimportmotifs
andwecanstartcreatingourfirstmotifobjects.WecaneithercreateaMotifobjectfromalistofinstancesofthemotif,orwecanobtainaMotifobjectbyparsingafile
fromamotifdatabaseormotiffindingsoftware.
14.1.1Creatingamotiffrominstances
SupposewehavetheseinstancesofaDNAmotif:
>>>instances=[Seq("TACAA"),
...Seq("TACGC"),
...Seq("TACAC"),
...Seq("TACCC"),
...Seq("AACCC"),
...Seq("AATGC"),
...Seq("AATGC"),
...]
thenwecancreateaMotifobjectasfollows:
>>>m=motifs.create(instances)
Theinstancesaresavedinanattributem.instances,whichisessentiallyaPythonlistwithsomeaddedfunctionality,asdescribedbelow.PrintingouttheMotifobject
showstheinstancesfromwhichitwasconstructed:
>>>print(m)
TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC
<BLANKLINE>
Thelengthofthemotifisdefinedasthesequencelength,whichshouldbethesameforallinstances:
>>>len(m)
5
TheMotifobjecthasanattribute.countscontainingthecountsofeachnucleotideateachposition.Printingthiscountsmatrixshowsitinaneasilyreadableformat:
>>>print(m.counts)
01234
A:3.007.000.002.001.00
C:0.000.005.002.006.00
G:0.000.000.003.000.00
T:4.000.002.000.000.00
<BLANKLINE>
Youcanaccessthesecountsasadictionary:
>>>m.counts['A']
[3,7,0,2,1]
butyoucanalsothinkofitasa2Darraywiththenucleotideasthefirstdimensionandthepositionastheseconddimension:
>>>m.counts['T',0]
4
>>>m.counts['T',2]
2
>>>m.counts['T',3]
0
Youcanalsodirectlyaccesscolumnsofthecountsmatrix
>>>m.counts[:,3]
{'A':2,'C':2,'T':0,'G':3}
Insteadofthenucleotideitself,youcanalsousetheindexofthenucleotideinthesortedlettersinthealphabetofthemotif:
>>>m.alphabet
>>>m.alphabet.letters
'GATC'
>>>sorted(m.alphabet.letters)
['A','C','G','T']
>>>m.counts['A',:]
(3,7,0,2,1)
>>>m.counts[0,:]
(3,7,0,2,1)
Themotifhasanassociatedconsensussequence,definedasthesequenceoflettersalongthepositionsofthemotifforwhichthelargestvalueinthecorresponding
columnsofthe.countsmatrixisobtained:
>>>m.consensus
Seq('TACGC',IUPACUnambiguousDNA())
aswellasananticonsensussequence,correspondingtothesmallestvaluesinthecolumnsofthe.countsmatrix:
>>>m.anticonsensus
Seq('GGGTG',IUPACUnambiguousDNA())
Youcanalsoaskforadegenerateconsensussequence,inwhichambiguousnucleotidesareusedforpositionswheretherearemultiplenucleotideswithhighcounts:
>>>m.degenerate_consensus
Seq('WACVC',IUPACAmbiguousDNA())
Here,WandRfollowtheIUPACnucleotideambiguitycodes:WiseitherAorT,andVisA,C,orG[10].Thedegenerateconsensussequenceisconstructedfollowing
therulesspecifiedbyCavener[11].
Wecanalsogetthereversecomplementofamotif:
>>>r=m.reverse_complement()
>>>r.consensus
Seq('GCGTA',IUPACUnambiguousDNA())
>>>r.degenerate_consensus
Seq('GBGTW',IUPACAmbiguousDNA())
>>>print(r)
TTGTA
GCGTA
GTGTA
GGGTA
GGGTT
GCATT
GCATT
<BLANKLINE>
ThereversecomplementandthedegenerateconsensussequenceareonlydefinedforDNAmotifs.
14.1.2Creatingasequencelogo
Ifwehaveinternetaccess,wecancreateaweblogo:
>>>m.weblogo("mymotif.png")
WeshouldgetourlogosavedasaPNGinthespecifiedfile.
14.2Readingmotifs
Creatingmotifsfrominstancesbyhandisabitboring,soitsusefultohavesomeI/Ofunctionsforreadingandwritingmotifs.Therearenotanyreallywellestablished
standardsforstoringmotifs,butthereareacoupleofformatsthataremoreusedthanothers.
14.2.1JASPAR
OneofthemostpopularmotifdatabasesisJASPAR.Inadditiontothemotifsequenceinformation,theJASPARdatabasestoresalotofmetainformationforeachmotif.
ThemoduleBio.motifscontainsaspecializedclassjaspar.Motifinwhichthismetainformationisrepresentedasattributes:
matrix_idtheuniqueJASPARmotifID,e.g.MA0004.1
namethenameoftheTF,e.g.Arnt
collectiontheJASPARcollectiontowhichthemotifbelongs,e.g.CORE
tf_classthestructualclassofthisTF,e.g.ZipperType
tf_familythefamilytowhichthisTFbelongs,e.g.HelixLoopHelix
speciesthespeciestowhichthisTFbelongs,mayhavemultiplevalues,thesearespecifiedastaxonomyIDs,e.g.10090
tax_groupthetaxonomicsupergrouptowhichthismotifbelongs,e.g.vertebrates
acctheaccessionnumberoftheTFprotein,e.g.P53762
data_typethetypeofdatausedtoconstructthismotif,e.g.SELEX
medlinethePubmedIDofliteraturesupportingthismotif,maybemultiplevalues,e.g.7592839
pazar_idexternalreferencetotheTFinthePAZARdatabase,e.g.TF0000003
commentfreeformtextcontainingnotesabouttheconstructionofthemotif
Thejaspar.MotifclassinheritsfromthegenericMotifclassandthereforeprovidesallthefacilitiesofanyofthemotifformatsreadingmotifs,writingmotifs,
scanningsequencesformotifinstancesetc.
JASPARstoresmotifsinseveraldifferentwaysincludingthreedifferentflatfileformatsandasanSQLdatabase.Alloftheseformatsfacilitatetheconstructionofa
countsmatrix.However,theamountofmetainformationdescribedabovethatisavailablevarieswiththeformat.
TheJASPARsitesformat
Thefirstofthethreeflatfileformatscontainsalistofinstances.Asanexample,thesearethebeginningandendinglinesoftheJASPARArnt.sitesfileshowingknown
bindingsitesofthemousehelixloophelixtranscriptionfactorArnt.
>MA0004ARNT1
CACGTGatgtcctc
>MA0004ARNT2
CACGTGggaggtac
>MA0004ARNT3
CACGTGccgcgcgc
...
>MA0004ARNT18
AACGTGacagccctcc
>MA0004ARNT19
AACGTGcacatcgtcc
>MA0004ARNT20
aggaatCGCGTGc
Thepartsofthesequenceincapitallettersarethemotifinstancesthatwerefoundtoaligntoeachother.
WecancreateaMotifobjectfromtheseinstancesasfollows:
>>>withopen("Arnt.sites")ashandle:
...arnt=motifs.read(handle,"sites")
...
Theinstancesfromwhichthismotifwascreatedisstoredinthe.instancesproperty:
>>>print(arnt.instances[:3])
[Seq('CACGTG',IUPACUnambiguousDNA()),Seq('CACGTG',IUPACUnambiguousDNA()),Seq('CACGTG',IUPACUnambiguousDNA())]
>>>forinstanceinarnt.instances:
...print(instance)
...
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
AACGTG
AACGTG
AACGTG
AACGTG
CGCGTG
Thecountsmatrixofthismotifisautomaticallycalculatedfromtheinstances:
>>>print(arnt.counts)
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
<BLANKLINE>
Thisformatdoesnotstoreanymetainformation.
TheJASPARpfmformat
JASPARalsomakesmotifsavailabledirectlyasacountmatrix,withouttheinstancesfromwhichitwascreated.Thispfmformatonlystoresthecountsmatrixforasingle
motif.Forexample,thisistheJASPARfileSRF.pfmcontainingthecountsmatrixforthehumanSRFtranscriptionfactor:
2901323461431522
133454511000101
392100000004443
4200134204533000
Wecancreateamotifforthiscountmatrixasfollows:
>>>withopen("SRF.pfm")ashandle:
...srf=motifs.read(handle,"pfm")
...
>>>print(srf.counts)
01234567891011
A:2.009.000.001.0032.003.0046.001.0043.0015.002.002.00
C:1.0033.0045.0045.001.001.000.000.000.001.000.001.00
G:39.002.001.000.000.000.000.000.000.000.0044.0043.00
T:4.002.000.000.0013.0042.000.0045.003.0030.000.000.00
<BLANKLINE>
Asthismotifwascreatedfromthecountsmatrixdirectly,ithasnoinstancesassociatedwithit:
>>>print(srf.instances)
None
Wecannowaskfortheconsensussequenceofthesetwomotifs:
>>>print(arnt.counts.consensus)
CACGTG
>>>print(srf.counts.consensus)
GCCCATATATGG
Aswiththeinstancesfile,nometainformationisstoredinthisformat.
TheJASPARformatjaspar
Thejasparfileformatallowsmultiplemotifstobespecifiedinasinglefile.Inthisformateachofthemotifrecordsconsistofaheaderlinefollowedbyfourlines
definingthecountsmatrix.Theheaderlinebeginswitha>character(similartotheFastafileformat)andisfollowedbytheuniqueJASPARmatrixIDandtheTFname.
ThefollowingexampleshowsajasparformattedfilecontainingthethreemotifsArnt,RUNX1andMEF2A:
>MA0004.1Arnt
A[4190000]
C[16020000]
G[01020020]
T[0000200]
>MA0002.1RUNX1
A[10124122000813]
C[22710800122]
G[31102302626004]
T[111114241160025167]
>MA0052.1MEF2A
A[1057296372566]
C[50011000000]
G[00000000250]
T[7580554952215602]
Themotifsarereadasfollows:
>>>fh=open("jaspar_motifs.txt")
>>>forminmotifs.parse(fh,"jaspar"))
...print(m)
TFnameArnt
MatrixIDMA0004.1
Matrix:
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
TFnameRUNX1
MatrixIDMA0002.1
Matrix:
012345678910
A:10.0012.004.001.002.002.000.000.000.008.0013.00
C:2.002.007.001.000.008.000.000.001.002.002.00
G:3.001.001.000.0023.000.0026.0026.000.000.004.00
T:11.0011.0014.0024.001.0016.000.000.0025.0016.007.00
TFnameMEF2A
MatrixIDMA0052.1
Matrix:
0123456789
A:1.000.0057.002.009.006.0037.002.0056.006.00
C:50.000.001.001.000.000.000.000.000.000.00
G:0.000.000.000.000.000.000.000.002.0050.00
T:7.0058.000.0055.0049.0052.0021.0056.000.002.00
NotethatprintingaJASPARmotifyieldsboththecountsdataandtheavailablemetainformation.
AccessingtheJASPARdatabase
Inadditiontoparsingtheseflatfileformats,wecanalsoretrievemotifsfromaJASPARSQLdatabase.Unliketheflatfileformats,aJASPARdatabaseallowsstoringof
allpossiblemetainformationdefinedintheJASPARMotifclass.ItisbeyondthescopeofthisdocumenttodescribehowtosetupaJASPARdatabase(pleaseseethe
mainJASPARwebsite).MotifsarereadfromaJASPARdatabaseusingtheBio.motifs.jaspar.dbmodule.FirstconnecttotheJASPARdatabaseusingtheJASPAR5
classwhichmodelsthethelatestJASPARschema:
>>>fromBio.motifs.jaspar.dbimportJASPAR5
>>>
>>>JASPAR_DB_HOST=<hostname>
>>>JASPAR_DB_NAME=<db_name>
>>>JASPAR_DB_USER=<user>
>>>JASPAR_DB_PASS=<passord>
>>>
>>>jdb=JASPAR5(
...host=JASPAR_DB_HOST,
...name=JASPAR_DB_NAME,
...user=JASPAR_DB_USER,
...password=JASPAR_DB_PASS
...)
NowwecanfetchasinglemotifbyitsuniqueJASPARIDwiththefetch_motif_by_idmethod.NotethataJASPARIDconistsofabaseIDandaversionnumber
seperatedbyadecimalpoint,e.g.MA0004.1.Thefetch_motif_by_idmethodallowsyoutouseeitherthefullyspecifiedIDorjustthebaseID.IfonlythebaseIDis
provided,thelatestversionofthemotifisreturned.
>>>arnt=jdb.fetch_motif_by_id("MA0004")
PrintingthemotifrevealsthattheJASPARSQLdatabasestoresmuchmoremetainformationthantheflatfiles:
>>>print(arnt)
TFnameArnt
MatrixIDMA0004.1
CollectionCORE
TFclassZipperType
TFfamilyHelixLoopHelix
Species10090
Taxonomicgroupvertebrates
Accession['P53762']
DatatypeusedSELEX
Medline7592839
PAZARIDTF0000003
Comments
Matrix:
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
Wecanalsofetchmotifsbyname.Thenamemustbeanexactmatch(partialmatchesordatabasewildcardsarenotcurrentlysupported).Notethatasthenameisnot
guaranteedtobeunique,thefetch_motifs_by_namemethodactuallyreturnsalist.
>>>motifs=jdb.fetch_motifs_by_name("Arnt")
>>>print(motifs[0])
TFnameArnt
MatrixIDMA0004.1
CollectionCORE
TFclassZipperType
TFfamilyHelixLoopHelix
Species10090
Taxonomicgroupvertebrates
Accession['P53762']
DatatypeusedSELEX
Medline7592839
PAZARIDTF0000003
Comments
Matrix:
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
Thefetch_motifsmethodallowsyoutofetchmotifswhichmatchaspecifiedsetofcriteria.Thesecriteriaincludeanyoftheabovedescribedmetainformationaswellas
certainmatrixpropertiessuchastheminimuminformationcontent(min_icintheexamplebelow),theminimumlengthofthematrixortheminimumnumberofsites
usedtoconstructthematrix.OnlymotifswhichpassALLthespecifiedcriteriaarereturned.Notethatselectioncriteriawhichcorrespondtometainformationwhich
allowformultiplevaluesmaybespecifiedaseitherasinglevalueoralistofvalues,e.g.tax_groupandtf_familyintheexamplebelow.
>>>motifs=jdb.fetch_motifs(
...collection='CORE',
...tax_group=['vertebrates','insects'],
...tf_class='WingedHelixTurnHelix',
...tf_family=['Forkhead','Ets'],
...min_ic=12
...)
>>>formotifinmotifs:
...pass#dosomethingwiththemotif
CompatibilitywithPerlTFBSmodules
AnimportantthingtonoteisthattheJASPARMotifclasswasdesignedtobecompatiblewiththepopularPerlTFBSmodules.Thereforesomespecificsaboutthechoice
ofdefaultsforbackgroundandpseudocountsaswellashowinformationcontentiscomputedandsequencessearchedforinstancesisbasedonthiscompatibilitycriteria.
Thesechoicesarenotedinthespecificsubsectionsbelow.
Choiceofbackground:
ThePerlTFBSmodulesappeartoallowachoiceofcustombackgroundprobabilities(althoughthedocumentationstatesthatuniformbackgroundisassumed).
Howeverthedefaultistouseauniformbackground.Thereforeitisrecommendedthatyouuseauniformbackgroundforcomputingthepositionspecificscoring
matrix(PSSM).ThisisthedefaultwhenusingtheBiopythonmotifsmodule.
Choiceofpseudocounts:
Bydefault,thePerlTFBSmodulesuseapseudocountequaltoN*bg[nucleotide],whereNrepresentsthetotalnumberofsequencesusedtoconstructthematrix.
Toapplythissamepseudocountformula,setthemotifpseudocountsattributeusingthejaspar.calculate\_pseudcounts()function:
>>>motif.pseudocounts=motifs.jaspar.calculate_pseudocounts(motif)
Notethatitispossibleforthecountsmatrixtohaveanunequalnumberofsequencesmakingupthecolumns.Thepseudocountcomputationusestheaverage
numberofsequencesmakingupthematrix.However,whennormalizeiscalledonthecountsmatrix,eachcountvalueinacolumnisdividedbythetotalnumber
ofsequencesmakingupthatspecificcolumn,notbytheaveragenumberofsequences.ThisdiffersfromthePerlTFBSmodulesbecausethenormalizationisnot
doneasaseparatestepandsotheaveragenumberofsequencesisusedthroughoutthecomputationofthepssm.Therefore,formatriceswithunequalcolumn
counts,thePSSMcomputedbythemotifsmodulewilldiffersomewhatfromthepssmcomputedbythePerlTFBSmodules.
Computationofmatrixinformationcontent:
Theinformationcontent(IC)orspecificityofamatrixiscomputedusingthemeanmethodofthePositionSpecificScoringMatrixclass.Howeverofnote,inthe
PerlTFBSmodulesthedefaultbehaviouristocomputetheICwithoutfirstapplyingpseudocounts,eventhoughbydefaultthePSSMsarecomputedusing
pseudocountsasdescribedabove.
Searchingforinstances:
SearchingforinstanceswiththePerlTFBSmotifswasusuallyperformedusingarelativescorethreshold,i.e.ascoreintherange0to1.Inordertocomputethe
absolutePSSMscorecorrespondingtoarelativescoreonecanusetheequation:
>>>abs_score=(pssm.maxpssm.min)*rel_score+pssm.min
Toconverttheabsolutescoreofaninstancebacktoarelativescore,onecanusetheequation:
>>>rel_score=(abs_scorepssm.min)/(pssm.maxpssm.min)
Forexample,usingtheArntmotifbefore,letssearchasequencewitharelativescorethresholdof0.8.
>>>test_seq=Seq("TAAGCGTGCACGCGCAACACGTGCATTA",unambiguous_dna)
>>>arnt.pseudocounts=motifs.jaspar.calculate_pseudocounts(arnt)
>>>pssm=arnt.pssm
>>>max_score=pssm.max
>>>min_score=pssm.min
>>>abs_score_threshold=(max_scoremin_score)*0.8+min_score
>>>forposition,scoreinpssm.search(test_seq,
threshold=abs_score_threshold):
...rel_score=(scoremin_score)/(max_scoremin_score)
...print("Position%d:score=%5.3f,rel.score=%5.3f"%(
position,score,rel_score))
...
Position2:score=5.362,rel.score=0.801
14.2.2MEME
MEME[12]isatoolfordiscoveringmotifsinagroupofrelatedDNAorproteinsequences.IttakesasinputagroupofDNAorproteinsequencesandoutputsasmany
motifsasrequested.Therefore,incontrasttoJASPARfiles,MEMEoutputfilestypicallycontainmultiplemotifs.Thisisanexample.
AtthetopofanoutputfilegeneratedbyMEMEshowssomebackgroundinformationabouttheMEMEandtheversionofMEMEused:
********************************************************************************
MEMEMotifdiscoverytool
********************************************************************************
MEMEversion3.0(Releasedate:2004/08/1809:07:01)
...
Furtherdown,theinputsetoftrainingsequencesisrecapitulated:
********************************************************************************
TRAININGSET
********************************************************************************
DATAFILE=INO_up800.s
ALPHABET=ACGT
SequencenameWeightLengthSequencenameWeightLength

CHO11.0000800CHO21.0000800
FAS11.0000800FAS21.0000800
ACC11.0000800INO11.0000800
OPI31.0000800
********************************************************************************
andtheexactcommandlinethatwasused:
********************************************************************************
COMMANDLINESUMMARY
********************************************************************************
Thisinformationcanalsobeusefulintheeventyouwishtoreporta
problemwiththeMEMEsoftware.
command:mememodoopsdnarevcompnmotifs2bfileyeast.nc.6.freqINO_up800.s
...
Nextisdetailedinformationoneachmotifthatwasfound:
********************************************************************************
MOTIF1width=12sites=7llr=95Evalue=2.0e001
********************************************************************************

Motif1Description

SimplifiedA:::9:a::::3:
pos.specificC::a:9:11691a
probabilityG::::1::94:4:
matrixTaa:1::9::11:
Toparsethisfile(storedasmeme.dna.oops.txt),use
>>>withopen("meme.dna.oops.txt")ashandle:
...record=motifs.parse(handle,"meme")
...
Themotifs.parsecommandreadsthecompletefiledirectly,soyoucanclosethefileaftercallingmotifs.parse.Theheaderinformationisstoredinattributes:
>>>record.version
'3.0'
>>>record.datafile
'INO_up800.s'
>>>record.command
'mememodoopsdnarevcompnmotifs2bfileyeast.nc.6.freqINO_up800.s'
>>>record.alphabet
>>>record.sequences
['CHO1','CHO2','FAS1','FAS2','ACC1','INO1','OPI3']
TherecordisanobjectoftheBio.motifs.meme.Recordclass.Theclassinheritsfromlist,andyoucanthinkofrecordasalistofMotifobjects:
>>>len(record)
2
>>>motif=record[0]
>>>print(motif.consensus)
TTCACATGCCGC
>>>print(motif.degenerate_consensus)
TTCACATGSCNC
Inadditiontothesegenericmotifattributes,eachmotifalsostoresitsspecificinformationascalculatedbyMEME.Forexample,
>>>motif.num_occurrences
7
>>>motif.length
12
>>>evalue=motif.evalue
>>>print("%3.1g"%evalue)
0.2
>>>motif.name
'Motif1'
Inadditiontousinganindexintotherecord,aswedidabove,youcanalsofinditbyitsname:
>>>motif=record['Motif1']
Eachmotifhasanattribute.instanceswiththesequenceinstancesinwhichthemotifwasfound,providingsomeinformationoneachinstance:
>>>len(motif.instances)
7
>>>motif.instances[0]
Instance('TTCACATGCCGC',IUPACUnambiguousDNA())
>>>motif.instances[0].motif_name
'Motif1'
>>>motif.instances[0].sequence_name
'INO1'
>>>motif.instances[0].start
620
>>>motif.instances[0].strand
''
>>>motif.instances[0].length
12
>>>pvalue=motif.instances[0].pvalue
>>>print("%5.3g"%pvalue)
1.85e08
MAST
14.2.3TRANSFAC
TRANSFACisamanuallycurateddatabaseoftranscriptionfactors,togetherwiththeirgenomicbindingsitesandDNAbindingprofiles[27].Whilethefileformatused
intheTRANSFACdatabaseisnowadaysalsousedbyothers,wewillrefertoitastheTRANSFACfileformat.
AminimalfileintheTRANSFACformatlooksasfollows:
IDmotif1
P0ACGT
011220S
022120R
033011A
040500C
055000A
060041G
070140G
080005T
090050G
100122K
110203Y
121031G
//
Thisfileshowsthefrequencymatrixofmotifmotif1of12nucleotides.Ingeneral,onefileintheTRANSFACformatcancontainmultiplemotifs.Forexample,thisisthe
contentsoftheexampleTRANSFACfiletransfac.dat:
VVEXAMPLEJanuary15,2013
XX
//
IDmotif1
P0ACGT
011220S
022120R
033011A
...
110203Y
121031G
//
IDmotif2
P0ACGT
012120R
021220S
...
090005T
100203Y
//
ToparseaTRANSFACfile,use
>>>withopen("transfac.dat")ashandle:
...record=motifs.parse(handle,"TRANSFAC")
...
Theoverallversionnumber,ifavailable,isstoredasrecord.version:
>>>record.version
'EXAMPLEJanuary15,2013'
EachmotifinrecordisininstanceoftheBio.motifs.transfac.Motifclass,whichinheritsbothfromtheBio.motifs.MotifclassandfromaPythondictionary.The
dictionaryusesthetwoletterkeystostoreanyadditionalinformationaboutthemotif:
>>>motif=record[0]
>>>motif.degenerate_consensus#UsingtheBio.motifs.Motifmethod
Seq('SRACAGGTGKYG',IUPACAmbiguousDNA())
>>>motif['ID']#Usingmotifasadictionary
'motif1'
TRANSFACfilesaretypicallymuchmoreelaboratethanthisexample,containinglotsofadditionalinformationaboutthemotif.Table14.2.3liststhetwoletterfield
codesthatarecommonlyfoundinTRANSFACfiles:
Table14.1:FieldscommonlyfoundinTRANSFACfiles
AC Accessionnumber
AS Accessionnumbers,secondary
BA Statisticalbasis
BF Bindingfactors
BS Factorbindingsitesunderlyingthematrix
CC Comments
CO Copyrightnotice
DE Shortfactordescription
DR Externaldatabases
DT Datecreated/updated
HC Subfamilies
HP Superfamilies
ID Identifier
NA Nameofthebindingfactor
OC Taxonomicclassification
OS Species/Taxon
OV Olderversion
PV Preferredversion
TY Type
XX EmptylinethesearenotstoredintheRecord.
Eachmotifalsohasanattribute.referencescontainingthereferencesassociatedwiththemotif,usingthesetwoletterkeys:
Table14.2:FieldsusedtostorereferencesinTRANSFACfiles
RN Referencenumber
RA Referenceauthors
RL Referencedata
RT Referencetitle
RX PubMedID
PrintingthemotifswritesthemoutintheirnativeTRANSFACformat:
>>>print(record)
VVEXAMPLEJanuary15,2013
XX
//
IDmotif1
XX
P0ACGT
011220S
022120R
033011A
040500C
055000A
060041G
070140G
080005T
090050G
100122K
110203Y
121031G
XX
//
IDmotif2
XX
P0ACGT
012120R
021220S
030500C
043011A
050041G
065000A
070140G
080050G
090005T
100203Y
XX
//
<BLANKLINE>
YoucanexportthemotifsintheTRANSFACformatbycapturingthisoutputinastringandsavingitinafile:
>>>text=str(record)
>>>withopen("mytransfacfile.dat",'w')asout_handle:
...out_handle.write(text)
...
14.3Writingmotifs
Speakingofexporting,letslookatexportfunctionsingeneral.WecanusetheformatmethodtowritethemotifinthesimpleJASPARpfmformat:
>>>print(arnt.format("pfm"))
4.0019.000.000.000.000.00
16.000.0020.000.000.000.00
0.001.000.0020.000.0020.00
0.000.000.000.0020.000.00
Similarly,wecanuseformattowritethemotifintheJASPARjasparformat:
>>>print(arnt.format("jaspar"))
>MA0004.1Arnt
A[4.0019.000.000.000.000.00]
C[16.000.0020.000.000.000.00]
G[0.001.000.0020.000.0020.00]
T[0.000.000.000.0020.000.00]
TowritethemotifinaTRANSFAClikematrixformat,use
>>>print(m.format("transfac"))
P0ACGT
013004W
027000A
030502C
042230V
051600C
XX
//
<BLANKLINE>
Towriteoutmultiplemotifs,youcanusemotifs.write.ThisfunctioncanbeusedregardlessofwhetherthemotifsoriginatedfromaTRANSFACfile.Forexample,
>>>two_motifs=[arnt,srf]
>>>print(motifs.write(two_motifs,'transfac'))
P0ACGT
0141600C
0219010A
0302000C
0400200G
0500020T
0600200G
XX
//
P0ACGT
0121394G
0293322C
0304510C
0414500C
05321013A
0631042T
0746000A
0810045T
0943003A
10151030T
1120440G
1221430G
XX
//
<BLANKLINE>
Or,towritemultiplemotifsinthejasparformat:
>>>two_motifs=[arnt,mef2a]
>>>print(motifs.write(two_motifs,"jaspar"))
>MA0004.1Arnt
A[4.0019.000.000.000.000.00]
C[16.000.0020.000.000.000.00]
G[0.001.000.0020.000.0020.00]
T[0.000.000.000.0020.000.00]
>MA0052.1MEF2A
A[1.000.0057.002.009.006.0037.002.0056.006.00]
C[50.000.001.001.000.000.000.000.000.000.00]
G[0.000.000.000.000.000.000.000.002.0050.00]
T[7.0058.000.0055.0049.0052.0021.0056.000.002.00]
14.4PositionWeightMatrices
The.countsattributeofaMotifobjectshowshowofteneachnucleotideappearedateachpositionalongthealignment.Wecannormalizethismatrixbydividingbythe
numberofinstancesinthealignment,resultingintheprobabilityofeachnucleotideateachpositionalongthealignment.Werefertotheseprobabilitiesastheposition
weightmatrix.However,bewarethatintheliteraturethistermmayalsobeusedtorefertothepositionspecificscoringmatrix,whichwediscussbelow.
Usually,pseudocountsareaddedtoeachpositionbeforenormalizing.Thisavoidsoverfittingofthepositionweightmatrixtothelimitednumberofmotifinstancesinthe
alignment,andcanalsopreventprobabilitiesfrombecomingzero.Toaddafixedpseudocounttoallnucleotidesatallpositions,specifyanumberforthepseudocounts
argument:
>>>pwm=m.counts.normalize(pseudocounts=0.5)
>>>print(pwm)
01234
A:0.390.830.060.280.17
C:0.060.060.610.280.72
G:0.060.060.060.390.06
T:0.500.060.280.060.06
<BLANKLINE>
Alternatively,pseudocountscanbeadictionaryspecifyingthepseudocountsforeachnucleotide.Forexample,astheGCcontentofthehumangenomeisabout40%,you
maywanttochoosethepseudocountsaccordingly:
>>>pwm=m.counts.normalize(pseudocounts={'A':0.6,'C':0.4,'G':0.4,'T':0.6})
>>>print(pwm)
01234
A:0.400.840.070.290.18
C:0.040.040.600.270.71
G:0.040.040.040.380.04
T:0.510.070.290.070.07
<BLANKLINE>
Thepositionweightmatrixhasitsownmethodstocalculatetheconsensus,anticonsensus,anddegenerateconsensussequences:
>>>pwm.consensus
Seq('TACGC',IUPACUnambiguousDNA())
>>>pwm.anticonsensus
Seq('GGGTG',IUPACUnambiguousDNA())
>>>pwm.degenerate_consensus
Seq('WACNC',IUPACAmbiguousDNA())
Notethatduetothepseudocounts,thedegenerateconsensussequencecalculatedfromthepositionweightmatrixisslightlydifferentfromthedegenerateconsensus
sequencecalculatedfromtheinstancesinthemotif:
>>>m.degenerate_consensus
Seq('WACVC',IUPACAmbiguousDNA())
Thereversecomplementofthepositionweightmatrixcanbecalculateddirectlyfromthepwm:
>>>rpwm=pwm.reverse_complement()
>>>print(rpwm)
01234
A:0.070.070.290.070.51
C:0.040.380.040.040.04
G:0.710.270.600.040.04
T:0.180.290.070.840.40
<BLANKLINE>
14.5PositionSpecificScoringMatrices
UsingthebackgrounddistributionandPWMwithpseudocountsadded,itseasytocomputethelogoddsratios,tellinguswhatarethelogoddsofaparticularsymbolto
becomingfromamotifagainstthebackground.Wecanusethe.log_odds()methodonthepositionweightmatrix:
>>>pssm=pwm.log_odds()
>>>print(pssm)
01234
A:0.681.761.910.210.49
C:2.492.491.260.091.51
G:2.492.492.490.602.49
T:1.031.910.211.911.91
<BLANKLINE>
Herewecanseepositivevaluesforsymbolsmorefrequentinthemotifthaninthebackgroundandnegativeforsymbolsmorefrequentinthebackground.0.0meansthat
itsequallylikelytoseeasymbolinthebackgroundandinthemotif.
ThisassumesthatA,C,G,andTareequallylikelyinthebackground.Tocalculatethepositionspecificscoringmatrixagainstabackgroundwithunequalprobabilities
forA,C,G,T,usethebackgroundargument.Forexample,againstabackgroundwitha40%GCcontent,use
>>>background={'A':0.3,'C':0.2,'G':0.2,'T':0.3}
>>>pssm=pwm.log_odds(background)
>>>print(pssm)
01234
A:0.421.492.170.050.75
C:2.172.171.580.421.83
G:2.172.172.170.922.17
T:0.772.170.052.172.17
<BLANKLINE>
ThemaximumandminimumscoreobtainablefromthePSSMarestoredinthe.maxand.minproperties:
>>>print("%4.2f"%pssm.max)
6.59
>>>print("%4.2f"%pssm.min)
10.85
ThemeanandstandarddeviationofthePSSMscoreswithrespecttoaspecificbackgroundarecalculatedbythe.meanand.stdmethods.
>>>mean=pssm.mean(background)
>>>std=pssm.std(background)
>>>print("mean=%0.2f,standarddeviation=%0.2f"%(mean,std))
mean=3.21,standarddeviation=2.59
Auniformbackgroundisusedifbackgroundisnotspecified.Themeanisparticularlyimportant,asitsvalueisequaltotheKullbackLeiblerdivergenceorrelative
entropy,andisameasurefortheinformationcontentofthemotifcomparedtothebackground.AsinBiopythonthebase2logarithmisusedinthecalculationofthelog
oddsscores,theinformationcontenthasunitsofbits.
The.reverse_complement,.consensus,.anticonsensus,and.degenerate_consensusmethodscanbeapplieddirectlytoPSSMobjects.
14.6Searchingforinstances
Themostfrequentuseforamotifistofinditsinstancesinsomesequence.Forthesakeofthissection,wewilluseanartificialsequencelikethis:
>>>test_seq=Seq("TACACTGCATTACAACCCAAGCATTA",m.alphabet)
>>>len(test_seq)
26
14.6.1Searchingforexactmatches
Thesimplestwaytofindinstances,istolookforexactmatchesofthetrueinstancesofthemotif:
>>>forpos,seqinm.instances.search(test_seq):
...print("%i%s"%(pos,seq))
...
0TACAC
10TACAA
13AACCC
Wecandothesamewiththereversecomplement(tofindinstancesonthecomplementarystrand):
>>>forpos,seqinr.instances.search(test_seq):
...print("%i%s"%(pos,seq))
...
6GCATT
20GCATT
14.6.2SearchingformatchesusingthePSSMscore
Itsjustaseasytolookforpositions,givingrisetohighlogoddsscoresagainstourmotif:
>>>forposition,scoreinpssm.search(test_seq,threshold=3.0):
...print("Position%d:score=%5.3f"%(position,score))
...
Position0:score=5.622
Thenegativepositionsrefertoinstancesofthemotiffoundonthereversestrandofthetestsequence,andfollowthePythonconventiononnegativeindices.Therefore,
theinstanceofthemotifatposislocatedattest_seq[pos:pos+len(m)]bothforpositiveandfornegativevaluesofpos.
Youmaynoticethethresholdparameter,heresetarbitrarilyto3.0.Thisisinlog2,sowearenowlookingonlyforwords,whichareeighttimesmorelikelytooccurunder
themotifmodelthaninthebackground.Thedefaultthresholdis0.0,whichselectseverythingthatlooksmorelikethemotifthanthebackground.
Youcanalsocalculatethescoresatallpositionsalongthesequence:
>>>pssm.calculate(test_seq)
array([5.62230396,5.6796999,3.43177247,0.93827754,
6.84962511,2.04066086,10.84962463,3.65614533,
0.03370807,3.91102552,3.03734159,2.14918518,
0.6016975,5.7381525,0.50977498,3.56422281,
8.73414803,0.09919716,0.6016975,2.39429784,
10.84962463,3.65614533],dtype=float32)
Ingeneral,thisisthefastestwaytocalculatePSSMscores.Thescoresreturnedbypssm.calculatearefortheforwardstrandonly.Toobtainthescoresonthereverse
strand,youcantakethereversecomplementofthePSSM:
>>>rpssm=pssm.reverse_complement()
>>>rpssm.calculate(test_seq)
array([9.43458748,3.06172252,7.18665981,7.76216221,
2.04066086,4.26466274,4.60124254,4.2480607,
8.73414803,2.26503372,6.49598789,5.64668512,
8.73414803,10.84962463,4.82356262,4.82356262,
5.64668512,8.73414803,4.15613794,5.6796999,
4.60124254,4.2480607],dtype=float32)
14.6.3Selectingascorethreshold
Ifyouwanttousealessarbitrarywayofselectingthresholds,youcanexplorethedistributionofPSSMscores.Sincethespaceforascoredistributiongrows
exponentiallywithmotiflength,weareusinganapproximationwithagivenprecisiontokeepcomputationcostmanageable:
>>>distribution=pssm.distribution(background=background,precision=10**4)
Thedistributionobjectcanbeusedtodetermineanumberofdifferentthresholds.Wecanspecifytherequestedfalsepositiverate(probabilityoffindingamotif
instanceinbackgroundgeneratedsequence):
>>>threshold=distribution.threshold_fpr(0.01)
>>>print("%5.3f"%threshold)
4.009
orthefalsenegativerate(probabilityofnotfindinganinstancegeneratedfromthemotif):
>>>threshold=distribution.threshold_fnr(0.1)
0.510
orathreshold(approximately)satisfyingsomerelationbetweenthefalsepositiverateandthefalsenegativerate(fnr/fprt):
>>>threshold=distribution.threshold_balanced(1000)
6.241
orathresholdsatisfying(roughly)theequalitybetweenthelogofthefalsepositiverateandtheinformationcontent(asusedinpatsersoftwarebyHertzandStormo):
>>>threshold=distribution.threshold_patser()
0.346
Forexample,incaseofourmotif,youcangetthethresholdgivingyouexactlythesameresults(forthissequence)assearchingforinstanceswithbalancedthreshold
withrateof1000.
4.009
>>>forposition,scoreinpssm.search(test_seq,threshold=threshold):
...print("Position%d:score=%5.3f"%(position,score))
...
14.7EachmotifobjecthasanassociatedPositionSpecificScoringMatrix
TofacilitatesearchingforpotentialTFBSsusingPSSMs,boththepositionweightmatrixandthepositionspecificscoringmatrixareassociatedwitheachmotif.Using
theArntmotifasanexample:
>>>withopen("Arnt.sites")ashandle:
...motif=motifs.read(handle,'sites')
...
>>>print(motif.counts)
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
<BLANKLINE>
>>>print(motif.pwm)
012345
A:0.200.950.000.000.000.00
C:0.800.001.000.000.000.00
G:0.000.050.001.000.001.00
T:0.000.000.000.001.000.00
<BLANKLINE>
>>>print(motif.pssm)
012345
A:0.321.93infinfinfinf
C:1.68inf2.00infinfinf
G:inf2.32inf2.00inf2.00
T:infinfinfinf2.00inf
<BLANKLINE>
Thenegativeinfinitiesappearherebecausethecorrespondingentryinthefrequencymatrixis0,andweareusingzeropseudocountsbydefault:
>>>forletterin"ACGT":
...print("%s:%4.2f"%(letter,motif.pseudocounts[letter]))
...
A:0.00
C:0.00
G:0.00
T:0.00
Ifyouchangethe.pseudocountsattribute,thepositionfrequencymatrixandthepositionspecificscoringmatrixarerecalculatedautomatically:
>>>motif.pseudocounts=3.0
...print("%s:%4.2f"%(letter,motif.pseudocounts[letter]))
...
A:3.00
C:3.00
G:3.00
T:3.00
>>>print(motif.pwm)
012345
A:0.220.690.090.090.090.09
C:0.590.090.720.090.090.09
G:0.090.120.090.720.090.72
T:0.090.090.090.090.720.09
<BLANKLINE>
012345
A:0.191.461.421.421.421.42
C:1.251.421.521.421.421.42
G:1.421.001.421.521.421.52
T:1.421.421.421.421.521.42
<BLANKLINE>
Youcanalsosetthe.pseudocountstoadictionaryoverthefournucleotidesifyouwanttousedifferentpseudocountsforthem.Settingmotif.pseudocountstoNoneresets
ittoitsdefaultvalueofzero.
Thepositionspecificscoringmatrixdependsonthebackgrounddistribution,whichisuniformbydefault:
...print("%s:%4.2f"%(letter,motif.background[letter]))
...
A:0.25
C:0.25
G:0.25
T:0.25
Again,ifyoumodifythebackgrounddistribution,thepositionspecificscoringmatrixisrecalculated:
>>>motif.background={'A':0.2,'C':0.3,'G':0.3,'T':0.2}
012345
A:0.131.781.091.091.091.09
C:0.981.681.261.681.681.68
G:1.681.261.681.261.681.26
T:1.091.091.091.091.851.09
<BLANKLINE>
Settingmotif.backgroundtoNoneresetsittoauniformdistribution:
>>>motif.background=None
...
A:0.25
C:0.25
G:0.25
T:0.25
Ifyousetmotif.backgroundequaltoasinglevalue,itwillbeinterpretedastheGCcontent:
>>>motif.background=0.8
...
A:0.10
C:0.40
G:0.40
T:0.10
NotethatyoucannowcalculatethemeanofthePSSMscoresoverthebackgroundagainstwhichitwascomputed:
>>>print("%f"%motif.pssm.mean(motif.background))
4.703928
aswellasitsstandarddeviation:
>>>print("%f"%motif.pssm.std(motif.background))
3.290900
anditsdistribution:
>>>distribution=motif.pssm.distribution(background=motif.background)
>>>print("%f"%threshold)
3.854375
Notethatthepositionweightmatrixandthepositionspecificscoringmatrixarerecalculatedeachtimeyoucallmotif.pwmormotif.pssm,respectively.Ifspeedisanissue
andyouwanttousethePWMorPSSMrepeatedly,youcansavethemasavariable,asin
>>>pssm=motif.pssm
14.8Comparingmotifs
Oncewehavemorethanonemotif,wemightwanttocomparethem.
Beforewestartcomparingmotifs,Ishouldpointoutthatmotifboundariesareusuallyquitearbitrary.Thismeansweoftenneedtocomparemotifsofdifferentlengths,so
comparisonneedstoinvolvesomekindofalignment.Thismeanswehavetotakeintoaccounttwothings:
alignmentofmotifs
somefunctiontocomparealignedmotifs
Toalignthemotifs,weuseungappedalignmentofPSSMsandsubstitutezerosforanymissingcolumnsatthebeginningandendofthematrices.Thismeansthat
effectivelyweareusingthebackgrounddistributionforcolumnsmissingfromthePSSM.Thedistancefunctionthenreturnstheminimaldistancebetweenmotifs,aswell
asthecorrespondingoffsetintheiralignment.
Togiveanexample,letusfirstloadanothermotif,whichissimilartoourtestmotifm:
>>>withopen("REB1.pfm")ashandle:
...m_reb1=motifs.read(handle,"pfm")
...
>>>m_reb1.consensus
Seq('GTTACCCGG',IUPACUnambiguousDNA())
>>>print(m_reb1.counts)
012345678
A:30.000.000.00100.000.000.000.000.0015.00
C:10.000.000.000.00100.00100.00100.000.0015.00
G:50.000.000.000.000.000.000.0060.0055.00
T:10.00100.00100.000.000.000.000.0040.0015.00
<BLANKLINE>
Tomakethemotifscomparable,wechoosethesamevaluesforthepseudocountsandthebackgrounddistributionasourmotifm:
>>>m_reb1.pseudocounts={'A':0.6,'C':0.4,'G':0.4,'T':0.6}
>>>m_reb1.background={'A':0.3,'C':0.2,'G':0.2,'T':0.3}
>>>pssm_reb1=m_reb1.pssm
>>>print(pssm_reb1)
012345678
A:0.005.675.671.725.675.675.675.670.97
C:0.975.675.675.672.302.302.305.670.41
G:1.305.675.675.675.675.675.671.571.44
T:1.531.721.725.675.675.675.670.410.97
<BLANKLINE>
WellcomparethesemotifsusingthePearsoncorrelation.Sincewewantittoresembleadistancemeasure,weactuallytake1r,whereristhePearsoncorrelation
coefficient(PCC):
>>>distance,offset=pssm.dist_pearson(pssm_reb1)
>>>print("distance=%5.3g"%distance)
distance=0.239
>>>print(offset)
2
ThismeansthatthebestPCCbetweenmotifmandm_reb1isobtainedwiththefollowingalignment:
m:bbTACGCbb
m_reb1:GTTACCCGG
wherebstandsforbackgrounddistribution.ThePCCitselfisroughly10.239=0.761.
14.9Denovomotiffinding
Currently,Biopythonhasonlylimitedsupportfordenovomotiffinding.Namely,wesupportrunningxxmotifandalsoparsingofMEME.Sincethenumberofmotif
findingtoolsisgrowingrapidly,contributionsofnewparsersarewelcome.
14.9.1MEME
Letsassume,youhaverunMEMEonsequencesofyourchoicewithyourfavoriteparametersandsavedtheoutputinthefilememe.out.Youcanretrievethemotifs
reportedbyMEMEbyrunningthefollowingpieceofcode:
>>>withopen("meme.out")ashandle:
...motifsM=motifs.parse(handle,"meme")
...
>>>motifsM
[<Bio.motifs.meme.Motifobjectat0xc356b0>]
Besidesthemostwantedlistofmotifs,theresultobjectcontainsmoreusefulinformation,accessiblethroughpropertieswithselfexplanatorynames:
.alphabet
.datafile
.sequence_names
.version
.command
ThemotifsreturnedbytheMEMEParsercanbetreatedexactlylikeregularMotifobjects(withinstances),theyalsoprovidesomeextrafunctionality,byadding
additionalinformationabouttheinstances.
>>>motifsM[0].consensus
Seq('CTCAATCGTA',IUPACUnambiguousDNA())
>>>motifsM[0].instances[0].sequence_name
'SEQ10;'
>>>motifsM[0].instances[0].start
3
>>>motifsM[0].instances[0].strand
'+'
>>>motifsM[0].instances[0].pvalue
8.71e07
14.10Usefullinks
Sequencemotifinwikipedia
PWMinwikipedia
Consensussequenceinwikipedia
Comparisonofdifferentmotiffindingprograms
Chapter15Clusteranalysis
Clusteranalysisisthegroupingofitemsintoclustersbasedonthesimilarityoftheitemstoeachother.Inbioinformatics,clusteringiswidelyusedingeneexpression
dataanalysistofindgroupsofgeneswithsimilargeneexpressionprofiles.Thismayidentifyfunctionallyrelatedgenes,aswellassuggestthefunctionofpresently
unknowngenes.
TheBiopythonmoduleBio.Clusterprovidescommonlyusedclusteringalgorithmsandwasdesignedwiththeapplicationtogeneexpressiondatainmind.However,this
modulecanalsobeusedforclusteranalysisofothertypesofdata.Bio.ClusterandtheunderlyingCClusteringLibraryisdescribedbyDeHoonetal.[14].
ThefollowingfourclusteringapproachesareimplementedinBio.Cluster:
Hierarchicalclustering(pairwisecentroid,single,complete,andaveragelinkage)
kmeans,kmedians,andkmedoidsclustering
SelfOrganizingMaps
PrincipalComponentAnalysis.
Datarepresentation
ThedatatobeclusteredarerepresentedbyanmNumericalPythonarraydata.Withinthecontextofgeneexpressiondataclustering,typicallytherowscorrespondto
differentgeneswhereasthecolumnscorrespondtodifferentexperimentalconditions.TheclusteringalgorithmsinBio.Clustercanbeappliedbothtorows(genes)andto
columns(experiments).
Missingvalues
Ofteninmicroarrayexperiments,someofthedatavaluesaremissing,whichisindicatedbyanadditionalnmNumericalPythonintegerarraymask.Ifmask[i,j]==0,
thendata[i,j]ismissingandisignoredintheanalysis.
Randomnumbergenerator
Thekmeans/medians/medoidsclusteringalgorithmsandSelfOrganizingMaps(SOMs)includetheuseofarandomnumbergenerator.Theuniformrandomnumber
generatorinBio.ClusterisbasedonthealgorithmbyLEcuyer[25],whilerandomnumbersfollowingthebinomialdistributionaregeneratedusingtheBTPEalgorithm
byKachitvichyanukulandSchmeiser[23].Therandomnumbergeneratorisinitializedautomaticallyduringitsfirstcall.Asthisrandomnumbergeneratorusesa
combinationoftwomultiplicativelinearcongruentialgenerators,two(integer)seedsareneededforinitialization,forwhichweusethesystemsuppliedrandomnumber
generatorrand(intheCstandardlibrary).Weinitializethisgeneratorbycallingsrandwiththeepochtimeinseconds,andusethefirsttworandomnumbersgeneratedby
randasseedsfortheuniformrandomnumbergeneratorinBio.Cluster.
15.1Distancefunctions
Inordertoclusteritemsintogroupsbasedontheirsimilarity,weshouldfirstdefinewhatexactlywemeanbysimilar.Bio.Clusterprovideseightdistancefunctions,
indicatedbyasinglecharacter,tomeasuresimilarity,orconversely,distance:
'e':Euclideandistance
'b':Cityblockdistance.
'c':Pearsoncorrelationcoefficient
'a':AbsolutevalueofthePearsoncorrelationcoefficient
'u':UncenteredPearsoncorrelation(equivalenttothecosineoftheanglebetweentwodatavectors)
'x':AbsoluteuncenteredPearsoncorrelation
's':Spearmansrankcorrelation
'k':Kendalls.
Thefirsttwoaretruedistancefunctionsthatsatisfythetriangleinequality:

d u , v d u , w +d w , v forall u , v , w ,

andarethereforereferedtoasmetrics.Ineverydaylanguage,thismeansthattheshortestdistancebetweentwopointsisastraightline.
Theremainingsixdistancemeasuresarerelatedtothecorrelationcoefficient,wherethedistancedisdefinedintermsofthecorrelationrbyd=1r.Notethatthese
distancefunctionsaresemimetricsthatdonotsatisfythetriangleinequality.Forexample,for

u= 1,0,1

v = 1,1,0

w = 0,1,1

wefindaPearsondistanced(u,w)=1.8660,whiled(u,v)+d(v,w)=1.6340.
Euclideandistance
InBio.Cluster,wedefinetheEuclideandistanceas
n
1
d=
n xiyi 2.
i=1
Onlythosetermsareincludedinthesummationforwhichbothxiandyiarepresent,andthedenominatornischosenaccordingly.Astheexpressiondataxiandyiare
subtracteddirectlyfromeachother,weshouldmakesurethattheexpressiondataareproperlynormalizedwhenusingtheEuclideandistance.
Cityblockdistance
Thecityblockdistance,alternativelyknownastheManhattandistance,isrelatedtotheEuclideandistance.WhereastheEuclideandistancecorrespondstothelengthof
theshortestpathbetweentwopoints,thecityblockdistanceisthesumofdistancesalongeachdimension.Asgeneexpressiondatatendtohavemissingvalues,in
Bio.Clusterwedefinethecityblockdistanceasthesumofdistancesdividedbythenumberofdimensions:
n
1
d=
n xiyi .
i=1
Thisisequaltothedistanceyouwouldhavetowalkbetweentwopointsinacity,whereyouhavetowalkalongcityblocks.AsfortheEuclideandistance,theexpression
dataaresubtracteddirectlyfromeachother,andweshouldthereforemakesurethattheyareproperlynormalized.
ThePearsoncorrelationcoefficient
ThePearsoncorrelationcoefficientisdefinedas
n
1 xix yi
r=
n
x y
,
i=1
inwhichx,arethesamplemeanofxandyrespectively,andx,yarethesamplestandarddeviationofxandy.ThePearsoncorrelationcoefficientisameasurefor
howwellastraightlinecanbefittedtoascatterplotofxandy.Ifallthepointsinthescatterplotlieonastraightline,thePearsoncorrelationcoefficientiseither+1or1,
dependingonwhethertheslopeoflineispositiveornegative.IfthePearsoncorrelationcoefficientisequaltozero,thereisnocorrelationbetweenxandy.
ThePearsondistanceisthendefinedas
dP1r.
AsthePearsoncorrelationcoefficientliesbetween1and1,thePearsondistanceliesbetween0and2.
AbsolutePearsoncorrelation
BytakingtheabsolutevalueofthePearsoncorrelation,wefindanumberbetween0and1.Iftheabsolutevalueis1,allthepointsinthescatterplotlieonastraightline
witheitherapositiveoranegativeslope.Iftheabsolutevalueisequaltozero,thereisnocorrelationbetweenxandy.
Thecorrespondingdistanceisdefinedas

dA1 r ,

whereristhePearsoncorrelationcoefficient.AstheabsolutevalueofthePearsoncorrelationcoefficientliesbetween0and1,thecorrespondingdistanceliesbetween0
and1aswell.
Inthecontextofgeneexpressionexperiments,theabsolutecorrelationisequalto1ifthegeneexpressionprofilesoftwogenesareeitherexactlythesameorexactly
opposite.Theabsolutecorrelationcoefficientshouldthereforebeusedwithcare.
Uncenteredcorrelation(cosineoftheangle)
Insomecases,itmaybepreferabletousetheuncenteredcorrelationinsteadoftheregularPearsoncorrelationcoefficient.Theuncenteredcorrelationisdefinedas

n
1 xi yi
rU=
n

x

y
,
(0) (0)
i=1
where
n
1
x(0) = xi2

n
i=1

n
1
(0)
yi2 .

y =
n
i=1
ThisisthesameexpressionasfortheregularPearsoncorrelationcoefficient,exceptthatthesamplemeansx,aresetequaltozero.Theuncenteredcorrelationmaybe
appropriateifthereisazeroreferencestate.Forinstance,inthecaseofgeneexpressiondatagivenintermsoflogratios,alogratioequaltozerocorrespondstothe
greenandredsignalbeingequal,whichmeansthattheexperimentalmanipulationdidnotaffectthegeneexpression.
Thedistancecorrespondingtotheuncenteredcorrelationcoefficientisdefinedas
dU1rU,
whererUistheuncenteredcorrelation.Astheuncenteredcorrelationcoefficientliesbetween1and1,thecorrespondingdistanceliesbetween0and2.
Theuncenteredcorrelationisequaltothecosineoftheangleofthetwodatavectorsinndimensionalspace,andisoftenreferredtoassuch.
Absoluteuncenteredcorrelation
AsfortheregularPearsoncorrelation,wecandefineadistancemeasureusingtheabsolutevalueoftheuncenteredcorrelation:

dAU1 rU ,

whererUistheuncenteredcorrelationcoefficient.Astheabsolutevalueoftheuncenteredcorrelationcoefficientliesbetween0and1,thecorrespondingdistancelies
between0and1aswell.
Geometrically,theabsolutevalueoftheuncenteredcorrelationisequaltothecosinebetweenthesupportinglinesofthetwodatavectors(i.e.,theanglewithouttaking
thedirectionofthevectorsintoconsideration).
Spearmanrankcorrelation
TheSpearmanrankcorrelationisanexampleofanonparametricsimilaritymeasure,andtendstobemorerobustagainstoutliersthanthePearsoncorrelation.
TocalculatetheSpearmanrankcorrelation,wereplaceeachdatavaluebytheirrankifwewouldorderthedataineachvectorbytheirvalue.Wethencalculatethe
Pearsoncorrelationbetweenthetworankvectorsinsteadofthedatavectors.
AsinthecaseofthePearsoncorrelation,wecandefineadistancemeasurecorrespondingtotheSpearmanrankcorrelationas
dS1rS,
whererSistheSpearmanrankcorrelation.
Kendalls
Kendallsisanotherexampleofanonparametricsimilaritymeasure.ItissimilartotheSpearmanrankcorrelation,butinsteadoftheranksthemselvesonlytherelative
ranksareusedtocalculate(seeSnedecor&Cochran[29]).
WecandefineadistancemeasurecorrespondingtoKendallsas
dK1.
AsKendallsisalwaysbetween1and1,thecorrespondingdistancewillbebetween0and2.
Weighting
FormostofthedistancefunctionsavailableinBio.Cluster,aweightvectorcanbeapplied.Theweightvectorcontainsweightsfortheitemsinthedatavector.Ifthe
weightforitemiiswi,thenthatitemistreatedasifitoccurredwitimesinthedata.Theweightdonothavetobeintegers.FortheSpearmanrankcorrelationand
Kendalls,weightsdonothaveawelldefinedmeaningandarethereforenotimplemented.
Calculatingthedistancematrix
Thedistancematrixisasquarematrixwithallpairwisedistancesbetweentheitemsindata,andcanbecalculatedbythefunctiondistancematrixintheBio.Cluster
module:
>>>fromBio.Clusterimportdistancematrix
>>>matrix=distancematrix(data)
wherethefollowingargumentsaredefined:
data(required)
Arraycontainingthedatafortheitems.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
weight(default:None)
Theweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
transpose(default:0)
Determinesifthedistancesbetweentherowsofdataaretobecalculated(transpose==0),orbetweenthecolumnsofdata(transpose==1).
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).
Tosavememory,thedistancematrixisreturnedasalistof1Darrays.Thenumberofcolumnsineachrowisequaltotherownumber.Hence,thefirstrowhaszero
elements.Anexampleofthereturnvalueis
[array([]),
array([1.]),
array([7.,3.]),
array([4.,2.,6.])]
Thiscorrespondstothedistancematrix
0 1 7 4

1 0 3 2
.
7 3 0 6
4 2 6 0
15.2Calculatingclusterproperties
Calculatingtheclustercentroids
Thecentroidofaclustercanbedefinedeitherasthemeanorasthemedianofeachdimensionoverallclusteritems.ThefunctionclustercentroidsinBio.Clustercan
beusedtocalculateeither:
>>>fromBio.Clusterimportclustercentroids
>>>cdata,cmask=clustercentroids(data)
data(required)
mask(default:None)
clusterid(default:None)
Vectorofintegersshowingtowhichclustereachitembelongs.IfclusteridisNone,thenallitemsareassumedtobelongtothesamecluster.
method(default:'a')
Specifieswhetherthearithmeticmean(method=='a')orthemedian(method=='m')isusedtocalculatetheclustercenter.
Determinesifthecentroidsoftherowsofdataaretobecalculated(transpose==0),orthecentroidsofthecolumnsofdata(transpose==1).
Thisfunctionreturnsthetuple(cdata,cmask).Thecentroiddataarestoredinthe2DNumericalPythonarraycdata,withmissingdataindicatedbythe2DNumerical
Pythonintegerarraycmask.Thedimensionsofthesearraysare(numberofclusters,numberofcolumns)iftransposeis0,or(numberofrows,numberofclusters)if
transposeis1.Eachrow(iftransposeis0)orcolumn(iftransposeis1)containstheaverageddatacorrespondingtothecentroidofeachcluster.
Calculatingthedistancebetweenclusters
Givenadistancefunctionbetweenitems,wecandefinethedistancebetweentwoclustersinseveralways.Thedistancebetweenthearithmeticmeansofthetwoclusters
isusedinpairwisecentroidlinkageclusteringandinkmeansclustering.Inkmedoidsclustering,thedistancebetweenthemediansofthetwoclustersisusedinstead.
Theshortestpairwisedistancebetweenitemsofthetwoclustersisusedinpairwisesinglelinkageclustering,whilethelongestpairwisedistanceisusedinpairwise
maximumlinkageclustering.Inpairwiseaveragelinkageclustering,thedistancebetweentwoclustersisdefinedastheaverageoverthepairwisedistances.
Tocalculatethedistancebetweentwoclusters,use
>>>fromBio.Clusterimportclusterdistance
>>>distance=clusterdistance(data)
data(required)
mask(default:None)
index1(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothefirstcluster.Aclustercontainingonlyoneitemicanberepresentedeitherasalist[i],orasanintegeri.
index2(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothesecondcluster.Aclustercontainingonlyoneitemsicanberepresentedeitherasalist[i],orasan
integeri.
method(default:'a')
Specifieshowthedistancebetweenclustersisdefined:
'a':Distancebetweenthetwoclustercentroids(arithmeticmean)
'm':Distancebetweenthetwoclustercentroids(median)
's':Shortestpairwisedistancebetweenitemsinthetwoclusters
'x':Longestpairwisedistancebetweenitemsinthetwoclusters
'v':Averageoverthepairwisedistancesbetweenitemsinthetwoclusters.
Iftranspose==0,calculatethedistancebetweentherowsofdata.Iftranspose==1,calculatethedistancebetweenthecolumnsofdata.
15.3Partitioningalgorithms
Partitioningalgorithmsdivideitemsintokclusterssuchthatthesumofdistancesovertheitemstotheirclustercentersisminimal.Thenumberofclusterskisspecified
bytheuser.ThreepartitioningalgorithmsareavailableinBio.Cluster:
kmeansclustering
kmediansclustering
kmedoidsclustering
Thesealgorithmsdifferinhowtheclustercenterisdefined.Inkmeansclustering,theclustercenterisdefinedasthemeandatavectoraveragedoverallitemsinthe
cluster.Insteadofthemean,inkmediansclusteringthemedianiscalculatedforeachdimensioninthedatavector.Finally,inkmedoidsclusteringtheclustercenteris
definedastheitemwhichhasthesmallestsumofdistancestotheotheritemsinthecluster.Thisclusteringalgorithmissuitableforcasesinwhichthedistancematrixis
knownbuttheoriginaldatamatrixisnotavailable,forexamplewhenclusteringproteinsbasedontheirstructuralsimilarity.
Theexpectationmaximization(EM)algorithmisusedtofindthispartitioningintokgroups.IntheinitializationoftheEMalgorithm,werandomlyassignitemsto
clusters.Toensurethatnoemptyclustersareproduced,weusethebinomialdistributiontorandomlychoosethenumberofitemsineachclustertobeoneormore.We
thenrandomlypermutetheclusterassignmentstoitemssuchthateachitemhasanequalprobabilitytobeinanycluster.Eachclusteristhusguaranteedtocontainatleast
oneitem.
Wetheniterate:
Calculatethecentroidofeachcluster,definedaseitherthemean,themedian,orthemedoidofthecluster
Calculatethedistancesofeachitemtotheclustercenters
Foreachitem,determinewhichclustercentroidisclosest
Reassigneachitemtoitsclosestcluster,orstoptheiterationifnofurtheritemreassignmentstakeplace.
Toavoidclustersbecomingemptyduringtheiteration,inkmeansandkmediansclusteringthealgorithmkeepstrackofthenumberofitemsineachcluster,and
prohibitsthelastremainingiteminaclusterfrombeingreassignedtoadifferentcluster.Forkmedoidsclustering,suchacheckisnotneeded,astheitemthatfunctions
astheclustercentroidhasazerodistancetoitself,andwillthereforeneverbeclosertoadifferentcluster.
Astheinitialassignmentofitemstoclustersisdonerandomly,usuallyadifferentclusteringsolutionisfoundeachtimetheEMalgorithmisexecuted.Tofindtheoptimal
clusteringsolution,thekmeansalgorithmisrepeatedmanytimes,eachtimestartingfromadifferentinitialrandomclustering.Thesumofdistancesoftheitemstotheir
clustercenterissavedforeachrun,andthesolutionwiththesmallestvalueofthissumwillbereturnedastheoverallclusteringsolution.
HowoftentheEMalgorithmshouldberundependsonthenumberofitemsbeingclustered.Asaruleofthumb,wecanconsiderhowoftentheoptimalsolutionwas
foundthisnumberisreturnedbythepartitioningalgorithmsasimplementedinthislibrary.Iftheoptimalsolutionwasfoundmanytimes,itisunlikelythatbetter
solutionsexistthantheonethatwasfound.However,iftheoptimalsolutionwasfoundonlyonce,theremaywellbeothersolutionswithasmallerwithinclustersumof
distances.Ifthenumberofitemsislarge(morethanseveralhundreds),itmaybedifficulttofindthegloballyoptimalsolution.
TheEMalgorithmterminateswhennofurtherreassignmentstakeplace.Wenoticedthatforsomesetsofinitialclusterassignments,theEMalgorithmfailstoconverge
duetothesameclusteringsolutionreappearingperiodicallyafterasmallnumberofiterationsteps.Wethereforecheckfortheoccurrenceofsuchperiodicsolutions
duringtheiteration.Afteragivennumberofiterationsteps,thecurrentclusteringresultissavedasareference.Bycomparingtheclusteringresultaftereachsubsequent
iterationsteptothereferencestate,wecandetermineifapreviouslyencounteredclusteringresultisfound.Insuchacase,theiterationishalted.Ifafteragivennumber
ofiterationsthereferencestatehasnotyetbeenencountered,thecurrentclusteringsolutionissavedtobeusedasthenewreferencestate.Initially,teniterationstepsare
executedbeforeresavingthereferencestate.Thisnumberofiterationstepsisdoubledeachtime,toensurethatperiodicbehaviorwithlongerperiodscanalsobedetected.
kmeansandkmedians
ThekmeansandkmediansalgorithmsareimplementedasthefunctionkclusterinBio.Cluster:
>>>fromBio.Clusterimportkcluster
>>>clusterid,error,nfound=kcluster(data)
data(required)
nclusters(default:2)
Thenumberofclustersk.
mask(default:None)
Determinesifrows(transposeis0)orcolumns(transposeis1)aretobeclustered.
npass(default:1)
Thenumberoftimesthekmeans/mediansclusteringalgorithmisperformed,eachtimewithadifferent(random)initialcondition.Ifinitialidisgiven,thevalue
ofnpassisignoredandtheclusteringalgorithmisrunonlyonce,asitbehavesdeterministicallyinthatcase.
method(default:a)
describeshowthecenterofaclusterisfound:
method=='a':arithmeticmean(kmeansclustering)
method=='m':median(kmediansclustering).
Forothervaluesofmethod,thearithmeticmeanisused.
Definesthedistancefunctiontobeused(see15.1).Whereasalleightdistancemeasuresareacceptedbykcluster,fromatheoreticalviewpointitisbesttousethe
Euclideandistanceforthekmeansalgorithm,andthecityblockdistanceforkmedians.
initialid(default:None)
SpecifiestheinitialclusteringtobeusedfortheEMalgorithm.Ifinitialid==None,thenadifferentrandominitialclusteringisusedforeachofthenpassrunsof
theEMalgorithm.IfinitialidisnotNone,thenitshouldbeequaltoa1Darraycontainingtheclusternumber(between0andnclusters1)foreachitem.Each
clustershouldcontainatleastoneitem.Withtheinitialclusteringspecified,theEMalgorithmisdeterministic.
Thisfunctionreturnsatuple(clusterid,error,nfound),whereclusteridisanintegerarraycontainingthenumberoftheclustertowhicheachroworclusterwas
assigned,erroristhewithinclustersumofdistancesfortheoptimalclusteringsolution,andnfoundisthenumberoftimesthisoptimalsolutionwasfound.
kmedoidsclustering
Thekmedoidsroutineperformskmedoidsclusteringonagivensetofitems,usingthedistancematrixandthenumberofclusterspassedbytheuser:
>>>fromBio.Clusterimportkmedoids
>>>clusterid,error,nfound=kmedoids(distance)
wherethefollowingargumentsaredefined:,nclusters=2,npass=1,initialid=None)|
distance(required)
Thematrixcontainingthedistancesbetweentheitemsthismatrixcanbespecifiedinthreeways:
asa2DNumericalPythonarray(inwhichonlytheleftlowerpartofthearraywillbeaccessed):
distance=array([[0.0,1.1,2.3],
[1.1,0.0,4.5],
[2.3,4.5,0.0]])
asa1DNumericalPythonarraycontainingconsecutivelythedistancesintheleftlowerpartofthedistancematrix:
distance=array([1.1,2.3,4.5])
asalistcontainingtherowsoftheleftlowerpartofthedistancematrix:
distance=[array([]|,
array([1.1]),
array([2.3,4.5])
]
Thesethreeexpressionscorrespondtothesamedistancematrix.
npass(default:1)
Thenumberoftimesthekmedoidsclusteringalgorithmisperformed,eachtimewithadifferent(random)initialcondition.Ifinitialidisgiven,thevalueof
npassisignored,astheclusteringalgorithmbehavesdeterministicallyinthatcase.
initialid(default:None)
SpecifiestheinitialclusteringtobeusedfortheEMalgorithm.Ifinitialid==None,thenadifferentrandominitialclusteringisusedforeachofthenpassrunsof
theEMalgorithm.IfinitialidisnotNone,thenitshouldbeequaltoa1Darraycontainingtheclusternumber(between0andnclusters1)foreachitem.Each
clustershouldcontainatleastoneitem.Withtheinitialclusteringspecified,theEMalgorithmisdeterministic.
Thisfunctionreturnsatuple(clusterid,error,nfound),whereclusteridisanarraycontainingthenumberoftheclustertowhicheachitemwasassigned,erroristhe
withinclustersumofdistancesfortheoptimalkmedoidsclusteringsolution,andnfoundisthenumberoftimestheoptimalsolutionwasfound.Notethatthecluster
numberinclusteridisdefinedastheitemnumberoftheitemrepresentingtheclustercentroid.
15.4Hierarchicalclustering
Hierarchicalclusteringmethodsareinherentlydifferentfromthekmeansclusteringmethod.Inhierarchicalclustering,thesimilarityintheexpressionprofilebetween
genesorexperimentalconditionsarerepresentedintheformofatreestructure.ThistreestructurecanbeshowngraphicallybyprogramssuchasTreeviewandJava
Treeview,whichhascontributedtothepopularityofhierarchicalclusteringintheanalysisofgeneexpressiondata.
Thefirststepinhierarchicalclusteringistocalculatethedistancematrix,specifyingallthedistancesbetweentheitemstobeclustered.Next,wecreateanodebyjoining
thetwoclosestitems.Subsequentnodesarecreatedbypairwisejoiningofitemsornodesbasedonthedistancebetweenthem,untilallitemsbelongtothesamenode.A
treestructurecanthenbecreatedbyretracingwhichitemsandnodesweremerged.UnliketheEMalgorithm,whichisusedinkmeansclustering,thecompleteprocess
ofhierarchicalclusteringisdeterministic.
Severalflavorsofhierarchicalclusteringexist,whichdifferinhowthedistancebetweensubnodesisdefinedintermsoftheirmembers.InBio.Cluster,pairwisesingle,
maximum,average,andcentroidlinkageareavailable.
Inpairwisesinglelinkageclustering,thedistancebetweentwonodesisdefinedastheshortestdistanceamongthepairwisedistancesbetweenthemembersofthe
twonodes.
Inpairwisemaximumlinkageclustering,alternativelyknownaspairwisecompletelinkageclustering,thedistancebetweentwonodesisdefinedasthelongest
distanceamongthepairwisedistancesbetweenthemembersofthetwonodes.
Inpairwiseaveragelinkageclustering,thedistancebetweentwonodesisdefinedastheaverageoverallpairwisedistancesbetweentheitemsofthetwonodes.
Inpairwisecentroidlinkageclustering,thedistancebetweentwonodesisdefinedasthedistancebetweentheircentroids.Thecentroidsarecalculatedbytakingthe
meanoveralltheitemsinacluster.Asthedistancefromeachnewlyformednodetoexistingnodesanditemsneedtobecalculatedateachstep,thecomputing
timeofpairwisecentroidlinkageclusteringmaybesignificantlylongerthanfortheotherhierarchicalclusteringmethods.Anotherpeculiarityisthat(foradistance
measurebasedonthePearsoncorrelation),thedistancesdonotnecessarilyincreasewhengoingupintheclusteringtree,andmayevendecrease.Thisiscausedby
aninconsistencybetweenthecentroidcalculationandthedistancecalculationwhenusingthePearsoncorrelation:WhereasthePearsoncorrelationeffectively
normalizesthedataforthedistancecalculation,nosuchnormalizationoccursforthecentroidcalculation.
Forpairwisesingle,complete,andaveragelinkageclustering,thedistancebetweentwonodescanbefounddirectlyfromthedistancesbetweentheindividualitems.
Therefore,theclusteringalgorithmdoesnotneedaccesstotheoriginalgeneexpressiondata,oncethedistancematrixisknown.Forpairwisecentroidlinkageclustering,
however,thecentroidsofnewlyformedsubnodescanonlybecalculatedfromtheoriginaldataandnotfromthedistancematrix.
TheimplementationofpairwisesinglelinkagehierarchicalclusteringisbasedontheSLINKalgorithm(R.Sibson,1973),whichismuchfasterandmorememory
efficientthanastraightforwardimplementationofpairwisesinglelinkageclustering.Theclusteringresultproducedbythisalgorithmisidenticaltotheclustering
solutionfoundbytheconventionalsinglelinkagealgorithm.Thesinglelinkagehierarchicalclusteringalgorithmimplementedinthislibrarycanbeusedtoclusterlarge
geneexpressiondatasets,forwhichconventionalhierarchicalclusteringalgorithmsfailduetoexcessivememoryrequirementsandrunningtime.
Representingahierarchicalclusteringsolution
Theresultofhierarchicalclusteringconsistsofatreeofnodes,inwhicheachnodejoinstwoitemsorsubnodes.Usually,wearenotonlyinterestedinwhichitemsor
subnodesarejoinedateachnode,butalsointheirsimilarity(ordistance)astheyarejoined.Tostoreonenodeinthehierarchicalclusteringtree,wemakeuseoftheclass
Node,whichdefinedinBio.Cluster.AninstanceofNodehasthreeattributes:
left
right
distance
Here,leftandrightareintegersreferringtothetwoitemsorsubnodesthatarejoinedatthisnode,anddistanceisthedistancebetweenthem.Theitemsbeingclustered
arenumberedfrom0to(numberofitems1),whileclustersarenumberedfrom1to(numberofitems1).Notethatthenumberofnodesisonelessthanthenumber
ofitems.
TocreateanewNodeobject,weneedtospecifyleftandrightdistanceisoptional.
>>>fromBio.ClusterimportNode
>>>Node(2,3)
(2,3):0
>>>Node(2,3,0.91)
(2,3):0.91
Theattributesleft,right,anddistanceofanexistingNodeobjectcanbemodifieddirectly:
>>>node=Node(4,5)
>>>node.left=6
>>>node.right=2
>>>node.distance=0.73
>>>node
(6,2):0.73
Anerrorisraisedifleftandrightarenotintegers,orifdistancecannotbeconvertedtoafloatingpointvalue.
ThePythonclassTreerepresentsafullhierarchicalclusteringsolution.ATreeobjectcanbecreatedfromalistofNodeobjects:
>>>fromBio.ClusterimportNode,Tree
>>>nodes=[Node(1,2,0.2),Node(0,3,0.5),Node(2,4,0.6),Node(1,3,0.9)]
>>>tree=Tree(nodes)
>>>print(tree)
(1,2):0.2
(0,3):0.5
(2,4):0.6
(1,3):0.9
TheTreeinitializerchecksifthelistofnodesisavalidhierarchicalclusteringresult:
>>>nodes=[Node(1,2,0.2),Node(0,2,0.5)]
>>>Tree(nodes)
File"<stdin>",line1,in?
ValueError:Inconsistenttree
IndividualnodesinaTreeobjectcanbeaccessedusingsquarebrackets:
>>>nodes=[Node(1,2,0.2),Node(0,1,0.5)]
>>>tree=Tree(nodes)
>>>tree[0]
(1,2):0.2
>>>tree[1]
(0,1):0.5
>>>tree[1]
(0,1):0.5
AsaTreeobjectisreadonly,wecannotchangeindividualnodesinaTreeobject.However,wecanconvertthetreetoalistofnodes,modifythislist,andcreateanew
treefromthislist:
>>>tree=Tree([Node(1,2,0.1),Node(0,1,0.5),Node(2,3,0.9)])
>>>print(tree)
(1,2):0.1
(0,1):0.5
(2,3):0.9
>>>nodes=tree[:]
>>>nodes[0]=Node(0,1,0.2)
>>>nodes[1].left=2
>>>tree=Tree(nodes)
>>>print(tree)
(0,1):0.2
(2,1):0.5
(2,3):0.9
ThisguaranteesthatanyTreeobjectisalwayswellformed.
TodisplayahierarchicalclusteringsolutionwithvisualizationprogramssuchasJavaTreeview,itisbettertoscaleallnodedistancessuchthattheyarebetweenzeroand
one.ThiscanbeaccomplishedbycallingthescalemethodonanexistingTreeobject:
>>>tree.scale()
Thismethodtakesnoarguments,andreturnsNone.
Afterhierarchicalclustering,theitemscanbegroupedintokclustersbasedonthetreestructurestoredintheTreeobjectbycuttingthetree:
>>>clusterid=tree.cut(nclusters=1)
wherenclusters(defaultingto1)isthedesirednumberofclustersk.Thismethodignoresthetopk1linkingeventsinthetreestructure,resultinginkseparatedclusters
ofitems.Thenumberofclusterskshouldbepositive,andlessthanorequaltothenumberofitems.Thismethodreturnsanarrayclusteridcontainingthenumberofthe
clustertowhicheachitemisassigned.
Performinghierarchicalclustering
Toperformhierarchicalclustering,usethetreeclusterfunctioninBio.Cluster.
>>>fromBio.Clusterimporttreecluster
>>>tree=treecluster(data)
data
mask(default:None)
Determinesifrows(transpose==0)orcolumns(transpose==1)aretobeclustered.
method(default:'m')
definesthelinkagemethodtobeused:
method=='s':pairwisesinglelinkageclustering
method=='m':pairwisemaximum(orcomplete)linkageclustering
method=='c':pairwisecentroidlinkageclustering
method=='a':pairwiseaveragelinkageclustering
Toapplyhierarchicalclusteringonaprecalculateddistancematrix,specifythedistancematrixargumentwhencallingtreeclusterfunctioninsteadofthedataargument:
>>>fromBio.Clusterimporttreecluster
>>>tree=treecluster(distancematrix=distance)
Inthiscase,thefollowingargumentsaredefined:
distancematrix
Thedistancematrix,whichcanbespecifiedinthreeways:
asa2DNumericalPythonarray(inwhichonlytheleftlowerpartofthearraywillbeaccessed):
distance=array([[0.0,1.1,2.3],
[1.1,0.0,4.5],
[2.3,4.5,0.0]])
asa1DNumericalPythonarraycontainingconsecutivelythedistancesintheleftlowerpartofthedistancematrix:
distance=array([1.1,2.3,4.5])
asalistcontainingtherowsoftheleftlowerpartofthedistancematrix:
distance=[array([]),
array([1.1]),
array([2.3,4.5])
Thesethreeexpressionscorrespondtothesamedistancematrix.Astreeclustermayshufflethevaluesinthedistancematrixaspartoftheclusteringalgorithm,be
suretosavethisarrayinadifferentvariablebeforecallingtreeclusterifyouneeditlater.
method
Thelinkagemethodtobeused:
Whilepairwisesingle,maximum,andaveragelinkageclusteringcanbecalculatedfromthedistancematrixalone,pairwisecentroidlinkagecannot.
Whencallingtreecluster,eitherdataordistancematrixshouldbeNone.
ThisfunctionreturnsaTreeobject.Thisobjectcontains(numberofitems1)nodes,wherethenumberofitemsisthenumberofrowsifrowswereclustered,orthe
numberofcolumnsifcolumnswereclustered.Eachnodedescribesapairwiselinkingevent,wherethenodeattributesleftandrighteachcontainthenumberofone
itemorsubnode,anddistancethedistancebetweenthem.Itemsarenumberedfrom0to(numberofitems1),whileclustersarenumbered1to(numberofitems1).
15.5SelfOrganizingMaps
SelfOrganizingMaps(SOMs)wereinventedbyKohonentodescribeneuralnetworks(seeforinstanceKohonen,1997[24]).Tamayo(1999)firstappliedSelf
OrganizingMapstogeneexpressiondata[30].
SOMsorganizeitemsintoclustersthataresituatedinsometopology.Usuallyarectangulartopologyischosen.TheclustersgeneratedbySOMsaresuchthatneighboring
clustersinthetopologyaremoresimilartoeachotherthanclustersfarfromeachotherinthetopology.
ThefirststeptocalculateaSOMistorandomlyassignadatavectortoeachclusterinthetopology.Ifrowsarebeingclustered,thenthenumberofelementsineachdata
vectorisequaltothenumberofcolumns.
AnSOMisthengeneratedbytakingrowsoneatatime,andfindingwhichclusterinthetopologyhastheclosestdatavector.Thedatavectorofthatcluster,aswellas
thoseoftheneighboringclusters,areadjustedusingthedatavectoroftherowunderconsideration.Theadjustmentisgivenby

x cell= x row x cell .

Theparameterisaparameterthatdecreasesateachiterationstep.Wehaveusedasimplelinearfunctionoftheiterationstep:

i
=init 1 ,
n

initistheinitialvalueofasspecifiedbytheuser,iisthenumberofthecurrentiterationstep,andnisthetotalnumberofiterationstepstobeperformed.Whilechanges
aremaderapidlyinthebeginningoftheiteration,attheendofiterationonlysmallchangesaremade.
AllclusterswithinaradiusRareadjustedtothegeneunderconsideration.Thisradiusdecreasesasthecalculationprogressesas
R=Rmax 1 i ,
n

inwhichthemaximumradiusisdefinedas
Rmax= Nx2+Ny2 ,
where(Nx,Ny)arethedimensionsoftherectangledefiningthetopology.
ThefunctionsomclusterimplementsthecompletealgorithmtocalculateaSelfOrganizingMaponarectangulargrid.Firstitinitializestherandomnumbergenerator.
Thenodedataaretheninitializedusingtherandomnumbergenerator.TheorderinwhichgenesormicroarraysareusedtomodifytheSOMisalsorandomized.Thetotal
numberofiterationsintheSOMalgorithmisspecifiedbytheuser.
Torunsomcluster,use
>>>fromBio.Clusterimportsomcluster
>>>clusterid,celldata=somcluster(data)
data(required)
mask(default:None)
containstheweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
nxgrid,nygrid(default:2,1)
ThenumberofcellshorizontallyandverticallyintherectangulargridonwhichtheSelfOrganizingMapiscalculated.
inittau(default:0.02)
TheinitialvaluefortheparameterthatisusedintheSOMalgorithm.Thedefaultvalueforinittauis0.02,whichwasusedinMichaelEisensCluster/TreeView
program.
niter(default:1)
Thenumberofiterationstobeperformed.
Thisfunctionreturnsthetuple(clusterid,celldata):
clusterid:
Anarraywithtwocolumns,wherethenumberofrowsisequaltothenumberofitemsthatwereclustered.Eachrowcontainsthexandycoordinatesofthecellin
therectangularSOMgridtowhichtheitemwasassigned.
celldata:
Anarraywithdimensions(nxgrid,nygrid,numberofcolumns)ifrowsarebeingclustered,or(nxgrid,nygrid,numberofrows)ifcolumnsarebeingclustered.
Eachelement[ix][iy]ofthisarrayisa1Dvectorcontainingthegeneexpressiondataforthecentroidoftheclusterinthegridcellwithcoordinates[ix][iy].
15.6PrincipalComponentAnalysis
PrincipalComponentAnalysis(PCA)isawidelyusedtechniqueforanalyzingmultivariatedata.ApracticalexampleofapplyingPrincipalComponentAnalysistogene
expressiondataispresentedbyYeungandRuzzo(2001)[33].
Inessence,PCAisacoordinatetransformationinwhicheachrowinthedatamatrixiswrittenasalinearsumoverbasisvectorscalledprincipalcomponents,whichare
orderedandchosensuchthateachmaximallyexplainstheremainingvarianceinthedatavectors.Forexample,ann3datamatrixcanberepresentedasanellipsoidal
cloudofnpointsinthreedimensionalspace.Thefirstprincipalcomponentisthelongestaxisoftheellipsoid,thesecondprincipalcomponentthesecondlongestaxisof
theellipsoid,andthethirdprincipalcomponentistheshortestaxis.Eachrowinthedatamatrixcanbereconstructedasasuitablelinearcombinationoftheprincipal
components.However,inordertoreducethedimensionalityofthedata,usuallyonlythemostimportantprincipalcomponentsareretained.Theremainingvariance
presentinthedataisthenregardedasunexplainedvariance.
Theprincipalcomponentscanbefoundbycalculatingtheeigenvectorsofthecovariancematrixofthedata.Thecorrespondingeigenvaluesdeterminehowmuchofthe
variancepresentinthedataisexplainedbyeachprincipalcomponent.
Beforeapplyingprincipalcomponentanalysis,typicallythemeanissubtractedfromeachcolumninthedatamatrix.Intheexampleabove,thiseffectivelycentersthe
ellipsoidalcloudarounditscentroidin3Dspace,withtheprincipalcomponentsdescribingthevariationofpointsintheellipsoidalcloudwithrespecttotheircentroid.
Thefunctionpcabelowfirstusesthesingularvaluedecompositiontocalculatetheeigenvaluesandeigenvectorsofthedatamatrix.Thesingularvaluedecompositionis
implementedasatranslationinCoftheAlgolproceduresvd[16],whichusesHouseholderbidiagonalizationandavariantoftheQRalgorithm.Theprincipal
components,thecoordinatesofeachdatavectoralongtheprincipalcomponents,andtheeigenvaluescorrespondingtotheprincipalcomponentsarethenevaluatedand
returnedindecreasingorderofthemagnitudeoftheeigenvalue.Ifdatacenteringisdesired,themeanshouldbesubtractedfromeachcolumninthedatamatrixbefore
callingthepcaroutine.
ToapplyPrincipalComponentAnalysistoarectangularmatrixdata,use
>>>fromBio.Clusterimportpca
>>>columnmean,coordinates,components,eigenvalues=pca(data)
Thisfunctionreturnsatuplecolumnmean,coordinates,components,eigenvalues:
columnmean
Arraycontainingthemeanovereachcolumnindata.
coordinates
Thecoordinatesofeachrowindatawithrespecttotheprincipalcomponents.
components
Theprincipalcomponents.
eigenvalues
Theeigenvaluescorrespondingtoeachoftheprincipalcomponents.
Theoriginalmatrixdatacanberecreatedbycalculatingcolumnmean+dot(coordinates,components).
15.7HandlingCluster/TreeViewtypefiles
Cluster/TreeViewareGUIbasedcodesforclusteringgeneexpressiondata.TheywereoriginallywrittenbyMichaelEisenwhileatStanfordUniversity.Bio.Cluster
containsfunctionsforreadingandwritingdatafilesthatcorrespondtotheformatspecifiedforCluster/TreeView.Inparticular,bysavingaclusteringresultinthatformat,
TreeViewcanbeusedtovisualizetheclusteringresults.WerecommendusingAlokSaldanhashttp://jtreeview.sourceforge.net/JavaTreeViewprogram,whichcan
displayhierarchicalaswellaskmeansclusteringresults.
AnobjectoftheclassRecordcontainsallinformationstoredinaCluster/TreeViewtypedatafile.TostoretheinformationcontainedinthedatafileinaRecordobject,we
firstopenthefileandthenreadit:
>>>fromBioimportCluster
>>>withopen("mydatafile.txt")ashandle:
...record=Cluster.read(handle)
...
Thistwostepprocessgivesyousomeflexibilityinthesourceofthedata.Forexample,youcanuse
>>>importgzip#Pythonstandardlibrary
>>>handle=gzip.open("mydatafile.txt.gz","rt")
toopenagzippedfile,or
>>>importurllib#Pythonstandardlibrary
>>>handle=urllib.urlopen("http://somewhere.org/mydatafile.txt")
toopenafilestoredontheInternetbeforecallingread.
Thereadcommandreadsthetabdelimitedtextfilemydatafile.txtcontaininggeneexpressiondataintheformatspecifiedforMichaelEisensCluster/TreeView
program.Inthisfileformat,rowsrepresentgenesandcolumnsrepresentsamplesorobservations.Forasimpletimecourse,aminimalinputfilewouldlooklikethis:
YORF 0minutes 30minutes 1hour 2hours 4hours

YAL001C 1 1.3 2.4 5.8 2.4
YAL002W 0.9 0.8 0.7 0.5 0.2
YAL003W 0.8 2.1 4.2 10.1 10.1
YAL005C 1.1 1.3 0.8 0.4
YAL010C 1.2 1 1.1 4.5 8.3
Eachrow(gene)hasanidentifierthatalwaysgoesinthefirstcolumn.Inthisexample,weareusingyeastopenreadingframecodes.Eachcolumn(sample)hasalabelin
thefirstrow.Inthisexample,thelabelsdescribethetimeatwhichasamplewastaken.Thefirstcolumnofthefirstrowcontainsaspecialfieldthattellstheprogram
whatkindofobjectsareineachrow.Inthiscase,YORFstandsforyeastopenreadingframe.Thisfieldcanbeanyalphanumericvalue.Theremainingcellsinthetable
containdatafortheappropriategeneandsample.The5.8inrow2column4meansthattheobservedvalueforgeneYAL001Cat2hourswas5.8.Missingvaluesare
acceptableandaredesignatedbyemptycells(e.g.YAL004Cat2hours).
Theinputfilemaycontainadditionalinformation.Amaximalinputfilewouldlooklikethis:
YORF NAME GWEIGHT GORDER 0 30 1 2 4

EWEIGHT 1 1 1 1 0
EORDER 5 3 2 1 1
YAL001C TFIIIC138KDSUBUNIT 1 1 1 1.3 2.4 5.8 2.4
YAL002W UNKNOWN 0.4 3 0.9 0.8 0.7 0.5 0.2
YAL003W ELONGATIONFACTOREF1BETA 0.4 2 0.8 2.1 4.2 10.1 10.1
YAL005C CYTOSOLICHSP70 0.4 5 1.1 1.3 0.8 0.4
TheaddedcolumnsNAME,GWEIGHT,andGORDERandrowsEWEIGHTandEORDERareoptional.TheNAMEcolumnallowsyoutospecifyalabelforeachgene
thatisdistinctfromtheIDincolumn1.
ARecordobjecthasthefollowingattributes:
data
Thedataarraycontainingthegeneexpressiondata.Genesarestoredrowwise,whilemicroarraysarestoredcolumnwise.
mask
Thisarrayshowswhichelementsinthedataarray,ifany,aremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifnodatawerefoundtobemissing,maskisset
toNone.
geneid
Thisisalistcontainingauniquedescriptionforeachgene(i.e.,ORFnumbers).
genename
Thisisalistcontainingadescriptionforeachgene(i.e.,genename).Ifnotpresentinthedatafile,genenameissettoNone.
gweight
Theweightsthataretobeusedtocalculatethedistanceinexpressionprofilebetweengenes.Ifnotpresentinthedatafile,gweightissettoNone.
gorder
Thepreferredorderinwhichgenesshouldbestoredinanoutputfile.Ifnotpresentinthedatafile,gorderissettoNone.
expid
Thisisalistcontainingadescriptionofeachmicroarray,e.g.experimentalcondition.
eweight
Theweightsthataretobeusedtocalculatethedistanceinexpressionprofilebetweenmicroarrays.Ifnotpresentinthedatafile,eweightissettoNone.
eorder
Thepreferredorderinwhichmicroarraysshouldbestoredinanoutputfile.Ifnotpresentinthedatafile,eorderissettoNone.
uniqid
ThestringthatwasusedinsteadofUNIQIDinthedatafile.
AfterloadingaRecordobject,eachoftheseattributescanbeaccessedandmodifieddirectly.Forexample,thedatacanbelogtransformedbytakingthelogarithmof
record.data.
Calculatingthedistancematrix
Tocalculatethedistancematrixbetweentheitemsstoredintherecord,use
>>>matrix=record.distancematrix()
Determinesifthedistancesbetweentherowsofdataaretobecalculated(transpose==0),orbetweenthecolumnsofdata(transpose==1).
Thisfunctionreturnsthedistancematrixasalistofrows,wherethenumberofcolumnsofeachrowisequaltotherownumber(seesection15.1).
Calculatingtheclustercentroids
Tocalculatethecentroidsofclustersofitemsstoredintherecord,use
>>>cdata,cmask=record.clustercentroids()
clusterid(default:None)
Vectorofintegersshowingtowhichclustereachitembelongs.Ifclusteridisnotgiven,thenallitemsareassumedtobelongtothesamecluster.
method(default:'a')
Specifieswhetherthearithmeticmean(method=='a')orthemedian(method=='m')isusedtocalculatetheclustercenter.
Determinesifthecentroidsoftherowsofdataaretobecalculated(transpose==0),orthecentroidsofthecolumnsofdata(transpose==1).
Thisfunctionreturnsthetuplecdata,cmaskseesection15.2foradescription.
Calculatingthedistancebetweenclusters
Tocalculatethedistancebetweenclustersofitemsstoredintherecord,use
>>>distance=record.clusterdistance()
index1(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothefirstcluster.Aclustercontainingonlyoneitemicanberepresentedeitherasalist[i],orasanintegeri.
index2(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothesecondcluster.Aclustercontainingonlyoneitemicanberepresentedeitherasalist[i],orasaninteger
i.
method(default:'a')
Specifieshowthedistancebetweenclustersisdefined:
'a':Distancebetweenthetwoclustercentroids(arithmeticmean)
'm':Distancebetweenthetwoclustercentroids(median)
's':Shortestpairwisedistancebetweenitemsinthetwoclusters
'x':Longestpairwisedistancebetweenitemsinthetwoclusters
'v':Averageoverthepairwisedistancesbetweenitemsinthetwoclusters.
Iftranspose==0,calculatethedistancebetweentherowsofdata.Iftranspose==1,calculatethedistancebetweenthecolumnsofdata.
Performinghierarchicalclustering
Toperformhierarchicalclusteringontheitemsstoredintherecord,use
>>>tree=record.treecluster()
Determinesifrows(transpose==0)orcolumns(transpose==1)aretobeclustered.
method(default:'m')
definesthelinkagemethodtobeused:
method=='c':pairwisecentroidlinkageclustering
transpose
Determinesifgenesormicroarraysarebeingclustered.Iftranspose==0,genes(rows)arebeingclustered.Iftranspose==1,microarrays(columns)areclustered.
ThisfunctionreturnsaTreeobject.Thisobjectcontains(numberofitems1)nodes,wherethenumberofitemsisthenumberofrowsifrowswereclustered,orthe
numberofcolumnsifcolumnswereclustered.Eachnodedescribesapairwiselinkingevent,wherethenodeattributesleftandrighteachcontainthenumberofone
itemorsubnode,anddistancethedistancebetweenthem.Itemsarenumberedfrom0to(numberofitems1),whileclustersarenumbered1to(numberofitems1).
Performingkmeansorkmediansclustering
Toperformkmeansorkmediansclusteringontheitemsstoredintherecord,use
>>>clusterid,error,nfound=record.kcluster()
npass(default:1)
Thenumberoftimesthekmeans/mediansclusteringalgorithmisperformed,eachtimewithadifferent(random)initialcondition.Ifinitialidisgiven,thevalue
ofnpassisignoredandtheclusteringalgorithmisrunonlyonce,asitbehavesdeterministicallyinthatcase.
method(default:a)
describeshowthecenterofaclusterisfound:
method=='a':arithmeticmean(kmeansclustering)
method=='m':median(kmediansclustering).
Forothervaluesofmethod,thearithmeticmeanisused.
Thisfunctionreturnsatuple(clusterid,error,nfound),whereclusteridisanintegerarraycontainingthenumberoftheclustertowhicheachroworclusterwas
assigned,erroristhewithinclustersumofdistancesfortheoptimalclusteringsolution,andnfoundisthenumberoftimesthisoptimalsolutionwasfound.
CalculatingaSelfOrganizingMap
TocalculateaSelfOrganizingMapoftheitemsstoredintherecord,use
>>>clusterid,celldata=record.somcluster()
nxgrid,nygrid(default:2,1)
ThenumberofcellshorizontallyandverticallyintherectangulargridonwhichtheSelfOrganizingMapiscalculated.
inittau(default:0.02)
TheinitialvaluefortheparameterthatisusedintheSOMalgorithm.Thedefaultvalueforinittauis0.02,whichwasusedinMichaelEisensCluster/TreeView
program.
niter(default:1)
Thenumberofiterationstobeperformed.
Thisfunctionreturnsthetuple(clusterid,celldata):
clusterid:
Anarraywithtwocolumns,wherethenumberofrowsisequaltothenumberofitemsthatwereclustered.Eachrowcontainsthexandycoordinatesofthecellin
therectangularSOMgridtowhichtheitemwasassigned.
celldata:
Anarraywithdimensions(nxgrid,nygrid,numberofcolumns)ifrowsarebeingclustered,or(nxgrid,nygrid,numberofrows)ifcolumnsarebeingclustered.
Eachelement[ix][iy]ofthisarrayisa1Dvectorcontainingthegeneexpressiondataforthecentroidoftheclusterinthegridcellwithcoordinates[ix][iy].
Savingtheclusteringresult
Tosavetheclusteringresult,use
>>>record.save(jobname,geneclusters,expclusters)
jobname
Thestringjobnameisusedasthebasenamefornamesofthefilesthataretobesaved.
geneclusters
Thisargumentdescribesthegene(rowwise)clusteringresult.Incaseofkmeansclustering,thisisa1Darraycontainingthenumberoftheclustereachgene
belongsto.Itcanbecalculatedusingkcluster.Incaseofhierarchicalclustering,geneclustersisaTreeobject.
expclusters
Thisargumentdescribesthe(columnwise)clusteringresultfortheexperimentalconditions.Incaseofkmeansclustering,thisisa1Darraycontainingthenumber
oftheclustereachexperimentalconditionbelongsto.Itcanbecalculatedusingkcluster.Incaseofhierarchicalclustering,expclustersisaTreeobject.
Thismethodwritesthetextfilejobname.cdt,jobname.gtr,jobname.atr,jobname*.kgg,and/orjobname*.kagforsubsequentreadingbytheJavaTreeViewprogram.If
geneclustersandexpclustersarebothNone,thismethodonlywritesthetextfilejobname.cdtthisfilecansubsequentlybereadintoanewRecordobject.
15.8Examplecalculation
Thisisanexampleofahierarchicalclusteringcalculation,usingsinglelinkageclusteringforgenesandmaximumlinkageclusteringforexperimentalconditions.Asthe
Euclideandistanceisbeingusedforgeneclustering,itisnecessarytoscalethenodedistancesgenetreesuchthattheyareallbetweenzeroandone.Thisisneededforthe
JavaTreeViewcodetodisplaythetreediagramcorrectly.Toclustertheexperimentalconditions,theuncenteredcorrelationisbeingused.Noscalingisneededinthis
case,asthedistancesinexptreearealreadybetweenzeroandtwo.Theexampledatacyano.txtcanbefoundinthedatasubdirectory.
>>>withopen("cyano.txt")ashandle:
...
>>>genetree=record.treecluster(method='s')
>>>genetree.scale()
>>>exptree=record.treecluster(dist='u',transpose=1)
>>>record.save("cyano_result",genetree,exptree)
Thiswillcreatethefilescyano_result.cdt,cyano_result.gtr,andcyano_result.atr.
Similarly,wecansaveakmeansclusteringsolution:
>>>withopen("cyano.txt")ashandle:
...
>>>(geneclusters,error,ifound)=record.kcluster(nclusters=5,npass=1000)
>>>(expclusters,error,ifound)=record.kcluster(nclusters=2,npass=100,transpose=1)
>>>record.save("cyano_result",geneclusters,expclusters)
Thiswillcreatethefilescyano_result_K_G2_A2.cdt,cyano_result_K_G2.kgg,andcyano_result_K_A2.kag.
15.9Auxiliaryfunctions
median(data)returnsthemedianofthe1Darraydata.
mean(data)returnsthemeanofthe1Darraydata.
version()returnstheversionnumberoftheunderlyingCClusteringLibraryasastring.
Chapter16Supervisedlearningmethods
NotethesupervisedlearningmethodsdescribedinthischapterallrequireNumericalPython(numpy)tobeinstalled.
16.1TheLogisticRegressionModel
16.1.1BackgroundandPurpose
LogisticregressionisasupervisedlearningapproachthatattemptstodistinguishKclassesfromeachotherusingaweightedsumofsomepredictorvariablesxi.The
logisticregressionmodelisusedtocalculatetheweightsiofthepredictorvariables.InBiopython,thelogisticregressionmodeliscurrentlyimplementedfortwo
classesonly(K=2)thenumberofpredictorvariableshasnopredefinedlimit.
Asanexample,letstrytopredicttheoperonstructureinbacteria.AnoperonisasetofadjacentgenesonthesamestrandofDNAthataretranscribedintoasingle
mRNAmolecule.TranslationofthesinglemRNAmoleculethenyieldstheindividualproteins.ForBacillussubtilis,whosedatawewillbeusing,theaveragenumberof
genesinanoperonisabout2.4.
Asafirststepinunderstandinggeneregulationinbacteria,weneedtoknowtheoperonstructure.Forabout10%ofthegenesinBacillussubtilis,theoperonstructureis
knownfromexperiments.Asupervisedlearningmethodcanbeusedtopredicttheoperonstructurefortheremaining90%ofthegenes.
Forsuchasupervisedlearningapproach,weneedtochoosesomepredictorvariablesxithatcanbemeasuredeasilyandaresomehowrelatedtotheoperonstructure.One
predictorvariablemightbethedistanceinbasepairsbetweengenes.Adjacentgenesbelongingtothesameoperontendtobeseparatedbyarelativelyshortdistance,
whereasadjacentgenesindifferentoperonstendtohavealargerspacebetweenthemtoallowforpromoterandterminatorsequences.Anotherpredictorvariableisbased
ongeneexpressionmeasurements.Bydefinition,genesbelongingtothesameoperonhaveequalgeneexpressionprofiles,whilegenesindifferentoperonsareexpected
tohavedifferentexpressionprofiles.Inpractice,themeasuredexpressionprofilesofgenesinthesameoperonarenotquiteidenticalduetothepresenceofmeasurement
errors.Toassessthesimilarityinthegeneexpressionprofiles,weassumethatthemeasurementerrorsfollowanormaldistributionandcalculatethecorrespondinglog
likelihoodscore.
WenowhavetwopredictorvariablesthatwecanusetopredictiftwoadjacentgenesonthesamestrandofDNAbelongtothesameoperon:
x1:thenumberofbasepairsbetweenthem
x2:theirsimilarityinexpressionprofile.
Inalogisticregressionmodel,weuseaweightedsumofthesetwopredictorstocalculateajointscoreS:
S=0+1x1+2x2.(16.1)
Thelogisticregressionmodelgivesusappropriatevaluesfortheparameters0,1,2usingtwosetsofexamplegenes:
OP:Adjacentgenes,onthesamestrandofDNA,knowntobelongtothesameoperon
NOP:Adjacentgenes,onthesamestrandofDNA,knowntobelongtodifferentoperons.
Inthelogisticregressionmodel,theprobabilityofbelongingtoaclassdependsonthescoreviathelogisticfunction.ForthetwoclassesOPandNOP,wecanwritethis
as
exp(0+1x1+2x2)
Pr(OP|x1,x2) = (16.2)
1+exp(0+1x1+2x2)

1
Pr(NOP|x1,x2) = (16.3)
1+exp(0+1x1+2x2)
Usingasetofgenepairsforwhichitisknownwhethertheybelongtothesameoperon(classOP)ortodifferentoperons(classNOP),wecancalculatetheweights0,
1,2bymaximizingtheloglikelihoodcorrespondingtotheprobabilityfunctions(16.2)and(16.3).
16.1.2Trainingthelogisticregressionmodel
Table16.1:Adjacentgenepairsknowntobelongtothesameoperon(classOP)ortodifferentoperons(classNOP).Intergenedistancesarenegativeifthe
twogenesoverlap.
Genepair Intergenedistance(x1) Geneexpressionscore(x2) Class
cotJAcotJB 53 200.78 OP
yesKyesL 117 267.14 OP
lplAlplB 57 163.47 OP
lplBlplC 16 190.30 OP
lplClplD 11 220.94 OP
lplDyetF 85 193.94 OP
yfmTyfmS 16 182.71 OP
yfmFyfmE 15 180.41 OP
citScitT 26 181.73 OP
citMyflN 58 259.87 OP
yfiIyfiJ 126 414.53 NOP
lipByfiQ 191 249.57 NOP
yfiUyfiV 113 265.28 NOP
yfhHyfhI 145 312.99 NOP
cotYcotX 154 213.83 NOP
yjoBrapA 147 380.85 NOP
ptsIsplA 93 291.13 NOP
Table16.1listssomeoftheBacillussubtilisgenepairsforwhichtheoperonstructureisknown.Letscalculatethelogisticregressionmodelfromthesedata:
>>>fromBioimportLogisticRegression
>>>xs=[[53,200.78],
[117,267.14],
[57,163.47],
[16,190.30],
[11,220.94],
[85,193.94],
[16,182.71],
[15,180.41],
[26,181.73],
[58,259.87],
[126,414.53],
[191,249.57],
[113,265.28],
[145,312.99],
[154,213.83],
[147,380.85],
[93,291.13]]
>>>ys=[1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0]
>>>model=LogisticRegression.train(xs,ys)
Here,xsandysarethetrainingdata:xscontainsthepredictorvariablesforeachgenepair,andysspecifiesifthegenepairbelongstothesameoperon(1,classOP)or
differentoperons(0,classNOP).Theresultinglogisticregressionmodelisstoredinmodel,whichcontainstheweights0,1,and2:
>>>model.beta
[8.9830290157144681,0.035968960444850887,0.02181395662983519]
Notethat1isnegative,asgenepairswithashorterintergenedistancehaveahigherprobabilityofbelongingtothesameoperon(classOP).Ontheotherhand,2is
positive,asgenepairsbelongingtothesameoperontypicallyhaveahighersimilarityscoreoftheirgeneexpressionprofiles.Theparameter0ispositiveduetothe
higherprevalenceofoperongenepairsthannonoperongenepairsinthetrainingdata.
Thefunctiontrainhastwooptionalarguments:update_fnandtypecode.Theupdate_fncanbeusedtospecifyacallbackfunction,takingasargumentstheiteration
numberandtheloglikelihood.Withthecallbackfunction,wecanforexampletracktheprogressofthemodelcalculation(whichusesaNewtonRaphsoniterationto
maximizetheloglikelihoodfunctionofthelogisticregressionmodel):
>>>defshow_progress(iteration,loglikelihood):
print("Iteration:",iteration,"Loglikelihoodfunction:",loglikelihood)
>>>
>>>model=LogisticRegression.train(xs,ys,update_fn=show_progress)
Iteration:0Loglikelihoodfunction:11.7835020695
Theiterationstopsoncetheincreaseintheloglikelihoodfunctionislessthan0.01.Ifnoconvergenceisreachedafter500iterations,thetrainfunctionreturnswithan
AssertionError.
Theoptionalkeywordtypecodecanalmostalwaysbeignored.ThiskeywordallowstheusertochoosethetypeofNumericmatrixtouse.Inparticular,toavoidmemory
problemsforverylargeproblems,itmaybenecessarytousesingleprecisionfloats(Float8,Float16,etc.)ratherthandouble,whichisusedbydefault.
16.1.3Usingthelogisticregressionmodelforclassification
Classificationisperformedbycallingtheclassifyfunction.Givenalogisticregressionmodelandthevaluesforx1andx2(e.g.foragenepairofunknownoperon
structure),theclassifyfunctionreturns1or0,correspondingtoclassOPandclassNOP,respectively.Forexample,letsconsiderthegenepairsyxcE,yxcDandyxiB,
yxiA:
Table16.2:Adjacentgenepairsofunknownoperonstatus.
Genepair Intergenedistancex1 Geneexpressionscorex2
yxcEyxcD 6 173.143442352
yxiByxiA 309 271.005880394
ThelogisticregressionmodelclassifiesyxcE,yxcDasbelongingtothesameoperon(classOP),whileyxiB,yxiAarepredictedtobelongtodifferentoperons:
>>>print("yxcE,yxcD:",LogisticRegression.classify(model,[6,173.143442352]))
yxcE,yxcD:1
>>>print("yxiB,yxiA:",LogisticRegression.classify(model,[309,271.005880394]))
yxiB,yxiA:0
(which,bytheway,agreeswiththebiologicalliterature).
Tofindouthowconfidentwecanbeinthesepredictions,wecancallthecalculatefunctiontoobtaintheprobabilities(equations(16.2)and16.3)forclassOPandNOP.
ForyxcE,yxcDwefind
>>>q,p=LogisticRegression.calculate(model,[6,173.143442352])
>>>print("classOP:probability=",p,"classNOP:probability=",q)
classOP:probability=0.993242163503classNOP:probability=0.00675783649744
andforyxiB,yxiA
>>>q,p=LogisticRegression.calculate(model,[309,271.005880394])
>>>print("classOP:probability=",p,"classNOP:probability=",q)
classOP:probability=0.000321211251817classNOP:probability=0.999678788748
Togetsomeideaofthepredictionaccuracyofthelogisticregressionmodel,wecanapplyittothetrainingdata:
>>>foriinrange(len(ys)):
print("True:",ys[i],"Predicted:",LogisticRegression.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
showingthatthepredictioniscorrectforallbutoneofthegenepairs.Amorereliableestimateofthepredictionaccuracycanbefoundfromaleaveoneoutanalysis,in
whichthemodelisrecalculatedfromthetrainingdataafterremovingthegenetobepredicted:
model=LogisticRegression.train(xs[:i]+xs[i+1:],ys[:i]+ys[i+1:])
print("True:",ys[i],"Predicted:",LogisticRegression.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:1
True:0Predicted:0
True:0Predicted:0
Theleaveoneoutanalysisshowsthatthepredictionofthelogisticregressionmodelisincorrectforonlytwoofthegenepairs,whichcorrespondstoaprediction
accuracyof88%.
16.1.4LogisticRegression,LinearDiscriminantAnalysis,andSupportVectorMachines
Thelogisticregressionmodelissimilartolineardiscriminantanalysis.Inlineardiscriminantanalysis,theclassprobabilitiesalsofollowequations(16.2)and(16.3).
However,insteadofestimatingthecoefficientsdirectly,wefirstfitanormaldistributiontothepredictorvariablesx.Thecoefficientsarethencalculatedfromthe
meansandcovariancesofthenormaldistribution.Ifthedistributionofxisindeednormal,thenweexpectlineardiscriminantanalysistoperformbetterthanthelogistic
regressionmodel.Thelogisticregressionmodel,ontheotherhand,ismorerobusttodeviationsfromnormality.
Anothersimilarapproachisasupportvectormachinewithalinearkernel.SuchanSVMalsousesalinearcombinationofthepredictors,butestimatesthecoefficients
fromthepredictorvariablesxneartheboundaryregionbetweentheclasses.Ifthelogisticregressionmodel(equations(16.2)and(16.3))isagooddescriptionforxaway
fromtheboundaryregion,weexpectthelogisticregressionmodeltoperformbetterthananSVMwithalinearkernel,asitreliesonmoredata.Ifnot,anSVMwitha
linearkernelmayperformbetter.
TrevorHastie,RobertTibshirani,andJeromeFriedman:TheElementsofStatisticalLearning.DataMining,Inference,andPrediction.SpringerSeriesinStatistics,2001.
Chapter4.4.
16.2kNearestNeighbors
16.2.1Backgroundandpurpose
Theknearestneighborsmethodisasupervisedlearningapproachthatdoesnotneedtofitamodeltothedata.Instead,datapointsareclassifiedbasedonthecategories
oftheknearestneighborsinthetrainingdataset.
InBiopython,theknearestneighborsmethodisavailableinBio.kNN.ToillustratetheuseoftheknearestneighbormethodinBiopython,wewillusethesameoperon
datasetasinsection16.1.
16.2.2Initializingaknearestneighborsmodel
UsingthedatainTable16.1,wecreateandinitializeaknearestneighborsmodelasfollows:
>>>fromBioimportkNN
>>>k=3
>>>model=kNN.train(xs,ys,k)
wherexsandysarethesameasinSection16.1.2.Here,kisthenumberofneighborskthatwillbeconsideredfortheclassification.Forclassificationintotwoclasses,
choosinganoddnumberforkletsyouavoidtiedvotes.Thefunctionnametrainisabitofamisnomer,sincenomodeltrainingisdone:thisfunctionsimplystoresxs,
ys,andkinmodel.
16.2.3Usingaknearestneighborsmodelforclassification
Toclassifynewdatausingtheknearestneighborsmodel,weusetheclassifyfunction.Thisfunctiontakesadatapoint(x1,x2)andfindstheknearestneighborsinthe
trainingdatasetxs.Thedatapoint(x1,x2)isthenclassifiedbasedonwhichcategory(ys)occursmostamongthekneighbors.
FortheexampleofthegenepairsyxcE,yxcDandyxiB,yxiA,wefind:
>>>x=[6,173.143442352]
>>>print("yxcE,yxcD:",kNN.classify(model,x))
yxcE,yxcD:1
>>>x=[309,271.005880394]
>>>print("yxiB,yxiA:",kNN.classify(model,x))
yxiB,yxiA:0
Inagreementwiththelogisticregressionmodel,yxcE,yxcDareclassifiedasbelongingtothesameoperon(classOP),whileyxiB,yxiAarepredictedtobelongto
differentoperons.
Theclassifyfunctionletsusspecifybothadistancefunctionandaweightfunctionasoptionalarguments.Thedistancefunctionaffectswhichkneighborsarechosenas
thenearestneighbors,asthesearedefinedastheneighborswiththesmallestdistancetothequerypoint(x,y).Bydefault,theEuclideandistanceisused.Instead,we
couldforexampleusethecityblock(Manhattan)distance:
>>>defcityblock(x1,x2):
...assertlen(x1)==2
...assertlen(x2)==2
...distance=abs(x1[0]x2[0])+abs(x1[1]x2[1])
...returndistance
...
>>>x=[6,173.143442352]
>>>print("yxcE,yxcD:",kNN.classify(model,x,distance_fn=cityblock))
yxcE,yxcD:1
Theweightfunctioncanbeusedforweightedvoting.Forexample,wemaywanttogivecloserneighborsahigherweightthanneighborsthatarefurtheraway:
>>>defweight(x1,x2):
...assertlen(x1)==2
...assertlen(x2)==2
...returnexp(abs(x1[0]x2[0])abs(x1[1]x2[1]))
...
>>>x=[6,173.143442352]
>>>print("yxcE,yxcD:",kNN.classify(model,x,weight_fn=weight))
yxcE,yxcD:1
Bydefault,allneighborsaregivenanequalweight.
Tofindouthowconfidentwecanbeinthesepredictions,wecancallthecalculatefunction,whichwillcalculatethetotalweightassignedtotheclassesOPandNOP.
Forthedefaultweightingscheme,thisreducestothenumberofneighborsineachcategory.ForyxcE,yxcD,wefind
>>>x=[6,173.143442352]
>>>weight=kNN.calculate(model,x)
>>>print("classOP:weight=",weight[0],"classNOP:weight=",weight[1])
classOP:weight=0.0classNOP:weight=3.0
whichmeansthatallthreeneighborsofx1,x2areintheNOPclass.Asanotherexample,foryesK,yesLwefind
>>>x=[117,267.14]
>>>weight=kNN.calculate(model,x)
>>>print("classOP:weight=",weight[0],"classNOP:weight=",weight[1])
classOP:weight=2.0classNOP:weight=1.0
whichmeansthattwoneighborsareoperonpairsandoneneighborisanonoperonpair.
Togetsomeideaofthepredictionaccuracyoftheknearestneighborsapproach,wecanapplyittothetrainingdata:
print("True:",ys[i],"Predicted:",kNN.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
showingthatthepredictioniscorrectforallbuttwoofthegenepairs.Amorereliableestimateofthepredictionaccuracycanbefoundfromaleaveoneoutanalysis,in
whichthemodelisrecalculatedfromthetrainingdataafterremovingthegenetobepredicted:
>>>k=3
model=kNN.train(xs[:i]+xs[i+1:],ys[:i]+ys[i+1:],k)
print("True:",ys[i],"Predicted:",kNN.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:1
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:1
Theleaveoneoutanalysisshowsthatknearestneighborsmodeliscorrectfor13outof17genepairs,whichcorrespondstoapredictionaccuracyof76%.
16.3NaveBayes
ThissectionwilldescribetheBio.NaiveBayesmodule.
16.4MaximumEntropy
ThissectionwilldescribetheBio.MaximumEntropymodule.
16.5MarkovModels
ThissectionwilldescribetheBio.MarkovModeland/orBio.HMM.MarkovModelmodules.
Chapter17GraphicsincludingGenomeDiagram
TheBio.GraphicsmoduledependsonthethirdpartyPythonlibraryReportLab.AlthoughfocusedonproducingPDFfiles,ReportLabcanalsocreateencapsulated
postscript(EPS)and(SVG)files.Inadditiontothesevectorbasedimages,providedcertainfurtherdependenciessuchasthePythonImagingLibrary(PIL)areinstalled,
ReportLabcanalsooutputbitmapimages(includingJPEG,PNG,GIF,BMPandPICTformats).
17.1GenomeDiagram
17.1.1Introduction
TheBio.Graphics.GenomeDiagrammodulewasaddedtoBiopython1.50,havingpreviouslybeenavailableasaseparatePythonmoduledependentonBiopython.
GenomeDiagramisdescribedintheBioinformaticsjournalpublicationbyPritchardetal.(2006)[2],whichincludessomeexamplesimages.ThereisaPDFcopyofthe
oldmanualhere,http://biopython.org/DIST/docs/GenomeDiagram/userguide.pdfwhichhassomemoreexamples.
Asthenamemightsuggest,GenomeDiagramwasdesignedfordrawingwholegenomes,inparticularprokaryoticgenomes,eitheraslineardiagrams(optionallybroken
upintofragmentstofitbetter)orascircularwheeldiagrams.HavealookatFigure2inTothetal.(2006)[3]foragoodexample.Itprovedalsowellsuitedtodrawing
quitedetailedfiguresforsmallergenomessuchasphage,plasmidsormitochrondia,forexampleseeFigures1and2inVanderAuweraetal.(2009)[4](shownwith
additionalmanualediting).
ThismoduleiseasiesttouseifyouhaveyourgenomeloadedasaSeqRecordobjectcontaininglotsofSeqFeatureobjectsforexampleasloadedfromaGenBankfile
(seeChapters4and5).
17.1.2Diagrams,tracks,featuresetsandfeatures
GenomeDiagramusesanestedsetofobjects.Atthetoplevel,youhaveadiagramobjectrepresentingasequence(orsequenceregion)alongthehorizontalaxis(or
circle).Adiagramcancontainoneormoretracks,shownstackedvertically(orradiallyoncirculardiagrams).Thesewilltypicallyallhavethesamelengthandrepresent
thesamesequenceregion.Youmightuseonetracktoshowthegenelocations,anothertoshowregulatoryregions,andathirdtracktoshowtheGCpercentage.
Themostcommonlyusedtypeoftrackwillcontainfeatures,bundledtogetherinfeaturesets.YoumightchoosetouseonefeaturesetforallyourCDSfeatures,and
anotherfortRNAfeatures.Thisisntrequiredtheycanallgointhesamefeatureset,butitmakesiteasiertoupdatethepropertiesofjustselectedfeatures(e.g.makeall
thetRNAfeaturesred).
Therearetwomainwaystobuildupacompletediagram.Firstly,thetopdownapproachwhereyoucreateadiagramobject,andthenusingitsmethodsaddtrack(s),and
usethetrackmethodstoaddfeatureset(s),andusetheirmethodstoaddthefeatures.Secondly,youcancreatetheindividualobjectsseparately(inwhateverordersuits
yourcode),andthencombinethem.
17.1.3Atopdownexample
WeregoingtodrawawholegenomefromaSeqRecordobjectreadinfromaGenBankfile(seeChapter5).ThisexampleusesthepPCP1plasmidfromYersiniapestis
biovarMicrotus,thefileisincludedwiththeBiopythonunittestsundertheGenBankfolder,oronlineNC_005816.gbfromourwebsite.
fromreportlab.libimportcolors
fromreportlab.lib.unitsimportcm
fromBio.GraphicsimportGenomeDiagram
fromBioimportSeqIO
record=SeqIO.read("NC_005816.gb","genbank")
Wereusingatopdownapproach,soafterloadinginoursequencewenextcreateanemptydiagram,thenaddan(empty)track,andtothataddan(empty)featureset:
gd_diagram=GenomeDiagram.Diagram("YersiniapestisbiovarMicrotusplasmidpPCP1")
gd_track_for_features=gd_diagram.new_track(1,name="AnnotatedFeatures")
gd_feature_set=gd_track_for_features.new_set()
NowthefunpartwetakeeachgeneSeqFeatureobjectinourSeqRecord,anduseittogenerateafeatureonthediagram.Weregoingtocolorthemblue,alternating
betweenadarkblueandalightblue.
forfeatureinrecord.features:
iffeature.type!="gene":
#Excludethisfeature
continue
iflen(gd_feature_set)%2==0:
color=colors.blue
else:
color=colors.lightblue
gd_feature_set.add_feature(feature,color=color,label=True)
Nowwecometoactuallymakingtheoutputfile.Thishappensintwosteps,firstwecallthedrawmethod,whichcreatesalltheshapesusingReportLabobjects.Thenwe
callthewritemethodwhichrendersthesetotherequestedfileformat.Noteyoucanoutputinmultiplefileformats:
gd_diagram.draw(format="linear",orientation="landscape",pagesize='A4',
fragments=4,start=0,end=len(record))
gd_diagram.write("plasmid_linear.pdf","PDF")
gd_diagram.write("plasmid_linear.eps","EPS")
gd_diagram.write("plasmid_linear.svg","SVG")
Also,providedyouhavethedependenciesinstalled,youcanalsodobitmaps,forexample:
gd_diagram.write("plasmid_linear.png","PNG")
Noticethatthefragmentsargumentwhichwesettofourcontrolshowmanypiecesthegenomegetsbrokenupinto.
Ifyouwanttodoacircularfigure,thentrythis:
gd_diagram.draw(format="circular",circular=True,pagesize=(20*cm,20*cm),
start=0,end=len(record),circle_core=0.7)
gd_diagram.write("plasmid_circular.pdf","PDF")
Thesefiguresarenotveryexciting,butweveonlyjustgotstarted.
17.1.4Abottomupexample
Nowletsproduceexactlythesamefigures,butusingthebottomupapproach.Thismeanswecreatethedifferentobjectsdirectly(andthiscanbedoneinalmostany
order)andthencombinethem.
fromBioimportSeqIO
#Createthefeaturesetanditsfeatureobjects,
gd_feature_set=GenomeDiagram.FeatureSet()
#Excludethisfeature
continue
color=colors.blue
else:
#(thisforloopisthesameasinthepreviousexample)
#Createatrack,andadiagram
gd_track_for_features=GenomeDiagram.Track(name="AnnotatedFeatures")
gd_diagram=GenomeDiagram.Diagram("YersiniapestisbiovarMicrotusplasmidpPCP1")
#Nowhavetogluethebitstogether...
gd_track_for_features.add_set(gd_feature_set)
gd_diagram.add_track(gd_track_for_features,1)
Youcannowcallthedrawandwritemethodsasbeforetoproducealinearorcirculardiagram,usingthecodeattheendofthetopdownexampleabove.Thefigures
shouldbeidentical.
17.1.5FeatureswithoutaSeqFeature
IntheaboveexampleweusedaSeqRecordsSeqFeatureobjectstobuildourdiagram(seealsoSection4.3).SometimesyouwonthaveSeqFeatureobjects,butjustthe
coordinatesforafeatureyouwanttodraw.YouhavetocreateminimalSeqFeatureobject,butthisiseasy:
fromBio.SeqFeatureimportSeqFeature,FeatureLocation
my_seq_feature=SeqFeature(FeatureLocation(50,100),strand=+1)
Forstrand,use+1fortheforwardstrand,1forthereversestrand,andNoneforboth.Hereisashortselfcontainedexample:
gdd=GenomeDiagram.Diagram('TestDiagram')
gdt_features=gdd.new_track(1,greytrack=False)
gds_features=gdt_features.new_set()
#Addthreefeaturestoshowthestrandoptions,
feature=SeqFeature(FeatureLocation(25,125),strand=+1)
gds_features.add_feature(feature,name="Forward",label=True)
feature=SeqFeature(FeatureLocation(150,250),strand=None)
gds_features.add_feature(feature,name="Strandless",label=True)
feature=SeqFeature(FeatureLocation(275,375),strand=1)
gds_features.add_feature(feature,name="Reverse",label=True)
gdd.draw(format='linear',pagesize=(15*cm,4*cm),fragments=1,
start=0,end=400)
gdd.write("GD_labels_default.pdf","pdf")
Thetoppartoftheimageinthenextsubsectionshowstheoutput(inthedefaultfeaturecolor,palegreen).
Noticethatwehaveusedthenameargumentheretospecifythecaptiontextforthesefeatures.Thisisdiscussedinmoredetailnext.
17.1.6Featurecaptions
Recallweusedthefollowing(wherefeaturewasaSeqFeatureobject)toaddafeaturetothediagram:
IntheexampleabovetheSeqFeatureannotationwasusedtopickasensiblecaptionforthefeatures.BydefaultthefollowingpossibleentriesundertheSeqFeature
objectsqualifiersdictionaryareused:gene,label,name,locus_tag,andproduct.Moresimply,youcanspecifyanamedirectly:
gd_feature_set.add_feature(feature,color=color,label=True,name="MyGene")
Inadditiontothecaptiontextforeachfeatureslabel,youcanalsochoosethefont,position(thisdefaultstothestartofthesigil,youcanalsochoosethemiddleoratthe
end)andorientation(forlineardiagramsonly,wherethisdefaultstorotatedby45degrees):
#Largefont,parallelwiththetrack
gd_feature_set.add_feature(feature,label=True,color="green",
label_size=25,label_angle=0)
#Verysmallfont,perpendiculartothetrack(towardsit)
gd_feature_set.add_feature(feature,label=True,color="purple",
label_position="end",
#Smallfont,perpendiculartothetrack(awayfromit)
gd_feature_set.add_feature(feature,label=True,color="blue",
label_position="middle",
Combiningeachofthesethreefragmentswiththecompleteexampleintheprevioussectionshouldgivesomethinglikethis:
Wevenotshownithere,butyoucanalsosetlabel_colortocontrolthelabelscolor(usedinSection17.1.9).
Youllnoticethedefaultfontisquitesmallthismakessensebecauseyouwillusuallybedrawingmany(small)featuresonapage,notjustafewlargeonesasshown
here.
17.1.7Featuresigils
Theexamplesabovehavealljustusedthedefaultsigilforthefeature,aplainbox,whichwasallthatwasavailableinthelastpubliclyreleasedstandaloneversionof
GenomeDiagram.ArrowsigilswereincludedwhenGenomeDiagramwasaddedtoBiopython1.50:
#DefaultusesaBOXsigil
gd_feature_set.add_feature(feature)
#Youcanmakethisexplicit:
gd_feature_set.add_feature(feature,sigil="BOX")
#Oroptforanarrow:
gd_feature_set.add_feature(feature,sigil="ARROW")
Biopython1.61addedthreemoresigils,
#Boxwithcornerscutoff(makingitanoctagon)
gd_feature_set.add_feature(feature,sigil="OCTO")
#Boxwithjaggededges(usefulforshowingbreaksincontains)
gd_feature_set.add_feature(feature,sigil="JAGGY")
#Arrowwhichspanstheaxiswithstrandusedonlyfordirection
gd_feature_set.add_feature(feature,sigil="BIGARROW")
Theseareshownbelow.Mostsigilsfitintoaboundingbox(asgivenbythedefaultBOXsigil),eitheraboveorbelowtheaxisfortheforwardorreversestrand,or
straddlingit(doubletheheight)forstrandlessfeatures.TheBIGARROWsigilisdifferent,alwaysstraddlingtheaxiswiththedirectiontakenfromthefeaturesstand.
17.1.8Arrowsigils
Weintroducedthearrowsigilsintheprevioussection.Therearetwoadditionaloptionstoadjusttheshapesofthearrows,firstlythethicknessofthearrowshaft,givenas
aproportionoftheheightoftheboundingbox:
#Fullheightshafts,givingpointedboxes:
gd_feature_set.add_feature(feature,sigil="ARROW",color="brown",
arrowshaft_height=1.0)
#Or,thinshafts:
gd_feature_set.add_feature(feature,sigil="ARROW",color="teal",
#Or,verythinshafts:
gd_feature_set.add_feature(feature,sigil="ARROW",color="darkgreen",
Theresultsareshownbelow:
Secondly,thelengthofthearrowheadgivenasaproportionoftheheightoftheboundingbox(defaultingto0.5,or50%):
#Shortarrowheads:
gd_feature_set.add_feature(feature,sigil="ARROW",color="blue",
arrowhead_length=0.25)
#Or,longerarrowheads:
gd_feature_set.add_feature(feature,sigil="ARROW",color="orange",
arrowhead_length=1)
#Or,veryverylongarrowheads(i.e.allhead,noshaft,sotriangles):
gd_feature_set.add_feature(feature,sigil="ARROW",color="red",
arrowhead_length=10000)
Theresultsareshownbelow:
Biopython1.61addsanewBIGARROWsigilwhichalwaysstradlestheaxis,pointingleftforthereversestrandorrightotherwise:
#Alargearrowstraddlingtheaxis:
gd_feature_set.add_feature(feature,sigil="BIGARROW")
AlltheshaftandarrowheadoptionsshownabovefortheARROWsigilcanbeusedfortheBIGARROWsigiltoo.
17.1.9Aniceexample
NowletsreturntothepPCP1plasmidfromYersiniapestisbiovarMicrotus,andthetopdownapproachusedinSection17.1.3,buttakeadvantageofthesigiloptions
wevenowdiscussed.Thistimewellusearrowsforthegenes,andoverlaythemwithstrandlessfeatures(asplainboxes)showingthepositionofsomerestrictiondigest
sites.
fromBioimportSeqIO
gd_diagram=GenomeDiagram.Diagram(record.id)
gd_track_for_features=gd_diagram.new_track(1,name="AnnotatedFeatures")
#Excludethisfeature
continue
color=colors.blue
else:
gd_feature_set.add_feature(feature,sigil="ARROW",
color=color,label=True,
#Iwanttoincludesomestrandlessfeatures,soforanexample
#willuseEcoRIrecognitionsitesetc.
forsite,name,colorin[("GAATTC","EcoRI",colors.green),
("CCCGGG","SmaI",colors.orange),
("AAGCTT","HindIII",colors.red),
("GGATCC","BamHI",colors.purple)]:
index=0
whileTrue:
index=record.seq.find(site,start=index)
ifindex==1:break
feature=SeqFeature(FeatureLocation(index,index+len(site)))
gd_feature_set.add_feature(feature,color=color,name=name,
label=True,label_size=10,
label_color=color)
index+=len(site)
gd_diagram.draw(format="linear",pagesize='A4',fragments=4,
start=0,end=len(record))
gd_diagram.write("plasmid_linear_nice.pdf","PDF")
gd_diagram.write("plasmid_linear_nice.eps","EPS")
gd_diagram.write("plasmid_linear_nice.svg","SVG")
gd_diagram.draw(format="circular",circular=True,pagesize=(20*cm,20*cm),
start=0,end=len(record),circle_core=0.5)
gd_diagram.write("plasmid_circular_nice.pdf","PDF")
gd_diagram.write("plasmid_circular_nice.eps","EPS")
gd_diagram.write("plasmid_circular_nice.svg","SVG")
Andtheoutput:
17.1.10Multipletracks
Alltheexamplessofarhaveusedasingletrack,butyoucanhavemorethanonetrackforexampleshowthegenesonone,andrepeatregionsonanother.Inthis
exampleweregoingtoshowthreephagegenomessidebysidetoscale,inspiredbyFigure6inProuxetal.(2002)[5].WellneedtheGenBankfilesforthefollowing
threephage:
NC_002703LactococcusphageTuc2009,completegenome(38347bp)
AF323668BacteriophagebIL285,completegenome(35538bp)
NC_003212ListeriainnocuaClip11262,completegenome,ofwhichwearefocussingonlyonintegratedprophage5(similarlength).
YoucandownloadtheseusingEntrezifyoulike,seeSection9.6formoredetails.Forthethirdrecordweveworkedoutwherethephageisintegratedintothegenome,
andslicetherecordtoextractit(withthefeaturespreserved,seeSection4.7),andmustalsoreversecomplementtomatchtheorientationofthefirsttwophage(again
preservingthefeatures,seeSection4.9):
fromBioimportSeqIO
A_rec=SeqIO.read("NC_002703.gbk","gb")
B_rec=SeqIO.read("AF323668.gbk","gb")
C_rec=SeqIO.read("NC_003212.gbk","gb")[2587879:2625807].reverse_complement(name=True)
Thefigureweareimitatinguseddifferentcolorsfordifferentgenefunctions.OnewaytodothisistoedittheGenBankfiletorecordcolorpreferencesforeachfeature
somethingSangersArtemiseditordoes,andwhichGenomeDiagramshouldunderstand.Herehowever,welljusthardcodethreelistsofcolors.
NotethattheannotationintheGenBankfilesdoesntexactlymatchthatshowninProuxetal.,theyhavedrawnsomeunannotatedgenes.
fromreportlab.lib.colorsimportred,grey,orange,green,brown,blue,lightblue,purple
A_colors=[red]*5+[grey]*7+[orange]*2+[grey]*2+[orange]+[grey]*11+[green]*4\
+[grey]+[green]*2+[grey,green]+[brown]*5+[blue]*4+[lightblue]*5\
+[grey,lightblue]+[purple]*2+[grey]
B_colors=[red]*6+[grey]*8+[orange]*2+[grey]+[orange]+[grey]*21+[green]*5\
+[grey]+[brown]*4+[blue]*3+[lightblue]*3+[grey]*5+[purple]*2
C_colors=[grey]*30+[green]*5+[brown]*4+[blue]*2+[grey,blue]+[lightblue]*2\
+[grey]*5
Nowtodrawthemthistimeweaddthreetrackstothediagram,andalsonoticetheyaregivendifferentstart/endvaluestoreflecttheirdifferentlengths(thisrequires
Biopython1.59orlater).
name="ProuxFig6"
gd_diagram=GenomeDiagram.Diagram(name)
max_len=0
forrecord,gene_colorsinzip([A_rec,B_rec,C_rec],[A_colors,B_colors,C_colors]):
max_len=max(max_len,len(record))
gd_track_for_features=gd_diagram.new_track(1,
name=record.name,
greytrack=True,
start=0,end=len(record))
i=0
#Excludethisfeature
continue
gd_feature_set.add_feature(feature,sigil="ARROW",
color=gene_colors[i],label=True,
name=str(i+1),
label_position="start",
i+=1
gd_diagram.draw(format="linear",pagesize='A4',fragments=1,
start=0,end=max_len)
gd_diagram.write(name+".pdf","PDF")
gd_diagram.write(name+".eps","EPS")
gd_diagram.write(name+".svg","SVG")
Theresult:
Ididwonderwhyintheoriginalmanuscripttherewerenoredororangegenesmarkedinthebottomphage.Anotherimportantpointisherethephageareshownwith
differentlengthsthisisbecausetheyarealldrawntothesamescale(theyaredifferentlengths).
Thekeydifferencefromthepublishedfigureistheyhavecolorcodedlinksbetweensimilarproteinswhichiswhatwewilldointhenextsection.
17.1.11CrossLinksbetweentracks
Biopython1.59addedtheabilitytodrawcrosslinksbetweentracksbothsimplelineardiagramsaswewillshowhere,butalsolineardiagramssplitintofragmentsand
circulardiagrams.
ContinuingtheexamplefromtheprevioussectioninspiredbyFigure6fromProuxetal.2002[5],wewouldneedalistofcrosslinksbetweenpairsofgenes,alongwith
ascoreorcolortouse.RealisticallyyoumightextractthisfromaBLASTfilecomputationally,buthereIhavemanuallytypedthemin.
MynamingconventioncontinuestorefertothethreephageasA,BandC.HerearethelinkswewanttoshowbetweenAandB,givenasalistoftuples(percentage
similarityscore,geneinA,geneinB).
#Tuc2009(NC_002703)vsbIL285(AF323668)
A_vs_B=[
(99,"Tuc2009_01","int"),
(33,"Tuc2009_03","orf4"),
(94,"Tuc2009_05","orf6"),
(100,"Tuc2009_06","orf7"),
(97,"Tuc2009_07","orf8"),
(98,"Tuc2009_08","orf9"),
(98,"Tuc2009_09","orf10"),
(100,"Tuc2009_10","orf12"),
(100,"Tuc2009_11","orf13"),
(94,"Tuc2009_12","orf14"),
(87,"Tuc2009_13","orf15"),
(94,"Tuc2009_14","orf16"),
(94,"Tuc2009_15","orf17"),
(88,"Tuc2009_17","rusA"),
(91,"Tuc2009_18","orf20"),
(93,"Tuc2009_19","orf22"),
(71,"Tuc2009_20","orf23"),
(51,"Tuc2009_22","orf27"),
(97,"Tuc2009_23","orf28"),
(88,"Tuc2009_24","orf29"),
(26,"Tuc2009_26","orf38"),
(19,"Tuc2009_46","orf52"),
(77,"Tuc2009_48","orf54"),
(91,"Tuc2009_49","orf55"),
(95,"Tuc2009_52","orf60"),
]
LikewiseforBandC:
#bIL285(AF323668)vsListeriainnocuaprophage5(inNC_003212)
B_vs_C=[
(42,"orf39","lin2581"),
(31,"orf40","lin2580"),
(49,"orf41","lin2579"),#terL
(54,"orf42","lin2578"),#portal
(55,"orf43","lin2577"),#protease
(33,"orf44","lin2576"),#mhp
(51,"orf46","lin2575"),
(33,"orf47","lin2574"),
(40,"orf48","lin2573"),
(25,"orf49","lin2572"),
(50,"orf50","lin2571"),
(48,"orf51","lin2570"),
(24,"orf52","lin2568"),
(30,"orf53","lin2567"),
(28,"orf54","lin2566"),
]
Forthefirstandlastphagetheseidentifiersarelocustags,forthemiddlephagetherearenolocustagssoIveusedgenenamesinstead.Thefollowinglittlehelper
functionletsuslookupafeatureusingeitheralocustagorgenename:
defget_feature(features,id,tags=["locus_tag","gene"]):
"""SearchlistofSeqFeatureobjectsforanidentifierunderthegiventags."""
forfinfeatures:
forkeyintags:
#tagmaynotbepresentinthisfeature
forxinf.qualifiers.get(key,[]):
ifx==id:
returnf
raiseKeyError(id)
WecannowturnthoselistofidentifierpairsintoSeqFeaturepairs,andthusfindtheirlocationcoordinates.Wecannowaddallthatcodeandthefollowingsnippettothe
previousexample(justbeforethegd_diagram.draw(...)lineseethefinishedexamplescriptProux_et_al_2002_Figure_6.pyincludedintheDoc/examplesfolderofthe
Biopythonsourcecode)toaddcrosslinkstothefigure:
fromBio.Graphics.GenomeDiagramimportCrossLink
#Noteitmighthavebeenclearertoassignthetracknumbersexplicitly...
forrec_X,tn_X,rec_Y,tn_Y,X_vs_Yin[(A_rec,3,B_rec,2,A_vs_B),
(B_rec,2,C_rec,1,B_vs_C)]:
track_X=gd_diagram.tracks[tn_X]
track_Y=gd_diagram.tracks[tn_Y]
forscore,id_X,id_YinX_vs_Y:
feature_X=get_feature(rec_X.features,id_X)
feature_Y=get_feature(rec_Y.features,id_Y)
color=colors.linearlyInterpolatedColor(colors.white,colors.firebrick,0,100,score)
link_xy=CrossLink((track_X,feature_X.location.start,feature_X.location.end),
(track_Y,feature_Y.location.start,feature_Y.location.end),
color,colors.lightgrey)
gd_diagram.cross_track_links.append(link_xy)
Thereareseveralimportantpiecestothiscode.FirsttheGenomeDiagramobjecthasacross_track_linksattributewhichisjustalistofCrossLinkobjects.EachCrossLink
objecttakestwosetsoftrackspecificcoordinates(heregivenastuples,youcanalternativelyuseaGenomeDiagram.Featureobjectinstead).Youcanoptionallysupplya
colour,bordercolor,andsayifthislinkshouldbedrawnflipped(usefulforshowinginversions).
YoucanalsoseehowweturntheBLASTpercentageidentityscoreintoacolour,interpolatingbetweenwhite(0%)andadarkred(100%).Inthisexamplewedonthave
anyproblemswithoverlappingcrosslinks.OnewaytotacklethatistousetransparencyinReportLab,byusingcolorswiththeiralphachannelset.However,thiskindof
shadedcolorschemecombinedwithoverlaptransparencywouldbedifficulttointerpret.Theresult:
ThereisstillalotmorethatcanbedonewithinBiopythontohelpimprovethisfigure.Firstofall,thecrosslinksinthiscasearebetweenproteinswhicharedrawnina
strandspecificmanor.Itcanhelptoaddabackgroundregion(afeatureusingtheBOXsigil)onthefeaturetracktoextendthecrosslink.Also,wecouldreducethe
verticalheightofthefeaturetrackstoallocatemoretothelinksinsteadonewaytodothatistoallocatespaceforemptytracks.Furthermore,incaseslikethiswhere
therearenolargegeneoverlaps,wecanusetheaxisstraddlingBIGARROWsigil,whichallowsustofurtherreducetheverticalspaceneededforthetrack.These
improvementsaredemonstratedintheexamplescriptProux_et_al_2002_Figure_6.pyincludedintheDoc/examplesfolderoftheBiopythonsourcecode.Theresult:
Beyondthat,finishingtouchesyoumightwanttodomanuallyinavectorimageeditorincludefinetuningtheplacementofgenelabels,andaddingothercustom
annotationsuchashighlightingparticularregions.
Althoughnotreallynecessaryinthisexamplesincenoneofthecrosslinksoverlap,usingatransparentcolorinReportLabisaveryusefultechniqueforsuperimposing
multiplelinks.However,inthiscaseashadedcolorschemeshouldbeavoided.
17.1.12Furtheroptions
Youcancontrolthetickmarkstoshowthescaleafteralleverygraphshouldshowitsunits,andthenumberofthegreytracklabels.
Also,wehaveonlyusedtheFeatureSetsofar.GenomeDiagramalsohasaGraphSetwhichcanbeusedforshowlinegraphs,barchartsandheatplots(e.g.toshowplots
ofGC%onatrackparalleltothefeatures).
Theseoptionsarenotcoveredhereyet,sofornowwereferyoutotheUserGuide(PDF)includedwiththestandaloneversionofGenomeDiagram(butpleasereadthe
nextsectionfirst),andthedocstrings.
17.1.13Convertingoldcode
IfyouhaveoldcodewrittenusingthestandaloneversionofGenomeDiagram,andyouwanttoswitchitovertousingthenewversionincludedwithBiopythonthenyou
willhavetomakeafewchangesmostimportantlytoyourimportstatements.
Also,theolderversionofGenomeDiagramusedonlytheUKspellingsofcolorandcenter(colourandcentre).YouwillneedtochangetotheAmericanspellings,
althoughforseveralyearstheBiopythonversionofGenomeDiagramsupportedboth.
Forexample,ifyouusedtohave:
fromGenomeDiagramimportGDFeatureSet,GDDiagram
gdd=GDDiagram("Anexample")
...
youcouldjustswitchtheimportstatementslikethis:
fromBio.Graphics.GenomeDiagramimportFeatureSetasGDFeatureSet,DiagramasGDDiagram
gdd=GDDiagram("Anexample")
...
andhopefullythatshouldbeenough.Inthelongtermyoumightwanttoswitchtothenewnames,butyouwouldhavetochangemoreofyourcode:
fromBio.Graphics.GenomeDiagramimportFeatureSet,Diagram
gdd=Diagram("Anexample")
...
or:
gdd=GenomeDiagram.Diagram("Anexample")
...
Ifyourunintodifficulties,pleaseaskontheBiopythonmailinglistforadvice.OnecatchisthatwehavenotincludedtheoldmoduleGenomeDiagram.GDUtilitiesyet.
ThisincludedanumberofGC%relatedfunctions,whichwillprobablybemergedunderBio.SeqUtilslateron.
17.2Chromosomes
TheBio.Graphics.BasicChromosomemoduleallowsdrawingofchromosomes.ThereisanexampleinJupeetal.(2012)[6](openaccess)usingcolorstohighlight
differentgenefamilies.
17.2.1SimpleChromosomes
HereisaverysimpleexampleforwhichwelluseArabidopsisthaliana.
Youcanskipthisbit,butfirstIdownloadedthefivesequencedchromosomesfromtheNCBIsFTPsiteftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thalianaand
thenparsedthemwithBio.SeqIOtofindouttheirlengths.YoucouldusetheGenBankfilesforthis,butitisfastertousetheFASTAfilesforthewholechromosomes:
fromBioimportSeqIO
entries=[("ChrI","CHR_I/NC_003070.fna"),
("ChrII","CHR_II/NC_003071.fna"),
("ChrIII","CHR_III/NC_003074.fna"),
("ChrIV","CHR_IV/NC_003075.fna"),
("ChrV","CHR_V/NC_003076.fna")]
for(name,filename)inentries:
record=SeqIO.read(filename,"fasta")
print(name,len(record))
Thisgavethelengthsofthefivechromosomes,whichwellnowuseinthefollowingshortdemonstrationoftheBasicChromosomemodule:
fromBio.GraphicsimportBasicChromosome
entries=[("ChrI",30432563),
("ChrII",19705359),
("ChrIII",23470805),
("ChrIV",18585042),
("ChrV",26992728)]
max_len=30432563#Couldcomputethis
telomere_length=1000000#Forillustration
chr_diagram=BasicChromosome.Organism()
chr_diagram.page_size=(29.7*cm,21*cm)#A4landscape
forname,lengthinentries:
cur_chromosome=BasicChromosome.Chromosome(name)
#SetthescaletotheMAXIMUMlengthplusthetwotelomeresinbp,
#wantthesamescaleusedonallfivechromosomessotheycanbe
#comparedtoeachother
cur_chromosome.scale_num=max_len+2*telomere_length
#Addanopeningtelomere
start=BasicChromosome.TelomereSegment()
start.scale=telomere_length
cur_chromosome.add(start)
#Addabodyusingbpasthescalelengthhere.
body=BasicChromosome.ChromosomeSegment()
body.scale=length
cur_chromosome.add(body)
#Addaclosingtelomere
end=BasicChromosome.TelomereSegment(inverted=True)
end.scale=telomere_length
cur_chromosome.add(end)
#Thischromosomeisdone
chr_diagram.add(cur_chromosome)
chr_diagram.draw("simple_chrom.pdf","Arabidopsisthaliana")
ThisshouldcreateaverysimplePDFfile,shownhere:
Thisexampleisdeliberatelyshortandsweet.Thenextexampleshowsthelocationoffeaturesofinterest.
17.2.2AnnotatedChromosomes
Continuingfromthepreviousexample,letsalsoshowthetRNAgenes.WellgettheirlocationsbyparsingtheGenBankfilesforthefiveArabidopsisthaliana
chromosomes.YoullneedtodownloadthesefilesfromtheNCBIFTPsiteftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana,andpreservethesubdirectory
namesoreditthepathsbelow:
fromBioimportSeqIO
fromBio.GraphicsimportBasicChromosome
entries=[("ChrI","CHR_I/NC_003070.gbk"),
("ChrII","CHR_II/NC_003071.gbk"),
("ChrIII","CHR_III/NC_003074.gbk"),
("ChrIV","CHR_IV/NC_003075.gbk"),
("ChrV","CHR_V/NC_003076.gbk")]
max_len=30432563#Couldcomputethis
telomere_length=1000000#Forillustration
chr_diagram=BasicChromosome.Organism()
chr_diagram.page_size=(29.7*cm,21*cm)#A4landscape
forindex,(name,filename)inenumerate(entries):
record=SeqIO.read(filename,"genbank")
length=len(record)
features=[fforfinrecord.featuresiff.type=="tRNA"]
#RecordanArtemisstyleintegercolorinthefeature'squalifiers,
#1=Black,2=Red,3=Green,4=blue,5=cyan,6=purple
forfinfeatures:f.qualifiers["color"]=[index+2]
cur_chromosome=BasicChromosome.Chromosome(name)
#SetthescaletotheMAXIMUMlengthplusthetwotelomeresinbp,
#wantthesamescaleusedonallfivechromosomessotheycanbe
#comparedtoeachother
cur_chromosome.scale_num=max_len+2*telomere_length
#Addanopeningtelomere
start=BasicChromosome.TelomereSegment()
start.scale=telomere_length
cur_chromosome.add(start)
#Addabodyagainusingbpasthescalelengthhere.
body=BasicChromosome.AnnotatedChromosomeSegment(length,features)
body.scale=length
cur_chromosome.add(body)
#Addaclosingtelomere
end=BasicChromosome.TelomereSegment(inverted=True)
end.scale=telomere_length
cur_chromosome.add(end)
#Thischromosomeisdone
chr_diagram.add(cur_chromosome)
chr_diagram.draw("tRNA_chrom.pdf","Arabidopsisthaliana")
Itmightwarnyouaboutthelabelsbeingtooclosetogetherhavealookattheforwardstrand(righthandside)ofChrI,butitshouldcreateacolorfulPDFfile,shown
here:
Chapter18KEGG
KEGG(http://www.kegg.jp/)isadatabaseresourceforunderstandinghighlevelfunctionsandutilitiesofthebiologicalsystem,suchasthecell,theorganismandthe
ecosystem,frommolecularlevelinformation,especiallylargescalemoleculardatasetsgeneratedbygenomesequencingandotherhighthroughputexperimental
technologies.
PleasenotethattheKEGGparserimplementationinBiopythonisincomplete.WhiletheKEGGwebsiteindicatesmanyflatfileformats,onlyparsersandwritersfor
compound,enzyme,andmaparecurrentlyimplemented.However,agenericparserisimplementedtohandletheotherformats.
18.1ParsingKEGGrecords
ParsingaKEGGrecordisassimpleasusinganyotherfileformatparserinBiopython.(Beforerunningthefollowingcodes,pleaseopen
http://rest.kegg.jp/get/ec:5.4.2.2withyourwebbrowserandsaveitasec_5.4.2.2.txt.)
>>>fromBio.KEGGimportEnzyme
>>>records=Enzyme.parse(open("ec_5.4.2.2.txt"))
>>>record=list(records)[0]
>>>record.classname
['Isomerases;','Intramoleculartransferases;','Phosphotransferases(phosphomutases)']
>>>record.entry
'5.4.2.2'
Alternatively,iftheinputKEGGfilehasexactlyoneentry,youcanuseread:
>>>record=Enzyme.read(open("ec_5.4.2.2.txt"))
>>>record.classname
>>>record.entry
'5.4.2.2'
ThefollowingsectionwillshowshowtodownloadtheaboveenzymeusingtheKEGGapiaswellashowtousethegenericparserwithdatathatdoesnothaveacustom
parserimplemented.
18.2QueryingtheKEGGAPI
BiopythonhasfullsupportforthequeryingoftheKEGGapi.QueryingallKEGGendpointsaresupportedallmethodsdocumentedbyKEGG
(http://www.kegg.jp/kegg/rest/keggapi.html)aresupported.TheinterfacehassomevalidationofquerieswhichfollowrulesdefinedontheKEGGsite.However,
invalidquerieswhichreturna400or404mustbehandledbytheuser.
First,hereishowtoextendtheaboveexamplebydownloadingtherelevantenzymeandpassingitthroughtheEnzymeparser.
>>>fromBio.KEGGimportREST
>>>request=REST.kegg_get("ec:5.4.2.2")
>>>open("ec_5.4.2.2.txt",'w').write(request.read())
>>>records=Enzyme.parse(open("ec_5.4.2.2.txt"))
>>>record=list(records)[0]
>>>record.classname
>>>record.entry
'5.4.2.2'
Now,heresamorerealisticexamplewhichshowsacombinationofqueryingtheKEGGAPI.Thiswilldemonstratehowtoextractauniquesetofallhumanpathway
genesymbolswhichrelatetoDNArepair.Thestepsthatneedtobetakentodosoareasfollows.First,weneedtogetalistofallhumanpathways.Secondly,weneedto
filterthoseforoneswhichrelateto"repair".Lastly,weneedtogetalistofallthegenesymbolsinallrepairpathways.
fromBio.KEGGimportREST
human_pathways=REST.kegg_list("pathway","hsa").read()
#Filterallhumanpathwaysforrepairpathways
repair_pathways=[]
forlineinhuman_pathways.rstrip().split("\n"):
entry,description=line.split("\t")
if"repair"indescription:
repair_pathways.append(entry)
#Getthegenesforpathwaysandaddthemtoalist
repair_genes=[]
forpathwayinrepair_pathways:
pathway_file=REST.kegg_get(pathway).read()#queryandreadeachpathway
#iteratethrougheachKEGGpathwayfile,keepingtrackofwhichsection
#ofthefilewe'rein,onlyreadthegeneineachpathway
current_section=None
forlineinpathway_file.rstrip().split("\n"):
section=line[:12].strip()#sectionnamesarewithin12columns
ifnotsection=="":
current_section=section

ifcurrent_section=="GENE":
gene_identifiers,gene_description=line[12:].split(";")
gene_id,gene_symbol=gene_identifiers.split()
ifnotgene_symbolinrepair_genes:
repair_genes.append(gene_symbol)
print("Thereare%drepairpathwaysand%drepairgenes.Thegenesare:"%\
(len(repair_pathways),len(repair_genes)))
print(",".join(repair_genes))
TheKEGGAPIwrapperiscompatiblewithallendpoints.Usageisessentiallyreplacingallslashesintheurlwithcommasandusingthatlistasargumentstothe
correspondingmethodintheKEGGmodule.Hereareafewexamplesfromtheapidocumentation(http://www.kegg.jp/kegg/docs/keggapi.html).
/list/hsa:10458+ece:Z5100>REST.kegg_list(["hsa:10458","ece:Z5100"])
/find/compound/300310/mol_weight>REST.kegg_find("compound","300310","mol_weight")
/get/hsa:10458+ece:Z5100/aaseq>REST.kegg_get(["hsa:10458","ece:Z5100"],"aaseq")
Chapter19Bio.phenotype:analysephenotypicdata
ThischaptergivesanoverviewofthefunctionalitiesoftheBio.phenotypepackageincludedinBiopython.Thescopeofthispackageistheanalysisofphenotypicdata,
whichmeansparsingandanalysinggrowthmeasurementsofcellcultures.Initscurrentstatethepackageisfocusedontheanalysisofhighthroughputphenotypic
experimentsproducedbythePhenotypeMicroarraytechnology,butfuturedevelopmentsmayincludeotherplatformsandformats.
19.1PhenotypeMicroarrays
ThePhenotypeMicroarrayisatechnologythatmeasuresthemetabolismofbacterialandeukaryoticcellsonroughly2000chemicals,dividedintwenty96wellplates.
ThetechnologymeasuresthereductionofatetrazoliumdyebyNADH,whoseproductionbythecellisusedasaproxyforcellmetabolismcolordevelopmentduetothe
reductionofthisdyeistypicallymeasuredonceevery15minutes.Whencellsaregrowninamediathatsustainscellmetabolism,therecordedphenotypicdataresembles
asigmoidgrowthcurve,fromwhichaseriesofgrowthparameterscanberetrieved.
19.1.1ParsingPhenotypeMicroarraydata
TheBio.phenotypepackagecanparsetwodifferentformatsofPhenotypeMicroarraydata:theCSV(commaseparatedvalues)filesproducedbythemachines
proprietarysoftwareandJSONfilesproducedbyanalysissoftware,likeopmorDuctApe.TheparserwillreturnoneorageneratorofPlateRecordobjects,dependingon
whetherthereadorparsemethodisbeingused.YoucantesttheparsefunctionbyusingthePlates.csvfileprovidedwiththeBiopythonsourcecode.
>>>fromBioimportphenotype
>>>forrecordinphenotype.parse("Plates.csv","pmcsv"):
...print("%s%i"%(record.id,len(record)))
...
PM0196
PM0196
PM0996
PM0996
TheparserreturnsaseriesofPlateRecordobjects,eachonecontainingaseriesofWellRecordobjects(holdingeachwellsexperimentaldata)arrangedin8rowsand12
columnseachrowisindicatedbyauppercasecharacterfromAtoH,whilecolumnsareindicatedbyatwodigitnumber,from01to12.Thereareseveralwaystoaccess
WellRecordobjectsfromaPlateRecordobjects:
Wellidentifier
Ifyouknowthewellidentifier(row+columnidentifiers)youcanaccessthedesiredwelldirectly.
>>>record['A02']

Wellplatecoordinates
Thesamewellcanberetrievedbyusingtherowandcolumnsnumbers(0basedindex).
>>>record=list(phenotype.parse("Plates.csv","pmcsv"))[1]
>>>print(record[0,1].id)
A02
Roworcolumncoordinates
AseriesofWellRecordobjectscontiguoustoeachotherintheplatecanberetrievedinbulkbyusingthepythonlistslicingsyntaxonPlateRecordobjectsrowsand
columnsarenumberedwitha0basedindex.
>>>print(record[0])
PlateID:PM09
Well:12
Rows:1
Columns:12
PlateRecord('WellRecord['A01'],WellRecord['A02'],WellRecord['A03'],...,WellRecord['A12']')
>>>print(record[:,0])
PlateID:PM09
Well:8
Rows:8
Columns:1
PlateRecord('WellRecord['A01'],WellRecord['B01'],WellRecord['C01'],...,WellRecord['H01']')
>>>print(record[:3,:3])
PlateID:PM09
Well:9
Rows:3
Columns:3
PlateRecord('WellRecord['A01'],WellRecord['A02'],WellRecord['A03'],...,WellRecord['C03']')
19.1.2ManipulatingPhenotypeMicroarraydata
19.1.2.1Accessingrawdata
TherawdataextractedfromthePMfilesiscomprisedofaseriesoftuplesforeachwell,containingthetime(inhours)andthecolorimetricmeasure(inarbitraryunits).
Usuallytheinstrumentcollectsdataeveryfifteenminutes,butthatcanvarybetweenexperiments.TherawdatacanbeaccessedbyiteratingonaWellRecordobjectin
theexamplebelowonlythefirsttentimepointsareshown.
>>>well=record['A02']
>>>fortime,signalinwell:
...print(time,signal)
...
(0.0,12.0)
(0.25,18.0)
(0.5,27.0)
(0.75,35.0)
(1.0,37.0)
(1.25,41.0)
(1.5,44.0)
(1.75,44.0)
(2.0,44.0)
(2.25,44.0)
[...]
Thismethod,whileprovidingawaytoaccesstherawdata,doesntallowadirectcomparisonbetweendifferentWellRecordobjects,whichmayhavemeasurementsat
differenttimepoints.
19.1.2.2Accessinginterpolateddata
Tomakeiteasiertocomparedifferentexperimentsandingeneraltoallowamoreintuitivehandlingofthephenotypicdata,themoduleallowstodefineacustomslicing
ofthetimepointsthatarepresentintheWellRecordobject.Colorimetricdatafortimepointsthathavenotbeendirectlymeasuredarederivedthroughalinear
interpolationoftheavailabledata,otherwiseaNaNisreturned.Thismethodonlyworksinthetimeintervalwhereactualdataisavailable.Timeintervalscanbedefined
withthesamesyntaxaslistindexingthedefaulttimeintervalisthereforeonehour.
>>>well[:10]
[12.0,37.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0]
Differenttimeintervalscanbeused,forinstancefiveminutes:
>>>well[63:64:0.083]
[12.0,37.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0]
>>>well[9.55]
44.0
>>>well[63.33:73.33]
[113.31999999999999,
117.0,
120.31999999999999,
128.0,
129.63999999999999,
132.95999999999998,
136.95999999999998,
140.0,
142.0,
nan]
19.1.2.3Controlwellsubtraction
ManyPhenotypeMicroarrayplatescontainacontrolwell(usuallyA01),thatisawellwherethemediashouldntsupportanygrowththelowsignalproducedbythis
wellcanbesubtractedfromtheotherwells.ThePlateRecordobjectshaveadedicatedfunctionforthat,whichreturnsanotherPlateRecordobjectwiththecorrecteddata.
>>>corrected=record.subtract_control(control='A01')
>>>record['A01'][63]
336.0
>>>corrected['A01'][63]
0.0
19.1.2.4Parametersextraction
Thosewellswheremetabolicactivityisobservedshowasigmoidbehaviorforthecolorimetricdata.Toallowaneasierwaytocomparedifferentexperimentsasigmoid
curvecanbefittedontothedata,sothataseriesofsummaryparameterscanbeextractedandusedforcomparisons.Theparametersthatcanbeextractedfromthecurve
are:
Minimum(min)andmaximum(max)signal
Averageheight(average_height)
Areaunderthecurve(area)
Curveplateaupoint(plateau)
Curveslopeduringexponentialmetabolicactivity(slope)
Curvelagtime(lag).
Alltheparameters(exceptmin,maxandaverage_height)requirethescipylibrarytobeinstalled.
Thefitfunctionusesthreesigmoidfunctions:
Gompertz
(me/A(t)+1)
Aee +y0
Logistic
A/1+e(4m/A(t)+2)+y0
Richards
A(1+ve1+v+em/A(1+v)(1+1/v)(t))1/v+y0
Where:
Acorrespondstotheplateau
mcorrespondstotheslope
correspondstothelag
Thesefunctionshavebeenderivedfromthispublication.Thefitmethodbydefaulttriesfirsttofitthegompertzfunction:ifitfailsitwillthentrytofitthelogisticand
thentherichardsfunction.Theusercanalsospecifyoneofthethreefunctionstobeapplied.
>>>well=record['A02']
>>>well.fit()
>>>print("Functionfitted:%s"%well.model)
Functionfitted:gompertz
>>>forparamin["area","average_height","lag","max","min",
..."plateau","slope"]:
...print("%s\t%.2f"%(param,getattr(well,param)))
...
area4414.38
average_height61.58
lag48.60
max143.00
min12.00
plateau120.02
slope4.99
19.1.3WritingPhenotypeMicroarraydata
PlateRecordobjectscanbewrittentofileintheformofJSONfiles,aformatcompatiblewithothersoftwarepackagessuchasopmorDuctApe.
>>>phenotype.write(record,"out.json","pmjson")
1
Chapter20CookbookCoolthingstodowithit
Biopythonnowhastwocollectionsofcookbookexamplesthischapter(whichhasbeenincludedinthistutorialformanyyearsandhasgraduallygrown),and
http://biopython.org/wiki/Category:Cookbookwhichisausercontributedcollectiononourwiki.
WeretryingtoencourageBiopythonuserstocontributetheirownexamplestothewiki.Inadditiontohelpingthecommunity,onedirectbenefitofsharinganexample
likethisisthatyoucouldalsogetsomefeedbackonthecodefromotherBiopythonusersanddeveloperswhichcouldhelpyouimproveallyourPythoncode.
Inthelongterm,wemayendupmovingalloftheexamplesinthischaptertothewiki,orelsewherewithinthetutorial.
20.1Workingwithsequencefiles
Thissectionshowssomemoreexamplesofsequenceinput/output,usingtheBio.SeqIOmoduledescribedinChapter5.
20.1.1Filteringasequencefile
Oftenyoullhavealargefilewithmanysequencesinit(e.g.FASTAfileorgenes,oraFASTQorSFFfileofreads),aseparateshorterlistoftheIDsforasubsetof
sequencesofinterest,andwanttomakeanewsequencefileforthissubset.
LetssaythelistofIDsisinasimpletextfile,asthefirstwordoneachline.ThiscouldbeatabularfilewherethefirstcolumnistheID.Trysomethinglikethis:
fromBioimportSeqIO
input_file="big_file.sff"
id_file="short_list.txt"
output_file="short_list.sff"
withopen(id_file)asid_handle:
wanted=set(line.rstrip("\n").split(None,1)[0]forlineinid_handle)
print("Found%iuniqueidentifiersin%s"%(len(wanted),id_file))
records=(rforrinSeqIO.parse(input_file,"sff")ifr.idinwanted)
count=SeqIO.write(records,output_file,"sff")
print("Saved%irecordsfrom%sto%s"%(count,input_file,output_file))
ifcount<len(wanted):
print("Warning%iIDsnotfoundin%s"%(len(wanted)count,input_file))
NotethatweuseaPythonsetratherthanalist,thismakestestingmembershipfaster.
AsdiscussedinSection5.6,foralargeFASTAorFASTQfileforspeedyouwouldbebetteroffnotusingthehighlevel SeqIOinterface,butworkingdirectlywithstrings.
ThisnextexampleshowshowtodothiswithFASTQfilesitismorecomplicated:
fromBio.SeqIO.QualityIOimportFastqGeneralIterator
input_file="big_file.fastq"
id_file="short_list.txt"
output_file="short_list.fastq"
withopen(id_file)asid_handle:
#Takingfirstwordoneachlineasanidentifer
wanted=set(line.rstrip("\n").split(None,1)[0]forlineinid_handle)
print("Found%iuniqueidentifiersin%s"%(len(wanted),id_file))
withopen(input_file)asin_handle:
withopen(output_file,"w")asout_handle:
fortitle,seq,qualinFastqGeneralIterator(in_handle):
#TheIDisthefirstwordinthetitleline(afterthe@sign):
iftitle.split(None,1)[0]inwanted:
#Thisproducesastandard4lineFASTQentry:
out_handle.write("@%s\n%s\n+\n%s\n"%(title,seq,qual))
count+=1
print("Saved%irecordsfrom%sto%s"%(count,input_file,output_file))
ifcount<len(wanted):
print("Warning%iIDsnotfoundin%s"%(len(wanted)count,input_file))
20.1.2Producingrandomisedgenomes
Letssupposeyouarelookingatgenomesequence,huntingforsomesequencefeaturemaybeextremelocalGC%bias,orpossiblerestrictiondigestsites.Onceyouve
gotyourPythoncodeworkingontherealgenomeitmaybesensibletotryrunningthesamesearchonrandomisedversionsofthesamegenomeforstatisticalanalysis
(afterall,anyfeaturesyouvefoundcouldjustbetherejustbychance).
Forthisdiscussion,wellusetheGenBankfileforthepPCP1plasmidfromYersiniapestisbiovarMicrotus.ThefileisincludedwiththeBiopythonunittestsunderthe
GenBankfolder,oryoucangetitfromourwebsite,NC_005816.gb.Thisfilecontainsoneandonlyonerecord,sowecanreaditinasaSeqRecordusingthe
Bio.SeqIO.read()function:
>>>original_rec=SeqIO.read("NC_005816.gb","genbank")
So,howcanwegenerateashuffledversionsoftheoriginalsequence?IwouldusethebuiltinPythonrandommoduleforthis,inparticularthefunctionrandom.shuffle
butthisworksonaPythonlist.OursequenceisaSeqobject,soinordertoshuffleitweneedtoturnitintoalist:
>>>importrandom
>>>nuc_list=list(original_rec.seq)
>>>random.shuffle(nuc_list)#actsinsitu!
Now,inordertouseBio.SeqIOtooutputtheshuffledsequence,weneedtoconstructanewSeqRecordwithanewSeqobjectusingthisshuffledlist.Inordertodothis,we
needtoturnthelistofnucleotides(singleletterstrings)intoalongstringthestandardPythonwaytodothisiswiththestringobjectsjoinmethod.
>>>shuffled_rec=SeqRecord(Seq("".join(nuc_list),original_rec.seq.alphabet),
...id="Shuffled",description="Basedon%s"%original_rec.id)
...
LetsputallthesepiecestogethertomakeacompletePythonscriptwhichgeneratesasingleFASTAfilecontaining30randomlyshuffledversionsoftheoriginal
sequence.
Thisfirstversionjustusesabigforloopandwritesouttherecordsonebyone(usingtheSeqRecordsformatmethoddescribedinSection5.5.4):
importrandom
fromBioimportSeqIO
original_rec=SeqIO.read("NC_005816.gb","genbank")
withopen("shuffled.fasta","w")asoutput_handle:
foriinrange(30):
nuc_list=list(original_rec.seq)
random.shuffle(nuc_list)
shuffled_rec=SeqRecord(Seq("".join(nuc_list),original_rec.seq.alphabet),
id="Shuffled%i"%(i+1),
description="Basedon%s"%original_rec.id)
out_handle.write(shuffled_rec.format("fasta"))
PersonallyIpreferthefollowingversionusingafunctiontoshuffletherecordandageneratorexpressioninsteadoftheforloop:
importrandom
fromBioimportSeqIO
defmake_shuffle_record(record,new_id):
nuc_list=list(record.seq)
random.shuffle(nuc_list)
returnSeqRecord(Seq("".join(nuc_list),record.seq.alphabet),
id=new_id,description="Basedon%s"%original_rec.id)
original_rec=SeqIO.read("NC_005816.gb","genbank")
shuffled_recs=(make_shuffle_record(original_rec,"Shuffled%i"%(i+1))
foriinrange(30))
SeqIO.write(shuffled_recs,"shuffled.fasta","fasta")
20.1.3TranslatingaFASTAfileofCDSentries
SupposeyouvegotaninputfileofCDSentriesforsomeorganism,andyouwanttogenerateanewFASTAfilecontainingtheirproteinsequences.i.e.Takeeach
nucleotidesequencefromtheoriginalfile,andtranslateit.BackinSection3.9wesawhowtousetheSeqobjectstranslatemethod,andtheoptionalcdsargument
whichenablescorrecttranslationofalternativestartcodons.
WecancombinethiswithBio.SeqIOasshowninthereversecomplementexampleinSection5.5.3.ThekeypointisthatforeachnucleotideSeqRecord,weneedtocreate
aproteinSeqRecordandtakecareofnamingit.
Youcanwriteyouownfunctiontodothis,choosingsuitableproteinidentifiersforyoursequences,andtheappropriategeneticcode.Inthisexamplewejustusethe
defaulttableandaddaprefixtotheidentifier:
defmake_protein_record(nuc_record):
"""ReturnsanewSeqRecordwiththetranslatedsequence(defaulttable)."""
returnSeqRecord(seq=nuc_record.seq.translate(cds=True),\
id="trans_"+nuc_record.id,\
description="translationofCDS,usingdefaulttable")
Wecanthenusethisfunctiontoturntheinputnucleotiderecordsintoproteinrecordsreadyforoutput.Anelegantwayandmemoryefficientwaytodothisiswitha
generatorexpression:
fromBioimportSeqIO
proteins=(make_protein_record(nuc_rec)fornuc_recin\
SeqIO.parse("coding_sequences.fasta","fasta"))
SeqIO.write(proteins,"translations.fasta","fasta")
ThisshouldworkonanyFASTAfileofcompletecodingsequences.Ifyouareworkingonpartialcodingsequences,youmayprefertouse
nuc_record.seq.translate(to_stop=True)intheexampleabove,asthiswouldntcheckforavalidstartcodonetc.
20.1.4MakingthesequencesinaFASTAfileuppercase
OftenyoullgetdatafromcollaboratorsasFASTAfiles,andsometimesthesequencescanbeinamixtureofupperandlowercase.Insomecasesthisisdeliberate(e.g.
lowercaseforpoorqualityregions),butusuallyitisnotimportant.Youmaywanttoeditthefiletomakeeverythingconsistent(e.g.alluppercase),andyoucandothis
easilyusingtheupper()methodoftheSeqRecordobject(addedinBiopython1.55):
fromBioimportSeqIO
records=(rec.upper()forrecinSeqIO.parse("mixed.fas","fasta"))
count=SeqIO.write(records,"upper.fas","fasta")
print("Converted%irecordstouppercase"%count)
Howdoesthiswork?ThefirstlineisjustimportingtheBio.SeqIOmodule.ThesecondlineistheinterestingbitthisisaPythongeneratorexpressionwhichgivesan
uppercaseversionofeachrecordparsedfromtheinputfile(mixed.fas).InthethirdlinewegivethisgeneratorexpressiontotheBio.SeqIO.write()functionanditsaves
thenewuppercasesrecordstoouroutputfile(upper.fas).
Thereasonweuseageneratorexpression(ratherthanalistorlistcomprehension)isthismeansonlyonerecordiskeptinmemoryatatime.Thiscanbereallyimportant
ifyouaredealingwithlargefileswithmillionsofentries.
20.1.5Sortingasequencefile
Supposeyouwantedtosortasequencefilebylength(e.g.asetofcontigsfromanassembly),andyouareworkingwithafileformatlikeFASTAorFASTQwhich
Bio.SeqIOcanread,write(andindex).
Ifthefileissmallenough,youcanloaditallintomemoryatonceasalistofSeqRecordobjects,sortthelist,andsaveit:
fromBioimportSeqIO
records=list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(key=lambdar:len(r))
SeqIO.write(records,"sorted_orchids.fasta","fasta")
Theonlycleverbitisspecifyingacomparisonmethodforhowtosorttherecords(herewesortthembylength).Ifyouwantedthelongestrecordsfirst,youcouldflipthe
comparisonorusethereverseargument:
fromBioimportSeqIO
records=list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(key=lambdar:len(r))
SeqIO.write(records,"sorted_orchids.fasta","fasta")
Nowthatsprettystraightforwardbutwhathappensifyouhaveaverylargefileandyoucantloaditallintomemorylikethis?Forexample,youmighthavesome
nextgenerationsequencingreadstosortbylength.ThiscanbesolvedusingtheBio.SeqIO.index()function.
fromBioimportSeqIO
#Getthelengthsandids,andsortonlength
len_and_ids=sorted((len(rec),rec.id)forrecin
SeqIO.parse("ls_orchid.fasta","fasta"))
ids=reversed([idfor(length,id)inlen_and_ids])
dellen_and_ids#freethismemory
record_index=SeqIO.index("ls_orchid.fasta","fasta")
records=(record_index[id]foridinids)
SeqIO.write(records,"sorted.fasta","fasta")
FirstwescanthroughthefileonceusingBio.SeqIO.parse(),recordingtherecordidentifiersandtheirlengthsinalistoftuples.Wethensortthislisttogettheminlength
order,anddiscardthelengths.UsingthissortedlistofidentifiersBio.SeqIO.index()allowsustoretrievetherecordsonebyone,andwepassthemtoBio.SeqIO.write()
foroutput.
TheseexamplesalluseBio.SeqIOtoparsetherecordsintoSeqRecordobjectswhichareoutputusingBio.SeqIO.write().Whatifyouwanttosortafileformatwhich
Bio.SeqIO.write()doesntsupport,liketheplaintextSwissProtformat?Hereisanalternativesolutionusingtheget_raw()methodaddedtoBio.SeqIO.index()in
Biopython1.54(seeSection5.4.2.2).
fromBioimportSeqIO
len_and_ids=sorted((len(rec),rec.id)forrecin
SeqIO.parse("ls_orchid.fasta","fasta"))
withopen("sorted.fasta","wb")asout_handle:
foridinids:
out_handle.write(record_index.get_raw(id))
NotewithPython3onwards,wehavetoopenthefileforwritinginbinarymodebecausetheget_raw()methodreturnsbytesstrings.
Asabonus,becauseitdoesntparsethedataintoSeqRecordobjectsasecondtimeitshouldbefaster.IfyouonlywanttousethiswithFASTAformat,wecanspeedthis
uponestepfurtherbyusingthelowlevelFASTAparsertogettherecordidentifiersandlengths:
fromBio.SeqIO.FastaIOimportSimpleFastaParser
fromBioimportSeqIO
withopen("ls_orchid.fasta")asin_handle:
len_and_ids=sorted((len(seq),title.split(None,1)[0])for
title,seqinSimpleFastaParser(in_handle))
withopen("sorted.fasta","wb")asout_handle:
foridinids:
out_handle.write(record_index.get_raw(id))
20.1.6SimplequalityfilteringforFASTQfiles
TheFASTQfileformatwasintroducedatSangerandisnowwidelyusedforholdingnucleotidesequencingreadstogetherwiththeirqualityscores.FASTQfiles(andthe
relatedQUALfiles)areanexcellentexampleofperletterannotation,becauseforeachnucleotideinthesequencethereisanassociatedqualityscore.Anyperletter
annotationisheldinaSeqRecordintheletter_annotationsdictionaryasalist,tupleorstring(withthesamenumberofelementsasthesequencelength).
Onecommontaskistakingalargesetofsequencingreadsandfilteringthem(orcroppingthem)basedontheirqualityscores.Thefollowingexampleisverysimplistic,
butshouldillustratethebasicsofworkingwithqualitydatainaSeqRecordobject.AllwearegoingtodohereisreadinafileofFASTQdata,andfilterittopickoutonly
thoserecordswhosePHREDqualityscoresareallabovesomethreshold(here20).
ForthisexamplewellusesomerealdatadownloadedfromtheENAsequencereadarchive,
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz(2MB)whichunzipstoa19MBfileSRR020192.fastq.ThisissomeRoche454GSFLX
singleenddatafromvirusinfectedCaliforniasealions(seehttp://www.ebi.ac.uk/ena/data/view/SRS004476fordetails).
First,letscountthereads:
fromBioimportSeqIO
count=0
forrecinSeqIO.parse("SRR020192.fastq","fastq"):
count+=1
print("%ireads"%count)
NowletsdoasimplefilteringforaminimumPHREDqualityof20:
fromBioimportSeqIO
good_reads=(recforrecin\
SeqIO.parse("SRR020192.fastq","fastq")\
ifmin(rec.letter_annotations["phred_quality"])>=20)
count=SeqIO.write(good_reads,"good_quality.fastq","fastq")
print("Saved%ireads"%count)
Thispulledoutonly14580readsoutofthe41892present.Amoresensiblethingtodowouldbetoqualitytrimthereads,butthisisintendedasanexampleonly.
FASTQfilescancontainmillionsofentries,soitisbesttoavoidloadingthemallintomemoryatonce.Thisexampleusesageneratorexpression,whichmeansonlyone
SeqRecordiscreatedatatimeavoidinganymemorylimitations.
NotethatitwouldbefastertousethelowlevelFastqGeneralIteratorparserhere(seeSection5.6),butthatdoesnotturnthequalitystringintointegerscores.
20.1.7Trimmingoffprimersequences
ForthisexampleweregoingtopretendthatGATGACGGTGTisa5primersequencewewanttolookforinsomeFASTQformattedreaddata.Asintheexampleabove,
wellusetheSRR020192.fastqfiledownloadedfromtheENA( ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).
ByusingthemainBio.SeqIOinterface,thesameapproachwouldworkwithanyothersupportedfileformat(e.g.FASTAfiles).However,forlargeFASTQfilesitwould
befasterthelowlevelFastqGeneralIteratorparserhere(seetheearlierexample,andSection5.6).
ThiscodeusesBio.SeqIOwithageneratorexpression(toavoidloadingallthesequencesintomemoryatonce),andtheSeqobjectsstartswithmethodtoseeiftheread
startswiththeprimersequence:
fromBioimportSeqIO
primer_reads=(recforrecin\
ifrec.seq.startswith("GATGACGGTGT"))
count=SeqIO.write(primer_reads,"with_primer.fastq","fastq")
Thatshouldfind13819readsfromSRR014849.fastqandsavethemtoanewFASTQfile,with_primer.fastq.
NowsupposethatinsteadyouwantedtomakeaFASTQfilecontainingthesereadsbutwiththeprimersequenceremoved?Thatsjustasmallchangeaswecanslicethe
SeqRecord(seeSection4.7)toremovethefirstelevenletters(thelengthofourprimer):
fromBioimportSeqIO
trimmed_primer_reads=(rec[11:]forrecin\
ifrec.seq.startswith("GATGACGGTGT"))
count=SeqIO.write(trimmed_primer_reads,"with_primer_trimmed.fastq","fastq")
Again,thatshouldpulloutthe13819readsfromSRR020192.fastq,butthistimestripoffthefirsttencharacters,andsavethemtoanothernewFASTQfile,
with_primer_trimmed.fastq.
Now,supposeyouwanttocreateanewFASTQfilewherethesereadshavetheirprimerremoved,butalltheotherreadsarekeptastheywere?Ifwewanttostillusea
generatorexpression,itisprobablyclearesttodefineourowntrimfunction:
fromBioimportSeqIO
deftrim_primer(record,primer):
ifrecord.seq.startswith(primer):
returnrecord[len(primer):]
else:
returnrecord
trimmed_reads=(trim_primer(record,"GATGACGGTGT")forrecordin\
SeqIO.parse("SRR020192.fastq","fastq"))
count=SeqIO.write(trimmed_reads,"trimmed.fastq","fastq")
Thistakeslonger,asthistimetheoutputfilecontainsall41892reads.Again,wereusedageneratorexpressiontoavoidanymemoryproblems.Youcouldalternatively
useageneratorfunctionratherthanageneratorexpression.
fromBioimportSeqIO
deftrim_primers(records,primer):
"""Removesperfectprimersequencesatstartofreads.
Thisisageneratorfunction,therecordsargumentshould
bealistoriteratorreturningSeqRecordobjects.
"""
len_primer=len(primer)#cachethisforlater
forrecordinrecords:
ifrecord.seq.startswith(primer):
yieldrecord[len_primer:]
else:
yieldrecord
original_reads=SeqIO.parse("SRR020192.fastq","fastq")
trimmed_reads=trim_primers(original_reads,"GATGACGGTGT")
Thisformismoreflexibleifyouwanttodosomethingmorecomplicatedwhereonlysomeoftherecordsareretainedasshowninthenextexample.
20.1.8Trimmingoffadaptorsequences
Thisisessentiallyasimpleextensiontothepreviousexample.WearegoingtogoingtopretendGATGACGGTGTisanadaptorsequenceinsomeFASTQformattedreaddata,
againtheSRR020192.fastqfilefromtheNCBI(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).
Thistimehowever,wewilllookforthesequenceanywhereinthereads,notjustattheverybeginning:
fromBioimportSeqIO
deftrim_adaptors(records,adaptor):
"""Trimsperfectadaptorsequences.
"""
len_adaptor=len(adaptor)#cachethisforlater
forrecordinrecords:
index=record.seq.find(adaptor)
ifindex==1:
#adaptornotfound,sowon'ttrim
yieldrecord
else:
#trimofftheadaptor
yieldrecord[index+len_adaptor:]
trimmed_reads=trim_adaptors(original_reads,"GATGACGGTGT")
BecauseweareusingaFASTQinputfileinthisexample,theSeqRecordobjectshaveperletterannotationforthequalityscores.ByslicingtheSeqRecordobjectthe
appropriatescoresareusedonthetrimmedrecords,sowecanoutputthemasaFASTQfiletoo.
Comparedtotheoutputofthepreviousexamplewhereweonlylookedforaprimer/adaptoratthestartofeachread,youmayfindsomeofthetrimmedreadsarequite
shortaftertrimming(e.g.iftheadaptorwasfoundinthemiddleratherthannearthestart).So,letsaddaminimumlengthrequirementaswell:
fromBioimportSeqIO
deftrim_adaptors(records,adaptor,min_len):
"""Trimsperfectadaptorsequences,checksreadlength.
"""
len_adaptor=len(adaptor)#cachethisforlater
forrecordinrecords:
len_record=len(record)#cachethisforlater
iflen(record)<min_len:
#Tooshorttokeep
continue
index=record.seq.find(adaptor)
ifindex==1:
#adaptornotfound,sowon'ttrim
yieldrecord
eliflen_recordindexlen_adaptor>=min_len:
#aftertrimmingthiswillstillbelongenough
yieldrecord[index+len_adaptor:]
trimmed_reads=trim_adaptors(original_reads,"GATGACGGTGT",100)
Bychangingtheformatnames,youcouldapplythistoFASTAfilesinstead.Thiscodealsocouldbeextendedtodoafuzzymatchinsteadofanexactmatch(maybe
usingapairwisealignment,ortakingintoaccountthereadqualityscores),butthatwillbemuchslower.
20.1.9ConvertingFASTQfiles
BackinSection5.5.2weshowedhowtouseBio.SeqIOtoconvertbetweentwofileformats.HerewellgointoalittlemoredetailregardingFASTQfileswhichareused
insecondgenerationDNAsequencing.PleaserefertoCocketal.(2009)[7]foralongerdescription.FASTQfilesstoreboththeDNAsequence(asastring)andthe
associatedreadqualities.
PHREDscores(usedinmostFASTQfiles,andalsoinQUALfiles,ACEfilesandSFFfiles)havebecomeadefactostandardforrepresentingtheprobabilityofa
sequencingerror(heredenotedbyPe)atagivenbaseusingasimplebasetenlogtransformation:
QPHRED=10log10(Pe)(20.1)
Thismeansawrongread(Pe=1)getsaPHREDqualityof0,whileaverygoodreadlikePe=0.00001getsaPHREDqualityof50.Whileforrawsequencingdata
qualitieshigherthanthisarerare,withpostprocessingsuchasreadmappingorassembly,qualitiesofuptoabout90arepossible(indeed,theMAQtoolallowsfor
PHREDscoresintherange0to93inclusive).
TheFASTQformathasthepotentialtobecomeadefactostandardforstoringthelettersandqualityscoresforasequencingreadinasingleplaintextfile.Theonlyflyin
theointmentisthatthereareatleastthreeversionsoftheFASTQformatwhichareincompatibleanddifficulttodistinguish...
1.TheoriginalSangerFASTQformatusesPHREDqualitiesencodedwithanASCIIoffsetof33.TheNCBIareusingthisformatintheirShortReadArchive.Wecall
thisthefastq(orfastqsanger)formatinBio.SeqIO.
2.Solexa(laterboughtbyIllumina)introducedtheirownversionusingSolexaqualitiesencodedwithanASCIIoffsetof64.Wecallthisthefastqsolexaformat.
3.Illuminapipeline1.3onwardsproducesFASTQfileswithPHREDqualities(whichismoreconsistent),butencodedwithanASCIIoffsetof64.Wecallthisthe
fastqilluminaformat.
TheSolexaqualityscoresaredefinedusingadifferentlogtransformation:
QSolexa=10log10 Pe (20.2)
1Pe

GivenSolexa/IlluminahavenowmovedtousingPHREDscoresinversion1.3oftheirpipeline,theSolexaqualityscoreswillgraduallyfalloutofuse.Ifyouequatethe
errorestimates(Pe)thesetwoequationsallowconversionbetweenthetwoscoringsystemsandBiopythonincludesfunctionstodothisintheBio.SeqIO.QualityIO
module,whicharecalledifyouuseBio.SeqIOtoconvertanoldSolexa/IlluminafileintoastandardSangerFASTQfile:
fromBioimportSeqIO
SeqIO.convert("solexa.fastq","fastqsolexa","standard.fastq","fastq")
IfyouwanttoconvertanewIllumina1.3+FASTQfile,allthatgetschangedistheASCIIoffsetbecausealthoughencodeddifferentlythescoresareallPHREDqualities:
fromBioimportSeqIO
SeqIO.convert("illumina.fastq","fastqillumina","standard.fastq","fastq")
NotethatusingBio.SeqIO.convert()likethisismuchfasterthancombiningBio.SeqIO.parse()andBio.SeqIO.write()becauseoptimisedcodeisusedforconverting
betweenFASTQvariants(andalsoforFASTQtoFASTAconversion).
Forgoodqualityreads,PHREDandSolexascoresareapproximatelyequal,whichmeanssinceboththefastasolexaandfastqilluminaformatsuseanASCIIoffsetof
64thefilesarealmostthesame.ThiswasadeliberatedesignchoicebyIllumina,meaningapplicationsexpectingtheoldfastasolexastylefileswillprobablybeOK
usingthenewerfastqilluminafiles(ongooddata).Ofcourse,bothvariantsareverydifferentfromtheoriginalFASTQstandardasusedbySanger,theNCBI,and
elsewhere(formatnamefastqorfastqsanger).
Formoredetails,seethebuiltinhelp(alsoonline):
>>>fromBio.SeqIOimportQualityIO
>>>help(QualityIO)
...
20.1.10ConvertingFASTAandQUALfilesintoFASTQfiles
FASTQfilesholdbothsequencesandtheirqualitystrings.FASTAfilesholdjustsequences,whileQUALfilesholdjustthequalities.ThereforeasingleFASTQfilecan
beconvertedtoorfrompairedFASTAandQUALfiles.
GoingfromFASTQtoFASTAiseasy:
fromBioimportSeqIO
SeqIO.convert("example.fastq","fastq","example.fasta","fasta")
GoingfromFASTQtoQUALisalsoeasy:
fromBioimportSeqIO
SeqIO.convert("example.fastq","fastq","example.qual","qual")
However,thereverseisalittlemoretricky.YoucanuseBio.SeqIO.parse()toiterateovertherecordsinasinglefile,butinthiscasewehavetwoinputfiles.Thereare
severalstrategiespossible,butassumingthatthetwofilesarereallypairedthemostmemoryefficientwayistoloopoverbothtogether.Thecodeisalittlefiddly,sowe
provideafunctioncalledPairedFastaQualIteratorintheBio.SeqIO.QualityIOmoduletodothis.Thistakestwohandles(theFASTAfileandtheQUALfile)andreturns
aSeqRecorditerator:
fromBio.SeqIO.QualityIOimportPairedFastaQualIterator
forrecordinPairedFastaQualIterator(open("example.fasta"),open("example.qual")):
print(record)
ThisfunctionwillcheckthattheFASTAandQUALfilesareconsistent(e.g.therecordsareinthesameorder,andhavethesamesequencelength).Youcancombinethis
withtheBio.SeqIO.write()functiontoconvertapairofFASTAandQUALfilesintoasingleFASTQfiles:
fromBioimportSeqIO
fromBio.SeqIO.QualityIOimportPairedFastaQualIterator
withopen("example.fasta")asf_handle,open("example.qual")asq_handle:
records=PairedFastaQualIterator(f_handle,q_handle)
count=SeqIO.write(records,"temp.fastq","fastq")
20.1.11IndexingaFASTQfile
FASTQfilesareoftenverylarge,withmillionsofreadsinthem.Duetothesheeramountofdata,youcantloadalltherecordsintomemoryatonce.Thisiswhythe
examplesabove(filteringandtrimming)iterateoverthefilelookingatjustoneSeqRecordatatime.
However,sometimesyoucantuseabiglooporaniteratoryoumayneedrandomaccesstothereads.HeretheBio.SeqIO.index()functionmayproveveryhelpful,asit
allowsyoutoaccessanyreadintheFASTQfilebyitsname(seeSection5.4.2).
AgainwellusetheSRR020192.fastqfilefromtheENA( ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz),althoughthisisactuallyquitea
smallFASTQfilewithlessthan50,000reads:
>>>fq_dict=SeqIO.index("SRR020192.fastq","fastq")
>>>len(fq_dict)
41892
>>>fq_dict.keys()[:4]
['SRR020192.38240','SRR020192.23181','SRR020192.40568','SRR020192.23186']
>>>fq_dict["SRR020192.23186"].seq
Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG',SingleLetterAlphabet())
WhentestingthisonaFASTQfilewithsevenmillionreads,indexingtookaboutaminute,butrecordaccesswasalmostinstant.
TheexampleinSection20.1.5showhowyoucanusetheBio.SeqIO.index()functiontosortalargeFASTAfilethiscouldalsobeusedonFASTQfiles.
20.1.12ConvertingSFFfiles
Ifyouworkwith454(Roche)sequencedata,youwillprobablyhaveaccesstotherawdataasaStandardFlowgramFormat(SFF)file.Thiscontainsthesequencereads
(calledbases)withqualityscoresandtheoriginalflowinformation.
AcommontaskistoconvertfromSFFtoapairofFASTAandQUALfiles,ortoasingleFASTQfile.TheseoperationsaretrivialusingtheBio.SeqIO.convert()
function(seeSection5.5.2):
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sff","reads.fasta","fasta")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sff","reads.qual","qual")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sff","reads.fastq","fastq")
10
Remembertheconvertfunctionreturnsthenumberofrecords,inthisexamplejustten.Thiswillgiveyoutheuntrimmedreads,wheretheleadingandtrailingpoor
qualitysequenceoradaptorwillbeinlowercase.Ifyouwantthetrimmedreads(usingtheclippinginformationrecordedwithintheSFFfile)usethis:
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sfftrim","trimmed.fasta","fasta")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sfftrim","trimmed.qual","qual")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sfftrim","trimmed.fastq","fastq")
10
IfyourunLinux,youcouldaskRocheforacopyoftheiroffinstrumenttools(oftenreferredtoastheNewblertools).ThisoffersanalternativewaytodoSFFto
FASTAorQUALconversionatthecommandline(butcurrentlyFASTQoutputisnotsupported),e.g.
$sffinfoseqnotrimE3MFGYR02_random_10_reads.sff>reads.fasta
$sffinfoqualnotrimE3MFGYR02_random_10_reads.sff>reads.qual
$sffinfoseqtrimE3MFGYR02_random_10_reads.sff>trimmed.fasta
$sffinfoqualtrimE3MFGYR02_random_10_reads.sff>trimmed.qual
ThewayBiopythonusesmixedcasesequencestringstorepresentthetrimmingpointsdeliberatelymimicswhattheRochetoolsdo.
FormoreinformationontheBiopythonSFFsupport,consultthebuiltinhelp:
>>>fromBio.SeqIOimportSffIO
>>>help(SffIO)
...
20.1.13Identifyingopenreadingframes
Averysimplisticfirststepatidentifyingpossiblegenesistolookforopenreadingframes(ORFs).Bythiswemeanlookinallsixframesforlongregionswithoutstop
codonsanORFisjustaregionofnucleotideswithnoinframestopcodons.
Ofcourse,tofindageneyouwouldalsoneedtoworryaboutlocatingastartcodon,possiblepromotersandinEukaryotesthereareintronstoworryabouttoo.
However,thisapproachisstillusefulinvirusesandProkaryotes.
ToshowhowyoumightapproachthiswithBiopython,wellneedasequencetosearch,andasanexamplewellagainusethebacterialplasmidalthoughthistimewell
startwithaplainFASTAfilewithnopremarkedgenes:NC_005816.fna.Thisisabacterialsequence,sowellwanttouseNCBIcodontable11(seeSection3.9about
translation).
>>>record=SeqIO.read("NC_005816.fna","fasta")
>>>table=11
>>>min_pro_len=100
HereisaneattrickusingtheSeqobjectssplitmethodtogetalistofallthepossibleORFtranslationsinthesixreadingframes:
>>>forstrand,nucin[(+1,record.seq),(1,record.seq.reverse_complement())]:
...forframeinrange(3):
...length=3*((len(record)frame)//3)#Multipleofthree
...forproinnuc[frame:frame+length].translate(table).split("*"):
...iflen(pro)>=min_pro_len:
...print("%s...%slength%i,strand%i,frame%i"\
...%(pro[:30],pro[3:],len(pro),strand,frame))
GCLMKKSSIVATIITILSGSANAASSQLIP...YRFlength315,strand1,frame0
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPElength285,strand1,frame1
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYHlength176,strand1,frame1
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLPlength165,strand1,frame1
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVAlength355,strand1,frame2
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTRlength128,strand1,frame2
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIKlength100,strand1,frame2
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSDlength114,strand1,frame0
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHVlength111,strand1,frame0
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESKlength125,strand1,frame1
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLKlength361,strand1,frame1
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKRlength111,strand1,frame1
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAAlength107,strand1,frame2
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQlength119,strand1,frame2
Notethatherewearecountingtheframesfromthe5end(start)ofeachstrand.Itissometimeseasiertoalwayscountfromthe5end(start)oftheforwardstrand.
Youcouldeasilyedittheaboveloopbasedcodetobuildupalistofthecandidateproteins,orconvertthistoalistcomprehension.Now,onethingthiscodedoesntdois
keeptrackofwheretheproteinsare.
Youcouldtacklethisinseveralways.Forexample,thefollowingcodetracksthelocationsintermsoftheproteincounting,andconvertsbacktotheparentsequenceby
multiplyingbythree,thenadjustingfortheframeandstrand:
fromBioimportSeqIO
table=11
min_pro_len=100
deffind_orfs_with_trans(seq,trans_table,min_protein_length):
answer=[]
seq_len=len(seq)
forstrand,nucin[(+1,seq),(1,seq.reverse_complement())]:
forframeinrange(3):
trans=str(nuc[frame:].translate(trans_table))
trans_len=len(trans)
aa_start=0
aa_end=0
whileaa_start<trans_len:
aa_end=trans.find("*",aa_start)
ifaa_end==1:
aa_end=trans_len
ifaa_endaa_start>=min_protein_length:
ifstrand==1:
start=frame+aa_start*3
end=min(seq_len,frame+aa_end*3+3)
else:
start=seq_lenframeaa_end*33
end=seq_lenframeaa_start*3
answer.append((start,end,strand,
trans[aa_start:aa_end]))
aa_start=aa_end+1
answer.sort()
returnanswer
orf_list=find_orfs_with_trans(record.seq,table,min_pro_len)
forstart,end,strand,proinorf_list:
print("%s...%slength%i,strand%i,%i:%i"\
%(pro[:30],pro[3:],len(pro),strand,start,end))
Andtheoutput:
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVAlength355,strand1,41:1109
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKRlength111,strand1,491:827
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPElength285,strand1,1030:1888
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQlength119,strand1,2830:3190
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTRlength128,strand1,3470:3857
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYHlength176,strand1,4249:4780
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLKlength361,strand1,4814:5900
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLPlength165,strand1,5923:6421
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAAlength107,strand1,5974:6298
GCLMKKSSIVATIITILSGSANAASSQLIP...YRFlength315,strand1,6654:7602
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHVlength111,strand1,7788:8124
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESKlength125,strand1,8087:8465
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIKlength100,strand1,8741:9044
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSDlength114,strand1,9264:9609
Ifyoucommentoutthesortstatement,thentheproteinsequenceswillbeshowninthesameorderasbefore,soyoucancheckthisisdoingthesamething.Herewehave
sortedthembylocationtomakeiteasiertocomparetotheactualannotationintheGenBankfile(asvisualisedinSection17.1.9).
Ifhoweverallyouwanttofindarethelocationsoftheopenreadingframes,thenitisawasteoftimetotranslateeverypossiblecodon,includingdoingthereverse
complementtosearchthereversestrandtoo.Allyouneedtodoissearchforthepossiblestopcodons(andtheirreversecomplements).Usingregularexpressionsisan
obviousapproachhere(seethePythonmodulere).Theseareanextremelypowerful(butrathercomplex)wayofdescribingsearchstrings,whicharesupportedinlotsof
programminglanguagesandalsocommandlinetoolslikegrepaswell).Youcanfindwholebooksaboutthistopic!
20.2Sequenceparsingplussimpleplots
Thissectionshowssomemoreexamplesofsequenceparsing,usingtheBio.SeqIOmoduledescribedinChapter5,plusthePythonlibrarymatplotlibspylabplotting
interface(seethematplotlibwebsiteforatutorial).Notethattofollowtheseexamplesyouwillneedmatplotlibinstalledbutwithoutityoucanstilltrythedataparsing
bits.
20.2.1Histogramofsequencelengths
Therearelotsoftimeswhenyoumightwanttovisualisethedistributionofsequencelengthsinadatasetforexampletherangeofcontigsizesinagenomeassembly
project.InthisexamplewellreuseourorchidFASTAfilels_orchid.fastawhichhasonly94sequences.
Firstofall,wewilluseBio.SeqIOtoparsetheFASTAfileandcompilealistofallthesequencelengths.Youcoulddothiswithaforloop,butIfindalistcomprehension
morepleasing:
>>>sizes=[len(rec)forrecinSeqIO.parse("ls_orchid.fasta","fasta")]
>>>len(sizes),min(sizes),max(sizes)
(94,572,789)
>>>sizes
[740,753,748,744,733,718,730,704,740,709,700,726,...,592]
Nowthatwehavethelengthsofallthegenes(asalistofintegers),wecanusethematplotlibhistogramfunctiontodisplayit.
fromBioimportSeqIO
sizes=[len(rec)forrecinSeqIO.parse("ls_orchid.fasta","fasta")]
importpylab
pylab.hist(sizes,bins=20)
pylab.title("%iorchidsequences\nLengths%ito%i"\
%(len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequencelength(bp)")
pylab.ylabel("Count")
pylab.show()
Thatshouldpopupanewwindowcontainingthefollowinggraph:
Noticethatmostoftheseorchidsequencesareabout740bplong,andtherecouldbetwodistinctclassesofsequenceherewithasubsetofshortersequences.
Tip:Ratherthanusingpylab.show()toshowtheplotinawindow,youcanalsousepylab.savefig(...)tosavethefiguretoafile(e.g.asaPNGorPDF).
20.2.2PlotofsequenceGC%
AnothereasilycalculatedquantityofanucleotidesequenceistheGC%.YoumightwanttolookattheGC%ofallthegenesinabacterialgenomeforexample,and
investigateanyoutlierswhichcouldhavebeenrecentlyacquiredbyhorizontalgenetransfer.Again,forthisexamplewellreuseourorchidFASTAfilels_orchid.fasta.
Firstofall,wewilluseBio.SeqIOtoparsetheFASTAfileandcompilealistofalltheGCpercentages.Again,youcoulddothiswithaforloop,butIpreferthis:
fromBioimportSeqIO
fromBio.SeqUtilsimportGC
gc_values=sorted(GC(rec.seq)forrecinSeqIO.parse("ls_orchid.fasta","fasta"))
HavingreadineachsequenceandcalculatedtheGC%,wethensortedthemintoascendingorder.Nowwelltakethislistoffloatingpointvaluesandplotthemwith
matplotlib:
importpylab
pylab.plot(gc_values)
pylab.title("%iorchidsequences\nGC%%%0.1fto%0.1f"\
%(len(gc_values),min(gc_values),max(gc_values)))
pylab.xlabel("Genes")
pylab.ylabel("GC%")
pylab.show()
Asinthepreviousexample,thatshouldpopupanewwindowcontainingagraph:
Ifyoutriedthisonthefullsetofgenesfromoneorganism,youdprobablygetamuchsmootherplotthanthis.
20.2.3Nucleotidedotplots
Adotplotisawayofvisuallycomparingtwonucleotidesequencesforsimilaritytoeachother.Aslidingwindowisusedtocompareshortsubsequencestoeachother,
oftenwithamismatchthreshold.Hereforsimplicitywellonlylookforperfectmatches(showninblackintheplotbelow).
Tostartoff,wellneedtwosequences.Forthesakeofargument,welljusttakethefirsttwofromourorchidFASTAfilels_orchid.fasta:
fromBioimportSeqIO
withopen("ls_orchid.fasta")asin_handle:
record_iterator=SeqIO.parse(in_handle,"fasta")
rec_one=next(record_iterator)
rec_two=next(record_iterator)
Weregoingtoshowtwoapproaches.Firstly,asimplenaiveimplementationwhichcomparesallthewindowsizedsubsequencestoeachothertocompilesasimilarity
matrix.Youcouldconstructamatrixorarrayobject,butherewejustusealistoflistsofbooleanscreatedwithanestedlistcomprehension:
window=7
seq_one=str(rec_one.seq).upper()
seq_two=str(rec_two.seq).upper()
data=[[(seq_one[i:i+window]<>seq_two[j:j+window])
forjinrange(len(seq_one)window)]
foriinrange(len(seq_two)window)]
Notethatwehavenotcheckedforreversecomplementmatcheshere.Nowwellusethematplotlibspylab.imshow()functiontodisplaythisdata,firstrequestingthe
graycolorschemesothisisdoneinblackandwhite:
importpylab
pylab.gray()
pylab.imshow(data)
pylab.xlabel("%s(length%ibp)"%(rec_one.id,len(rec_one)))
pylab.ylabel("%s(length%ibp)"%(rec_two.id,len(rec_two)))
pylab.title("Dotplotusingwindowsize%i\n(allowingnomismatches)"%window)
pylab.show()
Thatshouldpopupanewwindowcontainingagraphlikethis:
Asyoumighthaveexpected,thesetwosequencesareverysimilarwithapartiallineofwindowsizedmatchesalongthediagonal.Therearenooffdiagonalmatches
whichwouldbeindicativeofinversionsorotherinterestingevents.
Theabovecodeworksfineonsmallexamples,buttherearetwoproblemsapplyingthistolargersequences,whichwewilladdressbelow.Firstoffall,thisbruteforce
approachtotheallagainstallcomparisonsisveryslow.Instead,wellcompiledictionariesmappingthewindowsizedsubsequencestotheirlocations,andthentakethe
setintersectiontofindthosesubsequencesfoundinbothsequences.Thisusesmorememory,butismuchfaster.Secondly,thepylab.imshow()functionislimitedinthe
sizeofmatrixitcandisplay.Asanalternative,wellusethepylab.scatter()function.
Westartbycreatingdictionariesmappingthewindowsizedsubsequencestolocations:
window=7
dict_one={}
dict_two={}
for(seq,section_dict)in[(str(rec_one.seq).upper(),dict_one),
(str(rec_two.seq).upper(),dict_two)]:
foriinrange(len(seq)window):
section=seq[i:i+window]
try:
section_dict[section].append(i)
exceptKeyError:
section_dict[section]=[i]
#Nowfindanysubsequencesfoundinbothsequences
#(Python2.3wouldrequireslightlydifferentcodehere)
matches=set(dict_one).intersection(dict_two)
print("%iuniquematches"%len(matches))
Inordertousethepylab.scatter()weneedseparatelistsforthexandycoordinates:
#Createlistsofxandycoordinatesforscatterplot
x=[]
y=[]
forsectioninmatches:
foriindict_one[section]:
forjindict_two[section]:
x.append(i)
y.append(j)
Wearenowreadytodrawthereviseddotplotasascatterplot:
importpylab
pylab.cla()#clearanypriorgraph
pylab.gray()
pylab.scatter(x,y)
pylab.xlim(0,len(rec_one)window)
pylab.ylim(0,len(rec_two)window)
pylab.xlabel("%s(length%ibp)"%(rec_one.id,len(rec_one)))
pylab.ylabel("%s(length%ibp)"%(rec_two.id,len(rec_two)))
pylab.title("Dotplotusingwindowsize%i\n(allowingnomismatches)"%window)
pylab.show()
Thatshouldpopupanewwindowcontainingagraphlikethis:
PersonallyIfindthissecondplotmucheasiertoread!Againnotethatwehavenotcheckedforreversecomplementmatcheshereyoucouldextendthisexampletodo
this,andperhapsplottheforwardmatchesinonecolorandthereversematchesinanother.
20.2.4Plottingthequalityscoresofsequencingreaddata
Ifyouareworkingwithsecondgenerationsequencingdata,youmaywanttotryplottingthequalitydata.HereisanexampleusingtwoFASTQfilescontainingpaired
endreads,SRR001666_1.fastqfortheforwardreads,andSRR001666_2.fastqforthereversereads.TheseweredownloadedfromtheENAsequencereadarchiveFTPsite
(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gzandftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz),
andarefromE.coliseehttp://www.ebi.ac.uk/ena/data/view/SRR001666fordetails.
Inthefollowingcodethepylab.subplot(...)functionisusedinordertoshowtheforwardandreversequalitiesontwosubplots,sidebyside.Thereisalsoalittlebitof
codetoonlyplotthefirstfiftyreads.
importpylab
fromBioimportSeqIO
forsubfigurein[1,2]:
filename="SRR001666_%i.fastq"%subfigure
pylab.subplot(1,2,subfigure)
fori,recordinenumerate(SeqIO.parse(filename,"fastq")):
ifi>=50:break#trick!
pylab.plot(record.letter_annotations["phred_quality"])
pylab.ylim(0,45)
pylab.ylabel("PHREDqualityscore")
pylab.xlabel("Position")
pylab.savefig("SRR001666.png")
print("Done")
YoushouldnotethatweareusingtheBio.SeqIOformatnamefastqherebecausetheNCBIhassavedthesereadsusingthestandardSangerFASTQformatwithPHRED
scores.However,asyoumightguessfromthereadlengths,thisdatawasfromanIlluminaGenomeAnalyzerandwasprobablyoriginallyinoneofthetwo
Solexa/IlluminaFASTQvariantfileformatsinstead.
Thisexampleusesthepylab.savefig(...)functioninsteadofpylab.show(...),butasmentionedbeforebothareuseful.Hereistheresult:
20.3Dealingwithalignments
ThissectioncanbeenseenasafollowontoChapter6.
20.3.1Calculatingsummaryinformation
Onceyouhaveanalignment,youareverylikelygoingtowanttofindoutinformationaboutit.Insteadoftryingtohaveallofthefunctionsthatcangenerateinformation
aboutanalignmentinthealignmentobjectitself,wevetriedtoseparateoutthefunctionalityintoseparateclasses,whichactonthealignment.
Gettingreadytocalculatesummaryinformationaboutanobjectisquicktodo.Letssaywevegotanalignmentobjectcalledalignment,forexamplereadinusing
Bio.AlignIO.read(...)asdescribedinChapter6.Allweneedtodotogetanobjectthatwillcalculatesummaryinformationis:
fromBio.AlignimportAlignInfo
summary_align=AlignInfo.SummaryInfo(alignment)
Thesummary_alignobjectisveryuseful,andwilldothefollowingneatthingsforyou:
1.Calculateaquickconsensussequenceseesection20.3.2
2.Getapositionspecificscorematrixforthealignmentseesection20.3.3
3.Calculatetheinformationcontentforthealignmentseesection20.3.4
4.Generateinformationonsubstitutionsinthealignmentsection20.4detailsusingthistogenerateasubstitutionmatrix.
20.3.2Calculatingaquickconsensussequence
TheSummaryInfoobject,describedinsection20.3.1,providesfunctionalitytocalculateaquickconsensusofanalignment.AssumingwevegotaSummaryInfoobject
calledsummary_alignwecancalculateaconsensusbydoing:
consensus=summary_align.dumb_consensus()
Asthenamesuggests,thisisareallysimpleconsensuscalculator,andwilljustaddupalloftheresiduesateachpointintheconsensus,andifthemostcommonvalueis
higherthansomethresholdvaluewilladdthecommonresiduetotheconsensus.Ifitdoesntreachthethreshold,itaddsanambiguitycharactertotheconsensus.The
returnedconsensusobjectisSeqobjectwhosealphabetisinferredfromthealphabetsofthesequencesmakinguptheconsensus.Sodoingaprintconsensuswouldgive:
consensusSeq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...',IUPACAmbiguousDNA())
Youcanadjusthowdumb_consensusworksbypassingoptionalparameters:
thethreshold
Thisisthethresholdspecifyinghowcommonaparticularresiduehastobeatapositionbeforeitisadded.Thedefaultis0.7(meaning70%).
theambiguouscharacter
Thisistheambiguitycharactertouse.ThedefaultisN.
theconsensusalphabet
Thisisthealphabettousefortheconsensussequence.Ifanalphabetisnotspecifiedthanwewilltrytoguessthealphabetbasedonthealphabetsofthesequencesin
thealignment.
20.3.3PositionSpecificScoreMatrices
Positionspecificscorematrices(PSSMs)summarizethealignmentinformationinadifferentwaythanaconsensus,andmaybeusefulfordifferenttasks.Basically,a
PSSMisacountmatrix.Foreachcolumninthealignment,thenumberofeachalphabetlettersiscountedandtotaled.Thetotalsaredisplayedrelativetosome
representativesequencealongtheleftaxis.Thissequencemaybetheconsesussequence,butcanalsobeanysequenceinthealignment.Forinstanceforthealignment,
GTATC
ATC
CTGTC
thePSSMis:
GATC
G1101
T0030
A1100
T0020
C0003
Letsassumewevegotanalignmentobjectcalledc_align.TogetaPSSMwiththeconsensussequencealongthesidewefirstgetasummaryobjectandcalculatethe
consensussequence:
summary_align=AlignInfo.SummaryInfo(c_align)
consensus=summary_align.dumb_consensus()
Now,wewanttomakethePSSM,butignoreanyNambiguityresidueswhencalculatingthis:
my_pssm=summary_align.pos_specific_score_matrix(consensus,
chars_to_ignore=['N'])
Twonotesshouldbemadeaboutthis:
1.Tomaintainstrictnesswiththealphabets,youcanonlyincludecharactersalongthetopofthePSSMthatareinthealphabetofthealignmentobject.Gapsarenot
includedalongthetopaxisofthePSSM.
2.Thesequencepassedtobedisplayedalongtheleftsideoftheaxisdoesnotneedtobetheconsensus.Forinstance,ifyouwantedtodisplaythesecondsequencein
thealignmentalongthisaxis,youwouldneedtodo:
second_seq=alignment.get_seq_by_num(1)
my_pssm=summary_align.pos_specific_score_matrix(second_seq
ThecommandabovereturnsaPSSMobject.ToprintoutthePSSMasshownabove,wesimplyneedtodoaprint(my_pssm),whichgives:
ACGT
T0.00.00.07.0
A7.00.00.00.0
T0.00.00.07.0
A7.00.00.00.0
C0.07.00.00.0
A7.00.00.00.0
T0.00.00.07.0
T1.00.00.06.0
...
YoucanaccessanyelementofthePSSMbysubscriptinglikeyour_pssm[sequence_number][residue_count_name].Forinstance,togetthecountsfortheAresidueinthe
secondelementoftheabovePSSMyouwoulddo:
>>>print(my_pssm[1]["A"])
7.0
ThestructureofthePSSMclasshopefullymakesiteasybothtoaccesselementsandtoprettyprintthematrix.
20.3.4InformationContent
Apotentiallyusefulmeasureofevolutionaryconservationistheinformationcontentofasequence.
Ausefulintroductiontoinformationtheorytargetedtowardsmolecularbiologistscanbefoundathttp://www.lecb.ncifcrf.gov/~toms/paper/primer/.Forourpurposes,
wewillbelookingattheinformationcontentofaconsesussequence,oraportionofaconsensussequence.Wecalculateinformationcontentataparticularcolumnina
multiplesequencealignmentusingthefollowingformula:
Na
Pij
ICj=
Pijlog
Qi
i=1
where:
ICjTheinformationcontentforthejthcolumninanalignment.
NaThenumberoflettersinthealphabet.
PijThefrequencyofaparticularletteriinthejthcolumn(i.e.ifGoccurred3outof6timesinanaligmentcolumn,thiswouldbe0.5)
QiTheexpectedfrequencyofaletteri.Thisisanoptionalargument,usageofwhichisleftattheusersdiscretion.Bydefault,itisautomaticallyassignedto0.05
=1/20foraproteinalphabet,and0.25=1/4foranucleicacidalphabet.Thisisforgetingtheinformationcontentwithoutanyassumptionofpriordistributions.
Whenassumingpriors,orwhenusinganonstandardalphabet,youshouldsupplythevaluesforQi.
Well,nowthatwehaveanideawhatinformationcontentisbeingcalculatedinBiopython,letslookathowtogetitforaparticularregionofthealignment.
First,weneedtouseouralignmenttogetanalignmentsummaryobject,whichwellassumeiscalledsummary_align(seesection20.3.1)forinstructionsonhowtoget
this.Oncewevegotthisobject,calculatingtheinformationcontentforaregionisaseasyas:
info_content=summary_align.information_content(5,30,
Wow,thatwasmucheasierthentheformulaabovemadeitlook!Thevariableinfo_contentnowcontainsafloatvaluespecifyingtheinformationcontentoverthe
specifiedregion(from5to30ofthealignment).WespecificallyignoretheambiguityresidueNwhencalculatingtheinformationcontent,sincethisvalueisnot
includedinouralphabet(soweshouldntbeinterestedinlookingatit!).
Asmentionedabove,wecanalsocalculaterelativeinformationcontentbysupplyingtheexpectedfrequencies:
expect_freq={
'A':.3,
'G':.2,
'T':.3,
'C':.2}
Theexpectedshouldnotbepassedasarawdictionary,butinsteadbypassedasaSubsMat.FreqTableobject(seesection22.2.2formoreinformationaboutFreqTables).
TheFreqTableobjectprovidesastandardforassociatingthedictionarywithanAlphabet,similartohowtheBiopythonSeqclassworks.
TocreateaFreqTableobject,fromthefrequencydictionaryyoujustneedtodo:
fromBio.AlphabetimportIUPAC
fromBio.SubsMatimportFreqTable
e_freq_table=FreqTable.FreqTable(expect_freq,FreqTable.FREQ,
IUPAC.unambiguous_dna)
Nowthatwevegotthat,calculatingtherelativeinformationcontentforourregionofthealignmentisassimpleas:
e_freq_table=e_freq_table,
Now,info_contentwillcontaintherelativeinformationcontentovertheregioninrelationtotheexpectedfrequencies.
Thevaluereturniscalculatedusingbase2asthelogarithmbaseintheformulaabove.Youcanmodifythisbypassingtheparameterlog_baseasthebaseyouwant:
info_content=summary_align.information_content(5,30,log_base=10,
Bydefaultnucleotideoraminoacidresidueswithafrequencyof0inacolumnarenottakeintoaccountwhentherelativeinformationcolumnforthatcolumnis
computed.Ifthisisnotthedesiredresult,youcanusepseudo_countinstead.
chars_to_ignore=['N'],
pseudo_count=1)
Inthiscase,theobservedfrequencyPijofaparticularletteriinthejthcolumniscomputedasfollow:
nij+kQi
Pij=
Nj+k
where:
kthepseudocountyoupassasargument.
kthepseudocountyoupassasargument.
QiTheexpectedfrequencyoftheletteriasdescribedabove.
Well,nowyouarereadytocalculateinformationcontent.Ifyouwanttotryapplyingthistosomereallifeproblems,itwouldprobablybebesttodigintotheliteratureon
informationcontenttogetanideaofhowitisused.Hopefullyyourdiggingwontrevealanymistakesmadeincodingthisfunction!
Substitutionmatricesareanextremelyimportantpartofeverydaybioinformaticswork.Theyprovidethescoringtermsforclassifyinghowlikelytwodifferentresidues
aretosubstituteforeachother.Thisisessentialindoingsequencecomparisons.ThebookBiologicalSequenceAnalysisbyDurbinetal.providesareallynice
introductiontoSubstitutionMatricesandtheiruses.SomefamoussubstitutionmatricesarethePAMandBLOSUMseriesofmatrices.
Biopythonprovidesatonofcommonsubstitutionmatrices,andalsoprovidesfunctionalityforcreatingyourownsubstitutionmatrices.
20.4.1Usingcommonsubstitutionmatrices
20.4.2Creatingyourownsubstitutionmatrixfromanalignment
Averycoolthingthatyoucandoeasilywiththesubstitutionmatrixclassesistocreateyourownsubstitutionmatrixfromanalignment.Inpractice,thisisnormallydone
withproteinalignments.Inthisexample,wellfirstgetaBiopythonalignmentobjectandthengetasummaryobjecttocalculateinfoaboutthealignment.Thefile
containingprotein.aln(alsoavailableonlinehere)containstheClustalwalignmentoutput.
>>>fromBioimportAlphabet
>>>fromBio.AlignimportAlignInfo
>>>filename="protein.aln"
>>>alpha=Alphabet.Gapped(IUPAC.protein)
>>>c_align=AlignIO.read(filename,"clustal",alphabet=alpha)
>>>summary_align=AlignInfo.SummaryInfo(c_align)
Sections6.4.1and20.3.1containmoreinformationondoingthis.
Nowthatwevegotoursummary_alignobject,wewanttouseittofindoutthenumberoftimesdifferentresiduessubstituteforeachother.Tomaketheexamplemore
readable,wellfocusononlyaminoacidswithpolarchargedsidechains.Luckily,thiscanbedoneeasilywhengeneratingareplacementdictionary,bypassinginallof
thecharactersthatshouldbeignored.Thuswellcreateadictionaryofreplacementsforonlychargedpolaraminoacidsusing:
>>>replace_info=summary_align.replacement_dictionary(["G","A","V","L","I",
..."M","P","F","W","S",
..."T","N","Q","Y","C"])
Thisinformationaboutaminoacidreplacementsisrepresentedasapythondictionarywhichwilllooksomethinglike(theordercanvary):
{('R','R'):2079.0,('R','H'):17.0,('R','K'):103.0,('R','E'):2.0,
('R','D'):2.0,('H','R'):0,('D','H'):15.0,('K','K'):3218.0,
('K','H'):24.0,('H','K'):8.0,('E','H'):15.0,('H','H'):1235.0,
('H','E'):18.0,('H','D'):0,('K','D'):0,('K','E'):9.0,
('D','R'):48.0,('E','R'):2.0,('D','K'):1.0,('E','K'):45.0,
('K','R'):130.0,('E','D'):241.0,('E','E'):3305.0,
('D','E'):270.0,('D','D'):2360.0}
Thisinformationgivesusouracceptednumberofreplacements,orhowoftenweexpectdifferentthingstosubstituteforeachother.Itturnsout,amazinglyenough,that
thisisalloftheinformationweneedtogoaheadandcreateasubstitutionmatrix.First,weusethereplacementdictionaryinformationtocreateanAcceptedReplacement
Matrix(ARM):
>>>fromBioimportSubsMat
>>>my_arm=SubsMat.SeqMat(replace_info)
Withthisacceptedreplacementmatrix,wecangorightaheadandcreateourlogoddsmatrix(i.e.astandardtypeSubstitutionMatrix):
>>>my_lom=SubsMat.make_log_odds_matrix(my_arm)
Thelogoddsmatrixyoucreateiscustomizablewiththefollowingoptionalarguments:
exp_freq_tableYoucanpassatableofexpectedfrequenciesforeachalphabet.Ifsupplied,thiswillbeusedinsteadofthepassedacceptedreplacementmatrix
whencalculateexpectedreplacments.
logbaseThebaseofthelogarithmtakentocreatethelogoddmatrix.Defaultstobase10.
factorThefactortomultiplyeachmatrixentryby.Thisdefaultsto10,whichnormallymakesthematrixnumberseasytoworkwith.
round_digitThedigittoroundtointhematrix.Thisdefaultsto0(i.e.nodigits).
Onceyouvegotyourlogoddsmatrix,youcandisplayitprettilyusingthefunctionprint_mat.Doingthisonourcreatedmatrixgives:
>>>my_lom.print_mat()
D2
E11
H543
K10541
R48422
DEHKR
Verynice.Nowwevegotourveryownsubstitutionmatrixtoplaywith!
20.5BioSQLstoringsequencesinarelationaldatabase
BioSQLisajointeffortbetweentheOBFprojects(BioPerl,BioJavaetc)tosupportashareddatabaseschemaforstoringsequencedata.Intheory,youcouldloada
GenBankfileintothedatabasewithBioPerl,thenusingBiopythonextractthisfromthedatabaseasarecordobjectwithfeaturesandgetmoreorlessthesamethingas
ifyouhadloadedtheGenBankfiledirectlyasaSeqRecordusingBio.SeqIO(Chapter5).
BiopythonsBioSQLmoduleiscurrentlydocumentedathttp://biopython.org/wiki/BioSQLwhichispartofourwikipages.
Chapter21TheBiopythontestingframework
Biopythonhasaregressiontestingframework(thefilerun_tests.py)basedonunittest,thestandardunittestingframeworkforPython.Providingcomprehensivetestsfor
modulesisoneofthemostimportantaspectsofmakingsurethattheBiopythoncodeisasbugfreeaspossiblebeforegoingout.Italsotendstobeoneofthemost
undervaluedaspectsofcontributing.ThischapterisdesignedtomakerunningtheBiopythontestsandwritinggoodtestcodeaseasyaspossible.Ideally,everymodule
thatgoesintoBiopythonshouldhaveatest(andshouldalsohavedocumentation!).Allourdevelopers,andanyoneinstallingBiopythonfromsource,arestrongly
encouragedtoruntheunittests.
21.1Runningthetests
WhenyoudownloadtheBiopythonsourcecode,orcheckitoutfromoursourcecoderepository,youshouldfindasubdirectorycallTests.Thiscontainsthekeyscript
run_tests.py,lotsofindividualscriptsnamedtest_XXX.py,asubdirectorycalledoutputandlotsofothersubdirectorieswhichcontaininputfilesforthetestsuite.
AspartofbuildingandinstallingBiopythonyouwilltypicallyrunthefulltestsuiteatthecommandlinefromtheBiopythonsourcetopleveldirectoryusingthe
following:
pythonsetup.pytest
ThisisactuallyequivalenttogoingtotheTestssubdirectoryandrunning:
pythonrun_tests.py
Youlloftenwanttorunjustsomeofthetests,andthisisdonelikethis:
pythonrun_tests.pytest_SeqIO.pytest_AlignIO.py
Whengivingthelistoftests,the.pyextensionisoptional,soyoucanalsojusttype:
pythonrun_tests.pytest_SeqIOtest_AlignIO
Torunthedocstringtests(seesection21.3),youcanuse
pythonrun_tests.pydoctest
Youcanalsoskipanytestswhichhavebeensetupwithanexplicitonlinecomponentbyaddingoffline,e.g.
pythonrun_tests.pyoffline
Bydefault,run_tests.pyrunsalltests,includingthedocstringtests.
Ifanindividualtestisfailing,youcanalsotryrunningitdirectly,whichmaygiveyoumoreinformation.
Importantly,notethattheindividualunittestfilescomeintwotypes:
Oldersimpleprintandcomparescripts.TheseunittestsareessentiallyshortexamplePythonprograms,whichprintoutvariousoutputtext.Foratestfilenamed
test_XXX.pytherewillbeamatchingtextfilecalledtest_XXXundertheoutputsubdirectorywhichcontainstheexpectedoutput.Allthatthetestframeworkdoesto
isrunthescript,andchecktheoutputagrees.
Standardunittestbasedtests.Thesewillimportunittestandthendefineunittest.TestCaseclasses,eachwithoneormoresubtestsasmethodsstartingwith
test_whichchecksomespecificaspectofthecode.Thesetestsshouldnotprintanyoutputdirectly.
Currently,abouthalfoftheBiopythontestsareunitteststyletests,andhalfareprintandcomparetests.
Runningasimpleprintandcomparetestdirectlywillusuallygivelotsofoutputonscreen,butdoesnotchecktheoutputmatchestheexpectedoutput.Ifthetestisfailing
withanexceptionerror,itshouldbeveryeasytolocatewhereexactlythescriptisfailing.Foranexampleofaprintandcomparetest,try:
pythontest_SeqIO.py
Theunittestbasedtestsinsteadshowyouexactlywhichsubsection(s)ofthetestarefailing.Forexample,
pythontest_Cluster.py
21.1.1RunningthetestsusingTox
LikemostPythonprojects,youcanalsouseToxtorunthetestsonmultiplePythonversions,providedtheyarealreadyinstalledinyoursystem.
Wedonotprovidetheconfigurationtox.inifileinourcodebasebecauseofdifficultiespinningdownuserspecificsettings(e.g.executablenamesofthePython
versions).YoumayalsoonlybeinterestedintestingBiopythononlyagainstasubsetofthePythonversionsthatwesupport.
IfyouareinterestedinusingTox,youmaystartwiththeexampletox.inishownbelow:
[tox]
envlist=py26,py27,pypy,py33,py34,jython
[testenv]
changedir=Tests
commands={envpython}run_tests.pyoffline
deps=
numpy
Usingthetemplateabove,executingtoxwilltestyourBiopythoncodeagainstPython2.6,Python2.7,PyPy,Python3.3,Python3.4,andJython.Itassumesthatthose
Pythonsexecutablesarenamedaccordingly:python2.6forPython2.6,andsoon.
21.2Writingtests
LetssayyouwanttowritesometestsforamodulecalledBiospam.Thiscanbeamoduleyouwrote,oranexistingmodulethatdoesnthaveanytestsyet.Intheexamples
below,weassumethatBiospamisamodulethatdoessimplemath.
EachBiopythontestcanhavethreeimportantfilesanddirectoriesinvolvedwithit:
1.test_Biospam.pyTheactualtestcodeforyourmodule.
2.Biospam[optional]Adirectorywhereanynecessaryinputfileswillbelocated.Ifyouhaveanyoutputfilesthatshouldbemanuallyreviewed,outputthemhere
(butthisisdiscouraged)topreventcloggingupthemainTestsdirectory.Ingeneraluseatemporaryfile/folder.
3.output/Biospam[forprintandcomparetestsonly]Thisfilecontainstheexpectedoutputfromrunningtest_Biospam.py.Thisfileisnotneededforunitteststyle
tests,sincetherethevalidationisdoneinthetestscripttest_Biospam.pyitself.
Itsuptoyoutodecidewhetheryouwanttowriteaprintandcomparetestscriptoraunitteststyletestscript.Theimportantthingisthatyoucannotmixthesetwo
stylesinasingletestscript.Particularly,dontuseunittestfeaturesinaprintandcomparetest.
Anyscriptwithatest_prefixintheTestsdirectorywillbefoundandrunbyrun_tests.py.Below,weshowanexampletestscripttest_Biospam.pybothforaprintand
comparetestandforaunittestbasedtest.IfyouputthisscriptintheBiopythonTestsdirectory,thenrun_tests.pywillfinditandexecutethetestscontainedinit:
$pythonrun_tests.py
test_Ace...ok
test_AlignIO...ok
test_BioSQL...ok
test_BioSQL_SeqIO...ok
test_Biospam...ok
test_CAPS...ok
test_Clustalw...ok

Ran107testsin86.127seconds
21.2.1Writingaprintandcomparetest
Aprintandcomparestyletestshouldbemuchsimplerforbeginnersornovicestowriteessentiallyitisjustanexamplescriptusingyournewmodule.
HereiswhatyoushoulddotomakeaprintandcomparetestfortheBiospammodule.
1.Writeascriptcalledtest_Biospam.py
ThisscriptshouldliveintheTestsdirectory
Thescriptshouldtestalloftheimportantfunctionalityofthemodule(themoreyoutestthebetteryourtestis,ofcourse!).
Trytoavoidanythingwhichmightbeplatformspecific,suchasprintingfloatingpointnumberswithoutusinganexplicitformattingstringtoavoidhaving
toomanydecimalplaces(differentplatformscangiveveryslightlydifferentvalues).
2.Ifthescriptrequiresfilestodothetesting,theseshouldgointhedirectoryTests/Biospam(ifyoujustneedsomethinggeneric,likeaFASTAsequencefile,ora
GenBankrecord,tryanduseanexistingsampleinputfileinstead).
3.Writeoutthetestoutputandverifytheoutputtobecorrect.
Therearetwowaystodothis:
a.Thelongway:
Runthescriptandwriteitsoutputtoafile.OnUNIX(includingLinuxandMacOSX)machines,youwoulddosomethinglike:python
test_Biospam.py>test_Biospamwhichwouldwritetheoutputtothefiletest_Biospam.
Manuallylookatthefiletest_Biospamtomakesuretheoutputiscorrect.Whenyouaresureitisallrightandtherearenobugs,youneedtoquickly
editthetest_Biospamfilesothatthefirstlineis:test_Biospam(noquotes).
copythetest_BiospamfiletothedirectoryTests/output
b.Thequickway:
Runpythonrun_tests.pygtest_Biospam.py.Theregressiontestingframeworkisniftyenoughthatitllputtheoutputintherightplaceinjustthe
wayitlikesit.
Gototheoutput(whichshouldbeinTests/output/test_Biospam)anddoublechecktheoutputtomakesureitisallcorrect.
4.NowchangetotheTestsdirectoryandruntheregressiontestswithpythonrun_tests.py.Thiswillrunallofthetests,andyoushouldseeyourtestrun(andpass!).
5.Thatsit!Nowyouvegotanicetestforyourmodulereadytocheckin,orsubmittoBiopython.Congratulations!
Asanexample,thetest_Biospam.pytestscripttotesttheadditionandmultiplicationfunctionsintheBiospammodulecouldlookasfollows:
from__future__importprint_function
fromBioimportBiospam
print("2+3=",Biospam.addition(2,3))
print("91=",Biospam.addition(9,1))
print("2*3=",Biospam.multiplication(2,3))
print("9*(1)=",Biospam.multiplication(9,1))
Wegeneratethecorrespondingoutputwithpythonrun_tests.pygtest_Biospam.py,andchecktheoutputfileoutput/test_Biospam:
test_Biospam
2+3=5
91=8
2*3=6
9*(1)=9
Often,thedifficultywithlargerprintandcomparetestsistokeeptrackwhichlineintheoutputcorrespondstowhichcommandinthetestscript.Forthispurpose,itis
importanttoprintoutsomemarkerstohelpyoumatchlinesintheinputscriptwiththegeneratedoutput.
21.2.2Writingaunittestbasedtest
WewantallthemodulesinBiopythontohaveunittests,andasimpleprintandcomparetestisbetterthannotestatall.However,althoughthereisasteeperlearning
curve,usingtheunittestframeworkgivesamorestructuredresult,andifthereisatestfailurethiscanclearlypinpointwhichpartofthetestisgoingwrong.Thesub
testscanalsoberunindividuallywhichishelpfulfortestingordebugging.
TheunittestframeworkhasbeenincludedwithPythonsinceversion2.1,andisdocumentedinthePythonLibraryReference(whichIknowyouarekeepingunderyour
pillow,asrecommended).Thereisalsoonlinedocumentaionforunittest.Ifyouarefamiliarwiththeunittestsystem(orsomethingsimilarlikethenosetestframework),
youshouldnthaveanytrouble.YoumayfindlookingattheexistingexamplewithinBiopythonhelpfultoo.
HeresaminimalunitteststyletestscriptforBiospam,whichyoucancopyandpastetogetstarted:
importunittest
classBiospamTestAddition(unittest.TestCase):
deftest_addition1(self):
result=Biospam.addition(2,3)
self.assertEqual(result,5)
classBiospamTestDivision(unittest.TestCase):
deftest_division1(self):
result=Biospam.division(3.0,2.0)
self.assertAlmostEqual(result,1.5)
if__name__=="__main__":
runner=unittest.TextTestRunner(verbosity=2)
unittest.main(testRunner=runner)
Inthedivisiontests,weuseassertAlmostEqualinsteadofassertEqualtoavoidtestsfailingduetoroundofferrorsseetheunittestchapterinthePythondocumentation
fordetailsandforotherfunctionalityavailableinunittest(onlinereference).
Thesearethekeypointsofunittestbasedtests:
Testcasesarestoredinclassesthatderivefromunittest.TestCaseandcoveronebasicaspectofyourcode
YoucanusemethodssetUpandtearDownforanyrepeatedcodewhichshouldberunbeforeandaftereachtestmethod.Forexample,thesetUpmethodmightbe
usedtocreateaninstanceoftheobjectyouaretesting,oropenafilehandle.ThetearDownshoulddoanytidyingup,forexampleclosingthefilehandle.
Thetestsareprefixedwithtest_andeachtestshouldcoveronespecificpartofwhatyouaretryingtotest.Youcanhaveasmanytestsasyouwantinaclass.
Attheendofthetestscript,youcanuse
toexecutethetestswhenthescriptisrunbyitself(ratherthanimportedfromrun_tests.py).Ifyourunthisscript,thenyoullseesomethinglikethefollowing:
$pythontest_BiospamMyModule.py
test_addition1(__main__.TestAddition)...ok
test_addition2(__main__.TestAddition)...ok
test_division1(__main__.TestDivision)...ok
test_division2(__main__.TestDivision)...ok

Ran4testsin0.059s
OK
Toindicatemoreclearlywhateachtestisdoing,youcanadddocstringstoeachtest.Theseareshownwhenrunningthetests,whichcanbeusefulinformationifa
testisfailing.
importunittest
classBiospamTestAddition(unittest.TestCase):
"""Anadditiontest"""
"""Asecondadditiontest"""
classBiospamTestDivision(unittest.TestCase):
"""Nowlet'scheckdivision"""
"""Aseconddivisiontest"""
Runningthescriptwillnowshowyou:
$pythontest_BiospamMyModule.py
Anadditiontest...ok
Asecondadditiontest...ok
Nowlet'scheckdivision...ok
Aseconddivisiontest...ok
Ran4testsin0.001s
OK
Ifyourmodulecontainsdocstringtests(seesection21.3),youmaywanttoincludethoseintheteststoberun.Youcandosoasfollowsbymodifyingthecodeunderif
__name__=="__main__":tolooklikethis:
unittest_suite=unittest.TestLoader().loadTestsFromName("test_Biospam")
doctest_suite=doctest.DocTestSuite(Biospam)
suite=unittest.TestSuite((unittest_suite,doctest_suite))
runner=unittest.TextTestRunner(sys.stdout,verbosity=2)
runner.run(suite)
Thisisonlyrelevantifyouwanttorunthedocstringtestswhenyouexecutepythontest_Biospam.pyifithassomecomplexruntimedependencychecking.
Ingeneralinsteadincludethedocstringtestsbyaddingthemtotherun_tests.pyasexplainedbelow.
21.3Writingdoctests
Pythonmodules,classesandfunctionssupportbuiltindocumentationusingdocstrings.Thedoctestframework(includedwithPython)allowsthedevelopertoembed
workingexamplesinthedocstrings,andhavetheseexamplesautomaticallytested.
CurrentlyonlyasmallpartofBiopythonincludesdoctests.Therun_tests.pyscripttakescareofrunningthedoctests.Forthispurpose,atthetopoftherun_tests.py
scriptisamanuallycompiledlistofmodulestotest,whichallowsustoskipmoduleswithoptionalexternaldependencieswhichmaynotbeinstalled(e.g.theReportlab
andNumPylibraries).So,ifyouveaddedsomedocteststothedocstringsinaBiopythonmodule,inordertohavethemincludedintheBiopythontestsuite,youmust
updaterun_tests.pytoincludeyourmodule.Currently,therelevantpartofrun_tests.pylooksasfollows:
#Thisisthelistofmodulescontainingdocstringtests.
#Ifyoudevelopdocstringtestsforothermodules,pleaseadd
#thosemoduleshere.
DOCTEST_MODULES=["Bio.Seq",
"Bio.SeqRecord",
"Bio.SeqIO",
"...",
]
#Silentlyignoreanydoctestsformodulesrequiringnumpy!
try:
importnumpy
DOCTEST_MODULES.extend(["Bio.Statistics.lowess"])
exceptImportError:
pass
Notethatweregarddoctestsprimarilyasdocumentation,soyoushouldsticktotypicalusage.Generallycomplicatedexamplesdealingwitherrorconditionsandthelike
wouldbebestlefttoadedicatedunittest.
Notethatifyouwanttowritedoctestsinvolvingfileparsing,definingthefilelocationcomplicatesmatters.Ideallyuserelativepathsassumingthecodewillberunfrom
theTestsdirectory,seetheBio.SeqIOdoctestsforanexampleofthis.
Torunthedocstringtestsonly,use
$pythonrun_tests.pydoctest
NotethatthedoctestsystemisfragileandcareisneededtoensureyouroutputwillmatchonallthedifferentversionsofPythonthatBiopythonsupports(e.g.differences
infloatingpointnumbers).
21.4WritingdoctestsintheTutorial
ThisTutorialyouarereadinghasalotofcodesnippets,whichareoftenformattedlikeadoctest.Wehaveourownsysteminfiletest_Tutorial.pytoallowtaggingcode
snippetsintheTutorialsourcetoberunasPythondoctests.Thisworksbyaddingspecial%doctestcommentlinesbeforeeachverbatimblock,e.g.
%doctest
\begin{verbatim}
>>>len("ACGT")
4
\end{verbatim}
Oftencodeexamplesarenotselfcontained,butcontinuefromthepreviousverbatimblock.Hereweusethemagiccomment%contdoctestasshownhere:
%contdoctest
\begin{verbatim}
>>>Seq("ACGT")==Seq("ACGT",generic_dna)
True
\end{verbatim}
Thespecial%doctestcommentlinecantakeaworkingdirectory(relativetotheDoc/folder)touseifyouhaveanyexampledatafiles,e.g.%doctestexampleswillusethe
Doc/examplesfolder,while%doctest../Tests/GenBankwillusetheTests/GenBankfolder.
Afterthedirectoryargument,youcanspecifyanyPythondependencieswhichmustbepresentinordertorunthetestbyaddinglib:XXXtoindicateimportXXXmust
work,e.g.%doctestexampleslib:numpy
YoucanruntheTutorialdoctestsvia:
$pythontest_Tutorial.py
or:
$pythonrun_tests.pytest_Tutorial.py
Chapter22Advanced
22.1ParserDesign
ManyoftheolderBiopythonparserswerebuiltaroundaneventorienteddesignthatincludesScannerandConsumerobjects.
Scannerstakeinputfromadatasourceandanalyzeitlinebyline,sendingoffaneventwheneveritrecognizessomeinformationinthedata.Forexample,ifthedata
includesinformationaboutanorganismname,thescannermaygenerateanorganism_nameeventwheneveritencountersalinecontainingthename.
ConsumersareobjectsthatreceivetheeventsgeneratedbyScanners.Followingthepreviousexample,theconsumerreceivestheorganism_nameevent,andtheprocesses
itinwhatevermannernecessaryinthecurrentapplication.
Thisisaveryflexibleframework,whichisadvantageousifyouwanttobeabletoparseafileformatintomorethanonerepresentation.Forexample,theBio.GenBank
moduleusesthistoconstructeitherSeqRecordobjectsorfileformatspecificrecordobjects.
Morerecently,manyoftheparsersaddedforBio.SeqIOandBio.AlignIOtakeamuchsimplerapproach,butonlygenerateasingleobjectrepresentation(SeqRecordand
MultipleSeqAlignmentobjectsrespectively).InsomecasestheBio.SeqIOparsersactuallywrapanotherBiopythonparserforexample,theBio.SwissProtparser
producesSwissProtformatspecificrecordobjects,whichgetconvertedintoSeqRecordobjects.
22.2.1SubsMat
Thismoduleprovidesaclassandafewroutinesforgeneratingsubstitutionmatrices,similartoBLOSUMorPAMmatrices,butbasedonuserprovideddata.
Additionally,youmayselectamatrixfromMatrixInfo.py,acollectionofestablishedsubstitutionmatrices.TheSeqMatclassderivesfromadictionary:
classSeqMat(dict)
Thedictionaryisoftheform{(i1,j1):n1,(i1,j2):n2,...,(ik,jk):nk}wherei,jarealphabetletters,andnisavalue.
1.Attributes
a.self.alphabet:aclassasdefinedinBio.Alphabet
b.self.ab_list:alistofthealphabetsletters,sorted.Neededmainlyforinternalpurposes
2.Methods
a.__init__(self,data=None,alphabet=None,mat_name='',build_later=0):
i.data:canbeeitheradictionary,oranotherSeqMatinstance.
ii. alphabet:aBio.Alphabetinstance.Ifnotprovided,constructanalphabetfromdata.
iii.mat_name:matrixname,suchas"BLOSUM62"or"PAM250"
iv.build_later:defaultfalse.Iftrue,usermaysupplyonlyalphabetandemptydictionary,ifintendingtobuildthematrixlater.thisskipsthesanitycheck
ofalphabetsizevs.matrixsize.
b.entropy(self,obs_freq_mat)
i.obs_freq_mat:anobservedfrequencymatrix.Returnsthematrixsentropy,basedonthefrequencyinobs_freq_mat.ThematrixinstanceshouldbeLO
orSUBS.
c.sum(self)
Calculatesthesumofvaluesforeachletterinthematrixsalphabet,andreturnsitasadictionaryoftheform{i1:s1,i2:s2,...,in:sn},where:
i:analphabetletter
s:sumofallvaluesinahalfmatrixforthatletter
n:numberoflettersinalphabet.
d.print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
printsthematrixtofilehandlef.formatistheformatfieldforthematrixvaluesbottomformatistheformatfieldforthebottomrow,containingmatrix
letters.Exampleoutputfora3letteralphabetmatrix:
A23
B1234
C72227
ABC
Thealphabetoptionalargumentisastringofallcharactersinthealphabet.Ifsupplied,theorderoflettersalongtheaxesistakenfromthestring,ratherthan
byalphabeticalorder.
3.Usage
Thefollowingsectionislaidoutintheorderbywhichmostpeoplewishtogeneratealogoddsmatrix.Ofcourse,interimmatricescanbegeneratedand
investigated.Mostpeoplejustwantalogoddsmatrix,thatsall.
a.GeneratinganAcceptedReplacementMatrix
Initially,youshouldgenerateanacceptedreplacementmatrix(ARM)fromyourdata.ThevaluesinARMarethecountednumberofreplacementsaccording
toyourdata.Thedatacouldbeasetofpairsormultiplealignments.SoforinstanceifAlaninewasreplacedbyCysteine10times,andCysteinebyAlanine
12times,thecorrespondingARMentrieswouldbe:
('A','C'):10,('C','A'):12
asorderdoesntmatter,usercanalreadyprovideonlyoneentry:
('A','C'):22
ASeqMatinstancemaybeinitializedwitheitherafull(firstmethodofcounting:10,12)orhalf(thelattermethod,22)matrices.Afullproteinalphabet
matrixwouldbeofthesize20x20=400.Ahalfmatrixofthatalphabetwouldbe20x20/2+20/2=210.Thatisbecausesameletterentriesdontchange.
(Thematrixdiagonal).GivenanalphabetsizeofN:
i.Fullmatrixsize:N*N
ii.Halfmatrixsize:N(N+1)/2
TheSeqMatconstructorautomaticallygeneratesahalfmatrix,ifafullmatrixispassed.Ifahalfmatrixispassed,lettersinthekeyshouldbeprovidedin
alphabeticalorder:(A,C)andnot(C,A).
Atthispoint,ifallyouwishtodoisgeneratealogoddsmatrix,pleasegotothesectiontitledExampleofUse.Thefollowingtextdescribesthenittygritty
ofinternalfunctions,tobeusedbypeoplewhowishtoinvestigatetheirnucleotide/aminoacidfrequencydatamorethoroughly.
b.Generatingtheobservedfrequencymatrix(OFM)
Use:
OFM=SubsMat._build_obs_freq_mat(ARM)
TheOFMisgeneratedfromtheARM,onlyinsteadofreplacementcounts,itcontainsreplacementfrequencies.
c.Generatinganexpectedfrequencymatrix(EFM)
Use:
EFM=SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
i.exp_freq_table:shouldbeaFreqTableinstance.Seesection22.2.2fordetailedinformationonFreqTable.Briefly,theexpectedfrequencytablehasthe
frequenciesofappearanceforeachmemberofthealphabet.Itisimplementedasadictionarywiththealphabetlettersaskeys,andeachletters
frequencyasavalue.Valuessumto1.
Theexpectedfrequencytablecan(andgenerallyshould)begeneratedfromtheobservedfrequencymatrix.Soinmostcasesyouwillgenerate
exp_freq_tableusing:
>>>exp_freq_table=SubsMat._exp_freq_table_from_obs_freq(OFM)
>>>EFM=SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
Butyoucansupplyyourownexp_freq_table,ifyouwish
d.Generatingasubstitutionfrequencymatrix(SFM)
Use:
SFM=SubsMat._build_subs_mat(OFM,EFM)
AcceptsanOFM,EFM.Providesthedivisionproductofthecorrespondingvalues.
e.Generatingalogoddsmatrix(LOM)
Use:
LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])
i.AcceptsanSFM.
ii. logbase:baseofthelogarithmusedtogeneratethelogoddsvalues.
iii.factor:factorusedtomultiplythelogoddsvalues.Eachentryisgeneratedbylog(LOM[key])*factorAndroundedtotheround_digitplaceafterthe
decimalpoint,ifrequired.
4.Exampleofuse
Asmostpeoplewouldwanttogeneratealogoddsmatrix,withminimumhassle,SubsMatprovidesonefunctionwhichdoesitall:
make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
factor=10.0,round_digit=0):
a.acc_rep_mat:userprovidedacceptedreplacementsmatrix
b.exp_freq_table:expectedfrequenciestable.Usedifprovided,ifnot,generatedfromtheacc_rep_mat.
c.logbase:baseoflogarithmforthelogoddsmatrix.Defaultbase10.
d.round_digit:numberafterdecimaldigittowhichresultshouldberounded.Defaultzero.
22.2.2FreqTable
FreqTable.FreqTable(UserDict.UserDict)
1.Attributes:
a.alphabet:ABio.Alphabetinstance.
b.data:frequencydictionary
c.count:countdictionary(incasecountsareprovided).
2.Functions:
a.read_count(f):readacountfilefromstreamf.Thenconverttofrequencies.
b.read_freq(f):readafrequencydatafilefromstreamf.Ofcourse,wethendonthavethecounts,butitisusuallytheletterfrequencieswhichareinteresting.
3.Exampleofuse:Theexpectedcountoftheresiduesinthedatabaseissittinginafile,whitespacedelimited,inthefollowingformat(examplegivenfora3letter
alphabet):
A35
B65
C100
AndwillbereadusingtheFreqTable.read_count(file_handle)function.
Anequivalentfrequencyfile:
A0.175
B0.325
C0.5
Conversely,theresiduefrequenciesorcountscanbepassedasadictionary.Exampleofacountdictionary(3letteralphabet):
{'A':35,'B':65,'C':100}
Whichmeansthatanexpecteddatacountwouldgivea0.5frequencyforC,a0.325probabilityofBanda0.175probabilityofAoutof200total,sumofA,B
andC)
Afrequencydictionaryforthesamedatawouldbe:
{'A':0.175,'B':0.325,'C':0.5}
Summingupto1.
Whenpassingadictionaryasanargument,youshouldindicatewhetheritisacountorafrequencydictionary.ThereforetheFreqTableclassconstructorrequires
twoarguments:thedictionaryitself,andFreqTable.COUNTorFreqTable.FREQindicatingcountsorfrequencies,respectively.
Readexpectedcounts.readCountwillalreadygeneratethefrequenciesAnyoneofthefollowingmaybedonetogeeratethefrequencytable(ftab):
>>>fromSubsMatimport*
>>>ftab=FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ)
>>>ftab=FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT)
>>>ftab=FreqTable.read_count(open('myCountFile'))
>>>ftab=FreqTable.read_frequency(open('myFrequencyFile'))
Chapter23WheretogofromherecontributingtoBiopython
23.1BugReports+FeatureRequests
GettingfeedbackontheBiopythonmodulesisveryimportanttous.Opensourceprojectslikethisbenefitgreatlyfromfeedback,bugreports(andpatches!)fromawide
varietyofcontributors.
ThemainforumsfordiscussingfeaturerequestsandpotentialbugsaretheBiopythonmailinglists:
biopython@biopython.orgAnunmoderatedlistfordiscussionofanythingtodowithBiopython.
biopythondev@biopython.orgAmoredevelopmentorientedlistthatismainlyusedbydevelopers(butanyoneisfreetocontribute!).
Additionally,ifyouthinkyouvefoundanewbug,youcansubmitittoourissuetrackerathttps://github.com/biopython/biopython/issues(thishasreplacedtheolder
trackerhostedathttp://redmine.openbio.org/projects/biopython).Thisway,itwontgetburiedinanyonesInboxandforgottenabout.
23.2Mailinglistsandhelpingnewcomers
WeencourageallourusestosignuptothemainBiopythonmailinglist.OnceyouvegotthehangofanareaofBiopython,wedencourageyoutohelpanswerquestions
frombeginners.Afterall,youwereabeginneronce.
23.3ContributingDocumentation
WerehappytotakefeedbackorcontributionseitherviaabugreportorontheMailingList.Whilereadingthistutorial,perhapsyounoticedsometopicsyouwere
interestedinwhichweremissing,ornotclearlyexplained.ThereisalsoBiopythonsbuiltindocumentation(thedocstrings,thesearealsoonline),whereagain,youmay
beabletohelpfillinanyblanks.
23.4Contributingcookbookexamples
AsexplainedinChapter20,Biopythonnowhasawikicollectionofusercontributedcookbookexamples,http://biopython.org/wiki/Category:Cookbookmaybeyou
canaddtothis?
23.5Maintainingadistributionforaplatform
Wecurrentlyprovidesourcecodearchives(suitableforanyOS,ifyouhavetherightbuildtoolsinstalled),andWindowsInstallerswhicharejustclickandrun.This
coversallthemajoroperatingsystems.
MostmajorLinuxdistributionshavevolunteerswhotakethesesourcecodereleases,andcompilethemintopackagesforLinuxuserstoeasilyinstall(takingcareof
dependenciesetc).Thisisreallygreatandweareofcourseverygrateful.Ifyouwouldliketocontributetothiswork,pleasefindoutmoreabouthowyourLinux
distributionhandlesthis.
Belowaresometipsforcertainplatformstomaybegetpeoplestartedwithhelpingout:
Windows
Windowsproductstypicallyhaveanicegraphicalinstallerthatinstallsalloftheessentialcomponentsintherightplace.WeuseDistutilstocreateainstallerofthis
typefairlyeasily.
YoumustfirstmakesureyouhaveaCcompileronyourWindowscomputer,andthatyoucancompileandinstallthings(thisisthehardbitseetheBiopython
installationinstructionsforinfoonhowtodothis).
OnceyouaresetupwithaCcompiler,makingtheinstallerjustrequiresdoing:
pythonsetup.pybdist_wininst
NowyouvegotaWindowsinstaller.Congrats!Atthemomentwehavenotroubleshippinginstallersbuilton32bitwindows.Ifanyonewouldliketolookinto
supporting64bitWindowsthatwouldbegreat.
RPMs
RPMsareprettypopularpackagesystemsonsomeLinuxplatforms.ThereislotsofdocumentationonRPMsavailableathttp://www.rpm.orgtohelpyouget
startedwiththem.TocreateanRPMforyourplatformisreallyeasy.Youjustneedtobeabletobuildthepackagefromsource(havingaCcompilerthatworksis
thusessential)seetheBiopythoninstallationinstructionsformoreinfoonthis.
TomaketheRPM,youjustneedtodo:
pythonsetup.pybdist_rpm
ThiswillcreateanRPMforyourspecificplatformandasourceRPMinthedirectorydist.ThisRPMshouldbegoodandreadytogo,sothisisallyouneedtodo!
Niceandeasy.
Macintosh
SinceApplemovedtoMacOSX,thingshavebecomemucheasierontheMac.WegenerallytreatitasjustanotherUnixvariant,andinstallingBiopythonfrom
sourceisjustaseasyasonLinux.TheeasiestwaytogetalltheGCCcompilersetcinstalledistoinstallApplesXCode.Wemightbeabletoprovideclickandrun
installersforMacOSX,buttodatetherehasntbeenanydemand.
Onceyouvegotapackage,pleasetestitonyoursystemtomakesureitinstallseverythinginagoodwayandseemstoworkproperly.Onceyoufeelgoodaboutit,send
itofftooneoftheBiopythondevelopers(writetoourmainmailinglistatbiopython@biopython.orgifyourenotsurewhotosenditto)andyouvedoneit.Thanks!
23.6ContributingUnitTests
EvenifyoudonthaveanynewfunctionalitytoaddtoBiopython,butyouwanttowritesomecode,pleaseconsiderextendingourunittestcoverage.Wevedevotedall
ofChapter21tothistopic.
23.7ContributingCode
TherearenobarrierstojoiningBiopythoncodedevelopmentotherthananinterestincreatingbiologyrelatedcodeinPython.Thebestplacetoexpressaninterestison
theBiopythonmailinglistsjustletusknowyouareinterestedincodingandwhatkindofstuffyouwanttoworkon.Normally,wetrytohavesomediscussionon
modulesbeforecodingthem,sincethathelpsgenerategoodideasthenjustfeelfreetojumprightinandstartcoding!
ThemainBiopythonreleasetriestobefairlyuniformandinterworkable,tomakeiteasierforusers.Youcanreadaboutsomeof(fairlyinformal)codingstyleguidelines
wetrytouseinBiopythoninthecontributingdocumentationathttp://biopython.org/wiki/Contributing.Wealsotrytoaddcodetothedistributionalongwithtests
(seeChapter21formoreinfoontheregressiontestingframework)anddocumentation,sothateverythingcanstayasworkableandwelldocumentedaspossible
(includingdocstrings).Thisis,ofcourse,themostidealsituation,undermanysituationsyoullbeabletofindotherpeopleonthelistwhowillbewillingtohelpadd
documentationormoretestsforyourcodeonceyoumakeitavailable.So,toendthisparagraphlikethelast,feelfreetostartworking!
PleasenotethattomakeacodecontributionyoumusthavethelegalrighttocontributeitandlicenseitundertheBiopythonlicense.Ifyouwroteitallyourself,anditis
notbasedonanyothercode,thisshouldntbeaproblem.However,thereareissuesifyouwanttocontributeaderivativeworkforexamplesomethingbasedonGPLor
LPGLlicencedcodewouldnotbecompatiblewithourlicense.Ifyouhaveanyqueriesonthis,pleasediscusstheissueonthebiopythondevmailinglist.
AnotherpointofconcernforanyadditionstoBiopythonregardsanybuildtimeorruntimedependencies.Generallyspeaking,writingcodetointeractwithastandalone
tool(likeBLAST,EMBOSSorClustalW)doesntpresentabigproblem.However,anydependencyonanotherlibraryevenaPythonlibrary(especiallyoneneededin
ordertocompileandinstallBiopythonlikeNumPy)wouldneedfurtherdiscussion.
Additionally,ifyouhavecodethatyoudontthinkfitsinthedistribution,butthatyouwanttomakeavailable,wemaintainScriptCentral
(http://biopython.org/wiki/Scriptcentral)whichhaspointerstofreelyavailablecodeinPythonforbioinformatics.
HopefullythisdocumentationhasgotyouexcitedenoughaboutBiopythontotryitout(andmostimportantly,contribute!).Thanksforreadingallthewaythrough!
Chapter24Appendix:UsefulstuffaboutPython
IfyouhaventspentalotoftimeprogramminginPython,manyquestionsandproblemsthatcomeupinusingBiopythonareoftenrelatedtoPythonitself.Thissection
triestopresentsomeideasandcodethatcomeupoften(atleastforus!)whileusingtheBiopythonlibraries.Ifyouhaveanysuggestionsforusefulpointersthatcouldgo
here,pleasecontribute!
24.1Whattheheckisahandle?
Handlesarementionedquitefrequentlythroughoutthisdocumentation,andarealsofairlyconfusing(atleasttome!).Basically,youcanthinkofahandleasbeinga
wrapperaroundtextinformation.
Handlesprovide(atleast)twobenefitsoverplaintextinformation:
1.Theyprovideastandardwaytodealwithinformationstoredindifferentways.Thetextinformationcanbeinafile,orinastringstoredinmemory,ortheoutput
fromacommandlineprogram,oratsomeremotewebsite,butthehandleprovidesacommonwayofdealingwithinformationinalloftheseformats.
2.Theyallowtextinformationtobereadincrementally,insteadofallatonce.Thisisreallyimportantwhenyouaredealingwithhugetextfileswhichwoulduseup
allofyourmemoryifyouhadtoloadthemall.
Handlescandealwithtextinformationthatisbeingread(e.g.readingfromafile)orwritten(e.g.writinginformationtoafile).Inthecaseofareadhandle,
commonlyusedfunctionsareread(),whichreadstheentiretextinformationfromthehandle,andreadline(),whichreadsinformationonelineatatime.Forwrite
handles,thefunctionwrite()isregularlyused.
Themostcommonusageforhandlesisreadinginformationfromafile,whichisdoneusingthebuiltinPythonfunctionopen.Here,wehandletothefilem_cold.fasta
whichyoucandownloadhere(orfindincludedintheBiopythonsourcecodeasDoc/examples/m_cold.fasta).
>>>handle=open("m_cold.fasta","r")
>>>handle.readline()
">gi|8332116|gb|BE037100.1|BE037100MP14H09MPMesembryanthemum...\n"
HandlesareregularlyusedinBiopythonforpassinginformationtoparsers.Forexample,sinceBiopython1.54themainfunctionsinBio.SeqIOandBio.AlignIOhave
allowedyoutouseafilenameinsteadofahandle:
fromBioimportSeqIO
forrecordinSeqIO.parse("m_cold.fasta","fasta"):
print(record.id,len(record))
OnolderversionsofBiopythonyouhadtouseahandle,e.g.
fromBioimportSeqIO
handle=open("m_cold.fasta","r")
forrecordinSeqIO.parse(handle,"fasta"):
handle.close()
ThispatternisstillusefulforexamplesupposeyouhaveagzipcompressedFASTAfileyouwanttoparse:
importgzip
fromBioimportSeqIO
handle=gzip.open("m_cold.fasta.gz","rt")
forrecordinSeqIO.parse(handle,"fasta"):
handle.close()
Withourparsersforplaintextfiles,underPython3itisessentialtousegzipintextmode.
SeeSection5.2formoreexampleslikethis,includingreadingbzip2compressedfiles.
24.1.1Creatingahandlefromastring
Oneusefulthingistobeabletoturninformationcontainedinastringintoahandle.ThefollowingexampleshowshowtodothisusingcStringIOfromthePython
standardlibrary:
>>>my_info='Astring\nwithmultiplelines.'
>>>print(my_info)
Astring
withmultiplelines.
>>>my_info_handle=StringIO(my_info)
>>>first_line=my_info_handle.readline()
>>>print(first_line)
Astring
<BLANKLINE>
>>>second_line=my_info_handle.readline()
>>>print(second_line)
withmultiplelines.
References
[1]
PeterJ.A.Cock,TiagoAntao,JeffreyT.Chang,BradA.Chapman,CymonJ.Cox,AndrewDalke,IddoFriedberg,ThomasHamelryck,FrankKauff,Bartek
Wilczynski,MichielJ.L.deHoon:Biopython:freelyavailablePythontoolsforcomputationalmolecularbiologyandbioinformatics.Bioinformatics25(11),
14221423(2009).doi:10.1093/bioinformatics/btp163,
[2]
LeightonPritchard,JenniferA.White,PaulR.J.Birch,IanK.Toth:GenomeDiagram:apythonpackageforthevisualizationoflargescalegenomicdata.
Bioinformatics22(5):616617(2006).doi:10.1093/bioinformatics/btk021,
[3]
IanK.Toth,LeightonPritchard,PaulR.J.Birch:Comparativegenomicsrevealswhatmakesanenterobacterialplantpathogen.AnnualReviewof
Phytopathology44:305336(2006).doi:10.1146/annurev.phyto.44.070505.143444,
[4]
GraldineA.vanderAuwera,JaroslawE.Krl,HaruoSuzuki,BrianFoster,RobvanHoudt,CelesteJ.Brown,MaxMergeay,EvaM.Top:Plasmidscapturedin
C.metalliduransCH34:definingthePromAfamilyofbroadhostrangeplasmids.AntonievanLeeuwenhoek96(2):193204(2009).doi:10.1007/s10482009
93169
[5]
CarolineProux,DouwevanSinderen,JuanSuarez,PilarGarcia,VictorLadero,GeraldF.Fitzgerald,FrankDesiere,HaraldBrssow:Thedilemmaofphage
taxonomyillustratedbycomparativegenomicsofSfi21LikeSiphoviridaeinlacticacidbacteria.JournalofBacteriology184(21):60266036(2002).
http://dx.doi.org/10.1128/JB.184.21.60266036.2002
[6]
FlorianJupe,LeightonPritchard,GrahamJ.Etherington,KatrinMacKenzie,PeterJACock,FrankWright,SanjeevKumarSharma1,DanBolser,GlennJBryan,
JonathanDGJones,IngoHein:IdentificationandlocalisationoftheNBLRRgenefamilywithinthepotatogenome.BMCGenomics13:75(2012).
http://dx.doi.org/10.1186/147121641375
[7]
PeterJ.A.Cock,ChristopherJ.Fields,NaohisaGoto,MichaelL.Heuer,PeterM.Rice:TheSangerFASTQfileformatforsequenceswithqualityscores,andthe
Solexa/IlluminaFASTQvariants.NucleicAcidsResearch38(6):17671771(2010).doi:10.1093/nar/gkp1137
[8]
PatrickO.Brown,DavidBotstein:ExploringthenewworldofthegenomewithDNAmicroarrays.NatureGenetics21(Supplement1),3337(1999).
doi:10.1038/4462
[9]
EricTalevich,BrandonM.Invergo,PeterJ.A.Cock,BradA.Chapman:Bio.Phylo:Aunifiedtoolkitforprocessing,analyzingandvisualizingphylogenetictrees
inBiopython.BMCBioinformatics13:209(2012).doi:10.1186/1471210513209
[10]
AthelCornishBowden:Nomenclatureforincompletelyspecifiedbasesinnucleicacidsequences:Recommendations1984.NucleicAcidsResearch13(9):
30213030(1985).doi:10.1093/nar/13.9.3021
[11]
DouglasR.Cavener:ComparisonoftheconsensussequenceflankingtranslationalstartsitesinDrosophilaandvertebrates.NucleicAcidsResearch15(4):1353
1361(1987).doi:10.1093/nar/15.4.1353
[12]
TimothyL.BaileyandCharlesElkan:Fittingamixturemodelbyexpectationmaximizationtodiscovermotifsinbiopolymers,ProceedingsoftheSecond
InternationalConferenceonIntelligentSystemsforMolecularBiology2836.AAAIPress,MenloPark,California(1994).
[13]
BradChapmanandJeffChang:Biopython:Pythontoolsforcomputationalbiology.ACMSIGBIONewsletter20(2):1519(August2000).
[14]
MichielJ.L.deHoon,SeiyaImoto,JohnNolan,SatoruMiyano:Opensourceclusteringsoftware.Bioinformatics20(9):14531454(2004).
doi:10.1093/bioinformatics/bth078
[15]
MichielB.Eisen,PaulT.Spellman,PatrickO.Brown,DavidBotstein:Clusteranalysisanddisplayofgenomewideexpressionpatterns.Proceedingsofthe
NationalAcademyofScienceUSA95(25):1486314868(1998).doi:10.1073/pnas.96.19.10943c
[16]
GeneH.Golub,ChristianReinsch:Singularvaluedecompositionandleastsquaressolutions.InHandbookforAutomaticComputation,2,(LinearAlgebra)(J.H.
WilkinsonandC.Reinsch,eds),134151.NewYork:SpringerVerlag(1971).
[17]
GeneH.Golub,CharlesF.VanLoan:Matrixcomputations,2ndedition(1989).
[18]
ThomasHamelryckandBernardManderick:11PDBparserandstructureclassimplementedinPython.Bioinformatics,19(17):23082310(2003)doi:
10.1093/bioinformatics/btg299.
[19]
ThomasHamelryck:Efficientidentificationofsidechainpatternsusingamultidimensionalindextree.Proteins51(1):96108(2003).doi:10.1002/prot.10338
[20]
ThomasHamelryck:AnaminoacidhastwosidesAnew2Dmeasureprovidesadifferentviewofsolventexposure.Proteins59(1):2948(2005).
doi:10.1002/prot.20379.
[21]
JohnA.Hartiga.Clusteringalgorithms.NewYork:Wiley(1975).
[22]
AnilL.Jain,RichardC.Dubes:Algorithmsforclusteringdata.EnglewoodCliffs,N.J.:PrenticeHall(1988).
[23]
VoratasKachitvichyanukul,BruceW.Schmeiser:BinomialRandomVariateGeneration.CommunicationsoftheACM31(2):216222(1988).
doi:10.1145/42372.42381
[24]
TeuvoKohonen:Selforganizingmaps,2ndEdition.BerlinNewYork:SpringerVerlag(1997).
[25]
PierreLEcuyer:EfficientandPortableCombinedRandomNumberGenerators.CommunicationsoftheACM31(6):742749,774(1988).
doi:10.1145/62959.62969
[26]
IndraneelMajumdar,S.SriKrishna,NickV.Grishin:PALSSE:Aprogramtodelineatelinearsecondarystructuralelementsfromproteinstructures.BMC
Bioinformatics,6:202(2005).doi:10.1186/147121056202.
[27]
V.Matys,E.Fricke,R.Geffers,E.Gssling,M.Haubrock,R.Hehl,K.Hornischer,D.Karas,A.E.Kel,O.V.KelMargoulis,D.U.Kloos,S.Land,B.Lewicki
Potapov,H.Michael,R.Mnch,I.Reuter,S.Rotert,H.Saxel,M.Scheer,S.Thiele,E.WingenderE:TRANSFAC:transcriptionalregulation,frompatternsto
profiles.NucleicAcidsResearch31(1):374378(2003).doi:10.1093/nar/gkg108
[28]
RobinSibson:SLINK:Anoptimallyefficientalgorithmforthesinglelinkclustermethod.TheComputerJournal16(1):3034(1973).
doi:10.1093/comjnl/16.1.30
[29]
GeorgeW.Snedecor,WilliamG.Cochran:Statisticalmethods.Ames,Iowa:IowaStateUniversityPress(1989).
[30]
PabloTamayo,DonnaSlonim,JillMesirov,QingZhu,SutisakKitareewan,EthanDmitrovsky,EricS.Lander,ToddR.Golub:Interpretingpatternsofgene
expressionwithselforganizingmaps:Methodsandapplicationtohematopoieticdifferentiation.ProceedingsoftheNationalAcademyofScienceUSA96(6):
29072912(1999).doi:10.1073/pnas.96.6.2907
[31]
RobertC.Tryon,DanielE.Bailey:Clusteranalysis.NewYork:McGrawHill(1970).
[32]
JohnW.Tukey:Exploratorydataanalysis.Reading,Mass.:AddisonWesleyPub.Co.(1977).
[33]
KaYeeYeung,WalterL.Ruzzo:PrincipalComponentAnalysisforclusteringgeneexpressiondata.Bioinformatics17(9):763774(2001).
doi:10.1093/bioinformatics/17.9.763
[34]
AlokSaldanha:JavaTreeviewextensiblevisualizationofmicroarraydata.Bioinformatics20(17):32463248(2004).
http://dx.doi.org/10.1093/bioinformatics/bth349
ThisdocumentwastranslatedfromLATEXbyHEVEA.

Biopython Tutorial and Cookbook

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biopython Tutorial and Cookbook

Uploaded by

Copyright:

Available Formats

5/13/2017 BiopythonTutorialandCookbook

YORF 0minutes 30minutes 1hour 2hours 4hours

YORF NAME GWEIGHT GORDER 0 30 1 2 4

You might also like