Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 79

PERL AND BIOPERL

CONTROL STRUCTURES

 “if” statement - first style


 if ($porridge_temp < 40) {
print “too hot.\n”;
}
elsif ($porridge_temp > 150) {
print “too cold.\n”;
}
else {
print “just right\n”;
}
CONTROL STRUCTURES

 “if” statement - second style


 statement if condition;
 print “\$index is $index” if $DEBUG;
 Single statements only
 Simple expressions only
 “unless” is a reverse “if”
 statement unless condition;
 print “millennium is here!” unless $year < 2000;
CONTROL STRUCTURES

 “for” loop - first style


 for (initial; condition; increment) { code }
 for ($i=0; $i<10; $i++) {
print “hello\n”;
}
 “for” loop - second style
 for [variable] (range) { code }
 for $name (@employees) {
print “$name is an employee.\n”;
}
THE FOR STATEMENT

 Syntax
for (START; STOP; ACTION) { BODY }
 Initially execute START statements once.
 Repeatedly execute BODY until STOP is false.
 Execute ACTION after each iteration.

 Example
for ($i=0; $i<10; $i++) {
print(“Iteration: $i\n”);
}
THE FOREACH STATEMENT

 Syntax
foreach SCALAR ( ARRAY ) { BODY }
 Assign ARRAY element to SCALAR.
 Execute BODY.
 Repeat for each element in ARRAY.

 Example
asTmp = qw(One Two Three);
foreach $s (@asTmp){$s .= “sy ”;}
print(@asTmp); # Onesy Twosy Threesy
CONTROL STRUCTURES

 “while” loop
 while (condition) { code }
 $cars = 7;
while ($cars > 0) {
print “cars left: ”, $cars--, “\n”;
}
 while ($game_not_over) {…}
CONTROL STRUCTURES

 “until” loop is opposite of “while”


 until (condition) { code }
 $cars = 7;
until ($cars <= 0) {
print “cars left: ”, $cars--, “\n”;
}
 while ($game_not_over) {…}
CONTROL STRUCTURES

 Bottom-check Loops
 do { code } while (condition);
 do { code } until (condition);
 $value = 0;
do {
print “Enter Value: ”;
$value = <STDIN>;
} until ($value > 0);
SUBROUTINES (FUNCTIONS)

 Defining a Subroutine
 sub name { code }
 Arguments passed in via “@_” list
 sub multiply {
my ($a, $b) = @_;
return $a * $b;
}
 Last value processed is the return value
(could have left out word “return”, above)
SUBROUTINES (FUNCTIONS)

 Calling a Subroutine
 subname; # no args, no return value
 subname (args);
 retval = &subname (args);
 The “&” is optional so long as…
 subname is not a reserved word
 subroutine was defined before being called
SUBROUTINES (FUNCTIONS)

 Passing Arguments
 Passes the value
 Lists are expanded
 @a = (5,10,15);
@b = (20,25);
&mysub(@a,@b);
 this passes five arguments: 5,10,15,20,25
 mysub can receive them as 5 scalars, or one array
SUBROUTINES (FUNCTIONS)

 Examples
 sub good1 {
my($a,$b,$c) = @_;
}
&good1 (@triplet);
 sub good2 {
my(@a) = @_;
}
&good2 ($one, $two, $three);
DEALING WITH HASHES

 keys( ) - get an array of all keys


 foreach (keys (%hash)) { … }
 values( ) - get an array of all values
 @array = values (%hash);
 each( ) - get key/value pairs
 while (@pair = each(%hash)) {
print “element $pair[0] has $pair[1]\n”;
}
DEALING WITH HASHES

 exists( ) - check if element exists


 if (exists $ARRAY{$key}) { … }
 delete( ) - delete one element
 delete $ARRAY{$key};
OTHER USEFUL FUNCTIONS

 push( ), pop( )- stack operations on lists


 shift( ),unshift( ) - bottom-based ops

 split( ) - split a string by separator


 @parts = split(/:/,$passwd_line);
 while (split) … # like: split (/\s+/, $_)
 splice( ) - remove/replace elements
 substr( ) - substrings of a string
STRING MANIPULATION

 chop
 chop(VARIABLE)

 chop(LIST)

 index(STR, SUBSTR, POSITION)

 index(STR, SUBSTR)

 length(EXPR)
STRING MANIPULATION (CONT.)

 substr(EXPR, OFFSET, LENGTH)


 substr(EXPR, OFFSET)

 Example: string.pl
PATTERN MATCHING

 See if strings match a certain pattern


 syntax: string =~ pattern
 Returns true if it matches, false if not.
 Example: match “abc” anywhere in string:
 if ($str =~ /abc/) { … }
 But what about complex concepts like:
 between 3 and 5 numeric digits
 optional whitespace at beginning of line
PATTERN MATCHING

 Regular Expressions are a way to describe character


patterns in a string
 Example: match “john” or “jon”
 /joh?n/
 Example: match money values
 /\$\d+\.\d\d/
 Complex Example: match times of the day
 /\d?\d:\d\d(:\d\d)? (AM|PM)?/i
PATTERN MATCHING

 Symbols with Special Meanings


 period . - any single character
 char set [0-9a-f] - one char matching these
 Abbreviations

 \d - a numeric digit [0-9]


 \w - a word character [A-Za-z0-9_]
 \s - whitespace char [ \t\n\r\f]
 \D, \W, \S - any character but \d, \w, \s
 \n, \r, \t - newline, carriage-return, tab
 \f, \e - formfeed, escape
 \b - word break
PATTERN MATCHING

 Symbols with Special Meanings


 asterisk * - zero or more occurrences
 plus sign + - one or more occurrences
 question mark ? - zero or one occurrences
 carat ^ - anchor to begin of line
 dollar sign $ - anchor to end of line
 quantity {n,m} - between n and m
occurrences (inclusively)
 [A-Z]{2,4} means “2, 3, or 4 uppercase letters”.
PATTERN MATCHING

 Ways of Using Patterns


 Matching
 if ($line =~ /pattern/) { … }
 also written: m/pattern/
 Substitution
 $name =~ s/ASU/Arizona State University/;
 Translation
 $command =~ tr/A-Z/a-z/; # lowercase it
COMMAND LINE ARGS

 $0 = program name
 @ARGV array of arguments to program

 zero-based index (default for all arrays)

 Example
 yourprog -a somefile
 $0 is “yourprog”
 $ARGV[0] is “-a”

 $ARGV[1] is “somefile”
BASIC FILE I/O

 Reading a File
 open (FILEHANDLE, “$filename”) || die \ “open of
$filename failed: $!”;
while (<FILEHANDLE>) {
chop $_; # or just: chop;
print “$_\n”;
}
close FILEHANDLE;
BASIC FILE I/O

 Writing a File
 open (FILEHANDLE, “>$filename”) || die \ “open of
$filename failed: $!”;
while (@data) {
print FILEHANDLE “$_\n”;
# note, no comma!
}
close FILEHANDLE;
BASIC FILE I/O

 Predefined File Handles


 <STDIN> input
 <STDOUT> output
 <STDERR> output
 print STDERR “big bad error occurred\n”;
 <> ARGV or STDIN
READING WITH <>

 Reading from File


 $input = <MYFILE> ;
 Reading from Command Line
 $input = <> ;
 Reading from Standard Input
 $input = <> ;
 $input = <STDIN> ;
READING WITH <> (CONT.)

 Reading into Array Variable


 @an_array = <MYFILE> ;
 @an_array = <STDIN> ;
 @an_array = <> ;
PACKAGES

 Collect data & functions in a separate (“private”)


namespace
 Reusable code
PACKAGES

 Access packages by file name or path:


 require “getopts.pl”;
 require “/usr/local/lib/perl/getopts.pl”;
 require “../lib/mypkg.pl”;
PACKAGES

 Command: package pkgname;


 Stays in effect until next “package” or end of block { … } or
end of file.
 Default package is “main”
PACKAGES

 Package name in variables


 $pkg::counter = 0;
 Package name in subroutines
 sub pkg::mysub ( ) { … }
 &pkg::mysub($stuff);
 Old syntax in Perl 4
 sub pkg’mysub ( ) { … }
PACKAGES

#
# Get Day Of Month Package
#

package getDay;

sub main::getDayOfMonth {
local ($sec, $min, $hour, $mday) = localtime;
return $mday;
}
1; # otherwise “require” or “use” would fail
PACKAGES

 Calling the package


 require “/path/to/getDay.pl”;
$day = &getDayOfMonth;
 In Perl 5, you can leave off “&” for previously defined
functions:
 $day = getDayOfMonth;
WHAT ARE PERL MODULES?
 Modules are collections of subroutines
 Encapsulate code for a related set of processes

 End in .pm so Foo.pm would be used as Foo

 Can form basis for Objects in Object Oriented


programming
USING A SIMPLE MODULE
 List::Util is a set of List utilities functions
 Read the perldoc to see what you can do

 Follow the synopsis or individual function examples


LIST::UTIL

use List::Util;
my @list = 10..20;
my $sum = List::Util::sum(@list);
print “sum (@list) is $sum\n”;
use List::Util qw(shuffle sum);
my $sum = sum(@list);
my @list = (10,10,12,11,17,89);
print “sum (@list) is $sum\n”;

my @shuff = shuffle(@list);
print “shuff is @shuffle\n”;
MODULE NAMING

 Module naming is to help identify the purpose of the


module
 The symbol :: is used to further specify a directory
name, these map directly to a directory structure
 List::Util is therefore a module called Util.pm located in
a directory called ‘List’
(MORE) MODULE NAMING
 Does not require inheritance or specific relationship
between modules that all start with the same directory
name
 Case MaTTerS! List::util will not work

 Read more about a module by doing “perldoc


Modulename”
MODULES AS OBJECTS
 Modules are collections of subroutines
 Can also manage data (aka state)

 Multiple instances can be created (instantiated)

 Can access module routines directly on object


OBJECT CREATION
 To instantiate a module call ‘new’
 Sometimes there are initialization values

 Objects are registered for cleanup when they are set to


undefined (or when they go out of scope)
 Methods are called using -> because we are
dereferencing object.
SIMPLE MODULE AS OBJECT EXAMPLE
#!/usr/bin/perl -w
use strict;
use MyAdder;

my $adder = new MyAdder;


$adder->add(10);
print $adder->value, “\n”;
$adder->add(10);
print $adder->value, “\n”;

my $adder2 = new MyAdder(12);


$adder2->add(17);
print $adder2->value, “\n”;

my $adder3 = MyAdder->new(75);
$adder3->add(7);
print $adder3->value, “\n”;
WRITING A MODULE: INSTANTIATION
 Starts with package to define the module name
 multiple packages can be defined in a single module file -
but this is not recommended at this stage
 The method name new is usually used for instantiation
 bless is used to associate a datastructre with an object
WRITING A MODULE: SUBROUTINES
 The first argument to a subroutine from a module is
always a reference to the object - we usually call it ‘$self’
in the code.
 This is an implicit aspect Object-Oriented Perl

 Write subroutines just like normal, but data associated


with the object can be accessed through the $self
reference.
WRITING A MODULE
package MyAdder;
use strict;

sub new {
my ($package, $val) = @_;
$val ||= 0;
my $obj = bless { ‘value’ => $val}, $package;
return $obj;
}
sub add {
my ($self,$val) = @_;
$self->{’value’} += $val;
}

sub value {
my $self = shift;
return $self->{’value’};
}
WRITING A MODULE II (ARRAY)
package MyAdder;
use strict;

sub new {
my ($package, $val) = @_;
$val ||= 0;
my $obj = bless [$val], $package;
return $obj;
}
sub add {
my ($self,$val) = @_;
$self->[0] += $val;
}

sub value {
my $self = shift;
return $self->[0];
}
USING THE MODULE

 Perl has to know where to find the module


 Uses a set of include paths
 type perl -V and look at the @INC variable
 Can also add to this path with the PERL5LIB
environment variable
 Can also specify an additional library path in script use
lib ‘/path/to/lib’;
USING A MODULE AS AN OBJECT
 LWP is a perl library for WWW processing
 Will initialize an ‘agent’ to go out and retrieve web pages
for you
 Can be used to process the content that it downloads
LWP::USERAGENT
#!/usr/bin/perl -w
use strict;
use LWP::UserAgent;

my $url = 'http://us.expasy.org/uniprot/P42003.txt';
my $ua = LWP::UserAgent->new(); # initialize an object
$ua->timeout(10); # set the timeout value
my $response = $ua->get($url);
if ($response->is_success) {
# print $response->content; # or whatever
if( $response->content =~ /DE\s+(.+)\n/ ) {
print "description is '$1'\n";
}
if( $response->content =~ /OS\s+(.+)\n/ ) {
print "species is '$1'\n";
}
}
else {
die $response->status_line;
}
OVERVIEW OF BIOPERL TOOLKIT

 Bioperl is...
 A Set of Perl modules for manipulating gnomic and other
biological data
 An Open Source Toolkit with many contributors
 A flexible and extensible system for doing bioinformatics
data manipulation
SOME THINGS YOU CAN DO
 Read in sequence data from a file in standard formats
(FASTA, GenBank, EMBL, SwissProt,...)
 Manipulate sequences, reverse complement, translate
coding DNA sequence to protein.
 Parse a BLAST report, get access to every bit of data in
the report
 Dr. Mikler will post some detailed tutorials
MAJOR DOMAINS COVERED

 Sequences, Features, Annotations,


 Pairwise alignment reports

 Multiple Sequence Alignments

 Bibliographic data

 Graphical Rendering of sequence tracks

 Database for features and sequences


ADDITIONAL DOMAINS

 Gene prediction parsers


 Trees, Parsing Phylogenetic and Molecular Evolution
software output
 Population Genetic data and summary statistics

 Taxonomy

 Protein Structure
SEQUENCE FILE FORMATS

 Simple formats - without features


 FASTA (Pearson), Raw, GCG
 Rich Formats - with features and annotations
 GenBank, EMBL
 Swissprot, GenPept
 XML - BSML, GAME, AGAVE, TIGRXML, CHADO
PARSING SEQUENCES

 Bio::SeqIO
 multiple drivers: genbank, embl, fasta,...
 Sequence objects
 Bio::PrimarySeq
 Bio::Seq
 Bio::Seq::RichSeq
LOOK AT THE SEQUENCE OBJECT

 Common (Bio::PrimarySeq) methods


 seq() - get the sequence as a string
 length() - get the sequence length
 subseq($s,$e) - get a subsequence
 translate(...) - translate to protein [DNA]
 revcom() - reverse complement [DNA]
 display_id() - identifier string
 description() - description string
DETAILED LOOK AT SEQS WITH
ANNOTATIONS

 Bio::Seq objects have the methods


 add_SeqFeature($feature) - attach feature(s)
 get_SeqFeatures() - get all the attached features.
 species() - a Bio::Species object
 annotation() - Bio::Annotation::Collection
FEATURES
 Bio::SeqFeatureI - interface
 Bio::SeqFeature::Generic - basic implementation

 SeqFeature::Similarity - some score info

 SeqFeature::FeaturePair - pair of features


SEQUENCE FEATURES
 Bio::SeqFeatureI - interface - GFF derived
 start(), end(), strand() for location information
 location() - Bio::LocationI object (to represent complex
locations)
 score,frame,primary_tag, source_tag - feature information
 spliced_seq() - for attached sequence, get the sequence
spliced.
SEQUENCE FEATURE (CONT.)
 Bio::SeqFeature::Generic
 add_tag_value($tag,$value) - add a tag/value pair
 get_tag_value($tag) - get all the values for this tag
 has_tag($tag) - test if a tag exists
 get_all_tags() - get all the tags
ANNOTATIONS

 Each Bio::Seq has a Bio::Annotation::Collection via


$seq->annotation()
 Annotations are stored with keys like ‘comment’ and ‘reference’

 @com=$annotation-> get_Annotations(’comment’)

 $annotation-> add_Annotation(’comment’,$an)
ANNOTATIONS

 Annotation::Comment
 comment field
 Annotation::Reference
 author,journal,title, etc
 Annotation::DBLink
 database,primary_id,optional_id,comment

 Annotation::SimpleValue
CREATE A SEQUENCE OUT OF THIN AIR
use Bio::Seq;
my $seq = Bio::Seq->new(-seq => ‘ATGGGTA’,
-display_id => ‘MySeq’,
-description => ‘a description’);
print “base 4 is “, $seq->subseq(4,5), “\n”;
print “my whole sequence is “,$seq->seq(), “\n”;
print “reverse complement is “,
$seq->revcom->seq(), “\n”;
READING IN A SEQUENCE
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => ‘genbank’,
-file => ‘file.gb’);
while( my $seq = $in->next_seq ) {

print “sequence name is “, $seq->display_id,


“ length is ”,$seq->length,”\n”;
print “there are “,(scalar $seq->get_SeqFeatures),
“ features attached to this sequence and “,
scalar $seq->annotation->get_Annotations(’reference’),
“ reference annotations\n”;
}
WRITING A SEQUENCE
use Bio::SeqIO;
# Let’s convert swissprot to fasta format
my $in = Bio::SeqIO->new(-format => ‘swiss’,
-file => ‘file.sp’);
my $out = Bio::SeqIO->new(-format => ‘fasta’,
-file => ‘>file.fa’);`
while( my $seq = $in->next_seq ) {
$out->write_seq($seq);
}
A DETAILED LOOK AT BLAST PARSING
 3 Components
 Result: Bio::Search::Result::ResultI
 Hit: Bio::Search::Hit::HitI
 HSP: Bio::Search::HSP::HSPI
BLAST PARSING SCRIPT
use Bio::SearchIO;
my $cutoff = ’0.001’;
my $file = ‘BOSS_Ce.BLASTP’,
my $in = new Bio::SearchIO(-format => ‘blast’,
-file => $file);
while( my $r = $in->next_result ) {
print "Query is: ", $r->query_name, " ",
$r->query_description," ",$r->query_length," aa\n";
print " Matrix was ", $r->get_parameter(’matrix’), "\n";
while( my $h = $r->next_hit ) {
last if $h->significance > $cutoff;
print "Hit is ", $h->name, "\n";
while( my $hsp = $h->next_hsp ) {
print " HSP Len is ", $hsp->length(’total’), " ",
" E-value is ", $hsp->evalue, " Bit score ",
$hsp->score, " \n",
" Query loc: ",$hsp->query->start, " ",
$hsp->query->end," ",
" Sbject loc: ",$hsp->hit->start, " ",
$hsp->hit->end,"\n";
}
}
}
BLAST Report
Copyright (C) 1996-2000 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.

Reference: Gish, W. (1996-2000) http://blast.wustl.edu

Query= BOSS_DROME Bride of sevenless protein precursor.


(896 letters)

Database: wormpep87
20,881 sequences; 9,238,759 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N

F35H10.10 CE24945 status:Partially_confirmed TR:Q20073... 182 4.9e-11 1


M02H5.2 CE25951 status:Predicted TR:Q966H5 protein_id:... 86 0.15 1
ZC506.4 CE01682 locus:mgl-1 metatrophic glutamate recept... 91 0.18 1

……
USING THE SEARCH::RESULT OBJECT
use Bio::SearchIO;
use strict;
my $parser = new Bio::SearchIO(-format => ‘blast’, -file => ‘file.bls’);
while( my $result = $parser->next_result ){
print “query name=“, $result->query_name, “ desc=”,
$result->query_description, “, len=”,$result->query_length,“\n”;
print “algorithm=“, $result->algorithm, “\n”;
print “db name=”, $result->database_name, “ #lets=”,
$result->database_letters, “ #seqs=”,$result->database_entries, “\n”;
print “available params “, join(’,’,
$result->available_parameters),”\n”;
print “available stats “, join(’,’,
$result->available_statistics), “\n”;
print “num of hits “, $result->num_hits, “\n”;
}
USING THE SEARCH::HIT OBJECT
use Bio::SearchIO;
use strict;
my $parser = new Bio::SearchIO(-format => ‘blast’, -file => ‘file.bls’);
while( my $result = $parser->next_result ){
while( my $hit = $result->next_hit ) {
print “hit name=”,$hit->name, “ desc=”, $hit->description,
“\n len=”, $hit->length, “ acc=”, $hit->accession, ”\n”;
print “raw score “, $hit->raw_score, “ bits “, $hit->bits,
“ significance/evalue=“, $hit->evalue, “\n”;
}
}
TURNING BLAST INTO HTML

use Bio::SearchIO;
use Bio::SearchIO::Writer::HTMLResultWriter;

my $in = new Bio::SearchIO(-format => 'blast',


-file => shift @ARGV);

my $writer = new Bio::SearchIO::Writer::HTMLResultWriter();


my $out = new Bio::SearchIO(-writer => $writer
-file => “>file.html”);
$out->write_result($in->next_result);
TURNING BLAST INTO HTML

# to filter your output


my $MinLength = 100; # need a variable with scope outside the method
sub hsp_filter {
my $hsp = shift;
return 1 if $hsp->length('total') > $MinLength;
}
sub result_filter {
my $result = shift;
return $hsp->num_hits > 0;
}

my $writer = new Bio::SearchIO::Writer::HTMLResultWriter


(-filters => { 'HSP' => \&hsp_filter} );
my $out = new Bio::SearchIO(-writer => $writer);
$out->write_result($in->next_result);

# can also set the filter via the writer object


$writer->filter('RESULT', \&result_filter);
CUSTOM URL LINKS
@args = ( -nucleotide_url => $gbrowsedblink,
-protein_url => $gbrowsedblink
);
my $processor = new Bio::SearchIO::Writer::HTMLResultWriter(@args);
$processor->introduction(\&intro_with_overview);
$processor->hit_link_desc(\&gbrowse_link_desc);
$processor->hit_link_align(\&gbrowse_link_desc);

sub intro_with_overview {
my ($result) = @_;
my $f = &generate_overview($result,$result->{"_FILEBASE"});
$result->rewind();
return sprintf(
qq{
<center>
<b>Hit Overview<br>
Score: <font color="red">Red= (&gt;=200)</font>, <font color="purple">Purple 200-
80</font>, <font color="green">Green 80-50</font>, <font color="blue">Blue 50-40</font>,
<font color="black">Black &lt;40</font>
MULTIPLE SEQUENCE ALIGNMENTS
 Bio::AlignIO to read alignment files
 Produces Bio::SimpleAlign objects

 Interface and objects designed for round-tripping and


some functional work
 Could really use an overhaul or a parallel MSA
representation
GETTING SEQUENCES FROM GENBANK
 Through Web Interface Bio::DB::GenBank (don’t
abuse!!)
 Alternative is to download all of genbank, index with
Bio::DB::Flat (will be much faster in long run)
SIMPLE SEQUENCE RETRIEVAL

use Bio::Perl;

my $seq = get_sequence(’genbank’,$acc);

print “I got a sequence $seq for $acc\n”;



SEQUENCE RETRIEVAL SCRIPT

#!/usr/bin/perl -w
use strict;

use Bio::DB::GenPept;
use Bio::DB::GenBank;
use Bio::SeqIO;

my $db = new Bio::DB::GenPept();


# my $db = new Bio::DB::GenBank(); # if you want NT seqs
# use STDOUT to write sequences
my $out = new Bio::SeqIO(-format => 'fasta');

my $acc = ‘AB077698’;
my $seq = $db->get_Seq_by_acc($acc);
if( $seq ) {
$out->write_seq($seq);
} else {
print STDERR "cannot find seq for acc $acc\n";
}
$out->close();
SEQUENCE RETRIEVAL FROM LOCAL
DATABASE

use Bio::DB::Flat;

my $db = new Bio::DB::Flat(-directory => ‘/tmp/idx’,


-dbname => ‘swissprot’,
-write_flag => 1,
-format => ‘fasta’,
-index => ‘binarysearch’);

$db->make_index(’/data/protein/swissprot’);
my $seq = $db->get_Seq_by_acc(’BOSS_DROME’);

You might also like