Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 34

Perl for Bioinformatics

Lecture 4

Variables - review

A variable name starts with a $


It contains a number or a text string
Use my to define a variable
Use = to assign a value
Use \ to stop the variable being
interpolated
Take care with variable names and with
changing the contents of variables

Conditional Blocks, review


An if test can be used to control a
command in a conditional block,
according to the outcome of a decision
made by comparing variables.
Its important to keep track of whether
variables are strings or numbers. Numbers
are compared with ==, strings with eq.
Its usual to indent the block to make it
easier to read the code

Arrays
An array can store multiple pieces of data.
They are essential for the most useful
functions of Perl. They can store data such as:
the lines of a text file (e.g. primer sequences)
a list of numbers (e.g. BLAST e values)

Arrays are designated with the symbol @

my @bases = (A, C, G, T);

Converting a variable to an array


split splits a variable into parts and puts them
in an array.
my $dnastring = "ACGTGCTA";
my @dnaarray = split //, $dnastring;
@dnaarray is now (A, C, G, T, G, C, T, A)
@dnaarray = split /T/, $dnastring;
@dnaarray is now (ACG, GC, A)

Converting an array to a variable

join combines the elements of an array into a


single scalar variable (a string)

$dnastring = join('', @dnaarray);


spacer
(empty here)

which array

Loops
A loop repeats a bunch of functions until it is done.
The functions are placed in a BLOCK some code
delimited with curly brackets {}
Loops are really useful with arrays.
The foreach loop is probably the most useful of all:
foreach my $base (@dnaarray) {
print "$base ;
}

Comparing strings
String comparison (is the text the same?)
eq (equal )
ne (not equal )
There are others but beware of them!

Getting part of a string


substr takes characters out of a string
$letter = substr($dnastring, $position, 1)

which string

where in
the string

how many
letters to take

Combining strings
Strings can be concatenated (joined).
Use the dot . operator
$seq1= ACTG;
$seq2= GGCTA;
$seq3= $seq1 . $seq2;
print $seq3;
ACTGGGCTA

Making Decisions - review


The if operator is generally used together
with numerical or string comparison
operators, inside an (expression).
numerical:
strings:

==, !=, >, <, ,


eq, ne

You can make decisions on each member of


an array using a loop which puts each part
of the array through the test, one at a time

More healthy exercise


Write a program that asks the user for a DNA
restriction site, and then tells them whether that
particular sequence matches the site for the
restriction enzyme EcoRI, or Bam HI, or Hind III.
Site for EcoR1: GAATTC
Bam H1: GGATCC
Hind III: AAGCTT

Arrays and loops - review


An array starts with @. It contains multiple
bits of data in a list-like format.
@bases = (A, C, G, T);

You can make decisions on each member


of an array using a foreach loop which
puts each part of the array through the
test, one at a time

Test time, again


Remember
keep track of whats in a variable
dont over-write a variable with another
value, unless you intend to
syntax and case are critical
lines end with a semicolon
brackets and quotes must match.

Opening and closing files


So we can input large amounts of data, Perl
has to read data out of files, and write results
into output files
This is done in two steps
First, you must give the file a name within the
script - this is known as a filehandle
Use the open command:
open MYFILE, exampleprotein.txt;

Reading a file
Once the file is open, you can read from it, line by
line, using the readline <> operator again
(put the filehandle between the angle brackets)

Perl reads files one line at a time, each time you


input data from the file, the next line is read:
open FILE1,exampleprotein.txt;
$line1 = <FILE1>;
chomp $line1;
$line2 = <FILE1>;

Using loops to read in a file


The while loop just keeps doing an expression while its true. So it
will keep reading lines from the file until it runs out.
The special variable $_ keeps track of the line of the file were on.
my $longsequence;
open FILE, exampleprotein.txt;
while (<FILE>){
$longsequence = $longsequence . $_;
chomp $longsequence;
}
close FILE;

This reads the whole file, and puts each line into the variable
$longsequence one at a time.

Now More Fun Excercises


Read a DNA sequence from a fasta
format file
Calculate the GC content.
What about the non-DNA characters in
the file?
>header lines with the name of the sequence
carriage returns !! You know this one.
blank spaces
Ns or Xs or unexpected letters

Writing to a File
Writing to a file is similar to reading from it
Use the > operator to open a file for writing:
open OUTPUT, >/home/class30/output.txt;

This creates a new file with that name, or overwrites


an existing file
Use >> to append text to an existing file
print to the file using the filehandle:
print OUTPUT $myoutputdata;

Some more stuff you need to


know
else
Instead of just letting the script go on if it fails
an if test, you can get it to execute a second
block of code if the statement in brackets isnt
true.

elsif
You can string a lot of ifs together using elsif
if ($site eq GAATTC {
print EcoR1 site\n;
}
elsif ($site eq CCATGG {
print BamHI site\n;
}
elsif ($site eq AAGCTT) {
print HindIII site\n;
}
else { #only happens if none of the

preceeding are true

die I cant find any of the sites I know\n;

Subscripts
Bioinformatics data often can be made into array
format:
multi-line sequence files
Microarray or statistics data in tab delimited
format

You can address part of the array as if it was a


variable using a subscript
@numbers = (8, 8, 8, 23984092, 8);
print $numbers[3]\n;
Please note the first element is number zero! Second is 1!

Regular Expressions
Sounds odd, doesnt it? It means a pattern that
the computer can match, in a standard format.
Very useful in bioinformatics work
DNA patterns
restriction sites
promoters/transcription factor binding sites
intron splice site

Protein patterns
conserved domains (motifs)
active sites
structural motifs (membrane spanning, signal peptide, etc.)

The Binding and Match Operators:


=~ / /
The =~ operator binds functions together
The // operator matches things to patterns
It can be translated as contains
The forward slashes contain the pattern to be
matched, like this:
if ($dnaseq =~ /GAATTC/) {print EcoRI site found\n}

A regular expression..
is a joy forever. And a pattern to match:
can be just a text string, such as:

/GATC/

it can have alternative characters:

/G[AT]TC/

or contain a wildcard that matches any character: /G.TC/


Or be something bizzare:

/\/[^\/]*\/\.\./

Perl Regular Expressions


It never ceases to amaze me what people
can do with regular expressions, but you
can match pretty much anything you can
think of and a lot you cant:
#man perlrequick

Alternative Characters
Square brackets within the match expression allow for
alternative characters:
if

($dna =~ /CAG[AT]CAG/)

This will match an DNA string that starts with CAG; has A or T in
the 4th position, followed by another CAG.

A vertical line within the /expression/ means or; it


allows you to look for either of two completely different
patterns:
if

($dna =~ /GAAT|ATTC/)

Special characters
Perl has a large set of special characters to use
in regular expressions:
the dot (.) matches any character
\d matches any digit (a number from 0-9)
\w matches any word character
(a letter or a number, not punctuation or
space)
\s matches white space (any amount)
\t matches a tab (useful for tab delimited files)
^ matches the beginning of a line
$ matches the end of a line
Knowing this makes you lots of fun at parties.

Special characters
What if you need to match text that contains a special
character?
Arent there dots at the end of sentences?

Now you have to use a backslash (\) to escape the


special meaning of that character:
if $onewordsentence =~ /\w+ \ ./
-This would match any text that has one or more text
characters, followed by a dot.
Finished.

Bringing it together
So now, when you think about it, you can:
Open a file
Check whether each line of the file contains a particular
pattern
Recover part of that line
Write it out to another file
So.. isnt that what you wanted to know?
But really, its very useful combined with the UNIX
command line.

A last exercise?...
Now were getting up to speed with Perl,
lets try something more fun:
Open up a BLAST output file
Spit out the name of the query sequence,
the top hit, and how many hits there were.

Only the beginning


Sadly, there is much, much more than this to the
Power of Perl.
You can make, create and download other
peoples websites
Make Linux and Windows graphical programs
Do almost anything on the internet
Interact with databases
And much much more

Why wont I teach you more


stuff?
Whoa!
Programming takes time to learn properly
Youve got the tools now to get started on a
programming project
We will go through some more Perl functions
in the later classes, especially modules such
as Bioperl.

Practice makes perfect


You can now practice your Perl skills and
understand a lot of the books and help files,
which are probably more useful.
#man perlintro
#man perlrequick
#perldoc bioperl
Also, check out Lincoln Steins course at:
http://stein.cshl.org/genome_informatics/index.html

You might also like