Professional Documents
Culture Documents
Stat 133 All Lectures
Stat 133 All Lectures
Its free!
A screenshot from http://www.R-project.org/
R can be run in interactive or batch modes. The
interactive mode is useful for trying out new analyses and
making sure your code is doing what you think it is. The
batch mode is useful for carrying out pre-dened analyses
in the background.
For now, well focus
on the interactive
mode.
When you re up
R, youll see a
prompt, like this:
At the prompt, you can type an expression. An expression
is a combination of letters/numbers/symbols which are
interpreted by a particular programming language
according to its rules. It then returns a value. We can also
say it evaluates to that value.
> 3 + 5
[1] 8
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15 16 17 18 19 20
>
> # This is a comment
>
> 30 + 10 / # I'm not done typing
+ 2
[1] 35
To store a value, we can assign it to a variable.
> x1 <- 32 %% 5
> print(x1)
[1] 2
> x2 <- 32 %/% 5
> x2 # In interactive mode, this prints the object
[1] 6
> ls() # List all my variables
[1] "x1" "x2"
> rm(x2) # Remove a variable
> ls()
[1] "x1"
Variable names must follow some rules:
do basic calculations in R
High-level graphics
The types of data structures and how to index them:
Vectors: [index]
> x[1:10]; x[-3]; x[x>3]
Matrices: [rowindex, colindex]
> m[1,2]; m[1:2, ]; m[ ,a]
Arrays: [index
1
, index
2
, ..., index
K
]
> a[1, 3, ]; a[v==TRUE,,]
Data frames: [rowindex, colindex], $name
> cars$Cars6; cars[,3:4]; cars[cars$Junction == 7 to 8,]
Lists: $name, [index], [[index]]
> ingredients$meat; indgredients[1:2]; ingredients[[1]]
Note: both $ and [[]] can index only one element.
Last time we started talking about the apply function.
Lets review how this works for matrices.
> args(apply)
function (X, MARGIN, FUN, ...)
NULL
> m <- matrix(1:4, nrow = 2)
> m
[,1] [,2]
[1,] 1 3
[2,] 2 4
> apply(m, 2, paste, collapse = "")
[1] "12" "34"
the matrix
which dimension
to operate on -
1 for rows, 2 for columns
the function
any additional
arguments to FUN
The lapply and sapply functions both apply a specied
function FUN to each element of a list. The former returns
a list object and the latter returns a vector when
possible. Again, both allow passing of additional
arguments to FUN through the . . . argument.
> random.draws <- list(x1 = rnorm(10), x2 = rnorm(100000))
> lapply(random.draws, mean)
$x1
[1] 0.0827779
$x2
[1] 0.001470952
> sapply(random.draws, mean)
x1 x2
0.082777901 0.001470952
The tapply function allows us to apply a function to
different parts of a vector, where the parts are indexed by a
factor or list of factors.
Single factor:
> grp <- factor(rep(c("Control", "Treatment"), each = 4))
> grp
[1] Control Control Control Control
[5] Treatment Treatment Treatment Treatment
Levels: Control Treatment
>
> effect <- rnorm(8) # Make up some fake data
> tapply(effect, INDEX = grp, FUN = mean)
Control Treatment
0.2180109 -0.2433582
Multiple factors:
> sex <- factor(rep(c("Female", "Male"), times = 4))
> sex
[1] Female Male Female Male Female Male Female Male
Levels: Female Male
> tapply(effect, INDEX = list(grp, sex), FUN = mean)
Female Male
Control 0.3634973 0.07252456
Treatment -0.2860360 -0.20068040
Many data sets are stored as tables in text les. The
easiest way to read these into R is using either the
read.table or read.csv function.
As you can see in help(read.table), there are quite a few
options that can be changed. Some of the important ones
are
le - name or URL
if/else statements
f(x)dx = 1
3. P(a < X < b) =
b
a
f(x)dx
F(x) =
f(t)dt
In R, there are many built-in functions for handling
distributions, some of which we have seen already.
The prexes of the functions indicate what they do:
d - evaluate the PDF
p - evaluate the CDF
q - evaluate the inverse CDF
r - take a random sample
Note that the functions prexed by d, p, and q are all
calculating mathematical quantities.
However, once we have a random sample, we can also
estimate the PDF, CDF, and inverse CDF....
A histogram is a type of density estimator.
Recall that
For each bin of a histogram (with lower endpoint a and
upper endpoint b), we count the number of observations
falling into the bin, i.e.
If we properly normalize each of these quantities, the
total area of the rectangles in the histogram is one, just
like the area under a PDF. You can do this automatically in
R with hist(x, prob = TRUE).
b
a
f(x)dx = P(a < X < b)
n
i=1
I{a < X
i
b}
The empirical CDF uses the same sort of counting idea.
Dene
We are estimating a probability by a proportion. Another
way to think of it is that we estimate the PDF by a
discrete distribution which assigns probability 1/n to each
data point.
Exercise: Write a function which calculates the empirical
CDF. It should take a vectors sample and x.
F(x) =
n
i=1
I{X
i
x}
n
Finally, the quantile function in R returns the sample
quantiles, dened by
Note that we just plug in the empirical CDF to the
denition of the quantile function.
F
1
(q) = inf{x :
F(x) > q}
Well talk more next time about specic distributions.
For now, lets consider the role that simulation can play in
helping us understand statistics.
We can think of probability theory as complimentary to
statistical inference.
Distribution Observed data
Probability
Inference
A statistic is a function of a sample, for example the
sample mean or a sample quantile.
Statistics are often used as estimators of quantities of
interest about the distribution, called parameters.
Estimators are random variables; parameters are not.
In simple cases, we can study the distribution of the
statistic analytically. For example, we can prove that
under mild conditions the standard error of the sample
mean decreases at a rate proportional to 1/!n.
In more complicated cases, we turn to simulation.
Whereas mathematical results are symbolic, in terms of
arbitrary parameters and sample size, on a computer we
must specify particular values.
A single experiment looks something like this:
To study the distribution of the statistic, we repeat the
whole experiment B times. The larger B is, the better our
approximation of the distribution.
Particular choice
of parameters,
sample size
X
1
X
2
X
n
}
Single statistic
Steps in carrying out a simulation study:
1. Specify what makes up an individual experiment: sample
size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
Example: Find the standard error of the median when
sampling from the normal distribution. How does it vary
with the sample size and with the standard deviation?
Recall our example of a simulation study from last time:
Find the standard error of the median when sampling
from the normal distribution. How does it vary with the
sample size and with the standard deviation?
Steps in carrying out a simulation study:
1. Specify what makes up an individual experiment: sample
size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
A quick review of some other probability distributions
available in R:
abbreviation - Distribution
unif - Uniform(a, b)
exp - Exponential(")
pois - Poisson(")
f(x) =
1
ba
a x b
0 otherwise
f(x) =
e
x
x > 0
0 otherwise
f(x) =
e
x
x!
, x = 1, 2, 3, . . .
binom - Binomial
What if a distribution is not available in R? For instance,
there is no built-in Bernoulli distribution.
Well, in this case you could just use binom with size=1, or
sample(0:1, 1, prob = c(p, 1-p)).
We can also derive certain distributions from others. For
example, last week we sampled from the negative
binomial distribution by explicitly counting the number of
trials until we got the desired number of heads. Can we
sample from the Bernoulli in some other way?
f(x) =
n
x
p
x
(1 p)
x
, x = 0, 1, . . . , n
If the inverse CDF (the quantile function) has an inverse
in closed form, there is a method for generating random
values from the distribution.
The inverse CDF method is simple:
1. Generate n samples from a standard uniform
distribution. Call this vector u. In R, u <- runif(n).
2. Take y <- F.inv(u), where F.inv computes the
inverse CDF of the distribution we want.
We can prove that the CDF of the random values
produced in this way is exactly F.
Therefore,
We used the fact that F is nondecreasing in the second
line.
P(U u) =
0 u < 0
u 0 u 1
1 u > 1
P(Y y) = P(F
1
(U) y)
= P(F(F
1
(U)) F(y))
= P(U F(y)
= F(y)
Example: Triangle distribution with endpoints at a and b
and center at c.
We need to:
1. Find the CDF
2. Find the inverse CDF
3. Write a function to carry
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
D
e
n
s
i
t
y
f(x) =
2(xa)
(ba)(ca)
a x < c
2(bx)
(ba)(bc)
c x b
0 otherwise
We ended last time by talking about the inverse-CDF
method:
1. Generate n samples from a standard uniform
distribution. Call this vector u. In R, u <- runif(n).
2. Take y <- F.inv(u), where F.inv computes the
inverse CDF of the distribution we want.
Example: Triangle distribution with endpoints at a and b
and center at c.
We need to:
1. Find the CDF
2. Find the inverse CDF
3. Write a function to carry
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
D
e
n
s
i
t
y
f(x) =
2(xa)
(ba)(ca)
a x < c
2(bx)
(ba)(bc)
c x b
0 otherwise
Using the fact that the total area is one, and that the area
of a triangle is 1/2 base x height, we nd that
Inverting this function, we have
Now we can write our function.
F
1
(x) =
y(b a)(c a) + a 0 y <
ca
ba
b
(1 y)(b a)(b c)
ca
ba
y 1
F(x) =
0 x < 0
(xa)
2
(ba)(ca)
0 x < c
1
(bx)
2
(ba)(bc)
c x b
1 b < x
Well nish off this section on simulation by going over
one more example of designing a simulation study, dealing
with the risk of the James-Stein estimator.
In 1956, Charles Stein rocked the world of statistics when
he proved that the maximum likelihood estimator (MLE)
is inadmissible (that is, we can always nd a better
estimator) in this simple problem when
Let
The MLE for the vector is just the vector of
observations, which seems intuitively sensible.
To state Steins result, we rst have to talk about risk.
Y
i
indep
N(
i
,
2
),
i
, i = 1, . . . , d
d 3 :
Y
Speaking somewhat more formally, a loss function
describes the consequences of using a particular
estimator when the true parameter is
A common loss function is the squared error
which we can generalize to multiple dimensions by
summing the squared errors in each dimension.
But is random, because it depends on the data. We call
the expected value of the loss for given the risk function.
.
L(,
) = (
)
2
L(,
) =
d
i=1
(
i
i
)
2
= ||
||
2
i=1
(
i
Y
i
)
2
=
d
i=1
E
(
i
Y
i
)
2
= d
2
d 3
JS
=
1
(d 2)
2
||Y ||
2
Y
.
Recall the steps one more time:
1. Specify what makes up an individual experiment: sample
size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
One more quick note about summarizing your simulation
results....
When you are reporting the means of your simulated
distributions, its a good idea to add an indication of your
uncertainty as well.
The Central Limit Theorem tells us that the sample mean of
the simulated distribution is, with a sufciently large
sample, approximately normally distributed. We can use
this to form a 95% condence interval:
Note that we have control over B, so we can make the
intervals as narrow as we like!
X 2SD/
B
UNIX Basics
Operating systems
An operating system (OS) is a piece of software that
controls the hardware and other pieces of software on
your computer.
The most popular OS today, Microsoft Windows, uses a
graphical user interface (GUI) for you to interact with the
OS. This is easy to learn but not very powerful.
UNIX, on the other hand, is hard at rst to learn, but it
allows you vastly more control over what your computer
can do. There area actually many different avors of
UNIX, but what well cover applies to almost all of them.
The differences between, say, Windows and UNIX stem
from an underlying philosophy about what software
should do.
Windows: Programs are large, multi-functional. Example:
Microsoft Word.
UNIX: Many small programs, which can be combined to
get the job done. A toolbox approach. Example: stop all
my (cgks) processes whose name begins with cat and a
space:
ps -u cgk | grep [0-9] cat | awk {print$2} | xargs kill
The UNIX kernel is the part of the OS that actually
carries out basic tasks.
The UNIX shell is the user interface to the kernel. Like
avors of UNIX, there are also many different shells. For
this course, it
doesnt matter
which one you
use. The default
on the lab
computers is
called tcsh.
the prompt - yours
will differ
The rst thing you need to know about UNIX are how to
work with directories and les. Technically, everything in
UNIX is a le, but its easier to think of directories as you
would folders on Windows or Mac OS.
Directories are organized in an inverted tree structure.
To see the directory youre currently in, type the
command pwd (present working directory).
There are two special directories: The top level
directory, named /, is called the root directory.
Your home directory, named ~, contains all your les. For
Mary, ~ and /users/mary mean the same thing.
To create a new directory, use the command mkdir. Then
to move into it, use cd.
$ pwd
/Users/cgk
$ mkdir unixexamples
$ cd unixexamples
$ ls
$ ls -a
. ..
ls -a means to show all les, including the hidden les
starting with a dot (.).
The two hidden les here are special and exist in every
directory. . refers to the current directory, and ..
refers to the directory above it.
This brings us to the distinction between relative and
absolute path names. (Think of a path like an address in
UNIX, telling you where you are in the directory tree.)
You may have noticed that I typed cd unixexamples, rather
than cd /Users/cgk/unixexamples.
The rst is the relative path; the second is the absolute
path.
To refer to a le, you need to either be in the directory
where the le is located, or you need to refer to it using a
relative or absolute path name.
Example:
$ pwd
/Users/cgk/unixexamples
$ echo "Testing 1 2 3" > test.txt
$ ls
test.txt
$ cat test.txt
Testing 1 2 3
$ cd ..
$ cat test.txt
cat: test.txt: No such file or directory
$ cat unixexamples/test.txt
Testing 1 2 3
Note that le names must be unique within a particular
directory, but having, say, both /Users/cgk/test.txt and
/Users/cgk/unixexamples/test.txt is OK.
Commands, arguments, and options
Weve already started using these; now lets dene them
more precisely.
The general syntax for a UNIX command looks like this:
$ command -options argument1 argument2
(The number of arguments may vary.) An argument
comes at the end of the command line. Its usually the
name of a le or some text.
Example: move/rename a le.
$ mv test.txt newname.txt
Options come between the name of the command and
the arguments, and they tell the command to do
something other than its default. Theyre usually prefaced
with one or two hyphens.
$ pwd
/Users/cgk
$ rmdir unixexamples
rmdir: unixexamples: Directory not empty
$ rm -r unixexamples
$ ls
Desktop Movies Rlibs
Documents Music Sites
Icon? Pictures Work
Library Public bin
MathematicaFonts README
mathfonts.tar
To look at the syntax of any particular UNIX command,
type man (for manual) and then the name of the
command.
The two most important parts of the man page are
labeled SYNOPSIS and DESCRIPTION. These are very
much like the Usage and Arguments in Rs help pages.
SYNOPSIS shows you the syntax for a particular
command. Bracketed arguments are optional.
DESCRIPTION tells you what all the options do.
Press the space bar to scroll forward through the man
page, b to go backward, and q to exit.
You can refer to multiple les at once using wildcards. The
most common one is the asterisk (*). It stands in for
anything (including nothing at all).
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls G*
Gagging.text Going.nxt
$ ls *.xt
Bing.xt
The question mark (?) is similar, except it can only
represent a single character.
$ ls ?ing.xt
Bing.xt
Finally, square brackets can be replaced by whatever
characters are within those brackets.
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls [A-G]ing.*
Bing.xt
The wildcards can also be combined.
$ ls *G*
AGing.txt Gagging.text Going.nxt
$ ls *i*.*e*
Gagging.text ing.ext
Well cover text matching in a lot more detail next week
when we talk about regular expressions.
A recap of commands so far
pwd
print working directory
ls
list contents of current directory
ls -a
list contents, include hidden les
mkdir
create a new directory
cd dname
change directory to dname
cd ..
change to parent directory
cd ~
change to home directory
mv
move or rename a le
rm
remove a le
rm -r
remove all lower-level les
Here are a few more handy ones:
wc -l - count the number of lines in a le
$ wc -l halfdeg.elv
134845 halfdeg.elv
head -nx - look at the rst x lines of a le
$ head -n5 halfdeg.elv
Tyndall Centre grim file created on 13.05.2003 at 13:52 by
Dr. Tim Mitchell
.elv = elevation (km)
0.5deg MarkNew elev
[Long=-180.00, 180.00] [Lati= -90.00, 90.00] [Grid X,Y=
720, 360]
[Boxes= 67420] [Years=1975-1975] [Multi= 0.0010]
[Missing=-999]
tail - like head, but look at the end of the le
cp - copy a le
$ cp unixexamples/Bing.xt .
cat - print the contents of a le
echo - write arguments
The real power in UNIX, however, comes from stringing
these commands together. Well talk about this next
time.
Today, well talk about
Some interfaces between R and UNIX
- getting the results of UNIX commands from within R
- running R in BATCH mode and monitoring its progress
Putting UNIX commands together using
- redirection
- pipes
Manipulating data in UNIX using ltering commands in
combination with redirection and pipes
The system function in R allows you to execute a UNIX
command and either print the result to the screen or
store it as an R object (argument intern = TRUE).
> system("ls")
datagen.R group1.dat group2.dat group3.dat
> system("head -n2 *.dat")
==> group1.dat <==
height weight
65.4 134.9
==> group2.dat <==
height weight
65.7 145.7
==> group3.dat <==
height weight
63.8 138.9
Goal: Read in all the data
les and put them in a
single matrix with an extra
column for group.
Referring to your UNIX
handout, what should our
strategy be?
> nlines <- system("wc -l *.dat", intern = TRUE)
> nlines
[1] " 3 group1.dat" " 4 group2.dat"
[3] " 3 group3.dat" " 10 total"
> nfiles <- length(nlines) - 1
> nlines <- as.numeric(substring(nlines[1:nfiles], 7, 8)) - 1
> nlines
[1] 2 3 2
> hw <- matrix(NA, nrow = sum(nlines), ncol = 3)
> startline <- 1
> for(group in 1:nfiles){
+ temp <- read.table(file = paste("group", group,
+ ".dat", sep = ""),
+ header = TRUE)
+ index <- startline:(startline+nrow(temp)-1)
+ hw[index,1:2] <- as.matrix(temp)
+ hw[index,3] <- as.matrix(group)
+ startline <- startline + nrow(temp)
+ }
> names(hw) <- c("height", "weight", "group")
BATCH jobs in R are useful whenever
- You have a long job and you want to be able to use the
computer for other things in the meantime.
- You want to log out of the machine while the job is
running and come back to it later.
- Youre running the job on a remote machine, and again
you want to log out.
- You want to be courteous to other users of the machine
by decreasing the priority of the job.
To start a BATCH job, use
nice R CMD BATCH scriptfile.R outfile.Rout &
What would be printed to
the screen instead goes here.
Indicates that you want
to run this job in the
background.
Give the job lower priority
A few other things to keep in mind:
- scriptfile.R should require no input from the user. For
example, dont use identify.
- Graphics should be created by surrounding the relevant
code with pdf(file = filename.pdf) and dev.off(), rather
than using dev.print(pdf, file = filename.pdf).
- In simulations it can be helpful to include a line like
if(i %% 10 == 0) print(paste(Iteration, i)
Then you can monitor it using tail -f outfile.Rout.
- By default, the workspace will be saved in .RData. You
can also save specic objects using the save function.
To see information about currently running processes,
just type top. There are arguments to top that allow you
to sort by CPU usage, memory, etc. See man top for more
details.
The number at the beginning of the line is called the
process ID, or PID.
To kill a particular process, type kill PID, substituting the
correct PID.
Sometimes you want to see the list of all processes in a
non-interactive way. For example, you might want to pipe
the results through a lter, as well discuss next.
On BSD UNIX systems (like the Apple machines in the
lab), ps -aux will list all processes.
On other systems, ps -ef does the trick.
Well use ssh and sftp to start a new UNIX session on a
remote machine and to send les back and forth between
our computer and the remote computer.
To log into a statistics department machine, type
ssh -l uname scf-ugNN.berkeley.edu
where uname is the user name youve been assigned for
the course, and NN is a number between 01 and 27.
You will be prompted for your password. Your starting
directory is your home directory on the network. Note
this is the same no matter which department computer
you log into!
To transfer les back and forth, rst type
sftp -l uname scf-ugNN.berkeley.edu
You can use pwd, cd, and ls just as you would at the usual
prompt to nd the right remote le or directory. You can
also use lpwd, lcd, and lls to move around the local
machine.
To copy a le from the remote computer to your
computer, type get nameoffile.
Top copy a le from your computer to the remote
computer, type put nameoffile.
Type exit to quit.
Redirection and pipes are really at the heart of the UNIX
philosophy, which is to have many small tools, each one
suited for a particular job.
Redirection refers to changing the input and output of
individual commands/programs.
The standard input or STDIN is usually your keyboard.
The standard output or STDOUT is usually your
terminal (monitor).
As an example, if we type cat at the prompt and hit
return, the computer will accept input from us until it hits
an end-of-le (EOF) marker, which on most systems is
CNTRL-D. Each time we hit return, our input is printed
to the terminal.
We can redirect as follows:
> redirects STDOUT to a le
< redirects STDIN from a le
>> redirects STDOUT to a le, but appends rather than
overwriting.
(Theres also a <<, but its use is more advanced than well
cover.)
Here are two examples:
$ cat > temp.txt
$ sort < temp.txt
Try it out!
The idea behind pipes is that rather than redirecting
output to a le, we redirect it into another program.
Another way to say this is that STDOUT of one program
is used as STDIN to another program.
A common use of pipes is to view the output of a
command through a pager, like less. This is particularly
useful if the output is very long.
$ ls | less
Note that the data ows from left to right. See the UNIX
handout for more details on less.
Pipe
A program through which data is piped is called a lter.
Weve already seen a few lters: head, tail, and wc.
Two more common lters are
sort - sort lines of text les alphabetically
uniq - strip duplicate lines when they follow each other
$ cat somenumbers.txt
What will be the output of:
cat somenumbers.txt | sort
cat somenumbers.txt | uniq
cat somenumbers.txt | sort | uniq
$ cat somenumbers.txt | sort
One
One
One
Three
Two
Two
Two
$ cat somenumbers.txt | uniq
One
Two
One
Two
Three
One
$ cat somenumbers.txt | sort | uniq
One
Three
Two
Today: A quick wrap up on UNIX lters, then on to
regular expressions.
There will be a short assignment posted on bSpace later
today to give you some practice with these.
Recall the lters weve seen so far:
head - rst lines of le
tail - last lines of le
wc - word (or line or character or byte) count
sort - sort lines of text les alphabetically
uniq - strip duplicate lines when they follow each other
Here are two more useful lters:
grep - print lines matching a pattern (Well talk about
patterns more shortly -- for now just think of the pattern
as requiring an exact match.)
$ grep save *.R
Print all lines in any le ending with .R which contain the
word (pattern) save.
cut - select portions of each line of a le
$ cut -d -f 3-7
Here are some practice problems:
On many systems, the le /etc/passwd shows information
about the registered users for the machine. A quick look
at the le shows there are also some notes at the top.
1. Determine the total number of users.
2. Sort the users and display the information for the last
ve users, alphabetically speaking.
3. Show just the usernames for these entries.
4. Put the usernames in a le called lastusers.txt.
Regular Expressions
Regular expressions give us a powerful way of matching
patterns in text data.
Example 1: election data from three different datasets.
We know these are the same places, but how can the
computer recognize that?
Example 2: Creating variables that predict whether an
email is SPAM
- numbers or underscores in the sending address
- all capital letters in the subject line
- fake words like Vi@graa
- number of exclamation points in the subject line
- received time in the current time zone
Example 3: Mining the State of the Union addresses
How long are the speeches? How do the distributions of
certain words change over time? Which presidents have
given similar speeches?
The language of regular expressions allows us to carry
out some common tasks, such as
)).
Cov(X, Y ) = E[(X EX)(Y EY )]
=
V ar(X)
V ar(Y )Cor(X, Y )
=
2
Cor(X, Y )
if V ar(X) = V ar(Y ) =
2
A few simplifying assumptions about the structure of the
covariance function are
1. stationarity - the covariance between Z(s) and Z(s)
depends only on the relative locations, ie. s-s.
2. isotropy - the covariance between Z(s) and Z(s)
depends only on the distance between the locations,
ie. ||s-s||.
If both stationarity and isotropy hold, then the covariogram
can be used to study the form of the spatial covariance
function.
The (empirical) covariogram is a scatterplot that puts
distance on the x-axis and covariance on the y-axis.
If we had multiple replications, such as independent
observations in time, then for each pair of locations we
could calculate their covariance over those replications
and add one point to the covariogram. (How many
points would there be in all?)
However, if we have just a single replication, we can look
at pairs whose distance are within a given window of the
distance we want to plot.
Let
Then the empirical covariogram
I
d,
= {(i, j) : i j, ||s
i
s
j
|| (d , d + )}
(d) =
1
#I
d,
(i,j)I
d,
(Z
i
Z
(i)
)(Z
j
Z
(j)
)
mean over all the is mean over all the js
Recall from last time:
Geostatistical data consist of observations associated with
a set of locations.
Today well talk about how to interpolate a geostatistical
data set.
Example: Average
surface ozone at
monitoring stations
Our goal: Estimate
ozone over a grid
of locations
!95 !90 !85 !80 !75
3
4
3
6
3
8
4
0
4
2
35
40
45
50
55
60
65
Plotting the data
Youll need to load the packages fields and maps.
From elds, we use
image.plot - color plot of a data set on a regular grid
as.image - take an irregularly spaced data set and put it on
a grid with nrow rows and ncol columns.
From maps, we use
map(state, add = TRUE)
Other databases are available -
try county, usa, world
Now we plot the correlogram and use it to estimate the
spatial covariance.
0 100 200 300 400 500
0
5
1
0
1
5
2
0
2
5
3
0
Distance
E
s
t
i
m
a
t
e
d
C
o
v
a
r
i
a
n
c
e
The idea is that we t a parametric model to the
correlogram. In other words, we specify a functional form
for the covariance, where the function depends on
certain parameters, and then we estimate those
parameters.
A common parametric model takes the covariance to
decay exponentially with distance
Variance at a single location
Controls rate of decay;
higher means higher
correlation at a given distance
Cov(Z(s), Z(s
)) =
2
exp{||s s
||/}
One way to estimate the parameters in the covariance
function is to use nonlinear least squares.
0 100 200 300 400 500
0
5
1
0
1
5
2
0
2
5
3
0
Distance
E
s
t
i
m
a
t
e
d
C
o
v
a
r
i
a
n
c
e
Minimize the sum
of squared
residuals over
the parameters
using function nls
Now we need to determine the set of locations at which
we want to do the prediction.
Remember expand.grid from when we were running
simulation studies? It works here too, just be sure to
convert to a matrix, rather than a data frame.
You need to determine the resolution of the grid. Higher
resolution looks better, but takes longer.
Also be sure not to extrapolate, ie. predict beyond the
range of the data.
A fairly dense grid
!95 !90 !85 !80 !75
3
2
3
4
3
6
3
8
4
0
4
2
lon
l
a
t
Ok, now it is time to predict.
We will use a linear combination of the observations to
give us a prediction at a new location. The only question
then is what weights to assign to each observation.
We choose the weights to minimize the variance of the
resulting estimate. The best predictor minimizes this
quantity, and we call it the BLUP, for best linear unbiased
predictor.
It turns out the weights for the BLUP are easy to derive if
you know the true covariance function....
First we form the covariance matrix for the observations.
This contains the covariances for every pair of
observations.
We calculate a similar matrix for each pair where one
location comes from the set of observations and the
other location comes from the new grid.
ij
= Cov(Z(s
i
), Z(s
j
)) =
2
exp{||s
i
s
j
||/}
ij
= Cov(Z(s
i
), Z(s
j
)) =
2
exp{||s
i
s
j
||/}
New location j
If the vector of observations Z has mean zero, the BLUP
is simply
This procedure is called kriging after Daniel Krige, a South
African gold miner.
Now, if we dont know the parameters, we can plug in our
estimates when we compute the covariance matrices. This
predictor is no longer the BLUP. (Why not?)
1
Z
Weight matrix - depends on particular values
of the parameters as well as the distances
We can also estimate the variance in our predictions.
If the parameters are known, then the vector of variances
of the BLUP at each predicted location is
We plug in our estimates to approximate this variance.
The variance will be small when the predicted location is
close to the original locations.
In fact, with the covariance function were currently using,
the variance will be exactly zero if we predict at a
location where we already have an observation. (It
interpolates.)
2
1
m
diag{
1
}
Announcements:
There will be no lab tomorrow. Ill assign your group
projects next Tuesday, and well have a short in-lab
assignment on SQL next Friday.
Be sure to attend on Tuesday so you can make plans with
your group.
If you want to practice the SQL statements we talk about
today, there are lots of free SQL interpreters (programs
that mimic a database server) online.
For example, see http://sqlcourse.com/select.html.
Databases and SQL
A database is a collection of data with information about
how the data are organized (meta-data). A database server
is like a web server, but responds to requests for data
rather than web pages.
Well talk about relational database management systems
(RDBMS) and how to communicate with them using the
structured query language (SQL).
Why use a database?
Data denition
Data access
Privilege management
Well concentrate on data access, assuming the database
is already available and we have the needed privileges.
Topics:
Newton-Ralphson algorithm
Then well move on to some statistical examples and how
to use the built-in optimization methods in R.
Since every maximization problem can be rewritten as a
minimization problem, using -f(x) rather than f(x), well
assume from now on that were minimizing.
Golden section search
Assume f(x) has a single minimum on the interval [a,b].
The golden section search algorithm iteratively shrinks the
interval over which were looking for the minimum, until
the length of the interval is less than some preset
tolerance.
The name golden section comes from the fact that at
each iteration, we choose a new point to evaluate so that
we can reuse one of the points from the last iteration. It
works out that the way to do this is to maintain the so-
called golden ratio between the distances between points.
Consider two line segments c and d. They are said to be
in golden ratio if their sum c+d is to c as c is to d.
c + d
c
=
c
d
=
d + d
d
=
d
d
+ 1
=
2
1 = 0
=
1 +
5
2
1.618034
c d
An example:
Start with
Now compare
and
Since
we know the minimum
must be in
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
x
1
= b (b a)/
x
2
= a + (b a)/
f(x
1
) f(x
2
).
f(x
1
) < f(x
2
),
[a, x
2
].
An example
This time so the
minimum must be in
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
a
x
1
x
2 b
Add a new point
and maintain the
golden ratio.
f(x1) > f(x2),
[x
1
, b].
An example
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
a
x
1
x
2 b
a
x
1
x
2 b
Keep going like this....
An example
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
a
x
1
x
2 b
a
x
1
x
2 b
a
x
1
x
2 b
When b-a is sufciently
small, we stop and report
a minimum of (a+b)/2. One
can show that the error is
at most (1-#)(b-a).
The Newton-Raphson algorithm can be used if the function
to be minimized has two continuous derivatives that may
be evaluated.
Again assume that there is a single minimum in [a,b]. If
the minimizing value x* is not at a or b, then
If in addition then x* is a minimum.
The main idea behind N-R is that if we have an initial
value that is close to the minimizing value, then we
can approximate
f
(x
) = 0.
x
0
f
(x) f
(x
0
) + (x x
0
)f
(x
0
)
f
(x
) > 0,
Setting RHS = 0 gives
We keep going in this way until is sufciently close
to zero.
Its important to have a good initial guess, otherwise the
Taylor series approximation may be very poor and we
may even have
f
(x
n
)
f(x
n+1
) > f(x
n
).
x
1
= x
0
(x
0
)
f
(x
0
)
f
(x) f
(x
0
) + (x x
0
)f
(x
0
)
Weve already seen one example of numerical
optimization in action, when we used nonlinear least
squares to t a curve to the covariogram for spatial data.
This is useful more generally, for non-linear regression
models of the form
y = f(x, ) +
Response
variable
Nonlinear
function
Vector of
covariates
Vector of
coefcients
error term
N(0,
2
)
Example: weight loss
Patients tend to lose weight at diminishing rate. Here is
data from one patient with a linear t superimposed.
Another proposed model is
0 50 100 150 200 250
1
1
0
1
3
0
1
5
0
1
7
0
Days
W
e
i
g
h
t
(
k
g
)
2
5
0
3
0
0
3
5
0
4
0
0
W
e
i
g
h
t
(
l
b
)
y =
0
+
1
2
t
+
Ultimate lean
weight (asymptote)
Total amount
to be lost
Time taken to
lose half amount
remaining to be lost
The function nls in R uses numerical optimization to nd
the values of the parameters that minimize the sum of
squared errors
The main arguments are
i=1
(Y
i
f(x
i
, ))
2
Numerical
Optimization II: Fitting
Generalized Linear
Models
The normal linear model assumes that
1) the expected value of the outcome variable can be
expressed as a linear function of the explanatory
variables, and
2) the residuals (observations minus their expected
values) are independent and identically distributed with
a normal distribution.
Last time we talked about relaxing assumption 1), using
nonlinear regression models.
Today well talk about relaxing assumption 2), using what
are called generalized linear models.
First, a few words about the normal linear model.
With a single explanatory variable, it has the form
where
Recall that the least squares estimates of and
minimize the residual sum of squares
With a little calculus, we can minimize RSS explicitly....
Y
i
=
0
+
1
X
i
+
i
i
iid
N(0,
2
), i = 1, . . . , n
0
1
RSS(
0
,
1
) =
n
i=1
(Y
i
1
X
i
)
2
Setting each equal to zero and solving, we get
In this model, the least squares estimates are equal to the
maximum likelihood estimates, which well discuss next
time.
RSS/
0
= 2
n
i=1
(Y
i
1
X
i
)
RSS/
1
= 2
n
i=1
(Y
i
1
X
i
)X
i
1
=
n
i=1
(X
i
X)(Y
i
Y )
n
i=1
(X
i
X)
2
0
=
Y
1
X
We can do similar calculations if we have more than one
explanatory variable.
Another way of thinking about what weve done in the
normal linear model is that weve expressed the mean of
the Ys as a linear combination of the Xs.
To work with non-normal distributions, were going to
slightly modify this idea.
E[Y
i
] =
0
+
1
X
i
+E[
i
]
=
0
+
1
X
i
First, a motivating example.
Prior to the launch of the space shuttle Challenger, there
was some debate about whether temperature had any
effect on the performance of a key part called an O-ring.
The following plot, with data from past ights, was used as
evidence that it was safe to launch at a temperature of 31F.
One key problem with this analysis was that the
engineers left out the data from all the ights with no O-
ring problems, under the mistaken assumption that these
gave no extra information.
The solid rocket motors (labeled
3 and 4) are delivered to
Kennedy Space Center in four
pieces, and they are connected
on site using the O-rings. There
are actually two sets of O-rings
at each joint, but well focus on
the primary ones.
So in each launch, there are six
primary O-rings that can fail. If
any one fails, it can lead to a
catastrophic failure of the whole
shuttle.
Here are the data on past
failures of the primary O-
rings.
The data from past ights
come from rocket motors
that are retrieved from the
ocean after the ight. There
had been 24 shuttle
launches prior to
Challenger, of which the
rocket motors were
retrieved in 23 cases.
Temp Fail Date
1 66 0 4/12/81
2 70 1 11/12/81
3 69 0 3/22/82
4 68 0 11/11/82
5 67 0 4/4/83
6 72 0 6/18/83
7 73 0 8/30/83
8 70 0 11/28/83
9 57 1 2/3/84
10 63 1 4/6/84
11 70 1 8/30/84
12 78 0 10/5/84
13 67 0 11/8/84
14 53 2 1/24/85
15 67 0 4/12/85
16 75 0 4/29/85
17 70 0 6/17/85
18 81 0 7/29/85
19 76 0 8/27/85
20 79 0 10/3/85
21 75 2 10/30/85
22 76 0 11/26/85
23 58 1 1/12/86
We could t a linear regression model to this data,
relating the expected number of failures to temperature.
Some problems with this approach are that
1) the residuals are
clearly not iid
normal
2) if we go out
far enough, we
actually predict
a negative number
of failures.
0 20 40 60 80
0
1
2
3
4
5
6
Temperature (Degrees F)
F
a
i
l
u
r
e
s
Instead we will t a logistic regression model.
This model is appropriate when the data have a binomial
distribution (counting the number of events out of n
trials), of which binary data is a special case with n=1.
The expected value for a given trial is , the probability
of an event when the explanatory variable . We
relate this to the linear predictor using the logit function.
log
p
i
1 p
i
=
0
+
1
X
i
p
i
X = X
i
This ratio is called, the odds, so
the logit can also be called the log odds.
The case we are discussing (binomial outcome, logit
function) is a special case of a larger class of models called
generalized linear models.
Some other examples:
Normal outcome, identity link Poisson outcome, log link
Note that in each case, the link function maps the space of
the parameter representing the mean of the distribution
to the real line, which is the space of the linear
predictor.
Y
i
N(
i
,
2
),
i
=
0
+
1
X
i
Y
i
Pois(
i
)
log(
i
) =
0
+
1
X
i
(
i
,
i
, or p
i
)
Generalized linear models can be t using an algorithm
called iteratively reweighted least squares. In R, this is
implemented in the function glm.
# First create a matrix with events and non-events
FN <- cbind(challenge$Fail, 6 - challenge$Fail)
# Fit using specified family, default link function
glm.fit <- glm(FN~Temp, data = challenge,
family = binomial)
# Now predict for a range of temperatures
tempseq <- seq(0, 90, length = 100)
pred <- predict(glm.fit, newdata = data.frame(Temp =
tempseq), se.fit = TRUE)
inv.logit <- function(x){1/(1+exp(-x))}
lines(tempseq, inv.logit(pred$fit))
lines(tempseq, inv.logit(pred$fit + 2*pred$se.fit), lty = 2)
lines(tempseq, inv.logit(pred$fit - 2*pred$se.fit), lty = 2)
The condence interval at 31F is quite wide, but the
point estimate probably still should have been cause for
alarm, especially since the temperature was colder than
anything that had been tried before.
0 20 40 60 80
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Temperature (Degrees F)
P
r
o
b
a
b
i
l
i
t
y
o
f
F
a
i
l
u
r
e
Interpretation of the coefcients is a bit trickier in logistic
regression models than it is in linear regression models.
When we can say that
0
X
i
= 0
1
log
p
i
1 p
i
=
0
+
1
X
i
The interpretation regarding log odds is the easiest to
state but probably the hardest to understand.
0
log
p
i
1 p
i
=
0
+
1
X
i
X
i
= 0
1
p
i
/(1 p
i
) = exp(
0
+
1
X
i
)
exp(
0
)
X
i
= 0
Also, suppose
So gives the multiplicative change in odds
corresponding to a one unit change in X.
In particular, if X takes only the values 0 and 1, then
is the odds ratio for category 1 compared to category 0.
The interpretation in terms of probabilities is conditional
on other variables in the model, so well save it for after
we talk about using multiple regressors.
X
i
= X
j
+1.
p
i
/(1 p
i
)
p
j
/(1 p
j
)
=
exp(
0
) exp(
1
X
i
)
exp(
0
) exp(
1
X
j
)
= exp(
1
(X
i
X
j
)) = exp(
1
)
exp(
1
)
exp(
1
)
Another example, this time with multiple explanatory
variables
> library(MASS)
> birthwt[1:2,]
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
'low' indicator of birth weight less than 2.5kg
'age' mother's age in years
'lwt' mother's weight in pounds at last menstrual period
'race' mother's race ('1' = white, '2' = black, '3' = other)
'smoke' smoking status during pregnancy
'ptl' number of previous premature labours
'ht' history of hypertension
'ui' presence of uterine irritability
'ftv' number of physician visits during the first trimester
'bwt' birth weight in grams
We are now confronted with the question of model
choice. There are a variety of principles that can guide us
here, but in the interest of time, lets consider one
criterion balancing goodness of t with parsimony (the
number of parameters).
The Akaike information criterion is
where k is the number of parameters in the given model
and L is the maximized value of the likelihood for that
model. (For now, you can think of the likelihood as the
joint density of the data for a particular setting of the
parameter values.) Looking at this criterion, we favor
models with a lower value of AIC.
AIC = 2k 2 log(L)
Numerical
Optimization III:
Maximizing Likelihoods
One of the canonical cases in which we need to
numerically optimize a function in statistics is to nd the
maximum likelihood estimate.
For the likelihood function is
The log-likelihood function is
The maximum likelihood estimator (MLE), which well
denote by is the value of that maximizes
(Note that this is equivalent to maximizing
X
1
, . . . , X
n
iid
f(x; ),
L() =
n
i=1
f(X
i
; )
() = log L() =
n
i=1
log f(X
i
; )
,
L().
().
Another important thing to note is that we can multiply
the likelihood by a constant (or add a constant to the log-
likelihood), and this does not change the location of the
maximum.
Therefore, we often work only with the part of the
likelihood that concerns This part of the function is
called the kernel.
In simple cases, we can often nd the MLE in closed form
by, for example, differentiating the log-likelihood with
respect to , setting this equal to zero, and solving for
But things are often not this simple!
.
.
As an example, lets go back to the logistic regression
model. Remember, we have
where, inverting the logit function, we have
and the likelihood function is
substituting in the expression above.
We cant maximize this analytically as a function of
and but we can easily write a function for the
likelihood or log-likelihood and have R do the work for
us....
Y
i
Ber(p
i
), i = 1, . . . , n
p
i
=
exp{
0
+
1
X
i
}
1 + exp{
0
+
1
X
i
}
L(
0
,
1
) =
n
i=1
p
Y
i
i
(1 p
i
)
Y
i
,
1
,
# Function for the negative log-likelihood
logistic.nll <- function(beta, x, y, verbose = FALSE){
if(verbose) print(beta)
beta0 <- beta[1]; beta1 <- beta[2]
pvec <- exp(beta0 + beta1 * x) /
(1 + exp(beta0 + beta1 * x))
fvec <- y * log(pvec) + (1-y) * log(1 - pvec)
return(-sum(fvec))
}
# Use optim to minimize the nll
# par is a vector of starting values
# better starting values => faster convergence, and
# less chance of missing the global maximum
optim(par = c(0, 0), fn = logistic.nll,
x = x, y = y, verbose = TRUE)
In the case of logistic regression, minimizing the negative
log-likelihood using optim will give the same answer as
using glm with family = binomial.
However, there are many other models without built-in
functions like glm. One example is the spatial models we
discussed a few weeks ago.
Suppose we have a spatial eld with mean zero and
covariance function
Before, we estimated and by nding the covariogram
and tting a curve to it using nonlinear least squares.
Cov(Z(s
i
), Z(S
j
)) =
2
exp{||s
i
s
j
||/}
(
2
, )
1
Z/2}
Z = (Z
1
, Z
2
, . . . , Z
n
)
(
2
, )
(
2
, )
i,j
= Cov(Z(s
i
), Z(S
j
))
=
2
exp{||s
i
s
j
||/}
A few words before we move on:
1) Its always preferable to nd the MLEs in closed form if
you can. The answer is exact, and you avoid all the errors
that can be introduced in the numerical optimization,
including possibly converging to a local rather than a
global optimum.
2) If you do need to use numerical optimization, its a
good idea to evaluate the likelihood (or log-likelihood)
over a grid of values rst, to help you nd a good starting
value.
3) Theres a lot more theoretical detail concerning MLEs
that we dont have time to cover, importantly how to
estimate uncertainty. See Stat 135.
Nonparametric
regression and
scatterplot smoothing
Weve looked at linear models and nonlinear models with
a specied form, but what if you dont know a good
function to relate two variables X and Y?
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
Times
A
c
c
e
l
e
r
a
t
i
o
n
This data set shows
head acceleration
in a simulated
motorcycle accident,
used to test helmets.
This area of statistics is known as nonparametric regression
or scatterplot smoothing. Basically, we want to draw a
curve through the data that relates X and Y. More
formally, we suppose
where is an unknown function and the are iid with
some common distribution, typically normal.
Now, if we dont put any restrictions on its easy to get
a perfect t to the data -- just draw a curve that passes
through all the points! But this curve is unlikely to give
good predictions for any future observations.
Y
i
= f(X
i
) +
i
i
f
f,
Aside: This actually gets at a fundamental idea in
statistics, called the bias-variance tradeoff. We can get a
very low variance estimator of by interpolating the
data, using a very wiggly curve. But this introduces a lot
of bias. So we look for a happy medium.
f
We wont cover
the theoretical
details, here,
but just keep in
mind this question
of how much
smoothing to do.
Back to the motorcycle data....
One of the simplest things we could do would be to t a
high degree polynomial.
But tting a global
polynomial this
way isnt very
efcient.
How about breaking
up the region of
x and t a separate,
lower-degree
polynomial in each
region?
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
Degree = 5
Degree = 10
Degree = 20
This type of model is known as a piecewise polynomial
model or regression splines.
The breakpoints, between which we have separate
polynomial functions, are called knots.
Typically we impose some constraints on the way the
functions match up at the knots, such as maintaining the
rst and second derivatives.
So the modelling choices boil down to
1) where to put the knots
2) what degree polynomial to t between the knots
More knots less smoothing
Motorcycle data with 6 knots:
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
Degree = 1
Degree = 3
Motorcycle data with 9 knots:
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
Degree = 1
Degree = 3
Smoothing spline models are dened in a slightly different
way. Within a class of functions, a smoothing spline
minimizes the penalized least squares criterion
The parameter controls how smooth the function is (in
terms of integrated second derivative).
We can specify in terms of the equivalent degrees of
freedom of the model, or we can choose it in a data-based
way, using something called cross validation.
1
n
n
i=1
(Y
i
f(X
i
))
2
+
(x)dx
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
df = 10
df = 20
df chosen by cross!validation