Stat 133 All Lectures

Statistics 133: Concepts in Computing with Data
Instructor: Dr. Cari Kaufman

cgk@stat.berkeley.edu
GSI: Daisy Huang

yanhuang@stat.berkeley.edu
What Are Data?
Numbers
Example: Traffic on I-80

Text
Example: SPAM or HAM?

Images, video, or audio
Example: Mary Jane ski area and Rifle Sight trail
Height taken from a digital

elevation model, with
overlaid high-resolution
photograph.
Plan your descent through the bumps and go for it.

Bump skiing does not get much harder than this.This pitch
is a long one and typically does not have much loose snow so
technique is important even if you decide to traverse across to the
left to lose some speed. Bear to to skier's left at the bottom of this
pitch and finish out the run on Feebleminded. Look for good snow on
the sides. However you get down this run you should feel like you skied
something hard and a bit wild -- and done it in view of all the folks comfortably
sitting on the SuperGauge chairs. You will not find many other black runs that
will stretch you like Riflesight Notch. - From the Mary Jane Project
Meta-data
Example: Shelters along the Applachian trail

Course Expectations
Getting Started with R
Why use R?
Some of you may have used statistical software with a

GUI, like Minitab. You may also be familiar with other
programming languages, like C, Java, Python, etc.
In this class, we’ll use the R programming language and

environment as our “home base” for performing many
data analytic tasks.
Some benefits of R:
• Allows custom analyses and easy replicability
• High level language designed for statistics
• Active user community, lots of add-ons
• It’s free!
A screenshot from http://www.R-project.org/
R can be run in interactive or batch modes. The
interactive mode is useful for trying out new analyses and
making sure your code is doing what you think it is. The
batch mode is useful for carrying out pre-defined analyses
in the background.
For now, we’ll focus

on the interactive
mode.
When you fire up

R, you’ll see a
prompt, like this:
At the prompt, you can type an expression. An expression
is a combination of letters/numbers/symbols which are
interpreted by a particular programming language
according to its rules. It then returns a value. We can also
say it evaluates to that value.
> 3 + 5
[1] 8
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15 16 17 18 19 20
>
> # This is a comment
>
> 30 + 10 / # I'm not done typing
+ 2
[1] 35
To store a value, we can assign it to a variable.
> x1 <- 32 %% 5
> print(x1)
[1] 2
> x2 <- 32 %/% 5
> x2 # In interactive mode, this prints the object
[1] 6
> ls() # List all my variables
[1] "x1" "x2"
> rm(x2) # Remove a variable
> ls()
[1] "x1"
Variable names must follow some rules:
• May not start with a digit or underscore (_)

• May contain numbers, characters, and some
punctuation - period and underscore are ok, but most
others are not
• Case-sensitive, so x and X are different
Advice on variable names:
• Use meaningful names

• Avoid names that already have a meaning in R. If in
doubt, check:
> exists("pi")
[1] TRUE
There are several ways to save your objects for later.
You can use the save and load functions to save specific
variables.
> save(x1, file = "x1.RData")

> rm(x1)
> ls()
character(0)
> load(file = "x1.RData")
> ls()
[1] "x1"
When you quit R, you’ll be asked whether you want to

save ALL the contents of your current workspace.
> q()
Save workspace image? [y/n/c]:
A function is a portion of code that performs a specific
task. Usually it takes some inputs, performs some
computations, and returns a value.
The inputs are called arguments to the function. When

you use a function with a particular set of arguments, you
are set to be calling the function. The computer evaluates
the function call and returns the output.
For now, we’ll work with R’s built-in functions, and the
most important things to know are how to call the
function and how to get help when you need it.
First, determine the arguments.
> args(rnorm)
function (n, mean = 0, sd = 1)
NULL
> args(plot)
function (x, y, ...) default values
The “...” argument is special and we’ll talk about it later.

When you call a function, you can specify the arguments
either by position or by name, or a combination.
> x <- 1:100
> y <- rnorm(100, sd = x) # Combination
> plot(x, y) # By position
y
-150 -100 -50 0 50 100 150
0
20
40
x
60
80
100
> help(rnorm) # A shortened version of the real page:
Normal package:stats R
Documentation
The Normal Distribution
Description:
Random generation for the normal distribution with
mean equal to 'mean' and standard deviation equal to 'sd'.
Usage:
rnorm(n, mean = 0, sd = 1)
Arguments:
n: number of observations.
mean: vector of means.
sd: vector of standard deviations.
Details:
If 'mean' or 'sd' are not specified they assume the
default values of 0 and 1, respectively.
Value:
'rnorm' generates random deviates.
Source:
See RNG for how to select the algorithm and for
references to the supplied methods.
References:
Becker, R. A., Chambers, J. M. and Wilks, A. R.
(1988) _The New S Language_. Wadsworth & Brooks/Cole.
See Also:
'runif' and '.Random.seed'
Examples:
...
R has a number of built-in data types. The three most
basic types are numeric, character, and logical.
You can check the type using the mode function.
> mode(3.5)
[1] "numeric"
> mode("Hello")
[1] "character"
> mode(2 < 3)
[1] "logical"
Actually, the three types are numeric, character, and

logical vectors. There’s no such thing as a scalar in R, just a
vector of length one.
A vector in R is a collection of values of the same type.
You can join vectors together using the c (for
“concatenate”) function.
> c(1.3, 2, 8/3)

[1] 1.300000 2.000000 2.666667
> c("a", "l", "q")
[1] "a" "l" "q"
> c(TRUE, FALSE, FALSE)
[1] TRUE FALSE FALSE
>
> c(1, 2, FALSE)
[1] 1 2 0
> c(1, 2, "c")
[1] "1" "2" "c"
The last two expressions illustrate implicit coercion. You

should try to avoid this in most situations.
The elements of a vector can have names.
> unfair.coin <- c("heads" = 0.55, "tails" = 0.45)

> unfair.coin
heads tails
0.55 0.45
> names(unfair.coin)
[1] "heads" "tails"
>
> # Another way to do it
> fair.coin <- c(0.5, 0.5)
> names(fair.coin) <- names(unfair.coin)
> fair.coin
heads tails
0.5 0.5
There five ways to extract elements of a vector.
> unfair.coin[1] # 1) Inclusion by position
heads
0.55
> unfair.coin[-1] # 2) Exclusion by position
tails
0.45
> unfair.coin["heads"] # 3) By name
heads
0.55
> unfair.coin[unfair.coin > 0.5] # 4) By logical index
heads
0.55
> unfair.coin[] # 5) No index (include everything)
heads tails
0.55 0.45
A few announcements:
If you haven’t gotten your computer account, be sure to

email Daisy (yanhuang@stat.berkeley.edu) ASAP.
If you are just joining the course this week, please see me
after class, in office hours, or send me an email if you have
not done so already.
Last time in our introduction to R, we learned how to
• start and quit R in interactive mode

• do basic calculations in R
• assign, print, list, and remove variables
• save some or all of the variables in the workspace
• find the arguments for a function and use the help
system
• create numeric, character, and logical vectors and
concatenate them using c()
• name the elements of a vector
• extract elements of a vector five different ways
A few notes before we move on...
You can assign values to variables using = rather than <- if

you like.
You can use c to concatenate existing vectors.

> x1 <- c(1, 8)
> x2 <- 2:5
> x3 <- 4
> c(x1, x2, x3)
[1] 1 8 2 3 4 5 4
Remember that unlike other languages you may have

used, R does not start indexing with 0. Also, it does not
allow mixing of positive and negative subscripts. (Why
not?)
Indexing by exclusion can be used to remove elements of
a vector.
> x <- 1:5
> x
[1] 1 2 3 4 5
> x <- x[-c(1,3,5)]
> x
[1] 2 4
There was a question about indexing by name when the

names are not unique. It appears that R returns only the
first element with that name. So I’d avoid repeating
names.
> x <- 1:2
> names(x) <- c("a", "a")
> x["a"]
a
1
Today, we’ll cover
• missing values and other special values

• assigning parts of a vector using indexing
• vector arithmetic and the recycling rule
• making patterned vectors
• some built-in summary functions for vectors
• basic manipulation of character vectors
• logical vectors and Boolean algebra
• a new data type: factors
Next time: more complicated data structures, reading
data into R
Next week: graphics
The missing value symbol is NA. Note that this is different
from “NA”, so don’t include the quotation marks. You can
check for the presence of NA values using the is.na
function.
> x <- c(1, 5, NA)

> is.na(x)
[1] FALSE FALSE TRUE
Other special values are NaN, for “not a number,” which

typically arises when you try to compute an
indeterminate form such as 0/0. The result of dividing a
non-zero number by zero is Inf (or -Inf).
In general, the same indexing may be used to assign values
to elements of a vector. Make sure the vector exists
first, or you will get an error.
Can you guess what x will look like after each of the
following lines?
> x <- 1:10

> names(x) <- letters[1:10]
> x[1:2] <- 2:1 # By inclusion
> x[-(1:2)] <- 10:3 # By exclusion
> x["a"] <- 100 # By name
> x[x==100] <- NA # By logical index
> x[] <- 10 # No index
> x <- 10 # Watch out - what happens here?
A very important feature of R is that it can carry out
vectorized calculations. What this means is that basic
arithmetic, as well as many built-in R functions, will
operate on each element of a vector. This avoids much of
the looping that’s used in lower-level languages.
> x <- 1:3

> x * 10
[1] 10 20 30
> x^2
[1] 1 4 9
> y <- 0:2
> x + y
[1] 1 3 5
> x / y
[1] Inf 2.0 1.5
When the vectors in a calculation are of different lengths,
R follows the recycling rule. That is, it starts repeating
elements from the shorter one.
> x <- 1:3

> y <- 1:2
> x + y
[1] 2 4 4
Warning message:
In x + y : longer object length is not a multiple of
shorter object length
We’ve actually used this before. It would be a good

exercise for you to go through the notes so far and
identify where R is applying the recycling rule.
R has a number of built-in functions for making patterned
vectors, including seq and rep. We’ve seen “:” many times,
which is just a special case of the seq function.
> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(0, 10, by = 2)
[1] 0 2 4 6 8 10
> seq(0, 0.5, length = 6)
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> seq(1, 0, by = -0.1)
[1] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
> rep(c(0, 1), times = 5)
[1] 0 1 0 1 0 1 0 1 0 1
> rep(letters[1:5], each = 2)
[1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e"
R also has many built-in summary functions.
> x <- rnorm(100)

> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.92500 -0.71430 -0.19300 -0.09377 0.49810 2.68200
> mean(x)
[1] -0.09377121
> min(x)
[1] -1.925202
> max(x)
[1] 2.682179
> range(x)
[1] -1.925202 2.682179
> length(x)
[1] 100
> sum(x)
[1] -9.377121
> prod(x)
[1] 1.482105e-25
A handy way to make patterned character vectors is to
use the paste function.
> args(paste)
function (..., sep = " ", collapse = NULL)
NULL
The help says . . . represents “one or more R objects, to

be converted to character vectors.” This actually depends
on the function, but “one or more R objects” is a good
way to think of it for now. For another example, see the
help for c().
Type help(paste) to see more about how this function

works.
Some examples using paste
> paste("Iteration", 1:3)
[1] "Iteration 1" "Iteration 2" "Iteration 3"
> paste("Iteration", 1:3, sep = "")
[1] "Iteration1" "Iteration2" "Iteration3"
> words <- c("Hi", "everyone")
> paste(words, collapse = " ")
[1] "Hi everyone"
> paste(letters[1:5], collapse = "-")
[1] "a-b-c-d-e"
> paste("Iteration", 1:3, sep = "", collapse = "-")
[1] "Iteration1-Iteration2-Iteration3"
The substr function allows us to extract parts of a string.
> some.letters <- paste(letters[1:5], collapse = "-")

> some.letters
[1] "a-b-c-d-e"
> substr(some.letters, start = 1, stop = 3)
[1] "a-b"
It also allows us to assign parts of a string.
> substr(some.letters, start = 1, stop = 3) <- "A*B"

> some.letters
[1] "A*B-c-d-e"
We’ll talk a lot more about working with text later in the
course.
We learned that one of the three main data types in R is
a logical vector, which is either TRUE or FALSE. To
understand how R operates on logical vectors, you need
to know a bit about Boolean algebra.
Boolean algebra is a mathematical formalization of the

truth or falsity of statements. It has three operations,
which we’ll call “not,” “or,” and “and.” Boolean algebra
tells us how to evaluate the truth or falsity of compound
statements that are built using these operations. For
example, if A and B are statements, some compound
statements are
A and B
(not A) or B
The “not” operation just causes the statement following it
to switch its truth value. So (not TRUE) is FALSE and
(not FALSE) is TRUE. The compound statement A and B is
TRUE only if both A and B are TRUE. The compound
statement A or B is TRUE if either or both A or B is TRUE.
In R, we write ! for “not,” & for “and,” and | for “or.”

Note: all of these are vectorized!
> A <- c(TRUE, TRUE, FALSE, FALSE)
> B <- c(TRUE, FALSE, TRUE, FALSE)
> !A
[1] FALSE FALSE TRUE TRUE
> A & B
[1] TRUE FALSE FALSE FALSE
> A | B
[1] TRUE TRUE TRUE FALSE
We often need to test various conditions using the
relational operators. Again, these are vectorized and follow
the recycling rule.
> x <- 1:5
> x > 2
[1] FALSE FALSE TRUE TRUE TRUE
> x < 2
[1] TRUE FALSE FALSE FALSE FALSE
> x == 2
[1] FALSE TRUE FALSE FALSE FALSE
> x >= 2
[1] FALSE TRUE TRUE TRUE TRUE
> x <= 2
[1] TRUE TRUE FALSE FALSE FALSE
> x != 2
[1] TRUE FALSE TRUE TRUE TRUE
Two other useful functions that operate on logical vectors
are all and any. Can you guess what they do?
Logical vectors in R are just special representations of

numeric vectors filled with 1’s and 0’s.Treating them as 1’s
and 0’s in calculations where we’d otherwise use their
numeric value is one of those instances in which implicit
coercion is ok, even helpful.
> x <- rnorm(1000)

> sum(x > 0) # Number of times the condition is TRUE
[1] 468
> mean(x > 0) # Proportion of times the condition is TRUE
[1] 0.468
> y <- x * (x > 0) # Multiplying by an indicator variable
> min(y)
[1] 0
Factors are a special storage class in R used for
categorical data.
> group <- rep(c("control", "treatment"), each = 2)

> group
[1] "control" "control" "treatment" "treatment"
> group <- factor(group)
> group
[1] control control treatment treatment
Levels: control treatment
> levels(group)
[1] "control" "treatment"
Because the levels of a factor are internally coded as

integers, this is more efficient than using character
vectors. However, we still have the advantage of seeing
what the levels represent (rather than just the integer
codes).
Announcement: There will be another “Short
Assignment” posted later today and due Monday night.
Today’s topics
• Data structures galore:

matrices, arrays, data frames, and lists
• More ways to operate efficiently on entire data
structures and avoid looping
You can create a matrix in R using the matrix function. By
default, matrices in R are assigned by column-major order.
You can assign them by row-major order by setting the
byrow argument to TRUE. Note that the first argument to
matrix is a vector, so all elements must be of the same
type (numeric, character, or logical).
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> m <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Assign names to the rows and columns of a matrix:
> rownames(m) <- letters[1:2]
> colnames(m) <- letters[1:3]
> m
a b c
a 1 2 3
b 4 5 6
Find the dimensions of a matrix:

> dim(m); nrow(m); ncol(m)
[1] 2 3
[1] 2
[1] 3
Exchange rows and columns:

> t(m) # t for transpose
a b
a 1 4
b 2 5
c 3 6
To index elements of a matrix, use the same five methods
of indexing we covered for vectors, but with the first
index for rows and the second for columns.
Note: by default the result is coerced to a vector if

possible, rather than a matrix with a single row or
column.
Can you guess what each line returns?

> m
a b c
a 1 3 5
b 2 4 6
> m[-1, 2] # Exclusion & inclusion by position
> m["a",] # By name, empty column index
> m[, c(TRUE, TRUE, FALSE)] # Empty row index, logical
To avoid the coercion to lower dimension, add the
argument drop = FALSE to the indexing.
> m[1, 1, drop = FALSE]
a
a 1
> m[1, , drop = FALSE] # Note empty column index!
a b c
a 1 2 3
You can also index a matrix using a single index, as you

would for a vector. The ordering is again in column-major
order. (How could you change this?)
> m
a b c
a 1 2 3
b 4 5 6
> m[2]
[1] 4
An array is like a matrix, but with arbitrary dimension.
The first argument is still a vector of elements to fill the
array, but the second argument is a vector of sizes in each
dimension.
> array(1:24, dim = c(2, 3, 4)) # A 3-D array

, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6 Note: entries are filled
, , 2
[,1] [,2] [,3]
in such that the first index
[1,] 7 9 11 varies the fastest and the
[2,] 8 10 12
, , 3
last index varies the slowest.
[,1] [,2] [,3]
[1,] 13 15 17
[2,] 14 16 18
Data frames are also like matrices, but columns can be of
different types.
Example: Number of cars on Friday, 6th and Friday, 13th:
> cars
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
3 1991 September 137055 136018 7 to 8
4 1991 September 133732 131843 9 to 10
5 1991 December 123552 121641 7 to 8
6 1991 December 121139 118723 9 to 10
7 1992 March 128293 125532 7 to 8
8 1992 March 124631 120249 9 to 10
9 1992 November 124609 122770 7 to 8
10 1992 November 117584 117263 9 to 10
The data.frame function will extract column names either
from arguments with a name = value construction, or from
the arguments themselves.
> data.frame(let = letters[1:2], val = 1:2)

let val
1 a 1
2 b 2
> grp <- factor(rep(c("Control", "Treatment"), each = 2))
> effect <- rnorm(4, mean = rep(c(0, 10), each = 2))
> data.frame(grp, effect)
grp effect
1 Control 0.4145526
2 Control 0.6052182
3 Treatment 10.4363078
4 Treatment 10.7534556
> data.frame(1:2, rnorm(2))
X1.2 rnorm.2.
1 1 -0.58342447
Data frames can be indexed in all the ways that matrices
can. The result will be either a vector or another data
frame. Coercion to vectors can again be avoided using
the argument drop = FALSE.
We can also extract a column by name using the $

symbol.
> cars$Year
[1] 1990 1990 1991 1991 1991 1991 1992 1992 1992 1992
> cars$Month
[1] July July September September
[5] December December March March
[9] November November
Levels: December July March November September
Data frames are actually a special kind of list. As when
constructing a data frame, we specify the elements of a
list using either name = value or just value for each
argument. Unlike a data frame, lists are not displayed in
columns, and each element can have a different length.
> ingredients <- list(cheese = c("Cheddar", "Swiss"),

+ meat = c("Ham","Turkey", "Bologna"))
> ingredients
$cheese
[1] "Cheddar" "Swiss"
$meat
[1] "Ham" "Turkey" "Bologna"
Note that the elements are not associated with one

another by position, as they were in a given row of a data
frame.
You will often encounter lists as return values of function
calls in R.
> x <- 1:100

> y <- x * 3 + rnorm(100)
> regression.results <- lm(y~x) # Regress y on x
> is.list(regression.results)
[1] TRUE
> names(regression.results)
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
> regression.results$coef # Note partial matching
(Intercept) x
0.2433211 2.9950379
Lists can be indexed by name, using $.
They can also be indexed like vectors, using []. The result
will be another list.
> regression.results[1]
$coefficients
(Intercept) x
0.08847387 2.99781408
To extract individual elements of a list, enclose the index

in [[]]. The result will be coerced to a simpler structure,
depending on the element.
> regression.results[[1]]
(Intercept) x
0.08847387 2.99781408
To summarize, the types of data structures we have
encountered so far are:
vector
matrix
array
list
data frame
Matrices and arrays are actually just stored as vectors

with shape information, so our discussions of “vectorized”
calculations hold for matrices and arrays as well.
This is NOT true for lists and data frames.

Sometimes we want an operation to be applied to
individual dimensions of a matrix or array, or to each
element of a list.
Here R provides something called the apply mechanism.

This again avoids the need for looping through each
dimension or over each element of a list, as we would do
in lower-level languages.
• apply for matrices and arrays
• lapply and sapply for lists

> args(apply)
function (X, MARGIN, FUN, ...)
NULL
There’s that . . . argument again. The help page for apply

has this to say about the arguments:
X the array to be used.
MARGIN a vector giving the subscripts which the

function will be applied over. 1 indicates
rows, 2 indicates columns, c(1,2)
indicates rows and columns.
FUN the function to be applied: see ‘Details’.
In the case of functions like +, %*%,
etc., the function name must be backquoted
or quoted.
... optional arguments to FUN.
Let’s first talk about the MARGIN argument. This is a vector
representing the dimension(s) to which FUN will be
applied. Another way to think about it is that MARGIN gives
the dimension(s) we want to preserve.
> A <- array(1:12, dim = c(2, 3, 2))

> apply(A, 1, sum)
[1] 36 42
> apply(A, 2, sum)
[1] 18 26 34
> apply(A, c(1, 2), sum)
[,1] [,2] [,3]
[1,] 8 12 16
[2,] 10 14 18
We’ve seen that the “. . .” argument can stand for an
arbitrary number of objects, as in the c or paste
functions. Here it allows us to pass arguments through
from one function to another. Here, from apply to FUN.
> A[1,2,2] <- NA
> apply(A, 1, sum)
[1] NA 42
> apply(A, 1, sum, na.rm = TRUE)
[1] 27 42
Note that by using “. . .”, the author of the apply function

didn’t need to specify all possible arguments that FUN
could take.
Announcements:
Regina Wu, Hana Ueda, and John Jimenez will be helping
to answer your questions on bSpace and in lab.
Homework 1 is due next Wednesday night. There have

been some problems on bSpace. Please make sure you
get a verification email when you upload your assignment.
Today’s topics:
• Review of data structures and how to index them

• The apply mechanism, revisited
• Reading and writing data from within R
• Keeping track of your commands
• High-level graphics
The types of data structures and how to index them:
Vectors: [index]
> x[1:10]; x[-3]; x[x>3]
Matrices: [rowindex, colindex]

> m[1,2]; m[1:2, ]; m[ ,“a”]
Arrays: [index1, index2, ..., indexK]

> a[1, 3, ]; a[v==TRUE,,]
Data frames: [rowindex, colindex], $name

> cars$Cars6; cars[,3:4]; cars[cars$Junction == “7 to 8”,]
Lists: $name, [index], [[index]]

> ingredients$meat; indgredients[1:2]; ingredients[[1]]
Note: both $ and [[]] can index only one element.

Last time we started talking about the apply function.
Let’s review how this works for matrices.
> args(apply)
function (X, MARGIN, FUN, ...)
NULL
any additional
the matrix arguments to FUN
the function
which dimension
to operate on -
1 for rows, 2 for columns
> m <- matrix(1:4, nrow = 2)

> m
[,1] [,2]
[1,] 1 3
[2,] 2 4
> apply(m, 2, paste, collapse = "")
[1] "12" "34"
The lapply and sapply functions both apply a specified
function FUN to each element of a list. The former returns
a list object and the latter returns a vector when
possible. Again, both allow passing of additional
arguments to FUN through the “. . .” argument.
> random.draws <- list(x1 = rnorm(10), x2 = rnorm(100000))

> lapply(random.draws, mean)
$x1
[1] 0.0827779
$x2
[1] 0.001470952
> sapply(random.draws, mean)

x1 x2
0.082777901 0.001470952
The tapply function allows us to apply a function to
different parts of a vector, where the parts are indexed by a
factor or list of factors.
Single factor:
> grp <- factor(rep(c("Control", "Treatment"), each = 4))

> grp
[1] Control Control Control Control
[5] Treatment Treatment Treatment Treatment
Levels: Control Treatment
>
> effect <- rnorm(8) # Make up some fake data
> tapply(effect, INDEX = grp, FUN = mean)
Control Treatment
0.2180109 -0.2433582
Multiple factors:
> sex <- factor(rep(c("Female", "Male"), times = 4))

> sex
[1] Female Male Female Male Female Male Female Male
Levels: Female Male
> tapply(effect, INDEX = list(grp, sex), FUN = mean)
Female Male
Control 0.3634973 0.07252456
Treatment -0.2860360 -0.20068040
Many data sets are stored as tables in text files. The
easiest way to read these into R is using either the
read.table or read.csv function.
As you can see in help(read.table), there are quite a few

options that can be changed. Some of the important ones
are
• file - name or URL
• header - are column names at the top of the file?
• sep - what divides elements of the table
• na.strings - symbol for missing values, like 9999
• skip - number of lines at the top of the file to ignore
is like read.table, but with different defaults for
read.csv
CSV (comma separated value) files.
By default, all strings are read in as factors.
If a file doesn’t contain column names, you can add them

after the fact. Here’s how I created the R objects for the
assignment last week:
> cars <- read.csv("~/Desktop/friday13thcars.csv",

+ header = FALSE)
> cars[1:2,]
V1 V2 V3 V4 V5
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
> names(cars) <- c("Year", "Month", "Cars6",
+ "Cars13", "Junction")
> cars[1:2,]
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
Earthquakes Example:
Data from the California Geological Survey
> CAquakes <- read.table(file = "http://www.consrv.ca.gov/

cgs/rghm/quakes/Documents/ms49epicenters.txt", header =
TRUE)
> dim(CAquakes)
[1] 383 4
> CAquakes[1:3,]
Date Latitude Longitude M
1 18001011 36.8 -121.5 5.5
2 18001122 32.9 -117.8 6.3
3 18030000 34.2 -118.1 5.5
> mode(CAquakes$Date)
[1] "numeric"
How can we extract the years/months/days from the

Date column?
> datechar <- as.character(CAquakes$Date)
> substring(datechar, 1, 4)[1:3]
[1] "1800" "1800" "1803"
> CAquakes$Year <- as.numeric(substring(datechar, 1, 4))
> CAquakes$Month <- as.numeric(substring(datechar, 5, 6))
> CAquakes$Day <- as.numeric(substring(datechar, 7, 8))
> CAquakes[1:3,]
Date Latitude Longitude M Year Month Day
1 18001011 36.8 -121.5 5.5 1800 10 11
2 18001122 32.9 -117.8 6.3 1800 11 22
3 18030000 34.2 -118.1 5.5 1803 0 0
> CAquakes$Month[CAquakes$Month == 0] <- NA
> CAquakes$Day[CAquakes$Day == 0] <- NA
> summary(CAquakes$Month)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 4.000 6.000 6.281 9.000 12.000 2.000
> summary(CAquakes$Day)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 9.00 18.00 16.64 24.00 31.00 3.00
To save your R commands, use a plain text editor. Here
are two I like:
• R for Mac and Windows has a built-in text editor.

Access commands related to it, such as New Document
and Save, under the File menu. One nice feature is that it
automatically prints the arguments for functions at the
bottom of the window.
• The Emacs editor has a special package called ESS, for

“Emacs Speaks Statistics,” that makes working with .R files
very easy. It’s installed on all the 342 lab computers. It
includes keyboard shortcuts to evaluate the code, rather
than cutting and pasting. (See http://stat.ethz.ch/ESS/
refcard.pdf.)
Whichever editor you choose, you can run all the
commands in a particular file using source(“myfile.R”).
A few more notes:
If you don’t save your files as plain text, this won’t work,
since R cannot interpret any extra formatting commands.
So I do NOT recommend you use Microsoft Word.
If you’re cutting and pasting from the R session window

back into the text editor, be sure not to copy the prompt
(> symbol) as well.
If you want to keep your results in your .R file, put a # in

front of each line to mark them as comments.
Graphics in R
Part 1: High-level graphics
functions
We’ll be working in this section with many of R’s built-in
data sets. To see a list of them, just type
> data()
Data sets in package 'datasets':
AirPassengers Monthly Airline Passenger Numbers

1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide uptake in grass
plants
ChickWeight Weight versus age of chicks
different diets
. . . many more
1. Barplots
> x <- 1:5; names(x) <- letters[1:5]

> barplot(x)
5
4
3
2
1
0
a b c d e
> VADeaths
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
> barplot(VADeaths, legend = TRUE)
200
70−74
65−69
60−64
55−59
This stacked barplot

50−54
150
makes it hard to read

anything but the
100
bottom category and

50
the total.
0

Making a good plot in R is often a matter of iterative
improvement.
> barplot(VADeaths, beside = TRUE, legend = TRUE)

70
50−54
55−59
60−64
60
65−69
70−74
50
40
30
20
10
0

> barplot(VADeaths, beside = TRUE, legend = TRUE,
+ ylab = "Deaths per 1000",
+ main = "Death rates in Virginia, 1940")
Death rates in Virginia, 1940

70
50−54
55−59
60−64
60
65−69
70−74
50
Deaths per 1000
40
30
20
10
0

Saving your plots as graphics files
If you call a high-level plot command, R will automatically

start a graphics device or window.
To save the contents of the already open device to a file,

use dev.print.

> dev.print(device = pdf, file = "mybar.pdf",
+ height = 5, width = 6) # Inches
> dev.print(device = jpeg, file = "mybar.jpeg",
+ height = 500, width = 600) # Pixels
See help(device) for a list of other graphics formats.

To close the device (shut the window), type
> dev.off()
Alternatively, you can open up the device with a given file

name, run the commands, then use dev.off(). The device
itself won’t appear as a window. This is useful if you want
to run your commands in BATCH mode.
> pdf(file = "mybar.pdf", height = 6, width = 6)

> dev.off()
2. Pie charts
> pie(c(1, 1, 2), labels = letters[1:3])
b a
Note that elements

of the vector are
normalized by their
sum, so that the
total gives 100% of
the pie.
c
> Titanic
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
. . . two more matrices not printed here, with survivors

Did all groups have an equal survival rate?
> apply(Titanic, 1, sum) # Total passengers, each class
1st 2nd 3rd Crew
325 285 706 885
> pie(apply(Titanic, 1, sum), main = "Total Passengers")
> pie(apply(Titanic[,,,"Yes"], 1, sum),
+ main = "Survivors")
Total Passengers Survivors
2nd
2nd 1st
1st
3rd
3rd
Crew
Crew
Studies of human perception show we are not very good
at comparing areas, volumes, or angles.
• When making bar plots, start the axis at zero and

keep all bars the same width, so that length and area
are proportional.
• Try to avoid pie charts for anything requiring a

precise comparison.
3. Histograms
> precip[1:4] # Average annual precipitation in cities

Mobile Juneau Phoenix Little Rock
67.0 54.7 7.0 48.5
> hist(precip)
Histogram of precip
The height of the

25
bars shows the

20
number of
15
observations falling
Frequency
into each bin.

10
5
0
0 10 20 30 40 50 60 70
precip
There are several ways to change the cutoff points.
> hist(precip, breaks = 10) # Only a suggestion

> hist(precip, breaks = seq(min(precip), max(precip),
+ length = 11)) # Force it
Histogram of precip Histogram of precip

14
15
12
10
10
Frequency
Frequency
8
6
5
4
2
0
10 20 30 40 50 60 70 10 20 30 40 50 60
precip precip
Again, let’s add meaningful axis labels and a title.
> hist(precip, breaks = 10, xlab = "Inches",
+ main = "Yearly Average Rainfall for US Cities")
Yearly Average Rainfall for US Cities

14
12
10
Frequency
8
6
4
2
0
10 20 30 40 50 60 70
Inches
4. Boxplots
> boxplot(precip, ylab = "Inches",

+ main = "Yearly Average Rainfall for US Cities")
Yearly Average Rainfall for US Cities
●
Outlier
Upper whisker - Upper quartile + 1.5 IQR
60
50
Upper quartile
40
Inches
Median Inter-quartile range (IQR)

Lower quartile
30
20
Lower whisker - Lower quartile - 1.5 IQR

10
●
● Outliers
> mtcars[1:2,1:5]
mpg cyl disp hp drat
Mazda RX4 21 6 160 110 3.9
Mazda RX4 Wag 21 6 160 110 3.9
> boxplot(mpg~cyl, data = mtcars, xlab = "Cylinders",
+ ylab = "Miles per Gallon",
+ main = "Fuel Consumption")
Fuel Consumption
30
Miles per Gallon
25
20
15
●
10
4 6 8
Cylinders
5. Scatterplots
> state.x77[1:2,1:4]
Population Income Illiteracy Life Exp
Alabama 3615 3624 2.1 69.05
Alaska 365 6315 1.5 69.31
> plot(state.x77[,"Income"], state.x77[,"Life Exp"])
●
73
● ●
●
● ●●
● ●
● ●
72
●
●
state.x77[, "Life Exp"]
● ●
● ●
●
●
●
71
● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ● ● ●
70
●
●
●
69
● ●
●
●
●
68
3000 3500 4000 4500 5000 5500 6000
state.x77[, "Income"]
> plot(state.x77[,"Income"], state.x77[,"Life Exp"],
+ xlab = "Per Capita Income (Dollars)",
+ ylab = "Life Expectancy (Years)",
+ main = "Income and Life Expectancy in U.S., 1970s")
Income and Life Expectancy in U.S., 1970s
●
73
● ●
●
● ●●
● ●
● ●
72
●
Life Expectancy (Years)
● ● ●
● ●
●
●
●
71
● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ● ● ●
70
●
●
●
69
● ●
●
●
●
68
3000 3500 4000 4500 5000 5500 6000
Per Capita Income (Dollars)
How can we label the interesting cases?

See Stat133Lecture5.R
Announcement: John will have office hours tonight from
8-10pm in the bSpace chatroom, and I will have office
hours tomorrow from 2:30-3:30.
A few loose ends from last time:
arguments to par - mar, oma, xaxt, and yaxt
functions - legend, locator, axis, abline

Graphics, Continued:
The Dirty Dozen
From Wainer, H. (1984) “How to Display Data Badly.”
The American Statistician, 38, 137-147.
Additional images from Tufte, E. The Visual Display of
Quntitative Information.
1. Show as few data as possible.
An example with lots

of “chart junk,” not to
mention visual
distortion
How many data points?
2. Hide what data you do show.
3. Ignore the visual metaphor, or reverse it mid-graph.
4. Only order matters.
Are we supposed to
compare length, area,
or volume?
5. Graph data out
of context.
6. Change scales in
mid-axis.
7. Emphasize the trivial.
(Ignore the important.)
8. Jiggle the baseline.
Sometimes varying the baseline is ok, if the main points of
comparison are the first category and the total. This plot
is bizarre in other ways.
9. Austria first!
10. Label (a) Illegibly,
(b) Incompletely,
(c) Incorrectly, and
(d) Ambiguously.
11. More (dimensions) is murkier.
More dimensions AND
colors!
12. If it has been done well in the past, think of another
way to do it.
On the other hand, here are some creative plotting
techniques you may want to consider.
1. Letting the data points represent another variable.
(E.J. Marey)
2. Using “small
multiples”
3. Letting deformation represent a variable.
versus
An example from www.swivel.com
Critique:
x-axis labels poorly located - put them at election years
y-axis label misleading - these are numbers of counties
use of color could be improved (eg. red/blue)
Some R code for you to play with:
parties <- read.csv(file = "graph_26277754.csv", header = TRUE)

parties
names(parties) <- c("Democrat", "Republican", "Year")
order(parties$Year)
parties <- parties[order(parties$Year),]
party.mat <- t(as.matrix(parties[,1:2]))
barplot(party.mat, beside = TRUE)
barplot(party.mat, beside = TRUE, col = c("blue", "red"))
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year)
names.arg = parties$Year, legend = TRUE)
names.arg = parties$Year, legend = TRUE,
ylim = c(0, 50))
title(main = "California Counties\nMajority Party of Registered Voters",
xlab = "Election Year", ylab = "Number of Counties")
dev.print(pdf, file = "countyvotes.pdf", height = 6, width = 7)
California Counties
Majority Party of Registered Voters
50
Democrat
Republican
40
Number of Counties
30
20
10
0
1992 1996 2000 2004 2008
Election Year
But does this new graph tell the whole story?

Programming in R
So far we have relied on the built-in functionality of R to
carry out our analyses. In the next three lectures, we’ll
cover
• Importing packages into R

• How to write your own functions
• The meaning of “environments” and “variable scope”
• How to use flow control mechanisms like if and for
• Debugging your code when something goes wrong
• Timing and speeding up your code
If a particular package is already installed on your system,
you can access its contents by typing
> library(“nameofpackage”)
or
> library(nameofpackage)
The authors of the package write documentation for the

functions and datasets included in it, which you can read
as usual using help().
All packages come with a reference manual, which you

can access by visiting CRAN. Go to http://cran.r-
project.org/, click on “Packages,” then scroll down for the
particular package. This is just a hard copy of the help
pages. A few packages also come with a tutorial.
Writing your own functions in R
Think about the code we’ve been writing so far in R. It

has been
• made up of a list of commands, one after another

• specific to the particular dataset we’re working with.
Functions allow us to
• organize our code into individual tasks

• reuse the same code on different datasets by making
the data an argument to the function.
For example, last time we simulated some data.
> beta0 <- 3; beta1 <- 1

> m <- beta0 + beta1 * x
> y <- rnorm(100, mean = m, sd = 10)
Then we plotted it and added a linear regression line.

> plot(x, y)
> ls.mod <- lm(y~x)
> abline(a = ls.mod$coef[1], b = ls.mod$coef[2])
What we want to do now is encapsulate the last three

lines into a function that we can apply to any x and y
vector, not just the ones currently in our workspace.
Anatomy of a function
The syntax for writing a function is
function ( arglist ) body
Typically we assign the function to a particular name. This

should describe what the function does.
myfunction <- function (arglist) body
A function without a name is called an “orphan” function.

These can be very powerfully used with the apply
mechanism. Stay tuned....
function ( arglist ) body
The keyword function just tells R that you want to create

a function.
Recall that the arguments to a function are its inputs,

which may have default values.
> args(substring)
function (text, first, last = 1e+06)
Here, if we do not explicitly specify last when we call

substring, it will be assigned the default value of 1e+06,
which is very large. (Why do you think this was chosen?)
A few notes on writing the arguments list
When you’re writing your own function, it’s good practice

to put the most important arguments first. Often these
will not have default values.
This allows the user of your function to easily specify the

arguments by position, eg. plot(xvec, yvec) rather than
plot(x = xvec, y = yvec).
Next we have the body of the function, which typically
consists of expressions surrounded by curly brackets.
Think of these as performing some operations on the
input values given by the arguments.
{
expression 1
expression 2
return(value)
}
The return expression hands control back to the caller of

the function and returns a given value. If the function
returns more than one thing, this is done using a named
list, for example
return(list(total = sum(x), avg = mean(x))).

In the absence of a return expression, a function will
return the last evaluated expression. This is particularly
common if the function is short.
For example, I could write the simple function:
my.mean <- function(x) sum(x)/length(x)
Here I don’t even need brackets {}, since there is only

one expression.
A return expression anywhere in the function will cause

the function to return control to the user immediately,
without evaluating the rest of the function. This is often
used in conjunction with if statements, which we’ll come
to later.
Returning to our example, let’s make a function to carry
out these three steps for any vectors x and y.
> plot(x, y)
> ls.mod <- lm(y~x)
> abline(a = ls.mod$coef[1], b = ls.mod$coef[2])
What should we call it? (What does it do?)

What will be the arguments? Should they have default
values?
What (if anything) should the function return?
What do you actually need to type into R to create this
function?
Also, looking ahead, what might go wrong with the

function?
Environments and variable scope
R has a special mechanism for allowing you to use the

same name in different places in your code and have it
refer to different objects.
For example, you want to be able to create new variables

in your functions and not worry if there are variables
with the same name already in the workspace.
The solution relies on environments.

When you call a function, R creates a new workspace
containing just the variables defined by the arguments of
that function. This collection of variables is called a frame.
> x <- 1; y <- 2

> lookatframe <- function(a, b, c) print(ls())
> lookatframe(a = 1, b = 2, c = 3)
[1] "a" "b" "c"
However, R has a way of accessing variables that are not

in the frame created by the function.
> lookatframe <- function(a, b, c){print(ls()); print(x)}

> lookatframe(a = 1, b = 2, c = 3)
[1] "a" "b" "c”
[1] 1
What is happening is that R is looking for variables with
that name in a sequence of environments. An environment
is just a frame (collection of variables) plus a pointer to
the next environment to look in.
In our example, R didn’t find the variable x in the

environment defined by the function lookatframe, so it
went on to the next one. In this case, this was our main
workspace, which is called the Global Environment.
The “next environment to look in” is called the parent

environment. For the environment created by a function
call, this is just the environment we were in when we
called the function.
If R reaches the Global Environment and still can’t find
the variable, it looks in something called the search path.
This is a list of additional environments, which is used for
packages of functions and user attached data.
You can see the search path by typing search().

Computing in R consists of sequentially evaluating
statements. Flow control structures allow us to control
which statements are evaluated and in what order.
In R these consist of
• if/else statements
• for and while loops
• break and next
• the switch function
Statements can be grouped together using curly braces
“{” and “}”. A group of statements is called a block. For
today’s lecture, the word statement will refer to either a
single statement or a block.
The basic syntax for an if/else statement is
if ( condition ) {
statement1
} else {
statement2
}
First, condition is evaluated. If the first element of the

result is TRUE then statement1 is evaluated. If the first
element of the result is FALSE then statement2 is evaluated.
Only the first element of the result is used.
If the result is numeric, 0 is treated as FALSE and any other

number as TRUE. If the result is not logical or numeric, or
if it is NA, you will get an error.
When we discussed Boolean algebra before, we met the
operators & (AND) and | (OR).
Recall that these are both vectorized operators.
If/else statements, on the other hand, are based on a

single, “global” condition. So we often see constructions
using any or all to express something related to the
whole vector, like
if ( any(x < -1 | x > 1) )

warning("Value(s) in x outside the interval [-1,1]”)
(We’ll discuss error handling more next time.)

There is another set of operators, && and ||, which are not
vectorized. In fact, they ignore all but the first element of
whatever you give them.
The advantage in using them is that they only evaluate as

much as they need to in order to return TRUE or FALSE.
For example, in A && B, first A will be evaluated. If it is

FALSE, R will immediately evaluate to FALSE for the whole
expression, and will not evaluate B.
Likewise, in A || B, R will immediately evaluate to TRUE if A

is TRUE.
The result of an if/else statement can be assigned. For
example,
> if ( any(x <= 0) ) y <- log(1+x) else y <- log(x)
is the same as
> y <- if ( any(x <= 0) ) log(1+x) else log(x)
Also, the else clause is optional. Another way to do the

above is
> if( any(x <= 0) ) x <- 1+x

> y <- log(x)
Note that this version this changes x as well.

If/else statements can be nested.
if (condition1 )
statement1
else if (condition2)
statement2
else if (condition3)
statement3
else
statement4
The conditions are evaluated, in order, until one evaluates

to TRUE. The the associated statement/block is evaluated.
The statement in the final else clause is evaluated if none
of the conditions evaluates to TRUE.
A note about formatting if/else statements:
When the if statement is not in a block, the else (if

present) must appear on the same line as statement1 or
immediately following the closing brace. For example,
if (condition) {statement1}
else {statement2}
will be an error if not part of a larger block and/or

function, because R will evaluate the first line.
Some common uses of if/else clauses
1. With logical arguments to tell a function what to do
corplot <- function(x, y, plotit = TRUE){

if ( plotit == TRUE ) plot(x, y)
cor(x,y)
}
2. To verify that the arguments of a function are as

expected
if ( !is.matrix(m) )
stop("m must be a matrix”)
3. To handle common numerical errors
ratio <- if ( x!=0 ) y/x else NA
4. In general, to control which block of code is executed
if ( dist == “normal” ){
return( rnorm(n) )
} else if (dist == “t”){
return(rt(n, df = 1, ncp = 0))
} else stop(“distribution not implemented”)
These if/else constructions are useful for global tests, not
tests applied to individual elements of a vector.
However, there is a vectorized function called ifelse.

> args(ifelse)
function (test, yes, no)
R object that can be R objects of the same

coerced to logical size as test
For each element of test, the corresponding element of

yes is returned if the element is TRUE, and the
corresponding element of no is returned if it is FALSE.
Some examples of ifelse
ratio <- ifelse(x!=0, y/x, NA) # (Compare with earlier)
US.indicator <- ifelse(country == “USA”, 1, 0)
plot(Income, Donations,
col = ifelse(party == “Republican”, “red”, “blue”)
Looping is the repeated evaluation of a statement or block
of statements.
Much of what is handled using loops in other languages

can be more efficiently handled in R using vectorized
calculations or one of the apply mechanisms.
However, certain algorithms, such as those requiring

recursion, can only be handled by loops.
There are two main looping constructs in R: for and while.

For loops
A for loop repeats a statement or block of statements a

predefined number of times.
The syntax in R is
for ( name in vector ){

statement
}
For each element in vector, the variable name is set to the

value of that element and statement1 is evaluated.
vector often contains integers, but can be any valid type.

Some examples of for loops:
fibseq <- rep(NA, 100)

fibseq[1:2] <- 1
for(i in 3:100)
fibseq[i] <- fibseq[i-1] + fibseq[i-2]
datafiles <- paste(“data”, 1:10, “.RData”, sep = “”)

for(file in datafiles)
load(file)
While loops
A while loop repeats a statement or block of statements

for as many times as a particular condition is TRUE.
The syntax in R is
while (condition){
statement
}
condition is evaluated, and if it is TRUE, statement is

evaluated. This process continues until condition
evaluates to FALSE.
Exercise:
The expression sample(c(1, 0), size = 1, prob = c(p, 1-

p)) simulates a random coin flip, where the coin has
probability p of coming up heads, represented by a 1.
Write a function that simulates flipping a coin until a fixed

number of heads are obtained. It should take the
probability p and the total number of heads total and
return the trial on which the final head was obtained.
This produces a single sample from the negative binomial
distribution.
coin.flips <- function(total.heads, p = 0.5){
current.heads <- n.trials <- 0
while(current.heads < total.heads){
n.trials <- n.trials + 1
if(sample(c(1,0), size = 1, prob = c(p, 1-p))){
current.heads <- current.heads + 1
}
}
return(n.trials)
}
Announcement:
Graded homework 2 will be posted on bSpace after class.
It is due next Friday at 11pm.
Please take advantage of the lab sessions tomorrow to

get started.
Continued from last time...
The expression sample(c(1, 0), size = 1, prob = c(p, 1-

p)) simulates a random coin flip, where the coin has
probability p of coming up heads, represented by a 1.
Write a function that simulates flipping a coin until a fixed

number of heads are obtained. It should take the
probability p and the total number of heads total and
return the trial on which the final head was obtained.
This produces a single sample from the negative binomial
distribution.
Now write a function to take multiple samples from the

negative binomial distribution.
The break statement causes an exit from the innermost
loop that is currently being executed. The next statement
immediately causes control to return to the start of the
loop. These are typically used in conjunction with an if
statement.
> for(i in 1:10){

+ if(i == 5)
+ break
+ }
> i
[1] 5
Notice that the “name” being iterated over in a for loop,

in this case i, still exists once the loop is done. This tells
you where you were when the break occurred.
The syntax for the switch function is
switch(EXPR, ...)
where the additional arguments specified by “...” may be

named. EXPR is evaluated. If the result is a number
between 1 and the number of additional arguments then
the corresponding element of “...” is evaluated and
returned. If EXPR returns a character string then that
string is used to match the names of the elements in “...”.
switch(distribution, normal = rnorm(1),

t = rt(1, df = 1, ncp = 0),
poisson = rpois(1, lambda = 1),
stop(“distribution not implemented”))
Catching errors
1. The function stop stops execution of the current

expression and prints a specified error message.
> showstop <- function(x){

+ if(any(x < 0)) stop("x must be >= 0")
+ return("ok")
+ }
> showstop(1)
[1] "ok"
> showstop(c(-1, 1))
Error in showstop(c(-1, 1)) : x must be >= 0
2. A similar function is stopifnot. It has the advantage of
being able to take multiple conditions.
> showstopifnot <- function(x){

+ stopifnot(x>=0, x%%2 == 1)
+ return("ok")
+ }
> showstopifnot(1)
[1] "ok"
> showstopifnot(c(1, -1))
Error: all(x >= 0) is not TRUE
> showstopifnot(c(1,2))
Error: x%%2 == 1 is not all TRUE
3. Finally, warning just prints a warning message without
stopping the execution of the function.
> ratio.warn <- function(x, y){

+ if(any(y == 0))
+ warning("Dividing by zero")
+ return(x/y)
+ }
> ratio.warn(x = 1, y = c(1, 0))
[1] 1 Inf
Warning message:
In ratio.warn(x = 1, y = c(1, 0)) : Dividing by zero
> ratio.warn(x = 1:3, y = 1:2)
[1] 1 1 3
Warning message:
In x/y : longer object length is not a multiple of shorter
object length
Some debugging strategies
1. The traceback function prints the sequence of calls that

led to the last error. This can show you where in your
function something is going wrong.
It may not even be in the function itself, but in another

function that is being called within the original function.
> cv <- function(x) sd(x/mean(x))

> cv(0)
Error in var(x, na.rm = na.rm) : missing observations in
cov/cor
> traceback()
3: var(x, na.rm = na.rm)
2: sd(x/mean(x))
1: cv(0)
2. If you have some idea where the error is occurring, you
can use print to check that key variables are what you
think they are.
3. Consider “commenting out” lines of your code where

the error might occur, then adding them back in one by
one.
4. To step through the function, expression by expression,

and be able to print out any variable at each step, use the
debug function. Use undebug to turn of debugging.
> coin.flips(total.heads = 5)
debugging in: coin.flips(total.heads = 5)
debug: {
current.heads <- n.trials <- 0
while (current.heads < total.heads) {
n.trials <- n.trials + 1
if (sample(c(1, 0), size = 1, prob = p)) {
current.heads <- current.heads + 1
}
}
return(n.trials)
} What’s about to be evaluated
Browse[1]> n
debug: current.heads <- n.trials <- 0
Browse[1]>
While in the debugger, you can use the following
commands:
'n' (or just return) - Advance to the next step.

'c' - continue to the end of the current context: e.g. to
the end of the loop if within a loop or to the end of the
function.
'Q' - exit the browser and the current evaluation and
return to the top-level prompt.
You can also evaluate any valid R expression. For

example, you can type the names of variables to see their
current values.
Efficient programming
The first rule of efficient programming in R is to make use

of vectorized calculations and the apply mechanisms
whenever possible.
You can check how much time it takes to evaluate any

expression by wrapping it in system.time(). Units are in
seconds.
> system.time(normal.samples <- rnorm(1000000))

user system elapsed
0.196 0.013 0.221
wall clock time
CPU time for CPU time for system

R process on behalf of R
> x <- y <- 1:100000
> time1 <- system.time({
+ z <- x[1] + y[1]
+ for(i in 2:100000)
+ z <- c(z, x[i] + y[i])})
> time2 <- system.time({
+ z <- rep(NA, 100000)
+ for(i in 1:100000)
+ z[i] <- x[i] + y[i]})
> time3 <- system.time(x+y)
> time1/time3
user system elapsed
41687.5 80872.0 83769.5
> time2/time3
user system elapsed
276.5 4.0 279.5
Simulation in R
First, a brief review of probability theory.
Probability allows us to quantify statements about the

chance of an event taking place. There are two formal
definitions of probability:
• Frequentist: long-run relative frequency of an event

occurring in repeated experiments.
• Subjective/Bayesian: an individual’s degree of belief in

the occurrence of the event, given the evidence.
We will focus on the first definition. However, the basic

laws of probability are the same under both definitions.
A probability distribution assigns a number P(A) to each
event in the sample space (set of all possible outcomes).
P(A) must be between 0 and 1.
We may characterize the distribution using the cumulative

distribution function or CDF, defined by
F (x) = P (X ≤ x)
We call X the random variable.
Exercise: Graph the CDF for the random variable equal

to the number of heads in two independent coin flips.
Three important properties of the CDF are
1. F is non-decreasing: x1 < x2 implies F (x1 ) ≤ F (x2 )

2. F is normalized: limx→−∞ = 0 and limx→∞ = 1
3. F is right-continuous: limy↓x = F (x)
In fact a function mapping the real line to [0,1] is a CDF if

and only if it satisfies these three conditions.
The inverse CDF or quantile function is defined by
F −1
(q) = inf{x : F (x) > q}
If you’re not familiar with “inf,” just think of it as the

minimum.
Exercise: What does the inverse CDF for the coin flipping
example look like?
A random variable X is discrete if it takes countably many
values. We define the probability mass function for X by
f (x) = P (X = x)
A random variable X is continuous if there exists a

function f, called the probability density function (PDF), with
1. f (x) ≥ 0 for all x

!∞
2. −∞ f (x)dx = 1
!b
3. P (a < X < b) = a
f (x)dx
! x
Note that F (x) = f (t)dt
∞
In R, there are many built-in functions for handling
distributions, some of which we have seen already.
The prefixes of the functions indicate what they do:
d - evaluate the PDF

p - evaluate the CDF
q - evaluate the inverse CDF
r - take a random sample
Note that the functions prefixed by d, p, and q are all

calculating mathematical quantities.
However, once we have a random sample, we can also

estimate the PDF, CDF, and inverse CDF....
A histogram is a type of density estimator.
! b
Recall that f (x)dx = P (a < X < b)
a
For each bin of a histogram (with lower endpoint a and

upper endpoint b), we count the number of observations
falling into the bin, i.e.
!n
I{a < Xi ≤ b}
i=1
If we properly normalize each of these quantities, the
total area of the rectangles in the histogram is one, just
like the area under a PDF. You can do this automatically in
R with hist(x, prob = TRUE).
The empirical CDF uses the same sort of counting idea.
Define
!n
i=1 I{Xi ≤ x}
F̂ (x) =
n
We are estimating a probability by a proportion. Another

way to think of it is that we estimate the PDF by a
discrete distribution which assigns probability 1/n to each
data point.
Exercise: Write a function which calculates the empirical

CDF. It should take a vectors sample and x.
Finally, the quantile function in R returns the sample
quantiles, defined by
F̂ −1
(q) = inf{x : F̂ (x) > q}
Note that we just “plug in” the empirical CDF to the

definition of the quantile function.
We’ll talk more next time about specific distributions.
For now, let’s consider the role that simulation can play in
helping us understand statistics.
We can think of probability theory as complimentary to

statistical inference.
Probability
Distribution Observed data
Inference
A statistic is a function of a sample, for example the
sample mean or a sample quantile.
Statistics are often used as estimators of quantities of

interest about the distribution, called parameters.
Estimators are random variables; parameters are not.
In simple cases, we can study the distribution of the

statistic analytically. For example, we can prove that
under mild conditions the standard error of the sample
mean decreases at a rate proportional to 1/√n.
In more complicated cases, we turn to simulation.

Whereas mathematical results are symbolic, in terms of
arbitrary parameters and sample size, on a computer we
must specify particular values.
A single experiment looks something like this:
}
X1
Particular choice X2
of parameters, Single statistic
sample size
Xn
To study the distribution of the statistic, we repeat the

whole experiment B times. The larger B is, the better our
approximation of the distribution.
Steps in carrying out a simulation study:
1. Specify what makes up an individual experiment: sample

size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
Example: Find the standard error of the median when
sampling from the normal distribution. How does it vary
with the sample size and with the standard deviation?
Recall our example of a simulation study from last time:
Find the standard error of the median when sampling

from the normal distribution. How does it vary with the
sample size and with the standard deviation?
Steps in carrying out a simulation study:

A quick review of some other probability distributions
available in R:
abbreviation - Distribution
! 1
a≤x≤b
unif - Uniform(a, b) f (x) = b−a
0 otherwise
!
λe−λx x>0
exp - Exponential(λ) f (x) =
0 otherwise
e−λ λx
pois - Poisson(λ) f (x) = , x = 1, 2, 3, . . .
x!
! "
n x
binom - Binomial f (x) = p (1 − p)x , x = 0, 1, . . . , n
x
What if a distribution is not available in R? For instance,

there is no built-in Bernoulli distribution.
Well, in this case you could just use binom with size=1, or
sample(0:1, 1, prob = c(p, 1-p)).
We can also derive certain distributions from others. For

example, last week we sampled from the negative
binomial distribution by explicitly counting the number of
trials until we got the desired number of heads. Can we
sample from the Bernoulli in some other way?
If the inverse CDF (the quantile function) has an inverse
in closed form, there is a method for generating random
values from the distribution.
The inverse CDF method is simple:
1. Generate n samples from a standard uniform

distribution. Call this vector u. In R, u <- runif(n).
2. Take y <- F.inv(u), where F.inv computes the

inverse CDF of the distribution we want.
We can prove that the CDF of the random values

produced in this way is exactly F.

 0 u<0
P (U ≤ u) = u 0≤u≤1

1 u>1
Therefore,
P (Y ≤ y) = P (F (U ) ≤ y)
−1
= P (F (F (U )) ≤ F (y))
−1
= P (U ≤ F (y)
= F (y)
We used the fact that F is nondecreasing in the second

line.
Example: Triangle distribution with endpoints at a and b
and center at c.
 2(x−a)

 (b−a)(c−a) a ≤ x < c
f (x) = 2(b−x)
c ≤ x ≤ b

 (b−a)(b−c)
0 otherwise
1.0
0.8
We need to:
0.6
Density
1. Find the CDF
0.4
2. Find the inverse CDF
0.2
3. Write a function to carry 0.0
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
x
We ended last time by talking about the inverse-CDF
method:
1. Generate n samples from a standard uniform

distribution. Call this vector u. In R, u <- runif(n).
2. Take y <- F.inv(u), where F.inv computes the

inverse CDF of the distribution we want.
Example: Triangle distribution with endpoints at a and b
and center at c.
 2(x−a)

 (b−a)(c−a) a ≤ x < c
f (x) = 2(b−x)
c ≤ x ≤ b

 (b−a)(b−c)
0 otherwise
1.0
0.8
We need to:
0.6
Density
1. Find the CDF
0.4
2. Find the inverse CDF
0.2
3. Write a function to carry 0.0
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
x
Using the fact that the total area is one, and that the area
of a triangle is 1/2 base x height, we find that


 0 x < 0

 (x−a) 2
0≤x<c
F (x) = (b−a)(c−a)
(b−x) 2

 1 − (b−a)(b−c) c ≤ x ≤ b


1 b<x
Inverting this function, we have
! "
y(b − a)(c − a) + a 0 ≤ y < c−a
F −1 (x) = " b−a
b − (1 − y)(b − a)(b − c) b−a ≤ y ≤ 1
c−a
Now we can write our function.

We’ll finish off this section on simulation by going over
one more example of designing a simulation study, dealing
with the risk of the James-Stein estimator.
In 1956, Charles Stein rocked the world of statistics when

he proved that the maximum likelihood estimator (MLE)
is inadmissible (that is, we can always find a better
estimator) in this simple problem when d ≥ 3 :
indep
Let Yi ∼ N (θi , σ ),2
θi ∈ #, i = 1, . . . , d
The MLE for the vector θ is just the vector Y of

observations, which seems intuitively sensible.
To state Stein’s result, we first have to talk about risk.

Speaking somewhat more formally, a loss function
describes the consequences of using a particular
estimator θ̂ when the true parameter is θ.
A common loss function is the squared error

L(θ, θ̂) = (θ − θ̂)2
which we can generalize to multiple dimensions by
summing the squared errors in each dimension.
d
!
L(θ, θ̂) = ˆ
(θi − θi ) = ||θ − θ̂||
2 2
i=1
But θ̂ is random, because it depends on the data. We call

the expected value of the loss for given θ the risk function.
It’s easy to calculate the risk (under squared error loss) of
the MLE in this problem:
! d #
"
E[L(θ, Y )] = E (θi − Yi )2
i=1
d
" $ %
= E (θi − Yi )2
i=1
= dσ 2
What Stein proved was that when d ≥ 3, we can find

another estimator whose risk is always less than that of
the MLE, no matter what θ is.
A famous example is the James-Stein estimator:
! "
(d − 2)σ 2
θ̂JS = 1− Y
||Y ||2
We’ll now study the risk of the James-Stein estimator

through simulation and compare it to the risk of the MLE.
Note that for a given data set in the simulation, we can

calculate the loss, because we know θ.
We can approximate the risk (the expected value of the

loss) by generating many data sets and then calculating
the sample mean of the vector of losses.
Recall the steps one more time:

One more quick note about summarizing your simulation
results....
When you are reporting the means of your simulated

distributions, it’s a good idea to add an indication of your
uncertainty as well.
The Central Limit Theorem tells us that the sample mean of

the simulated distribution is, with a sufficiently large
sample, approximately normally distributed. We can use
this to form a 95% confidence interval:
√
X̄ ± 2SD/ B
Note that we have control over B, so we can make the

intervals as narrow as we like!
UNIX Basics
Operating systems
An operating system (OS) is a piece of software that

controls the hardware and other pieces of software on
your computer.
The most popular OS today, Microsoft Windows, uses a

graphical user interface (GUI) for you to interact with the
OS. This is easy to learn but not very powerful.
UNIX, on the other hand, is hard at first to learn, but it

allows you vastly more control over what your computer
can do. There area actually many different “flavors” of
UNIX, but what we’ll cover applies to almost all of them.
The differences between, say, Windows and UNIX stem
from an underlying philosophy about what software
should do.
Windows: Programs are large, multi-functional. Example:

Microsoft Word.
UNIX: Many small programs, which can be combined to

get the job done. A “toolbox approach.” Example: stop all
my (cgk’s) processes whose name begins with cat and a
space:
ps -u cgk | grep “[0-9] cat” | awk ‘{print$2}’ | xargs kill

The UNIX kernel is the part of the OS that actually
carries out basic tasks.
The UNIX shell is the user interface to the kernel. Like

flavors of UNIX, there are also many different shells. For
this course, it
doesn’t matter
which one you
use. The default
on the lab
computers is
called tcsh. the prompt - yours
will differ
The first thing you need to know about UNIX are how to
work with directories and files. Technically, everything in
UNIX is a file, but it’s easier to think of directories as you
would folders on Windows or Mac OS.
Directories are organized in an inverted tree structure.
To see the directory you’re currently in, type the

command pwd (“present working directory”).
There are two “special” directories: The top level

directory, named “/”, is called the root directory.
Your home directory, named “~”, contains all your files. For
Mary, “~” and “/users/mary” mean the same thing.
To create a new directory, use the command mkdir. Then
to move into it, use cd.
$ pwd
/Users/cgk
$ mkdir unixexamples
$ cd unixexamples
$ ls
$ ls -a
. ..
ls -a means to show all files, including the hidden files

starting with a dot (“.”).
The two hidden files here are special and exist in every
directory. “.” refers to the current directory, and “..”
refers to the directory above it.
This brings us to the distinction between relative and
absolute path names. (Think of a path like an address in
UNIX, telling you where you are in the directory tree.)
You may have noticed that I typed cd unixexamples, rather

than cd /Users/cgk/unixexamples.
The first is the relative path; the second is the absolute

path.
To refer to a file, you need to either be in the directory

where the file is located, or you need to refer to it using a
relative or absolute path name.
Example:
$ pwd
/Users/cgk/unixexamples
$ echo "Testing 1 2 3" > test.txt
$ ls
test.txt
$ cat test.txt
Testing 1 2 3
$ cd ..
$ cat test.txt
cat: test.txt: No such file or directory
$ cat unixexamples/test.txt
Testing 1 2 3
Note that file names must be unique within a particular

directory, but having, say, both /Users/cgk/test.txt and
/Users/cgk/unixexamples/test.txt is OK.
Commands, arguments, and options
We’ve already started using these; now let’s define them

more precisely.
The general syntax for a UNIX command looks like this:
$ command -options argument1 argument2
(The number of arguments may vary.) An argument

comes at the end of the command line. It’s usually the
name of a file or some text.
Example: move/rename a file.
$ mv test.txt newname.txt
Options come between the name of the command and
the arguments, and they tell the command to do
something other than its default. They’re usually prefaced
with one or two hyphens.
$ pwd
/Users/cgk
$ rmdir unixexamples
rmdir: unixexamples: Directory not empty
$ rm -r unixexamples
$ ls
Desktop Movies Rlibs
Documents Music Sites
Icon? Pictures Work
Library Public bin
MathematicaFonts README
mathfonts.tar
To look at the syntax of any particular UNIX command,
type man (for “manual”) and then the name of the
command.
The two most important parts of the man page are

labeled SYNOPSIS and DESCRIPTION. These are very
much like the “Usage” and “Arguments” in R’s help pages.
SYNOPSIS shows you the syntax for a particular

command. Bracketed arguments are optional.
DESCRIPTION tells you what all the options do.
Press the space bar to scroll forward through the man

page, “b” to go backward, and “q” to exit.
You can refer to multiple files at once using wildcards. The
most common one is the asterisk (*). It stands in for
anything (including nothing at all).
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls G*
Gagging.text Going.nxt
$ ls *.xt
Bing.xt
The question mark (?) is similar, except it can only

represent a single character.
$ ls ?ing.xt
Bing.xt
Finally, square brackets can be replaced by whatever
characters are within those brackets.
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls [A-G]ing.*
Bing.xt
The wildcards can also be combined.
$ ls *G*
AGing.txt Gagging.text Going.nxt
$ ls *i*.*e*
Gagging.text ing.ext
We’ll cover text matching in a lot more detail next week

when we talk about regular expressions.
A recap of commands so far
pwd print working directory

ls list contents of current directory
ls -a list contents, include hidden files
mkdir create a new directory
cd dname change directory to dname
cd .. change to parent directory
cd ~ change to home directory
mv move or rename a file
rm remove a file
rm -r remove all lower-level files
Here are a few more handy ones:
wc -l - count the number of lines in a file

$ wc -l halfdeg.elv
134845 halfdeg.elv
head -nx - look at the first x lines of a file
$ head -n5 halfdeg.elv

Tyndall Centre grim file created on 13.05.2003 at 13:52 by
Dr. Tim Mitchell
.elv = elevation (km)
0.5deg MarkNew elev
[Long=-180.00, 180.00] [Lati= -90.00, 90.00] [Grid X,Y=
720, 360]
[Boxes= 67420] [Years=1975-1975] [Multi= 0.0010]
[Missing=-999]
tail - like head, but look at the end of the file
cp - copy a file
$ cp unixexamples/Bing.xt .
cat - print the contents of a file
echo - write arguments
The real power in UNIX, however, comes from stringing

these commands together. We’ll talk about this next
time.
Today, we’ll talk about
Some interfaces between R and UNIX

- getting the results of UNIX commands from within R
- running R in BATCH mode and monitoring its progress
Putting UNIX commands together using

- redirection
- pipes
Manipulating data in UNIX using filtering commands in

combination with redirection and pipes
The system function in R allows you to execute a UNIX
command and either print the result to the screen or
store it as an R object (argument intern = TRUE).
> system("ls")
datagen.R group1.dat group2.dat group3.dat
> system("head -n2 *.dat")
==> group1.dat <==
height weight Goal: Read in all the data
65.4 134.9
files and put them in a
==> group2.dat <== single matrix with an extra
height weight column for group.
65.7 145.7
==> group3.dat <==

Referring to your UNIX
height weight handout, what should our
63.8 138.9 strategy be?
> nlines <- system("wc -l *.dat", intern = TRUE)
> nlines
[1] " 3 group1.dat" " 4 group2.dat"
[3] " 3 group3.dat" " 10 total"
> nfiles <- length(nlines) - 1
> nlines <- as.numeric(substring(nlines[1:nfiles], 7, 8)) - 1
> nlines
[1] 2 3 2
> hw <- matrix(NA, nrow = sum(nlines), ncol = 3)
> startline <- 1
> for(group in 1:nfiles){
+ temp <- read.table(file = paste("group", group,
+ ".dat", sep = ""),
+ header = TRUE)
+ index <- startline:(startline+nrow(temp)-1)
+ hw[index,1:2] <- as.matrix(temp)
+ hw[index,3] <- as.matrix(group)
+ startline <- startline + nrow(temp)
+ }
> names(hw) <- c("height", "weight", "group")
BATCH jobs in R are useful whenever
- You have a long job and you want to be able to use the
computer for other things in the meantime.
- You want to log out of the machine while the job is
running and come back to it later.
- You’re running the job on a remote machine, and again
you want to log out.
- You want to be courteous to other users of the machine
by decreasing the priority of the job.
To start a BATCH job, use
What would be printed to

the screen instead goes here.
nice R CMD BATCH scriptfile.R outfile.Rout &
Give the job lower priority

Indicates that you want
to run this job in the
background.
A few other things to keep in mind:
- scriptfile.R should require no input from the user. For

example, don’t use identify.
- Graphics should be created by surrounding the relevant

code with pdf(file = “filename.pdf”) and dev.off(), rather
than using dev.print(pdf, file = “filename.pdf”).
- In simulations it can be helpful to include a line like

if(i %% 10 == 0) print(paste(“Iteration”, i)
Then you can monitor it using tail -f outfile.Rout.
- By default, the workspace will be saved in .RData. You

can also save specific objects using the save function.
To see information about currently running processes,
just type top. There are arguments to top that allow you
to sort by CPU usage, memory, etc. See man top for more
details.
The number at the beginning of the line is called the
process ID, or PID.
To kill a particular process, type kill PID, substituting the

correct PID.
Sometimes you want to see the list of all processes in a

non-interactive way. For example, you might want to pipe
the results through a filter, as we’ll discuss next.
On BSD UNIX systems (like the Apple machines in the

lab), ps -aux will list all processes.
On other systems, ps -ef does the trick.
We’ll use ssh and sftp to start a new UNIX session on a
remote machine and to send files back and forth between
our computer and the remote computer.
To log into a statistics department machine, type
ssh -l uname scf-ugNN.berkeley.edu
where uname is the user name you’ve been assigned for

the course, and NN is a number between 01 and 27.
You will be prompted for your password. Your starting

directory is your home directory on the network. Note
this is the same no matter which department computer
you log into!
To transfer files back and forth, first type
sftp -l uname scf-ugNN.berkeley.edu
You can use pwd, cd, and ls just as you would at the usual
prompt to find the right remote file or directory. You can
also use lpwd, lcd, and lls to move around the local
machine.
To copy a file from the remote computer to your

computer, type get nameoffile.
Top copy a file from your computer to the remote

computer, type put nameoffile.
Type exit to quit.

Redirection and pipes are really at the heart of the UNIX
philosophy, which is to have many small tools, each one
suited for a particular job.
Redirection refers to changing the input and output of

individual commands/programs.
The “standard input” or STDIN is usually your keyboard.

The “standard output” or STDOUT is usually your
terminal (monitor).
As an example, if we type cat at the prompt and hit

return, the computer will accept input from us until it hits
an end-of-file (EOF) marker, which on most systems is
CNTRL-D. Each time we hit return, our input is printed
to the terminal.
We can redirect as follows:
> redirects STDOUT to a file

< redirects STDIN from a file
>> redirects STDOUT to a file, but appends rather than
overwriting.
(There’s also a <<, but it’s use is more advanced than we’ll
cover.)
Here are two examples:
$ cat > temp.txt

$ sort < temp.txt
Try it out!
The idea behind pipes is that rather than redirecting
output to a file, we redirect it into another program.
Another way to say this is that STDOUT of one program

is used as STDIN to another program.
A common use of pipes is to view the output of a

command through a pager, like less. This is particularly
useful if the output is very long.
$ ls | less
Pipe
Note that the data flows from left to right. See the UNIX
handout for more details on less.
A program through which data is piped is called a filter.
We’ve already seen a few filters: head, tail, and wc.
Two more common filters are
sort - sort lines of text files alphabetically

uniq - strip duplicate lines when they follow each other
$ cat somenumbers.txt
What will be the output of:

cat somenumbers.txt | sort
cat somenumbers.txt | uniq
cat somenumbers.txt | sort | uniq
$ cat somenumbers.txt | sort
One
One
One
Three
Two
Two
Two
$ cat somenumbers.txt | uniq
One
Two
One
Two
Three
One
$ cat somenumbers.txt | sort | uniq
One
Three
Two
Today: A quick wrap up on UNIX filters, then on to
regular expressions.
There will be a short assignment posted on bSpace later

today to give you some practice with these.
Recall the filters we’ve seen so far:
head - first lines of file

tail - last lines of file
wc - word (or line or character or byte) count
sort - sort lines of text files alphabetically
uniq - strip duplicate lines when they follow each other
Here are two more useful filters:
grep - print lines matching a pattern (We’ll talk about

patterns more shortly -- for now just think of the pattern
as requiring an exact match.)
$ grep “save” *.R

Print all lines in any file ending with .R which contain the
word (pattern) save.
cut - select portions of each line of a file
$ cut -d “ ” -f 3-7
Here are some practice problems:
On many systems, the file /etc/passwd shows information

about the registered users for the machine. A quick look
at the file shows there are also some notes at the top.
1. Determine the total number of users.

2. Sort the users and display the information for the last
five users, alphabetically speaking.
3. Show just the usernames for these entries.
4. Put the usernames in a file called lastusers.txt.
Regular Expressions
Regular expressions give us a powerful way of matching
patterns in text data.
Example 1: election data from three different datasets.

We know these are the same places, but how can the
computer recognize that?
Example 2: Creating variables that predict whether an
email is SPAM
- numbers or underscores in the sending address

- all capital letters in the subject line
- fake “words” like Vi@graa
- number of exclamation points in the subject line
- received time in the current time zone
Example 3: Mining the State of the Union addresses
How long are the speeches? How do the distributions of

certain words change over time? Which presidents have
given “similar” speeches?
The language of regular expressions allows us to carry
out some common tasks, such as
• extracting pieces of text in non-standard format -

for example, finding all the links in an HTML document
• creating variables from information found in text
• cleaning and transforming text into a uniform format,
resolving inconsistencies in format between files
• mining text by treating documents directly as data
Most importantly, we do this all programatically rather
than by hand, so that we can easily reproduce our work if
needed.
Regular expressions are constructed from three things:
Literal characters is matched only by the character itself.
A character class is matched by any member of the

specified class. For example, [A-Z].
Modifiers operate on literal characters, character classes,

or combinations of the two.
First let’s go “under the hood” a little bit and think about
the algorithms we could use to identify pattern matches
made up of literal characters.
What strategy is this code using?
> string
[1] "St John the Baptist Parish"
> if (substring(string, 1, 3) == "St ")
+ newstring <- paste("St. ",
+ substring(string, 4, nchar(string)), sep = "")
> newstring
[1] "St. John the Baptist Parish"
Can you see any problems with it?

A more general approach is to split the input string into a
a vector of characters and then iterate over those
characters looking for the particular string.
> string <- "The Slippery St Frances"

> characters <- unlist(strsplit(string, ""))
> characters
[1] "T" "h" "e" " " "S" "l" "i" "p" "p" "e" "r" "y"
[13] " " "S" "t" " " "F" "r" "a" "n" "c" "e" "s"
> possible <- which(characters == "S")
> possible
[1] 5 14
> substring(string, possible, possible + 2)
[1] "Sli" "St "
The regular expression “St ” is made up of three literal
characters. The regular expression matching engine does
something very similar to what we just did.
The Slippery St Frances

|| |||
Found S________|| |||
Followed by t?__| No |||
Is it S?________| No ...||| Keep looking for an S
|||
Found S_________________|||
Followed by t?___________|| Yes
Followed by a blank?______| Yes - A match!
Luckily, we don’t actually need to write our own functions
for replacement. The R functions gsub() and sub() look
for a pattern and replace it within a string with some
other text.
The “g” in gsub() refers to global. It changes all the

matches, whereas sub() only replaces the first match.
> strings <- c("a test", "and one and one is two",
+ "one two three")
> gsub("one", "1", strings)
[1] "a test" "and 1 and 1 is two" "1 two three"
> sub("one", "1", strings)
[1] "a test" "and 1 and one is two" "1 two three"
What about finding fake “words” such as rep1!c@ted or
Vi@graa? In this case, we’re looking for numbers and/or
punctuation surrounded by regular letters.
These concepts of “numbers”, “punctuation”, and “regular

letters” get at the idea of equivalent characters or character
classes.
We can enumerate any collection of characters within

[ ]. Example: [Tt]his
If we put a caret (^) as the first character , this indicates

that the equivalent characters are the complement of the
enumerated characters.
The character “-” when used within the character class
pattern identifies a range.
Examples: [0-9], [A-Za-z]
If we want to include the character “-” in the set of

characters to match, put it at the beginning of the
character set to avoid confusion.
Example: [-+][0-9]
Note that here we’ve created a pattern from a sequence

of two sub-patterns.
There are also built-in character sets for commonly used
collections.
[[:alpha:]] All alphabetic
[[:digit:]] Digits 0123456789
[[:alnum:]] All alphabetic and numeric
[[:lower:]] Lower case alphabetic
[[:upper:]] Upper case alphabetic
[[:punct:]] Punctuation characters
[[:blank:]] Blank characters, i.e. space or tab
These can be used in conjunction with other characters,

for example [[:digit:]_].
The grep() function in R works in somewhat the same
way as the UNIX command grep, although rather than
returning the matching strings, it returns the indices of the
elements for which there was a match.
However, you can easily use the indices to grab the

corresponding strings.
> Addresses
[1] "Cari Kaufman <cgk@stat.berkeley.edu"
[2] "depchairs03-04@uclink.berkeley.edu"
[3] "Chancey <_arkbound@deutschland.de>"
> grep("[[:digit:]_]", Addresses)
[1] 2 3
> Addresses[grep("[[:digit:]_]", Addresses)]
[1] "depchairs03-04@uclink.berkeley.edu"
[2] "Chancey <_arkbound@deutschland.de>"
Going back to our fake “words” example, what will this
match?
[[:alpha:]][[:digit:][:punct:]][[:alpha:]]
Can you foresee any problems with it?

> subjectLines
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "It's me"
> grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
subjectLines)
[1] 2 3
We can either remove the apostrophe first:

> newString <- gsub("'", "", subjectLines)
> grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[1] 2
Or we can specify the particular punctuation marks we’re

looking for:
> grep("[[:alpha:]][[:digit:]!@#$%^&*():;?,.][[:alpha:]]",
subjectLines)
[1] 2
gregexpr() shows exactly where the pattern was found:
> newString
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "Its me"
> gregexpr("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[[1]]
[1] -1
attr(,"match.length") No match
[1] -1
[[2]]
[1] 12 Starting at 12,
attr(,"match.length") match of length 3
[1] 3
[[3]]
[1] -1 No match
attr(,"match.length")
[1] -1
Did we miss anything??
We didn’t find p1!c because it consists of four characters:
a letter, a digit, a punctuation mark, and another letter.
To search for the more general pattern of any number of

digits or punctuation marks between letters, we use
[[:alpha:]][[:digit:][:punct:]]+[[:alpha:]]
The plus sign indicates that members from the second

character class (digits and punctuation) may appear one or
more times.
The plus sign is an example of a meta character.

More meta characters
^ As the first character in the pattern,

anchor for the beginning of the line
As the first character in [], exclude these
$ End of line anchor
? Character or sub-pattern occurs zero or
one time
+ Character or sub-pattern occurs one or
more times
* Character or sub-pattern occurs zero or
more times
. Any single character
[ ] Character class
- Range within a character class
( ) Group or sub-pattern
| Alternation, i.e. one subpattern or
another
{ } Quantifier: {n} means exactly n repeats of
the sub-pattern
{n, m} n to m repeats
{n,} n or more repeats
What will this match?
^[^[:lower:]]+$
The position of a character in a pattern determines
whether it is treated as a meta character.
Examples: [-+*/], [1-9]*
When you want to refer to one of these symbols literally,

you need to precede it with a backslash (\). However, R
already have a special meaning in R’s character strings --
they indicate control characters like newline (\n).
So, to refer to these symbols in R’s regular expressions,

you need to precede them with two backslashes.
The characters for which you need to do this are:
. ^ $ + ? ( ) [ ] { } | \
Announcements:
There will be a new homework posted on bSpace today

and due Wed., Oct. 29.
I will be out of town Wednesday - Sunday and able to

respond to email only about once per day.
-- In class Thursday, Dr. Deborah Nolan will be our guest

speaker. She’ll cover some concepts you’ll need in the
new homework.
-- Daisy will be holding extra office hours on Thursday
from 3:30-4:30 in the 342 Evans Hall lab.
Last time, we learned that regular expressions are made
up of
1. Literal characters
2. Character classes
3. Modifiers
Today we’ll cover some advanced concepts, including
- getting text data into R

- greedy matching
- tagging and back references
We’ll do this all in the context of learning how to

automatically grab text data from the web.
If I select
a state
here, I get...
a table with vote counts for that state. The URL is
http://www.usatoday.com/news/politicselections/
vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=CA
If I click a few more states,

I see the only thing that
changes is the abbreviation
at the end.
So it’s easy to create the

URLS in R using paste.
The most flexible way to read text files, like web pages,
into R is to use the readLines function. The result is a
character vector where each element is a line in the file.
state <- "California"

# Use the built-in state names and abbreviations
abb <- state.abb[state == state.name]
web <- readLines(paste("http://www.usatoday.com/news/
politicselections/vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=", abb, sep = ""))
If we wanted a single long character vector, we could

simply say
paste(web, collapse = “ ”)
However, it’s often easier to process line by line.

Most web browsers allow you to look at the source file
creating what you see on the screen.
The first county was Alameda. Searching for it in the file,

we see some lines like this:
<td class="notch_medium" width="153">County</td><td class="notch_medium" align="Right" width="65">Total
Precincts</td><td class="notch_medium" align="Right" width="70">Precincts Reporting</td><td class="notch_medium" align="Right"
width="60">Bush</td><td class="notch_medium" align="Right" width="60">Kerry</td><td class="notch_medium" align="Right"
width="60">Nader</td>
</tr><tr>
<td class="notch_white" width="153">Alameda</td><td class="notch_white" align="Right" width="65">1,141</td><td
class="notch_white" align="Right" width="70">1,141</td><td class="notch_white" align="Right" width="60">107,489</td><td class="notch_white"
align="Right" width="60">326,675</td><td class="notch_white" align="Right" width="60">0</td>
</tr><tr>
<td class="notch_light" width="153">Alpine</td><td class="notch_light" align="Right" width="65">5</td><td
class="notch_light" align="Right" width="70">5</td><td class="notch_light" align="Right" width="60">311</td><td class="notch_light"
align="Right" width="60">373</td><td class="notch_light" align="Right" width="60">0</td>
</tr><tr>
<td class="notch_white" width="153">Amador</td><td class="notch_white" align="Right" width="65">57</td><td
class="notch_white" align="Right" width="70">57</td><td class="notch_white" align="Right" width="60">10,479</td><td class="notch_white"
align="Right" width="60">6,211</td><td class="notch_white" align="Right" width="60">0</td>
</tr><tr>
Some additional searching reveals that width=“153” only

occurs in the lines of the table. Knowing something about
HTML would have helped, but it wasn’t necessary.
Using grep in R, we can find just those lines containing
width=“153”. Removing the first one (the header of the
table), we have a bunch of lines like this:
[1] "\t\t\t\t<td class=\"notch_white\" width=

\"153\">Alameda</td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"
Goal: grab the county name, votes for Bush, and votes for
Kerry.
[1] "\t\t\t\t<td class=\"notch_white\" width=
\"153\">Alameda</td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"
We can make our lives a little easier by first removing all

the HTML tags. Note that they all start with < and end
with >. So we might think to try
gsub(pattern = "<.*>", replacement = "",

x = web[rowsoftable])
The issue is that regular expression engine performs
something called “greedy matching.” This means that it will
always try to find the longest pattern that satisfies the
match.
To get around this, we have to consider what we really

want to match: in this case it isn’t anything at all (denoted
by .*), it’s anything except a right angle bracket (denoted by
[^>]*).
> newtable <- gsub(pattern = "<[^>]*>", replacement = " ",

+ x = web[rowsoftable])
> newtable[1:3]
[1] "\t\t\t\t Alameda 1,141 1,141 107,489 326,675 0 "
[2] "\t\t\t\t Alpine 5 5 311 373 0 "
[3] "\t\t\t\t Amador 57 57 10,479 6,211 0 "
[1] "\t\t\t\t Alameda 1,141 1,141 107,489 326,675 0 "
[2] "\t\t\t\t Alpine 5 5 311 373 0 "
[3] "\t\t\t\t Amador 57 57 10,479 6,211 0 "
There are a couple different ways we can now tackle the

problem. As an exercise, try doing it using strsplit.
(What regular expression can you use as the delimiter?)
Another way to do it is to use tagging and back references.
The idea is that we write a regular expression for the

whole pattern, and then we mark what we later want to
refer to using ().
> gsub(pattern = "([[:alnum:]]+)@([[:alnum:].]+)",

+ replacement = "\\1", "cgk@stat.berkeley.edu")
[1] "cgk"
We can do the same thing here, with pattern
pattern <- "^\t\t\t\t (.*) [[:digit:],]+ +[[:digit:],]+

+([[:digit:],]+) +([[:digit:],]+).*$"
counties <- gsub(pattern, "\\1", newtable)
bush <- gsub(pattern, "\\2", newtable)
kerry <- gsub(pattern, "\\3", newtable)
bush <- as.numeric(gsub(",", "", bush))

kerry <- as.numeric(gsub(",", "", bush))
Make sure you understand what each part of the regular

expression is doing.
Now we’ll switch over to R to finish the job.

XML
Other than the text data we’ve just learned to work
with, most of the data sets we’ve seen have been in the
form of ASCII tables.
Date Time Lat Lon Depth Mag

1968/01/12 22:19:10.35 36.6453 -121.2497 6.84 3.00
1968/02/09 13:42:37.05 37.1527 -121.5448 8.49 3.00
1968/02/21 14:39:48.10 37.1783 -121.5780 6.95 3.80
1968/03/02 04:25:53.94 36.8343 -121.5447 5.35 3.00
1968/03/17 15:07:02.12 37.3088 -121.6615 4.39 3.00
1968/03/21 21:54:59.94 37.0378 -121.7407 11.86 4.30
Advantages:
- easy to read, write, and process
- in standard cases, don’t need a lot of extra information
But these advantages can quickly disappear....

XML (for ‘eXtensible Markup Language’) is a standard for
semantic, hierarchical representation of data.
<state>
<gml:name abbreviation="AL">
ALABAMA
</gml:name>
<county>
<gml:name>
Autauga County
</gml:name>
<gml:location> Relationships between pieces
<gml:coord>
<gml:X> of data reflect relationships
-86641472 in the real world.
</gml:X>
<gml:Y>
32542207
</gml:Y>
</gml:coord>
</gml:location>
</county>
...
Some positive aspects of XML are
- data is self-describing
- format separates content from structure
- data can be easily merged and exchanged
- file is human-readable
- but file is also easily machine-generated
- standards are widely adopted
Some negative aspects are
- XML documents can be very verbose and hard to read

- it’s so general that it’s hard to develop tools for all cases
- files can be quite large due to high amount of
redundancy
XML is has become quite popular in many scientific
fields, and it is standard in many web applications for the
exchange and visualization of data. We’ll learn how to 1)
read/process it, and 2) create it.
We’ll do both of these things from within R, but first let’s

start with an overview of what XML documents look like.
The basic unit of XML code is called an “element” or
“node.” It is made up of both markup and content.
Markup consists of tags, attributes, and comments.
<CYL> 6 </CYL> 
Start tag End tag

Content Comment - can go anywhere
<CYL> </CYL> 

<CYL/> 
<CYL size=”2”> 6 </CYL>
An attribute
XML is well-formed if it obeys certain syntax rules. The
rules for tags are
1.Tag names are case-sensitive; start and end tags much

match exactly.
2. No spaces are allowed between the < and the tag
name.
3. Tag names must begin with a letter and contain only
alphanumeric characters.
4. An element must have both an open and closing tag
unless it is empty.
5. An empty element that does not have a closing tag
must be of the form <tagname/>.
6. Tags must nest properly.
XML declaration
An example and processing
<?xml version="1.0" encoding="ISO-8859-1"?>
instructions

<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
</PLANT>
<PLANT> Note how indentation
<COMMON>Marsh Marigold</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL> makes it easier to
<ZONE>4</ZONE>
<LIGHT>Mostly Sunny</LIGHT> check that the tags
<AVAILABILITY>051799</AVAILABILITY> are correctly nested.
</PLANT>
</CATALOG>
In addition, we have the rules
7. All attributes must appear in quotes in a name =

“value” format
8. Isolated markup characters must be specified via entity
references. For example, < is specified by < and > is
specified by >.
9. All XML documents must contain a root node containing
all the other nodes.
This brings us to the tree structure of an XML document.

There is only one root or document node in the tree, and
all the other nodes are contained within it.
We might also think of these other nodes as being

descendants of the root node. We use the language of a
family tree to refer to relationships between nodes.
- parents
- children
- siblings
- ancestors
- descendants
The terminal nodes in a tree are also known as leaf nodes.

Content always falls in a leaf node.
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
</book>
</bookstore>
leaf nodes
Working with XML in R
The first thing we need to do is load the XML library.
> library(XML)
Then read the XML file into R using xmlTreeParse
> doc <- xmlTreeParse(“plant_catalog.xml”)
and extract the root node using xmlRoot.
> root <- xmlRoot(doc)
> class(root)
[1] "XMLNode"
Aside:
R allows for object-oriented programming. We’re not going

to do any of this style of programming ourselves, but it’s
helpful to know how to interpret it when we see it.
A class is a definition of a type of object. A class contains

slots that are used to hold class-specific information. A
particular object is called an instance of the class.
Methods are functions that are specialized for a certain

class. Rather than being fully object-oriented, R uses what
are called generic functions. These determine the type of
object being operated on and then call the appropriate
function. To see the classes for which the function has
been defined, use methods(functionname).
Ok, back to XML.
implements what is called the DOM

xmlTreeParse
(Document Object Model) parser. It reads the entire file
into memory.
We don’t have time to cover it, but you should be aware

of another parsing model called SAX (Simple API for
XML). It reads the document incrementally and is more
memory efficient, but it is trickier to use.
The tree structure is represented in R as a list of lists.
We can access an element within a node (i.e., a child),

using the usual [[ ]] indexing for lists.
> ## Look at one plant node
> oneplant <- root[[1]]
> class(oneplant)
[1] "XMLNode"
> print(oneplant)
<PLANT>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
</PLANT>
We’ve reached the leaf nodes.

We can access the content of the leaf nodes using the
function xmlValue.
> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"
Note: this removes the markup:
> oneplant[['COMMON']]
There are special XML versions of lapply, and sapply,
named xmlApply, xmlSApply. Each takes an XMLNode object as
its primary argument. They iterate over the node’s
children nodes, invoking the given function.
Like lapply, xmlApply returns a list. Like sapply, xmlSApply

returns a simpler data structure if possible. In this case,
we can use xmlSApply to extract the names of all the
plants.
> commons <- xmlSApply(root, function(x){

+ xmlValue(x[['COMMON']])})
> head(commons)
PLANT PLANT PLANT
"Bloodroot" "Columbine" "Marsh Marigold"
PLANT PLANT PLANT
"Cowslip" "Dutchman's-Breeches" "Ginger, Wild"
Now we can create the full dataframe.
> getvar <- function(x, var) xmlValue(x[[var]])

> res <- lapply(names(root[[1]]), function(var){
+ xmlSApply(root,getvar,var)})
> plants <- data.frame(res)
> names(plants) <- names(root[[1]])
What is this command doing?

An overview of where we are:
This week: finishing XML, then spatial data

No class next Tuesday, starting SQL on Thursday
We’ll have one more graded assignment (spatial data) and

one more short assignment (SQL).
Nov 18 -- I’ll assign the group projects. These will involve

analyzing the results of the presidential election.
Part I -- due Dec 1 -- data collection and planning
Part 2 -- due Dec 17 (day of the final)
-- carrying out the analysis and completing your report
The final is on paper (not computer) and will take 1-2 hrs.
First, a quick review of sapply and lapply. Remember:
1) lapply and sapply can operate on either a list or a vector.

What will be the results of the following?
lapply(1:3, function(x){x^2})
sapply(1:3, function(x){x^2})
myList <- list(a=1, b=2, c=3)
lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})
2) You can always include additional arguments. Examples:
sapply(myList, log)
sapply(myList, log, base = 10)
sapply(myList, function(x, pow){x^pow}, pow = 3)
3) If the results of sapply cannot be simplified, then sapply
and lapply will return the same thing.
myList <- list(a=1:2, b=3:5, c=6)

lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})
(All of these are just examples to illustrate the differences

between sapply and lapply. Of course if you have a
vector, you can just use vectorized operations, e.g. vec^2.)
Finally, do you remember what plain old apply does?

The functions xmlApply and xmlSApply work like lapply and
sapply, except their arguments are XML nodes.
returns a list (the elements may themselves be

xmlApply
XML nodes).
If it can, xmlSApply returns a vector or matrix. If not, it

also returns a list.
With all of these functions, always ask yourself

1) What do I want to operate on (iterate over)?
2) What do I want to produce?
Spatial Data
In most introductory statistics textbooks, we assume that
when there is more than one observation, they are iid
(independent and identically distributed). This makes the
theory of estimators using these observations very
analytically tractable.
However, one can easily think of instances where this is

not the case.
-- observations of some genetic variable will tend to be

closer within families than between families
-- variables that change over time can have a distribution
for current values that depends on past values
-- variables that are measured over space can have similar
values for nearby locations
Spatial data can be divided into three main types:
1. Geostatistical data associate a value (univariate case) or

values (multivariate case) with a particular location.
Clearly it wouldn’t be appropriate to treat these data as

iid. Often we fit a parametric model for the correlation.
● ●● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ●● ● ●● ●● ● ●
●
● ● ●● ● ●
● ● ● ●●● ● ● ● ●●● ●● ● ●
● ●
● ● ● ● ● ●
One of the most

● ● ● ● ●● ● ●
●
●●
●●
●● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●●●● ●● ● ●● ● ● ● ●● ● ●
● ●● ● ●● ●●● ●● ● ● ● ● ●● ● ● ● ● ●
● ●●●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●
●
●
● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●
● ●● ●●●●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ●
●
● ●
●
● ● ●●
● ● ● ● ●
●
● ●
● ●
●
● ● ●● ● ● ● ●●●● ● ●● ● ● ● ●●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●●● ● ●●● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ●
● ●●●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ●
●● ● ●
● ●
● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●●● ● ●● ●
● ● ●● ● ● ● ●● ● ●● ●
●● ● ● ● ● ●
●
● ●● ●● ●
● ● ●● ●
● ● ● ● ● ● ● ● ● ● ●●
●●●●●● ●
●● ● ●● ●
● ●
● ●
●●
●●● ●
● ●
● ●●
●
●
● ●●
●● ● ●● ● ● ●
● ●● ●
●●
●
●● ●● ●
● ● ●
●● ●
● ●●●●
●
●
●●
●
●
● ●
●
● ●● ● ●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
● ●● ●
● ● ●
● ● ● ●●
●
●
●
●
●
● ●
●
●
●● ●●
●●
●
●
●
●● ●● ●
● ●●
●● ●● ● ● ● ● ●
●
●
●
● ●●
● ● ● ● ●●● ●
●● ● ● ●
●
●
●
●
●
●●●●
●● ●●
● ●
● ● ●
●
●
●
● ●● ●
● ●
●
●
●
●
● ● ● ●
●● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ●●●
●
●
3
● ●
● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ●●● ●
●● ● ● ● ●●● ●●
common goals is
●●●●● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ● ●
● ● ●●●● ●● ● ● ●●●● ●
● ● ●●●
●●●
● ●● ●● ● ●● ● ●● ● ● ● ● ● ●●● ●● ●● ●●●●● ● ● ●● ● ● ● ● ● ●●●● ● ●●●●●● ●
●●● ● ● ●● ● ●●● ● ●● ● ●
● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●
●●
● ● ●
● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ●● ● ● ● ●●
● ●●● ● ●
● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●
● ●●● ● ● ●● ● ●● ● ●
●
● ●●● ● ● ●●● ● ●●●● ● ●● ● ●
● ● ●● ● ● ● ● ●● ●●●●●●●● ● ● ●●●●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ●●
●● ●
●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ●● ●● ●
●●
● ●● ● ●●● ● ● ● ●● ●● ● ●● ●
● ●● ● ●● ● ● ● ●●● ●● ●● ● ●●● ● ●●●●●● ●
● ● ● ● ●●● ● ●● ● ●●● ● ● ●
● ●●
● ● ● ● ●● ●● ● ●●● ● ● ●●
●●
● ● ●●
● ● ● ●●●
●●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ●●● ●●●●●●● ● ●● ● ●● ●●●● ●● ● ● ●● ● ●● ●●● ●●●● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●
●●●
● ● ● ● ●
●● ●●●● ●● ● ● ●●●●
● ●●
●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ●● ●
●
●●
● ●●● ●● ●●● ● ● ●● ● ●
● ●●● ●● ●● ●●●●● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ●● ● ●●● ●●● ●● ●● ●●
●● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ●● ●● ●● ●●● ●
●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●
●
● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ●● ●● ●● ●●● ●●●●● ●●●●
● ●●●● ●●● ● ●● ● ● ●●●● ● ●●●●●● ●● ●●●●
●●
●● ● ●
●
●
● ● ●
●
●
● ● ●● ●●
●● ●●
●
●●●
●
●
●
●
●
●●
● ●
●●
●
●● ● ●● ●
● ● ● ● ●● ●
● ● ●● ● ● ●● ●
● ● ●●● ●● ● ● ● ● ● ● ●●● ●●
●
● ●●●
●● ● ●
● ● ●
● ●
●●
●● ●●● ●●●●
●●
●●
● ● ●
●●● ● ● ● ● ●
● ●
● ● ● ●●● ●
●● ●
●●
● ●
● ●●
●
●●
●●
●●●●
●
●
●●●●●
● ●●
●●
●●●●●
●
●●
●
●
●●●
●
●●●
●●●
●●●●
●●● ● ●● ●●●
●●●●
●●
●●
●●
●●●
●●●
●
●●
●● ●
●●●
●●
●●●
●
●●
●
●●
●
●
● ●
●●
●
●
●●●●●● 2
prediction of the
● ●● ●●● ●● ●● ● ●●●● ●●
● ● ● ● ● ● ● ● ●●● ●●●● ●
● ● ● ● ●● ●●● ●● ● ●●● ● ●● ● ●●●●●● ● ●● ●●● ●● ●●●● ●●● ●● ●
● ●●●● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
●● ● ● ●●
●
● ● ●● ● ●● ●●● ●●● ●● ● ● ●● ●●● ● ●●●● ●●●● ●●●● ●●● ●● ●
●
●●●●
●●● ●●
●
●
●● ●●● ●● ● ● ●● ● ● ●●● ●
● ●●●●● ●●● ●
●●● ●●●●● ● ● ●● ● ●●●
● ●
●●●●●
●● ●● ●● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●
● ● ● ● ●●●●
● ● ●●● ●● ●● ●●●●●● ●●●
● ●● ●
●●
●●
●● ●●
●●
●●●
●
● ● ●● ● ● ●
● ●● ● ● ●● ● ● ● ● ●● ●●● ●● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●●●●●●
●
●
●●
●
● ●●
●● ●●●
● ● ●●●●●●●● ●●
●● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●
● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ●●
●● ● ●●●● ●● ●●●●● ●
●●
●
●●
●●●● ●
●
●
●●●●
●
●●●●● ●● ●●●
●
● ●●
●●● ● ● ● ● ●● ● ●●● ● ●● ● ●
● ● ● ●● ● ● ●● ●
●●●● ●● ● ●● ● ●● ●
●● ● ●● ●
● ● ●● ● ●
● ● ●●● ●●
● ● ●
● ●● ● ● ●
● ● ●● ● ●●●●
● ●
● ●●● ●
● ●● ●●●●● ●●● ●● ● ●● ●
● ●●
●● ●●● ● ● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●
●
●
●●
●●
● ●●● ●●
●● ● ● ● ● ●●
● ●
●
●● ● ● ● ●●●●● ● ●●●●● ●●● ●●●● ● ● ●●●● ●
●● ● ●
● ●●● ● ● ● ● ●● ●●●●● ● ● ● ● ●● ● ●●●●●●● ● ● ●●● ●●●●●● ●●●
● ●
●
● ●●
● ●
●●●● ● ●
●
● ● ● ●● ● ● ●
● ●●● ● ● ● ● ● ●
●●●●●● ● ● ●●
● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●
●●●
●● ●
●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ●●●● ● ● ●● ●● ●
●● ●●●● ● ● ●●● ●● ● ●● ● ●●●● ● ●● ●●●● ● ●●● ●
● ●
●● ● ●● ●
●●●●●● ●● ● ● ●●●●● ● ●● ●●● ●●● ●●● ● ● ●
● ● ●
●●● ● ● ●●● ● ● ● ●●● ● ● ● ●
●●● ●● ● ● ● ●● ● ● ●● ●● ● ●●
●●●
●● ●
●●●●●●●●
●● ●
●●●● ● ●
● ● ●● ● ●●● ●
●● ●
●●
● ● ●●● ●● ● ● ●●● ● ● ● ●●● ●●●
●● ●● ● ● ●● ● ● ● ● ●
● ●● ●● ●●● ●●● ●●● ●●
● ●● ●●● ●●
●●
●●
●●●●
●
●● ●● ● ● ● ● ●●●
●●
● ●●●
●
●
●●
●
●●
●●●●
●
●●●
●●● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●● ●●
● ● ● ●●
●
●● ●●●● ●
● ●● ● ●● ● ●
● ● ●●●●● ●●●
● ●● ● ● ●● ● ● ●●●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●
●●●●● ●
●●● ●●●● ●● ● ● ● ●
●●
●●● ●●●
● ●●●●●
●● ● ●●
● ● ●● ●● ● ● ●●● ● ●●●●● ●●● ●●● ●● ●● ● ● ● ● ●●● ● ●
● ●● ●●● ● ● ● ● ●●●●
● ●● ●●● ●● ● ● ●● ● ● ●● ●●●●●● ●●● ●●
●● ●
●●●●●
●
●
●● ● ●● ● ●
● ● ●● ●●● ●●●●● ●● ●● ●
●●●●● ●
●●● ● ●●●● ● ● ● ● ● ● ●● ● ●
●●● ● ● ●●
●●● ●●
●●● ● ●● ●● ●● ●● ●●●
●●●●●●
●
●● ● ●
●●● ●●
● ● ●●
●●●●● ●
●● ●● ●●●●●● ● ●●●● ● ● ● ●●● ● ● ● ● ●●●●●
● ●●● ● ● ●●●● ●●● ● ●●● ●●●
●● ●●●●● ● ● ●● ● ● ● ● ●●●●
● ●● ●●●●●● ● ● ●● ●● ● ●●●● ●●●●● ●● ● ●●●●●●●● ●● ● ●●
variable of interest
● ●●● ● ●●● ● ● ●●●● ● ● ●● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●
● ●● ●
●
●●
●
●●
●
●
● ●● ● ●
●
● ● ● ●●
● ●● ●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
● ●●●
● ● ●
●●
●
●●
●● ●
●
●●●
●
●
●
● ●
●
●
●
●
●
●●● ●
●
●
●● ●
●
● ●
● ● ●● ●
●
●●●
●
●
●●● ●●
●
●●
●
●
●
●● ●● ● ● ●● ● ●
●
●
●
●●
●
● ●
●
●
● ●● ● ●●●● ●
● ●●●●●
●
●●● ●
●
●
●
●
● ● ●
● ● ●● ● ● ● ● ● ●●
●
●
● ● ●●
●
●
●●
● ●
●
●
●
●●●●
●
●
●
●● ●
●
●
● ●●●
●
●
●●●● ● ●
●●● ●
●●● ●●● ●●●
● ● ●
●● ●
●●●●●● ●●
●● ● ●
●
●
●
●● ●
● ●
●
●
●
●●● ●●● ●● ● ● ●
●●
●
● ● ●● ●●
●
●
●
●● ● ●●
● ●
●
● ●
●
●
● ●
●●●
●●●
●●● ● ●
●●
●
● ●●
●
● ● ●● ● ● ●
●
●
●
●
●
●●●● ●●●●
●● ●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●●●●●●
●● ● ●●
●● ●● ●● ●● ●● ●●●
● ●
●
●
● ●●●●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●●●● ●●
●
●● ●●●
● ●
●
●●●● ●
●
●
● ●
●●●●●
●●
●
●
●
1
●● ●●● ● ● ● ●
● ●● ●
●● ● ●●●●●● ●●● ●● ● ● ●● ● ● ●
● ●●●● ●●● ● ●
●● ● ●
●●● ●● ●●●
●●●
●●
●●● ● ● ● ● ● ● ●●●● ● ●
●● ●
●●● ●
● ● ●● ●●●● ●●● ● ● ●● ●●● ●●● ● ●●●●●●
●
●● ● ● ● ●
● ●● ●● ● ●● ● ●●●●
●
●●● ●●
●
●●●● ● ●
●●●● ● ● ●● ●●
● ● ●● ● ● ● ● ● ●●● ●●●● ●●●● ● ●
●●● ● ● ●●● ●
● ●●●● ● ●
● ●●● ●●● ● ●
● ●● ●● ● ●● ●● ● ●● ●
●
● ●●●●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ● ● ●●●
●●●● ● ●●●
●●
● ●●
●●
●●● ●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ●●●● ●●●● ●●●● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ●●
● ● ● ● ●●●● ● ●●
●●●●
●●
●
●●● ● ●●●● ● ●● ● ●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ●●●● ●●●● ●
●● ● ●● ● ● ● ● ● ●●●●●●● ●
●●●●●● ● ● ● ● ●
●●
●● ● ●●● ●● ● ● ● ●
●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ●
●●
● ●●●● ● ●● ●
● ● ● ●
● ●●●● ● ● ● ●
● ● ●● ●
● ● ●● ● ● ● ● ● ●
●
●
●● ●●●●● ●
●● ●● ●● ● ● ●●●
●
●● ●● ●● ●● ●●● ● ● ●● ● ●●● ●●● ●●● ●● ● ● ●
● ●● ● ●● ●
● ●● ● ●● ●● ●●● ●
●● ● ●● ●●● ● ● ● ●
●
●
●● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ●●●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ● ●● ●
● ●●● ● ●● ● ●●●● ● ● ●●
●
●●●●●● ● ● ●●● ● ● ● ●● ● ●● ●
●●
● ● ●● ●● ●● ● ● ●
● ● ● ●
● ●● ●●● ● ●● ● ● ●● ●●●●●● ● ● ● ● ●● ● ●● ●
● ●●● ● ●
●● ● ● ● ●●
● ● ●●●● ●●●●●● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●● ●● ●●● ●● ● ● ●●●●
●● ● ● ●● ●● ●●● ●● ●
● ● ●● ●● ●● ●●●● ● ● ● ●● ●● ●●
● ●●
at new locations.
●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●●●
●●●●● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●● ● ● ● ●● ● ● ●● ●
●●
● ●● ●● ●
● ● ● ●●●●● ●●● ●●
● ●● ●●
● ●●
● ● ●● ● ●
● ●●●
●
● ●●● ●
● ●● ● ●● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ● ●● ●
●●● ●● ●
●
●● ● ●
● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●●● ●
● ● ●● ●●●● ●
●●● ●
●●
●●● ●
● ● ● ●●
●
● ● ●●● ●
●● ● ● ●●
● ●●
●● ● ●●● ●
●● ●●
●● ●●●●●●
●● ●●●
●●●●●●●●
● ●● ●
●● ● ●● ● ● ● ● ● ●● ●●
● ● ● ● ●● ●● ● ●●● ● ●● ●● ●● ●● ●● ● ●● ●●●●●●●● ● ●●● ●●
● ●●
●●
●
●
● ●●
●
●● ●● ● ●
●
●● ● ● ●
● ●
●
● ● ● ●● ●
● ●●
●
●●
●
● ●● ●
● ●
●
● ● ● ●● ●
● ● ●
● ●●
●
●
● ●●
●
● ● ●
●●
●
●●
● ●
●●
● ●●
●
● ● ● ●●●
● ●
● ● ●●
●●●● ● ● ●●● ●●
● ●● ●● ● ●●● ● ●
● ●● ● ● ● ●●●●●●●●●●
● ● ● ● ●
●●
●●
●
●
●●● ●
●●
●● ●●● ●
●●●●
●●
●
●
● ●●
● ● ● ● ●
● ●● ●● ●●● ● ●●
●
● ●●
●● ●●
●
●
●
● ● ●● ●
●
● ●
● ● ●●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●●● ●●● ●
●● ●
●●● ● ●
●● ●●
● ●●
●●●
●
●
●● ● ●
●● ● ●● ●
●● ● ● ●
●
●
0
●
●● ●● ● ● ● ●
●
● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ● ●●
●● ●●●●●
● ● ● ●●●●● ● ● ● ●
●● ● ● ●
●
● ● ●
●
● ●
●●● ●●●● ●● ● ●●
●● ● ● ● ● ● ●● ● ● ●● ●
● ● ●●●●● ● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●
●● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●
● ●● ● ●● ●● ● ● ●●●●● ●●
● ●● ●● ● ● ●
● ● ●
●●
●● ●● ● ● ●●●
●● ●
● ●● ●● ●
●●● ●●● ● ● ● ● ● ● ● ●●●●●●●●●● ● ●●●● ●● ●
● ●●●●● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●
●●●●
●
● ●
●
●●
●●
●
●
●
●
●●
●●
●●
●
●●●
●●
●
●●
● ●● ●●●●●●●●●● ●
●
● ● ● ● ● ● ● ● ●● ●
●
●● ● ● ●●●
●
● ●●●●● ● ● ●● ●●●● ● ● ● ● ● ●●●● ●●
●●● ●●
● ● ● ●
●●●● ● ● ●
● ● ●
● ●
● ●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●
●●●●
●●●
●●●
●●
●
●●●●
●
● ●● ● ●● ●● ●● ●
●●●
● ● ● ●●
● ● ●● ● ●●
● ● ● ●● ●●
●●● ●●●●●●● ● ● ●● ●●● ● ● ● ●●●●●●● ●●● ●●● ● ●● ● ● ● ● ●●● ●
●
●
●
●●●
●●●●
● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ●● ● ● ●●● ● ● ●
●●
●●●
●●●
●
●●●● ●●
● ●● ● ● ●●● ●●●● ● ● ●●● ● ● ● ●● ● ● ●● ●● ●●
● ● ●●
●●● ● ●●● ●● ● ● ●●●●● ● ● ●●● ● ●● ● ● ●● ●
● ●● ●●●● ●●
● ●●●●●● ● ● ● ●
● ● ● ●● ●
●
●●●●●●
●
●● ● ● ● ●● ●● ●●● ●● ●● ●●● ● ●
● ● ●●● ● ●● ●●● ●●● ● ● ●●● ● ●● ● ●
●
●● ●
●●●●
●
●●●● ●● ●●●● ●● ●
● ● ●● ● ●● ●●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ●
●●●●●●● ●
● ● ● ●● ●
● ● ● ● ●● ●● ●● ● ● ● ● ●● ●
● ● ●●●●● ● ●● ●
● ●● ● ● ●●● ● ● ●● ●
●●●●●●● ●● ●●● ● ● ● ●
●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ● ●●● ● ● ● ●●●● ●●●●
●●●●●●●
● ●●● ●●● ● ● ●●● ●●●●●●● ●●●● ●● ●●●●
●●●●● ● ●●● ● ● ●● ●
● ● ● ● ●● ●
● ● ● ●●● ● ●
●●● ●●●
● ● ● ●●● ● ●● ● ●● ● ● ●● ●● ● ●●●●● ● ●● ● ●● ●● ● ●
●
●●
● ●●● ● ● ● ●●●●
●●●
●● ● ●● ●●
●
●
●
● ●
●●●●
●● ●
● ●●● ●●●● ● ● ●● ● ● ●● ● ●● ● ●●● ● ●●● ●●●● ●● ● ●● ●
●
●● ●● ● ●●
● ● ●●●● ●●●●●●● ●●●●● ●● ●●●● ●● ●●● ● ● ●●●● ● ●●● ●●●● ● ●
●● ●●● ● ● ● ●● ●
● ●● ● ●●
●● ● ● ●● ● ● ● ●●●● ● ●● ● ●● ●● ● ●●
● ●●● ● ●
●●●●
●● ●●
●
●●●●
●
●●●
●
● ●
●●●
●●●
●●
●●●
●●
●
● ● ●
●
●
●
● ● ●●
●
●● ● ● ●
● ●● ● ●
●● ●●●
●
●● ●●
●●● ●● ●●●●●● ●
●
● ●●●● ● ●●●● ● ● ●
●
●
●●● ●●● ●●●
● ●● ● ●●
● ●
●
●● ● ●
● ●●
●●
●●
● ●●●● ● ● ●● ● ● ●
● ●● ●
●
● ●●●
●
●●●●
●
●●
● ●
● ●●
●
●
●
●●● ● ●● ●● ●● ● ● ●
● ●● ●● ●●
● ● ● ●●
● ● ●●
●● ●● ● ● ●● ●
● ●
●●
●●
●●
●
●
●
−1
● ● ●●
● ●● ●● ●●●●● ●●●● ●● ● ●● ● ● ● ●● ● ●● ●●●
●● ● ● ●● ● ●●● ●●
● ● ●
●● ● ● ●●● ●●●● ●●●● ● ●● ●●● ● ● ●● ●● ●
● ●●
● ● ● ●● ● ● ●●● ●● ● ● ●●● ●●● ●●● ● ●●● ●●●●● ●● ● ● ● ● ● ●● ●●● ● ● ●
●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●
●● ●● ●● ● ●● ● ●● ● ●●● ● ● ●●● ●●● ●● ●●●● ●● ●●●●
● ● ● ● ● ●● ● ●
● ● ● ● ●●●● ● ●●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●● ●●● ● ● ● ●● ● ● ● ●●●● ● ● ●
● ● ● ● ● ●● ●●
●● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●
●●● ●● ●●●●●
●● ● ● ●●● ● ● ●●● ●●
●
●●●
● ●
● ● ● ●● ● ● ●●● ●
●
●● ● ● ●●
● ● ● ●●●● ●●●● ● ●
● ●
●
●●●●● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●●
● ● ● ●● ● ●●
●● ●● ●● ● ● ● ● ● ●●●
● ●● ● ●● ●
●●● ●●● ● ● ● ●
●
● ●● ● ● ● ●●● ●● ● ●
● ● ●●● ●
●
●
● ●●
●
●●
● ●
● ● ●
●● ●●
● ●●
●●
● ●●
●
●●●
● ●
● −2
● ●
●● ●● ●●●●●
● ●●●
● ●●● ●● ● ●●● ●
●
● ● ●
●●
● ● ● ●●●
● ●● ●
● ●
●●●●●● ●● ●●
● ● ●● ●
●●
●●
●
●
●●
● ●
●●
2. Lattice data consist of measurements that are particular
to a certain geographic region, such as a county.
It’s common to see lattice data when the data collection

method is controlled
by government
agencies. The census
is one example. A
lot of epidemiological
data looks like this
too.
Modelling tends to
focus on the structure
of “neighborhoods.”
3. Point process data consist of the locations of particular
events. If there are also values associated with the event,
this is known as a marked point process.
The earthquake data in your homework can be modeled

as a marked point process in space and time.
Some questions to study with

this type of data are
- Is the rate of events the same
everywhere? (Is it a homogeneous
point process?)
- Given this underlying rate, are
the events independent? Or do
they tend to cluster together?
Geostatistical data uses a continuous representation of
space, with s ∈ S ⊆ #3
We write the covariance between the variable of interest

at any two locations as Cov(Z(s), Z(s! )).
Note:
Cov(X, Y ) = E[(X − EX)(Y − EY )]

! !
= V ar(X) V ar(Y )Cor(X, Y )
= σ Cor(X, Y )
2
if V ar(X) = V ar(Y ) = σ 2
A few simplifying assumptions about the structure of the
covariance function are
1. stationarity - the covariance between Z(s) and Z(s’)

depends only on the relative locations, ie. s-s’.
2. isotropy - the covariance between Z(s) and Z(s’)

depends only on the distance between the locations,
ie. ||s-s’||.
If both stationarity and isotropy hold, then the covariogram

can be used to study the form of the spatial covariance
function.
The (empirical) covariogram is a scatterplot that puts
distance on the x-axis and covariance on the y-axis.
If we had multiple replications, such as independent

observations in time, then for each pair of locations we
could calculate their covariance over those replications
and add one point to the covariogram. (How many
points would there be in all?)
However, if we have just a single replication, we can look

at pairs whose distance are within a given window of the
distance we want to plot.
Let Id,! = {(i, j) : i ≤ j, ||si − sj || ∈ (d − !, d + !)}
Then the empirical covariogram
1 !
Ĉ! (d) = (Zi − Z̄ (i)
)(Zj − Z̄ (j)
)
#Id,!
(i,j)∈Id,!
mean over all the i’s mean over all the j’s
Recall from last time:
Geostatistical data consist of observations associated with

a set of locations.
Today we’ll talk about how to interpolate a geostatistical

data set.
Example: Average 65
42
surface ozone at 60
40
monitoring stations 55
38
50
Our goal: Estimate 45

36
ozone over a grid

40
34
35
of locations
−95 −90 −85 −80 −75
Plotting the data
You’ll need to load the packages fields and maps.
From fields, we use

image.plot - color plot of a data set on a regular grid
as.image - take an irregularly spaced data set and put it on
a grid with nrow rows and ncol columns.
From maps, we use

map(“state”, add = TRUE)
Other databases are available -

try “county,” “usa”, “world”
Now we plot the correlogram and use it to estimate the
spatial covariance.
●
30
25
Estimated Covariance
20
●
15
●
10
●
●
● ●
●
5
●
●
●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
0
● ●
● ●
0 100 200 300 400 500
Distance
The idea is that we fit a parametric model to the
correlogram. In other words, we specify a functional form
for the covariance, where the function depends on
certain parameters, and then we estimate those
parameters.
A common parametric model takes the covariance to

decay exponentially with distance
Cov(Z(s), Z(s )) = σ exp{−||s − s ||/θ}

! 2 !
Variance at a single location Controls rate of decay;

higher means higher
correlation at a given distance
One way to estimate the parameters in the covariance
function is to use nonlinear least squares.
●
30
Minimize the sum

25
of squared
Estimated Covariance
20
residuals over
●
the parameters
15
using function nls

10
●
●
● ●
●
5
●
●
●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
0
● ●
● ●
0 100 200 300 400 500
Distance
Now we need to determine the set of locations at which
we want to do the prediction.
Remember expand.grid from when we were running

simulation studies? It works here too, just be sure to
convert to a matrix, rather than a data frame.
You need to determine the resolution of the grid. Higher

resolution looks better, but takes longer.
Also be sure not to extrapolate, ie. predict beyond the

range of the data.
A fairly dense grid
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
42
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
40
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
38
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
lat
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
36
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
34
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
32
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
−95 −90 −85 −80 −75
lon
Ok, now it is time to predict.
We will use a linear combination of the observations to

give us a prediction at a new location. The only question
then is what weights to assign to each observation.
We choose the weights to minimize the variance of the

resulting estimate. The “best” predictor minimizes this
quantity, and we call it the BLUP, for “best linear unbiased
predictor.”
It turns out the weights for the BLUP are easy to derive if
you know the true covariance function....
First we form the covariance matrix for the observations.
This contains the covariances for every pair of
observations.
Σij = Cov(Z(si ), Z(sj ))

!
= σ exp{−||si −
2 !
sj ||/θ}
We calculate a similar matrix for each pair where one

location comes from the set of observations and the
other location comes from the new grid.
γij = Cov(Z(si ), Z(sj ))

∗
= σ exp{−||si −
2 ∗
sj ||/θ}
New location j
If the vector of observations Z has mean zero, the BLUP
is simply
γΣ
! −1
Z
Weight matrix - depends on particular values

of the parameters as well as the distances
This procedure is called kriging after Daniel Krige, a South

African gold miner.
Now, if we don’t know the parameters, we can plug in our

estimates when we compute the covariance matrices. This
predictor is no longer the BLUP. (Why not?)
We can also estimate the variance in our predictions.
If the parameters are known, then the vector of variances

of the BLUP at each predicted location is
σ 1m − diag{γ Σ
2 ! −1
γ}
We plug in our estimates to approximate this variance.
The variance will be small when the predicted location is

close to the original locations.
In fact, with the covariance function we’re currently using,

the variance will be exactly zero if we predict at a
location where we already have an observation. (It
interpolates.)
Announcements:
There will be no lab tomorrow. I’ll assign your group

projects next Tuesday, and we’ll have a short in-lab
assignment on SQL next Friday.
Be sure to attend on Tuesday so you can make plans with

your group.
If you want to practice the SQL statements we talk about

today, there are lots of free SQL interpreters (programs
that mimic a database server) online.
For example, see http://sqlcourse.com/select.html.
Databases and SQL
A database is a collection of data with information about
how the data are organized (meta-data). A database server
is like a web server, but responds to requests for data
rather than web pages.
We’ll talk about relational database management systems

(RDBMS) and how to communicate with them using the
structured query language (SQL).
Why use a database?
• Coordinate synchronized access to data

• Change continually; give immediate access to live data
• Centralize data for backups
• Support client-server computing
• Control access to the data
A RDBMS had three main parts
• Data definition
• Data access
• Privilege management
We’ll concentrate on data access, assuming the database
is already available and we have the needed privileges.
Topics:
• using SQL to extract info from RDBMSs
• relating these back to similar tasks in R
• using SQL from within R
There are tradeoffs in terms of what we choose to do
using SQL and what we do in R.
A database is made up of one or more two dimensional
tables, usually stored as files on the server.
A very important concept in the design of a database is

normalization. The idea is to remove as much redundancy
as possible when creating the tables. We do this by
breaking the full dataset into separate tables.
The “relational” in RDBMS comes from the fact that we

then need to link the tables together.
For now let’s talk about a single table....

A table is a rectangular arrangement of values, where a
row represents a case, and a column represents a variable
(just like a data frame in R).
Missing value
Terminology
Object Statistics Database

Table Data frame Relation
Row Case Tuple
Column Variable Attribute
Row count Sample size Cardinality
Column count Dimension Degree
Row ID Row name Key
An entity is the general object of interest. For example, a
lab test. Each case is a particular occurrence of the entity.
This means that rows in the table are unique.
To identify each row, we us a key. A key is just an

attribute or a combination of attributes.
In the lab test example, there is a composite key of both

patient ID and date, since neither is necessarily unique.
In R, the row names of a dataframe play a similar role.

Queries and the SELECT statement
SQL allows us to interactively query the database to

reduce the data by subsetting, grouping, or aggregation.
Each database program tends to have its own version of

SQL, but they all support the same basic SQL statements.
(We say statements rather than commands because SQL
is referred to as a declarative rather than a programming
language.)
The SQL statement for retrieving data is the SELECT

statement. The result will always be another table.
A table called Chips:
The simplest possible query gives back everything:
SELECT * FROM Chips;
By convention, we display SQL commands in upper case.

Selecting by variables/attributes
Recall that in R, we can select particular variables

(columns) by name.
Chips[ , c(“Mips”, “Microns”)]
The order of the variable names determines the order in

which they’ll be returned in the resulting data frame.
The corresponding SQL query is
SELECT Mips, Microns FROM Chips;

Selecting by cases/tuples
Likewise, in R we can select cases from a dataframe using

their row names.
Chips[c(“Pentium”, “PentiumII”), ]
Two equivalent SQL queries are
SELECT * FROM Chips

WHERE Processor = “Pentium” OR
Processor = “PentiumII”;
SELECT * FROM Chips

WHERE Processor IN (“Pentium”, “PentiumII”);
In both R and SQL, we can do both types of subsetting at
once.
R:
Chips[c(“Pentium”, “PentiumII”), c(“Mips”, “Microns”)]
SQL:
SELECT Mips, Microns FROM Chips

Generalizing, so far we have the syntax
SELECT attribute(s) FROM relation(s)

[WHERE constraints];
[optional]
How would we pull the years of all 32-bit processors that
execute fewer than 250 million instructions per second,
1) in SQL, 2) in R?
SQL offers limited features for summarizing data -- some
aggregate functions that operate over the rows of a table,
and some mathematical functions that operate on
individual values in a tuple.
The aggregate functions are

• COUNT - number of tuples
• SUM - total of all values for an attribute
• AVG - average value for an attribute
• MIN - minimum value for an attribute
• MAX - maximum value for an attribute
SELECT attribute(s) FROM relation(s)
[WHERE constraints];
or functions of attributes
Additional clauses
The GROUP BY clause makes the aggregate functions in

SQL more useful. It enables the aggregates to be applied
to subsets of the tuples in a table.
SELECT Region, SUM(Amount) FROM Sales

GROUP BY Region;
The WHERE clause can’t contain an aggregate function,

but the HAVING clause can be used to refer to the
groups to be selected.

GROUP BY Region HAVING SUM(Amount) > 100000;
A few other predicates and clauses are
DISTINCT - forces values of an attribute in the results

table to have unique values
NOT - negates conditions in WHERE or HAVING clause
LIMIT - limits the number of tuples returned
SELECT DISTINCT State FROM Sales

WHERE NOT Region IN (“East”, “West”)
LIMIT 10;
The order of execution of the clauses in a SELECT
statement is as follows:
1. FROM: The working table is constructed.
2. WHERE: The WHERE clause is applied to each tuple of

the table, and only the rows that test TRUE are retained.
3. GROUP BY: The results are broken into groups of

tuples all with the same value of the GROUP BY clause.
4. HAVING: The HAVING clause is applied to each group

and only those that test TRUE are retained.
5. SELECT: The attributes not in the list are dropped and

options such as DISTINCT and LIMIT are applied.
Finally, another useful command is ORDER BY. This is
always executed last, since it is technically part of the host
language and not SQL.
For example, this will order the first seven, not give you
the top seven!
SELECT Location, Amount FROM Sales

ORDER BY Amount DESC LIMIT 7;
Moving on to multiple tables: an example
This is where normalization really comes into play. We
could have stored the same information in one big table:
Designing efficient databases is a topic in its own right. As

users of the database, we just need to understand the
relationships between tables.
An example
SELECT CID, SUM(Balance) AS Total

FROM Registration AS R, Accounts AS A
WHERE A.AcctNo = R.AcctNo
GROUP BY CID;
Your turn: Find the names and addresses of all customers
with accounts in the downtown branch of the bank.
It helps to think about order of exectution (FROM, then
WHERE, then GROUP BY, then SELECT).
Recall from last time how subsetting rows and columns in
SQL differs from doing it in R.
R:
Chips[c(“Pentium”, “PentiumII”), c(“Mips”, “Microns”)]
SQL:
SELECT Mips, Microns FROM Chips

Also remember the additional clauses GROUP BY and
WHERE.
The GROUP BY clause enables the aggregate functions to

be applied to subsets of the tuples in a table.

GROUP BY Region;
The WHERE clause can’t contain an aggregate function,

but the HAVING clause can be used to refer to the
groups to be selected.

GROUP BY Region HAVING SUM(Amount) > 100000;
Interacting with databases and multiple tables
Each server may provide multiple databases, each of

which may contain multiple tables, each of which may
have multiple columns. The following commands are
useful for orienting yourself.
SHOW DATABASES;
SHOW TABLES IN database;
SHOW COLUMNS IN table;
DESCRIBE table; Same thing
Today we’ll get practice using the command line program

MySQL to interact with a database.
springer.cgk% mysql -u stat133 -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 5.0.51a-3ubuntu5.1 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the

buffer.
mysql> SHOW DATABASES;

+--------------------+ You can log into springer from
| Database | any of the department machines.
+--------------------+
| information_schema |
After typing this command, enter
| albums | password T0pSecr3t
| baseball |
| music |
+--------------------+
4 rows in set (0.00 sec)
We’ll try some examples with the database called albums.
mysql> SHOW TABLES IN albums;

+------------------+
| Tables_in_albums |
+------------------+
| Album |
| Artist |
| Track |
+------------------+
mysql> DESCRIBE Album;

ERROR 1046 (3D000): No database selected
Note that to access the tables within a database, we can

use the same “dot” notation as before, e.g. albums.Album.
Or...
We can also first say
mysql> USE albums;
Reading table information for completion of table and
column names
You can turn off this feature to get a quicker startup
with -A
Database changed
mysql> DESCRIBE Album;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
+-------+------------+------+-----+---------+-------+
mysql> describe Album;
+-------+------------+------+-----+---------+-------+
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
+-------+------------+------+-----+---------+-------+
mysql> describe Artist;
+-------+--------+------+-----+---------+-------+ What is the
+-------+--------+------+-----+---------+-------+ structure of
| aid | double | YES | MUL | NULL | |
| name | text | YES | | NULL | |
this database?
+-------+--------+------+-----+---------+-------+
mysql> describe Track;
+----------+------------+------+-----+---------+-------+
+----------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | | NULL | |
| filesize | bigint(20) | YES | | NULL | |
| bitrate | double | YES | | NULL | |
| length | bigint(20) | YES | | NULL | |
+----------+------------+------+-----+---------+-------+
The individual tables don’t have much interesting
information, since they rely on the IDs.
mysql> SELECT * FROM Album LIMIT 5;

+------+------+-----------------------+
| alid | aid | title |
+------+------+-----------------------+
| 346 | 372 | 'Perk Up' | This is just so that
| 226 | 235 | 'Round Midnight | the examples fit
| 316 | 326 | 10 to 4 at the 5 Spot |
| 204 | 205 | 100% Pure Funk |
on the slides -- you’ll
| 500 | 265 | 150 MPH | want to remove it.
+------+------+-----------------------+
To see anything interesting, we have to link together

multiple tables.
mysql> SELECT Artist.name, Album.title FROM Artist, Album
-> WHERE Artist.aid = Album.aid LIMIT 5;
+----------------------+-----------------------+
| name | title |
+----------------------+-----------------------+
| Shelly Manne | 'Perk Up' |
| Kenny Burrell | 'Round Midnight |
| Pepper Adams Quintet | 10 to 4 at the 5 Spot |
| Jimmy McGriff | 100% Pure Funk |
| Louie Bellson | 150 MPH |
+----------------------+-----------------------+
Combining two datasets in this way is called an inner join.
The arrow (->) appears above because I haven’t entered

the terminating “;” yet. You shouldn’t type it in.
Also, don’t forget you can use AS to rename tables or
columns. This can save a lot of typing.
mysql> SELECT ar.name AS Artist, al.title AS 'Album Name'

-> FROM Artist AS ar, Album AS al
-> WHERE ar.aid = al.aid LIMIT 5;
+----------------------+-----------------------+
| Artist | Album Name |
+----------------------+-----------------------+
| Shelly Manne | 'Perk Up' |
| Kenny Burrell | 'Round Midnight |
| Pepper Adams Quintet | 10 to 4 at the 5 Spot |
| Jimmy McGriff | 100% Pure Funk |
| Louie Bellson | 150 MPH |
+----------------------+-----------------------+
Subqueries
The result of one query can be used to represent a value

in another query.
For example, say we wanted to find the artist and title of

the longest track in the database.
This gives the length, but not the other information:
mysql> SELECT MAX(length) FROM Track;

+-------------+
| MAX(length) |
+-------------+ (Remember the aggregate
| 1561 | functions are COUNT,
+-------------+ SUM, AVG, MIN, and MAX.)
1 row in set (0.00 sec)
Now let’s get the name of the track.
mysql> SELECT title,length FROM Track
-> WHERE length = (SELECT max(length) FROM Track);
+------------+--------+
| title | length |
+------------+--------+
| The Lovers | 1561 |
+------------+--------+
How do we get the artist and album information as well?

Ask yourself:
1. What tables do I need (for the FROM clause)?
2. What constraints do I need (for the WHERE clause)?
3. What columns do I want to SELECT?
4. Should I rename anything to save typing?
mysql> SELECT tr.title,al.title,ar.name,tr.length
-> FROM Track as tr, Album as al, Artist as ar
-> WHERE tr.alid = al.alid AND tr.aid = al.aid
-> AND ar.aid = al.aid
-> AND length = (SELECT max(length) FROM Track);
+------------+------------------------+------------+--------+
| title | title | name | length |
+------------+------------------------+------------+--------+
| The Lovers | Invitation to Openness | Les McCann | 1561 |
+------------+------------------------+------------+--------+
Note that it’s very important to make all the links between
tables, or you will get unwanted rows in the table.
Another example: let’s make a table of the number of
artists with certain numbers of albums. (How many
artists have one album, how many have two, etc.)
First, this tells us how many albums each artist (aid) has:
mysql> SELECT aid, COUNT(aid) AS ct FROM Album

-> GROUP BY aid LIMIT 5;
+------+----+
| aid | ct |
+------+----+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 5 |
+------+----+
mysql> SELECT ct, COUNT(ct) FROM
-> (SELECT aid, count(aid) AS ct FROM Album GROUP BY aid) AS x
-> GROUP BY ct ORDER BY ct;
+----+-----------+
| ct | COUNT(ct) |
+----+-----------+ We’re using the whole
| 1 | 318 |
| 2 | 63 |
table from the last slide
| 3 | 30 | as a subquery here.
| 4 | 20 |
| 5 | 7 |
Remember the result
| 6 | 4 | looked like:
| 7 | 3 |
| 8 | 2 |
+------+----+
| 9 | 2 |
| 10 | 1 | | aid | ct |
| 11 | 1 | +------+----+
| 12 | 2 | | 1 | 3 |
| 14 | 1 | | 2 | 1 |
+----+-----------+ | 3 | 1 |
13 rows in set (0.00 sec) | 4 | 1 |
| 5 | 5 |
Using MySQL with R
The RMySQL library allows you to connect to a database,

submit a query, and receive the results as a data frame. The
important functions are dbDriver, dbConnect, and
dbGetQuery.
> library(RMySQL)
Loading required package: DBI
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, dbname = "albums",
+ user = "stat133", pass = "T0pSecr3t")
This assumes you are already logged into springer. If you're

connecting from some other SCF machine, you'd need to
add the host='springer' argument to the dbConnect call.
We can grab each full table:
> album <- dbGetQuery(con,

+ statement = "SELECT * FROM Album")
> track <- dbGetQuery(con,
+ statement = "SELECT * FROM Track")
> artist <- dbGetQuery(con,
+ statement = "SELECT * FROM Artist")
Notice that we can get away with not using the

terminating (“;”) here.
The merge function in R performs an inner join. We can

use it to recreate the huge table with every column.
> album.and.artist <- merge(album, artist, by = "aid")
> full <- merge(track, album.and.artist, by = "alid")
> head(full)
alid aid.x title.x filesize bitrate length aid.y
1 1 2 S'Wonderful 3351 126.5808 217 2
2 1 2 Taboo 2528 126.0203 164 2
3 1 2 Just One Of Those Things 3753 128.0000 237 2
4 1 2 Yardbird Suite 2879 126.3243 187 2
5 1 2 It's The Talk Of The Town 2915 126.5238 189 2
6 1 2 Mighty Like A Rose 4503 126.7958 291 2
title.y name
1 Al Haig Trio And Sextets Featu Al Haig
Now we can do any processing we need to do.

On the other hand, with large databases it is often slow
or even impossible to load all the tables into R or create
one large dataframe. Then we can customize the query to
select just what we want and to do some processing on
the remote server.
> query <- "SELECT al.title, ar.name, SUM(tr.length) AS tot\
+ FROM Album AS al,Artist AS ar,Track AS tr\
+ WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid\
+ GROUP BY tr.alid\
+ HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC"
> albums <- dbGetQuery(con, statement = query)
What do you think this will do? This says that the string
is not finished, even though
we hit return.
> head(albums)
title name tot
1 'Perk Up' Shelly Manne 2684
2 Kaleidoscope Sonny Stitt 2679
3 Red Garland's Piano (Remastere Red Garland 2676
4 Ask The Ages Sonny Sharrock 2675
5 Duo Charlie Hunter & Leon Parker 2667
6 Tenor Conclave Hank Mobley/Al Cohn/John Coltr 2665
Numerical
Optimization
Function optimization refers to the problem of finding a
value of x to make some function f(x) as large (or as
small) as possible.
In statistics, these problems often arise in the context of

calculating estimates of model parameters.
• Nonlinear least squares
• Generalized linear models
• Maximum likelihood estimates
Sometimes we can find the solution explicitly, for example
using the derivatives of f. But when the solution can’t be
found in closed form, we turn to numerical optimization.
There are very many techniques for numerical
optimization, and we can’t possibly cover them all.
We’ll talk about two basic methods and how to program

them:
• Golden section search
• Newton-Ralphson algorithm
Then we’ll move on to some statistical examples and how
to use the built-in optimization methods in R.
Since every maximization problem can be rewritten as a

minimization problem, using -f(x) rather than f(x), we’ll
assume from now on that we’re minimizing.
Golden section search
Assume f(x) has a single minimum on the interval [a,b].
The golden section search algorithm iteratively shrinks the

interval over which we’re looking for the minimum, until
the length of the interval is less than some preset
tolerance.
The name golden section comes from the fact that at

each iteration, we choose a new point to evaluate so that
we can reuse one of the points from the last iteration. It
works out that the way to do this is to maintain the so-
called golden ratio between the distances between points.
Consider two line segments c and d. They are said to be
in golden ratio if their sum c+d is to c as c is to d.
c+d c c d
= =φ
c d
dφ + d dφ
⇒ =
dφ d
φ+1
⇒ =φ
φ
⇒ φ −φ−1=0
2
√
1+ 5
⇒ φ= ≈ 1.618034
2
An example:
2.5
●
Start with
2.0
f(x)
1.5
x1 = b − (b − a)/φ ●
= a + (b − a)/φ
1.0
x2
1 2 3 4 5
Now compare
f (x1 ) and f (x2 ). a x1 x2 b
Since f (x1 ) < f (x2 ),

we know the minimum
must be in [a, x2 ].
An example
2.5
●
2.0
f(x)
1.5
●
Add a new point
1.0
and maintain the 1 2 3 4 5
golden ratio. x
a x1 x2 b
a x1 x2 b
This time f (x1) > f (x2), so the

minimum must be in [x1 , b].
An example
2.5
●
2.0
f(x)
1.5
●
●
●
1.0
1 2 3 4 5
a x1 x2 b
a x1 x2 b
Keep going like this.... a x1 x2 b

An example
2.5
●
2.0
f(x)
1.5
●
●
● ●
1.0
1 2 3 4 5
a x1 x2 b
When b-a is sufficiently a x1 x2 b

small, we stop and report
a minimum of (a+b)/2. One a x1 x2 b
can show that the error is x1 x2

a b
at most (1-Φ)(b-a).
The Newton-Raphson algorithm can be used if the function
to be minimized has two continuous derivatives that may
be evaluated.
Again assume that there is a single minimum in [a,b]. If

the minimizing value x* is not at a or b, then f (x ) = 0.
! ∗
If in additionf !! (x∗ ) > 0, then x* is a minimum.
The main idea behind N-R is that if we have an initial

value x0 that is close to the minimizing value, then we
can approximate
f (x) ≈ f (x0 ) + (x − x0 )f (x0 )
! ! !!
f (x) ≈ f (x0 ) + (x − x0 )f (x0 )
! ! !!
f ! (x0 )
Setting RHS = 0 gives x1 = x0 − !!
f (x0 )
We keep going in this way until f (xn ) is sufficiently close

!
to zero.
It’s important to have a good initial guess, otherwise the

Taylor series approximation may be very poor and we
may even have f (xn+1 ) > f (xn ).
We’ve already seen one example of numerical
optimization in action, when we used nonlinear least
squares to fit a curve to the covariogram for spatial data.
This is useful more generally, for non-linear regression

models of the form
N (0, σ 2 )
y = f (x, β) + " error term
Response Vector of
variable coefficients
Vector of
Nonlinear covariates
function
●
Example: weight loss
400
●
●
●
●
●
170
●●
●
●
350
Weight (kg)
●
Weight (lb)
●
●
●●
150
●
●
●
● ●
●●
●
300
●
●●● ●
130
●●
●●
●
●● ●
●
●
● ●●
● ●
250
●●● ● ●
110
● ●
0 50 100 150 200 250
Days
Patients tend to lose weight at diminishing rate. Here is

data from one patient with a linear fit superimposed.
Another proposed model is y = β0 + β1 2 +"

−tθ
Time taken to
Ultimate lean
lose half amount
weight (asymptote)
Total amount remaining to be lost
to be lost
The function nls in R uses numerical optimization to find
the values of the parameters that minimize the sum of
squared errors
n
!
(Yi − f (xi , β))
2
i=1
The main arguments are

• formula - outcome on the LHS, function on the RHS
• data - dataframe holding the variables
• start - vector of starting values
Numerical
Optimization II: Fitting
Generalized Linear
Models
The normal linear model assumes that
1) the expected value of the outcome variable can be

expressed as a linear function of the explanatory
variables, and
2) the residuals (observations minus their expected

values) are independent and identically distributed with
a normal distribution.
Last time we talked about relaxing assumption 1), using

nonlinear regression models.
Today we’ll talk about relaxing assumption 2), using what

are called generalized linear models.
First, a few words about the normal linear model.
With a single explanatory variable, it has the form
Yi = β0 + β1 Xi + "i
iid
where !i ∼ N (0, σ 2 ), i = 1, . . . , n
Recall that the least squares estimates of β0 and β1

minimize the residual sum of squares
n
!
RSS(β0 , β1 ) = (Yi − β0 − β1 Xi )
2
i=1
With a little calculus, we can minimize RSS explicitly....

n
!
∂RSS/∂β0 = −2 (Yi − β0 − β1 Xi )
i=1
!n
∂RSS/∂β1 = −2 (Yi − β0 − β1 Xi )Xi
i=1
Setting each equal to zero and solving, we get

!n
i=1 (Xi − X̄)(Yi − Ȳ )
β̂1 = !n
i=1 (X i − X̄)2
β̂0 = Ȳ − β1 X̄
In this model, the least squares estimates are equal to the

maximum likelihood estimates, which we’ll discuss next
time.
We can do similar calculations if we have more than one
explanatory variable.
Another way of thinking about what we’ve done in the

normal linear model is that we’ve expressed the mean of
the Y’s as a linear combination of the X’s.
E[Yi ] = β0 + β1 Xi + E["i ]
= β0 + β1 Xi
To work with non-normal distributions, we’re going to

slightly modify this idea.
First, a motivating example.
Prior to the launch of the space shuttle Challenger, there

was some debate about whether temperature had any
effect on the performance of a key part called an O-ring.
The following plot, with data from past flights, was used as
evidence that it was safe to launch at a temperature of 31°F.
One key problem with this analysis was that the
engineers left out the data from all the flights with no O-
ring problems, under the mistaken assumption that these
gave no extra information.
The solid rocket motors (labeled
3 and 4) are delivered to
Kennedy Space Center in four
pieces, and they are connected
on site using the O-rings. There
are actually two sets of O-rings
at each joint, but we’ll focus on
the primary ones.
So in each launch, there are six

primary O-rings that can fail. If
any one fails, it can lead to a
catastrophic failure of the whole
shuttle.
Temp Fail Date
1 66 0 4/12/81
Here are the data on past 2 70 1 11/12/81
failures of the primary O- 3 69 0 3/22/82
4 68 0 11/11/82
rings. 5 67 0 4/4/83
6 72 0 6/18/83
7 73 0 8/30/83
The data from past flights 8 70 0 11/28/83
come from rocket motors 9 57 1 2/3/84
10 63 1 4/6/84
that are retrieved from the 11 70 1 8/30/84
ocean after the flight. There 12 78 0 10/5/84
13 67 0 11/8/84
had been 24 shuttle 14 53 2 1/24/85
launches prior to 15
16
67
75
0 4/12/85
0 4/29/85
Challenger, of which the 17 70 0 6/17/85
rocket motors were 18
19
81
76
0 7/29/85
0 8/27/85
retrieved in 23 cases. 20 79 0 10/3/85
21 75 2 10/30/85
22 76 0 11/26/85
23 58 1 1/12/86
We could fit a linear regression model to this data,
relating the expected number of failures to temperature.
Some problems with this approach are that
1) the residuals are
6
clearly not iid
5
normal
4
Failures
2) if we go out
3
far enough, we
2
● ●
actually predict
1
●● ● ●
a negative number
0
●●●●● ●● ●● ●● ●
of failures. 0 20 40 60 80
Temperature (Degrees F)
Instead we will fit a logistic regression model.
This model is appropriate when the data have a binomial

distribution (counting the number of events out of n
trials), of which binary data is a special case with n=1.
The expected value for a given trial is pi , the probability

of an event when the explanatory variable X = Xi . We
relate this to the linear predictor using the logit function.
! "
pi
log = β0 + β1 Xi
1 − pi
This ratio is called, the odds, so

the logit can also be called the log odds.
The case we are discussing (binomial outcome, logit
function) is a special case of a larger class of models called
generalized linear models.
Some other examples:
Normal outcome, identity link Poisson outcome, log link

Yi ∼ N (µi , σ ),2
Yi ∼ P ois(λi )
µi = β0 + β1 Xi log(λi ) = β0 + β1 Xi
Note that in each case, the link function maps the space of
the parameter representing the mean of the distribution
(µi , λi , or pi ) to the real line, which is the space of the linear
predictor.
Generalized linear models can be fit using an algorithm
called iteratively reweighted least squares. In R, this is
implemented in the function glm.
# First create a matrix with events and non-events

FN <- cbind(challenge$Fail, 6 - challenge$Fail)
# Fit using specified family, default link function

glm.fit <- glm(FN~Temp, data = challenge,
family = binomial)
# Now predict for a range of temperatures

tempseq <- seq(0, 90, length = 100)
pred <- predict(glm.fit, newdata = data.frame(Temp =
tempseq), se.fit = TRUE)
inv.logit <- function(x){1/(1+exp(-x))}
lines(tempseq, inv.logit(pred$fit))
lines(tempseq, inv.logit(pred$fit + 2*pred$se.fit), lty = 2)
lines(tempseq, inv.logit(pred$fit - 2*pred$se.fit), lty = 2)
The confidence interval at 31°F is quite wide, but the
point estimate probably still should have been cause for
alarm, especially since the temperature was colder than
anything that had been tried before.
1.0
0.8
Probability of Failure
0.6
0.4
● ●
0.2
●● ● ●
0.0
●●
●●●●●● ●● ●● ●
0 20 40 60 80
Temperature (Degrees F)
Interpretation of the coefficients is a bit trickier in logistic
regression models than it is in linear regression models.
When E[Yi ] = β0 + β1 Xi , we can say that

• β0 is the expected value of Y when Xi = 0
(This is not always interesting; for example the
temperature will almost never be 0°F.)
• β1 is the change in the expected value of Y due to a unit
increase in X.
! "
pi
Now we have log = β0 + β1 Xi
1 − pi
We can interpret the parameters in terms of log odds,
odds, or probabilities.
! "
pi
log = β0 + β1 Xi
1 − pi
The interpretation regarding log odds is the easiest to

state but probably the hardest to understand.
• β0 is the log odds of an event when Xi = 0

•β1 is the change in the log odds due to a unit
increase in X.
Now, if we exponentiate both sides, we get
pi /(1 − pi ) = exp(β0 + β1 Xi )
which implies that exp(β0 ) is the value of the odds

when Xi = 0
Also, suppose Xi = Xj + 1.
pi /(1 − pi ) exp(β0 ) exp(β1 Xi )

=
pj /(1 − pj ) exp(β0 ) exp(β1 Xj )
= exp(β1 (Xi − Xj )) = exp(β1 )
So exp(β1 ) gives the multiplicative change in odds

corresponding to a one unit change in X.
In particular, if X takes only the values 0 and 1, then exp(β1 )

is the odds ratio for category 1 compared to category 0.
The interpretation in terms of probabilities is conditional

on other variables in the model, so we’ll save it for after
we talk about using multiple regressors.
Another example, this time with multiple explanatory
variables
> library(MASS)
> birthwt[1:2,]
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
'low' indicator of birth weight less than 2.5kg

'age' mother's age in years
'lwt' mother's weight in pounds at last menstrual period
'race' mother's race ('1' = white, '2' = black, '3' = other)
'smoke' smoking status during pregnancy
'ptl' number of previous premature labours
'ht' history of hypertension
'ui' presence of uterine irritability
'ftv' number of physician visits during the first trimester
'bwt' birth weight in grams
We are now confronted with the question of model
choice. There are a variety of principles that can guide us
here, but in the interest of time, let’s consider one
criterion balancing goodness of fit with parsimony (the
number of parameters).
The Akaike information criterion is
AIC = 2k − 2 log(L)
where k is the number of parameters in the given model

and L is the maximized value of the likelihood for that
model. (For now, you can think of the likelihood as the
joint density of the data for a particular setting of the
parameter values.) Looking at this criterion, we favor
models with a lower value of AIC.
Numerical
Optimization III:
Maximizing Likelihoods
One of the canonical cases in which we need to
numerically optimize a function in statistics is to find the
maximum likelihood estimate.
iid
For X1 , . . . , Xn ∼ f (x; θ), the likelihood function is
!n
L(θ) = f (Xi ; θ)
i=1
The log-likelihood function is n

!
!(θ) = log L(θ) = log f (Xi ; θ)
i=1
The maximum likelihood estimator (MLE), which we’ll

denote by θ̂, is the value of θ that maximizes L(θ).
(Note that this is equivalent to maximizing !(θ).
Another important thing to note is that we can multiply
the likelihood by a constant (or add a constant to the log-
likelihood), and this does not change the location of the
maximum.
Therefore, we often work only with the part of the

likelihood that concerns θ. This part of the function is
called the kernel.
In simple cases, we can often find the MLE in closed form

by, for example, differentiating the log-likelihood with
respect to θ, setting this equal to zero, and solving for θ.
But things are often not this simple!
As an example, let’s go back to the logistic regression
model. Remember, we have Yi ∼ Ber(pi ), i = 1, . . . , n
where, inverting the logit function, we have
exp{β0 + β1 Xi }
pi =
1 + exp{β0 + β1 Xi }
n
!
Yi
and the likelihood function is L(β0 , β1 ) = pi (1 − pi ) ,
Yi
substituting in the expression above. i=1
We can’t maximize this analytically as a function of β0

and β1 , but we can easily write a function for the
likelihood or log-likelihood and have R do the work for
us....
# Function for the negative log-likelihood
logistic.nll <- function(beta, x, y, verbose = FALSE){

if(verbose) print(beta)
beta0 <- beta[1]; beta1 <- beta[2]
pvec <- exp(beta0 + beta1 * x) /
(1 + exp(beta0 + beta1 * x))
fvec <- y * log(pvec) + (1-y) * log(1 - pvec)
return(-sum(fvec))
}
# Use optim to minimize the nll

# par is a vector of starting values
# better starting values => faster convergence, and
# less chance of missing the global maximum
optim(par = c(0, 0), fn = logistic.nll,

x = x, y = y, verbose = TRUE)
In the case of logistic regression, minimizing the negative
log-likelihood using optim will give the same answer as
using glm with family = binomial.
However, there are many other models without built-in

functions like glm. One example is the spatial models we
discussed a few weeks ago.
Suppose we have a spatial field with mean zero and

covariance function
Cov(Z(si ), Z(Sj )) = σ exp{−||si − sj ||/ρ}

2
2
Before, we estimated σ and ρ by finding the covariogram
and fitting a curve to it using nonlinear least squares.
However, the MLEs are actually much better estimators.
The kernel of the likelihood function (for normal data)
looks like this:
|Σ(σ 2 , ρ)|−1/2 exp{−Z " Σ(σ 2 , ρ)−1 Z/2}
where Z = (Z1 , Z2 , . . . , Zn )! is the vector of

observations and Σ(σ 2
, ρ) is the n by n matrix with
Σ(σ 2 , ρ)i,j = Cov(Z(si ), Z(Sj ))

= σ exp{−||si − sj ||/ρ}
2
We can again use optim to find the MLEs numerically....

A few words before we move on:
1) It’s always preferable to find the MLEs in closed form if

you can. The answer is exact, and you avoid all the errors
that can be introduced in the numerical optimization,
including possibly converging to a local rather than a
global optimum.
2) If you do need to use numerical optimization, it’s a

good idea to evaluate the likelihood (or log-likelihood)
over a grid of values first, to help you find a good starting
value.
3) There’s a lot more theoretical detail concerning MLEs

that we don’t have time to cover, importantly how to
estimate uncertainty. See Stat 135.
Nonparametric
regression and
scatterplot smoothing
We’ve looked at linear models and nonlinear models with
a specified form, but what if you don’t know a good
function to relate two variables X and Y?
● ●
●
●
50
● ●●● ●
● ● ●
●
●
●● This data set shows
head acceleration
● ●
●● ● ● ● ● ● ● ●
● ●
●
0
●
●●●● ●●
●●●●
●●● ● ●● ● ● ●● ●
●●● ●●● ●●
Acceleration
● ●
in a simulated
●
●
● ● ● ● ●●● ●● ●
●
●●●● ●● ● ●
● ● ●
●
●● ●
motorcycle accident,
●
● ●
−50
●● ●
● ●
● ● ●
●●
used to test helmets.

●●
● ● ●
●● ●
● ●
●
−100
●
●
● ●
● ●
●
● ●
●
● ●● ● ●
● ●
●
10 20 30 40 50
Times
This area of statistics is known as nonparametric regression
or scatterplot smoothing. Basically, we want to draw a
curve through the data that relates X and Y. More
formally, we suppose
Yi = f (Xi ) + !i
where f is an unknown function and the !i are iid with

some common distribution, typically normal.
Now, if we don’t put any restrictions onf, it’s easy to get

a perfect fit to the data -- just draw a curve that passes
through all the points! But this curve is unlikely to give
good predictions for any future observations.
Aside: This actually gets at a fundamental idea in
statistics, called the bias-variance tradeoff. We can get a
very low variance estimator of f by interpolating the
data, using a very wiggly curve. But this introduces a lot
of bias. So we look for a happy medium.
We won’t cover
the theoretical
details, here,
but just keep in
mind this question
of how much
smoothing to do.
Back to the motorcycle data....
One of the simplest things we could do would be to fit a

high degree polynomial.
But fitting a global

● ●
●
polynomial this
●
50
● ●● ●
●
● ● ●
● ●●
way isn’t very ●

●
●
●
●
●
●
●
● ● ● ● ●
efficient.
●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ●●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ●● ●
● ● ● ●
●●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ●
How about breaking

−50
●
●● ●
●
● ● ●
● ●
●
● ●
up the region of
● ● ●
●
● ●
● ●
●
−100
x and fit a separate,

●
●
● ●
●
● ●
● ●
● Degree = 5
lower-degree
● ●● ● ●
● ●
Degree = 10
● Degree = 20
polynomial in each 10 20 30 40 50
region?
x
This type of model is known as a piecewise polynomial
model or regression splines.
The breakpoints, between which we have separate

polynomial functions, are called knots.
Typically we impose some constraints on the way the

functions match up at the knots, such as maintaining the
first and second derivatives.
So the modelling choices boil down to
1) where to put the knots

2) what degree polynomial to fit between the knots
More knots less smoothing

Motorcycle data with 6 knots:
● ●
●
●
50
● ●● ●
●
● ● ●
● ●●
●
● ●
● ● ● ● ● ● ●
● ●
● ●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ● ●
−50
●● ●
●
● ● ●
● ●
●
● ●
● ● ●
●
● ●
● ●
●
−100
●
●
● ●
●
● ●
● ●
●
● ● ● ● ●
● ●
Degree = 1
● Degree = 3
10 20 30 40 50
x
Motorcycle data with 9 knots:
● ●
●
●
50
● ●● ●
●
● ● ●
● ●●
●
● ●
● ● ● ● ● ● ●
● ●
● ●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ● ●
−50
●● ●
●
● ● ●
● ●
●
● ●
● ● ●
●
● ●
● ●
●
−100
●
●
● ●
●
● ●
● ●
●
● ● ● ● ●
● ●
Degree = 1
● Degree = 3
10 20 30 40 50
x
Smoothing spline models are defined in a slightly different
way. Within a class of functions, a smoothing spline
minimizes the penalized least squares criterion
n
! "
1
(Yi − f (Xi )) + λ
2
f (x)dx
!!
n i=1
The parameter λ controls how smooth the function is (in

terms of integrated second derivative).
We can specify λ in terms of the equivalent degrees of

freedom of the model, or we can choose it in a data-based
way, using something called cross validation.
● ●
●
●
50
● ●● ●
●
● ● ●
● ●●
●
● ●
● ● ● ● ● ● ●
● ●
● ●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ● ●
−50
●● ●
●
● ● ●
● ●
●
● ●
● ● ●
●
● ●
● ●
●
−100
●
●
● ●
●
● ●
● ●
● df = 10
● ● ● ● ●
● ●
df = 20
● df chosen by cross−validation
10 20 30 40 50

Stat 133 All Lectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 133 All Lectures

Uploaded by

Copyright:

Available Formats

Statistics 133: Concepts in Computing with Data

Instructor: Dr. Cari Kaufman

GSI: Daisy Huang

Example: Traffic on I-80

Example: SPAM or HAM?

Example: Mary Jane ski area and Rifle Sight trail

Height taken from a digital

Plan your descent through the bumps and go for it.

Example: Shelters along the Applachian trail

Some of you may have used statistical software with a

In this class, we’ll use the R programming language and

For now, we’ll focus

When you fire up

• May not start with a digit or underscore (_)

• Use meaningful names

> save(x1, file = "x1.RData")

When you quit R, you’ll be asked whether you want to

The inputs are called arguments to the function. When

The “...” argument is special and we’ll talk about it later.

-150 -100 -50 0 50 100 150

The Normal Distribution

You can check the type using the mode function.

Actually, the three types are numeric, character, and

> c(1.3, 2, 8/3)

The last two expressions illustrate implicit coercion. You

> unfair.coin <- c("heads" = 0.55, "tails" = 0.45)

If you haven’t gotten your computer account, be sure to

• start and quit R in interactive mode

You can assign values to variables using = rather than <- if

You can use c to concatenate existing vectors.

Remember that unlike other languages you may have

There was a question about indexing by name when the

• missing values and other special values

> x <- c(1, 5, NA)

Other special values are NaN, for “not a number,” which

> x <- 1:10

> x <- 1:3

> x <- 1:3

We’ve actually used this before. It would be a good

> x <- rnorm(100)

The help says . . . represents “one or more R objects, to

Type help(paste) to see more about how this function

> some.letters <- paste(letters[1:5], collapse = "-")

It also allows us to assign parts of a string.

> substr(some.letters, start = 1, stop = 3) <- "A*B"

Boolean algebra is a mathematical formalization of the

In R, we write ! for “not,” & for “and,” and | for “or.”

Logical vectors in R are just special representations of

> x <- rnorm(1000)

> group <- rep(c("control", "treatment"), each = 2)

Because the levels of a factor are internally coded as

• Data structures galore:

Find the dimensions of a matrix:

Exchange rows and columns:

Note: by default the result is coerced to a vector if

Can you guess what each line returns?

You can also index a matrix using a single index, as you

> array(1:24, dim = c(2, 3, 4)) # A 3-D array

Example: Number of cars on Friday, 6th and Friday, 13th:

> data.frame(let = letters[1:2], val = 1:2)

We can also extract a column by name using the $

> ingredients <- list(cheese = c("Cheddar", "Swiss"),