Download as pdf or txt
Download as pdf or txt
You are on page 1of 401

Statistics 133: Concepts in Computing with Data

Instructor: Dr. Cari Kaufman


cgk@stat.berkeley.edu
GSI: Daisy Huang
yanhuang@stat.berkeley.edu
What Are Data?
Numbers
Example: Trafc on I-80
Text
Example: SPAM or HAM?
Images, video, or audio
Example: Mary Jane ski area and Rie Sight trail
Height taken from a digital
elevation model, with
overlaid high-resolution
photograph.
Plan your descent through the bumps and go for it.
Bump skiing does not get much harder than this. This pitch
is a long one and typically does not have much loose snow so
technique is important even if you decide to traverse across to the
left to lose some speed. Bear to to skier's left at the bottom of this
pitch and nish out the run on Feebleminded. Look for good snow on
the sides. However you get down this run you should feel like you skied
something hard and a bit wild -- and done it in view of all the folks comfortably
sitting on the SuperGauge chairs. You will not nd many other black runs that
will stretch you like Riesight Notch. - From the Mary Jane Project
Meta-data
Example: Shelters along the Applachian trail
Course Expectations
Getting Started with R
Why use R?
Some of you may have used statistical software with a
GUI, like Minitab. You may also be familiar with other
programming languages, like C, Java, Python, etc.
In this class, well use the R programming language and
environment as our home base for performing many
data analytic tasks.
Some benets of R:

Allows custom analyses and easy replicability

High level language designed for statistics

Active user community, lots of add-ons

Its free!
A screenshot from http://www.R-project.org/
R can be run in interactive or batch modes. The
interactive mode is useful for trying out new analyses and
making sure your code is doing what you think it is. The
batch mode is useful for carrying out pre-dened analyses
in the background.
For now, well focus
on the interactive
mode.
When you re up
R, youll see a
prompt, like this:
At the prompt, you can type an expression. An expression
is a combination of letters/numbers/symbols which are
interpreted by a particular programming language
according to its rules. It then returns a value. We can also
say it evaluates to that value.
> 3 + 5
[1] 8
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15 16 17 18 19 20
>
> # This is a comment
>
> 30 + 10 / # I'm not done typing
+ 2
[1] 35
To store a value, we can assign it to a variable.
> x1 <- 32 %% 5
> print(x1)
[1] 2
> x2 <- 32 %/% 5
> x2 # In interactive mode, this prints the object
[1] 6
> ls() # List all my variables
[1] "x1" "x2"
> rm(x2) # Remove a variable
> ls()
[1] "x1"
Variable names must follow some rules:

May not start with a digit or underscore (_)

May contain numbers, characters, and some


punctuation - period and underscore are ok, but most
others are not

Case-sensitive, so x and X are different


Advice on variable names:

Use meaningful names

Avoid names that already have a meaning in R. If in


doubt, check:
> exists("pi")
[1] TRUE
There are several ways to save your objects for later.
You can use the save and load functions to save specic
variables.
> save(x1, file = "x1.RData")
> rm(x1)
> ls()
character(0)
> load(file = "x1.RData")
> ls()
[1] "x1"
When you quit R, youll be asked whether you want to
save ALL the contents of your current workspace.
> q()
Save workspace image? [y/n/c]:
A function is a portion of code that performs a specic
task. Usually it takes some inputs, performs some
computations, and returns a value.
The inputs are called arguments to the function. When
you use a function with a particular set of arguments, you
are set to be calling the function. The computer evaluates
the function call and returns the output.
For now, well work with Rs built-in functions, and the
most important things to know are how to call the
function and how to get help when you need it.
First, determine the arguments.
> args(rnorm)
function (n, mean = 0, sd = 1)
NULL
> args(plot)
function (x, y, ...)
The ... argument is special and well talk about it later.
When you call a function, you can specify the arguments
either by position or by name, or a combination.
> x <- 1:100
> y <- rnorm(100, sd = x) # Combination
> plot(x, y) # By position
default values
0 20 40 60 80 100
-
1
5
0
-
1
0
0
-
5
0
0
5
0
1
0
0
1
5
0
x
y
> help(rnorm) # A shortened version of the real page:
Normal package:stats R
Documentation
The Normal Distribution
Description:
Random generation for the normal distribution with
mean equal to 'mean' and standard deviation equal to 'sd'.
Usage:
rnorm(n, mean = 0, sd = 1)
Arguments:
n: number of observations.
mean: vector of means.
sd: vector of standard deviations.
Details:
If 'mean' or 'sd' are not specified they assume the
default values of 0 and 1, respectively.
Value:
'rnorm' generates random deviates.
Source:
See RNG for how to select the algorithm and for
references to the supplied methods.
References:
Becker, R. A., Chambers, J. M. and Wilks, A. R.
(1988) _The New S Language_. Wadsworth & Brooks/Cole.
See Also:
'runif' and '.Random.seed'
Examples:
...
R has a number of built-in data types. The three most
basic types are numeric, character, and logical.
You can check the type using the mode function.
> mode(3.5)
[1] "numeric"
> mode("Hello")
[1] "character"
> mode(2 < 3)
[1] "logical"
Actually, the three types are numeric, character, and
logical vectors. Theres no such thing as a scalar in R, just a
vector of length one.
A vector in R is a collection of values of the same type.
You can join vectors together using the c (for
concatenate) function.
> c(1.3, 2, 8/3)
[1] 1.300000 2.000000 2.666667
> c("a", "l", "q")
[1] "a" "l" "q"
> c(TRUE, FALSE, FALSE)
[1] TRUE FALSE FALSE
>
> c(1, 2, FALSE)
[1] 1 2 0
> c(1, 2, "c")
[1] "1" "2" "c"
The last two expressions illustrate implicit coercion. You
should try to avoid this in most situations.
The elements of a vector can have names.
> unfair.coin <- c("heads" = 0.55, "tails" = 0.45)
> unfair.coin
heads tails
0.55 0.45
> names(unfair.coin)
[1] "heads" "tails"
>
> # Another way to do it
> fair.coin <- c(0.5, 0.5)
> names(fair.coin) <- names(unfair.coin)
> fair.coin
heads tails
0.5 0.5
There ve ways to extract elements of a vector.
> unfair.coin[1] # 1) Inclusion by position
heads
0.55
> unfair.coin[-1] # 2) Exclusion by position
tails
0.45
> unfair.coin["heads"] # 3) By name
heads
0.55
> unfair.coin[unfair.coin > 0.5] # 4) By logical index
heads
0.55
> unfair.coin[] # 5) No index (include everything)
heads tails
0.55 0.45
A few announcements:
If you havent gotten your computer account, be sure to
email Daisy (yanhuang@stat.berkeley.edu) ASAP.
If you are just joining the course this week, please see me
after class, in ofce hours, or send me an email if you have
not done so already.
Last time in our introduction to R, we learned how to

start and quit R in interactive mode

do basic calculations in R

assign, print, list, and remove variables

save some or all of the variables in the workspace

nd the arguments for a function and use the help


system

create numeric, character, and logical vectors and


concatenate them using c()

name the elements of a vector

extract elements of a vector ve different ways


A few notes before we move on...
You can assign values to variables using = rather than <- if
you like.
You can use c to concatenate existing vectors.
> x1 <- c(1, 8)
> x2 <- 2:5
> x3 <- 4
> c(x1, x2, x3)
[1] 1 8 2 3 4 5 4
Remember that unlike other languages you may have
used, R does not start indexing with 0. Also, it does not
allow mixing of positive and negative subscripts. (Why
not?)
Indexing by exclusion can be used to remove elements of
a vector.
> x <- 1:5
> x
[1] 1 2 3 4 5
> x <- x[-c(1,3,5)]
> x
[1] 2 4
There was a question about indexing by name when the
names are not unique. It appears that R returns only the
rst element with that name. So Id avoid repeating
names.
> x <- 1:2
> names(x) <- c("a", "a")
> x["a"]
a
1
Today, well cover

missing values and other special values

assigning parts of a vector using indexing

vector arithmetic and the recycling rule

making patterned vectors

some built-in summary functions for vectors

basic manipulation of character vectors

logical vectors and Boolean algebra

a new data type: factors


Next time: more complicated data structures, reading
data into R
Next week: graphics
The missing value symbol is NA. Note that this is different
from NA, so dont include the quotation marks. You can
check for the presence of NA values using the is.na
function.
> x <- c(1, 5, NA)
> is.na(x)
[1] FALSE FALSE TRUE
Other special values are NaN, for not a number, which
typically arises when you try to compute an
indeterminate form such as 0/0. The result of dividing a
non-zero number by zero is Inf (or -Inf).
In general, the same indexing may be used to assign values
to elements of a vector. Make sure the vector exists
rst, or you will get an error.
Can you guess what x will look like after each of the
following lines?
> x <- 1:10
> names(x) <- letters[1:10]
> x[1:2] <- 2:1 # By inclusion
> x[-(1:2)] <- 10:3 # By exclusion
> x["a"] <- 100 # By name
> x[x==100] <- NA # By logical index
> x[] <- 10 # No index
> x <- 10 # Watch out - what happens here?
A very important feature of R is that it can carry out
vectorized calculations. What this means is that basic
arithmetic, as well as many built-in R functions, will
operate on each element of a vector. This avoids much of
the looping thats used in lower-level languages.
> x <- 1:3
> x * 10
[1] 10 20 30
> x^2
[1] 1 4 9
> y <- 0:2
> x + y
[1] 1 3 5
> x / y
[1] Inf 2.0 1.5
When the vectors in a calculation are of different lengths,
R follows the recycling rule. That is, it starts repeating
elements from the shorter one.
> x <- 1:3
> y <- 1:2
> x + y
[1] 2 4 4
Warning message:
In x + y : longer object length is not a multiple of
shorter object length
Weve actually used this before. It would be a good
exercise for you to go through the notes so far and
identify where R is applying the recycling rule.
R has a number of built-in functions for making patterned
vectors, including seq and rep. Weve seen : many times,
which is just a special case of the seq function.
> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(0, 10, by = 2)
[1] 0 2 4 6 8 10
> seq(0, 0.5, length = 6)
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> seq(1, 0, by = -0.1)
[1] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
> rep(c(0, 1), times = 5)
[1] 0 1 0 1 0 1 0 1 0 1
> rep(letters[1:5], each = 2)
[1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e"
R also has many built-in summary functions.
> x <- rnorm(100)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.92500 -0.71430 -0.19300 -0.09377 0.49810 2.68200
> mean(x)
[1] -0.09377121
> min(x)
[1] -1.925202
> max(x)
[1] 2.682179
> range(x)
[1] -1.925202 2.682179
> length(x)
[1] 100
> sum(x)
[1] -9.377121
> prod(x)
[1] 1.482105e-25
A handy way to make patterned character vectors is to
use the paste function.
> args(paste)
function (..., sep = " ", collapse = NULL)
NULL
The help says . . . represents one or more R objects, to
be converted to character vectors. This actually depends
on the function, but one or more R objects is a good
way to think of it for now. For another example, see the
help for c().
Type help(paste) to see more about how this function
works.
Some examples using paste
> paste("Iteration", 1:3)
[1] "Iteration 1" "Iteration 2" "Iteration 3"
> paste("Iteration", 1:3, sep = "")
[1] "Iteration1" "Iteration2" "Iteration3"
> words <- c("Hi", "everyone")
> paste(words, collapse = " ")
[1] "Hi everyone"
> paste(letters[1:5], collapse = "-")
[1] "a-b-c-d-e"
> paste("Iteration", 1:3, sep = "", collapse = "-")
[1] "Iteration1-Iteration2-Iteration3"
The substr function allows us to extract parts of a string.
> some.letters <- paste(letters[1:5], collapse = "-")
> some.letters
[1] "a-b-c-d-e"
> substr(some.letters, start = 1, stop = 3)
[1] "a-b"
It also allows us to assign parts of a string.
> substr(some.letters, start = 1, stop = 3) <- "A*B"
> some.letters
[1] "A*B-c-d-e"
Well talk a lot more about working with text later in the
course.
We learned that one of the three main data types in R is
a logical vector, which is either TRUE or FALSE. To
understand how R operates on logical vectors, you need
to know a bit about Boolean algebra.
Boolean algebra is a mathematical formalization of the
truth or falsity of statements. It has three operations,
which well call not, or, and and. Boolean algebra
tells us how to evaluate the truth or falsity of compound
statements that are built using these operations. For
example, if A and B are statements, some compound
statements are
A and B
(not A) or B
The not operation just causes the statement following it
to switch its truth value. So (not TRUE) is FALSE and
(not FALSE) is TRUE. The compound statement A and B is
TRUE only if both A and B are TRUE. The compound
statement A or B is TRUE if either or both A or B is TRUE.
In R, we write ! for not, & for and, and | for or.
Note: all of these are vectorized!
> A <- c(TRUE, TRUE, FALSE, FALSE)
> B <- c(TRUE, FALSE, TRUE, FALSE)
> !A
[1] FALSE FALSE TRUE TRUE
> A & B
[1] TRUE FALSE FALSE FALSE
> A | B
[1] TRUE TRUE TRUE FALSE
We often need to test various conditions using the
relational operators. Again, these are vectorized and follow
the recycling rule.
> x <- 1:5
> x > 2
[1] FALSE FALSE TRUE TRUE TRUE
> x < 2
[1] TRUE FALSE FALSE FALSE FALSE
> x == 2
[1] FALSE TRUE FALSE FALSE FALSE
> x >= 2
[1] FALSE TRUE TRUE TRUE TRUE
> x <= 2
[1] TRUE TRUE FALSE FALSE FALSE
> x != 2
[1] TRUE FALSE TRUE TRUE TRUE
Two other useful functions that operate on logical vectors
are all and any. Can you guess what they do?
Logical vectors in R are just special representations of
numeric vectors lled with 1s and 0s.Treating them as 1s
and 0s in calculations where wed otherwise use their
numeric value is one of those instances in which implicit
coercion is ok, even helpful.
> x <- rnorm(1000)
> sum(x > 0) # Number of times the condition is TRUE
[1] 468
> mean(x > 0) # Proportion of times the condition is TRUE
[1] 0.468
> y <- x * (x > 0) # Multiplying by an indicator variable
> min(y)
[1] 0
Factors are a special storage class in R used for
categorical data.
> group <- rep(c("control", "treatment"), each = 2)
> group
[1] "control" "control" "treatment" "treatment"
> group <- factor(group)
> group
[1] control control treatment treatment
Levels: control treatment
> levels(group)
[1] "control" "treatment"
Because the levels of a factor are internally coded as
integers, this is more efcient than using character
vectors. However, we still have the advantage of seeing
what the levels represent (rather than just the integer
codes).
Announcement: There will be another Short
Assignment posted later today and due Monday night.
Todays topics

Data structures galore:


matrices, arrays, data frames, and lists

More ways to operate efciently on entire data


structures and avoid looping
You can create a matrix in R using the matrix function. By
default, matrices in R are assigned by column-major order.
You can assign them by row-major order by setting the
byrow argument to TRUE. Note that the rst argument to
matrix is a vector, so all elements must be of the same
type (numeric, character, or logical).
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> m <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Assign names to the rows and columns of a matrix:
> rownames(m) <- letters[1:2]
> colnames(m) <- letters[1:3]
> m
a b c
a 1 2 3
b 4 5 6
Find the dimensions of a matrix:
> dim(m); nrow(m); ncol(m)
[1] 2 3
[1] 2
[1] 3
Exchange rows and columns:
> t(m) # t for transpose
a b
a 1 4
b 2 5
c 3 6
To index elements of a matrix, use the same ve methods
of indexing we covered for vectors, but with the rst
index for rows and the second for columns.
Note: by default the result is coerced to a vector if
possible, rather than a matrix with a single row or
column.
Can you guess what each line returns?
> m
a b c
a 1 3 5
b 2 4 6
> m[-1, 2] # Exclusion & inclusion by position
> m["a",] # By name, empty column index
> m[, c(TRUE, TRUE, FALSE)] # Empty row index, logical
To avoid the coercion to lower dimension, add the
argument drop = FALSE to the indexing.
> m[1, 1, drop = FALSE]
a
a 1
> m[1, , drop = FALSE] # Note empty column index!
a b c
a 1 2 3
You can also index a matrix using a single index, as you
would for a vector. The ordering is again in column-major
order. (How could you change this?)
> m
a b c
a 1 2 3
b 4 5 6
> m[2]
[1] 4
An array is like a matrix, but with arbitrary dimension.
The rst argument is still a vector of elements to ll the
array, but the second argument is a vector of sizes in each
dimension.
> array(1:24, dim = c(2, 3, 4)) # A 3-D array
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
, , 3
[,1] [,2] [,3]
[1,] 13 15 17
[2,] 14 16 18
Note: entries are lled
in such that the rst index
varies the fastest and the
last index varies the slowest.
Data frames are also like matrices, but columns can be of
different types.
Example: Number of cars on Friday, 6th and Friday, 13th:
> cars
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
3 1991 September 137055 136018 7 to 8
4 1991 September 133732 131843 9 to 10
5 1991 December 123552 121641 7 to 8
6 1991 December 121139 118723 9 to 10
7 1992 March 128293 125532 7 to 8
8 1992 March 124631 120249 9 to 10
9 1992 November 124609 122770 7 to 8
10 1992 November 117584 117263 9 to 10
The data.frame function will extract column names either
from arguments with a name = value construction, or from
the arguments themselves.
> data.frame(let = letters[1:2], val = 1:2)
let val
1 a 1
2 b 2
> grp <- factor(rep(c("Control", "Treatment"), each = 2))
> effect <- rnorm(4, mean = rep(c(0, 10), each = 2))
> data.frame(grp, effect)
grp effect
1 Control 0.4145526
2 Control 0.6052182
3 Treatment 10.4363078
4 Treatment 10.7534556
> data.frame(1:2, rnorm(2))
X1.2 rnorm.2.
1 1 -0.58342447
Data frames can be indexed in all the ways that matrices
can. The result will be either a vector or another data
frame. Coercion to vectors can again be avoided using
the argument drop = FALSE.
We can also extract a column by name using the $
symbol.
> cars$Year
[1] 1990 1990 1991 1991 1991 1991 1992 1992 1992 1992
> cars$Month
[1] July July September September
[5] December December March March
[9] November November
Levels: December July March November September
Data frames are actually a special kind of list. As when
constructing a data frame, we specify the elements of a
list using either name = value or just value for each
argument. Unlike a data frame, lists are not displayed in
columns, and each element can have a different length.
> ingredients <- list(cheese = c("Cheddar", "Swiss"),
+ meat = c("Ham","Turkey", "Bologna"))
> ingredients
$cheese
[1] "Cheddar" "Swiss"
$meat
[1] "Ham" "Turkey" "Bologna"
Note that the elements are not associated with one
another by position, as they were in a given row of a data
frame.
You will often encounter lists as return values of function
calls in R.
> x <- 1:100
> y <- x * 3 + rnorm(100)
> regression.results <- lm(y~x) # Regress y on x
> is.list(regression.results)
[1] TRUE
> names(regression.results)
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
> regression.results$coef # Note partial matching
(Intercept) x
0.2433211 2.9950379
Lists can be indexed by name, using $.
They can also be indexed like vectors, using []. The result
will be another list.
> regression.results[1]
$coefficients
(Intercept) x
0.08847387 2.99781408
To extract individual elements of a list, enclose the index
in [[]]. The result will be coerced to a simpler structure,
depending on the element.
> regression.results[[1]]
(Intercept) x
0.08847387 2.99781408
To summarize, the types of data structures we have
encountered so far are:
vector
matrix
array
list
data frame
Matrices and arrays are actually just stored as vectors
with shape information, so our discussions of vectorized
calculations hold for matrices and arrays as well.
This is NOT true for lists and data frames.
Sometimes we want an operation to be applied to
individual dimensions of a matrix or array, or to each
element of a list.
Here R provides something called the apply mechanism.
This again avoids the need for looping through each
dimension or over each element of a list, as we would do
in lower-level languages.

apply for matrices and arrays

lapply and sapply for lists


> args(apply)
function (X, MARGIN, FUN, ...)
NULL
Theres that . . . argument again. The help page for apply
has this to say about the arguments:

X the array to be used.
MARGIN a vector giving the subscripts which the
function will be applied over. 1 indicates
rows, 2 indicates columns, c(1,2)
indicates rows and columns.
FUN the function to be applied: see Details.
In the case of functions like +, %*%,
etc., the function name must be backquoted
or quoted.
... optional arguments to FUN.
Lets rst talk about the MARGIN argument. This is a vector
representing the dimension(s) to which FUN will be
applied. Another way to think about it is that MARGIN gives
the dimension(s) we want to preserve.
> A <- array(1:12, dim = c(2, 3, 2))
> apply(A, 1, sum)
[1] 36 42
> apply(A, 2, sum)
[1] 18 26 34
> apply(A, c(1, 2), sum)
[,1] [,2] [,3]
[1,] 8 12 16
[2,] 10 14 18
Weve seen that the . . . argument can stand for an
arbitrary number of objects, as in the c or paste
functions. Here it allows us to pass arguments through
from one function to another. Here, from apply to FUN.
> A[1,2,2] <- NA
> apply(A, 1, sum)
[1] NA 42
> apply(A, 1, sum, na.rm = TRUE)
[1] 27 42
Note that by using . . ., the author of the apply function
didnt need to specify all possible arguments that FUN
could take.
Announcements:
Regina Wu, Hana Ueda, and John Jimenez will be helping
to answer your questions on bSpace and in lab.
Homework 1 is due next Wednesday night. There have
been some problems on bSpace. Please make sure you
get a verication email when you upload your assignment.
Todays topics:

Review of data structures and how to index them

The apply mechanism, revisited

Reading and writing data from within R

Keeping track of your commands

High-level graphics
The types of data structures and how to index them:
Vectors: [index]
> x[1:10]; x[-3]; x[x>3]
Matrices: [rowindex, colindex]
> m[1,2]; m[1:2, ]; m[ ,a]
Arrays: [index
1
, index
2
, ..., index
K
]
> a[1, 3, ]; a[v==TRUE,,]
Data frames: [rowindex, colindex], $name
> cars$Cars6; cars[,3:4]; cars[cars$Junction == 7 to 8,]
Lists: $name, [index], [[index]]
> ingredients$meat; indgredients[1:2]; ingredients[[1]]
Note: both $ and [[]] can index only one element.
Last time we started talking about the apply function.
Lets review how this works for matrices.
> args(apply)
function (X, MARGIN, FUN, ...)
NULL
> m <- matrix(1:4, nrow = 2)
> m
[,1] [,2]
[1,] 1 3
[2,] 2 4
> apply(m, 2, paste, collapse = "")
[1] "12" "34"
the matrix
which dimension
to operate on -
1 for rows, 2 for columns
the function
any additional
arguments to FUN
The lapply and sapply functions both apply a specied
function FUN to each element of a list. The former returns
a list object and the latter returns a vector when
possible. Again, both allow passing of additional
arguments to FUN through the . . . argument.
> random.draws <- list(x1 = rnorm(10), x2 = rnorm(100000))
> lapply(random.draws, mean)
$x1
[1] 0.0827779
$x2
[1] 0.001470952
> sapply(random.draws, mean)
x1 x2
0.082777901 0.001470952
The tapply function allows us to apply a function to
different parts of a vector, where the parts are indexed by a
factor or list of factors.
Single factor:
> grp <- factor(rep(c("Control", "Treatment"), each = 4))
> grp
[1] Control Control Control Control
[5] Treatment Treatment Treatment Treatment
Levels: Control Treatment
>
> effect <- rnorm(8) # Make up some fake data
> tapply(effect, INDEX = grp, FUN = mean)
Control Treatment
0.2180109 -0.2433582
Multiple factors:
> sex <- factor(rep(c("Female", "Male"), times = 4))
> sex
[1] Female Male Female Male Female Male Female Male
Levels: Female Male
> tapply(effect, INDEX = list(grp, sex), FUN = mean)
Female Male
Control 0.3634973 0.07252456
Treatment -0.2860360 -0.20068040
Many data sets are stored as tables in text les. The
easiest way to read these into R is using either the
read.table or read.csv function.
As you can see in help(read.table), there are quite a few
options that can be changed. Some of the important ones
are

le - name or URL

header - are column names at the top of the le?

sep - what divides elements of the table

na.strings - symbol for missing values, like 9999

skip - number of lines at the top of the le to ignore


read.csv is like read.table, but with different defaults for
CSV (comma separated value) les.
By default, all strings are read in as factors.
If a le doesnt contain column names, you can add them
after the fact. Heres how I created the R objects for the
assignment last week:
> cars <- read.csv("~/Desktop/friday13thcars.csv",
+ header = FALSE)
> cars[1:2,]
V1 V2 V3 V4 V5
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
> names(cars) <- c("Year", "Month", "Cars6",
+ "Cars13", "Junction")
> cars[1:2,]
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
Earthquakes Example:
Data from the California Geological Survey
> CAquakes <- read.table(file = "http://www.consrv.ca.gov/
cgs/rghm/quakes/Documents/ms49epicenters.txt", header =
TRUE)
> dim(CAquakes)
[1] 383 4
> CAquakes[1:3,]
Date Latitude Longitude M
1 18001011 36.8 -121.5 5.5
2 18001122 32.9 -117.8 6.3
3 18030000 34.2 -118.1 5.5
> mode(CAquakes$Date)
[1] "numeric"
How can we extract the years/months/days from the
Date column?
> datechar <- as.character(CAquakes$Date)
> substring(datechar, 1, 4)[1:3]
[1] "1800" "1800" "1803"
> CAquakes$Year <- as.numeric(substring(datechar, 1, 4))
> CAquakes$Month <- as.numeric(substring(datechar, 5, 6))
> CAquakes$Day <- as.numeric(substring(datechar, 7, 8))
> CAquakes[1:3,]
Date Latitude Longitude M Year Month Day
1 18001011 36.8 -121.5 5.5 1800 10 11
2 18001122 32.9 -117.8 6.3 1800 11 22
3 18030000 34.2 -118.1 5.5 1803 0 0
> CAquakes$Month[CAquakes$Month == 0] <- NA
> CAquakes$Day[CAquakes$Day == 0] <- NA
> summary(CAquakes$Month)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 4.000 6.000 6.281 9.000 12.000 2.000
> summary(CAquakes$Day)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 9.00 18.00 16.64 24.00 31.00 3.00
To save your R commands, use a plain text editor. Here
are two I like:

R for Mac and Windows has a built-in text editor.


Access commands related to it, such as New Document
and Save, under the File menu. One nice feature is that it
automatically prints the arguments for functions at the
bottom of the window.

The Emacs editor has a special package called ESS, for


Emacs Speaks Statistics, that makes working with .R les
very easy. Its installed on all the 342 lab computers. It
includes keyboard shortcuts to evaluate the code, rather
than cutting and pasting. (See http://stat.ethz.ch/ESS/
refcard.pdf.)
Whichever editor you choose, you can run all the
commands in a particular le using source(myfile.R).
A few more notes:
If you dont save your les as plain text, this wont work,
since R cannot interpret any extra formatting commands.
So I do NOT recommend you use Microsoft Word.
If youre cutting and pasting from the R session window
back into the text editor, be sure not to copy the prompt
(> symbol) as well.
If you want to keep your results in your .R le, put a # in
front of each line to mark them as comments.
Graphics in R
Part 1: High-level graphics
functions
Well be working in this section with many of Rs built-in
data sets. To see a list of them, just type
> data()
Data sets in package 'datasets':
AirPassengers Monthly Airline Passenger Numbers
1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide uptake in grass
plants
ChickWeight Weight versus age of chicks
different diets
. . . many more
1. Barplots
> x <- 1:5; names(x) <- letters[1:5]
> barplot(x)
a b c d e
0
1
2
3
4
5
> VADeaths
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
> barplot(VADeaths, legend = TRUE)
Rural Male Rural Female Urban Male Urban Female
70!74
65!69
60!64
55!59
50!54
0
5
0
1
0
0
1
5
0
2
0
0
This stacked barplot
makes it hard to read
anything but the
bottom category and
the total.
Making a good plot in R is often a matter of iterative
improvement.
> barplot(VADeaths, beside = TRUE, legend = TRUE)
Rural Male Rural Female Urban Male Urban Female
50!54
55!59
60!64
65!69
70!74
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
> barplot(VADeaths, beside = TRUE, legend = TRUE,
+ ylab = "Deaths per 1000",
+ main = "Death rates in Virginia, 1940")
Rural Male Rural Female Urban Male Urban Female
50!54
55!59
60!64
65!69
70!74
Death rates in Virginia, 1940
D
e
a
t
h
s

p
e
r

1
0
0
0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
Saving your plots as graphics les
If you call a high-level plot command, R will automatically
start a graphics device or window.
To save the contents of the already open device to a le,
use dev.print.
> barplot(VADeaths, legend = TRUE)
> dev.print(device = pdf, file = "mybar.pdf",
+ height = 5, width = 6) # Inches
> dev.print(device = jpeg, file = "mybar.jpeg",
+ height = 500, width = 600) # Pixels
See help(device) for a list of other graphics formats.
To close the device (shut the window), type
> dev.off()
Alternatively, you can open up the device with a given le
name, run the commands, then use dev.off(). The device
itself wont appear as a window. This is useful if you want
to run your commands in BATCH mode.
> pdf(file = "mybar.pdf", height = 6, width = 6)
> barplot(VADeaths, legend = TRUE)
> dev.off()
2. Pie charts
> pie(c(1, 1, 2), labels = letters[1:3])
a b
c
Note that elements
of the vector are
normalized by their
sum, so that the
total gives 100% of
the pie.
> Titanic
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
. . . two more matrices not printed here, with survivors
Did all groups have an equal survival rate?
> apply(Titanic, 1, sum) # Total passengers, each class
1st 2nd 3rd Crew
325 285 706 885
> pie(apply(Titanic, 1, sum), main = "Total Passengers")
> pie(apply(Titanic[,,,"Yes"], 1, sum),
+ main = "Survivors")
1st
2nd
3rd
Crew
Total Passengers
1st
2nd
3rd
Crew
Survivors
Studies of human perception show we are not very good
at comparing areas, volumes, or angles.

When making bar plots, start the axis at zero and


keep all bars the same width, so that length and area
are proportional.

Try to avoid pie charts for anything requiring a


precise comparison.
3. Histograms
> precip[1:4] # Average annual precipitation in cities
Mobile Juneau Phoenix Little Rock
67.0 54.7 7.0 48.5
> hist(precip)
Histogram of precip
precip
F
r
e
q
u
e
n
c
y
0 10 20 30 40 50 60 70
0
5
1
0
1
5
2
0
2
5
The height of the
bars shows the
number of
observations falling
into each bin.
There are several ways to change the cutoff points.
> hist(precip, breaks = 10) # Only a suggestion
> hist(precip, breaks = seq(min(precip), max(precip),
+ length = 11)) # Force it
Histogram of precip
precip
F
r
e
q
u
e
n
c
y
10 20 30 40 50 60 70
0
2
4
6
8
1
0
1
2
1
4
Histogram of precip
precip
F
r
e
q
u
e
n
c
y
10 20 30 40 50 60
0
5
1
0
1
5
Again, lets add meaningful axis labels and a title.
> hist(precip, breaks = 10, xlab = "Inches",
+ main = "Yearly Average Rainfall for US Cities")
Yearly Average Rainfall for US Cities
Inches
F
r
e
q
u
e
n
c
y
10 20 30 40 50 60 70
0
2
4
6
8
1
0
1
2
1
4
4. Boxplots
> boxplot(precip, ylab = "Inches",
+ main = "Yearly Average Rainfall for US Cities")
1
0
2
0
3
0
4
0
5
0
6
0
Yearly Average Rainfall for US Cities
I
n
c
h
e
s
Outlier
Upper whisker - Upper quartile + 1.5 IQR
Upper quartile
Median
Lower quartile
Lower whisker - Lower quartile - 1.5 IQR
Outliers
Inter-quartile range (IQR)
> mtcars[1:2,1:5]
mpg cyl disp hp drat
Mazda RX4 21 6 160 110 3.9
Mazda RX4 Wag 21 6 160 110 3.9
> boxplot(mpg~cyl, data = mtcars, xlab = "Cylinders",
+ ylab = "Miles per Gallon",
+ main = "Fuel Consumption")
4 6 8
1
0
1
5
2
0
2
5
3
0
Fuel Consumption
Cylinders
M
i
l
e
s

p
e
r

G
a
l
l
o
n
5. Scatterplots
> state.x77[1:2,1:4]
Population Income Illiteracy Life Exp
Alabama 3615 3624 2.1 69.05
Alaska 365 6315 1.5 69.31
> plot(state.x77[,"Income"], state.x77[,"Life Exp"])
3000 3500 4000 4500 5000 5500 6000
6
8
6
9
7
0
7
1
7
2
7
3
state.x77[, "Income"]
s
t
a
t
e
.
x
7
7
[
,

"
L
i
f
e

E
x
p
"
]
> plot(state.x77[,"Income"], state.x77[,"Life Exp"],
+ xlab = "Per Capita Income (Dollars)",
+ ylab = "Life Expectancy (Years)",
+ main = "Income and Life Expectancy in U.S., 1970s")
How can we label the interesting cases?
3000 3500 4000 4500 5000 5500 6000
6
8
6
9
7
0
7
1
7
2
7
3
Income and Life Expectancy in U.S., 1970s
Per Capita Income (Dollars)
L
i
f
e

E
x
p
e
c
t
a
n
c
y

(
Y
e
a
r
s
)
See Stat133Lecture5.R
Announcement: John will have ofce hours tonight from
8-10pm in the bSpace chatroom, and I will have ofce
hours tomorrow from 2:30-3:30.
A few loose ends from last time:
arguments to par - mar, oma, xaxt, and yaxt
functions - legend, locator, axis, abline
Graphics, Continued:
The Dirty Dozen
From Wainer, H. (1984) How to Display Data Badly.
The American Statistician, 38, 137-147.
Additional images from Tufte, E. The Visual Display of
Quntitative Information.
1. Show as few data as possible.
An example with lots
of chart junk, not to
mention visual
distortion
How many data points?
2. Hide what data you do show.
3. Ignore the visual metaphor, or reverse it mid-graph.
4. Only order matters.
Are we supposed to
compare length, area,
or volume?
5. Graph data out
of context.
6. Change scales in
mid-axis.
7. Emphasize the trivial.
(Ignore the important.)
8. Jiggle the baseline.
Sometimes varying the baseline is ok, if the main points of
comparison are the rst category and the total. This plot
is bizarre in other ways.
9. Austria rst!
10. Label (a) Illegibly,
(b) Incompletely,
(c) Incorrectly, and
(d) Ambiguously.
11. More (dimensions) is murkier.
More dimensions AND
colors!
12. If it has been done well in the past, think of another
way to do it.
On the other hand, here are some creative plotting
techniques you may want to consider.
1. Letting the data points represent another variable.
(E.J. Marey)
2. Using small
multiples
3. Letting deformation represent a variable.
versus
An example from www.swivel.com
Critique:
x-axis labels poorly located - put them at election years
y-axis label misleading - these are numbers of counties
use of color could be improved (eg. red/blue)
Some R code for you to play with:
parties <- read.csv(file = "graph_26277754.csv", header = TRUE)
parties
names(parties) <- c("Democrat", "Republican", "Year")
order(parties$Year)
parties <- parties[order(parties$Year),]
party.mat <- t(as.matrix(parties[,1:2]))
barplot(party.mat, beside = TRUE)
barplot(party.mat, beside = TRUE, col = c("blue", "red"))
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year)
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year, legend = TRUE)
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year, legend = TRUE,
ylim = c(0, 50))
title(main = "California Counties\nMajority Party of Registered Voters",
xlab = "Election Year", ylab = "Number of Counties")
dev.print(pdf, file = "countyvotes.pdf", height = 6, width = 7)
But does this new graph tell the whole story?
1992 1996 2000 2004 2008
Democrat
Republican
0
1
0
2
0
3
0
4
0
5
0
California Counties
Majority Party of Registered Voters
Election Year
N
u
m
b
e
r

o
f

C
o
u
n
t
i
e
s
Programming in R
So far we have relied on the built-in functionality of R to
carry out our analyses. In the next three lectures, well
cover

Importing packages into R

How to write your own functions

The meaning of environments and variable scope

How to use ow control mechanisms like if and for

Debugging your code when something goes wrong

Timing and speeding up your code


If a particular package is already installed on your system,
you can access its contents by typing
> library(nameofpackage)
or
> library(nameofpackage)
The authors of the package write documentation for the
functions and datasets included in it, which you can read
as usual using help().
All packages come with a reference manual, which you
can access by visiting CRAN. Go to http://cran.r-
project.org/, click on Packages, then scroll down for the
particular package. This is just a hard copy of the help
pages. A few packages also come with a tutorial.
Writing your own functions in R
Think about the code weve been writing so far in R. It
has been

made up of a list of commands, one after another

specic to the particular dataset were working with.


Functions allow us to

organize our code into individual tasks

reuse the same code on different datasets by making


the data an argument to the function.
For example, last time we simulated some data.
> beta0 <- 3; beta1 <- 1
> m <- beta0 + beta1 * x
> y <- rnorm(100, mean = m, sd = 10)
Then we plotted it and added a linear regression line.
> plot(x, y)
> ls.mod <- lm(y~x)
> abline(a = ls.mod$coef[1], b = ls.mod$coef[2])
What we want to do now is encapsulate the last three
lines into a function that we can apply to any x and y
vector, not just the ones currently in our workspace.
Anatomy of a function
The syntax for writing a function is
function ( arglist ) body
Typically we assign the function to a particular name. This
should describe what the function does.
myfunction <- function (arglist) body
A function without a name is called an orphan function.
These can be very powerfully used with the apply
mechanism. Stay tuned....
function ( arglist ) body
The keyword function just tells R that you want to create
a function.
Recall that the arguments to a function are its inputs,
which may have default values.
> args(substring)
function (text, first, last = 1e+06)
Here, if we do not explicitly specify last when we call
substring, it will be assigned the default value of 1e+06,
which is very large. (Why do you think this was chosen?)
A few notes on writing the arguments list
When youre writing your own function, its good practice
to put the most important arguments rst. Often these
will not have default values.
This allows the user of your function to easily specify the
arguments by position, eg. plot(xvec, yvec) rather than
plot(x = xvec, y = yvec).
Next we have the body of the function, which typically
consists of expressions surrounded by curly brackets.
Think of these as performing some operations on the
input values given by the arguments.
{
expression 1
expression 2
return(value)
}
The return expression hands control back to the caller of
the function and returns a given value. If the function
returns more than one thing, this is done using a named
list, for example
return(list(total = sum(x), avg = mean(x))).
In the absence of a return expression, a function will
return the last evaluated expression. This is particularly
common if the function is short.
For example, I could write the simple function:
my.mean <- function(x) sum(x)/length(x)
Here I dont even need brackets {}, since there is only
one expression.
A return expression anywhere in the function will cause
the function to return control to the user immediately,
without evaluating the rest of the function. This is often
used in conjunction with if statements, which well come
to later.
Returning to our example, lets make a function to carry
out these three steps for any vectors x and y.
> plot(x, y)
> ls.mod <- lm(y~x)
> abline(a = ls.mod$coef[1], b = ls.mod$coef[2])
What should we call it? (What does it do?)
What will be the arguments? Should they have default
values?
What (if anything) should the function return?
What do you actually need to type into R to create this
function?
Also, looking ahead, what might go wrong with the
function?
Environments and variable scope
R has a special mechanism for allowing you to use the
same name in different places in your code and have it
refer to different objects.
For example, you want to be able to create new variables
in your functions and not worry if there are variables
with the same name already in the workspace.
The solution relies on environments.
When you call a function, R creates a new workspace
containing just the variables dened by the arguments of
that function. This collection of variables is called a frame.
> x <- 1; y <- 2
> lookatframe <- function(a, b, c) print(ls())
> lookatframe(a = 1, b = 2, c = 3)
[1] "a" "b" "c"
However, R has a way of accessing variables that are not
in the frame created by the function.
> lookatframe <- function(a, b, c){print(ls()); print(x)}
> lookatframe(a = 1, b = 2, c = 3)
[1] "a" "b" "c
[1] 1
What is happening is that R is looking for variables with
that name in a sequence of environments. An environment
is just a frame (collection of variables) plus a pointer to
the next environment to look in.
In our example, R didnt nd the variable x in the
environment dened by the function lookatframe, so it
went on to the next one. In this case, this was our main
workspace, which is called the Global Environment.
The next environment to look in is called the parent
environment. For the environment created by a function
call, this is just the environment we were in when we
called the function.
If R reaches the Global Environment and still cant nd
the variable, it looks in something called the search path.
This is a list of additional environments, which is used for
packages of functions and user attached data.
You can see the search path by typing search().
Computing in R consists of sequentially evaluating
statements. Flow control structures allow us to control
which statements are evaluated and in what order.
In R these consist of

if/else statements

for and while loops

break and next

the switch function


Statements can be grouped together using curly braces
{ and }. A group of statements is called a block. For
todays lecture, the word statement will refer to either a
single statement or a block.
The basic syntax for an if/else statement is
if ( condition ) {
statement1
} else {
statement2
}
First, condition is evaluated. If the rst element of the
result is TRUE then statement1 is evaluated. If the rst
element of the result is FALSE then statement2 is evaluated.
Only the rst element of the result is used.
If the result is numeric, 0 is treated as FALSE and any other
number as TRUE. If the result is not logical or numeric, or
if it is NA, you will get an error.
When we discussed Boolean algebra before, we met the
operators & (AND) and | (OR).
Recall that these are both vectorized operators.
If/else statements, on the other hand, are based on a
single, global condition. So we often see constructions
using any or all to express something related to the
whole vector, like
if ( any(x < -1 | x > 1) )
warning("Value(s) in x outside the interval [-1,1])
(Well discuss error handling more next time.)
There is another set of operators, && and ||, which are not
vectorized. In fact, they ignore all but the rst element of
whatever you give them.
The advantage in using them is that they only evaluate as
much as they need to in order to return TRUE or FALSE.
For example, in A && B, rst A will be evaluated. If it is
FALSE, R will immediately evaluate to FALSE for the whole
expression, and will not evaluate B.
Likewise, in A || B, R will immediately evaluate to TRUE if A
is TRUE.
The result of an if/else statement can be assigned. For
example,
> if ( any(x <= 0) ) y <- log(1+x) else y <- log(x)
is the same as
> y <- if ( any(x <= 0) ) log(1+x) else log(x)
Also, the else clause is optional. Another way to do the
above is
> if( any(x <= 0) ) x <- 1+x
> y <- log(x)
Note that this version this changes x as well.
If/else statements can be nested.
if (condition1 )
statement1
else if (condition2)
statement2
else if (condition3)
statement3
else
statement4
The conditions are evaluated, in order, until one evaluates
to TRUE. The the associated statement/block is evaluated.
The statement in the nal else clause is evaluated if none
of the conditions evaluates to TRUE.
A note about formatting if/else statements:
When the if statement is not in a block, the else (if
present) must appear on the same line as statement1 or
immediately following the closing brace. For example,
if (condition) {statement1}
else {statement2}
will be an error if not part of a larger block and/or
function, because R will evaluate the rst line.
Some common uses of if/else clauses
1. With logical arguments to tell a function what to do
corplot <- function(x, y, plotit = TRUE){
if ( plotit == TRUE ) plot(x, y)
cor(x,y)
}
2. To verify that the arguments of a function are as
expected
if ( !is.matrix(m) )
stop("m must be a matrix)
3. To handle common numerical errors
ratio <- if ( x!=0 ) y/x else NA
4. In general, to control which block of code is executed
if ( dist == normal ){
return( rnorm(n) )
} else if (dist == t){
return(rt(n, df = 1, ncp = 0))
} else stop(distribution not implemented)
These if/else constructions are useful for global tests, not
tests applied to individual elements of a vector.
However, there is a vectorized function called ifelse.
> args(ifelse)
function (test, yes, no)
For each element of test, the corresponding element of
yes is returned if the element is TRUE, and the
corresponding element of no is returned if it is FALSE.
R object that can be
coerced to logical
R objects of the same
size as test
Some examples of ifelse
ratio <- ifelse(x!=0, y/x, NA) # (Compare with earlier)
US.indicator <- ifelse(country == USA, 1, 0)
plot(Income, Donations,
col = ifelse(party == Republican, red, blue)
Looping is the repeated evaluation of a statement or block
of statements.
Much of what is handled using loops in other languages
can be more efciently handled in R using vectorized
calculations or one of the apply mechanisms.
However, certain algorithms, such as those requiring
recursion, can only be handled by loops.
There are two main looping constructs in R: for and while.
For loops
A for loop repeats a statement or block of statements a
predened number of times.
The syntax in R is
for ( name in vector ){
statement
}
For each element in vector, the variable name is set to the
value of that element and statement1 is evaluated.
vector often contains integers, but can be any valid type.
Some examples of for loops:
fibseq <- rep(NA, 100)
fibseq[1:2] <- 1
for(i in 3:100)
fibseq[i] <- fibseq[i-1] + fibseq[i-2]
datafiles <- paste(data, 1:10, .RData, sep = )
for(file in datafiles)
load(file)

While loops
A while loop repeats a statement or block of statements
for as many times as a particular condition is TRUE.
The syntax in R is
while (condition){
statement
}
condition is evaluated, and if it is TRUE, statement is
evaluated. This process continues until condition
evaluates to FALSE.
Exercise:
The expression sample(c(1, 0), size = 1, prob = c(p, 1-
p)) simulates a random coin ip, where the coin has
probability p of coming up heads, represented by a 1.
Write a function that simulates ipping a coin until a xed
number of heads are obtained. It should take the
probability p and the total number of heads total and
return the trial on which the nal head was obtained.
This produces a single sample from the negative binomial
distribution.
coin.flips <- function(total.heads, p = 0.5){
current.heads <- n.trials <- 0
while(current.heads < total.heads){
n.trials <- n.trials + 1
if(sample(c(1,0), size = 1, prob = c(p, 1-p))){
current.heads <- current.heads + 1
}
}
return(n.trials)
}
Announcement:
Graded homework 2 will be posted on bSpace after class.
It is due next Friday at 11pm.
Please take advantage of the lab sessions tomorrow to
get started.
Continued from last time...
The expression sample(c(1, 0), size = 1, prob = c(p, 1-
p)) simulates a random coin ip, where the coin has
probability p of coming up heads, represented by a 1.
Write a function that simulates ipping a coin until a xed
number of heads are obtained. It should take the
probability p and the total number of heads total and
return the trial on which the nal head was obtained.
This produces a single sample from the negative binomial
distribution.
Now write a function to take multiple samples from the
negative binomial distribution.
The break statement causes an exit from the innermost
loop that is currently being executed. The next statement
immediately causes control to return to the start of the
loop. These are typically used in conjunction with an if
statement.
> for(i in 1:10){
+ if(i == 5)
+ break
+ }
> i
[1] 5
Notice that the name being iterated over in a for loop,
in this case i, still exists once the loop is done. This tells
you where you were when the break occurred.
The syntax for the switch function is
switch(EXPR, ...)
where the additional arguments specied by ... may be
named. EXPR is evaluated. If the result is a number
between 1 and the number of additional arguments then
the corresponding element of ... is evaluated and
returned. If EXPR returns a character string then that
string is used to match the names of the elements in ....
switch(distribution, normal = rnorm(1),
t = rt(1, df = 1, ncp = 0),
poisson = rpois(1, lambda = 1),
stop(distribution not implemented))
Catching errors
1. The function stop stops execution of the current
expression and prints a specied error message.
> showstop <- function(x){
+ if(any(x < 0)) stop("x must be >= 0")
+ return("ok")
+ }
> showstop(1)
[1] "ok"
> showstop(c(-1, 1))
Error in showstop(c(-1, 1)) : x must be >= 0
2. A similar function is stopifnot. It has the advantage of
being able to take multiple conditions.
> showstopifnot <- function(x){
+ stopifnot(x>=0, x%%2 == 1)
+ return("ok")
+ }
> showstopifnot(1)
[1] "ok"
> showstopifnot(c(1, -1))
Error: all(x >= 0) is not TRUE
> showstopifnot(c(1,2))
Error: x%%2 == 1 is not all TRUE
3. Finally, warning just prints a warning message without
stopping the execution of the function.
> ratio.warn <- function(x, y){
+ if(any(y == 0))
+ warning("Dividing by zero")
+ return(x/y)
+ }
> ratio.warn(x = 1, y = c(1, 0))
[1] 1 Inf
Warning message:
In ratio.warn(x = 1, y = c(1, 0)) : Dividing by zero
> ratio.warn(x = 1:3, y = 1:2)
[1] 1 1 3
Warning message:
In x/y : longer object length is not a multiple of shorter
object length
Some debugging strategies
1. The traceback function prints the sequence of calls that
led to the last error. This can show you where in your
function something is going wrong.
It may not even be in the function itself, but in another
function that is being called within the original function.
> cv <- function(x) sd(x/mean(x))
> cv(0)
Error in var(x, na.rm = na.rm) : missing observations in
cov/cor
> traceback()
3: var(x, na.rm = na.rm)
2: sd(x/mean(x))
1: cv(0)
2. If you have some idea where the error is occurring, you
can use print to check that key variables are what you
think they are.
3. Consider commenting out lines of your code where
the error might occur, then adding them back in one by
one.
4. To step through the function, expression by expression,
and be able to print out any variable at each step, use the
debug function. Use undebug to turn of debugging.
> coin.flips(total.heads = 5)
debugging in: coin.flips(total.heads = 5)
debug: {
current.heads <- n.trials <- 0
while (current.heads < total.heads) {
n.trials <- n.trials + 1
if (sample(c(1, 0), size = 1, prob = p)) {
current.heads <- current.heads + 1
}
}
return(n.trials)
}
Browse[1]> n
debug: current.heads <- n.trials <- 0
Browse[1]>
Whats about to be evaluated
While in the debugger, you can use the following
commands:
'n' (or just return) - Advance to the next step.
'c' - continue to the end of the current context: e.g. to
the end of the loop if within a loop or to the end of the
function.
'Q' - exit the browser and the current evaluation and
return to the top-level prompt.
You can also evaluate any valid R expression. For
example, you can type the names of variables to see their
current values.
Efcient programming
The rst rule of efcient programming in R is to make use
of vectorized calculations and the apply mechanisms
whenever possible.
You can check how much time it takes to evaluate any
expression by wrapping it in system.time(). Units are in
seconds.
> system.time(normal.samples <- rnorm(1000000))
user system elapsed
0.196 0.013 0.221
wall clock time
CPU time for
R process
CPU time for system
on behalf of R
> x <- y <- 1:100000
> time1 <- system.time({
+ z <- x[1] + y[1]
+ for(i in 2:100000)
+ z <- c(z, x[i] + y[i])})
> time2 <- system.time({
+ z <- rep(NA, 100000)
+ for(i in 1:100000)
+ z[i] <- x[i] + y[i]})
> time3 <- system.time(x+y)
> time1/time3
user system elapsed
41687.5 80872.0 83769.5
> time2/time3
user system elapsed
276.5 4.0 279.5
Simulation in R
First, a brief review of probability theory.
Probability allows us to quantify statements about the
chance of an event taking place. There are two formal
denitions of probability:

Frequentist: long-run relative frequency of an event


occurring in repeated experiments.

Subjective/Bayesian: an individuals degree of belief in


the occurrence of the event, given the evidence.
We will focus on the rst denition. However, the basic
laws of probability are the same under both denitions.
A probability distribution assigns a number P(A) to each
event in the sample space (set of all possible outcomes).
P(A) must be between 0 and 1.
We may characterize the distribution using the cumulative
distribution function or CDF, dened by
We call X the random variable.
Exercise: Graph the CDF for the random variable equal
to the number of heads in two independent coin ips.
F(x) = P(X x)
Three important properties of the CDF are
In fact a function mapping the real line to [0,1] is a CDF if
and only if it satises these three conditions.
1. F is non-decreasing: x
1
< x
2
implies F(x
1
) F(x
2
)
2. F is normalized: lim
x
= 0 and lim
x
= 1
3. F is right-continuous: lim
yx
= F(x)
The inverse CDF or quantile function is dened by
If youre not familiar with inf, just think of it as the
minimum.
Exercise: What does the inverse CDF for the coin ipping
example look like?
F
1
(q) = inf{x : F(x) > q}
A random variable X is discrete if it takes countably many
values. We dene the probability mass function for X by
A random variable X is continuous if there exists a
function f, called the probability density function (PDF), with
Note that
f(x) = P(X = x)
1. f(x) 0 for all x
2.

f(x)dx = 1
3. P(a < X < b) =

b
a
f(x)dx
F(x) =

f(t)dt
In R, there are many built-in functions for handling
distributions, some of which we have seen already.
The prexes of the functions indicate what they do:
d - evaluate the PDF
p - evaluate the CDF
q - evaluate the inverse CDF
r - take a random sample
Note that the functions prexed by d, p, and q are all
calculating mathematical quantities.
However, once we have a random sample, we can also
estimate the PDF, CDF, and inverse CDF....
A histogram is a type of density estimator.
Recall that
For each bin of a histogram (with lower endpoint a and
upper endpoint b), we count the number of observations
falling into the bin, i.e.
If we properly normalize each of these quantities, the
total area of the rectangles in the histogram is one, just
like the area under a PDF. You can do this automatically in
R with hist(x, prob = TRUE).

b
a
f(x)dx = P(a < X < b)
n

i=1
I{a < X
i
b}
The empirical CDF uses the same sort of counting idea.
Dene
We are estimating a probability by a proportion. Another
way to think of it is that we estimate the PDF by a
discrete distribution which assigns probability 1/n to each
data point.
Exercise: Write a function which calculates the empirical
CDF. It should take a vectors sample and x.

F(x) =

n
i=1
I{X
i
x}
n
Finally, the quantile function in R returns the sample
quantiles, dened by
Note that we just plug in the empirical CDF to the
denition of the quantile function.

F
1
(q) = inf{x :

F(x) > q}
Well talk more next time about specic distributions.
For now, lets consider the role that simulation can play in
helping us understand statistics.
We can think of probability theory as complimentary to
statistical inference.
Distribution Observed data
Probability
Inference
A statistic is a function of a sample, for example the
sample mean or a sample quantile.
Statistics are often used as estimators of quantities of
interest about the distribution, called parameters.
Estimators are random variables; parameters are not.
In simple cases, we can study the distribution of the
statistic analytically. For example, we can prove that
under mild conditions the standard error of the sample
mean decreases at a rate proportional to 1/!n.
In more complicated cases, we turn to simulation.
Whereas mathematical results are symbolic, in terms of
arbitrary parameters and sample size, on a computer we
must specify particular values.
A single experiment looks something like this:
To study the distribution of the statistic, we repeat the
whole experiment B times. The larger B is, the better our
approximation of the distribution.
Particular choice
of parameters,
sample size
X
1
X
2
X
n
}
Single statistic
Steps in carrying out a simulation study:
1. Specify what makes up an individual experiment: sample
size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
Example: Find the standard error of the median when
sampling from the normal distribution. How does it vary
with the sample size and with the standard deviation?
Recall our example of a simulation study from last time:
Find the standard error of the median when sampling
from the normal distribution. How does it vary with the
sample size and with the standard deviation?
Steps in carrying out a simulation study:
1. Specify what makes up an individual experiment: sample
size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
A quick review of some other probability distributions
available in R:
abbreviation - Distribution
unif - Uniform(a, b)
exp - Exponential(")
pois - Poisson(")
f(x) =

1
ba
a x b
0 otherwise
f(x) =

e
x
x > 0
0 otherwise
f(x) =
e

x
x!
, x = 1, 2, 3, . . .
binom - Binomial
What if a distribution is not available in R? For instance,
there is no built-in Bernoulli distribution.
Well, in this case you could just use binom with size=1, or
sample(0:1, 1, prob = c(p, 1-p)).
We can also derive certain distributions from others. For
example, last week we sampled from the negative
binomial distribution by explicitly counting the number of
trials until we got the desired number of heads. Can we
sample from the Bernoulli in some other way?
f(x) =

n
x

p
x
(1 p)
x
, x = 0, 1, . . . , n
If the inverse CDF (the quantile function) has an inverse
in closed form, there is a method for generating random
values from the distribution.
The inverse CDF method is simple:
1. Generate n samples from a standard uniform
distribution. Call this vector u. In R, u <- runif(n).
2. Take y <- F.inv(u), where F.inv computes the
inverse CDF of the distribution we want.
We can prove that the CDF of the random values
produced in this way is exactly F.
Therefore,
We used the fact that F is nondecreasing in the second
line.
P(U u) =

0 u < 0
u 0 u 1
1 u > 1
P(Y y) = P(F
1
(U) y)
= P(F(F
1
(U)) F(y))
= P(U F(y)
= F(y)
Example: Triangle distribution with endpoints at a and b
and center at c.
We need to:
1. Find the CDF
2. Find the inverse CDF
3. Write a function to carry
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
D
e
n
s
i
t
y
f(x) =

2(xa)
(ba)(ca)
a x < c
2(bx)
(ba)(bc)
c x b
0 otherwise
We ended last time by talking about the inverse-CDF
method:
1. Generate n samples from a standard uniform
distribution. Call this vector u. In R, u <- runif(n).
2. Take y <- F.inv(u), where F.inv computes the
inverse CDF of the distribution we want.
Example: Triangle distribution with endpoints at a and b
and center at c.
We need to:
1. Find the CDF
2. Find the inverse CDF
3. Write a function to carry
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
D
e
n
s
i
t
y
f(x) =

2(xa)
(ba)(ca)
a x < c
2(bx)
(ba)(bc)
c x b
0 otherwise
Using the fact that the total area is one, and that the area
of a triangle is 1/2 base x height, we nd that
Inverting this function, we have
Now we can write our function.
F
1
(x) =


y(b a)(c a) + a 0 y <
ca
ba
b

(1 y)(b a)(b c)
ca
ba
y 1
F(x) =

0 x < 0
(xa)
2
(ba)(ca)
0 x < c
1
(bx)
2
(ba)(bc)
c x b
1 b < x
Well nish off this section on simulation by going over
one more example of designing a simulation study, dealing
with the risk of the James-Stein estimator.
In 1956, Charles Stein rocked the world of statistics when
he proved that the maximum likelihood estimator (MLE)
is inadmissible (that is, we can always nd a better
estimator) in this simple problem when
Let
The MLE for the vector is just the vector of
observations, which seems intuitively sensible.
To state Steins result, we rst have to talk about risk.
Y
i
indep
N(
i
,
2
),
i
, i = 1, . . . , d
d 3 :
Y
Speaking somewhat more formally, a loss function
describes the consequences of using a particular
estimator when the true parameter is
A common loss function is the squared error
which we can generalize to multiple dimensions by
summing the squared errors in each dimension.
But is random, because it depends on the data. We call
the expected value of the loss for given the risk function.

.
L(,

) = (

)
2
L(,

) =
d

i=1
(
i

i
)
2
= ||

||
2

Its easy to calculate the risk (under squared error loss) of


the MLE in this problem:
What Stein proved was that when , we can nd
another estimator whose risk is always less than that of
the MLE, no matter what is.
E[L(, Y )] = E

i=1
(
i
Y
i
)
2

=
d

i=1
E

(
i
Y
i
)
2

= d
2
d 3

A famous example is the James-Stein estimator:


Well now study the risk of the James-Stein estimator
through simulation and compare it to the risk of the MLE.
Note that for a given data set in the simulation, we can
calculate the loss, because we know
We can approximate the risk (the expected value of the
loss) by generating many data sets and then calculating
the sample mean of the vector of losses.

JS
=

1
(d 2)
2
||Y ||
2

Y
.
Recall the steps one more time:
1. Specify what makes up an individual experiment: sample
size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
One more quick note about summarizing your simulation
results....
When you are reporting the means of your simulated
distributions, its a good idea to add an indication of your
uncertainty as well.
The Central Limit Theorem tells us that the sample mean of
the simulated distribution is, with a sufciently large
sample, approximately normally distributed. We can use
this to form a 95% condence interval:
Note that we have control over B, so we can make the
intervals as narrow as we like!

X 2SD/

B
UNIX Basics
Operating systems
An operating system (OS) is a piece of software that
controls the hardware and other pieces of software on
your computer.
The most popular OS today, Microsoft Windows, uses a
graphical user interface (GUI) for you to interact with the
OS. This is easy to learn but not very powerful.
UNIX, on the other hand, is hard at rst to learn, but it
allows you vastly more control over what your computer
can do. There area actually many different avors of
UNIX, but what well cover applies to almost all of them.
The differences between, say, Windows and UNIX stem
from an underlying philosophy about what software
should do.
Windows: Programs are large, multi-functional. Example:
Microsoft Word.
UNIX: Many small programs, which can be combined to
get the job done. A toolbox approach. Example: stop all
my (cgks) processes whose name begins with cat and a
space:
ps -u cgk | grep [0-9] cat | awk {print$2} | xargs kill
The UNIX kernel is the part of the OS that actually
carries out basic tasks.
The UNIX shell is the user interface to the kernel. Like
avors of UNIX, there are also many different shells. For
this course, it
doesnt matter
which one you
use. The default
on the lab
computers is
called tcsh.
the prompt - yours
will differ
The rst thing you need to know about UNIX are how to
work with directories and les. Technically, everything in
UNIX is a le, but its easier to think of directories as you
would folders on Windows or Mac OS.
Directories are organized in an inverted tree structure.
To see the directory youre currently in, type the
command pwd (present working directory).
There are two special directories: The top level
directory, named /, is called the root directory.
Your home directory, named ~, contains all your les. For
Mary, ~ and /users/mary mean the same thing.
To create a new directory, use the command mkdir. Then
to move into it, use cd.
$ pwd
/Users/cgk
$ mkdir unixexamples
$ cd unixexamples
$ ls
$ ls -a
. ..
ls -a means to show all les, including the hidden les
starting with a dot (.).
The two hidden les here are special and exist in every
directory. . refers to the current directory, and ..
refers to the directory above it.
This brings us to the distinction between relative and
absolute path names. (Think of a path like an address in
UNIX, telling you where you are in the directory tree.)
You may have noticed that I typed cd unixexamples, rather
than cd /Users/cgk/unixexamples.
The rst is the relative path; the second is the absolute
path.
To refer to a le, you need to either be in the directory
where the le is located, or you need to refer to it using a
relative or absolute path name.
Example:
$ pwd
/Users/cgk/unixexamples
$ echo "Testing 1 2 3" > test.txt
$ ls
test.txt
$ cat test.txt
Testing 1 2 3
$ cd ..
$ cat test.txt
cat: test.txt: No such file or directory
$ cat unixexamples/test.txt
Testing 1 2 3
Note that le names must be unique within a particular
directory, but having, say, both /Users/cgk/test.txt and
/Users/cgk/unixexamples/test.txt is OK.
Commands, arguments, and options
Weve already started using these; now lets dene them
more precisely.
The general syntax for a UNIX command looks like this:
$ command -options argument1 argument2
(The number of arguments may vary.) An argument
comes at the end of the command line. Its usually the
name of a le or some text.
Example: move/rename a le.
$ mv test.txt newname.txt
Options come between the name of the command and
the arguments, and they tell the command to do
something other than its default. Theyre usually prefaced
with one or two hyphens.
$ pwd
/Users/cgk
$ rmdir unixexamples
rmdir: unixexamples: Directory not empty
$ rm -r unixexamples
$ ls
Desktop Movies Rlibs
Documents Music Sites
Icon? Pictures Work
Library Public bin
MathematicaFonts README
mathfonts.tar
To look at the syntax of any particular UNIX command,
type man (for manual) and then the name of the
command.
The two most important parts of the man page are
labeled SYNOPSIS and DESCRIPTION. These are very
much like the Usage and Arguments in Rs help pages.
SYNOPSIS shows you the syntax for a particular
command. Bracketed arguments are optional.
DESCRIPTION tells you what all the options do.
Press the space bar to scroll forward through the man
page, b to go backward, and q to exit.
You can refer to multiple les at once using wildcards. The
most common one is the asterisk (*). It stands in for
anything (including nothing at all).
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls G*
Gagging.text Going.nxt
$ ls *.xt
Bing.xt
The question mark (?) is similar, except it can only
represent a single character.
$ ls ?ing.xt
Bing.xt
Finally, square brackets can be replaced by whatever
characters are within those brackets.
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls [A-G]ing.*
Bing.xt
The wildcards can also be combined.
$ ls *G*
AGing.txt Gagging.text Going.nxt
$ ls *i*.*e*
Gagging.text ing.ext
Well cover text matching in a lot more detail next week
when we talk about regular expressions.
A recap of commands so far
pwd
print working directory
ls
list contents of current directory
ls -a
list contents, include hidden les
mkdir
create a new directory
cd dname
change directory to dname
cd ..
change to parent directory
cd ~
change to home directory
mv
move or rename a le
rm
remove a le
rm -r
remove all lower-level les
Here are a few more handy ones:
wc -l - count the number of lines in a le
$ wc -l halfdeg.elv
134845 halfdeg.elv
head -nx - look at the rst x lines of a le
$ head -n5 halfdeg.elv
Tyndall Centre grim file created on 13.05.2003 at 13:52 by
Dr. Tim Mitchell
.elv = elevation (km)
0.5deg MarkNew elev
[Long=-180.00, 180.00] [Lati= -90.00, 90.00] [Grid X,Y=
720, 360]
[Boxes= 67420] [Years=1975-1975] [Multi= 0.0010]
[Missing=-999]
tail - like head, but look at the end of the le
cp - copy a le
$ cp unixexamples/Bing.xt .
cat - print the contents of a le
echo - write arguments
The real power in UNIX, however, comes from stringing
these commands together. Well talk about this next
time.
Today, well talk about
Some interfaces between R and UNIX
- getting the results of UNIX commands from within R
- running R in BATCH mode and monitoring its progress
Putting UNIX commands together using
- redirection
- pipes
Manipulating data in UNIX using ltering commands in
combination with redirection and pipes
The system function in R allows you to execute a UNIX
command and either print the result to the screen or
store it as an R object (argument intern = TRUE).
> system("ls")
datagen.R group1.dat group2.dat group3.dat
> system("head -n2 *.dat")
==> group1.dat <==
height weight
65.4 134.9
==> group2.dat <==
height weight
65.7 145.7
==> group3.dat <==
height weight
63.8 138.9
Goal: Read in all the data
les and put them in a
single matrix with an extra
column for group.
Referring to your UNIX
handout, what should our
strategy be?
> nlines <- system("wc -l *.dat", intern = TRUE)
> nlines
[1] " 3 group1.dat" " 4 group2.dat"
[3] " 3 group3.dat" " 10 total"
> nfiles <- length(nlines) - 1
> nlines <- as.numeric(substring(nlines[1:nfiles], 7, 8)) - 1
> nlines
[1] 2 3 2
> hw <- matrix(NA, nrow = sum(nlines), ncol = 3)
> startline <- 1
> for(group in 1:nfiles){
+ temp <- read.table(file = paste("group", group,
+ ".dat", sep = ""),
+ header = TRUE)
+ index <- startline:(startline+nrow(temp)-1)
+ hw[index,1:2] <- as.matrix(temp)
+ hw[index,3] <- as.matrix(group)
+ startline <- startline + nrow(temp)
+ }
> names(hw) <- c("height", "weight", "group")
BATCH jobs in R are useful whenever
- You have a long job and you want to be able to use the
computer for other things in the meantime.
- You want to log out of the machine while the job is
running and come back to it later.
- Youre running the job on a remote machine, and again
you want to log out.
- You want to be courteous to other users of the machine
by decreasing the priority of the job.
To start a BATCH job, use
nice R CMD BATCH scriptfile.R outfile.Rout &
What would be printed to
the screen instead goes here.
Indicates that you want
to run this job in the
background.
Give the job lower priority
A few other things to keep in mind:
- scriptfile.R should require no input from the user. For
example, dont use identify.
- Graphics should be created by surrounding the relevant
code with pdf(file = filename.pdf) and dev.off(), rather
than using dev.print(pdf, file = filename.pdf).
- In simulations it can be helpful to include a line like
if(i %% 10 == 0) print(paste(Iteration, i)
Then you can monitor it using tail -f outfile.Rout.
- By default, the workspace will be saved in .RData. You
can also save specic objects using the save function.
To see information about currently running processes,
just type top. There are arguments to top that allow you
to sort by CPU usage, memory, etc. See man top for more
details.
The number at the beginning of the line is called the
process ID, or PID.
To kill a particular process, type kill PID, substituting the
correct PID.
Sometimes you want to see the list of all processes in a
non-interactive way. For example, you might want to pipe
the results through a lter, as well discuss next.
On BSD UNIX systems (like the Apple machines in the
lab), ps -aux will list all processes.
On other systems, ps -ef does the trick.
Well use ssh and sftp to start a new UNIX session on a
remote machine and to send les back and forth between
our computer and the remote computer.
To log into a statistics department machine, type
ssh -l uname scf-ugNN.berkeley.edu
where uname is the user name youve been assigned for
the course, and NN is a number between 01 and 27.
You will be prompted for your password. Your starting
directory is your home directory on the network. Note
this is the same no matter which department computer
you log into!
To transfer les back and forth, rst type
sftp -l uname scf-ugNN.berkeley.edu
You can use pwd, cd, and ls just as you would at the usual
prompt to nd the right remote le or directory. You can
also use lpwd, lcd, and lls to move around the local
machine.
To copy a le from the remote computer to your
computer, type get nameoffile.
Top copy a le from your computer to the remote
computer, type put nameoffile.
Type exit to quit.
Redirection and pipes are really at the heart of the UNIX
philosophy, which is to have many small tools, each one
suited for a particular job.
Redirection refers to changing the input and output of
individual commands/programs.
The standard input or STDIN is usually your keyboard.
The standard output or STDOUT is usually your
terminal (monitor).
As an example, if we type cat at the prompt and hit
return, the computer will accept input from us until it hits
an end-of-le (EOF) marker, which on most systems is
CNTRL-D. Each time we hit return, our input is printed
to the terminal.
We can redirect as follows:
> redirects STDOUT to a le
< redirects STDIN from a le
>> redirects STDOUT to a le, but appends rather than
overwriting.
(Theres also a <<, but its use is more advanced than well
cover.)
Here are two examples:
$ cat > temp.txt
$ sort < temp.txt
Try it out!
The idea behind pipes is that rather than redirecting
output to a le, we redirect it into another program.
Another way to say this is that STDOUT of one program
is used as STDIN to another program.
A common use of pipes is to view the output of a
command through a pager, like less. This is particularly
useful if the output is very long.
$ ls | less
Note that the data ows from left to right. See the UNIX
handout for more details on less.
Pipe
A program through which data is piped is called a lter.
Weve already seen a few lters: head, tail, and wc.
Two more common lters are
sort - sort lines of text les alphabetically
uniq - strip duplicate lines when they follow each other
$ cat somenumbers.txt
What will be the output of:
cat somenumbers.txt | sort
cat somenumbers.txt | uniq
cat somenumbers.txt | sort | uniq
$ cat somenumbers.txt | sort
One
One
One
Three
Two
Two
Two
$ cat somenumbers.txt | uniq
One
Two
One
Two
Three
One
$ cat somenumbers.txt | sort | uniq
One
Three
Two
Today: A quick wrap up on UNIX lters, then on to
regular expressions.
There will be a short assignment posted on bSpace later
today to give you some practice with these.
Recall the lters weve seen so far:
head - rst lines of le
tail - last lines of le
wc - word (or line or character or byte) count
sort - sort lines of text les alphabetically
uniq - strip duplicate lines when they follow each other
Here are two more useful lters:
grep - print lines matching a pattern (Well talk about
patterns more shortly -- for now just think of the pattern
as requiring an exact match.)
$ grep save *.R
Print all lines in any le ending with .R which contain the
word (pattern) save.
cut - select portions of each line of a le
$ cut -d -f 3-7
Here are some practice problems:
On many systems, the le /etc/passwd shows information
about the registered users for the machine. A quick look
at the le shows there are also some notes at the top.
1. Determine the total number of users.
2. Sort the users and display the information for the last
ve users, alphabetically speaking.
3. Show just the usernames for these entries.
4. Put the usernames in a le called lastusers.txt.
Regular Expressions
Regular expressions give us a powerful way of matching
patterns in text data.
Example 1: election data from three different datasets.
We know these are the same places, but how can the
computer recognize that?
Example 2: Creating variables that predict whether an
email is SPAM
- numbers or underscores in the sending address
- all capital letters in the subject line
- fake words like Vi@graa
- number of exclamation points in the subject line
- received time in the current time zone
Example 3: Mining the State of the Union addresses
How long are the speeches? How do the distributions of
certain words change over time? Which presidents have
given similar speeches?
The language of regular expressions allows us to carry
out some common tasks, such as

extracting pieces of text in non-standard format -


for example, nding all the links in an HTML document

creating variables from information found in text

cleaning and transforming text into a uniform format,


resolving inconsistencies in format between les

mining text by treating documents directly as data


Most importantly, we do this all programatically rather
than by hand, so that we can easily reproduce our work if
needed.
Regular expressions are constructed from three things:
Literal characters is matched only by the character itself.
A character class is matched by any member of the
specied class. For example, [A-Z].
Modiers operate on literal characters, character classes,
or combinations of the two.
First lets go under the hood a little bit and think about
the algorithms we could use to identify pattern matches
made up of literal characters.
What strategy is this code using?
> string
[1] "St John the Baptist Parish"
> if (substring(string, 1, 3) == "St ")
+ newstring <- paste("St. ",
+ substring(string, 4, nchar(string)), sep = "")
> newstring
[1] "St. John the Baptist Parish"
Can you see any problems with it?
A more general approach is to split the input string into a
a vector of characters and then iterate over those
characters looking for the particular string.
> string <- "The Slippery St Frances"
> characters <- unlist(strsplit(string, ""))
> characters
[1] "T" "h" "e" " " "S" "l" "i" "p" "p" "e" "r" "y"
[13] " " "S" "t" " " "F" "r" "a" "n" "c" "e" "s"
> possible <- which(characters == "S")
> possible
[1] 5 14
> substring(string, possible, possible + 2)
[1] "Sli" "St "
The regular expression St is made up of three literal
characters. The regular expression matching engine does
something very similar to what we just did.
The Slippery St Frances
|| |||
Found S________|| |||
Followed by t?__| No |||
Is it S?________| No ...||| Keep looking for an S
|||
Found S_________________|||
Followed by t?___________|| Yes
Followed by a blank?______| Yes - A match!
Luckily, we dont actually need to write our own functions
for replacement. The R functions gsub() and sub() look
for a pattern and replace it within a string with some
other text.
The g in gsub() refers to global. It changes all the
matches, whereas sub() only replaces the rst match.
> strings <- c("a test", "and one and one is two",
+ "one two three")
> gsub("one", "1", strings)
[1] "a test" "and 1 and 1 is two" "1 two three"
> sub("one", "1", strings)
[1] "a test" "and 1 and one is two" "1 two three"
What about nding fake words such as rep1!c@ted or
Vi@graa? In this case, were looking for numbers and/or
punctuation surrounded by regular letters.
These concepts of numbers, punctuation, and regular
letters get at the idea of equivalent characters or character
classes.
We can enumerate any collection of characters within
[ ]. Example: [Tt]his
If we put a caret (^) as the rst character , this indicates
that the equivalent characters are the complement of the
enumerated characters.
The character - when used within the character class
pattern identies a range.
Examples: [0-9], [A-Za-z]
If we want to include the character - in the set of
characters to match, put it at the beginning of the
character set to avoid confusion.
Example: [-+][0-9]
Note that here weve created a pattern from a sequence
of two sub-patterns.
There are also built-in character sets for commonly used
collections.
These can be used in conjunction with other characters,
for example [[:digit:]_].
[[:alpha:]]
All alphabetic
[[:digit:]]
Digits 0123456789
[[:alnum:]]
All alphabetic and numeric
[[:lower:]]
Lower case alphabetic
[[:upper:]]
Upper case alphabetic
[[:punct:]]
Punctuation characters
[[:blank:]]
Blank characters, i.e. space or tab
The grep() function in R works in somewhat the same
way as the UNIX command grep, although rather than
returning the matching strings, it returns the indices of the
elements for which there was a match.
However, you can easily use the indices to grab the
corresponding strings.
> Addresses
[1] "Cari Kaufman <cgk@stat.berkeley.edu"
[2] "depchairs03-04@uclink.berkeley.edu"
[3] "Chancey <_arkbound@deutschland.de>"
> grep("[[:digit:]_]", Addresses)
[1] 2 3
> Addresses[grep("[[:digit:]_]", Addresses)]
[1] "depchairs03-04@uclink.berkeley.edu"
[2] "Chancey <_arkbound@deutschland.de>"
Going back to our fake words example, what will this
match?
[[:alpha:]][[:digit:][:punct:]][[:alpha:]]
Can you foresee any problems with it?
> subjectLines
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "It's me"
> grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
subjectLines)
[1] 2 3
We can either remove the apostrophe rst:
> newString <- gsub("'", "", subjectLines)
> grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[1] 2
Or we can specify the particular punctuation marks were
looking for:
> grep("[[:alpha:]][[:digit:]!@#$%^&*():;?,.][[:alpha:]]",
subjectLines)
[1] 2
gregexpr() shows exactly where the pattern was found:
> newString
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "Its me"
> gregexpr("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 12
attr(,"match.length")
[1] 3
[[3]]
[1] -1
attr(,"match.length")
[1] -1
No match
No match
Starting at 12,
match of length 3
Did we miss anything??
We didnt nd p1!c because it consists of four characters:
a letter, a digit, a punctuation mark, and another letter.
To search for the more general pattern of any number of
digits or punctuation marks between letters, we use
[[:alpha:]][[:digit:][:punct:]]+[[:alpha:]]
The plus sign indicates that members from the second
character class (digits and punctuation) may appear one or
more times.
The plus sign is an example of a meta character.
More meta characters
^
As the rst character in the pattern,
anchor for the beginning of the line
As the rst character in [], exclude these
$
End of line anchor
?
Character or sub-pattern occurs zero or
one time
+
Character or sub-pattern occurs one or
more times
*
Character or sub-pattern occurs zero or
more times
.
Any single character
What will this match?
^[^[:lower:]]+$
[ ]
Character class
-
Range within a character class
( )
Group or sub-pattern
|
Alternation, i.e. one subpattern or
another
{ }
Quantier: {n} means exactly n repeats of
the sub-pattern
{n, m} n to m repeats
{n,} n or more repeats
The position of a character in a pattern determines
whether it is treated as a meta character.
Examples: [-+*/], [1-9]*
When you want to refer to one of these symbols literally,
you need to precede it with a backslash (\). However, R
already have a special meaning in Rs character strings --
they indicate control characters like newline (\n).
So, to refer to these symbols in Rs regular expressions,
you need to precede them with two backslashes.
The characters for which you need to do this are:
. ^ $ + ? ( ) [ ] { } | \
Announcements:
There will be a new homework posted on bSpace today
and due Wed., Oct. 29.
I will be out of town Wednesday - Sunday and able to
respond to email only about once per day.
-- In class Thursday, Dr. Deborah Nolan will be our guest
speaker. Shell cover some concepts youll need in the
new homework.
-- Daisy will be holding extra ofce hours on Thursday
from 3:30-4:30 in the 342 Evans Hall lab.
Last time, we learned that regular expressions are made
up of
1. Literal characters
2. Character classes
3. Modiers
Today well cover some advanced concepts, including
- getting text data into R
- greedy matching
- tagging and back references
Well do this all in the context of learning how to
automatically grab text data from the web.
If I select
a state
here, I get...
a table with vote counts for that state. The URL is
http://www.usatoday.com/news/politicselections/
vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=CA
If I click a few more states,
I see the only thing that
changes is the abbreviation
at the end.
So its easy to create the
URLS in R using paste.
The most exible way to read text les, like web pages,
into R is to use the readLines function. The result is a
character vector where each element is a line in the le.
state <- "California"
# Use the built-in state names and abbreviations
abb <- state.abb[state == state.name]
web <- readLines(paste("http://www.usatoday.com/news/
politicselections/vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=", abb, sep = ""))
If we wanted a single long character vector, we could
simply say
paste(web, collapse = )
However, its often easier to process line by line.
Most web browsers allow you to look at the source le
creating what you see on the screen.
The rst county was Alameda. Searching for it in the le,
we see some lines like this:
<td class="notch_medium" width="153"><b>County</b></td><td class="notch_medium" align="Right" width="65"><b>Total
Precincts</b></td><td class="notch_medium" align="Right" width="70"><b>Precincts Reporting</b></td><td class="notch_medium" align="Right"
width="60"><b>Bush</b></td><td class="notch_medium" align="Right" width="60"><b>Kerry</b></td><td class="notch_medium" align="Right"
width="60"><b>Nader</b></td>
</tr><tr>
<td class="notch_white" width="153"><b>Alameda</b></td><td class="notch_white" align="Right" width="65">1,141</td><td
class="notch_white" align="Right" width="70">1,141</td><td class="notch_white" align="Right" width="60">107,489</td><td class="notch_white"
align="Right" width="60">326,675</td><td class="notch_white" align="Right" width="60">0</td>
</tr><tr>
<td class="notch_light" width="153"><b>Alpine</b></td><td class="notch_light" align="Right" width="65">5</td><td
class="notch_light" align="Right" width="70">5</td><td class="notch_light" align="Right" width="60">311</td><td class="notch_light"
align="Right" width="60">373</td><td class="notch_light" align="Right" width="60">0</td>
</tr><tr>
<td class="notch_white" width="153"><b>Amador</b></td><td class="notch_white" align="Right" width="65">57</td><td
class="notch_white" align="Right" width="70">57</td><td class="notch_white" align="Right" width="60">10,479</td><td class="notch_white"
align="Right" width="60">6,211</td><td class="notch_white" align="Right" width="60">0</td>
</tr><tr>
Some additional searching reveals that width=153 only
occurs in the lines of the table. Knowing something about
HTML would have helped, but it wasnt necessary.
Using grep in R, we can nd just those lines containing
width=153. Removing the rst one (the header of the
table), we have a bunch of lines like this:
[1] "\t\t\t\t<td class=\"notch_white\" width=
\"153\"><b>Alameda</b></td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"
Goal: grab the county name, votes for Bush, and votes for
Kerry.
[1] "\t\t\t\t<td class=\"notch_white\" width=
\"153\"><b>Alameda</b></td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"
We can make our lives a little easier by rst removing all
the HTML tags. Note that they all start with < and end
with >. So we might think to try
gsub(pattern = "<.*>", replacement = "",
x = web[rowsoftable])
The issue is that regular expression engine performs
something called greedy matching. This means that it will
always try to nd the longest pattern that satises the
match.
To get around this, we have to consider what we really
want to match: in this case it isnt anything at all (denoted
by .*), its anything except a right angle bracket (denoted by
[^>]*).
> newtable <- gsub(pattern = "<[^>]*>", replacement = " ",
+ x = web[rowsoftable])
> newtable[1:3]
[1] "\t\t\t\t Alameda 1,141 1,141 107,489 326,675 0 "
[2] "\t\t\t\t Alpine 5 5 311 373 0 "
[3] "\t\t\t\t Amador 57 57 10,479 6,211 0 "
[1] "\t\t\t\t Alameda 1,141 1,141 107,489 326,675 0 "
[2] "\t\t\t\t Alpine 5 5 311 373 0 "
[3] "\t\t\t\t Amador 57 57 10,479 6,211 0 "
There are a couple different ways we can now tackle the
problem. As an exercise, try doing it using strsplit.
(What regular expression can you use as the delimiter?)
Another way to do it is to use tagging and back references.
The idea is that we write a regular expression for the
whole pattern, and then we mark what we later want to
refer to using ().
> gsub(pattern = "([[:alnum:]]+)@([[:alnum:].]+)",
+ replacement = "\\1", "cgk@stat.berkeley.edu")
[1] "cgk"
We can do the same thing here, with pattern
pattern <- "^\t\t\t\t (.*) [[:digit:],]+ +[[:digit:],]+
+([[:digit:],]+) +([[:digit:],]+).*$"
counties <- gsub(pattern, "\\1", newtable)
bush <- gsub(pattern, "\\2", newtable)
kerry <- gsub(pattern, "\\3", newtable)
bush <- as.numeric(gsub(",", "", bush))
kerry <- as.numeric(gsub(",", "", bush))
Make sure you understand what each part of the regular
expression is doing.
Now well switch over to R to nish the job.
XML
Other than the text data weve just learned to work
with, most of the data sets weve seen have been in the
form of ASCII tables.
Date Time Lat Lon Depth Mag
1968/01/12 22:19:10.35 36.6453 -121.2497 6.84 3.00
1968/02/09 13:42:37.05 37.1527 -121.5448 8.49 3.00
1968/02/21 14:39:48.10 37.1783 -121.5780 6.95 3.80
1968/03/02 04:25:53.94 36.8343 -121.5447 5.35 3.00
1968/03/17 15:07:02.12 37.3088 -121.6615 4.39 3.00
1968/03/21 21:54:59.94 37.0378 -121.7407 11.86 4.30
Advantages:
- easy to read, write, and process
- in standard cases, dont need a lot of extra information
But these advantages can quickly disappear....
XML (for eXtensible Markup Language) is a standard for
semantic, hierarchical representation of data.
<state>
<gml:nameabbreviation="AL">
ALABAMA
</gml:name>
<county>
<gml:name>
AutaugaCounty
</gml:name>
<gml:location>
<gml:coord>
<gml:X>
-86641472
</gml:X>
<gml:Y>
32542207
</gml:Y>
</gml:coord>
</gml:location>
</county>
...
Relationships between pieces
of data reect relationships
in the real world.
Some positive aspects of XML are
- data is self-describing
- format separates content from structure
- data can be easily merged and exchanged
- le is human-readable
- but le is also easily machine-generated
- standards are widely adopted
Some negative aspects are
- XML documents can be very verbose and hard to read
- its so general that its hard to develop tools for all cases
- les can be quite large due to high amount of
redundancy
XML is has become quite popular in many scientic
elds, and it is standard in many web applications for the
exchange and visualization of data. Well learn how to 1)
read/process it, and 2) create it.
Well do both of these things from within R, but rst lets
start with an overview of what XML documents look like.
The basic unit of XML code is called an element or
node. It is made up of both markup and content.
Markup consists of tags, attributes, and comments.
<CYL> 6 </CYL> <!-- CYL element with content 6 -->
<CYL> </CYL> <!-- CYL element with no content -->
<CYL/> <!-- another way to write it -->
<CYL size=2> 6 </CYL>
Start tag End tag
Content Comment - can go anywhere
An attribute
XML is well-formed if it obeys certain syntax rules. The
rules for tags are

1.Tag names are case-sensitive; start and end tags much
match exactly.
2. No spaces are allowed between the < and the tag
name.
3. Tag names must begin with a letter and contain only
alphanumeric characters.
4. An element must have both an open and closing tag
unless it is empty.
5. An empty element that does not have a closing tag
must be of the form <tagname/>.
6. Tags must nest properly.
An example
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Marsh Marigold</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Sunny</LIGHT>
<PRICE>$6.81</PRICE>
<AVAILABILITY>051799</AVAILABILITY>
</PLANT>
</CATALOG>
Note how indentation
makes it easier to
check that the tags
are correctly nested.
XML declaration
and processing
instructions
In addition, we have the rules
7. All attributes must appear in quotes in a name =
value format
8. Isolated markup characters must be specied via entity
references. For example, < is specied by &lt; and > is
specied by &gt;.
9. All XML documents must contain a root node containing
all the other nodes.
This brings us to the tree structure of an XML document.
There is only one root or document node in the tree, and
all the other nodes are contained within it.
We might also think of these other nodes as being
descendants of the root node. We use the language of a
family tree to refer to relationships between nodes.
- parents
- children
- siblings
- ancestors
- descendants
The terminal nodes in a tree are also known as leaf nodes.
Content always falls in a leaf node.
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
leaf nodes
Working with XML in R
The rst thing we need to do is load the XML library.
>library(XML)
Then read the XML le into R using xmlTreeParse
>doc<-xmlTreeParse(plant_catalog.xml)
and extract the root node using xmlRoot.
>root<-xmlRoot(doc)
> class(root)
[1]"XMLNode"
Aside:
R allows for object-oriented programming. Were not going
to do any of this style of programming ourselves, but its
helpful to know how to interpret it when we see it.
A class is a denition of a type of object. A class contains
slots that are used to hold class-specic information. A
particular object is called an instance of the class.
Methods are functions that are specialized for a certain
class. Rather than being fully object-oriented, R uses what
are called generic functions. These determine the type of
object being operated on and then call the appropriate
function. To see the classes for which the function has
been dened, use methods(functionname).
Ok, back to XML.
xmlTreeParse implements what is called the DOM
(Document Object Model) parser. It reads the entire le
into memory.
We dont have time to cover it, but you should be aware
of another parsing model called SAX (Simple API for
XML). It reads the document incrementally and is more
memory efcient, but it is trickier to use.
The tree structure is represented in R as a list of lists.
We can access an element within a node (i.e., a child),
using the usual [[ ]] indexing for lists.
> ## Look at one plant node
> oneplant <- root[[1]]
> class(oneplant)
[1] "XMLNode"
> print(oneplant)
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
Weve reached the leaf nodes.
We can access the content of the leaf nodes using the
function xmlValue.
> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"
Note: this removes the markup:
> oneplant[['COMMON']]
<COMMON>Bloodroot</COMMON>
There are special XML versions of lapply, and sapply,
named xmlApply, xmlSApply. Each takes an XMLNode object as
its primary argument. They iterate over the nodes
children nodes, invoking the given function.
Like lapply, xmlApply returns a list. Like sapply, xmlSApply
returns a simpler data structure if possible. In this case,
we can use xmlSApply to extract the names of all the
plants.
> commons <- xmlSApply(root, function(x){
+ xmlValue(x[['COMMON']])})
> head(commons)
PLANT PLANT PLANT
"Bloodroot" "Columbine" "Marsh Marigold"
PLANT PLANT PLANT
"Cowslip" "Dutchman's-Breeches" "Ginger, Wild"
Now we can create the full dataframe.
> getvar <- function(x, var) xmlValue(x[[var]])
> res <- lapply(names(root[[1]]), function(var){
+ xmlSApply(root,getvar,var)})
> plants <- data.frame(res)
> names(plants) <- names(root[[1]])
What is this command doing?
An overview of where we are:
This week: nishing XML, then spatial data
No class next Tuesday, starting SQL on Thursday
Well have one more graded assignment (spatial data) and
one more short assignment (SQL).
Nov 18 -- Ill assign the group projects. These will involve
analyzing the results of the presidential election.
Part I -- due Dec 1 -- data collection and planning
Part 2 -- due Dec 17 (day of the nal)
-- carrying out the analysis and completing your report
The nal is on paper (not computer) and will take 1-2 hrs.
First, a quick review of sapply and lapply. Remember:
1) lapply and sapply can operate on either a list or a vector.
What will be the results of the following?
lapply(1:3, function(x){x^2})
sapply(1:3, function(x){x^2})
myList <- list(a=1, b=2, c=3)
lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})
2) You can always include additional arguments. Examples:
sapply(myList, log)
sapply(myList, log, base = 10)
sapply(myList, function(x, pow){x^pow}, pow = 3)
3) If the results of sapply cannot be simplied, then sapply
and lapply will return the same thing.
myList <- list(a=1:2, b=3:5, c=6)
lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})
(All of these are just examples to illustrate the differences
between sapply and lapply. Of course if you have a
vector, you can just use vectorized operations, e.g. vec^2.)
Finally, do you remember what plain old apply does?
The functions xmlApply and xmlSApply work like lapply and
sapply, except their arguments are XML nodes.
xmlApply returns a list (the elements may themselves be
XML nodes).
If it can, xmlSApply returns a vector or matrix. If not, it
also returns a list.
With all of these functions, always ask yourself
1) What do I want to operate on (iterate over)?
2) What do I want to produce?
Spatial Data
In most introductory statistics textbooks, we assume that
when there is more than one observation, they are iid
(independent and identically distributed). This makes the
theory of estimators using these observations very
analytically tractable.
However, one can easily think of instances where this is
not the case.
-- observations of some genetic variable will tend to be
closer within families than between families
-- variables that change over time can have a distribution
for current values that depends on past values
-- variables that are measured over space can have similar
values for nearby locations
Spatial data can be divided into three main types:
1. Geostatistical data associate a value (univariate case) or
values (multivariate case) with a particular location.
Clearly it wouldnt be appropriate to treat these data as
iid. Often we t a parametric model for the correlation.
One of the most
common goals is
prediction of the
variable of interest
at new locations.
!
!!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
! !
!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
! !!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
! ! ! !
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!!
! !
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! ! !
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!!
!
!
!
!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
!
!
! !
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
! !
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
! !
!
!
! ! !
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
! !
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
! !
! !
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!!
!
! !
!
! ! !
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
! !
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!2
!1
0
1
2
3
2. Lattice data consist of measurements that are particular
to a certain geographic region, such as a county.
Its common to see lattice data when the data collection
method is controlled
by government
agencies. The census
is one example. A
lot of epidemiological
data looks like this
too.
Modelling tends to
focus on the structure
of neighborhoods.
3. Point process data consist of the locations of particular
events. If there are also values associated with the event,
this is known as a marked point process.
The earthquake data in your homework can be modeled
as a marked point process in space and time.
Some questions to study with
this type of data are
- Is the rate of events the same
everywhere? (Is it a homogeneous
point process?)
- Given this underlying rate, are
the events independent? Or do
they tend to cluster together?
Geostatistical data uses a continuous representation of
space, with
We write the covariance between the variable of interest
at any two locations as
Note:
s S
3
Cov(Z(s), Z(s

)).
Cov(X, Y ) = E[(X EX)(Y EY )]
=

V ar(X)

V ar(Y )Cor(X, Y )
=
2
Cor(X, Y )
if V ar(X) = V ar(Y ) =
2
A few simplifying assumptions about the structure of the
covariance function are
1. stationarity - the covariance between Z(s) and Z(s)
depends only on the relative locations, ie. s-s.
2. isotropy - the covariance between Z(s) and Z(s)
depends only on the distance between the locations,
ie. ||s-s||.
If both stationarity and isotropy hold, then the covariogram
can be used to study the form of the spatial covariance
function.
The (empirical) covariogram is a scatterplot that puts
distance on the x-axis and covariance on the y-axis.
If we had multiple replications, such as independent
observations in time, then for each pair of locations we
could calculate their covariance over those replications
and add one point to the covariogram. (How many
points would there be in all?)
However, if we have just a single replication, we can look
at pairs whose distance are within a given window of the
distance we want to plot.
Let
Then the empirical covariogram
I
d,
= {(i, j) : i j, ||s
i
s
j
|| (d , d + )}

(d) =
1
#I
d,

(i,j)I
d,
(Z
i


Z
(i)
)(Z
j


Z
(j)
)
mean over all the is mean over all the js
Recall from last time:
Geostatistical data consist of observations associated with
a set of locations.
Today well talk about how to interpolate a geostatistical
data set.
Example: Average
surface ozone at
monitoring stations
Our goal: Estimate
ozone over a grid
of locations
!95 !90 !85 !80 !75
3
4
3
6
3
8
4
0
4
2
35
40
45
50
55
60
65
Plotting the data
Youll need to load the packages fields and maps.
From elds, we use
image.plot - color plot of a data set on a regular grid
as.image - take an irregularly spaced data set and put it on
a grid with nrow rows and ncol columns.
From maps, we use
map(state, add = TRUE)
Other databases are available -
try county, usa, world
Now we plot the correlogram and use it to estimate the
spatial covariance.
0 100 200 300 400 500
0
5
1
0
1
5
2
0
2
5
3
0
Distance
E
s
t
i
m
a
t
e
d

C
o
v
a
r
i
a
n
c
e
The idea is that we t a parametric model to the
correlogram. In other words, we specify a functional form
for the covariance, where the function depends on
certain parameters, and then we estimate those
parameters.
A common parametric model takes the covariance to
decay exponentially with distance
Variance at a single location
Controls rate of decay;
higher means higher
correlation at a given distance
Cov(Z(s), Z(s

)) =
2
exp{||s s

||/}
One way to estimate the parameters in the covariance
function is to use nonlinear least squares.
0 100 200 300 400 500
0
5
1
0
1
5
2
0
2
5
3
0
Distance
E
s
t
i
m
a
t
e
d

C
o
v
a
r
i
a
n
c
e
Minimize the sum
of squared
residuals over
the parameters
using function nls
Now we need to determine the set of locations at which
we want to do the prediction.
Remember expand.grid from when we were running
simulation studies? It works here too, just be sure to
convert to a matrix, rather than a data frame.
You need to determine the resolution of the grid. Higher
resolution looks better, but takes longer.
Also be sure not to extrapolate, ie. predict beyond the
range of the data.
A fairly dense grid
!95 !90 !85 !80 !75
3
2
3
4
3
6
3
8
4
0
4
2
lon
l
a
t
Ok, now it is time to predict.
We will use a linear combination of the observations to
give us a prediction at a new location. The only question
then is what weights to assign to each observation.
We choose the weights to minimize the variance of the
resulting estimate. The best predictor minimizes this
quantity, and we call it the BLUP, for best linear unbiased
predictor.
It turns out the weights for the BLUP are easy to derive if
you know the true covariance function....
First we form the covariance matrix for the observations.
This contains the covariances for every pair of
observations.
We calculate a similar matrix for each pair where one
location comes from the set of observations and the
other location comes from the new grid.

ij
= Cov(Z(s
i
), Z(s

j
)) =
2
exp{||s
i
s

j
||/}

ij
= Cov(Z(s
i
), Z(s

j
)) =
2
exp{||s
i
s

j
||/}
New location j
If the vector of observations Z has mean zero, the BLUP
is simply
This procedure is called kriging after Daniel Krige, a South
African gold miner.
Now, if we dont know the parameters, we can plug in our
estimates when we compute the covariance matrices. This
predictor is no longer the BLUP. (Why not?)

1
Z
Weight matrix - depends on particular values
of the parameters as well as the distances
We can also estimate the variance in our predictions.
If the parameters are known, then the vector of variances
of the BLUP at each predicted location is
We plug in our estimates to approximate this variance.
The variance will be small when the predicted location is
close to the original locations.
In fact, with the covariance function were currently using,
the variance will be exactly zero if we predict at a
location where we already have an observation. (It
interpolates.)

2
1
m
diag{

1
}
Announcements:
There will be no lab tomorrow. Ill assign your group
projects next Tuesday, and well have a short in-lab
assignment on SQL next Friday.
Be sure to attend on Tuesday so you can make plans with
your group.
If you want to practice the SQL statements we talk about
today, there are lots of free SQL interpreters (programs
that mimic a database server) online.
For example, see http://sqlcourse.com/select.html.
Databases and SQL
A database is a collection of data with information about
how the data are organized (meta-data). A database server
is like a web server, but responds to requests for data
rather than web pages.
Well talk about relational database management systems
(RDBMS) and how to communicate with them using the
structured query language (SQL).
Why use a database?

Coordinate synchronized access to data

Change continually; give immediate access to live data

Centralize data for backups

Support client-server computing

Control access to the data


A RDBMS had three main parts

Data denition

Data access

Privilege management
Well concentrate on data access, assuming the database
is already available and we have the needed privileges.
Topics:

using SQL to extract info from RDBMSs

relating these back to similar tasks in R

using SQL from within R


There are tradeoffs in terms of what we choose to do
using SQL and what we do in R.
A database is made up of one or more two dimensional
tables, usually stored as les on the server.
A very important concept in the design of a database is
normalization. The idea is to remove as much redundancy
as possible when creating the tables. We do this by
breaking the full dataset into separate tables.
The relational in RDBMS comes from the fact that we
then need to link the tables together.
For now lets talk about a single table....
A table is a rectangular arrangement of values, where a
row represents a case, and a column represents a variable
(just like a data frame in R).
Missing value
Terminology
Object Statistics Database
Table Data frame Relation
Row Case Tuple
Column Variable Attribute
Row count Sample size Cardinality
Column count Dimension Degree
Row ID Row name Key
An entity is the general object of interest. For example, a
lab test. Each case is a particular occurrence of the entity.
This means that rows in the table are unique.
To identify each row, we us a key. A key is just an
attribute or a combination of attributes.
In the lab test example, there is a composite key of both
patient ID and date, since neither is necessarily unique.
In R, the row names of a dataframe play a similar role.
Queries and the SELECT statement
SQL allows us to interactively query the database to
reduce the data by subsetting, grouping, or aggregation.
Each database program tends to have its own version of
SQL, but they all support the same basic SQL statements.
(We say statements rather than commands because SQL
is referred to as a declarative rather than a programming
language.)
The SQL statement for retrieving data is the SELECT
statement. The result will always be another table.
A table called Chips:
The simplest possible query gives back everything:
SELECT * FROM Chips;
By convention, we display SQL commands in upper case.
Selecting by variables/attributes
Recall that in R, we can select particular variables
(columns) by name.
Chips[ , c(Mips, Microns)]
The order of the variable names determines the order in
which theyll be returned in the resulting data frame.
The corresponding SQL query is
SELECT Mips, Microns FROM Chips;
Selecting by cases/tuples
Likewise, in R we can select cases from a dataframe using
their row names.
Chips[c(Pentium, PentiumII), ]
Two equivalent SQL queries are
SELECT * FROM Chips
WHERE Processor = Pentium OR
Processor = PentiumII;
SELECT * FROM Chips
WHERE Processor IN (Pentium, PentiumII);
In both R and SQL, we can do both types of subsetting at
once.
R:
Chips[c(Pentium, PentiumII), c(Mips, Microns)]
SQL:
SELECT Mips, Microns FROM Chips
WHERE Processor IN (Pentium, PentiumII);
Generalizing, so far we have the syntax
SELECT attribute(s) FROM relation(s)
[WHERE constraints];
How would we pull the years of all 32-bit processors that
execute fewer than 250 million instructions per second,
1) in SQL, 2) in R?
[optional]
SQL offers limited features for summarizing data -- some
aggregate functions that operate over the rows of a table,
and some mathematical functions that operate on
individual values in a tuple.
The aggregate functions are

COUNT - number of tuples

SUM - total of all values for an attribute

AVG - average value for an attribute

MIN - minimum value for an attribute

MAX - maximum value for an attribute


SELECT attribute(s) FROM relation(s)
[WHERE constraints];
or functions of attributes
Additional clauses
The GROUP BY clause makes the aggregate functions in
SQL more useful. It enables the aggregates to be applied
to subsets of the tuples in a table.
SELECT Region, SUM(Amount) FROM Sales
GROUP BY Region;
The WHERE clause cant contain an aggregate function,
but the HAVING clause can be used to refer to the
groups to be selected.
SELECT Region, SUM(Amount) FROM Sales
GROUP BY Region HAVING SUM(Amount) > 100000;
A few other predicates and clauses are
DISTINCT - forces values of an attribute in the results
table to have unique values
NOT - negates conditions in WHERE or HAVING clause
LIMIT - limits the number of tuples returned
SELECT DISTINCT State FROM Sales
WHERE NOT Region IN (East, West)
LIMIT 10;
The order of execution of the clauses in a SELECT
statement is as follows:
1. FROM: The working table is constructed.
2. WHERE: The WHERE clause is applied to each tuple of
the table, and only the rows that test TRUE are retained.
3. GROUP BY: The results are broken into groups of
tuples all with the same value of the GROUP BY clause.
4. HAVING: The HAVING clause is applied to each group
and only those that test TRUE are retained.
5. SELECT: The attributes not in the list are dropped and
options such as DISTINCT and LIMIT are applied.
Finally, another useful command is ORDER BY. This is
always executed last, since it is technically part of the host
language and not SQL.
For example, this will order the rst seven, not give you
the top seven!
SELECT Location, Amount FROM Sales
ORDER BY Amount DESC LIMIT 7;
Moving on to multiple tables: an example
This is where normalization really comes into play. We
could have stored the same information in one big table:
Designing efcient databases is a topic in its own right. As
users of the database, we just need to understand the
relationships between tables.
An example
SELECT CID, SUM(Balance) AS Total
FROM Registration AS R, Accounts AS A
WHERE A.AcctNo = R.AcctNo
GROUP BY CID;
Your turn: Find the names and addresses of all customers
with accounts in the downtown branch of the bank.
It helps to think about order of exectution (FROM, then
WHERE, then GROUP BY, then SELECT).
Recall from last time how subsetting rows and columns in
SQL differs from doing it in R.
R:
Chips[c(Pentium, PentiumII), c(Mips, Microns)]
SQL:
SELECT Mips, Microns FROM Chips
WHERE Processor IN (Pentium, PentiumII);
Also remember the additional clauses GROUP BY and
WHERE.
The GROUP BY clause enables the aggregate functions to
be applied to subsets of the tuples in a table.
SELECT Region, SUM(Amount) FROM Sales
GROUP BY Region;
The WHERE clause cant contain an aggregate function,
but the HAVING clause can be used to refer to the
groups to be selected.
SELECT Region, SUM(Amount) FROM Sales
GROUP BY Region HAVING SUM(Amount) > 100000;
Interacting with databases and multiple tables
Each server may provide multiple databases, each of
which may contain multiple tables, each of which may
have multiple columns. The following commands are
useful for orienting yourself.
SHOW DATABASES;
SHOW TABLES IN database;
SHOW COLUMNS IN table;
DESCRIBE table;
Today well get practice using the command line program
MySQL to interact with a database.
Same thing
springer.cgk% mysql -u stat133 -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 5.0.51a-3ubuntu5.1 (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the
buffer.
mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| albums |
| baseball |
| music |
+--------------------+
4 rows in set (0.00 sec)
You can log into springer from
any of the department machines.
After typing this command, enter
password T0pSecr3t
Well try some examples with the database called albums.
mysql> SHOW TABLES IN albums;
+------------------+
| Tables_in_albums |
+------------------+
| Album |
| Artist |
| Track |
+------------------+
3 rows in set (0.00 sec)
mysql> DESCRIBE Album;
ERROR 1046 (3D000): No database selected
Note that to access the tables within a database, we can
use the same dot notation as before, e.g. albums.Album.
Or...
We can also rst say
mysql> USE albums;
Reading table information for completion of table and
column names
You can turn off this feature to get a quicker startup
with -A
Database changed
mysql> DESCRIBE Album;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
+-------+------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
mysql> describe Album;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
+-------+------------+------+-----+---------+-------+
mysql> describe Artist;
+-------+--------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------+------+-----+---------+-------+
| aid | double | YES | MUL | NULL | |
| name | text | YES | | NULL | |
+-------+--------+------+-----+---------+-------+
mysql> describe Track;
+----------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
| filesize | bigint(20) | YES | | NULL | |
| bitrate | double | YES | | NULL | |
| length | bigint(20) | YES | | NULL | |
+----------+------------+------+-----+---------+-------+
What is the
structure of
this database?
The individual tables dont have much interesting
information, since they rely on the IDs.
mysql> SELECT * FROM Album LIMIT 5;
+------+------+-----------------------+
| alid | aid | title |
+------+------+-----------------------+
| 346 | 372 | 'Perk Up' |
| 226 | 235 | 'Round Midnight |
| 316 | 326 | 10 to 4 at the 5 Spot |
| 204 | 205 | 100% Pure Funk |
| 500 | 265 | 150 MPH |
+------+------+-----------------------+
5 rows in set (0.01 sec)
To see anything interesting, we have to link together
multiple tables.
This is just so that
the examples t
on the slides -- youll
want to remove it.
mysql> SELECT Artist.name, Album.title FROM Artist, Album
-> WHERE Artist.aid = Album.aid LIMIT 5;
+----------------------+-----------------------+
| name | title |
+----------------------+-----------------------+
| Shelly Manne | 'Perk Up' |
| Kenny Burrell | 'Round Midnight |
| Pepper Adams Quintet | 10 to 4 at the 5 Spot |
| Jimmy McGriff | 100% Pure Funk |
| Louie Bellson | 150 MPH |
+----------------------+-----------------------+
5 rows in set (0.00 sec)
Combining two datasets in this way is called an inner join.
The arrow (->) appears above because I havent entered
the terminating ; yet. You shouldnt type it in.
Also, dont forget you can use AS to rename tables or
columns. This can save a lot of typing.
mysql> SELECT ar.name AS Artist, al.title AS 'Album Name'
-> FROM Artist AS ar, Album AS al
-> WHERE ar.aid = al.aid LIMIT 5;
+----------------------+-----------------------+
| Artist | Album Name |
+----------------------+-----------------------+
| Shelly Manne | 'Perk Up' |
| Kenny Burrell | 'Round Midnight |
| Pepper Adams Quintet | 10 to 4 at the 5 Spot |
| Jimmy McGriff | 100% Pure Funk |
| Louie Bellson | 150 MPH |
+----------------------+-----------------------+
5 rows in set (0.00 sec)
Subqueries
The result of one query can be used to represent a value
in another query.
For example, say we wanted to nd the artist and title of
the longest track in the database.
This gives the length, but not the other information:
mysql> SELECT MAX(length) FROM Track;
+-------------+
| MAX(length) |
+-------------+
| 1561 |
+-------------+
1 row in set (0.00 sec)
(Remember the aggregate
functions are COUNT,
SUM, AVG, MIN, and MAX.)
Now lets get the name of the track.
mysql>SELECTtitle,lengthFROMTrack
-> WHERElength=(SELECTmax(length)FROMTrack);
+------------+--------+
|title|length|
+------------+--------+
|TheLovers|1561|
+------------+--------+
1rowinset(0.00sec)
How do we get the artist and album information as well?
Ask yourself:
1. What tables do I need (for the FROM clause)?
2. What constraints do I need (for the WHERE clause)?
3. What columns do I want to SELECT?
4. Should I rename anything to save typing?
mysql>SELECTtr.title,al.title,ar.name,tr.length
->FROMTrackastr,Albumasal,Artistasar
->WHEREtr.alid=al.alidANDtr.aid=al.aid
-> ANDar.aid=al.aid
->ANDlength=(SELECTmax(length)FROMTrack);
+------------+------------------------+------------+--------+
|title|title|name|length|
+------------+------------------------+------------+--------+
|TheLovers|InvitationtoOpenness|LesMcCann|1561|
+------------+------------------------+------------+--------+
1rowinset(0.00sec)
Note that its very important to make all the links between
tables, or you will get unwanted rows in the table.
Another example: lets make a table of the number of
artists with certain numbers of albums. (How many
artists have one album, how many have two, etc.)
First, this tells us how many albums each artist (aid) has:
mysql> SELECT aid, COUNT(aid) AS ct FROM Album
-> GROUP BY aid LIMIT 5;
+------+----+
| aid | ct |
+------+----+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 5 |
+------+----+
5 rows in set (0.01 sec)
mysql> SELECT ct, COUNT(ct) FROM
-> (SELECT aid, count(aid) AS ct FROM Album GROUP BY aid) AS x
-> GROUP BY ct ORDER BY ct;
+----+-----------+
| ct | COUNT(ct) |
+----+-----------+
| 1 | 318 |
| 2 | 63 |
| 3 | 30 |
| 4 | 20 |
| 5 | 7 |
| 6 | 4 |
| 7 | 3 |
| 8 | 2 |
| 9 | 2 |
| 10 | 1 |
| 11 | 1 |
| 12 | 2 |
| 14 | 1 |
+----+-----------+
13 rows in set (0.00 sec)
Were using the whole
table from the last slide
as a subquery here.
Remember the result
looked like:
+------+----+
| aid | ct |
+------+----+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 5 |
Using MySQL with R
The RMySQL library allows you to connect to a database,
submit a query, and receive the results as a data frame. The
important functions are dbDriver, dbConnect, and
dbGetQuery.
> library(RMySQL)
Loading required package: DBI
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, dbname = "albums",
+ user = "stat133", pass = "T0pSecr3t")
This assumes you are already logged into springer. If you're
connecting from some other SCF machine, you'd need to
add the host='springer' argument to the dbConnect call.
We can grab each full table:
> album <- dbGetQuery(con,
+ statement = "SELECT * FROM Album")
> track <- dbGetQuery(con,
+ statement = "SELECT * FROM Track")
> artist <- dbGetQuery(con,
+ statement = "SELECT * FROM Artist")
Notice that we can get away with not using the
terminating (;) here.
The merge function in R performs an inner join. We can
use it to recreate the huge table with every column.
> album.and.artist <- merge(album, artist, by = "aid")
> full <- merge(track, album.and.artist, by = "alid")
> head(full)
alid aid.x title.x filesize bitrate length aid.y
1 1 2 S'Wonderful 3351 126.5808 217 2
2 1 2 Taboo 2528 126.0203 164 2
3 1 2 Just One Of Those Things 3753 128.0000 237 2
4 1 2 Yardbird Suite 2879 126.3243 187 2
5 1 2 It's The Talk Of The Town 2915 126.5238 189 2
6 1 2 Mighty Like A Rose 4503 126.7958 291 2
title.y name
1 Al Haig Trio And Sextets Featu Al Haig
2 Al Haig Trio And Sextets Featu Al Haig
3 Al Haig Trio And Sextets Featu Al Haig
4 Al Haig Trio And Sextets Featu Al Haig
5 Al Haig Trio And Sextets Featu Al Haig
6 Al Haig Trio And Sextets Featu Al Haig
Now we can do any processing we need to do.
On the other hand, with large databases it is often slow
or even impossible to load all the tables into R or create
one large dataframe. Then we can customize the query to
select just what we want and to do some processing on
the remote server.
> query <- "SELECT al.title, ar.name, SUM(tr.length) AS tot\
+ FROM Album AS al,Artist AS ar,Track AS tr\
+ WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid\
+ GROUP BY tr.alid\
+ HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC"
> albums <- dbGetQuery(con, statement = query)
What do you think this will do?
This says that the string
is not nished, even though
we hit return.
> head(albums)
title name tot
1 'Perk Up' Shelly Manne 2684
2 Kaleidoscope Sonny Stitt 2679
3 Red Garland's Piano (Remastere Red Garland 2676
4 Ask The Ages Sonny Sharrock 2675
5 Duo Charlie Hunter & Leon Parker 2667
6 Tenor Conclave Hank Mobley/Al Cohn/John Coltr 2665
Numerical
Optimization
Function optimization refers to the problem of nding a
value of x to make some function f(x) as large (or as
small) as possible.
In statistics, these problems often arise in the context of
calculating estimates of model parameters.

Nonlinear least squares

Generalized linear models

Maximum likelihood estimates


Sometimes we can nd the solution explicitly, for example
using the derivatives of f. But when the solution cant be
found in closed form, we turn to numerical optimization.
There are very many techniques for numerical
optimization, and we cant possibly cover them all.
Well talk about two basic methods and how to program
them:

Golden section search

Newton-Ralphson algorithm
Then well move on to some statistical examples and how
to use the built-in optimization methods in R.
Since every maximization problem can be rewritten as a
minimization problem, using -f(x) rather than f(x), well
assume from now on that were minimizing.
Golden section search
Assume f(x) has a single minimum on the interval [a,b].
The golden section search algorithm iteratively shrinks the
interval over which were looking for the minimum, until
the length of the interval is less than some preset
tolerance.
The name golden section comes from the fact that at
each iteration, we choose a new point to evaluate so that
we can reuse one of the points from the last iteration. It
works out that the way to do this is to maintain the so-
called golden ratio between the distances between points.
Consider two line segments c and d. They are said to be
in golden ratio if their sum c+d is to c as c is to d.
c + d
c
=
c
d
=

d + d
d
=
d
d

+ 1

=

2
1 = 0
=
1 +

5
2
1.618034
c d
An example:
Start with
Now compare
and
Since
we know the minimum
must be in
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
x
1
= b (b a)/
x
2
= a + (b a)/
f(x
1
) f(x
2
).
f(x
1
) < f(x
2
),
[a, x
2
].
An example
This time so the
minimum must be in
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
a
x
1
x
2 b
Add a new point
and maintain the
golden ratio.
f(x1) > f(x2),
[x
1
, b].
An example
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
a
x
1
x
2 b
a
x
1
x
2 b
Keep going like this....
An example
1 2 3 4 5
1
.
0
1
.
5
2
.
0
2
.
5
x
f
((
x
))
a
x
1
x
2 b
a
x
1
x
2 b
a
x
1
x
2 b
a
x
1
x
2 b
When b-a is sufciently
small, we stop and report
a minimum of (a+b)/2. One
can show that the error is
at most (1-#)(b-a).
The Newton-Raphson algorithm can be used if the function
to be minimized has two continuous derivatives that may
be evaluated.
Again assume that there is a single minimum in [a,b]. If
the minimizing value x* is not at a or b, then
If in addition then x* is a minimum.
The main idea behind N-R is that if we have an initial
value that is close to the minimizing value, then we
can approximate

f

(x

) = 0.
x
0
f

(x) f

(x
0
) + (x x
0
)f

(x
0
)
f

(x

) > 0,
Setting RHS = 0 gives
We keep going in this way until is sufciently close
to zero.
Its important to have a good initial guess, otherwise the
Taylor series approximation may be very poor and we
may even have
f

(x
n
)
f(x
n+1
) > f(x
n
).
x
1
= x
0

(x
0
)
f

(x
0
)
f

(x) f

(x
0
) + (x x
0
)f

(x
0
)
Weve already seen one example of numerical
optimization in action, when we used nonlinear least
squares to t a curve to the covariogram for spatial data.
This is useful more generally, for non-linear regression
models of the form
y = f(x, ) +
Response
variable
Nonlinear
function
Vector of
covariates
Vector of
coefcients
error term
N(0,
2
)
Example: weight loss
Patients tend to lose weight at diminishing rate. Here is
data from one patient with a linear t superimposed.
Another proposed model is
0 50 100 150 200 250
1
1
0
1
3
0
1
5
0
1
7
0
Days
W
e
i
g
h
t

(
k
g
)
2
5
0
3
0
0
3
5
0
4
0
0
W
e
i
g
h
t

(
l
b
)
y =
0
+
1
2
t
+
Ultimate lean
weight (asymptote)
Total amount
to be lost
Time taken to
lose half amount
remaining to be lost
The function nls in R uses numerical optimization to nd
the values of the parameters that minimize the sum of
squared errors
The main arguments are

formula - outcome on the LHS, function on the RHS

data - dataframe holding the variables

start - vector of starting values


n

i=1
(Y
i
f(x
i
, ))
2
Numerical
Optimization II: Fitting
Generalized Linear
Models
The normal linear model assumes that
1) the expected value of the outcome variable can be
expressed as a linear function of the explanatory
variables, and
2) the residuals (observations minus their expected
values) are independent and identically distributed with
a normal distribution.
Last time we talked about relaxing assumption 1), using
nonlinear regression models.
Today well talk about relaxing assumption 2), using what
are called generalized linear models.
First, a few words about the normal linear model.
With a single explanatory variable, it has the form
where
Recall that the least squares estimates of and
minimize the residual sum of squares
With a little calculus, we can minimize RSS explicitly....
Y
i
=
0
+
1
X
i
+
i

i
iid
N(0,
2
), i = 1, . . . , n

0

1
RSS(
0
,
1
) =
n

i=1
(Y
i

1
X
i
)
2
Setting each equal to zero and solving, we get
In this model, the least squares estimates are equal to the
maximum likelihood estimates, which well discuss next
time.
RSS/
0
= 2
n

i=1
(Y
i

1
X
i
)
RSS/
1
= 2
n

i=1
(Y
i

1
X
i
)X
i

1
=

n
i=1
(X
i


X)(Y
i


Y )

n
i=1
(X
i


X)
2

0
=

Y
1

X
We can do similar calculations if we have more than one
explanatory variable.
Another way of thinking about what weve done in the
normal linear model is that weve expressed the mean of
the Ys as a linear combination of the Xs.
To work with non-normal distributions, were going to
slightly modify this idea.
E[Y
i
] =
0
+
1
X
i
+E[
i
]
=
0
+
1
X
i
First, a motivating example.
Prior to the launch of the space shuttle Challenger, there
was some debate about whether temperature had any
effect on the performance of a key part called an O-ring.
The following plot, with data from past ights, was used as
evidence that it was safe to launch at a temperature of 31F.
One key problem with this analysis was that the
engineers left out the data from all the ights with no O-
ring problems, under the mistaken assumption that these
gave no extra information.
The solid rocket motors (labeled
3 and 4) are delivered to
Kennedy Space Center in four
pieces, and they are connected
on site using the O-rings. There
are actually two sets of O-rings
at each joint, but well focus on
the primary ones.
So in each launch, there are six
primary O-rings that can fail. If
any one fails, it can lead to a
catastrophic failure of the whole
shuttle.
Here are the data on past
failures of the primary O-
rings.
The data from past ights
come from rocket motors
that are retrieved from the
ocean after the ight. There
had been 24 shuttle
launches prior to
Challenger, of which the
rocket motors were
retrieved in 23 cases.
Temp Fail Date
1 66 0 4/12/81
2 70 1 11/12/81
3 69 0 3/22/82
4 68 0 11/11/82
5 67 0 4/4/83
6 72 0 6/18/83
7 73 0 8/30/83
8 70 0 11/28/83
9 57 1 2/3/84
10 63 1 4/6/84
11 70 1 8/30/84
12 78 0 10/5/84
13 67 0 11/8/84
14 53 2 1/24/85
15 67 0 4/12/85
16 75 0 4/29/85
17 70 0 6/17/85
18 81 0 7/29/85
19 76 0 8/27/85
20 79 0 10/3/85
21 75 2 10/30/85
22 76 0 11/26/85
23 58 1 1/12/86
We could t a linear regression model to this data,
relating the expected number of failures to temperature.
Some problems with this approach are that
1) the residuals are
clearly not iid
normal
2) if we go out
far enough, we
actually predict
a negative number
of failures.
0 20 40 60 80
0
1
2
3
4
5
6
Temperature (Degrees F)
F
a
i
l
u
r
e
s
Instead we will t a logistic regression model.
This model is appropriate when the data have a binomial
distribution (counting the number of events out of n
trials), of which binary data is a special case with n=1.
The expected value for a given trial is , the probability
of an event when the explanatory variable . We
relate this to the linear predictor using the logit function.
log

p
i
1 p
i

=
0
+
1
X
i
p
i
X = X
i
This ratio is called, the odds, so
the logit can also be called the log odds.
The case we are discussing (binomial outcome, logit
function) is a special case of a larger class of models called
generalized linear models.
Some other examples:
Normal outcome, identity link Poisson outcome, log link
Note that in each case, the link function maps the space of
the parameter representing the mean of the distribution
to the real line, which is the space of the linear
predictor.
Y
i
N(
i
,
2
),

i
=
0
+
1
X
i
Y
i
Pois(
i
)
log(
i
) =
0
+
1
X
i
(
i
,
i
, or p
i
)
Generalized linear models can be t using an algorithm
called iteratively reweighted least squares. In R, this is
implemented in the function glm.
# First create a matrix with events and non-events
FN <- cbind(challenge$Fail, 6 - challenge$Fail)
# Fit using specified family, default link function
glm.fit <- glm(FN~Temp, data = challenge,
family = binomial)
# Now predict for a range of temperatures
tempseq <- seq(0, 90, length = 100)
pred <- predict(glm.fit, newdata = data.frame(Temp =
tempseq), se.fit = TRUE)
inv.logit <- function(x){1/(1+exp(-x))}
lines(tempseq, inv.logit(pred$fit))
lines(tempseq, inv.logit(pred$fit + 2*pred$se.fit), lty = 2)
lines(tempseq, inv.logit(pred$fit - 2*pred$se.fit), lty = 2)
The condence interval at 31F is quite wide, but the
point estimate probably still should have been cause for
alarm, especially since the temperature was colder than
anything that had been tried before.
0 20 40 60 80
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Temperature (Degrees F)
P
r
o
b
a
b
i
l
i
t
y

o
f

F
a
i
l
u
r
e
Interpretation of the coefcients is a bit trickier in logistic
regression models than it is in linear regression models.
When we can say that

is the expected value of Y when


(This is not always interesting; for example the
temperature will almost never be 0F.)

is the change in the expected value of Y due to a unit


increase in X.
Now we have
We can interpret the parameters in terms of log odds,
odds, or probabilities.
E[Y
i
] =
0
+
1
X
i
,

0
X
i
= 0

1
log

p
i
1 p
i

=
0
+
1
X
i
The interpretation regarding log odds is the easiest to
state but probably the hardest to understand.

is the log odds of an event when

is the change in the log odds due to a unit


increase in X.
Now, if we exponentiate both sides, we get
which implies that is the value of the odds
when

0
log

p
i
1 p
i

=
0
+
1
X
i
X
i
= 0

1
p
i
/(1 p
i
) = exp(
0
+
1
X
i
)
exp(
0
)
X
i
= 0
Also, suppose
So gives the multiplicative change in odds
corresponding to a one unit change in X.
In particular, if X takes only the values 0 and 1, then
is the odds ratio for category 1 compared to category 0.
The interpretation in terms of probabilities is conditional
on other variables in the model, so well save it for after
we talk about using multiple regressors.
X
i
= X
j
+1.
p
i
/(1 p
i
)
p
j
/(1 p
j
)
=
exp(
0
) exp(
1
X
i
)
exp(
0
) exp(
1
X
j
)
= exp(
1
(X
i
X
j
)) = exp(
1
)
exp(
1
)
exp(
1
)
Another example, this time with multiple explanatory
variables
> library(MASS)
> birthwt[1:2,]
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
'low' indicator of birth weight less than 2.5kg
'age' mother's age in years
'lwt' mother's weight in pounds at last menstrual period
'race' mother's race ('1' = white, '2' = black, '3' = other)
'smoke' smoking status during pregnancy
'ptl' number of previous premature labours
'ht' history of hypertension
'ui' presence of uterine irritability
'ftv' number of physician visits during the first trimester
'bwt' birth weight in grams
We are now confronted with the question of model
choice. There are a variety of principles that can guide us
here, but in the interest of time, lets consider one
criterion balancing goodness of t with parsimony (the
number of parameters).
The Akaike information criterion is
where k is the number of parameters in the given model
and L is the maximized value of the likelihood for that
model. (For now, you can think of the likelihood as the
joint density of the data for a particular setting of the
parameter values.) Looking at this criterion, we favor
models with a lower value of AIC.
AIC = 2k 2 log(L)
Numerical
Optimization III:
Maximizing Likelihoods
One of the canonical cases in which we need to
numerically optimize a function in statistics is to nd the
maximum likelihood estimate.
For the likelihood function is
The log-likelihood function is
The maximum likelihood estimator (MLE), which well
denote by is the value of that maximizes
(Note that this is equivalent to maximizing
X
1
, . . . , X
n
iid
f(x; ),
L() =
n

i=1
f(X
i
; )
() = log L() =
n

i=1
log f(X
i
; )

,
L().
().
Another important thing to note is that we can multiply
the likelihood by a constant (or add a constant to the log-
likelihood), and this does not change the location of the
maximum.
Therefore, we often work only with the part of the
likelihood that concerns This part of the function is
called the kernel.
In simple cases, we can often nd the MLE in closed form
by, for example, differentiating the log-likelihood with
respect to , setting this equal to zero, and solving for
But things are often not this simple!
.
.
As an example, lets go back to the logistic regression
model. Remember, we have
where, inverting the logit function, we have
and the likelihood function is
substituting in the expression above.
We cant maximize this analytically as a function of
and but we can easily write a function for the
likelihood or log-likelihood and have R do the work for
us....
Y
i
Ber(p
i
), i = 1, . . . , n
p
i
=
exp{
0
+
1
X
i
}
1 + exp{
0
+
1
X
i
}
L(
0
,
1
) =
n

i=1
p
Y
i
i
(1 p
i
)
Y
i
,

1
,
# Function for the negative log-likelihood
logistic.nll <- function(beta, x, y, verbose = FALSE){
if(verbose) print(beta)
beta0 <- beta[1]; beta1 <- beta[2]
pvec <- exp(beta0 + beta1 * x) /
(1 + exp(beta0 + beta1 * x))
fvec <- y * log(pvec) + (1-y) * log(1 - pvec)
return(-sum(fvec))
}
# Use optim to minimize the nll
# par is a vector of starting values
# better starting values => faster convergence, and
# less chance of missing the global maximum
optim(par = c(0, 0), fn = logistic.nll,
x = x, y = y, verbose = TRUE)
In the case of logistic regression, minimizing the negative
log-likelihood using optim will give the same answer as
using glm with family = binomial.
However, there are many other models without built-in
functions like glm. One example is the spatial models we
discussed a few weeks ago.
Suppose we have a spatial eld with mean zero and
covariance function
Before, we estimated and by nding the covariogram
and tting a curve to it using nonlinear least squares.
Cov(Z(s
i
), Z(S
j
)) =
2
exp{||s
i
s
j
||/}

However, the MLEs are actually much better estimators.


The kernel of the likelihood function (for normal data)
looks like this:
where is the vector of
observations and is the n by n matrix with
We can again use optim to nd the MLEs numerically....
|(
2
, )|
1/2
exp{Z

(
2
, )
1
Z/2}
Z = (Z
1
, Z
2
, . . . , Z
n
)

(
2
, )
(
2
, )
i,j
= Cov(Z(s
i
), Z(S
j
))
=
2
exp{||s
i
s
j
||/}
A few words before we move on:
1) Its always preferable to nd the MLEs in closed form if
you can. The answer is exact, and you avoid all the errors
that can be introduced in the numerical optimization,
including possibly converging to a local rather than a
global optimum.
2) If you do need to use numerical optimization, its a
good idea to evaluate the likelihood (or log-likelihood)
over a grid of values rst, to help you nd a good starting
value.
3) Theres a lot more theoretical detail concerning MLEs
that we dont have time to cover, importantly how to
estimate uncertainty. See Stat 135.
Nonparametric
regression and
scatterplot smoothing
Weve looked at linear models and nonlinear models with
a specied form, but what if you dont know a good
function to relate two variables X and Y?
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
Times
A
c
c
e
l
e
r
a
t
i
o
n
This data set shows
head acceleration
in a simulated
motorcycle accident,
used to test helmets.
This area of statistics is known as nonparametric regression
or scatterplot smoothing. Basically, we want to draw a
curve through the data that relates X and Y. More
formally, we suppose
where is an unknown function and the are iid with
some common distribution, typically normal.
Now, if we dont put any restrictions on its easy to get
a perfect t to the data -- just draw a curve that passes
through all the points! But this curve is unlikely to give
good predictions for any future observations.
Y
i
= f(X
i
) +
i

i
f
f,
Aside: This actually gets at a fundamental idea in
statistics, called the bias-variance tradeoff. We can get a
very low variance estimator of by interpolating the
data, using a very wiggly curve. But this introduces a lot
of bias. So we look for a happy medium.
f
We wont cover
the theoretical
details, here,
but just keep in
mind this question
of how much
smoothing to do.
Back to the motorcycle data....
One of the simplest things we could do would be to t a
high degree polynomial.
But tting a global
polynomial this
way isnt very
efcient.
How about breaking
up the region of
x and t a separate,
lower-degree
polynomial in each
region?
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
Degree = 5
Degree = 10
Degree = 20
This type of model is known as a piecewise polynomial
model or regression splines.
The breakpoints, between which we have separate
polynomial functions, are called knots.
Typically we impose some constraints on the way the
functions match up at the knots, such as maintaining the
rst and second derivatives.
So the modelling choices boil down to
1) where to put the knots
2) what degree polynomial to t between the knots
More knots less smoothing
Motorcycle data with 6 knots:
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
Degree = 1
Degree = 3
Motorcycle data with 9 knots:
10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
Degree = 1
Degree = 3
Smoothing spline models are dened in a slightly different
way. Within a class of functions, a smoothing spline
minimizes the penalized least squares criterion
The parameter controls how smooth the function is (in
terms of integrated second derivative).
We can specify in terms of the equivalent degrees of
freedom of the model, or we can choose it in a data-based
way, using something called cross validation.
1
n
n

i=1
(Y
i
f(X
i
))
2
+

(x)dx

10 20 30 40 50
!
1
0
0
!
5
0
0
5
0
x
y
df = 10
df = 20
df chosen by cross!validation

You might also like