Professional Documents
Culture Documents
Stat 133 All Lectures
Stat 133 All Lectures
Some benefits of R:
• Allows custom analyses and easy replicability
• High level language designed for statistics
• Active user community, lots of add-ons
• It’s free!
A screenshot from http://www.R-project.org/
R can be run in interactive or batch modes. The
interactive mode is useful for trying out new analyses and
making sure your code is doing what you think it is. The
batch mode is useful for carrying out pre-defined analyses
in the background.
> 3 + 5
[1] 8
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15 16 17 18 19 20
>
> # This is a comment
>
> 30 + 10 / # I'm not done typing
+ 2
[1] 35
To store a value, we can assign it to a variable.
> x1 <- 32 %% 5
> print(x1)
[1] 2
> x2 <- 32 %/% 5
> x2 # In interactive mode, this prints the object
[1] 6
> ls() # List all my variables
[1] "x1" "x2"
> rm(x2) # Remove a variable
> ls()
[1] "x1"
Variable names must follow some rules:
You can use the save and load functions to save specific
variables.
For now, we’ll work with R’s built-in functions, and the
most important things to know are how to call the
function and how to get help when you need it.
First, determine the arguments.
> args(rnorm)
function (n, mean = 0, sd = 1)
NULL
> args(plot)
function (x, y, ...) default values
0
20
40
x
60
80
100
> help(rnorm) # A shortened version of the real page:
Normal package:stats R
Documentation
Description:
Random generation for the normal distribution with
mean equal to 'mean' and standard deviation equal to 'sd'.
Usage:
rnorm(n, mean = 0, sd = 1)
Arguments:
n: number of observations.
mean: vector of means.
sd: vector of standard deviations.
Details:
If 'mean' or 'sd' are not specified they assume the
default values of 0 and 1, respectively.
Value:
'rnorm' generates random deviates.
Source:
See RNG for how to select the algorithm and for
references to the supplied methods.
References:
Becker, R. A., Chambers, J. M. and Wilks, A. R.
(1988) _The New S Language_. Wadsworth & Brooks/Cole.
See Also:
'runif' and '.Random.seed'
Examples:
...
R has a number of built-in data types. The three most
basic types are numeric, character, and logical.
> mode(3.5)
[1] "numeric"
> mode("Hello")
[1] "character"
> mode(2 < 3)
[1] "logical"
If you are just joining the course this week, please see me
after class, in office hours, or send me an email if you have
not done so already.
Last time in our introduction to R, we learned how to
Can you guess what x will look like after each of the
following lines?
> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(0, 10, by = 2)
[1] 0 2 4 6 8 10
> seq(0, 0.5, length = 6)
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> seq(1, 0, by = -0.1)
[1] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
> rep(c(0, 1), times = 5)
[1] 0 1 0 1 0 1 0 1 0 1
> rep(letters[1:5], each = 2)
[1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e"
R also has many built-in summary functions.
> args(paste)
function (..., sep = " ", collapse = NULL)
NULL
We’ll talk a lot more about working with text later in the
course.
We learned that one of the three main data types in R is
a logical vector, which is either TRUE or FALSE. To
understand how R operates on logical vectors, you need
to know a bit about Boolean algebra.
A and B
(not A) or B
The “not” operation just causes the statement following it
to switch its truth value. So (not TRUE) is FALSE and
(not FALSE) is TRUE. The compound statement A and B is
TRUE only if both A and B are TRUE. The compound
statement A or B is TRUE if either or both A or B is TRUE.
Today’s topics
> cars
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
3 1991 September 137055 136018 7 to 8
4 1991 September 133732 131843 9 to 10
5 1991 December 123552 121641 7 to 8
6 1991 December 121139 118723 9 to 10
7 1992 March 128293 125532 7 to 8
8 1992 March 124631 120249 9 to 10
9 1992 November 124609 122770 7 to 8
10 1992 November 117584 117263 9 to 10
The data.frame function will extract column names either
from arguments with a name = value construction, or from
the arguments themselves.
> cars$Year
[1] 1990 1990 1991 1991 1991 1991 1992 1992 1992 1992
> cars$Month
[1] July July September September
[5] December December March March
[9] November November
Levels: December July March November September
Data frames are actually a special kind of list. As when
constructing a data frame, we specify the elements of a
list using either name = value or just value for each
argument. Unlike a data frame, lists are not displayed in
columns, and each element can have a different length.
They can also be indexed like vectors, using []. The result
will be another list.
> regression.results[1]
$coefficients
(Intercept) x
0.08847387 2.99781408
> regression.results[[1]]
(Intercept) x
0.08847387 2.99781408
To summarize, the types of data structures we have
encountered so far are:
vector
matrix
array
list
data frame
Today’s topics:
Vectors: [index]
> x[1:10]; x[-3]; x[x>3]
$x2
[1] 0.001470952
Single factor:
If you don’t save your files as plain text, this won’t work,
since R cannot interpret any extra formatting commands.
So I do NOT recommend you use Microsoft Word.
> data()
Data sets in package 'datasets':
. . . many more
1. Barplots
a b c d e
> VADeaths
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
> barplot(VADeaths, legend = TRUE)
200
70−74
65−69
60−64
55−59
the total.
0
50−54
55−59
60−64
60
65−69
70−74
50
40
30
20
10
0
50−54
55−59
60−64
60
65−69
70−74
50
Deaths per 1000
40
30
20
10
0
b a
c
> Titanic
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
2nd
2nd 1st
1st
3rd
3rd
Crew
Crew
Studies of human perception show we are not very good
at comparing areas, volumes, or angles.
Histogram of precip
number of
15
observations falling
Frequency
0 10 20 30 40 50 60 70
precip
There are several ways to change the cutoff points.
15
12
10
10
Frequency
Frequency
8
6
5
4
2
0
10 20 30 40 50 60 70 10 20 30 40 50 60
precip precip
Again, let’s add meaningful axis labels and a title.
> hist(precip, breaks = 10, xlab = "Inches",
+ main = "Yearly Average Rainfall for US Cities")
8
6
4
2
0
10 20 30 40 50 60 70
Inches
4. Boxplots
●
Outlier
Upper whisker - Upper quartile + 1.5 IQR
60
50
Upper quartile
40
Inches
●
● Outliers
> mtcars[1:2,1:5]
mpg cyl disp hp drat
Mazda RX4 21 6 160 110 3.9
Mazda RX4 Wag 21 6 160 110 3.9
> boxplot(mpg~cyl, data = mtcars, xlab = "Cylinders",
+ ylab = "Miles per Gallon",
+ main = "Fuel Consumption")
Fuel Consumption
30
Miles per Gallon
25
20
15
●
10
4 6 8
Cylinders
5. Scatterplots
> state.x77[1:2,1:4]
Population Income Illiteracy Life Exp
Alabama 3615 3624 2.1 69.05
Alaska 365 6315 1.5 69.31
> plot(state.x77[,"Income"], state.x77[,"Life Exp"])
●
73
● ●
●
● ●●
● ●
● ●
72
●
●
state.x77[, "Life Exp"]
● ●
● ●
●
●
●
71
● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ● ● ●
70
●
●
●
69
● ●
●
●
●
68
state.x77[, "Income"]
> plot(state.x77[,"Income"], state.x77[,"Life Exp"],
+ xlab = "Per Capita Income (Dollars)",
+ ylab = "Life Expectancy (Years)",
+ main = "Income and Life Expectancy in U.S., 1970s")
Income and Life Expectancy in U.S., 1970s
●
73
● ●
●
● ●●
● ●
● ●
72
●
Life Expectancy (Years)
● ● ●
● ●
●
●
●
71
● ● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ● ● ●
70
●
●
●
69
● ●
●
●
●
68
Are we supposed to
compare length, area,
or volume?
5. Graph data out
of context.
6. Change scales in
mid-axis.
7. Emphasize the trivial.
(Ignore the important.)
8. Jiggle the baseline.
Sometimes varying the baseline is ok, if the main points of
comparison are the first category and the total. This plot
is bizarre in other ways.
9. Austria first!
10. Label (a) Illegibly,
(b) Incompletely,
(c) Incorrectly, and
(d) Ambiguously.
11. More (dimensions) is murkier.
More dimensions AND
colors!
12. If it has been done well in the past, think of another
way to do it.
On the other hand, here are some creative plotting
techniques you may want to consider.
1. Letting the data points represent another variable.
(E.J. Marey)
2. Using “small
multiples”
3. Letting deformation represent a variable.
versus
An example from www.swivel.com
Critique:
x-axis labels poorly located - put them at election years
y-axis label misleading - these are numbers of counties
use of color could be improved (eg. red/blue)
Some R code for you to play with:
50
Democrat
Republican
40
Number of Counties
30
20
10
0
Election Year
> library(“nameofpackage”)
or
> library(nameofpackage)
Functions allow us to
> args(substring)
function (text, first, last = 1e+06)
{
expression 1
expression 2
return(value)
}
In R these consist of
• if/else statements
• for and while loops
• break and next
• the switch function
Statements can be grouped together using curly braces
“{” and “}”. A group of statements is called a block. For
today’s lecture, the word statement will refer to either a
single statement or a block.
The basic syntax for an if/else statement is
if ( condition ) {
statement1
} else {
statement2
}
is the same as
if (condition1 )
statement1
else if (condition2)
statement2
else if (condition3)
statement3
else
statement4
if (condition) {statement1}
else {statement2}
if ( !is.matrix(m) )
stop("m must be a matrix”)
3. To handle common numerical errors
if ( dist == “normal” ){
return( rnorm(n) )
} else if (dist == “t”){
return(rt(n, df = 1, ncp = 0))
} else stop(“distribution not implemented”)
These if/else constructions are useful for global tests, not
tests applied to individual elements of a vector.
plot(Income, Donations,
col = ifelse(party == “Republican”, “red”, “blue”)
Looping is the repeated evaluation of a statement or block
of statements.
The syntax in R is
The syntax in R is
while (condition){
statement
}
switch(EXPR, ...)
F (x) = P (X ≤ x)
F −1
(q) = inf{x : F (x) > q}
Exercise: What does the inverse CDF for the coin flipping
example look like?
A random variable X is discrete if it takes countably many
values. We define the probability mass function for X by
f (x) = P (X = x)
F̂ −1
(q) = inf{x : F̂ (x) > q}
Probability
Inference
A statistic is a function of a sample, for example the
sample mean or a sample quantile.
}
X1
Particular choice X2
of parameters, Single statistic
sample size
Xn
abbreviation - Distribution
! 1
a≤x≤b
unif - Uniform(a, b) f (x) = b−a
0 otherwise
!
λe−λx x>0
exp - Exponential(λ) f (x) =
0 otherwise
e−λ λx
pois - Poisson(λ) f (x) = , x = 1, 2, 3, . . .
x!
! "
n x
binom - Binomial f (x) = p (1 − p)x , x = 0, 1, . . . , n
x
Well, in this case you could just use binom with size=1, or
sample(0:1, 1, prob = c(p, 1-p)).
Therefore,
P (Y ≤ y) = P (F (U ) ≤ y)
−1
= P (F (F (U )) ≤ F (y))
−1
= P (U ≤ F (y)
= F (y)
1.0
0.8
We need to:
0.6
Density
1. Find the CDF
0.4
2. Find the inverse CDF
0.2
3. Write a function to carry 0.0
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
x
We ended last time by talking about the inverse-CDF
method:
1.0
0.8
We need to:
0.6
Density
1. Find the CDF
0.4
2. Find the inverse CDF
0.2
3. Write a function to carry 0.0
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0
x
Using the fact that the total area is one, and that the area
of a triangle is 1/2 base x height, we find that
0 x < 0
(x−a) 2
0≤x<c
F (x) = (b−a)(c−a)
(b−x) 2
1 − (b−a)(b−c) c ≤ x ≤ b
1 b<x
Inverting this function, we have
! "
y(b − a)(c − a) + a 0 ≤ y < c−a
F −1 (x) = " b−a
b − (1 − y)(b − a)(b − c) b−a ≤ y ≤ 1
c−a
i=1
i=1
d
" $ %
= E (θi − Yi )2
i=1
= dσ 2
$ pwd
/Users/cgk
$ mkdir unixexamples
$ cd unixexamples
$ ls
$ ls -a
. ..
The two hidden files here are special and exist in every
directory. “.” refers to the current directory, and “..”
refers to the directory above it.
This brings us to the distinction between relative and
absolute path names. (Think of a path like an address in
UNIX, telling you where you are in the directory tree.)
$ mv test.txt newname.txt
Options come between the name of the command and
the arguments, and they tell the command to do
something other than its default. They’re usually prefaced
with one or two hyphens.
$ pwd
/Users/cgk
$ rmdir unixexamples
rmdir: unixexamples: Directory not empty
$ rm -r unixexamples
$ ls
Desktop Movies Rlibs
Documents Music Sites
Icon? Pictures Work
Library Public bin
MathematicaFonts README
mathfonts.tar
To look at the syntax of any particular UNIX command,
type man (for “manual”) and then the name of the
command.
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls G*
Gagging.text Going.nxt
$ ls *.xt
Bing.xt
$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls [A-G]ing.*
Bing.xt
$ ls *G*
AGing.txt Gagging.text Going.nxt
$ ls *i*.*e*
Gagging.text ing.ext
cp - copy a file
$ cp unixexamples/Bing.xt .
> system("ls")
datagen.R group1.dat group2.dat group3.dat
> system("head -n2 *.dat")
==> group1.dat <==
height weight Goal: Read in all the data
65.4 134.9
files and put them in a
==> group2.dat <== single matrix with an extra
height weight column for group.
65.7 145.7
- You have a long job and you want to be able to use the
computer for other things in the meantime.
- You want to log out of the machine while the job is
running and come back to it later.
- You’re running the job on a remote machine, and again
you want to log out.
- You want to be courteous to other users of the machine
by decreasing the priority of the job.
To start a BATCH job, use
You can use pwd, cd, and ls just as you would at the usual
prompt to find the right remote file or directory. You can
also use lpwd, lcd, and lls to move around the local
machine.
(There’s also a <<, but it’s use is more advanced than we’ll
cover.)
Try it out!
The idea behind pipes is that rather than redirecting
output to a file, we redirect it into another program.
$ ls | less
Pipe
Note that the data flows from left to right. See the UNIX
handout for more details on less.
A program through which data is piped is called a filter.
We’ve already seen a few filters: head, tail, and wc.
$ cat somenumbers.txt
$ cut -d “ ” -f 3-7
Here are some practice problems:
> string
[1] "St John the Baptist Parish"
> if (substring(string, 1, 3) == "St ")
+ newstring <- paste("St. ",
+ substring(string, 4, nchar(string)), sep = "")
> newstring
[1] "St. John the Baptist Parish"
> strings <- c("a test", "and one and one is two",
+ "one two three")
> gsub("one", "1", strings)
[1] "a test" "and 1 and 1 is two" "1 two three"
> sub("one", "1", strings)
[1] "a test" "and 1 and one is two" "1 two three"
What about finding fake “words” such as rep1!c@ted or
Vi@graa? In this case, we’re looking for numbers and/or
punctuation surrounded by regular letters.
Example: [-+][0-9]
> Addresses
[1] "Cari Kaufman <cgk@stat.berkeley.edu"
[2] "depchairs03-04@uclink.berkeley.edu"
[3] "Chancey <_arkbound@deutschland.de>"
> grep("[[:digit:]_]", Addresses)
[1] 2 3
> Addresses[grep("[[:digit:]_]", Addresses)]
[1] "depchairs03-04@uclink.berkeley.edu"
[2] "Chancey <_arkbound@deutschland.de>"
Going back to our fake “words” example, what will this
match?
[[:alpha:]][[:digit:][:punct:]][[:alpha:]]
> newString
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "Its me"
> gregexpr("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[[1]]
[1] -1
attr(,"match.length") No match
[1] -1
[[2]]
[1] 12 Starting at 12,
attr(,"match.length") match of length 3
[1] 3
[[3]]
[1] -1 No match
attr(,"match.length")
[1] -1
Did we miss anything??
We didn’t find p1!c because it consists of four characters:
a letter, a digit, a punctuation mark, and another letter.
[[:alpha:]][[:digit:][:punct:]]+[[:alpha:]]
^[^[:lower:]]+$
The position of a character in a pattern determines
whether it is treated as a meta character.
. ^ $ + ? ( ) [ ] { } | \
Announcements:
1. Literal characters
2. Character classes
3. Modifiers
http://www.usatoday.com/news/politicselections/
vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=CA
paste(web, collapse = “ ”)
Goal: grab the county name, votes for Bush, and votes for
Kerry.
[1] "\t\t\t\t<td class=\"notch_white\" width=
\"153\"><b>Alameda</b></td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"
Advantages:
- easy to read, write, and process
- in standard cases, don’t need a lot of extra information
- data is self-describing
- format separates content from structure
- data can be easily merged and exchanged
- file is human-readable
- but file is also easily machine-generated
- standards are widely adopted
An attribute
XML is well-formed if it obeys certain syntax rules. The
rules for tags are
leaf nodes
Working with XML in R
The first thing we need to do is load the XML library.
> library(XML)
> doc <- xmlTreeParse(“plant_catalog.xml”)
> root <- xmlRoot(doc)
> class(root)
[1] "XMLNode"
Aside:
> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"
> oneplant[['COMMON']]
<COMMON>Bloodroot</COMMON>
There are special XML versions of lapply, and sapply,
named xmlApply, xmlSApply. Each takes an XMLNode object as
its primary argument. They iterate over the node’s
children nodes, invoking the given function.
The final is on paper (not computer) and will take 1-2 hrs.
First, a quick review of sapply and lapply. Remember:
lapply(1:3, function(x){x^2})
sapply(1:3, function(x){x^2})
myList <- list(a=1, b=2, c=3)
lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})
sapply(myList, log)
sapply(myList, log, base = 10)
sapply(myList, function(x, pow){x^pow}, pow = 3)
3) If the results of sapply cannot be simplified, then sapply
and lapply will return the same thing.
●●● ●
● ●
● ●●
●
●
● ●●
●● ● ●● ● ● ●
● ●● ●
●●
●
●● ●● ●
● ● ●
●● ●
● ●●●●
●
●
●●
●
●
● ●
●
● ●● ● ●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
● ●● ●
● ● ●
● ● ● ●●
●
●
●
●
●
● ●
●
●
●● ●●
●●
●
●
●
●● ●● ●
● ●●
●● ●● ● ● ● ● ●
●
●
●
● ●●
● ● ● ● ●●● ●
●● ● ● ●
●
●
●
●
●
●●●●
●● ●●
● ●
● ● ●
●
●
●
● ●● ●
● ●
●
●
●
●
● ● ● ●
●● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ●●●
●
●
3
● ●
● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ●●● ●
●● ● ● ● ●●● ●●
common goals is
●●●●● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ● ●
● ● ●●●● ●● ● ● ●●●● ●
● ● ●●●
●●●
● ●● ●● ● ●● ● ●● ● ● ● ● ● ●●● ●● ●● ●●●●● ● ● ●● ● ● ● ● ● ●●●● ● ●●●●●● ●
●●● ● ● ●● ● ●●● ● ●● ● ●
● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●
●●
● ● ●
● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ●● ● ● ● ●●
● ●●● ● ●
● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●
● ●●● ● ● ●● ● ●● ● ●
●
● ●●● ● ● ●●● ● ●●●● ● ●● ● ●
● ● ●● ● ● ● ● ●● ●●●●●●●● ● ● ●●●●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ●●
●● ●
●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ●● ●● ●
●●
● ●● ● ●●● ● ● ● ●● ●● ● ●● ●
● ●● ● ●● ● ● ● ●●● ●● ●● ● ●●● ● ●●●●●● ●
● ● ● ● ●●● ● ●● ● ●●● ● ● ●
● ●●
● ● ● ● ●● ●● ● ●●● ● ● ●●
●●
● ● ●●
● ● ● ●●●
●●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ●●● ●●●●●●● ● ●● ● ●● ●●●● ●● ● ● ●● ● ●● ●●● ●●●● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●
●●●
● ● ● ● ●
●● ●●●● ●● ● ● ●●●●
● ●●
●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ●● ●
●
●●
● ●●● ●● ●●● ● ● ●● ● ●
● ●●● ●● ●● ●●●●● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ●● ● ●●● ●●● ●● ●● ●●
●● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ●● ●● ●● ●●● ●
●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●
●
● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ●● ●● ●● ●●● ●●●●● ●●●●
● ●●●● ●●● ● ●● ● ● ●●●● ● ●●●●●● ●● ●●●●
●●
●● ● ●
●
●
● ● ●
●
●
● ● ●● ●●
●● ●●
●
●●●
●
●
●
●
●
●●
● ●
●●
●
●● ● ●● ●
● ● ● ● ●● ●
● ● ●● ● ● ●● ●
● ● ●●● ●● ● ● ● ● ● ● ●●● ●●
●
● ●●●
●● ● ●
● ● ●
● ●
●●
●● ●●● ●●●●
●●
●●
● ● ●
●●● ● ● ● ● ●
● ●
● ● ● ●●● ●
●● ●
●●
● ●
● ●●
●
●●
●●
●●●●
●
●
●●●●●
● ●●
●●
●●●●●
●
●●
●
●
●●●
●
●●●
●●●
●●●●
●●● ● ●● ●●●
●●●●
●●
●●
●●
●●●
●●●
●
●●
●● ●
●●●
●●
●●●
●
●●
●
●●
●
●
● ●
●●
●
●
●●●●●● 2
prediction of the
● ●● ●●● ●● ●● ● ●●●● ●●
● ● ● ● ● ● ● ● ●●● ●●●● ●
● ● ● ● ●● ●●● ●● ● ●●● ● ●● ● ●●●●●● ● ●● ●●● ●● ●●●● ●●● ●● ●
● ●●●● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
●● ● ● ●●
●
● ● ●● ● ●● ●●● ●●● ●● ● ● ●● ●●● ● ●●●● ●●●● ●●●● ●●● ●● ●
●
●●●●
●●● ●●
●
●
●● ●●● ●● ● ● ●● ● ● ●●● ●
● ●●●●● ●●● ●
●●● ●●●●● ● ● ●● ● ●●●
● ●
●●●●●
●● ●● ●● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●
● ● ● ● ●●●●
● ● ●●● ●● ●● ●●●●●● ●●●
● ●● ●
●●
●●
●● ●●
●●
●●●
●
● ● ●● ● ● ●
● ●● ● ● ●● ● ● ● ● ●● ●●● ●● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●●●●●●
●
●
●●
●
● ●●
●● ●●●
● ● ●●●●●●●● ●●
●● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●
● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ●●
●● ● ●●●● ●● ●●●●● ●
●●
●
●●
●●●● ●
●
●
●●●●
●
●●●●● ●● ●●●
●
● ●●
●●● ● ● ● ● ●● ● ●●● ● ●● ● ●
● ● ● ●● ● ● ●● ●
●●●● ●● ● ●● ● ●● ●
●● ● ●● ●
● ● ●● ● ●
● ● ●●● ●●
● ● ●
● ●● ● ● ●
● ● ●● ● ●●●●
● ●
● ●●● ●
● ●● ●●●●● ●●● ●● ● ●● ●
● ●●
●● ●●● ● ● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●
●
●
●●
●●
● ●●● ●●
●● ● ● ● ● ●●
● ●
●
●● ● ● ● ●●●●● ● ●●●●● ●●● ●●●● ● ● ●●●● ●
●● ● ●
● ●●● ● ● ● ● ●● ●●●●● ● ● ● ● ●● ● ●●●●●●● ● ● ●●● ●●●●●● ●●●
● ●
●
● ●●
● ●
●●●● ● ●
●
● ● ● ●● ● ● ●
● ●●● ● ● ● ● ● ●
●●●●●● ● ● ●●
● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●
●●●
●● ●
●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ●●●● ● ● ●● ●● ●
●● ●●●● ● ● ●●● ●● ● ●● ● ●●●● ● ●● ●●●● ● ●●● ●
● ●
●● ● ●● ●
●●●●●● ●● ● ● ●●●●● ● ●● ●●● ●●● ●●● ● ● ●
● ● ●
●●● ● ● ●●● ● ● ● ●●● ● ● ● ●
●●● ●● ● ● ● ●● ● ● ●● ●● ● ●●
●●●
●● ●
●●●●●●●●
●● ●
●●●● ● ●
● ● ●● ● ●●● ●
●● ●
●●
● ● ●●● ●● ● ● ●●● ● ● ● ●●● ●●●
●● ●● ● ● ●● ● ● ● ● ●
● ●● ●● ●●● ●●● ●●● ●●
● ●● ●●● ●●
●●
●●
●●●●
●
●● ●● ● ● ● ● ●●●
●●
● ●●●
●
●
●●
●
●●
●●●●
●
●●●
●●● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●● ●●
● ● ● ●●
●
●● ●●●● ●
● ●● ● ●● ● ●
● ● ●●●●● ●●●
● ●● ● ● ●● ● ● ●●●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●
●●●●● ●
●●● ●●●● ●● ● ● ● ●
●●
●●● ●●●
● ●●●●●
●● ● ●●
● ● ●● ●● ● ● ●●● ● ●●●●● ●●● ●●● ●● ●● ● ● ● ● ●●● ● ●
● ●● ●●● ● ● ● ● ●●●●
● ●● ●●● ●● ● ● ●● ● ● ●● ●●●●●● ●●● ●●
●● ●
●●●●●
●
●
●● ● ●● ● ●
● ● ●● ●●● ●●●●● ●● ●● ●
●●●●● ●
●●● ● ●●●● ● ● ● ● ● ● ●● ● ●
●●● ● ● ●●
●●● ●●
●●● ● ●● ●● ●● ●● ●●●
●●●●●●
●
●● ● ●
●●● ●●
● ● ●●
●●●●● ●
●● ●● ●●●●●● ● ●●●● ● ● ● ●●● ● ● ● ● ●●●●●
● ●●● ● ● ●●●● ●●● ● ●●● ●●●
●● ●●●●● ● ● ●● ● ● ● ● ●●●●
● ●● ●●●●●● ● ● ●● ●● ● ●●●● ●●●●● ●● ● ●●●●●●●● ●● ● ●●
variable of interest
● ●●● ● ●●● ● ● ●●●● ● ● ●● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●
● ●● ●
●
●●
●
●●
●
●
● ●● ● ●
●
● ● ● ●●
● ●● ●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●
● ●●●
● ● ●
●●
●
●●
●● ●
●
●●●
●
●
●
● ●
●
●
●
●
●
●●● ●
●
●
●● ●
●
● ●
● ● ●● ●
●
●●●
●
●
●●● ●●
●
●●
●
●
●
●● ●● ● ● ●● ● ●
●
●
●
●●
●
● ●
●
●
● ●● ● ●●●● ●
● ●●●●●
●
●●● ●
●
●
●
●
● ● ●
● ● ●● ● ● ● ● ● ●●
●
●
● ● ●●
●
●
●●
● ●
●
●
●
●●●●
●
●
●
●● ●
●
●
● ●●●
●
●
●●●● ● ●
●●● ●
●●● ●●● ●●●
● ● ●
●● ●
●●●●●● ●●
●● ● ●
●
●
●
●● ●
● ●
●
●
●
●●● ●●● ●● ● ● ●
●●
●
● ● ●● ●●
●
●
●
●● ● ●●
● ●
●
● ●
●
●
● ●
●●●
●●●
●●● ● ●
●●
●
● ●●
●
● ● ●● ● ● ●
●
●
●
●
●
●●●● ●●●●
●● ●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●●●●●●
●● ● ●●
●● ●● ●● ●● ●● ●●●
● ●
●
●
● ●●●●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●●●● ●●
●
●● ●●●
● ●
●
●●●● ●
●
●
● ●
●●●●●
●●
●
●
●
1
●● ●●● ● ● ● ●
● ●● ●
●● ● ●●●●●● ●●● ●● ● ● ●● ● ● ●
● ●●●● ●●● ● ●
●● ● ●
●●● ●● ●●●
●●●
●●
●●● ● ● ● ● ● ● ●●●● ● ●
●● ●
●●● ●
● ● ●● ●●●● ●●● ● ● ●● ●●● ●●● ● ●●●●●●
●
●● ● ● ● ●
● ●● ●● ● ●● ● ●●●●
●
●●● ●●
●
●●●● ● ●
●●●● ● ● ●● ●●
● ● ●● ● ● ● ● ● ●●● ●●●● ●●●● ● ●
●●● ● ● ●●● ●
● ●●●● ● ●
● ●●● ●●● ● ●
● ●● ●● ● ●● ●● ● ●● ●
●
● ●●●●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ● ● ●●●
●●●● ● ●●●
●●
● ●●
●●
●●● ●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ●●●● ●●●● ●●●● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ●●
● ● ● ● ●●●● ● ●●
●●●●
●●
●
●●● ● ●●●● ● ●● ● ●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ●●●● ●●●● ●
●● ● ●● ● ● ● ● ● ●●●●●●● ●
●●●●●● ● ● ● ● ●
●●
●● ● ●●● ●● ● ● ● ●
●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ●
●●
● ●●●● ● ●● ●
● ● ● ●
● ●●●● ● ● ● ●
● ● ●● ●
● ● ●● ● ● ● ● ● ●
●
●
●● ●●●●● ●
●● ●● ●● ● ● ●●●
●
●● ●● ●● ●● ●●● ● ● ●● ● ●●● ●●● ●●● ●● ● ● ●
● ●● ● ●● ●
● ●● ● ●● ●● ●●● ●
●● ● ●● ●●● ● ● ● ●
●
●
●● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ●●●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ● ●● ●
● ●●● ● ●● ● ●●●● ● ● ●●
●
●●●●●● ● ● ●●● ● ● ● ●● ● ●● ●
●●
● ● ●● ●● ●● ● ● ●
● ● ● ●
● ●● ●●● ● ●● ● ● ●● ●●●●●● ● ● ● ● ●● ● ●● ●
● ●●● ● ●
●● ● ● ● ●●
● ● ●●●● ●●●●●● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●● ●● ●●● ●● ● ● ●●●●
●● ● ● ●● ●● ●●● ●● ●
● ● ●● ●● ●● ●●●● ● ● ● ●● ●● ●●
● ●●
at new locations.
●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●●●
●●●●● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●● ● ● ● ●● ● ● ●● ●
●●
● ●● ●● ●
● ● ● ●●●●● ●●● ●●
● ●● ●●
● ●●
● ● ●● ● ●
● ●●●
●
● ●●● ●
● ●● ● ●● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ● ●● ●
●●● ●● ●
●
●● ● ●
● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●●● ●
● ● ●● ●●●● ●
●●● ●
●●
●●● ●
● ● ● ●●
●
● ● ●●● ●
●● ● ● ●●
● ●●
●● ● ●●● ●
●● ●●
●● ●●●●●●
●● ●●●
●●●●●●●●
● ●● ●
●● ● ●● ● ● ● ● ● ●● ●●
● ● ● ● ●● ●● ● ●●● ● ●● ●● ●● ●● ●● ● ●● ●●●●●●●● ● ●●● ●●
● ●●
●●
●
●
● ●●
●
●● ●● ● ●
●
●● ● ● ●
● ●
●
● ● ● ●● ●
● ●●
●
●●
●
● ●● ●
● ●
●
● ● ● ●● ●
● ● ●
● ●●
●
●
● ●●
●
● ● ●
●●
●
●●
● ●
●●
● ●●
●
● ● ● ●●●
● ●
● ● ●●
●●●● ● ● ●●● ●●
● ●● ●● ● ●●● ● ●
● ●● ● ● ● ●●●●●●●●●●
● ● ● ● ●
●●
●●
●
●
●●● ●
●●
●● ●●● ●
●●●●
●●
●
●
● ●●
● ● ● ● ●
● ●● ●● ●●● ● ●●
●
● ●●
●● ●●
●
●
●
● ● ●● ●
●
● ●
● ● ●●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●●● ●●● ●
●● ●
●●● ● ●
●● ●●
● ●●
●●●
●
●
●● ● ●
●● ● ●● ●
●● ● ● ●
●
●
0
●
●● ●● ● ● ● ●
●
● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ● ●●
●● ●●●●●
● ● ● ●●●●● ● ● ● ●
●● ● ● ●
●
● ● ●
●
● ●
●●● ●●●● ●● ● ●●
●● ● ● ● ● ● ●● ● ● ●● ●
● ● ●●●●● ● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●
●● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●
● ●● ● ●● ●● ● ● ●●●●● ●●
● ●● ●● ● ● ●
● ● ●
●●
●● ●● ● ● ●●●
●● ●
● ●● ●● ●
●●● ●●● ● ● ● ● ● ● ● ●●●●●●●●●● ● ●●●● ●● ●
● ●●●●● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●
●●●●
●
● ●
●
●●
●●
●
●
●
●
●●
●●
●●
●
●●●
●●
●
●●
● ●● ●●●●●●●●●● ●
●
● ● ● ● ● ● ● ● ●● ●
●
●● ● ● ●●●
●
● ●●●●● ● ● ●● ●●●● ● ● ● ● ● ●●●● ●●
●●● ●●
● ● ● ●
●●●● ● ● ●
● ● ●
● ●
● ●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●
●●●●
●●●
●●●
●●
●
●●●●
●
● ●● ● ●● ●● ●● ●
●●●
● ● ● ●●
● ● ●● ● ●●
● ● ● ●● ●●
●●● ●●●●●●● ● ● ●● ●●● ● ● ● ●●●●●●● ●●● ●●● ● ●● ● ● ● ● ●●● ●
●
●
●
●●●
●●●●
● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ●● ● ● ●●● ● ● ●
●●
●●●
●●●
●
●●●● ●●
● ●● ● ● ●●● ●●●● ● ● ●●● ● ● ● ●● ● ● ●● ●● ●●
● ● ●●
●●● ● ●●● ●● ● ● ●●●●● ● ● ●●● ● ●● ● ● ●● ●
● ●● ●●●● ●●
● ●●●●●● ● ● ● ●
● ● ● ●● ●
●
●●●●●●
●
●● ● ● ● ●● ●● ●●● ●● ●● ●●● ● ●
● ● ●●● ● ●● ●●● ●●● ● ● ●●● ● ●● ● ●
●
●● ●
●●●●
●
●●●● ●● ●●●● ●● ●
● ● ●● ● ●● ●●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ●
●●●●●●● ●
● ● ● ●● ●
● ● ● ● ●● ●● ●● ● ● ● ● ●● ●
● ● ●●●●● ● ●● ●
● ●● ● ● ●●● ● ● ●● ●
●●●●●●● ●● ●●● ● ● ● ●
●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ● ●●● ● ● ● ●●●● ●●●●
●●●●●●●
● ●●● ●●● ● ● ●●● ●●●●●●● ●●●● ●● ●●●●
●●●●● ● ●●● ● ● ●● ●
● ● ● ● ●● ●
● ● ● ●●● ● ●
●●● ●●●
● ● ● ●●● ● ●● ● ●● ● ● ●● ●● ● ●●●●● ● ●● ● ●● ●● ● ●
●
●●
● ●●● ● ● ● ●●●●
●●●
●● ● ●● ●●
●
●
●
● ●
●●●●
●● ●
● ●●● ●●●● ● ● ●● ● ● ●● ● ●● ● ●●● ● ●●● ●●●● ●● ● ●● ●
●
●● ●● ● ●●
● ● ●●●● ●●●●●●● ●●●●● ●● ●●●● ●● ●●● ● ● ●●●● ● ●●● ●●●● ● ●
●● ●●● ● ● ● ●● ●
● ●● ● ●●
●● ● ● ●● ● ● ● ●●●● ● ●● ● ●● ●● ● ●●
● ●●● ● ●
●●●●
●● ●●
●
●●●●
●
●●●
●
● ●
●●●
●●●
●●
●●●
●●
●
● ● ●
●
●
●
● ● ●●
●
●● ● ● ●
● ●● ● ●
●● ●●●
●
●● ●●
●●● ●● ●●●●●● ●
●
● ●●●● ● ●●●● ● ● ●
●
●
●●● ●●● ●●●
● ●● ● ●●
● ●
●
●● ● ●
● ●●
●●
●●
● ●●●● ● ● ●● ● ● ●
● ●● ●
●
● ●●●
●
●●●●
●
●●
● ●
● ●●
●
●
●
●●● ● ●● ●● ●● ● ● ●
● ●● ●● ●●
● ● ● ●●
● ● ●●
●● ●● ● ● ●● ●
● ●
●●
●●
●●
●
●
●
−1
● ● ●●
● ●● ●● ●●●●● ●●●● ●● ● ●● ● ● ● ●● ● ●● ●●●
●● ● ● ●● ● ●●● ●●
● ● ●
●● ● ● ●●● ●●●● ●●●● ● ●● ●●● ● ● ●● ●● ●
● ●●
● ● ● ●● ● ● ●●● ●● ● ● ●●● ●●● ●●● ● ●●● ●●●●● ●● ● ● ● ● ● ●● ●●● ● ● ●
●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●
●● ●● ●● ● ●● ● ●● ● ●●● ● ● ●●● ●●● ●● ●●●● ●● ●●●●
● ● ● ● ● ●● ● ●
● ● ● ● ●●●● ● ●●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●● ●●● ● ● ● ●● ● ● ● ●●●● ● ● ●
● ● ● ● ● ●● ●●
●● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●
●●● ●● ●●●●●
●● ● ● ●●● ● ● ●●● ●●
●
●●●
● ●
● ● ● ●● ● ● ●●● ●
●
●● ● ● ●●
● ● ● ●●●● ●●●● ● ●
● ●
●
●●●●● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●●
● ● ● ●● ● ●●
●● ●● ●● ● ● ● ● ● ●●●
● ●● ● ●● ●
●●● ●●● ● ● ● ●
●
● ●● ● ● ● ●●● ●● ● ●
● ● ●●● ●
●
●
● ●●
●
●●
● ●
● ● ●
●● ●●
● ●●
●●
● ●●
●
●●●
● ●
● −2
● ●
●● ●● ●●●●●
● ●●●
● ●●● ●● ● ●●● ●
●
● ● ●
●●
● ● ● ●●●
● ●● ●
● ●
●●●●●● ●● ●●
● ● ●● ●
●●
●●
●
●
●●
● ●
●●
2. Lattice data consist of measurements that are particular
to a certain geographic region, such as a county.
Modelling tends to
focus on the structure
of “neighborhoods.”
3. Point process data consist of the locations of particular
events. If there are also values associated with the event,
this is known as a marked point process.
Note:
if V ar(X) = V ar(Y ) = σ 2
A few simplifying assumptions about the structure of the
covariance function are
1 !
Ĉ! (d) = (Zi − Z̄ (i)
)(Zj − Z̄ (j)
)
#Id,!
(i,j)∈Id,!
mean over all the i’s mean over all the j’s
Recall from last time:
Example: Average 65
42
surface ozone at 60
40
monitoring stations 55
38
50
35
of locations
−95 −90 −85 −80 −75
Plotting the data
●
30
25
Estimated Covariance
20
●
15
●
10
●
●
● ●
●
5
●
●
●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
0
● ●
● ●
Distance
The idea is that we fit a parametric model to the
correlogram. In other words, we specify a functional form
for the covariance, where the function depends on
certain parameters, and then we estimate those
parameters.
●
30
of squared
Estimated Covariance
20
residuals over
●
the parameters
15
●
●
● ●
●
5
●
●
●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
0
● ●
● ●
Distance
Now we need to determine the set of locations at which
we want to do the prediction.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
42
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
40
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
38
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
lat
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
36
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
34
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
32
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
lon
Ok, now it is time to predict.
It turns out the weights for the BLUP are easy to derive if
you know the true covariance function....
First we form the covariance matrix for the observations.
This contains the covariances for every pair of
observations.
New location j
If the vector of observations Z has mean zero, the BLUP
is simply
γΣ
! −1
Z
σ 1m − diag{γ Σ
2 ! −1
γ}
Topics:
• using SQL to extract info from RDBMSs
• relating these back to similar tasks in R
• using SQL from within R
There are tradeoffs in terms of what we choose to do
using SQL and what we do in R.
A database is made up of one or more two dimensional
tables, usually stored as files on the server.
Missing value
Terminology
Chips[c(“Pentium”, “PentiumII”), ]
R:
SQL:
or functions of attributes
Additional clauses
For example, this will order the first seven, not give you
the top seven!
R:
SQL:
SHOW DATABASES;
SHOW TABLES IN database;
SHOW COLUMNS IN table;
DESCRIBE table; Same thing
Database changed
mysql> DESCRIBE Album;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
+-------+------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
mysql> describe Album;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
+-------+------------+------+-----+---------+-------+
mysql> describe Artist;
+-------+--------+------+-----+---------+-------+ What is the
| Field | Type | Null | Key | Default | Extra |
+-------+--------+------+-----+---------+-------+ structure of
| aid | double | YES | MUL | NULL | |
| name | text | YES | | NULL | |
this database?
+-------+--------+------+-----+---------+-------+
mysql> describe Track;
+----------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
| filesize | bigint(20) | YES | | NULL | |
| bitrate | double | YES | | NULL | |
| length | bigint(20) | YES | | NULL | |
+----------+------------+------+-----+---------+-------+
The individual tables don’t have much interesting
information, since they rely on the IDs.
mysql> SELECT title,length FROM Track
-> WHERE length = (SELECT max(length) FROM Track);
+------------+--------+
| title | length |
+------------+--------+
| The Lovers | 1561 |
+------------+--------+
1 row in set (0.00 sec)
+------------+------------------------+------------+--------+
| title | title | name | length |
+------------+------------------------+------------+--------+
| The Lovers | Invitation to Openness | Les McCann | 1561 |
+------------+------------------------+------------+--------+
1 row in set (0.00 sec)
Note that it’s very important to make all the links between
tables, or you will get unwanted rows in the table.
Another example: let’s make a table of the number of
artists with certain numbers of albums. (How many
artists have one album, how many have two, etc.)
First, this tells us how many albums each artist (aid) has:
> library(RMySQL)
Loading required package: DBI
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, dbname = "albums",
+ user = "stat133", pass = "T0pSecr3t")
What do you think this will do? This says that the string
is not finished, even though
we hit return.
> head(albums)
title name tot
1 'Perk Up' Shelly Manne 2684
2 Kaleidoscope Sonny Stitt 2679
3 Red Garland's Piano (Remastere Red Garland 2676
4 Ask The Ages Sonny Sharrock 2675
5 Duo Charlie Hunter & Leon Parker 2667
6 Tenor Conclave Hank Mobley/Al Cohn/John Coltr 2665
Numerical
Optimization
Function optimization refers to the problem of finding a
value of x to make some function f(x) as large (or as
small) as possible.
c+d c c d
= =φ
c d
dφ + d dφ
⇒ =
dφ d
φ+1
⇒ =φ
φ
⇒ φ −φ−1=0
2
√
1+ 5
⇒ φ= ≈ 1.618034
2
An example:
2.5
●
Start with
2.0
f(x)
1.5
x1 = b − (b − a)/φ ●
= a + (b − a)/φ
1.0
x2
1 2 3 4 5
Now compare
f (x1 ) and f (x2 ). a x1 x2 b
2.5
●
2.0
f(x)
1.5
●
1.0
and maintain the 1 2 3 4 5
golden ratio. x
a x1 x2 b
a x1 x2 b
2.5
●
2.0
f(x)
1.5
●
●
●
1.0
1 2 3 4 5
a x1 x2 b
a x1 x2 b
2.5
●
2.0
f(x)
1.5
●
●
● ●
1.0
1 2 3 4 5
a x1 x2 b
f ! (x0 )
Setting RHS = 0 gives x1 = x0 − !!
f (x0 )
to zero.
Response Vector of
variable coefficients
Vector of
Nonlinear covariates
function
●
400
●
●
●
●
●
170
●●
●
●
350
Weight (kg)
●
Weight (lb)
●
●
●●
150
●
●
●
● ●
●●
●
300
●
●●● ●
130
●●
●●
●
●● ●
●
●
● ●●
● ●
250
●●● ● ●
110
● ●
Days
Time taken to
Ultimate lean
lose half amount
weight (asymptote)
Total amount remaining to be lost
to be lost
The function nls in R uses numerical optimization to find
the values of the parameters that minimize the sum of
squared errors
n
!
(Yi − f (xi , β))
2
i=1
Yi = β0 + β1 Xi + "i
iid
where !i ∼ N (0, σ 2 ), i = 1, . . . , n
i=1
β̂0 = Ȳ − β1 X̄
E[Yi ] = β0 + β1 Xi + E["i ]
= β0 + β1 Xi
6
clearly not iid
5
normal
4
Failures
2) if we go out
3
far enough, we
2
● ●
actually predict
1
●● ● ●
a negative number
0
●●●●● ●● ●● ●● ●
of failures. 0 20 40 60 80
Temperature (Degrees F)
Instead we will fit a logistic regression model.
Note that in each case, the link function maps the space of
the parameter representing the mean of the distribution
(µi , λi , or pi ) to the real line, which is the space of the linear
predictor.
Generalized linear models can be fit using an algorithm
called iteratively reweighted least squares. In R, this is
implemented in the function glm.
0.6
0.4
● ●
0.2
●● ● ●
0.0
●●
●●●●●● ●● ●● ●
0 20 40 60 80
Temperature (Degrees F)
Interpretation of the coefficients is a bit trickier in logistic
regression models than it is in linear regression models.
pi /(1 − pi ) = exp(β0 + β1 Xi )
> library(MASS)
> birthwt[1:2,]
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
AIC = 2k − 2 log(L)
exp{β0 + β1 Xi }
pi =
1 + exp{β0 + β1 Xi }
n
!
Yi
and the likelihood function is L(β0 , β1 ) = pi (1 − pi ) ,
Yi
2
Before, we estimated σ and ρ by finding the covariogram
and fitting a curve to it using nonlinear least squares.
However, the MLEs are actually much better estimators.
The kernel of the likelihood function (for normal data)
looks like this:
● ●
●
●
50
● ●●● ●
● ● ●
●
●
●● This data set shows
head acceleration
● ●
●● ● ● ● ● ● ● ●
● ●
●
0
●
●●●● ●●
●●●●
●●● ● ●● ● ● ●● ●
●●● ●●● ●●
Acceleration
● ●
in a simulated
●
●
● ● ● ● ●●● ●● ●
●
●●●● ●● ● ●
● ● ●
●
●● ●
motorcycle accident,
●
● ●
−50
●● ●
● ●
● ● ●
●●
●
●
● ●
● ●
●
● ●
●
● ●● ● ●
● ●
●
10 20 30 40 50
Times
This area of statistics is known as nonparametric regression
or scatterplot smoothing. Basically, we want to draw a
curve through the data that relates X and Y. More
formally, we suppose
Yi = f (Xi ) + !i
We won’t cover
the theoretical
details, here,
but just keep in
mind this question
of how much
smoothing to do.
Back to the motorcycle data....
polynomial this
●
50
● ●● ●
●
● ● ●
● ●●
●
●
●
● ● ● ● ●
efficient.
●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ●●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ●● ●
● ● ● ●
●●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ●
●
●● ●
●
● ● ●
● ●
●
● ●
up the region of
● ● ●
●
● ●
● ●
●
−100
lower-degree
● ●● ● ●
● ●
Degree = 10
● Degree = 20
polynomial in each 10 20 30 40 50
region?
x
This type of model is known as a piecewise polynomial
model or regression splines.
● ●
●
●
50
● ●● ●
●
● ● ●
● ●●
●
● ●
● ● ● ● ● ● ●
● ●
● ●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ● ●
−50
●● ●
●
● ● ●
● ●
●
● ●
● ● ●
●
● ●
● ●
●
−100
●
●
● ●
●
● ●
● ●
●
● ● ● ● ●
● ●
Degree = 1
● Degree = 3
10 20 30 40 50
x
Motorcycle data with 9 knots:
● ●
●
●
50
● ●● ●
●
● ● ●
● ●●
●
● ●
● ● ● ● ● ● ●
● ●
● ●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ● ●
−50
●● ●
●
● ● ●
● ●
●
● ●
● ● ●
●
● ●
● ●
●
−100
●
●
● ●
●
● ●
● ●
●
● ● ● ● ●
● ●
Degree = 1
● Degree = 3
10 20 30 40 50
x
Smoothing spline models are defined in a slightly different
way. Within a class of functions, a smoothing spline
minimizes the penalized least squares criterion
n
! "
1
(Yi − f (Xi )) + λ
2
f (x)dx
!!
n i=1
●
50
● ●● ●
●
● ● ●
● ●●
●
● ●
● ● ● ● ● ● ●
● ●
● ●
●
0
●
●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●
●●●● ● ● ●
●
● ● ●
y
●
● ●
●
● ● ●
−50
●● ●
●
● ● ●
● ●
●
● ●
● ● ●
●
● ●
● ●
●
−100
●
●
● ●
●
● ●
● ●
● df = 10
● ● ● ● ●
● ●
df = 20
● df chosen by cross−validation
10 20 30 40 50