Holect 18

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Bootstrap: lecture 18 May 12, 2004

Multivariate Hypothesis Tests


Wald-Wolfowitz Test on the Line
Pool the data, rank the data , count the number of `runs', sequences of observations that are from the same
sample and follow each other.
More precisely, suppose we have two sample of observations, we will call them the X's and the Y 's. Null
hypothesis FX = FY .
Test statistic is the number of runs R, that is the number of consecutive sequences of identical labels. The null
distribution of the test statistic can be found by a combinatorial argument:
2mn
R− N −1
W=q ∼ N (0, 1)
2mn(2mn−N)
N2 (N−1)

as long as the ratio of the sample sizes stays a number. (bounded away from zero as they increase.)

Smirnov Test on the line


Sort the univariate observations, ri is the number of X's that that have a rank smaller or equal to i.si is the
same but for the Yi 's. The statistic here is
ri si
D = max|di | where di = −
m n
Reject for large values of D

Minimal Spanning Tree Based Tests


This method is inspired by the Wald-Wolfowitz, and solves the problem that there is no natural multidimen-
sional `ordering'.
The main reference is Friedman and Rafsky 1979, Ann stat.,pp. 697-717.
One may either use multidimensional data or rst make the data bidimensionnal by principal components. (or
another dimension reducing technique), in order to be able to represent the tree in a graphic.

Minimal Spanning Tree Algorithm

Not the travelling salesman problem. Algorithms include Prim's, Kruskals', I have implemented the greedy
algorithm which is as follows: (it is a recursive algorithm) T0 is a point.
Suppose we have Ti , add an edge to it by nding the minimal edge between a point in the tree and a point
that is not, join these to make Ti+1 .
Algorithm:

1. Set tree(i) < − (−n), for i = 1 : n − 1


2. Do n − 1 times:
 dmin = min{dist(i, |tree(i)|) for i < n with tree(i) < 0}
 imin = index at which the minimum is attained.
 Add edge : tree(imin ) < −(−tree(imin ))
 Update list of nearest vertices: for i = 1 : n − 1 do:
– if tree(i) < 0 and dist(i, imin ) < dist(i, −tree(i))
tree(i) < − − imin
The output to create will be a vector of numbers that correspond to where the ith point should be connected
to.
Here are some examples with R:
mstree(vincr[1:10, ])
> mstree(vincr[1:10, ])
$x:
[1] 4.20959759 -3.70445704 4.54166603 7.22903347 4.89603710 2.28360796
[7] 0.00000000 -2.00589728 1.03156424 0.04926162
$y:
[1] -7.0463328 -6.4673190 0.4050449 -0.3888339 -2.2247181 0.0000000
[7] 0.0000000 -3.2911484 -0.8079203 -1.5704904
$mst:
[1] 10 8 6 3 3 7 10 10 10

mst: vector of length nrow(x)-1 describing the edges in the


minimal spanning tree. The ith value in this vector is an
observation number, indicating that this observation and
the ith observation should be linked in the minimal span-
ning tree.

$order:
[,1] [,2]
[1,] 4 7
[2,] 3 10
[3,] 5 6
[4,] 6 9
[5,] 7 8
[6,] 10 3
[7,] 1 1
[8,] 9 4
[9,] 8 5
[10,] 2 2
order: matrix, of size nrow(x) by 2, giving two types of ord-
ering: The first column presents the standard ordering
from one extreme of the minimal spanning tree to the oth-
er. This starts on one end of a diameter and numbers the
points in a certain order so that points close together in
Euclidean space tend to be close in the sequence. The
second column presents the radial ordering, based on dis-
tance from the center of the minimal spanning tree. These
can be used to detect clustering. See below for graph
theory definitions.

plot(x, y) # plot original data


mst <- mstree(cbind(x, y), plane=F) # minimal spanning tree
# show tree on plot
segments(x[seq(mst)], y[seq(mst)], x[mst], y[mst])

i <- rbind(iris[,,1], iris[,,2], iris[,,3])


tree <- mstree(i) # multivariate planing
plot(tree, type="n") # plot data in plane
text(tree, label=c(rep(1, 50), rep(2, 50), rep(3, 50))) # identify points

# get the absolute value stress e


distp <- dist(i)
dist2 <- dist(cbind(tree$x, tree$y))
e <- sum(abs(distp - dist2))/sum(distp)
0.1464094

Two-sample test

 Pool the two samples together.


 Make a minimal spanning tree.
 Count the number of 'pure' edges, ie edges of the minimal spanning tree whose vertices come from the
same sample.
 An equivalent statistic is provided by taking out all the edges that have mixed colors and counting how
many `separate' trees remain.
This extends to the case of more than two levels of treatment by doing the following:

Many -sample test

 Pool the K samples together.


 Make a minimal spanning tree.
 Count the number of 'pure' edges, ie edges of the minimal spanning tree whose vertices come from the
same treatment level.
Example: Wine data This data was 14 physico-chemical composition varaibles (all continuous), that we are
trying to relate to several categorical variables.
acp.vin <- princomp(vin[, 1:14], cor=T)
#Two components was the choice here
−5 0 5
biplot(acp.vin)
library(ape) 23
mst.wine <- mst(dist(acp.vin$loadings)) INTCOL 17
0.2

TANINS
D28
> sum(vin[mst.wine,15]==vin[-78,15]) 15INDGEL16 52
47 42 22
5

[1] 44 BLEU 44 PH ION


12 14 13 50
PVP
0.1

randclass=function(S=1000,data=as.numeric(vin[-78,15]), 20 412139
3845
467624
193726 49
TEINTE
compar=as.numeric(vin[mst.wine,15])){ INDHCL 40
69
36 JAUNE
Comp.2

11
9101 18 48 51
same=rep(0,S) 25
7 75 43
0.0

78
0

63 77
n=length(compar) ANTCY 322 6659 65 68
3363 74
for (i in (1:S)) 4 5 8 3062 67 72
31
−0.1

same[i]=sum(as.numeric(data==compar[sample(n,n)])) DAFLAV 28 64 71
ROUGE 34 60
61 73
return(same)} 27 58 57 70
−5

55 35
−0.2

r1=randclass() 29
> max(r1) 56 53 54
[1] 39
−0.2 −0.1 0.0 0.1 0.2
> max(r2)
[1] 40 Comp.1

hist(r1,nclass=50)
Matlab MSTREE algorithm:

See in Richard Strauss’s programs the MSTree.m file.

Convex Hulls as Multivariate Confidence Regions


% PCACOVB: Objective function for bootstrapping pcacov(). Returns a single
% row vector containing loadings and percvar values.
%
% Usage: retstr = ...
% pcacovb(X,not_used1,not_used2,not_used3,npc,loadtype,origload)
%
% X = [n x p] data matrix (obs x vars).
% grps = row or column vector of group identifiers.
% npc = number of leading discriminant functions for
% which scores are desired (default = groups-1).
%loadtype = optional boolean flag indicating the scaling for the
% loadings:
% 0: vector correlations [default];
% 1: regression coefficients;
% 2: squared loadings sum to unity.
% origload = [p x ndf] matrix of loadings from original analysis.
% --------------------------------------------------------------------------
% retstr = row vector containing loadings and percvar results.
%

% RE Strauss, 11/21/99, modified from discrimb.m


% 5/2/00 - isolated scores & loadings in LOADSCRS.

function retstr = pcacovb(X,nu1,nu2,nu3,npc,loadtype,origload)


covmat = cov(X); % Covariance matrix
[evects,evals] = eigen(covmat);
percvar = 100 * evals / sum(evals); % Percents variance
percvar = percvar(1:npc); % Retain subset

loadings = loadscrs(X,evects,npc,loadtype);

for d = 1:npc % Check if direction consistent with original


if (corr(loadings(:,d),origload(:,d),2)<0)
loadings(:,d) = -loadings(:,d);
end;
end;

retstr = [loadings(:)’ percvar’];

return;

e =
0.7607 0.7768 0.5830 0.6054 -0.1665 -0.1424 -0.1411 -0.1228 -0.0256 -0.0220
0.7226 0.7371 0.2036 0.2725 0.2543 0.3298 0.4791 0.5041 -0.0910 -0.0816
0.8491 0.8535 -0.1031 -0.0931 0.1059 0.1219 0.0170 0.0381 0.4659 0.4750
0.7972 0.8070 -0.3006 -0.2642 0.3932 0.4286 -0.3134 -0.2483 -0.1006 -0.0887
0.8038 0.8108 -0.4817 -0.4511 -0.3214 -0.2997 0.0377 0.0954 -0.0510 -0.0456

Convex Hull in rst principal plane


Three variables
Three variables- recentered

First two principal axes, Axis 1: 63%, Axis 2: 33%


First two principal axes

You might also like