Statistic For The Four Population Test PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

1

Statistics for the Four Population Test


Nassim Nicholas Taleb∗ , Pierre Zalloua† , and Dan Platt‡
∗ Tandon School of Engineering, New York University,† Lebanese American University,‡ IBM Research

First draft, February 21, 2019. Corresponding author:


NNT1@nyu.edu

Fig. 2: An example of what the results show (from Platt


et al, 2018) establishing a different origin for population in
Anatolia/Levant from the Arabian Peninsula. Anatolia_N and
Levant_N stands for Neolithic Anatolia and Levant, respec-
tively

dealing with a multivariate trinomial distribution (of the style


sometimes dealt with heuristically in financial applications),
but the multivariate aspect of it remains largely unexplored in
Fig. 1: The intuition of what F4 is trying to capture, a graph mathematical statistics [3].
inspired from Reich [1]. The intuition of the trinomial can be seen in the marginal
distribution of, say, X1 :
The Four population test [2], F4 , aims at revealing some
path dependent structure for DNA mutations from temporal  P2 P2 P2
divergence between two samples by considering variations  0 w.p. l=0 m=0 n=0 p{0,l,m,n}
P2 P2 P2
with another two samples of known properties. It aims at X1 = 1 w.p. l=0 m=0 n=0 p{1,l,m,n}
 P2 P2 P2
finding the proportion of mutations in one population that is 2 w.p. l=0 m=0 n=0 p{2,l,m,n}
not present in others, thus assessing difference in evolutionary
paths. and
In this note we present a test statistic for its standard
P2 P2 P2
error under the null hypothesis, assuming such a variability

 0 w.p. k=0 m=0 n=0 p{k,0,m,n}
is largely insulated from path dependence. The test statistic X2 = 1
P2 P2 P2
w.p. k=0 m=0 n=0 p{k,1,m,n}
aims at finding confidence level bands that are more tractable 
2
P2 P2 P2
w.p. k=0 m=0 n=0 p{k,2,m,n}
and analytically firmer than those obtained by the current
techniques entailing jackknife/bootstrap methods. etc.
We have the characteristic function
I. T HE PROBLEM 2 X
X 2 X
2 X
2
χ(t1 ,t2 ,t3 ,t4 ) = ei(kt1 +lt2 +mt3 +nt4 ) p{k,l,m,n}
A. Initial
k=0 l=0 m=0 n=0
Let X = (X1 , X2 , X3 , X4 ) be a (4 × n) random variables
with realizations in {0, 1, 2} with joint probabilities (of a We have, for a, b = 1, . . . , 4,
number of mutually exclusive combinations of X equals 34 ): ∂2χ
p{0,0,0,0} , p{0,0,0,1} , p{0,0,0,2} , . . . p{2,2,2,2} . Clearly we are σa,b = i−2 |t =0,t2 =0,t3 =0,t4 =0 , (1)
∂ta ∂tb 1
2

and writing σa2 for σa,a , we express the (4 × 4) covariance where


structure by: λ1 = −2σ1,2 + σ12 + σ22 ,
 2  λ2 = 4σ1,3 − σ1,4 − σ2,3 + σ2,4 ,
σ1 σ1,2 σ1,3 σ1,4 λ3 = −2σ3,4 + σ32 + σ42 .
 σ1,2 σ22 σ2,3 σ2,4 
Σ4 = 
 σ1,3 σ2,3 σ32 σ3,4 

2) Application of the Central Limit Theorem: Going from
σ1,4 σ2,4 σ3,4 σ42 the moments of F4,1 to F4,n under the central limit theorem
and the individual expectations: (hence CLT) without knowing the exact distribution requires
  (some) independence of realizations of F4 , not necessarily
µ1 independence of individual components Xi . We do not need to
 µ2  assume that the vectors Xi = Xi,1 , ..., Xi,n and Xi,k , Xi,l6=p
M = 
 µ3  are independent (in the sense that E(Xi,k Xi,l6=p = 0), in
µ4 other words p{.,.,.,.} is not indexed by any additional counter.
both computed directly from χ(.). We just need to assume that the dependence (and cross-
The individual components of the covariance matrix can be dependence) structure of individual Xi wanes in front of that
unwieldy since, for instance, of F4 .
The problem is common in quantitative finance as we en-
2 X
X 2 2 X
X 2 counter situations where assets may exhibit serial dependence
σ1,2 = p{1,1,m,n} + 2 p{1,2,m,n} (expressed in an autocorrelation function showing some type
(2)
m=0 n=0 m=0 n=0
 of "memory") while their higher moments (or, as we have here,
+ p{2,1,m,n} + 2p{2,2,m,n} , cross moments) don’t.
Now consider that the vectors Xi are indexed in a way to
but the good news is that we can compute σ1,2 and other
match the same position for individual SNPs, in the manner we
elements of the covariance matrix in a time series, which
index time to variables and make them synchronous. We thus
allows us to bypass the various marginal probabilities.
can use analyses homologous to those of time series statistics.
If assume weak stationarity, that is covariance-stationarity,
B. The Four Population Estimator F4 the moments of F4 should not be affected by the index i for
We are looking for the n-sample properties of the vector Xi , or by blocks of path dependence; that is, beyond
n a certain size n of SNPs, the variance of F4,n will be invariant
1X with the sample size n. Covariance-stationarity can be further
F4,n = (x1,i − x2,i )(x3,i − x4,i ),
n i=1 tested for by selecting sub-sections of data and computing the
matrix –or, much more effectively, by simply checking if the
particularly its variance.
variance of F4,n as expressed below appears to be sample
1) Law of the unconscious statistician (well, almost):
dependent.
Without knowing the distribution of the transformation (X1 −
But, critically, F4 will be affected by a partial reshuffling,
X2 )(X3 − X4 ), we can readily obtain the results for the
which is the entire point.
moments by means of a variant of the law of the uncon-
A Note on CLT and Mixing Conditions: Furthermore, note
scious statistician –and given finiteness of moments we can
that it is commonly a myth that CLT requires total indepen-
subsequently establish convergence to a univariate normal
dence of summands; even if there were some dependence,
distribution. We note that the convolution of χ(., ., ., .) the
CLT can be satisfied under a set of mixing conditions, see
characteristic function is easily computable, but highly un-
the Bradley surveys in [4] and addendum [5]. For the rate of
wieldy in its expression (the equation spans entire pages), and,
convergence under dependence, see Tikhomirov [6].
in our case, unnecessary.
Intuitively, we need to worry about serial dependence that
Using standard results, we have the pre-summed expected
causes the standard deviation
√ of the n-summed variable to
value:
grow at a rate faster than n, see [7] (as well as for a method
E((X1 − X2 )(X3 − X4 )) = σ1,3 − σ1,4 − σ2,3 + σ2,4 (3)
to ascertain data sufficiency assuming independence.)
+ (µ1 − µ2 ) (µ3 − µ4 ) Reaching the Gaussian, in addition, allows us to easily
establish centiles in our analysis.
(which is supposed to be 0 under the null) and the pre-summed
variance V((X1 − X2 )(X3 − X4 )): II. F INAL R ESULT
We have the statistic for the summation:
V(.) = 2λ2 (µ1 − µ2 ) (µ3 − µ4) + λ3 (µ1 − µ2 ) 2 E(F4 , n) = σ1,3 − σ1,4 − σ2,3 + σ2,4 + (µ1 − µ2 ) (µ3 − µ4 )
(4)
+ λ1 λ3 + (µ3 − µ4 ) 2 + λ22 (5)

V(F4,n ) = n1 V((X1 − X2 )(X3 − X4 )) (6)

and applying the results of Eq. 4.


3

Further, for a sufficiently large n (estimated using methods


in [7], anything > 100, which is orders of magnitude below
the quantities used for F4 ).
Further computational simplification can come from center-
ing Xi0 = Xi − µi and recomputing the central covariance
matrix, this yields

V(F40 , n) = n1 0 0 0 0
2 0
σ1,3 − σ1,4 − σ2,3 + σ2,4 + −2σ1,2
0
+ σ1 02 + σ2 02 −2σ3,4 + σ3 02 + σ4 02
 

(7)

1) Estimation: We can thus establish the variance of F4


from direct estimation of the variances and co-variances of
the components, using standard maximum likelihood (or other)
estimators.
2) Population under consideration: Our analysis is invari-
ant to whether Xi is a single individual or the mean of several.

R EFERENCES
[1] D. Reich, Who We Are and How We Got Here: Ancient DNA and the new
science of the human past. Oxford University Press, 2018.
[2] N. J. Patterson, P. Moorjani, Y. Luo, S. Mallick, N. Rohland, Y. Zhan,
T. Genschoreck, T. Webster, and D. Reich, “Ancient admixture in human
history,” Genetics, pp. genetics–112, 2012.
[3] J. L. Teugels, “Some representations of the multivariate bernoulli and
binomial distributions,” Journal of multivariate analysis, vol. 32, no. 2,
pp. 256–268, 1990.
[4] R. C. Bradley, “Basic properties of strong mixing conditions.” NORTH
CAROLINA UNIV AT CHAPEL HILL CENTER FOR STOCHASTIC
PROCESSES, Tech. Rep., 1985.
[5] R. C. Bradley et al., “Basic properties of strong mixing conditions. a
survey and some open questions,” Probability surveys, vol. 2, pp. 107–
144, 2005.
[6] A. N. Tikhomirov, “On the convergence rate in the central limit theorem
for weakly dependent random variables,” Theory of Probability & Its
Applications, vol. 25, no. 4, pp. 790–809, 1981.
[7] N. N. Taleb, “How much data do you need? an operational, pre-asymptotic
metric for fat-tailedness,” International Journal of Forecasting, 2018.

You might also like