Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Unbiased Estimator for Deconvolving Human Blood into

Antibody Composition
Motivation: We have a HIV antibodies, each with di↵erent
neutralization profiles against v HIV-1 pseudo-viruses (measured in
IC50 ). Given human blood, tested against the same viruses, we want to
know the composition of antibodies, ✓.
Caveat: Testing human blood against the panel of viruses gives
measurements in ID50 not IC50 . IC50 ’s do not add linearly.
Solution: Transform the data. The Gibbs free energy,

Gij = RT ln (Xij Cij )

does add linearly across antibodies, where Xij is the IC50 value of
antibody j on virus i. R, T , and C known. Experimentally determined
monotonic calibration curve, f , tells us

s ⇡ f (( G) ✓)

where si is the ID50 measurements of the blood versus virus i.


Presenter: Han Altae-Tran Authors, Year: I. Georgiev, N. Doria-Rose T. Zhou, Y. Kwon, et al. 2013
Unbiased Estimator for Deconvolving Human Blood into
Antibody Composition
ind
Model: f 1
(s) ⇠ N (( G) ✓, 2
Iv⇥v ), unknown ✓ 2 Ra , , known f
Estimand: g(✓) = ✓
2
Decision Space: D = Ra Loss: L(✓, d) = kg(✓) dk2
Initial
⇣ estimator:⌘Let r(s) = Spearman rank ordering of s. Example:
T T
r (2.0, 9.2, 1.0) = (2, 3, 1) . To avoid f and G, the authors chose

(X) = argmin kr (s) r (X) ✓k1


✓⌫0

Not UMVUE, Not minimax, Inadmissible


⇣ ⌘ 1
T T
Alternative estimator: 0 (X) = ( G) ( G) ( G) f 1
(s)
UMVUE
Preference: I would prefer over 0 since ✓ ⌫ 0, but the UMVUE 0

need not satisfy this property. Moreover, is sparse.


Presenter: Han Altae-Tran Authors, Year: I. Georgiev, N. Doria-Rose T. Zhou, Y. Kwon, et al. 2013
Minimax estimation of the Scale Parameter of Laplace
Distribution under Squared-Log Error Loss Function

Motivation: The classical Laplace distribution with mean zero and


variance 2 was introduced by Laplace in 1774. This distribution has
been used for modeling data that have heavier tails than those of normal
disribution. The probability density function of the Laplace distribution
with location parameter a and scale parameter ✓ is given by:
✓ ◆
1 |x a|
f (x|a, ✓) = exp 1<x<1
2✓ ✓

This paper aims to find a minimax estimation for the scale parameter
when the location parameter is known.

Presenter: Mona Azadkia Authors, Year: Huda Abdullah, Emad F.Al-Shareefi, 2015
Minimax estimation of the Scale Parameter of Laplace
Distribution under Squared-Log Error Loss Function
ind
Model: Xi ⇠ L(a, ✓) for unknown ✓ 2 R+ , known a
Estimand: g(✓) = ✓
Decision Space: D = R+ Loss: L(✓, d) = (log d log ✓)2 = (log d✓ )2
Pn 0
|xi a| (n)
Initial estimator: (X) = i=1
exp( (n)) where (n) = (n) the digamma
function
Minimax,Bayes, UMRUE
The prior distribution is Je↵reys prior, that is the prior density of ✓ is
1 3n
proportional to: ✓ 2
Pn
0 |xi a|
Alternative estimator: (X) = i=1
n
Ubiased
Preference: I would prefer because of it’s minimaxity. Also for large n
the bias is not too much.

Presenter: Mona Azadkia Authors, Year: Huda Abdullah, Emad F.Al-Shareefi, 2015
Estimating the number of unseen species: How far can one
forsee?

Definition: Suppose one draws n samples from a discrete distribution


P(unknown) and those n samples contains S(n) unique elements. How
many new unique elements will one expect to “discover” if m further
samples are drawn from P.
Eg:How often species of butterflies are trapped (n = 639, S(n) = 285).
Frequency (i) 1 2 3 4 5
Species ( i ) 110 78 44 24 29
Question: How many new species (Un (m)) of butterflies will be “discovered” if we
make m(= 100) more observations.

Data generating model (P): Multinomial Distribution, Poisson.


Estimand g(✓) : Un (m).
⇣ E ⌘2
Un (m) Un (m)
Loss Function: m . (E is the estimator)

Presenter: Vivek Kumar Bagaria Authors, Year: Alon Orlitsky, Ananda Suresh, Yihong Wu, 2015
Estimating the number of unseen species: How far can one
forsee?
Pn m i
UMRUE estimator: Unu (m) = n . i
Pni=1 m i
Estimator proposed: Unh (m) = i=1hi i , hi = n P(poi(r) i)
Not UMVUE / UMRUE, Not Bayes, May be admissible
⇣ ⌘+
Unh (m) dominates Unh (m).

Whats good then?


The maximum loss under Poisson, Multinomial model is bounded
The maximum loss is close to a lower-bound which is applicable to
all the estimators
What is the next step?
Few “parts” of the estimator aren’t well motivated, one could
improve it
Extend this framework to various other Models P.

Presenter: Vivek Kumar Bagaria Authors, Year: Alon Orlitsky, Ananda Suresh, Yihong Wu, 2015
On improved shrinkage estimators for concave loss

Motivation:
Estimating a location vector of a multivariate model is of interest in
many statistical modeling settings. Spherically symmetric distributions
are commonly used. For a multivariate normal with square error as loss
function, James and Stein famously showed that the sample mean is is
inadmissable, and proposed a minimax estimator using shrinkage
methods.
Contribution:
This paper develops new minimax estimators for a wider range of
distributions and loss functions. For multivariate gaussian scale mixtures
and loss functions that are concave functions of squared error, this paper
expands the class of known minimax estimators. Similar to the
James-Stein estimator, the proposed minimax estimators are of the form:

(1 f (X))X

Presenter: Stephen Bates Authors, Year: Kubokawa, Marchand, and Strawderman, 2015
On improved shrinkage estimators for concave loss
Model: X is from a MVN scale mixture: X|✓ ⇠ N (✓, V I) with V a
non-neg r.v. with cdf h(v).
Estimand: The location parameter: g(✓) = ✓.
Decision Space: D = Rp
2
Loss: L(✓, d) = l(k✓ dk )
Initial estimator:
2 2
a (X) = (1 a ⇥ r(kXk )/ kXk )X
Where r(t) and a satisfy: 0  r(t)  1, r(t) non-decreasing, r(t)/t
2
non-increasing, 0 < a < 2/E⇤ [1/ kXk ]
Minimax, Not UMRUE, Not MRE (location equivariant)
Alternative estimator: 0 (X) = X
MRE (location equivariant) when l(t) = t
Preference: If the. only goal were to minimize the loss function L, then
0
a is a better estimator because it dominates . However, if each
component of X were independently of interest, and the values were not
related in some way, then 0 would be more informative.
Presenter: Stephen Bates Authors, Year: Kubokawa, Marchand, and Strawderman, 2015
Parameter Estimation for Hidden Markov Models with
Intractable Likelihoods

Definition: A Hidden Markov model (HMM) is a statistical tool used to


model data output over time. It consists of a hidden process, {Xk }, and
an observed process {Yk }. The model obeys the Markov property: Yk |Xk
is conditionally independent of all preceding random variables.
Motivation: HMMs have a wide range of applications including speech
and handwriting recognition, stock market analysis, and population
genetics. We consider a simple finance application.
Model: One can model the returns of a set of stocks using an HMM
with two hidden states. Specifically, {Xk } 2 {1, 1}, and

Yk |Xk ⇠ S↵ ( , 0, Xk + ),

where S↵ ( , 0, Xk + ) denotes the ↵-stable distribution with stability ↵,


skewness , scale 0, and location Xk + . We assume ↵ is known and
that we have an observed sequence y1 , . . . yn .

Presenter: Ahmed Bou-Rabee Authors, Year: Dean, Singh, Jasra, and Peters, 2015
Parameter Estimation for Hidden Markov Models with
Intractable Likelihoods
Estimand: g(✓) = ( , ) = ✓. Decision Space: D = [ 1, 1] ⇥ ( 1, 1).
Loss: For fixed ✏ > 0,
(
0 if k✓ dk2 < ✏
L(✓, d) =
1 otherwise.
Estimator: We cannot express the likelihood analytically, so we
approximate it and take
(Y ) = arg sup p✏✓ (y1 , . . . yn ),
✓2⇥

where p✏✓ is the likelihood function of the perturbed process


{Xk , Yk + ✏Zk } and Zk ⇠ UB(0,1) .
Properties: Not Bayes, not UMRU, Inadmissible.
Alternative estimator: Assume that ✓ has a prior distribution ⇡. Choose
0
(Y ) = arg sup p✏ (✓|y1 , . . . , yn ).
✓2⇥

This is Bayes under prior ⇡.


Preference: Second estimator if reasonable prior exists.
Presenter: Ahmed Bou-Rabee Authors, Year: Dean, Singh, Jasra, and Peters, 2015
SLOPE Adaptive Variable Selection via Convex
Optimization
Motivation: A common challenge in many scientific problems is the
ability to achieve correct model selection while also allowing for an
inferential selection framework. A recent estimator that is closely linked
to the LASSO and Benjamini-Hochberg (BH), is Sorted L-One Penalized
Estimation:
X p
ˆ = argmin 1 ||y X ||2 + BHi |b|i
2
2RP 2 i=1

which controls the False Discovery Rate (FDR) at level q 2 [0, 1) under
orthogonal designs under BHi :
1
BHi = (1 iq/2p)
This is equivalent to the isotonic formulation:

1
min ||y BH ||22
2
subject to 1 ... p 0
Presenter: Bryan Callaway Bogdan, van den Berg, Sabatti, Su, and Candes, 2014
SLOPE Adaptive Variable Selection via Convex
Optimization

Model: Gaussian sequence X, X T X = Ip such that


Y = + z ⇠ N ( , 2 Ip )
Estimand: g( ) =
Decision Space: D = RP Loss: `2 : L( , d) = (g( ) d)2
Initial estimator: Given an orthogonal design:
p
X
1
(X) = prox (X T y) = argmin ||y ||22 + i| |(i)
2 i=1

where | |(i) is the i-th order statistic.


Not UMVUE / UMRUE, Bayes - Laplacians, Minimax (Asymptotic)
Alternative estimator: ˆ = X T Y
Unbiased and UMVUE

Presenter: Bryan Callaway Bogdan, van den Berg, Sabatti, Su, and Candes, 2014
Improved Minimax Estimation of a multivariate Normal
Mean under heteroscedaticity
Problem and Model: Let
8
0 1 0 1 >
> ✓ = (✓1 , · · · , ✓p )T is unknown
X1 ✓1 >
< 0 1
B .. C B .. C d1 0 ... 0
@ . A = N ( @ . A , ⌃), where B C
>
> ⌃ = D = @ 0 d2 ... 0 A
Xp ✓p >
:
0 0 . . . dp

Goal: Find an estimator (X) 2 Rp of ✓ under squared-error loss:

L( (X), ✓) = ( (X) ✓)T ( (X) ✓)


Suggested approach: develop a minimax shrinkage estimator of the form
(X) = (I A)X where A is known
Sketch of the proof:
DA c
For a fixed A, opt = E✓ (X T AT AX)
=) A,c = (I X T AT AX
A)X
Theory tells us that A,c is minimax if:
0  c  2(tr(DA) 2 max (DA)) =) pick c⇤(A, D) = tr(DA) 2 max (DA)

Pick A such that A minimises the upper bound of the previous inequality
under Gaussian prior ⇤ = N (0, ) where is diagonal
Presenter: Claire Donnat Authors, Year: Zhiqiang Tan, 2015
Improved Minimax estimator
Initial estimator:
Pp ⇤ 2
j=1 (aj ) (dj + j )
A (X) = X Pp ⇤ 2 2 AX
j=1 (aj ) Xj

where A = diag(a⇤1 , a⇤2 , · · · , a⇤p ) is given by


( P n0 d k + k 1 n0 2
a⇤j = ( k=1 d2
) dj j = 1, · · · , n0
k
dj
a⇤j = dj + j
j = n0 + 1, · · · , p
and n0 = argmax{i : di a⇤i = max(a⇤j dj , j 2 [1, p])}
Minimax, not admissible, Not Bayes, not UMRUE
Alternative estimator: 0 (X) = X
Unbiased, direct
Preference: For estimating ✓, the unbiasedness property seems to lead
to a far greater risk (⇠ p 2 in case of homoscedasticity ), whereas A has
far smaller risk
Xp Xp Xp
d2j d2j
If n0 4, R(✓, A )  di = R(✓, 0 (X))
i=1
d + j
j=5 j
d + j
j=5 j

Presenter: Claire Donnat Authors, Year: Zhiqiang Tan, 2015


Bayesian estimation of adaptive bandwidth matrices in
multivariate kernel density estimation
Motivation: Kernel density estimation (KDE)
Nonparametric way to estimate probability density function of a
random variable
Essentially a generalization of the histogram
Uses kernel functions instead of step functions to locally
approximate the density
Given a random sample X1 , . . . , Xn 2 Rd from an Punknown density f ,
the standard KDE of f (x) has the form fb(x) = n1 i=1 KH (x Xi )
n

where KH (x Xi ) is a kernel function centered at Xi with bandwidth


matrix H.
This paper
Examines locally adaptive KDE where a variable bandwidth matrix
Hi is used for the kernel function centered at Xi
Uses Gaussian kernel functions; each Hi is a covariance matrix
Develops Bayes estimators for the variable bandwidth (covariance)
matrices
Presenter: Lin Fan Authors, Year: Zougab, Adjabi, and Kokonendji, 2014
Bayesian estimation of adaptive bandwidth matrices in
multivariate kernel density estimation
Pn
Estimand: Each Hi used in KDE fb(x) = n1 i=1 KHi (x Xi )
where KHi (x Xi ) = (2⇡)d/2 (det1
Hi )1/2
exp 1
2 (x Xi )T Hi 1 (x Xi )
Pn
Model: Likelihood: fb i (xi |Hi , xj6=i ) = n 1 1 j=1,j6=i KHi (xj xi )
Prior: W 1 (Q, r) for each Hi where Q is SPD d ⇥ d matrix and r d
Decision Space: SPD d ⇥ d matrices Loss: L(Hi , H b i ) = tr (Hi H b i )2

b i (X) = 1 Pn
Initial estimator: H Pn
(det Bi,j ) (r+1)/2
B
r d j=1,j6=i (det Bi,j ) (r+1)/2 i,j
j=1,j6=i

where Bi,j = (Xj Xi )(Xj Xi )T + Q


Not UMVUE/UMRUE, Bayes wrt inverse-Wishart prior, admissible
bb 1
Pn
Alternative estimator: H i (X) = n 1 j=1,j6=i (Xj Xi )(Xj Xi )T
Q n
MLE wrt likelihood: fxi (xj6=i |Hi ) = j=1,j6=i KHi (xj xi )
Preference: For small sample size, prefer H b i as small sample MLEs of
covariance matrices have been shown to have distorted eigenvalue
bb
structure. For large sample size, prefer H i or the sample covariance
matrix. MLE and sample covariance matrices are usually consistent.
Presenter: Lin Fan Authors, Year: Zougab, Adjabi, and Kokonendji, 2014
Kernel Regularized Least Squares

Motivation: When the correct model specification is unknown (as is


generally the case in social science), researchers often engage in ad hoc
searches across polynomial and interaction terms in GLMs. More flexible
methods, such as neural nets and generalized additive models, are often
useful for prediction but difficult to interpret. By constructing a new set
of features based on the kernel
✓ ◆
||xj xi ||2
k(xi , xj ) = exp 2

Ki,j = k(xi , xj )

and using Tikhonov regularization, it is possible to search systematically


over a broader space of smooth, continuous functions. By taking the
expectations of the derivative of these functions over the sample,
“coefficients” can be obtained. By taking derivatives at particular points,
heterogeneous e↵ects can be discovered.

Presenter: Christian Fong Authors, Year: Hainmueller and Hazlett, 2014


Kernel Regularized Least Squares

Model: y = Kc with K 2 Rn⇥n observed and y, c 2 Rn unknown.


y obs = y + ✏ with y obs 2 Rn observed and ✏ 2 Rn unknown, E[✏|K] = 0
Estimand: g(c⇤ , ✏) = Kc⇤ ,
where c⇤ = argmin(y Kc)T (y Kc) cT Kc
c2Rn

Decision Space: D = Rn Loss: L(✓, d) = (g(c, ✏) d)2


1 obs
Initial estimator: (y, K) = K(K + I) y
Not Bayes, Not location MREE, Not location-scale MREE
0
Alternative estimator: (y, K) = K(In + K T K) 1
KT y
Bayes under ⇤ such that c ⇠ M V N (0, In ) and ✏ ⇠ M V N (0, In ).
Preference: I would generally prefer to 0 , because 0 makes a strong
assumption on the prior which will not in general be satisfied.

Presenter: Christian Fong Authors, Year: Hainmueller and Hazlett, 2014


An Example of an Improvable Rao-Blackwell improvement,
inefficient MLE and unbiased generalized Bayes estimator.
Motivation:
• Data compression outside of exponential families can be difficult,
especially without complete sufficient statistics.
• The authors want to improve the Rao-Blackwellized estimator
X +X
E[X1 | X(1) , X(n) ] = (1) 2 (n) ; in the given model, T = (X(1) , X(n) ) is
minimal sufficient but not complete.
• The authors show that this Rao-Blackwell improvement can be
uniformly dominated by another unbiased estimator.
Decision Problem:
• Model P = {U((1 k)✓, (1 + k)✓) : ✓ 2 ⇥}, k 2 (0, 1) known.
• Estimand g (✓) = ✓.
• Decision space D = R.
• Loss function is squared error loss, L( ) = ( ✓)2 .

Presenter: Rina Friedberg Authors: Tal Galili, Isaac Meilijson, 2015 November 30, 2015 1/2
An Example of an Improvable Rao-Blackwell improvement,
inefficient MLE and unbiased generalized Bayes estimator.
Initial estimator :
(1 k)X(1) + (1 + k)X(n)
(X1 , . . . , Xn ) = ⇣ ⌘
2 k 2 nn+11 + 1

• Not UMRU/UMVU, not Bayes, not MRE, uniformly dominates the


Rao-Blackwell improvement
Proposed alternate 0 :

0 (1 + k)X(1) + (1 k)X(n)
(X1 , . . . , Xn ) =
2

• Not UMRU/UMVU, not Bayes, MRE, minimax


Preference: I would prefer to use 0 for this task. Although dominates
in a large and useful class of unbiased estimators, with 0 we get stronger
properties such as minimum risk equivariance and minimality, which I find
convincing arguments for using 0 over .
Presenter: Rina Friedberg Authors: Tal Galili, Isaac Meilijson, 2015 November 30, 2015 2/2
Condition-number-regularized covariance estimation

Motivation: Estimation of high-dimensional covariance matrices is


known to be a difficult problem,has many applications, and is of current
interest to the larger statistics community. In many applications including
so-called the “large p small n” setting, the estimate of the covariance
matrix is required to be not only invertible, but also well-conditioned.
Although many regularization schemes attempt to do this, none of them
address the ill-conditioning problem directly.
ind
Model: Xi ⇠ N (µ, ⌃) for unknown ⌃ 2 Rp⇥p , known µ
Estimand: ⌃
Decision Space: D = {⌃ : ⌃ 0}

ˆ
log P(X|⌃) ˆ  max
cond(⌃)
ˆ
Loss: L(⌃, ⌃(X)) ˆ =
= ⇢(⌃)
1 otherwise
In fact, this is a MLE with constraint. But it can be interpreted as Bayes
estimator.

Presenter: Pengfei Gao Authors, Year: Won, J. H., Lim, J., Kim, S. J., and Rajaratnam, B. (2013)
Condition-number-regularized covariance estimation
Initial estimator: Because loss function does not depend on ⌃, Bayes
estimator under any prior distirbution is the same, which is
ˆ = argmin E(⇢(⌃)|X)
⌃ ˆ = argmin ˆ
log P(X|⌃)
ˆ
⌃ ˆ
cond(⌃) max

If we define Q as eigenvector matrix and li as eigenvalues of sample


covariance matrix. Then solution can be expressed as
ˆ = Q diag( 1 , . . . , p )QT , where

8
< ⌧⇤ li  ⌧ ⇤
i = li ⌧ < li < max ⌧ ⇤

: ⇤
max ⌧ max ⌧ ⇤  li

biased (even when n goes infinity), Bayes, Admissible


Alternative estimator: ⌃ ˜ = 1 P n Xi X T
n 1 i=1 i
UMRUE, ill-conditioned, overfitting
Preference: When p is comparable to n, sample convariance matrix is
usually overfitted and ill-conditioned. By enforcing regularization of
condition number, we can solve the problem of ill-conditioned directly as
well as overfitting. However, when p ⌧ n, I prefer UMRUE ⌃. ˜
Presenter: Pengfei Gao Authors, Year: Won, J. H., Lim, J., Kim, S. J., and Rajaratnam, B. (2013)
. Estimation of P(X > Y ) with Topp-Leone Distribution

Motivation: In order to analyze stress-strength reliability of a system, we


need to estimate the probability that stress, X, exceeds strength, Y . For
instance, if Y indicates the amount of pressure that a bridge can tolerate
and X indicates the stress of a flood, the reliability of the bridge is
related to the probability P(X > Y ).

Presenter: Nima Hamidi Authors, Year: Liu and Ziebart, 2014


. Estimation of P(X > Y ) with Topp-Leone Distribution
Model: X1 , · · · , Xn ⇠ T L(↵) and Y1 , · · · , Ym ⇠ T L( ) independent for
unknown ↵ and , where T L(↵) has density (0 < ↵ < 1 and 0 < x < 1)
f (x; ↵) = 2↵(1 x)x↵ 1
(2 x)↵ 1

Estimand: g(↵, ) = P↵, (X1 > Y1 ) = p


Decision Space: D = R Loss: L(✓, d) = (p d)2
Initial estimator:
Pn Let 2 F1 (a, b; 2c; z) be the hypergeometric
Pm function and
define u := i=1 log(2xi xi ) and v := i=1 log(2yi yi2 ). Then,

2 F1 (1, 1 m; n; u/v), if v u,
(X, Y ) =
1 2 F1 (1, 1 n; m; v/u), otherwise, .
UMVUE, Not Bayes, Not minimax
1
P P
Alternative estimator: 0 (X) = nm i j I(Xi > Yj )
UMVUE in non-parametric setting
Preference: The alternative estimator can be computed faster and it is
UMVUE in a larger class of densities. If we have a good evidence that X
and Y have Topp-Leone densities, it is better to use initial estimator. If
not, it is better to use the alternative.
Presenter: Nima Hamidi Authors, Year: Liu and Ziebart, 2014
Convergence rate of Bayesian tensor estimator and its
minimax optimality
Motivation: Tensor decomposition generalizes matrix factorization and
can represent many practical applications, including collaborative filtering,
multi-task learning, and spatio-temporal data analysis
Model:
True tensor A⇤ 2 RM1 ⇥···⇥MK of order K
Observe n samples Dn = {(Xi , Yi )}ni=1
i.i.d
Linear model: Yi = hA⇤ , Xi i + ✏i , where ✏i ⇠ N (0, 2 )
P 0 (1) (2) (K )
A⇤ is assumed to be “low-rank” (Aj1 ,...,jK = dr =1 Ur ,j1 Ur ,j2 · · · Ur ,jK
for small d 0 )
Estimand: g (A⇤ ) = A⇤
Decision Space: D = RM1 ⇥···⇥MK
Loss: L(A⇤ , Â) = kA⇤ Âk2
Presenter: Bryan He Author: Taiji Suzuki, ICML 2015 1/2
Convergence rate of Bayesian tensor estimator and its
minimax optimality
R
Inital Estimator: Â = A⇧(dA|Dn ), where ⇧ is the posterior distribution
of A given Dn
Bayes under prior 0 (M
⇡(d 0 ) / ⇠ d 1 +···+Mk )

( K
)
d0 X T
⇡(U (1) , . . . , U (K ) |d 0 ) / exp 2
Tr[U (k) U (k) ]
2 P
k=1
Admissible: Loss is convex ! Bayes estimator is unique ! admissible
Not UMRU: not even unbiased when A⇤ 6= 0M1 ⇥···⇥MK
Alternative estimator: Â = (X T X ) 1X T Y (standard linear regression)
UMVU Estimator
Initial estimator is still preferred in most cases: the alternative
estimator is not even well-defined until n is large and has a large
variance
Presenter: Bryan He Author: Taiji Suzuki, ICML 2015 2/2
Optimal Shrinkage Estimation of Mean Parameters in
Family of Distributions With Quadratic Variance
This paper discusses simultaneous inference of mean parameters in a family of
distributions with quadratic variance function. (Normal, Poisson, Gamma, . . . )

Assume we have p indep observations Yi , i = 1, · · · , p that come from a


⌫ +⌫ ✓ +⌫ ✓ 2
distribution with E(Yi ) = ✓i 2 ⇥ and Var(Yi ) = 0 1 ⌧ii 2 i . Consider a class of
semi-parametric shrinkage estimators of the form
b,µ
✓ˆi = (1 b i ) · Y i + bi · µ with bi 2 (0, 1]
This is inspired by Bayes estimator as each mean parameter ✓i is the weighted
average of Yi and the prior mean µ.

Under the sum of squared error loss we have


p
1X b,µ
lp (✓, ✓ˆb,µ ) = (✓i ✓ˆi )2
p i=1

however, since it depends on the unknown ✓, we need an estimator for the risk
Rp (✓, ✓ˆb,µ ) = E[lp (✓, ✓ˆb,µ )] in order to find the optimal parameter b and µ.
Presenter: Qijia Jiang Authors, Year: Xie, Kou, and Brown, 2015 November 29, 2015 1/2
Optimal Shrinkage Estimation of Mean Parameters in
Family of Distributions With Quadratic Variance
Model: Yi indep with E(Yi ) = ✓i 2 ⇥ unknown and variance quadratic
Estimand: g(✓) = E[lp (✓, ✓ˆb,µ )] for fixed b, µ 2 R
Decision Space: D = R Loss: L(✓, d) = (g(✓) d)2
Estimator:
p
1X 2 V (Yi )
(Y ) = [b · (Yi µ)2 + (1 2bi ) · ]
p i=1 i ⌧i + ⌫ 2

where ⌫k are known constants and V (Yi ) = ⌫0 + ⌫1 Yi + ⌫2 Yi2 .


⌧i is assumed to be known and can be interpreted as the within-group sample size.
UMVUE, Not Bayes, Inadmissible
0
Alternative estimator: (Y ) = ( (Y ))+
Dominates
Preference: If the only goal were to estimate g(✓), I would prefer the dominating
0
over . However, choosing 0 doesn’t lead to better selection of b and µ.

Presenter: Qijia Jiang Authors, Year: Xie, Kou, and Brown, 2015 November 29, 2015 2/2
Minimax Estimators of Functionals (in particular, Entropy)
of Discrete Distributions

Motivation: In a wide variety of fields, it is important to estimate


functions of probability distributions. Particularly, estimating entropy is
important. For example, in machine learning, we might want to maximize
mutual information between input and output, where mutual information
is a function of entropy. Thus, finding a good estimator for entropy is of
practical importance. It might also be the case that the number of
samples available is lower than, or of the order of, the size of the sample
space. In this regime, other methods of estimating entropy often
underperform, and hence there is a need for a new and improved
estimator of entropy in such situations.

Presenter: Poorna Kumar Authors, Year: Jiao, Venkat, Han and Tsachy, 2015
Minimax Estimators of Functionals (in particular, Entropy)
of Discrete Distributions
Model: The set {P : P = (p1 , p2 , ..., pS )} of discrete probability
distributions P of unknown support size S.
PS
Estimand: Functionals of the form F (P ) = i=1 f (pi ), where
f : (0, 1] ! R is continuous. In particular, the entropy,
PS
F (P ) = i=1 pi ln(pi )
Decision Space: D = R Loss: L(F, F̂ ) = EP (F (P ) F̂ )2
00
Initial estimator: For a Bernoulli dist., f c (p) = f (p̂) f (p̂)2n
p̂(1 p̂)
.
(1 p̂)
When F is the entropy, we get p̂ ln(p̂) + 2n
Not UMVUE / UMRUE (but consistent), Not minimax (but
rate-optimal, so approx minimax), Inadmissible
Alternative estimator: Work has been done by others to find a Bayes
estimator under a Dirichlet prior for this problem, di↵erent from the
estimator used by the authors of this paper.
Preference: The Bayes estimator under a Dirichlet prior is only
consistent for n S. The estimator in our paper is consistent for
S
n ln(S) , so it is preferable in this regime.
Presenter: Poorna Kumar Authors, Year: Jiao, Venkat, Han and Tsachy, 2015
Estimation of a multivariate normal mean with a bounded
signal to noise ratio under scaled squared error loss (Kortbi
and Marchand, 2013)

Model: X ⇠ Np (✓, 2 Ip ) and S ⇠ 2 2


k independent
Estimand: ✓ 2
Loss: L((✓, ), d) = kd 2✓k
k✓k
Parameter Space: ⌦m = {(✓, ) 2 Rp ⇥ R+ :  m}
Decision Space: D = R

Motivation: This problem setup has many applications. One


example is normal full rank linear models.

Y ⇠ Np (Z , 2 Ip ) with orthogonal design matrix Z , unknown


parameter vector , and constraint k k  m.

Presenter: Luke Lefebure


2
Proposed Estimator: 1 (X , S) = mm 2 +p X
Minimax under class of estimators linear in X, Not
UMVUE/UMRUE, Not MRE
Alternative Estimator: 2 (X , S) = X
UMVUE/UMRUE
Preference: 1 is preferable in this setup because it makes use of
the constrained parameter space to shrink the estimate. Hence,
unbiasedness is not necessarily a desirable property.

Presenter: Luke Lefebure


Equivariant Minimax Dominators of the MLE in the Array
Normal Model

Motivation: Tensor data is common in multiple fields (signal processing,


machine learning, etc.). Most statistical analysis of tensors fit a model
X = ⇥ + E, where E represents some error term. Understanding the
residual variation E plays an important role in tasks such as prediction,
model-checking, and improved parameter estimation.
Model: X ⇠ Np1 ⇥...⇥pK (0, ⌃K ⌦ ... ⌦ ⌃1 )
2 2
Estimand: g(✓) = ( , ⌃K , ..., ⌃1 ), where > 0, ⌃i 2 Sp+i , |⌃i | = 1
2 2
Decision Space: R1⇥pK ⇥...⇥p1
s 2 PK p s2
Loss: L(⌃, S) = 2 k=1 tr[Sk ⌃k 1 ] Kp log 2
Kp
pk

Presenter: Reginald Long Authors, Year: Gerard, Ho↵, 2014


Equivariant Minimax Dominators of the MLE in the Array
Normal Model
⇥ ⇤ 1
Define: Ek = E ( 2 Tk ⌃k k ) 1 |X , ⌃ˆi ( , X) = Ei /|Ei |1/pi ,
⇣ P ⌘ 1
K
ˆ 2 ( , X) = 1/K k=1 |Ek | 1/pk
ˆ K (I, X), ..., ⌃
Estimator: (X) = (ˆ 2 (I, X), ⌃ ˆ 1 (I, X)).
Minimax, UMREE under Lower Triangular Group, Inadmissible
R Tˆ
⌃k ( , X) k
Define: Sk (X) = Op ⇥...⇥Op k d 1 ...d K
K 1 ˆ k ( , X))
tr(⌃
R
e2 = Op ⇥...⇥Op ˆ 2 ( , X)d 1 ...d K
K 1
e k = Sk (X)/|Sk (X)|1/pk

e K , ..., ⌃
Alternative estimator: 0 (X) = (e2 , ⌃ e 1 ).
Dominates , UMREE under Orthogonal Group
Preference: I would pick the first estimator, since it has comparable
empirical results compared to the second estimator, and doesn’t require
approximately evaluating an infeasible integral over the space of
orthogonal matrices

Presenter: Reginald Long Authors, Year: Gerard, Ho↵, 2014


Convergence rate of Bayesian tensor estimator and its
minimax optimality

Motivation: Low Rank Tensor approximation is useful for many practical


applications, such as collaborative filtering, multitask learning, and
spatial temporal data analysis. The problem is a NP hard problem in
general. The paper suggests a computationally tractable Bayes Estimator
which is valid without using convex regularization.
A tensor A 2 RM1 XM2 X...XMk has CP-rank d if there exist matrices U(k)
Pd (1) (2) (K)
2 RdXMk (k = 1; ...; K ) such that Aj1;...;jK = r=1 Ur,j1 Ur,j2 ...Ur,jK
and d is the minimum number to yield this decomposition (we do not
require the orthogonality of U(k)).
A general assumption is CP rank for the tensor A⇤ is small. This method
is also adaptable in case the CP rank is not known.

Presenter: Rahul Makhijani Authors, Year:Taiji Suzuki, ICML 2015


Convergence rate of Bayesian tensor estimator and its
minimax optimality
Model: Yi =< A⇤ , Xi > +✏ Xi is tensor 2 RM1 XM2 X...XMk , <, > is
the tensor inner product and ✏i is i.i.d. Gaussian noise N (0, 2 )
Estimand: g(✓) = A⇤ a low rank tensor 2 RM1 XM2 X...XMk
Decision Space: D = RM1 XM2 X...XMk
Loss: L(A, A⇤ ) = ||A A⇤ ||L2 (P (X)) generalization error
Properties of the estimator:
Biased, Bayes under Gaussian prior, Inadmissible, not UMRU, not
UMVUE, nearly minimax optimal
Alternative estimator: A0 = argmin(l(A) + |A|tr )
|A|tr is tensor trace norm
Solution of convex regularized tensor problem
Preference: If the goal was to estimate A⇤ to get a quick convergence
rate then I would prefer the convex estimator A0 over A. However the
Bayesian estimator suggested in the paper is valid in general in the
absence of convexity as well and is adaptable to unknown rank as well.
Presenter: Rahul Makhijani Authors, Year:Taiji Suzuki, ICML 2015
Near Minimax Line Spectral Estimation
Motivation: Determine the locations and magnitudes of spectral lines
from noisy temporal samples. It is a fundamental problem in statistical
signal processing.
Model: Consider k sinusoids {ei2⇡jfl }kl=1 , fl 2 [0, 1] unknown.
Let x⇤ 2 Cn be n = 2m + 1 equi-spaced samples of their superposition
with unknown weights cl 2 C,
k
X
x⇤j = cl · ei2⇡jfl , j 2 { m, . . . , m}.
l=1

We observe a noised version y 2 Cn

y = x⇤ + !, ! 2 N (0, 2
In ).

Estimand: x⇤ 2 Cn and hence {cl }kl=1 , {fl }kl=1 and k.


Remark: To render the estimation feasible, we require that {fl }kl=1
satisfy separation condition
4
min |fp fq | > .
p6=q n
Presenter: Song Mei Authors, Year: Tang, Bhaskar, and Recht, 2015
Near Minimax Line Spectral Estimation
Estimand: x⇤ 2 Cn . Decision space: D = Cn .
Loss: L(x̂, x⇤ ) = n1 kx̂ x⇤ k22 .
Estimator: (Atomic norm soft thresholding.) Let ⌧ > 0,
1
x̂ = ⌧ (y) = argmin ky zk22 + ⌧ kzkA ,
z 2
where k · kA is atomic norm defined by
X X
kzkA = inf{ ca : z = ca a, a 2 A, ca > 0}
a a

and A is a set of all sinusoids basis with frequency in a continuous set.


Remark: This problem can be formulated into an SDP problem.
Properties:
1 Like Lasso, not unbiased, not Bayes. (As ⌧ > 0. ) p
2 Asymptotically minimax. Risk = O( 2 k log n
n ), ⌧ = O( n log n)
3 Some insurance for the estimation for {cl }kl=1 and {fl }kl=1 .
Alternative estimator: 0 (y) = y is unbiased. p But it is not good in
that 1) it is dominated by ⌧ (y) for ⌧ = O( n log n), in the sparse
setting . 2) {ĉl } and {fˆl } are not sparse.

Presenter: Song Mei Authors, Year: Tang, Bhaskar, and Recht, 2015
Convex Point Estimation using Undirected Bayesian
Transfer Hierarchies
Motivation: Classical hierarchical Bayes models do not cope well with
scarcity of instances.(X 0 X may not be invertible!) It can also be
computationally expensive due to the requirements of proper priors. It
may not lead to efficient estimates due to non-convexity. The proposed
undirected transfer learning in this paper leads to convex objectives. The
condition of proper priors is no longer needed, which allows for flexible
specification of joint distributions over transfer hierarchies.

Figure: Undirected
Figure: Hierarchical Bayes Model Markov Process
Presenter: Jing Miao Authors: Gal Elidan, Ben Packer, Geremy Heitz, Daphne Kolloer
Convex Point Estimation using Undirected Bayesian
Transfer Hierarchies
Model: Xij ⇠ N (µc , ⌃c ) unknown µc , ⌃c , ✓c ⌘ (µc , ⌃c ).
✓c ⇠ Div(✓par ). (1) ✓par fixed; (2) ✓par ⇠ inverse gamma known.
P P
Loss: log P (X, ✓c , ✓par ) = log P (Xij |✓c ) div(✓c , ✓par )
(
1 if | ✓c | > ✏, ✏ ! 0
L=
0 otherwise

Estimand: ✓c = (µc , ⌃c ) (estimated separately)


Decision Space: µc 2 Rij , ⌃c 2 Rij⇥ij .
Initial estimator: posterior mode
Bayes, admissible, Not UMRUE P
Alternative estimator: 0 (µc ) = X̄ij , 0 (⌃c ) = j(i1 1) (Xij X̄ij )2
UMRUE
Preference: In order to derive , we must make sure n > p to avoid
singular covariance matrix. But scarcity of observation is always a
problem. does not P invoke such problems. Moreover, the added
divergence term ( i 1/ i div(✓i c, ✓i par))
Presenter: Jing Miao
gives more flexibility.
Authors: Gal Elidan, Ben Packer, Geremy Heitz, Daphne Kolloer
Gaussian Sequence Model
I Consider the Gaussian sequence model withP white noise,
1
yi = ✓i + ✏zi , i 2 N where ✓ 2 ⇥(a, c) := 2 2 2
i=1 ai ✓i  c .

I Let a1 = 0, a2k = a2k+1 = (2k)m for m > 0 (this corresponds to


the Sobolev space of order m under the trigonometric basis for the
model dX(t) = f (t)dt + ✏dW (t)). Define the minimax risk
b 2 and let R✏,L (m, c) be the
R✏ (m, c) := inf ✓b sup✓2⇥(a,c) E✓ k✓ ✓k
corresponding minimax risk among linear estimators.

I Pinsker’s theorem states that


4m 4m
lim ✏ 2m+1 R✏,L (m, c) = lim ✏ 2m+1 R✏ (m, c) = Pm,c
✏!0 ✏!0
⇣ 2 ⌘ 2m+1
1 ⇣ ⌘ 2m+1
2m

where Pm,c := c (2m+1)⇡ 2m


m
m+1 .
The linear minimax estimator is given by the shrinkage estimator
bi (y) = [1 ai /µ] yi where µ is determined by
✓P +
1 2
i=1 ✏ ai [µ ai ]+ = c2 .
Zhu and La↵erty (2015), Quantized Nonparametric Estimation. Hongseok Namkoong 1
Communication/Storage Constraints

I Consider the same setting but where there the estimators are
b no greater than B✏ bits.
constrained to use storage C(✓)

I Define the minimax risk in this computationally restrained setting as


R✏ (m, c, B✏ ) := inf ✓:C(
b b
✓)B ✏
sup✓2⇥(a,c) Ek✓b ✓k2 . Then,

4m c2 m2m
R✏ (m, c, B✏ ) ' Pm,c ✏ 2m+1 + 2m
B✏ 2m
| {z } | ⇡ {z }
estimation error quantization error

2
I Over-sufficient regime: B✏ ✏ 2m+1 , number of bits is very large,
classical rate of convergence obtains under the same constant Pm,c .
2
Sufficient regime: B✏ ⇠ ✏ 2m+1 , minimax rate is still of the same
order but with a new constant Pm,c + Qm,c .
2
Insufficient regime: B✏ ⌧ ✏ 2m+1 but with B✏ ! 1, rate
2 2m
deteriorates to lim✏!0 B✏2m R✏ (m, c, B✏ ) = c ⇡m2m .
Zhu and La↵erty (2015), Quantized Nonparametric Estimation. Hongseok Namkoong 2
Note on the Linearity of Bayesian Estimates in the
Dependent Case

Motivation: This paper shows the relationship between the MLE and
the Bayesian estimator when we assume dependent observations.
Especially when they are generated from Markov chains, it proves that
the Bayesian estimator is just a linear function of the MLE.
Model: X = (X0 , . . . , Xn ) is the first (n + 1) observations of a Markov
chain with a finite state space E = {1, 2, . . . , s} and the transition
matrix P = (pij ). We assume the distribution of X0 is known.
Estimand: g(✓) = (pij , (i, j) 2 E 2 , i 6= j) 2 [0, 1]s(s 1)

Decision Space: D = [0, 1]s(s 1)


Loss: L(✓, d) = (g(✓) d)2

Presenter: Jae Hyuck Park Authors, Year: Souad Assoudou and Belkheir Essebbar, 2014
Note on the Linearity of Bayesian Estimates in the
Dependent Case
Initial estimator: the natural choice of prior distributions is the
multivariate beta distribution. Then the Bayesian estimator of pij is

X n
X
p̄ij = Nnij + aij / Nnij + aij , Nnij ⌘ I(Xt 1 = i, Xt = j)
j2E t=1

UMVUE / UMRUE, Bayes, Inadmissible


Alternative estimator: Under the Je↵rey’s prior distribution (assuming
s = 3)

p21 p32 + p31 (p21 + p23 )


p1 =
(p12 + p13 )(p32 + p23 ) + p31 (p12 + p21 + p23 ) + p21 (p13 + p32 )

Preference: I prefer the latter one because we can utilize on


approximation scheme - the independent Metropolis-Hasting algorithm
(IMH).

Presenter: Jae Hyuck Park Authors, Year: Souad Assoudou and Belkheir Essebbar, 2014
Robust Nonparametric Confidence Intervals for
Regression-Discontinuity Designs
Motivation: Regression-discontinuity (RD) design has become one of the
most important quasi-experimental tools used by economists and social
scientists to study the e↵ect of a particular treatment on a population.
However, they may be prone to sensitivity to the specific bandwidth
employed and so this paper proposes a more robust confidence interval
estimator using a novel way to estimate standard errors.
Model: Xi , i = 1, ..., n is a random sample with a density with respect
to the Lebesgue measure.
Estimand: ⌃, the middle matrix of a generalized Huber-Eicker-White
heteroskedasticity-robust standard error
(p+1)⇥(q+1)
Decision Space: D = R+ Loss: L(⌃, d) = ||⌃ d||2F
Initial estimator:
n
1X
U V +,p,q (hn , bn ) = 1(Xi 0)Khn (Xi )Kbn (Xi )
n i=1

⇥rp (Xi /hn )rq (Xi /bn )0 2


U V + (Xi )
Presenter: Karthik Rajkumar Authors, Year: Calonico, Cattaneo, and Titiunik, 2014
Robust Nonparametric Confidence Intervals for
Regression-Discontinuity Designs
Admissible, generalized Bayes, Not unbiased
Alternative estimator: Based on nearest-neighbor estimators with a
fixed tuning parameter.
X n
ˆ U V +,p,q (hn , bn ) = 1 1(Xi 0)Khn (Xi )Kbn (Xi )
n i=1

⇥rp (Xi /hn )rq (Xi /bn )0 ˆU


2
V + (Xi )
Here
2 J
ˆU V + (Xi ) = 1(Xi 0)
J +1
0 10 1
XJ X J
⇥ @Ui Ul ,j (i) /J A @Vi Vl ,j (i)
/J A
j=1 j=1

Asymptotically valid, more robust in finite samples


Preference: Alternative, if constrained by sample size, otherwise the
initial estimator for its simplicity.
Presenter: Karthik Rajkumar Authors, Year: Calonico, Cattaneo, and Titiunik, 2014
Evan Rosenman

Ridge regression and asymptotic minimax


estimation over spheres of growing dimension
Lee Dicker, Rutgers University

Motivation: In the regression setting, goal is to derive minimax


estimators for the parameter vector β when β is d-dimensional
and constrained to Sd-1(τ), the sphere of radius τ. Take d → ∞.

Focus is on Ridge Regression estimator:

which estimates the coefficients and constrains them via a


regularization parameter d/t2.

It turns out that the optimal regularization parameter is t = τ, but


this is typically unknown and must be estimated from the data.
Evan Rosenman

Ridge regression and asymptotic minimax


estimation over spheres of growing dimension
Model: x1, … xn ~ N(0, Id); ε ~ N(0, In); y = Xβ + ε
Estimand: Want to estimate β, which lies in Sd-1(τ)
Risk Function: R(β, β*) = E(|| β – β*||2)
Decision Space: Rd
Estimator: Ridge regression estimator, with estimator of τ:

•  Not UMVUE (biased), Bayes (under β ~ N(0, )


•  Not minimax, but asymptotically minimax (as d → ∞)
Alternative: Standard OLS estimate
is UMVUE by the Gauss-Markov Theorem
Preference: Ridge estimator is preferred as it is more
robust to overfit as d → ∞.
Minimaxity in estimation of restricted and non-restricted scale parameter matrices

Introduction and motivation: In the estimation of the covariance matrix of a


multivariate normal distribution, the best equivariant estimator under the group
transformation of lower triangular matrices with positive diagonal elements,
which is also called the James-Stein estimator, is known to be minimax by the
Hunt-Stein theorem from the invariance approach.

However, finding a least favorable sequence of prior distributions has been an


open question for a long time. This paper addresses this classical problem and
accomplishes the specification of such a sequence.

This is an interesting issue in statistical decision theory. Moreover, in the case


that the parameter space is restricted, it is not clear whether the best
equivariant estimator maintains the minimax property.

Presenter: Matteo Sesia Authors, Year: Tsukuma and Kubokawa, 2015


Minimaxity in estimation of restricted and non-restricted scale parameter matrices

Model: S = Xt X, where X ⇠ Nn (0, ⌃) for unknown ⌃ 2 Rp⇥p .


Estimand: g(⌃) = ⌃.
Decision Space: D = Rp⇥p positive-definite
1 1
Loss: (Stein) LS (⌃, ) = tr(⌃ ) log |⌃ | p.
Group of transforms: Cholesky decomposition: S = T Tt , for T 2 +
p.

G(p) : (T, ⌃) ! (AT, A⌃At ), A2 +


p.

Estimator:
JS
(S) = T DJS Tt
1
DJS (S) = diag(d1 , . . . , dp ), di = .
n+p 2i + 1
Minimax, MREE, Inadmissible.
Dominating estimator:
✓ ◆
⇤ i
(S) = R diag Rt , S = R diag ( i ) Rt
n+p 2i + 1
JS
Dominates , Minimax.
⇤ JS ⇤
Preference: I would prefer over . However is also Inadmissible.
Presenter: Matteo Sesia Authors, Year: Tsukuma and Kubokawa, 2015
A Comparison of the Classical Estimators with the Bayes Estimators of One Parameter
Inverse Rayleigh Distribution
Motivation: The inverse Rayleigh Distribution has many applications in
the area of reliability and survival studies. The object of this paper is to
evaluate the Bayes estimator under different priors and loss.

Model: 𝑓 𝑥; Θ = ×𝑒 x > 0, Θ > 0


Loss: 𝐿 Θ, Θ = Θ − Θ
Initial estimator:
( )

( )
Bayes* : Θ = ( ) (𝑇 = ∑ )

( )

Θ = [𝑎 +𝑎 ( )] [𝑎 + 𝑎 ( )]

Presenter: Yixin Tang Author, Year: Huda A.Rasheed Siba Zaki Ismail 2015
AAComparison
Comparisonofofthe
theClassical
ClassicalEstimators
Estimators with the
the Bayes
BayesEstimators
EstimatorsofofOne
OneParameter
Parameter
Inverse
Inverse Rayleigh
Rayleigh Distribution
Distribution

Under 𝐿 Θ, Θ = Θ − Θ : not admissible; biased;


Θ : 𝐽𝑒𝑓𝑓𝑟𝑒𝑦 𝑠 𝑝𝑟𝑖𝑜𝑟 Θ : 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑝𝑟𝑖𝑜𝑟

Alternative estimator:
Θ = [𝑛 − 2] ∑ Min MSE admissible

Θ = [𝑛 − 1] ∑ UMVUE; inadmissible; dominated by Θ


The choice of Min MSE or Bayes is dependent on which loss function is
applied. When we choose mean square loss and bias is not so concerned,
it is better to use Min MSE .

Presenter: Yixin Tang Author, Year: Huda A.Rasheed Siba Zaki Ismail 2015
Bayes Minimax Estimation Under Power Priors of Location
Parameters for a Wide Class of Spherically Symmetric
Distributions

Motivation: In 1956, Charles Stein showed that, when estimating the


mean vector ✓ of a p-dimensional random vector with a normal
distribution with identity covariance matrix, estimators of the form

(1 a/(kXk2 + b))X

dominate the usual estimator X. Since then, a significant portion of


research in statistical theory has been focused on extending this result to
other families of distributions. This particular paper focuses on
spherically symmetric distributions, i.e. distributions that can be
expressed as f (kx ✓k2 ), where f is some real-valued function.

Presenter: Andy Tsao Authors, Year: Fourdrinier, Mezoued, Strawderman, 2013


Bayes Minimax Estimation Under Power Priors of Location
Parameters for a Wide Class of Spherically Symmetric
Distributions
ind
Model: X ⇠ f (kx ✓k2 ) for fixed f : R ! R and unknown ✓ 2 Rp ,
Estimand: Location parameter ✓
Decision Space: D = Rp Loss: L(✓, d) = k✓ dk2
Initial estimator:
rM (kXk2 )
(X) = X + ,
m(kXk)2
where r denotes the gradient operator andR m, M are the marginal
1
densities with respect to f (t) and F (t) = t f (t)dt, respectively.
Bayes, Minimax, Biased
Alternative estimator: 0 (X) = X
Unbiased
Preference: It is shown in the paper that dominates 0 at the cost of
being biased. Therefore, I would only prefer 0 if unbiasedness were
extremely important.
Presenter: Andy Tsao Authors, Year: Fourdrinier, Mezoued, Strawderman, 2013
Estimating the Accuracies of Multiple Classifiers without
Labeled Data

Motivation: Consider a Machine Learning setup, with a twist: instead of


having labeled test data, you only have the predictions of multiple
classifiers over unlabeled test data. This happens when, for example, you
receive predictions from multiple experts, and you don’t have labels
because they’re expensive, or will be available only after making the
decision.
The paper tries to estimate the label of each data point (basically, to
come up with a meta-classifier) by estimating the accuracy of each
classifier, and combining their predictions accordingly.

Model: Consider a binary classification problem: X is an instance space


with output space Y = { 1, 1}. Now, let {fi }m i=1 be m 3 classifiers
operating on X . We have n unlabeled test instances, and the m ⇥ n
matrix of predictions Z, given by Zij = fi (xj ).

Presenter: Viswajith Venugopal Authors, Year: Ja↵e, Nadler and Kluger, 2015
Estimating the Accuracies of Multiple Classifiers without
Labeled Data

Estimand: (i) Estimate the sensitivity ( ) and specificity (⌘) of each


classifier. (ii) Use this to estimate the true label of every data point.
Decision Space: (i) D = [0, 1] (ii) D = {0, 1} for each instance.
Loss: L(✓, d) = (g(✓) d)2⇤⇣ q ⌘
Initial estimator: (i) ˆ = 12 1 + û + v̂ 11+bb and
⇣ q ⌘
⌘ˆ = 12 1 û + v̂ 11+bb where b is the class imbalance, and û and v̂ are
unbiased estimates based on mean and covariance. (ii)
P ⌘i i (1 i)
ŷ ML = sign( i fi (x) ln ↵i + i ) where ↵i = (1 i i)(1 ⌘i ) , i = ⌘i (1 ⌘i )
UMRUE, Admissible, Not Bayes
Alternative estimator: MAP estimate with prior probability p of +1.
p (1 p) p p
i ⌘i i (1 i)
Same ŷ, but with ↵i = (1 i)
p (1 ⌘i )(1 p) , i = (1 p)
⌘i (1 ⌘i )(1 p)

Preference: If I had prior information that I wanted to incorporate, then


I would pick my alternative. The initial estimator is neutral a priori.
⇤ The analysis is general enough to accommodate other reasonable loss functions.
Presenter: Viswajith Venugopal Authors, Year: Ja↵e, Nadler and Kluger, 2015
Estimation of Functionals of Sparse Covariance Matrices
Motivation: High-dimensional statistical tests often leads to null
distributions that depend on functionals of correlation matrices. Take a
two-sample hypothesis testing in high-dimensions as example. Suppose
that we observe two independent samples
(1) (1) iid (2) (2) iid
X1 , ..., Xn1 2 Rp ⇠ N (µ1 , ⌃) and X1 , ..., Xn2 2 Rp ⇠ N (µ2 , ⌃).
Let n = n1 + n2 . The goal is to test H0 : µ1 = µ2 vs H2 : µ1 6= µ2 . A
test for this problem is based on a statistic
n ˆ
M = (X̄ (1) X̄ (2) )T (X̄ (1) X̄ (2) ) tr(⌃)
n1 n2
which is asymptotically normal under the null hypothesis with

n(n 1)
var(M ) = 2 k⌃k2F (1 + o(1)).
(n1 n2 )2

In order to compute the critical value of the test, we need estimate the
quadratic functional k⌃k2F . The paper investigates the optimal rate of
estimating k⌃k2F for a class of sparse covariance matrices (indeed
correlation matrices).
Presenter: Can Wang Authors, Year: Fan, Rigollet and Wang, 2015
Estimation of Functionals of Sparse Covariance Matrices
iid
Model: Xi , ..., Xn ⇠PN (0, ⌃) for unknown ⌃ = { ij }ij 2 Fq (R), where
Fq (R) = {⌃ 2 S+ p :
q
i6=j | ij |  R, diag(⌃) = Ip } for any q 2 [0, 2),
R>0
P 2
Estimand: g(⌃) = i6=j ij
Decision Space: D = R+ Loss: L(⌃, d) = (g(⌃) d)2
ˆ = {ˆij }ij = 1 Pn Xk X T . Then,
Initial estimator: Let ⌃ n k=1 k
X
2
(X) = ˆij I (|ˆij | > ⌧ )
i6=j
q
log p
where ⌧ = 2C0 n .
Inadmissible, Not UMVUE / UMRUE,
P
Minimax to a constant rate
2 1
0 i6=j (ˆ ij n)
Alternative estimator: (X) = 1+ n 1

UMVUE / UMRUE
Preference: If my goal were to use an unbiased estimator, then I would
prefer 0 over . Otherwise, if ⌃ is believed to be sparse, then I would
prefer because it leverages the sparsity of ⌃, and attains smaller
variance while not introducing much bias.
Presenter: Can Wang Authors, Year: Fan, Rigollet and Wang, 2015
Estimating the shape parameter of a Pareto distribution
under restrictions

Motivation: Pareto distribution has found widespread applications in


studies of economic data. This distribution can be used to adequately
model quantities such as individual incomes. Economists have shown that
the shape parameter of a Pareto distribution can be used to represent
inequalities in income distributions. Thus deriving efficient estimates for
the unknown shape parameter is of interest under these situations. For
the existence of higher order moments of a Pareto distribution, the shape
parameter has to be bounded below by a specified constant.
Now let X1 , X2 , ..., Xn denotes a random sample from a Pareto
distribution P (↵, ) where ↵ is a known scale parameter, and is the
shape parameter. We have priori knowledge that 0 , our task is to
estimate .

Presenter: Guanyang Wang Authors, Year: Tripathi , Kumar, and Petropoulos, 2014
Estimating the shape parameter of a Pareto distribution
under restrictions


Model: X1 · · · Xn i.i.d with density f (x, ↵, ) = x +1

↵  x < 1, 0 < ↵ < 1, 0 , known ↵

Estimand: g( ) = n
Decision Space: D = [n Loss: L(g( ), d) = ( nd
0 , 1) 1)2
Pn
Initial estimator: g0 (T ) = nT 2 g0 (T ), where T = n1 i=1 ln X↵i ,
R1 n 2 n 0u
n 0 Ry u e du
g0 (y) = n 2 y1 un 3e n 0 u du

Not UMVUE / UMRUE, Minimax, Admissible


Alternative estimator: nT 1 UMVUE
Preference: If my only goal was to derive an unbiased estimation, then I
would prefer the dominating nT 1 which is the UMVUE. However, g0 (T )
is minimax, admissible and generalized bayes, therefore we may prefer
g0 (T ) if we do not require unbiasedness.

Presenter: Guanyang Wang Authors, Year: Tripathi , Kumar, and Petropoulos, 2014
Learning with Noisy Labels
Motivation: Designing supervised learning algorithms that can learn
from data sets with noisy labels is a problem of great practical
importance. General results have not been obtained before.
iid
Model: (X, Y ) 2 X ⇥ {±1} ⇠ D, w/ CCN =) (X, Ỹ )
Estimand: bounded loss function l(t, y)
Decision Space: measurable functions
Pn
Loss: E(X,Y )⇠D [l(argminf 2F [ n1 i=1 ˜l(f (Xi ), Ỹi )], Y )]
Initial estimator:
˜l(t, y) ⌘ (1 ⇢ y )l(t, y) ⇢y l(t, y)
1 ⇢+1 ⇢ 1
where
⇢+1 = P(Ỹ = 1|Y = +1), ⇢ 1 = P(Ỹ = +1|Y = 1), ⇢+1 +⇢ 1 <1
Admissible, Not UMRUE, Not Bayes
Alternative estimator: No unique UMRUE exists
Preference: ˜l(t, y) for empirical risk minimization under CCN
Presenter: Ken Wang Authors, Year: Natarajan, Dhillon, Ravikumar, and Tewari, 2013
Toward Minimax O↵-Policy Value Estimation

Motivation: In reinforcement learning, we often want to find the


estimated value of a new policy. If ⇡ is a distribution over actions and ra
is the reward for action a, we wish to calculate
X
E⇡ [R] = ⇡a r a
a

The reward function is unknown though, and since experiments can get
expensive, it is often not feasible to run the policy, so we instead have to
approximate the value o↵ of previous observations. This paper examines
several estimators for doing so. The one we’ll look at here is called the
importance sampling estimator.

Presenter: Colin Wei Authors, Year: Li, Munos, and Szepesvari, 2015
Toward Minimax O↵-Policy Value Estimation
Model: Let A be a finite set of actions. We are given n actions
iid
A1 , ..., An , Ai ⇠ ⇡D , where ⇡D is a known distribution. We are also
given real valued rewards R1 , ..., Rn , where Ri ⇠ (·|Ai ) for some
unknown distribution . Let r (a) = ER⇠ (·|a) [R]. In this paper, we
consider 2 2 ,R
max
, where Var( )  2 and 0  r (a)  Rmax .
Estimand: Given another distribution
P ⇡, we wish to estimate the value
v ⇡ = EA⇠⇡,R⇠ (·|A) [R] = a2A ⇡a r (a).
Decision Space: D = R Loss: L(✓, d) = (v ⇡ d)2
1
Pn ⇡(Ai )
Initial estimator: v̂ = n i=1 ⇡D (Ai ) Ri
Not Minimax, Not Bayes, Not UMRUE
Other estimators: There is no UMRUE estimator for this loss.
Sketch of Proof: In the class of models we considered, we can show
that our estimator is UMRUE for the subclass where (·|a) is distributed
N (✓a , 2 ) for 0  ✓a  Rmax , for unknown parameters ✓a . However, we
can also show that it is not UMRUE when (·|a) is uniform; thus there is
no UMRUE for the entire model.
Presenter: Colin Wei Authors, Year: Li, Munos, and Szepesvari, 2015
Iterative hard thresholding methods for l0 regularized
convex cone programming

Motivation: Sparse approximation of signals is of great interest for the


signal processing community. Often, imaging data comes from a simple
structure embedded in a noisy enviroment.
In these settings, images are often believed to be sparse (zero in many
entries) in some tranformed domain from the original signal, such as a
wavelet basis.
Model: Let B = {y 2 Rm : l  yi  u, i = 1, . . . , m} for known l, u.
ind
X ⇠ N ( ✓, ⌧ 2 I) for unknown ✓ 2 B such that supp(✓) = K, known ⌧ ,
2 Rn⇥m , K.

Presenter: Steve Yadlowsky Authors, Year: Zhaosong Lu, 2013


Iterative hard thresholding methods for l0 regularized
convex cone programming
Estimand: g(✓) = ✓.
Decision Space: D = {✓ 2 B : supp(✓) = K}.
Loss: L(✓, d) = kg(✓) dk22
Initial estimator: Let {✓t }t2N be the sequence defined recursively as
✓ ◆
t+1 t 1 | t
✓i = TK ⇧B (x ( ( ✓ x)))
L
, where
(
vi |vi | > |vj |for at least m K distinct j
TK (v)i =
0 o.w..
and let (x) be a fixed point of this sequence.
Not UMVUE / UMRUE, Not Bayes, Inadmissible
Alternative estimator: 0 (X) = argmink ✓ Xk2 + k✓k0 , for
✓2B
chosen such that k✓⇤ k0 = K. Solve by testing every support pattern of
✓, and solving the constrained least squares problem.
Presenter: Steve Yadlowsky Authors, Year: Zhaosong Lu, 2013
Unbiased Estimation for the General Half-Normal
Distribution

Motivation: Half-normal distribution, a special case of folded normal


and truncated normal distributions, appears in noise magnitude models
and fatigue crack lifetime models. This paper studies unbiased estimation
of location and scale parameters of pgeneral half-normal distribution
HN (⇠, ⌘), where the mean is ⇠ + ⌘ 2/⇡.
p ( ✓ ◆2 )
2 1 x ⇠
f (x) = p exp I[⇠,+1) (x)
⌘ ⇡ 2 ⌘

In this presentation, we consider the unbiased estimator for the location


parameter of HN (⇠, ⌘) proposed in the paper.
iid
Model: X1 , . . . , Xn ⇠ HN (⇠, ⌘) for unknown ⇠ 2 R and known ⌘ > 0.
⇣ ⌘2
Estimand: ⇠ Decision Space: D = R Loss: L(⇠, d) = ⇠ ⌘ d

Presenter: Chulhee Yun Authors, Year: Nogales and Pérez, 2015


Unbiased Estimation for the General Half-Normal
Distribution
1
Pn
Initial estimator: Let X = n i=1 Xi and X(1) = min(X1 , . . . , Xn ).
iid
Also, Xi can be written as Xi = ⇠ + ⌘Yi where Yi = |Zi | , Zi ⇠ N (0, 1).
q q q
2
⇡ X (1) E[Y (1) ]X 2
⇡ (⇠ + ⌘E[Y (1) ]) E[Y (1) ](⇠ + ⌘ 2
⇡)
⇠ˆ = q ˆ =
, E[⇠] q =⇠

2
E[Y (1) ] 2
⇡ E[Y (1) ]

Minimum Risk Location Equivariant


Minimum Risk Location-Scale Equivariant (If we don’t know ⌘)
Not Bayes (unbiased with nonzero MSE)
Alternative estimator: The posterior w.r.t. prior ⇤(⇠) = N (µ, ) is a
truncated normal distribution. The Bayes estimator is the posterior mean.
(n/⌘ 2 )X + (1/ 2 )µ 1 p (X(1) ↵)/
↵= , = , ⇠˜ = ↵
n/⌘ 2 + 1/ 2 n/⌘ 2 + 1/ 2 (X(1) ↵)/

Preference: If we don’t have prior knowledge about ⇠, ⇠ˆ is simpler,


ˆ
unbiased, and is a MREE. I would prefer ⇠.
Presenter: Chulhee Yun Authors, Year: Nogales and Pérez, 2015
Chebyshev Polynomials, Moment Matching, and Optimal
Estimation of the Unseen
The Story: (from prof. Bradley Efron)
In 1940’s, the naturalist Corbet had spent two years trapping butterflies
in Malaya.

Corbet asked R. A. Fisher: how many new species he would see, if he


returned to Malaya for another two years of trapping? Fisher gave
his answer:

118 74 + 44 24 + 29 22... = 75 ± 20.9. (1)

The ”butterflies” can be people in daily life, words in a book, genes of


human... Is this method optimal?
Presenter: Martin Jinye Zhang Authors, Year: Wu, and Yang, 2015
Chebyshev Polynomials, Moment Matching, and Optimal
Estimation of the Unseen
iid
Model: P = (p1 , p2 , ..., pk ) unknown, pi 1/k; X = (X1 , ..., Xn ) ⇠ P .
P
Estimand: g(P ) = i I{pi >0} .
Decision Space: D = {0, 1, ..., k} Loss: L(P, d) = (g(P ) d)2
Initial estimator: Let fj be the number of elements that occur j times.
Then
1
X
g̃(P ) = ↵L (j)fj , (2)
j=1

where ⇢
aj j!/nj + 1, jL
↵L (j) = (3)
1, j > L.

No UMVUE, Bayes, Inadmissible, Minimax (Sample order).


No Unbiased Estimator Exists
Preference: Stable and accurate. If Sir Fisher were asked how many
new species would be trapped for another 10 years, his estimator would
explode :)
Presenter: Martin Jinye Zhang Authors, Year: Wu, and Yang, 2015

You might also like